A PM's thoughts on data contracts
Get clear on the problem and what can make the jobs of data professionals easier
Data contracts are one of the hottest buzzwords in the data ecosphere right now. Many teams compare data contracts to those of an API; teams will publish their data offerings in terms of the schema, data semantics, and availability. Lots of articles today talk about data contracts either at a high level or through technical lens. In addition to those conversations, we have a chance to examine the role of data contracts from a product management perspective as well.
Clarifying the problem statement
A data contract is part of the 2-sided marketplace that is a data platform. The contract has both a data producer, who needs to establish and meet the contract and the data consumer who helps clarify the needs of the contract. In this case, the consumer could be an end consumer (think BI analyst) or another data producer (think data engineer) who needs to use data provided by another upstream data producer.
But why does this matter and what’s the larger problem that the entire organization might care about? Ultimately, data contracts exist to improve the overall reliability and usability of datasets by making handshakes across data pipelines explicit and predictable.
With data contracts, data producers will explicitly provide expectations about the data supplied in the platform. Their daily jobs do become slightly more complex as they need to state these expectations and continuously work to ensure they adhere to them. The benefit is a more reliable data product for downstream consumers.
Data contracts exist to improve the overall reliability and usability of datasets by making handshakes across data pipelines explicit and predictable.
This ultimately improves overall trust in data from downstream consumers and reduces time spent debugging data issues. Teams can build more complex and impactful data products if they have confidence that their supply of data will be stable. Over time, this value accumulates as organizations can become more data driven.
In order to make data contracts effective, it’s critical to explain the benefit of data contracts to the overall data ecosystem and to the downstream consumers (i.e. people outside the data team). Getting clear on where and how data contracts add value will help your teams navigate how you implement them and get buy-in.
Understanding incentives
Establishing and enforcing data contracts is not an easy thing. You are essentially trying to convince many teams to take on more daily work and overhead to maintain their data. Understanding teams and their incentives will help you drive the adoption you need to make data contracts successful.
Data suppliers have very little natural incentive to do more work in providing their data for downstream consumers. Yes, they care about being a good teammate and helping provide a solid data experience, but perhaps not as much to take on the overhead of daily maintenance and SLA upkeep of their data pipelines.
There’s a carrot method to drive incentives by showing all teams involved the power of the data product. By showing what can be unlocked with more reliable data, all involved teams will have more of a reason to engage in data contracts to support a larger organization goal tied to the use of your data product. In addition, if you can quantify and articulate the time and cost savings to the org with less data disruptions, you can convince more teams to opt in to the extra overhead of a data contract.
Show people the improvements on how you can use data as a product and the cost savings to the organization as a carrot to drive adoption and engagement of data contracts
On the flip side, there’s a stick method to drive these incentives. If you can convince teams to incorporate the data contract to their definitions of done and the expectations of the team, then there’s not much other choice than to participate in the data contract. Making this transparent is key; you need to show the organization which data producers are meeting their data contracts and which ones are falling short. Over time, teams will accept their responsibility and bake the work into their daily lives.
For stricter enforcement of data contracts, seek to incorporate this as part of an engineering team’s definition of done and their on-call responsibilities, then drive transparency around adherence and adoption
Make data contracts easy
Finally, encouraging success of the data contract means making it as easy as possible for teams to use them and adhere to them. The data platform needs to make it very clear what the data contract states for a given team, where the team has met the contract, and where the team has not.
Building this system is complex as it requires data observability and a method to establish expectations. An ideal system helps data producers understand their data and establish the details of the contract. For example, the data contract system should make it easy for a data producer to know when they can expect a schema change, how often the semantic meaning of their data changes, and what level of freshness and reliability the data has. This makes it easier to publish any contract expectations and for consumers to know how well a data producer meets them.
Data observability and a mechanism to set and enforce data contracts in an automated fashion is a must have for any system
Beyond that, a data contract system has the opportunity to go beyond just the management of the contracts themselves. In order to help organizations meet the ultimate goal of more reliable data, the platform should help data producers more easily meet their contract goals and even improve their performance.
Platform features around data observability, setting and tracking data contract adherence, and proactive tools to avoid downtime and breakages make it a lot easier for data producer to take on the overhead of a data contract
For example, data contracts often struggle with unexpected schema changes that break downstream pipelines. A simple system can help detect when a schema change has happened so that people know when the contract has been changed or broken and help drive quick remediation. A more advanced system will notify upstream data producers of the breaking change before it happens to minimize downtime. An even more advanced system could help minimize the impact of the schema change so that there’s no downtime for a schema change and little extra overhead for the upstream data producer.
User-focused thinking on how to ease the pain of data producers’ daily jobs can lead to innovations around data contracts and features to improve reliability
Going beyond the data contracts themselves and focusing on encouraging the ecosystem to participate and be successful will help you solve the deeper organization problem for data reliability.
Recap
Data contracts are a powerful tool to help your organization provide more reliable data. In addition to the technology and infrastructure, data teams have opportunities to employ PM principles to make data contracts more successful across companies.
In particular, the data PM can drive impact by identifying and evangelizing the larger org problem to be solved and which customers might care. From there, the data PM can drive engagement and adoption of data contracts by understanding the incentives of their customers and how to use carrots and sticks to encourage proper behaviors. Finally, the data PM can drive adoption and further impact by working with the data platform to make all the various jobs to be done around data contracts as easy as possible.
This idea of data contracts seems to have had no contact with the real world. I run data engineering at my company. There's no way I'm going to do this. Half my data comes from outside my organization in the first place, so I have no ability to influence what they send me. And then this:
'You are essentially trying to convince many teams to take on more daily work and overhead to maintain their data."
Lol. Yes, that is what I would be doing. I have absolutely no authority or priority over these teams, who are responsible for the main software product. Sometimes I need stuff from them. Why would I blow all my requests on something I can manage without them?
We're really just talking about type checking, right? I can (and do) check for that on my own.