The Data Journey Manifesto
After working for years with many data analytic teams using different technologies, we’ve come to believe that reducing errors and defects in the insight production process is the key to success. We’ve been shamed and blamed by our customers for problems with the data we did not cause, trapped with existing data processes we don’t understand, and sat in dread every morning, waiting for something to break in our data, reports, models, or other customer deliverables. We are tired of the stress and wasted productivity needed to find problems deep within the system. We want a method that enables us to observe our data’s complicated paths to avoid problems and errors, and customer frustration.
Define and Know What Should Be:
At any time, in data analytic systems, know what should be, what is, and the exact difference between the two.
Make Hope Infrequent:
Hoping your data systems work in production is not a strategy. It’s a recipe for failure.
Customers Finding Problems Is Not OK:
Customers finding problems in your data analytics is unacceptable. Find problems before your customer does.
Don’t Trust Your Data Providers:
Some data providers are on top of their game, while others barely fulfill an unwanted task. Either way, providers make mistakes. Get used to it. Protect against it. Use it as an opportunity for improvement.
Don’t Assume What Worked Last Week Will Work Today:
Your team is constantly changing the code and configuration in your data estate. Make sure it is still working.
Find The Problem Fast:
Finding the exact source of the problem – whether in raw data, integrated data, models, reports, servers, software, and/or code – is half the battle.
Perfect Data Quality Is Not a Cure-all:
Even with perfect initial data quality, many other things can still go wrong
Avoid Manual Quality Testing Like The Plague:
Completely automate the testing of your data and tools.
Your Data Production Is A Factory:
Heed the lessons of Toyota, Lean, and Deming. Every tool in the system – data ingestion, transformation, database, predictive model, and visualization – is a workstation on that assembly line.
Complicated Data Architectures Need NASA’s Mission Control:
Our modern data architectures are built on performative complexity. Your data architecture has many ‘little boxes,’ each of which can fail.
Follow DataOps Principles:
Let the ideas in dataopsmanifesto.org guide your data team during development and production.
We Need A New Idea: The Data Journey
Is The Expectation Layer:
Data Journeys represent the expectations of all the myriad paths data takes from source to the insight value you deliver to your customer.
Observes, But Does Not “Run”:
Data Journeys track and monitor all levels of the data stack, from data quality validation to servers, software, code, costs, and utilization. It’s the ‘digital twin’ of complicated batch and streaming data architectures. Data Journeys hold expectations and don’t ‘run’ anything.
Alerts In Real Time:
A Data Journey supplies real-time statuses and alerts. With this information, you can know if everything ran on time and without errors and immediately identify the parts that didn’t.
Goes Across And Down:
Data Journeys define the process lineage for the many complex elements that deliver insight. It covers components ‘across’ your toolchain and ‘down’ your technology stack, including logs, messages, run status, metrics, data validation tests, and other information from your data estate.
A Data Journey has many components. Paraphrasing Anna Karenina: all happy, error-free Data Journeys are alike; each unhappy Data Journey is broken in its unique way. Data Journeys find the ‘unhappy’ component quickly.
Trusts But Verifies:
”Trust, but verify” is an old Russian proverb. Trust comes from monitoring every component in your Data Journey, then verifying the data that touches it. Test, validate, and look for anomalies at every step.
Your production schedule is public property – share widely. Use it to eliminate silos between the engineers who built the components of the Data Journeys, the operators who run them, the customers who use them, and the managers who get yelled at if there is a problem.
Learns From Production History:
Each instance of a Data Journey provides history and evidence to find the root cause of a defect, help your team improve, and share evidence of improvement in production errors and unmet SLAs.
It Can Be A Business Workflow, Too:
An instance of a Data Journey usually represents the batch or streaming technical steps used to create value from data. However, some Data Journey instances represent a business workflow in ‘the real world. A particular customer may use that Data Journey to check on the status of that process.
Lowers Deployment Risk:
Use the Data Journey to find the impact of regressions during development. You can’t ship code to production based on manual or static analysis. Use the Data Journey to help automatically regression test your code in development to find the impact of changes.
Reduces Errors And Drives Productivity:
Your data analytics team productivity drops when they spend time finding and fixing problems in production. Unidentified errors in your Data Journeys cause costly business mistakes, erode the trust of your customers, and may have a compliance risk.