Architecting Trust: The Challenge of Data Quality Part 1

In today's world of big data (velocity, volume and variety), it is the bedrock upon which all data-driven decisions are made. From advanced analytics, machine learning to generative AI applications, the quality of data directly impacts the reliability and effectiveness of the outcomes. There is an age-old adage in the data community:

Garbage in, garbage out.

Despite the many tools, technologies and frameworks available to manage data and that promise data quality, ensuring high-quality data in an organization remains a significant challenge. In this series of blog posts, I will write about the multi-faceted nature of building data quality into an organization's data ecosystem.

Here is the entire series so far:

Architecting Trust: The Challenge of Data Quality Part 1

Introduction
The Operational Reality of Data Quality Dimensions
Data quality expectations across workloads
The Role of Architecture in Data Quality Outcomes
Common Pitfalls on the path to Data Trust
Conclusion

Introduction

The impact of poor data quality isn't merely a technical nuisance - it's a systemic issue that erodes trust, hits operational efficiency and can lead to flawed business outcomes with significant financial and reputational consequences for an organization. The belief that better tools alone will solve an organization’s data quality problems is a common misconception. While tooling is important, it addresses symptoms rather than the root cause.
The core of the data quality challenge lies in the interplay of system architectures, organizational structures and human factors. To truly solve this issue, we need to reframe the problem that cannot be solved by a silver-bullet solution, but as a continuous, multi-dimensional challenge that requires a holistic approach to create, transform and consume data.
Different use-cases, whether analytical, operational or machine learning, require unique and often conflicting demands on data quality expectations. This complexity is further compounded by the diverse sources of data, each with its own set of quality issues, and the evolving nature of data as it flows through various systems and processes within an organization.
In this series, we will look into the various dimensions of data quality challenges, operational aspects of embedding data quality and observability into the workings of an organization with particular focus on ownership models, service level objectives and incident response. These are also the common pitfalls that many companies encounter on their data quality journey.

Experienced practitioners, whether engineers, analysts or decision makers, will confirm that data quality issues frequently arise not from the failure of a single tool but from the boundaries between systems, processes and people.

It is the implicit assumptions made during the integration of disparate systems, the misalignment of incentives among stakeholders and the lack of clear ownership that often degrades data quality and processes surrounding it over time. For example, a schema change in an upstream source system might silently break a downstream transformation if the pipeline isn't designed to detect and adapt to such changes. Similarly, data latency expectations may be violated if the architecture does not appropriately account for the processing characteristics of different stages or the delays in batch vs. streaming paradigms.

I have often observed that the ownership boundaries between teams or systems producing and consuming data create a “no-man’s land” where data quality becomes nobody’s explicit responsibility. Without proper architectural patterns that promote clear contracts, error handling and monitoring, data quality can degrade silently as it traverses these boundaries between systems and teams.

These issues implementing robust data quality and observability reminds me of Conway's law:

Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure.

In other words, the architecture of data systems often reflects the organizational structure and communication patterns within the company. If teams are siloed and lack effective communication, it can lead to fragmented data systems with inconsistent data quality standards and practices. This misalignment can result in data quality issues that are difficult to trace and resolve, as different teams may have different priorities, standards and processes for managing data quality.

The Operational Reality of Data Quality Dimensions

Data quality is a multi-dimensional concept which can be broken down into several key dimensions: accuracy, completeness, consistency, timeliness, validity and uniqueness. In practice, these are not just abstract ideals as they manifest as concrete operational concerns that directly impact the usability and trustworthiness of data for specific use-cases.
It is rarely feasible to enforce all dimensions for all use-cases and data across various datasets. A pragmatic approach is to identify which dimensions are most critical for a given use-case and calibrate and enforce it accordingly.
The expectations for data quality differ between operational systems that require real-time responsiveness, analytical systems that focus on historical trends and machine learning workloads that require feature reliability and label integrity. Each of these use-cases require deep understanding of the operational implications and trade-offs involved in balancing these dimensions of data quality.

Freshness

Freshness refers to the latency between when an event occurs and when it becomes available for querying or analysis in its target system.
For operational systems, such as for monitoring application performance or tracking user activity in real-time, freshness is critical. Decisions made on stale data could be ineffective or even detrimental. For example, if a fraud detection system that relies on analyzing transactions in real-time has a freshness requirement of under 5 seconds, any delay beyond this threshold could allow fraudulent activities to go undetected. Similarly, a dashboard showing real-time website traffic loses its value if the data is hours behind.

Ensuring high freshness requires architectural choices like streaming ingestion (ex. Apache Kafka, AWS Kinesis), real-time processing engines (ex. Apache Flink, Spark Streaming) and storage layers optimized for low-latency access (ex. Apache Druid, ClickHouse). However, choices in architecture must be balanced against cost, complexity and scalability considerations. Often, optimizing for one dimension can lead to conflict with others. For example, processing data as soon as it arrives might mean less time to insight but could compromise validation or deduplication checks that ensure accuracy and consistency. Batch processing, while offering larger windows for data cleaning, validation and transformation, introduces inherent delays that may not meet freshness requirements.

This is all to say that the acceptable level of freshness is context-dependent and needs to be defined based on the specific use-case requirements, balancing the trade-offs between latency, accuracy and resource utilization.

A common practice in the industry is to define Service level Objectives (SLOs) for data freshness. An example of this could be:

95% of the order events must be available in the data warehouse within 5 minutes of occurrence.

For tracking this SLO, we would then need to track the time of the last successful data load, the timestamp of the most recent record in the table and the lag between the two. An alert can then be triggered if the lag exceeds the defined threshold.

A trickier scenario arises when dealing with complex data dependencies, where a downstream process is dependent on multiple upstream sources. The overall freshness of this downstream process is constrained by the slowest upstream source. In such cases, it is important to monitor the freshness of each upstream source individually and understand their impact on the overall freshness of the downstream process.

Completeness

Completeness means that all expected data records and fields are present and populated in the dataset. Incomplete data can lead to skewed analyses, incorrect aggregations and flawed machine learning models. For example, if a customer demographic data is missing key attributes for a portion of users, any segmentation analysis based on this data will be biased and potentially misleading. Similarly, in a financial reporting system, missing transaction records can lead to inaccurate financial statements and compliance issues.

There are some common strategies to ensure data completeness:

Checking for null values in critical columns
Verifying record counts against expected volumes
For time series, ensuring no gaps in timestamps
Check filters to ensure all expected categories or segments are represented
Check joins to ensure no records are lost
Ensure all partitions are present

Architecturally, data quality checks for completeness can be implemented at various stages:

During ingestion to ensure all data from source system is captured
After transformations, to ensure no critical data is lost during processing
After loading into target systems, to ensure all expected records and fields are present
Before consumption, to ensure data meets completeness requirements for specific use-cases

The definition of completeness also varies based on use-cases. This often requires close collaboration between data producers and consumers. Techniques like data profiling, which analyze data to understand structure and content, can help establish baselines for completeness and identify patterns of missing data.
Also important is to have lineage tracking to able to trace back the source of missing data, whether it originated from the source system, during transformation or loading.
It also helps identify the datasets that would be impacted by this incomplete data, enabling proactive mitigation strategies.

Accuracy

Accuracy refers to the correctness of data values. Data is accurate if it correctly represents the real-world entities or events it is supposed to model. It is arguably one of the most important dimensions of data quality, as inaccurate data directly leads to flawed insights and poor decisions.
Typically, accuracy requires comparing data against a trusted source or ground truth, which may not always be available depending on the use-case.
For example, for aggregations on top of financial data coming from a source system like SAP, accuracy can be verified by reconciling the aggregated values against official financial reports. In contrast, for user behavior data collected from web applications, accuracy might be assessed by cross-referencing with server logs. In practice, this would be automated by running periodic reconciliation jobs that compare aggregated values in the data warehouse against those in the source systems, flagging any discrepancies beyond a certain threshold.

Operational checks for accuracy can include:

Referential integrity checks to ensure that foreign keys correctly reference primary keys in another table (for example, every transaction record should reference a valid customer ID)
Data type and format validation: checks to ensure values conform to expected data types and formats (for example, dates in valid date format, email addresses have a valid structure)
Range checks: checks for numerical values to ensure they fall within acceptable ranges (for example, a percentage is between 0 and 100)
Unique checks: ensures that unique identifiers are indeed unique (for example, no duplicate transaction IDs if the base grain is at the transaction level)
Business rule validation: checks custom logic to enforce domain-specific rules (for example, a delivery date should not be earlier than the order date, sum of line items should equal total amount)
Statistical analysis: using statistical methods to identify outliers or values that deviate significantly from expected distributions (for example, order amount outside of peak season moves beyond 2 standard deviations from the mean of the last 2 weeks). Data engineering teams usually have limited control over upstream source systems. Patterns like data contracts and schema evolution strategies can help mitigate these issues. Machine learning models can also be employed to detect anomalies in data that may indicate accuracy issues.

Consistency

Consistency ensures that data does not contain contradictory information, both within a single dataset and across multiple datasets. It's about ensuring that different pieces of data that are related to the same entity or event align with each other. Inconsistent data can lead to confusion and mistrust, as users may not know which version of the data to rely on. For example, a customer's address might be recorded differently in the marketing database and the billing system. Another example could be mismatch between aggregated data and its underlying detailed records such as individual sales might not equal the reported total sales. This is particularly complex in complex data environments with multiple data sources and different storage systems. Keeping different copies of the same data in sync across systems is a non-trivial challenge.

Operational checks for consistency include:

Cross-table consistency checks: verifying that related data in different tables within the same database is consistent (for example, ensuring that the total sales in the summary table matches the sum of individual sales records in the transactions table)
Cross-system consistency checks: comparing data across different systems (for example, customer contact information in CRM vs. ERP system)
Historical Consistency: ensuring that the meaning of data attributes does not change over time without proper versioning, or that the historical data is not altered in a way that creates inconsistencies with current data (for example, product categories should remain consistent over time unless explicitly updated with versioning)
Schema consistency: ensuring that the same entity is represented across different datasets with the same schema (for example, ensuring that customer IDs are represented consistently across sales, support and marketing datasets). Architecturally, data consistency can be promoted through the use of canonical data models, master data management (MDM) solutions that provide a single source of truth for key business entities and robust ETL/ELT processes that include data reconciliation and standardization steps.

Schema Stability

Schema stability refers to the consistency of the data structures over time. It's not just about the schema being valid at a single point of time, but about how it evolves and the impact of that evolution on downstream systems and processes. Uncontrolled or unexpected schema changes are a major source of data pipeline failures, data corruption and analytical errors. For example, if a BI dashboard relies on a specific column named user_signup_date and an upstream system changes this column name to registration_date without proper communication or versioning, the dashboard queries will fail or return incorrect results. Similarly, if the data type of a column changes for example, from integer to string, it can lead to processing errors in the ETL pipelines or incorrect aggregations in analytical queries.

Schema definition, evolution and monitoring: clear, centrally managed schemas with versioning, like schema registries that track changes over time (for example, Confluent schema registry for Kafka topics in Avro, Protobuf or JSON schema formats). This comes with proper versioning strategies like backward, forward and full compatibility to ensure that changes do not break existing or future consumers. Backward compability is the most common strategy where new schema changes are optional for existing consumers.
Schema validation: automated schema checks during data ingestion and processing to ensure that incoming data conforms to the expected schema and does not introduce breaking changes. It also prevents invalid or unexpected data from entering the system.

Introducing a schema registry to enforce a data contract between producers and consumers requires not only proper tooling but also organizational alignment between teams and clear communication channels to coordinate changes. From a data engineering perspective, schema on read systems like data lakes are more vulnerable to silent failures due to schema drift as structure is only enforced at query(read) time. In contrast, schema on write systems like data warehouses enforce structure at ingestion(write) time, providing stronger guarantees of schema stability.
However, the vulnerability of schema on read systems can be mitigated using tools like cataloging, data profiling and automated schema validation during data ingestion.

Data quality expectations across workloads

As stated before, different data workloads have varying data quality expectations based on their specific use-cases. For example, striving for freshness by processing data in real-time means that the data has less time for thorough validation and accuracy checks, potentially allowing more inaccuracies to slip through. On the other hand, a rigorous batch process that prioritizes accuracy and completeness but introduces latency, making the data less fresh. Similarly, ensuring strict consistency across multiple copies of data in a distributed data system can lead to reduced availability or increased latency due to the overhead of synchronization mechanisms, as described by the CAP theorem.

Pushing for extreme validity with very strict schemas and format rules can lead to rejecting that particular record, that while not perfectly valid could still contain valuable information, thereby impacting completeness. For example, an extra comma in a CSV file might cause the entire row to be discarded and all the other valid data in that row to be lost.

Operational reality of data quality is not a binary state of good or bad, but a spectrum, where the ideal point on the spectrum varies based on the specific use-case requirements, organizational priorities and resource constraints. A financial dashboard used for regulatory filings demands near perfect accuracy, consistency and completeness and can tolerate some latency. A real-time dashboard monitoring website traffic might prioritize freshness and is more tolerant to minor inaccuracies or inconsistencies. A machine learning model used for product recommendation engines would prioritize a large volume of diverse data and be more resilient to inaccuracy in individual data points.

The point is that a pragmatic approach involves understanding the specific requirements of each data consumer and use-case leading to informed trade-offs. It requires clear communication between data producers and consumers to establish clear Service level agreements(SLAs) and Service Level Objectives(SLOs) that define acceptable levels of each dimension. A common pitfall is to try to achieve perfection across all dimensions for all data use-cases, which is often a recipe for paralysis, inefficiency and wasted resources. A more nuanced and context-aware approach would be the way to go.

To summarize:

Operational workloads, such as transaction processing systems, real-time monitoring dashboards, or customer-facing applications often prioritize freshness and availability. These systems often need data up-to-the-second/minute and readily accessible to support immediate actions or decisions. While accuracy and validity are important, the tolerance for minor inconsistencies might be higher if it means having low latency and high throughput. For example, a real-time fraud detection system must analyze transactions as they happen, and while some delay is acceptable, the system typically optimizes for false positives (flagging legitimate transactions as potentially fraudulent) rather than false negatives (missing actual fraud), as the cost of missing fraudulent activity far outweighs the inconvenience of temporarily blocking a legitimate transaction. Consistency is also crucial for the operational database itself but eventual consistency is acceptable for replicated data used for less critical workloads.
Analytical workloads, involve querying large historical datasets to generate reports, identify trends and support strategic decision making. This workload prioritizes accuracy, completeness and consistency. Freshness is desired but typically less critical. A sale report for the last quarter needs to be highly accurate and complete but a delay of hours/days is often acceptable. These workloads contain complex aggregations and joins across multiple datasets, making consistency a key concern.
For Machine learning workloads, accuracy of labels and features is paramount for model performance. Completeness is also important but machine learning models can often handle missing data through imputation techniques. Freshness of training data depends on the application, some models require to be trained on historical snapshots whereas some require near real-time data. Consistency is important to ensure that features are derived from stable and reliable data sources.

The Role of Architecture in Data Quality Outcomes

Data quality issues are rarely random occurrences, they are frequently the predictable outcome of systemic design choices and architectural patterns within an organization's data ecosystem. Many of data quality failures stem from fundamental misalignments in how systems are designed to interact, how data flows through these systems and how responsibilities are distributed among teams. Issues like ambiguous ownership boundaries between upstream producers and downstream consumers, the reliance of implicit rather than explicit data contracts, and the inherent complexities of handling late-arriving or out-of-order data in event-driven systems are not just incidental problems. Silent failures in batch processes that go unnoticed for extended periods, and weak feedback loops that prevent timely communication of issues are often symptoms of architectural shortcomings in monitoring, alerting and error handling mechanisms.

By recognizing these common patterns and root causes and linking them directly to architectural decisions, we can being to see data quality not as a series of isolated incidents but as an emergent property of the overall system design. This shift in perspective is crucial for building more resilient and trustworthy data platforms, as it encourages engineers and architects to consider data quality implications at every stage of design and implementation, rather than treating it as an afterthought.

Upstream system ownership boundaries

Data rarely originates and resides in a single monolithic system. Instead, it flows through a complex web of applications, services and databases, each potentially owned and managed by different teams or even entirely separate business units. These boundaries create significant challenges for ensuring end to end data quality. The team producing the data, for example, a product engineering team, is often focused on fuctional requirements of their application such as user experience, performance and feature delivery. Data quality for downstream analytical or operational uses is often a secondary concern. This misalignment of priorities can lead to data being generated with inconsistencies, missing values or adequate validation.

Downstream data consumers (for example: Data engineering, analytics or data science teams) often have limited visibility into or control over these upstream systems. They are typically consumers of data provided via APIs, database extracts or CDC (Change data capture) streams. If the quality of this source data is poor, downstream teams are left with cleansing, transforming and attempting to correct issues that originated outside of their sphere of influence.

If an upstream application team changes a data type or removes a column without notifying downstream consumers, it can break multiple pipelines and reports, leading to a frantic scramble to identify and fix the issue. With frequent changes such as this, the team is then constantly firefighting data quality issues rather than focusing on delivering value through insights and analysis. Addressing this requires architectural patterns that promote better cross-team collaboration and shared responsibility. This means establishing clear data ownership and stewardship roles, implementing data contracts that explicitly define data expectations and creating centralized data catalogs or governance frameworks that provide visiblity into data sources, their quality and their downstream dependencies.

With this "shift-left" approach to data quality, the data quality validation and monitoring are as close to the source as possible, catching issues early in the data lifecycle, prevents them from propagating through the downstream systems. However, this often requires a cultural shift within organizations and investment from teams whose primary focus isn't data production or analytics.

Implicit contracts between producers and consumers

Continuing from the previous point, once clear ownership boundaries are established, the next step is to formalize the expectations between data producers and consumers through explicit data contracts.

An implicit contract is an unwritten, often unspoken set of assumptions about the structure, format, content and behavior of data being exchanged between systems or teams. Examples of these include data types of columns, the range of valid values, frequency of data updates or the meaning of specific fields (semantics). While this works initially to get things moving, it becomes a major liability in complex, evolving data ecosystems.

Explicit Data Contracts

An explicit contract is a formal, machine-readable agreement between a data producer and a data consumer. It defines:

Schema (field names, data types, nullability)
Semantics (meaning of fields, business rules)
Quality expectations (valid ranges, constraints)
SLAs (Freshness, Availability)

Tools and patterns already exist to support data contracts. Schema registries (like Confluent Schema Registry) allow producers to register schemas for their data streams and consumers can verify that incoming data conforms to these specifications.

Schema Evolution

Data formats like Avro and Protobuf (Protocol Buffers) allow for the definition of data schemas that can evolve in a controlled, backward-compatible manner. This ensures that changes to data schema do not break existing consumers, which are typically data pipelines, analytics or machine learning models. An example of this would be adding an optional field to a schema. Existing consumers that do not expect this field can continue to operate without any changes, while new consumers can take advantage of the additional information.

These binary formats can be used with schema registries to enforce compatibility rules during schema evolution. This prevents breaking changes from being introduced into the data stream. The kind of schema compatibility (backward, forward or full) can be defined based on the use-case requirements.

Late-arriving and Out-of-order data

Late arriving and out-of-order data are commong challenges in event-driven and streaming-first systems. In an ideal world, events would be processed in the exact order they occurred, and all events for a given time window would arrive promptly. However, in distributed systems, network latencies, asynchronous processing and retries can lead to events arriving out of sequence or significantly delayed. For example, a user might perform an action on a mobile device while offline, and the corresponding event is only sent to the backend when the device reconnects. Or a batch process from an upstream system might be delayed, causing a chunk of data to arrive after more recent data has already been processed. If data pipelines are not designed to handle these scenarios, it can produce incorrect or inconsistent results.

Let's consider an example. A streaming pipeline calculates the total value of orders per minute. If an order for 10:05 AM arrives after the window for 10:06 AM are being calculated, the total for 10:05 AM would be incorrect and the late arriving event might be incorrectly attributed to 10:06 AM or dropped entirely.

In stream processing engines, Watermarking is a common technique to handle late-arriving data. A watermark is a timestamp that indicates that the system believes it has seen all events up to that time with a high degree of probability. Events arriving after the watermark for their timestamp has passed are considered "late data". How this is handled, depends on the application requirements. It might be dropped, processed and used to update previous results or diverted to a separate late data stream for special handling. Windowing strategies also play a crucial role in managing out-of-order data.

Tumbling windows are fixed-size and non-overlapping. They are simple to implement but can be less flexible in handling late data.
Sliding windows can offer more flexibility but are more complex to manage.

Some systems allow for "grace periods" where late data can still be processed within a certain timeframe after the window has closed. This allows for a balance between timeliness and completeness, ensuring that late-arriving data can still contribute to the results without indefinitely delaying the output.

For batch systems, handling late data often means implementing backfill mechanisms or reprocessing entire time ranges when delayed data arrives. This can be operationally complex and resource intensive. At the end, this is a trade-off between data freshness and completeness of results.

Silent Failures in Batch Processes

Silent failures in batch pipelines can allow corrupted, incomplete or inconsistent data to propagate through the system unnoticed for extended periods, potentially undermining trust of all downstream consumers. Unlike streaming pipelines that often process data continuously and might fail more conspicuously, batch pipelines typically run on a schedule (nightly, hourly). If a batch job encounters an error that doesn't cause it to crash outright, but for example, silently skips some records, produces incorrect aggregations due to a logic error or fails to load a subset of the data, the problem might be discovered much later, perhaps when an end user reports a discrepancy in a report or dashboard. By then, significant damage might have been done and multiple downstream systems could be affected. Diagnosing the root cause can be challenging, especially if the error logs are insufficient or the problematic intermediate data has already been cleaned up.

From an architectural perspective, silent failures often stem from inadequate error handling, logging and alerting mechanisms embedded within pipeline code or as part of the data transformation framework. Pipeline code should be designed to explicitly log important events, checkpoints and fail fast explicitly when encountering unrecoverable errors, rather than proceeding with an undefined state. All exceptions and error conditions must additionally be logged with sufficient context (the specific record, timestamp, stack traces and input parameters where applicable). Data quality checks must be embedded within batch pipelines to verify the integrity of intermediate data and final outputs. Some simple data quality checks could include: row count validations, checksum or hash comparisons and data profiling.

Weak Feedback loops between Data Producers and Consumers

This one's more on the organizational side of things. In a healthy data ecosystem, there should be a rapid and effective channel for consumers to report data quality problems back to the producers, and for producers to acknowledge, investigate and resolve these issues. However, in many organizations, this feedback loop is either non-existent, slow or ineffective. Data consumers might discover issues during their consumption or analysis but lack a clear process to report these problems back to the data producers. Even when issues are reported, there might be delays in acknowledgment or resolution due to misaligned priorities, lack of ownership or insufficient resources on the producer side. This lack of clear feedback mechanism means that the data producers are often unaware of how their data is being used downstream and the impact of data quality issues on business outcomes.

Architecturally, addressing weak feedback loops requires both technical and organizational solutions. As mentioned in the first point about ownership boundaries, establishing clear data stewardship roles and responsibilities is crucial. Data producers should have dedicated points of contact responsible for addressing data quality issues raised by consumers. Implementing issue tracking systems specifically for data quality problems can help formalize the reporting and resolution process. These systems should allow consumers to log issues with sufficient detail, track their status and receive updates on progress. Additionally, fostering a culture of collaboration and shared responsibility for data quality across teams is essential. Regular meetings or forums where data producers and consumers can discuss ongoing issues, share insights and align on priorities can help strengthen these feedback loops. From the technical side, implementing proper data lineage tracking helps producers quickly identify the source of reported issues and understand their downstream impact. This visibility enables faster diagnosis and resolution.

Common Pitfalls on the path to Data Trust

Even with a clear understanding of data quality dimensions and architectural considerations, organizations frequently stumble in their implementation. These missteps are not usually due to lack of technical capability but rather stem from flawed assumptions, misaligned priorities and a misunderstanding of what it truly takes to build a culture of data trust. These errors often manifest as an over-reliance on technology as a silver bullet, a superficial treatment of observability as merely a collection of dashboards and pushing responsiblity onto a single team. Even most data teams fail to connect data quality efforts to actual business outcomes. By examining some of these common pitfalls, organizations can gain a more realistic perspective on the challenge and adopt a more holistic and sustainable approach to achieving and maintaining data trust.

The aim is to reflect and encourage a shift from simplistic tool-focused solutions to deeper, systemic changes in how data is managed and valued across the enterprise.

Over-reliance on data quality tools

A prevalent misconception in the quest for data quality is the over-reliance on data quality tools as a primary solution. While data quality tools like profiling suites, cleansing applications or components within a broader data platform can enable data teams, they are not a silver bullet. Purchasing and deploying a tool without addressing the underlying systemic, architectural and organizational issues is like buying a high-end fire extinguisher but ignoring the fire hazards in the first place. Tools facilitate processes; they don’t define or guarantee success in data quality on their own.

The "Technology fixes everything" fallacy

There's a tendency in a lot of organizations that complex human and process problems can be solved simply by introducing new software. Data quality issues can originate from unclear business requirements, flawed data models, misaligned incentives between teams or poorly designed data architectures. A tool cannot magically fix these foundational problems. For instance, a data profiling tool might identify duplicate customer records, but doesn't address the root cause of why duplicates were created in multiple upstream systems that lack a unified customer identifier.

Underestimating the "Last Mile" problem

Tools might help and even suggest fixes for data quality issues, but the actual implementation of these fixes, the redesign of flawed processes or negotiation of data definitions across teams requires human effort, political will and ongoing dilligence. It can range from updating ETL pipelines, retraining staff, revising data governance policies or even re-architecting data systems. These are non-trivial tasks that require time, resources and organizational commitment beyond just deploying a tool.

Neglecting Process and People

Effective data quality management requires robust processes (Data governance frameworks, clear data ownership, defined data quality workflows) and people who are empowered and skilled to execute these processes. Tools are meant to support these processes and people, not replace them. Without clear ownership, alerts from a data quality tool might go unaddressed. Without defined workflows, identified issues might not be prioritized or resolved effectively. Without skilled personnel, the insights provided by tools might be misinterpreted or ignored.

Assuming the Data Quality Tool/Framework is "Set and Forget"

Data quality is not a one-time project but an ongoing commitment. Data ecosystems are dynamic, with evolving data sources, changing business requirements and shifting organizational priorities. The technological, process and people investments made to ensure data quality need to be continuously revisited, refined and adapted to these changes.

A more effective approach views data quality tools as components of a broader strategy:

Strong data governance: Clear policies, standards and ownership structures
Well-defined data architectures: Designed for data quality from the ground up
Skilled teams: Data stewards, engineers and analysts who understand the data and its context
Robust processes: For issue detection, resolution and continuous improvement
Cultural shift: Valuing data as a critical asset for decision making in the organization and the shared responsibility for its quality

Treating observability as just dashboards

Another thing that diminishes the potential impact of data observability is treating observability as merely a collection of dashboards. Observability is an integrated and actionable system that provides insights into the health, performance and quality of data systems. In addition to building dashboards, teams should build the underlying capabilities for detection, analysis, alerting and incident response.

Dashboard Warehousing

Teams invest a significant amount of effort in creating numerous dashboards but if most of these remain static and not actively monitored or acted upon, they provide little value. A lot of scattered dashboards can lead to alert and dashboard fatigue, where critical data issues are lost in a sea of metrics.

Lack of Actionable Insights

An operational dashboard can show that a metric has deviated, for example: data freshness has degraded for a certain data asset. But if they aren't accompanied by tools or processes to investigate, the utility is limited. True observability systems should connect the what (anomaly) to the why (root cause) and the how (resolution steps/playbooks).

Focusing on Vanity metrics

We all love to see our data pipelines running smoothly. However, focusing on vanity metrics like uptime percentages or number of processed records can create a false sense of security. These metrics don't necessarily reflect the actual quality or reliability of the data being produced. A pipeline might be running without errors but still produce inaccurate or incomplete data.

True observability is an active system that enables teams to:

Detect anomalies and issues proactively through automated monitoring and alerting
Diagnose root causes efficiently by providing rich context including metrics, logs and correlated events
Predict potential failures by identifying patterns and degradations before they impact consumers/stakeholders
Improve systems continuously by analyzing trends and feedback from incidents

Centralizing data quality responsiblity in data teams

A common organizational pitfall is to push the responsibility for data quality entirely onto the data team (data engineering, data ops or data platform teams). While data teams play a crucial role in designing pipelines, implementing transformations and performing validation, they are often the last link in a long chain of data creation and modification. Attributing sole responsibility to them ignores the systemic nature of data and the critical roles played by other parts of the organization.

Upstream issues remain unaddressed

If source system owners (for example: product engineering teams) are not held accountable for the quality of the data they produce, the data team is left to deal with an endless stream of "garbage in, garbage out" scenarios. They can attempt to clean and transform the data, but this is often inefficient, imperfect and doesn't solve the root cause. For example, if a CRM application allows users to enter country names in free text without validation, the data team will constantly struggle with trying to standardize these values downstream. This reaffirms the need for clear ownership and stewardship that we'd discussed earlier.

Data Quality becomes a downstream bottleneck

Data teams can become overwhelmed with requests for data fixes, as downstream consumers struggle with poor quality/inconsistent data. This reactive firefighting diverts the teams' attention from more strategic initiatives like building new data products, improving architectures or enhancing observability. Instead of being enablers of data-driven decision making, they become gatekeepers constantly patching up data issues.

Ineffective data contracts

Data contracts, as we discussed earlier, are an explicit agreement between the data producers and consumers. If the contracts are incomplete, poorly defined or not enforced, data teams are left to interpret and fill in the gaps.

To democratize the data quality effort, an approach should involve roles like:

Data Producers (Source systems & Applications): Teams that develop and operate source systems are responsible for ensuring that the data they generate meets defined data accuracy, validity and completeness standards.
Data Engineering teams: These teams are responsible for reliable data movement, schema validation and ensuring the integrity of data transformations. They implement data transformations and business logic, so they need to add robust testing and data quality checks within their pipelines.
Data Consumers (Analytics, BI, ML, Business Users): These teams have a responsiblity to validate that the data is fit for their specific use-cases. They should ensure they provide the right requirements, perform an adequate level of user acceptance testing(UAT) and report any data quality issues back to the producers.
Data Governance & Stewardship: A cross-functional group (often representatives from various teams) that oversees data standards, policies, quality rules and compliance. They facilitate communication between producers and consumers, manage data contracts and ensure accountability.

Measuring data quality in isolation from business outcomes

Another common pitfall in data quality initiatives is the failure to measure data quality without linking it to business outcomes. Data quality metrics in pure technical terms like "reduce null values by 10%" or "99% of data passes schema validation", can be difficult for business stakeholders to grasp in terms of their actual impact. Data quality metrics, in isolation, can feel abstract and detached from the core objectives of the organization. This can lead to lack of buy-in, insufficient resources and a general perception that data quality is a "tech problem" rather than a business outcome. Therefore, data quality measurement and reporting must be translated into tangible business outcomes:

Identifying business-critical data assets

Not all data is created equal (pun intended). The focus should be on the datasets and metrics that have the most significant impact on key business decisions, KPIs, operational efficiency, customer experience or regulatory compliance.

Quantifying the cost of poor data quality

This can be challenging to do right, but is really impactful.

Direct costs: Wasted engineer time spent on finding and correcting data, rework due to flawed reports, costs associated with wrong exports to external partners or regulatory fines due to non-compliance.
Opportunity costs: Lost revenue due to bad decisions based on faulty data, missed market opportunities, ineffective marketing campaigns or delayed product launches.
Customer impact: Customer churn, dissatisfaction or support costs due to errors in customer-facing data (billing, wrong recommendations).

Translating data quality metrics to business KPIs

Directly connect data quality dimensions to improvements in key business KPIs. For example:

How did improving the accuracy of customer segmentation data lead to measurable increase in campaign conversion rates?
How did enhancing the freshness of risk data improve fraud detection rates and reduce financial losses?
How did ensuring the completeness of sales data lead to more accurate revenue forecasting?

Communicating in business terms

When reporting data quality initiatives to leadership or stakeholders, frame the progress in terms of business value delivered, not just technical metrics:

We reduced the time to detect fraud by 20%, enabling us to prevent $X in potential losses.
Enhancing data accuracy for our financial data reporting led to a Y% reduction in audit findings, saving $Z in compliance costs.

Conclusion

The journey towards robust data quality and effective observability is not a destination but an ongoing process of continuous improvement. It requires a holistic approach that encompasses not just technical solutions but also organizational alignment, cultural shifts and systemic changes in how data is managed and valued across the company.
Similarly, observability is not just about passive monitoring but about building active systems that enable detection, diagnosis and improvement of data systems. By avoiding common pitfalls such as over-reliance on tools, superficial treatment of observability, centralized responsibility and disconnected metrics, organizations can build a more resilient and trustworthy data ecosystem. Ultimately, the goal is to foster a culture where data quality is everyone's responsibility and where data is recognized as a critical asset that drives informed decision-making and business success.
I will continue this discussion in the next post where I will explore patterns for building data quality into the architecture of data systems. These include traditional data warehousing patterns, modern data lakehouse and streamhouse architectures.

If you're a decision maker struggling with data quality challenges in your organization, you don't have to navigate this journey alone. I help companies architect robust data systems that deliver reliable, trustworthy insights.

Book a free 30-minute consultation to discuss:

Your specific data quality challenges and pain points
Architectural patterns that fit your organization's needs
Strategies for building data observability and governance
Roadmaps for implementing sustainable data quality practices

Whether you’re dealing with silent pipeline failures, inconsistent data across systems, or struggling to secure stakeholder buy-in for data quality initiatives, let’s explore how the right architectural approach can transform your data ecosystem.

Architecting Trust: The Challenge of Data Quality Part 1

Architecting Trust: The Challenge of Data Quality Part 1

Contents

Introduction

The Operational Reality of Data Quality Dimensions

Freshness

Completeness

Accuracy

Consistency

Schema Stability

Data quality expectations across workloads

The Role of Architecture in Data Quality Outcomes

Upstream system ownership boundaries

Implicit contracts between producers and consumers

Explicit Data Contracts

Schema Evolution

Late-arriving and Out-of-order data

Silent Failures in Batch Processes

Weak Feedback loops between Data Producers and Consumers

Common Pitfalls on the path to Data Trust

Over-reliance on data quality tools

The "Technology fixes everything" fallacy

Underestimating the "Last Mile" problem

Neglecting Process and People

Assuming the Data Quality Tool/Framework is "Set and Forget"

Treating observability as just dashboards

Dashboard Warehousing

Lack of Actionable Insights

Focusing on Vanity metrics

Centralizing data quality responsiblity in data teams

Upstream issues remain unaddressed

Data Quality becomes a downstream bottleneck

Ineffective data contracts

Measuring data quality in isolation from business outcomes

Identifying business-critical data assets

Quantifying the cost of poor data quality

Translating data quality metrics to business KPIs

Communicating in business terms

Conclusion

Enjoyed this post? Subscribe for more!