You Can’t Trust COUNT and SUM: Scalable Data Validation with Merkle Trees

The Problem Is Subtle — and Everywhere

Data pipelines are the backbone of modern analytics, but they’re more fragile than we like to admit.

Imagine a typical workflow: you extract raw data from a source like PostgreSQL, move it to a staging layer, enrich it with joins or calculations, and load it into a data warehouse like Snowflake or BigQuery. From there, it’s transformed again—aggregated for dashboards, filtered for machine learning models, or reshaped for business reports. One dataset spawns multiple versions, each tailored for specific teams or tools, often spanning different databases, cloud platforms, or even external systems.

At every step, the data evolves, but its core truth must remain intact. A single source dataset and its derivatives—whether in staging, production, or analytics layers—need to stay consistent, no matter how they’re sliced or processed. Yet, things go wrong, quietly.

To catch issues, most engineers rely on quick checks: COUNT(*), SUM(amount), or filters for NULLs. These are lightweight and give a sense of control. But they’re deceptive:

  • A row drops during a join, while another duplicates in a CDC stream? COUNT(*) won’t flinch.

  • A column’s values round off due to a type mismatch (say, float to integer)? SUM() hides the drift.

  • A faulty mapping overwrites a column with NULLs? Aggregations sail right past.

These aren’t hypothetical bugs—they’re daily realities in ETL jobs, cloud migrations, and data syncs. Full row-by-row comparisons could catch these issues, but they’re impractical for large datasets, grinding to a halt on billions of rows. The reality is clear: we need a way to verify data integrity—across all versions and platforms—quickly, scalably, and without moving terabytes of data.

This is where Merkle Trees come in.

Enter Merkle Trees: Hierarchical Integrity at Scale

Merkle Trees weren’t built for data pipelines — but they’re a game-changer for anyone wrestling with data integrity.

At their core, Merkle Trees create a compact "fingerprint" of your dataset using a hierarchical structure of hashes. You start by hashing individual rows, then build layers of hashes up to a single root hash that represents the entire table. If even one value changes, the root hash shifts, and you can drill down to find the exact mismatch — all with simple SQL queries, right in your database.

Here’s how you can structure it for a data pipeline:

Row hashes: For each row, compute a hash (e.g., md5(concat_ws('|', col1, col2, col3))) based on the columns you want to track (or the entire row).

Your custom hierarchy (group and higher-level hashes, tailored to your data):

Countries: For each country, aggregate row hashes to create a country-level hash (e.g., md5(concat_ws('|', row_hashes)) for all rows in a given country).

Dates: For each date within a country, aggregate row hashes to create a date-level hash (e.g., for all rows on a specific day).

Months: For each month within a date, aggregate date hashes to create a month-level hash (e.g., for all rows in a given month).

Table (root hash): At the top, a single hash captures the entire table’s integrity.

If the root hashes of two tables differ, you can trace the issue down through the layers — from table to month, date, or country, down to a single row — in seconds. No need to export billions of rows or crunch data over the network; all computations happen locally in your database (BigQuery, Redshift, Snowflake, etc.).

This isn’t just theory. Merkle Trees power robust systems like:

  • Git, tracking changes in massive codebases.

  • Blockchain networks like Bitcoin and Ethereum, verifying transactions across untrusted nodes.

  • Distributed databases like Cassandra and DynamoDB, catching replication errors.

  • Data sync tools like Kafka MirrorMaker and Fivetran, ensuring flawless transfers.

Why do these systems rely on Merkle Trees? They deliver: Proof of data consistency, Lightning-fast detection of changes, and Scalability for billions of rows.

For data engineers, this is a lifeline. Instead of slogging through row-by-row comparisons, you compare a handful of hashes to catch any data drift. By organizing hashes hierarchically (by date, region, or other logical groups), you can pinpoint mismatches — like an error in Germany or on a specific day — all within your existing SQL engine.

Picture it as a map of your data’s integrity — compact, precise, and ready to guide you to any issue.

Applying It to Data Pipelines

Merkle Trees sound great in theory, but how do you actually use them to keep your data pipelines honest?

The idea is simple: use the hierarchical hash structure described earlier (rows → countries → dates → months → table) to create a verifiable fingerprint of your data. Then, compare these hashes to catch inconsistencies and drill down to the root of the problem — all within your existing SQL engine.

Start by computing hashes for the columns or rows you want to track. You can group them by logical categories like source_system (e.g., CRM, ERP), campaign_id, or region, tailoring the hierarchy to your pipeline’s needs. The key is that each level’s hash depends on the hashes below it, creating a chain of trust from rows to table.

To compare two tables (say, staging vs. production in BigQuery), check their root hashes. If they match, the tables are identical. If not, drill down through the hierarchy — months, dates, countries — to pinpoint the mismatch, like a corrupted row in Germany on 2025-07-01. All of this happens locally, without exporting data, and scales to billions of rows.

For example, a single SQL query can build the entire hash hierarchy:

-- This query computes hierarchical hashes for table validation.
-- It creates: row_hash → country_hash → date_hash → month_hash → table_hash.

WITH row_hashes AS (
  SELECT
    md5(concat_ws('|', sales_amount, customer_id, product_id)) AS row_hash,
    country,
    event_date,
    EXTRACT(MONTH FROM event_date) AS event_month
  FROM sales_table
),

country_hashes AS (
  SELECT
    country,
    event_date,
    event_month,
    md5(concat_ws('|', collect_list(row_hash))) AS country_hash
  FROM row_hashes
  GROUP BY country, event_date, event_month
),

date_hashes AS (
  SELECT
    event_date,
    event_month,
    md5(concat_ws('|', collect_list(country_hash))) AS date_hash
  FROM country_hashes
  GROUP BY event_date, event_month
),

month_hashes AS (
  SELECT
    event_month,
    md5(concat_ws('|', collect_list(date_hash))) AS month_hash
  FROM date_hashes
  GROUP BY event_month
)

-- Final aggregation: hash representing the entire table
SELECT
  md5(concat_ws('|', collect_list(month_hash))) AS table_hash
FROM month_hashes;

-- In Spark / Databricks:
--     collect_list(row_hash)
-- In BigQuery:
--     ARRAY_AGG(row_hash ORDER BY row_hash)
-- In PostgreSQL:
--     array_agg(row_hash ORDER BY row_hash)
-- In Redshift / Snowflake:
--     LISTAGG(row_hash, ',' ORDER BY row_hash)
--     or STRING_AGG(row_hash, ',' ORDER BY row_hash)

This query computes hashes for rows, countries, dates, months, and the entire table in one go, making it easy to store or compare results. It’s lightweight and fast, perfect for daily ETL checks, CDC validations, or cloud migrations.

You can save these hashes in a table for historical snapshots or retrospective analysis, like tracking how data evolves over time. However, for real-time comparisons, always compute hashes on the fly. Hashes stored in a table reflect the data at the moment they were calculated — if the underlying data changes, those hashes become outdated. To ensure accuracy, run the query online during validation to capture the current state of your tables.

 
 

How to Use Hash Trees in Practice

Merkle Trees make data validation precise and fast, catching errors and pinpointing their location in any pipeline. Here’s how they transform key data engineering tasks.

Comparing Tables Across Environments

Moving data between staging, production, or databases like PostgreSQL to BigQuery risks inconsistencies. Merkle Trees compare root hashes in seconds. A mismatch flags a specific row in a sales table, letting you fix issues fast. Group by region, system, or channel to fit your pipeline, perfect for validating migrations or enriched layers.

Daily DAG Reconciliation

Daily ETL jobs in Airflow or dbt need constant checks. Store table hashes to spot errors, like a missing batch in a CRM dataset, with alerts via Slack. Automate with dbt hooks or Airflow tasks, integrating CI/CD checks for schema updates, ensuring reliability beyond COUNT.

Validating CDC and Synchronization

CDC pipelines (e.g., PostgreSQL to BigQuery) or sync tools like Kafka MirrorMaker can drop events. Merkle Trees verify hashes to catch mismatches in a streaming job’s time window, keeping syncs trustworthy without heavy scans.

Why It Matters

Unlike COUNT or SUM, which miss silent errors, Merkle Trees guarantee integrity in BigQuery, Redshift, or PostgreSQL, scaling to billions of rows. They fit dbt, Airflow, or CI/CD workflows, revealing where and why issues occur — a single row or a faulty partition. Adopt Merkle Trees to make your pipelines rock-solid.

How It Compares to Traditional Methods

Data engineers need validation that ensures every row is correct, not just rough totals. Traditional methods fall short, while Merkle Trees deliver precision. Here’s how they compare:

  • COUNT(*) checks row counts but misses altered or replaced rows if the total stays the same.

  • SUM(col) tracks numeric totals but overlooks rounding errors or changed values that preserve the sum, hiding issues.

  • EXCEPT detects row differences accurately but scales poorly, slowing down on large datasets.

  • Merkle Trees ensure full integrity with a single SQL query, scaling to billions of rows with modern cloud infrastructure (e.g., BigQuery, Snowflake) and pinpointing mismatches, like a faulty row in a sales table.

Unlike COUNT or SUM, which mask subtle errors, or EXCEPT, which struggles with big data, Merkle Trees are fast and precise, the go-to for robust pipelines.

Tool Integration for Merkle Trees in Data Pipelines

To make Merkle Trees even more practical for data pipelines, they can be seamlessly integrated with modern tools and libraries. This streamlines automation, boosts performance, and simplifies embedding data validation into existing workflows. Here are some key approaches:

  • Apache DataSketches: This open-source library from Apache provides high-performance hashing and probabilistic data structures. In the context of Merkle Trees, DataSketches can optimize hash computation for large datasets using structures like Theta Sketches to create compact representations of massive data volumes. For example, instead of concatenating millions of row hashes in collect_list, Theta Sketches can generate approximate yet efficient group-level hashes (e.g., for countries or dates), reducing memory overhead in platforms like BigQuery or Spark.

  • dbt Macros: dbt (data build tool) is ideal for automating data validation in pipelines. You can create a dbt macro that generates SQL queries for building the Merkle Tree hash hierarchy (as shown in the example). For instance, the macro could take a list of columns and groupings (e.g., country, date, month) and produce a query to compute hashes. This enables you to integrate integrity checks into your dbt models, running them as tests (e.g., dbt test) or embedding them in CI/CD pipelines for automated validation.

Complementary Validation with Bloom Filters

While Merkle Trees provide precise, scalable data integrity checks for pipelines, they can be paired with other techniques for added efficiency. Bloom Filters, a probabilistic data structure, are a strong complement. They quickly verify if rows or keys (e.g., customer IDs) exist across datasets, like checking for missing records in a CDC sync from PostgreSQL to BigQuery. Bloom Filters are fast and memory-efficient but may yield false positives and can’t locate specific mismatches. Use them for rapid initial scans in high-throughput pipelines, followed by Merkle Trees’ hierarchical hashes to pinpoint and resolve discrepancies. This combination ensures speed and precision in ETL jobs or cloud migrations, leveraging SQL-native workflows.

Final Thoughts: Why Consider Merkle Trees

Merkle Trees provide a robust way to ensure data integrity in pipelines. They work seamlessly with SQL engines like BigQuery, Redshift, Snowflake, or PostgreSQL, and integrate with dbt or Airflow. A single query validates massive datasets with precision, pinpointing errors efficiently. For critical data, such as financial reporting or compliance, consider applying Merkle Trees to enhance reliability. Exploring this approach in your ETL jobs or migrations can strengthen trust in your data.

Know not just that your data broke — but exactly where.

Next
Next

Engineering with SOLID, DRY, KISS, YAGNI and GRASP