TIL how to investigate missing data easily with DuckDB
⋅ 2 minute read
In data engineering pipelines we commonly store output data in S3 buckets at various stages.
In this particular case we stored the input data received from customers as csv files in the S3 bucket s3://customer-ingested/customerA.csv
and after some transformations we stored the transformed data as csv files in another S3 bucket s3://customer-transformed/customerA.csv
.
Problem
For some of the customers the csv file was missing in the first bucket customer-ingested
while we were sure that all customers had a file in customer-transformed
. However, there was a legit reason why the file could be missing which was that in the application the customer had a state of is_deactivated = True
.
The goal of this investigation was to find all the active customers for which we were missing data in the bucket customer-ingested
.
Solution
The nice thing about DuckDB in this scenario is that I can combine filesystem data with information from an application table and work in (familiar) SQL land.
Here is the query:
with transformed as (
select regexp_extract(file, '/([^/]+)\.csv$', 1) AS filenames_transformed from glob('s3://customer-transformed/*.csv')
),
ingested as (
select regexp_extract(file, '/([^/]+)\.csv$', 1) AS filenames_ingested from glob('s3://customer-ingested/*.csv')
),
customer_status as (
select customer_name, is_deactivated from read_parquet('s3://application_table_snapshot/customers.parquet')
)
select * from transformed
left join ingested on transformed.filenames_transformed = ingested.filenames_ingested
left join customer_status as cs on transformed.filenames_transformed = cs.customer_name
where is_deactivated = False and filenames_ingested is null;
┌───────────────────────┬────────────────────┬────────────────────┬────────────────┐
│ filenames_transformed │ filenames_ingested │ customer_name │ is_deactivated │
│ varchar │ varchar │ varchar │ varchar │
├───────────────────────┼────────────────────┼────────────────────┼────────────────┤
│ customerT │ NULL │ customerT │ False │
│ customerW │ NULL │ customerW │ False │
│ customerZ │ NULL │ customerZ │ False │
└───────────────────────┴────────────────────┴────────────────────┴────────────────┘
In the CTE, I am creating a table transformed
with all the filenames in the S3 bucket customer-transformed
. I am using DuckDB’s utility function glob(search_path)
to retrieve the filenames. I do the same for the bucket customer-ingested
.
To get the customer status I read from a recent replication snapshot of the application database to get customer_status
.
Finally, I join the three expressions to identify missing files (missing values) in the column filenames_ingested
for active customers.
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.
Tags:#duck-db #sql