staticnotes.org

DuckDB use cases for data scientists: Querying remote S3 files

⋅ 5 minute read

Contents

DuckDB external link is a pretty cool in-process OLAP analytical database that I started to spin up on the fly for quick data analysis. What SQLite is to Postgres, DuckDB is to Snowflake. It is a single executable without dependencies and stores databases in local files.

I can think of four use cases for data science work:

  1. DuckDB supports larger-than-memory workloads by loading data sequentially. You can use it to analysis datasets that are too large for Pandas (and too small to justify PySpark).
  2. I can querying CSV, parquet, and JSON files directly from remote endpoints, e.g. S3, using SQL.
  3. I can replace Snowflake queries with DuckDB queries in unit / integration tests.
  4. I can set up a (DuckDB + dbt) external link data warehouse for local development.

I want to share here my workflow for the second use case. Inspecting parquet files in AWS S3 is a pain because I can’t easily inspect them in the AWS console. Since a few months I use DuckDB to load, inspect, and analyse parquet files from the command line. I found this reduced my cognitive load in situations where I quickly want to check a remote file, because I don’t have to download the parquet file and write a python script to inspect it.

Installing DuckDB

I use macOS and can install DuckDB via homebrew: brew install duckdb. To work with remote files I also install the httpfs extension.

I start the DuckDB shell with duckdb and run the SQL commands:

1INSTALL httpfs;
2LOAD httpfs;

Authentication with AWS

To load files from S3, I need to configure AWS credentials. I am assuming here that your workplace has configured AWS SSO temporary credentials, but this work also with static credentials (ACCESS_KEY_ID, SECRET_ACCESS_KEY). There are two ways of doing this:

1. Set AWS credentials in the session:

I can set AWS credentials inside a duck db session like this:

1LOAD httpfs;
2SET s3_region = 'eu-west-2';
3SET s3_access_key_id ='???';
4SET s3_secret_access_key ='???';

2. Let aws vault handle credentials

I use the AWS credentials management tool aws vault (see here external link ) which manages my temporary AWS credentials in the background and exposes them to my shell sub-process. I create the duckdb shell with aws-vault exec [profile-name] | duckdb which ensures that AWS credentials are automatically set and updated.

Querying parquet files

Let’s assume that I have multiple parquet files stored in an S3 bucket work-project and they all have the same schema:

s3://work-project/clickdata_001.parquet
s3://work-project/clickdata_002.parquet
[...]
s3://work-project/clickdata_100.parquet

These files might be placed in this S3 bucket as part of an ETL pipeline before they are loaded into a table in a data warehouse. Let’s imagine I want to investigate a bug in the pipeline and need to inspect the files.

Inspect schema of parquet file

To familiarize myself with the schema I run parquet_schema

1LOAD httfs; -- once at the beginning of the session
2SELECT * FROM parquet_schema('s3://work-project/clickdata_001.parquet');

which will return:

clicked_at::timestamp
user_id::int
event_type::string

Query a parquet file

I can use postgres SQL dialect to query the parquet files with the read_parquet function, e.g.

1--- count number of events in the file
2SELECT count(*) FROM read_parquet('s3://work-project/clickdata_001.parquet');
3
4--- find the last event time of a particular user
5SELECT max(clicked_at) FROM read_parquet('s3://work-project/clickdata_001.parquet') where user_id = 1234;

Query multiple parquet files

Say I want to find the users with the highest number of events across all files. I can use glob syntax external link with * to run a query against all files as if they were one table:

1SELECT user_id, count(*) as num_events FROM read_parquet('s3://work-project/*.parquet') group by user_id order by num_events desc;

Create table from files

Let’s assume that all clickdata files together have a size of 500MB. It would be annoying if I had to download these files for every query that I want to run. Let’s instead create a table from the files.

1create table clickdata as select * from read_parquet('s3://work-project/*.parquet');

This table is stored in memory until we close the shell, which allows me to run different queries against it until I am done with my analysis, e.g.

1select * from clickdata;

I can also store the table in a local database on my computer in case I need to work with the data for a longer time. The following SQL statement

1EXPORT DATABASE 'mydatabase' (FORMAT PARQUET);

will create a local folder mydatabase which stores the tables currently in memory. In a later duckdb session I can reload the table from the database using the following command

1IMPORT DATABASE 'mydatabase'; 

Query column statistics

A nice feature of the parquet file format is that it stores statistics about each column external link , e.g. min, max, null_count, distinct_count. If you are interested, you can inspect parquet file metadata using select * from parquet_metadata('file.parquet'); Assume the click event dataset is several GB large and I want to identify the earliest clicked_at time. I would run the query below.

1SELECT min(clicked_at) FROM read_parquet('s3://work-project/*.parquet');

Fortunately, DuckDB uses the column statistics stored in the parquet files to compute the answer without having to download the whole dataset from S3. Simon Willison shows external link a more extreme example of this feature.

DuckDB is a great tool for quick investigations of remotely hosted files. Especially after having configured automatic handling of AWS credentials I can spin up a duckdb shell with one command. I hope it becomes a time saver for you too.

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

#data-engineering   #data-science   #sql   #duck-db