DuckDB use cases for data scientists: Querying remote S3 files
⋅ 5 minute read
Contents
DuckDB is a pretty cool in-process OLAP analytical database that I started to spin up on the fly for quick data analysis. What SQLite is to Postgres, DuckDB is to Snowflake. It is a single executable without dependencies and stores databases in local files.
I can think of four use cases for data science work:
- DuckDB supports larger-than-memory workloads by loading data sequentially. You can use it to analysis datasets that are too large for Pandas (and too small to justify PySpark).
- I can querying CSV, parquet, and JSON files directly from remote endpoints, e.g. S3, using SQL.
- I can replace Snowflake queries with DuckDB queries in unit / integration tests.
- I can set up a (DuckDB + dbt) data warehouse for local development.
I want to share here my workflow for the second use case. Inspecting parquet files in AWS S3 is a pain because I can’t easily inspect them in the AWS console. Since a few months I use DuckDB to load, inspect, and analyse parquet files from the command line. I found this reduced my cognitive load in situations where I quickly want to check a remote file, because I don’t have to download the parquet file and write a python script to inspect it.
Installing DuckDB
I use macOS and can install DuckDB via homebrew: brew install duckdb
. To work with remote files I also install the httpfs
extension.
I start the DuckDB shell with duckdb
and run the SQL commands:
1INSTALL httpfs;
2LOAD httpfs;
Authentication with AWS
To load files from S3, I need to configure AWS credentials. I am assuming here that your workplace has configured AWS SSO temporary credentials, but this work also with static credentials (ACCESS_KEY_ID
, SECRET_ACCESS_KEY
). There are two ways of doing this:
1. Set AWS credentials in the session:
I can set AWS credentials inside a duck db session like this:
1LOAD httpfs;
2SET s3_region = 'eu-west-2';
3SET s3_access_key_id ='???';
4SET s3_secret_access_key ='???';
2. Let aws vault handle credentials
I use the AWS credentials management tool aws vault
(see
here
) which manages my temporary AWS credentials in the background and exposes them to my shell sub-process. I create the duckdb shell with aws-vault exec [profile-name] | duckdb
which ensures that AWS credentials are automatically set and updated.
Querying parquet files
Let’s assume that I have multiple parquet files stored in an S3 bucket work-project
and they all have the same schema:
s3://work-project/clickdata_001.parquet
s3://work-project/clickdata_002.parquet
[...]
s3://work-project/clickdata_100.parquet
These files might be placed in this S3 bucket as part of an ETL pipeline before they are loaded into a table in a data warehouse. Let’s imagine I want to investigate a bug in the pipeline and need to inspect the files.
Inspect schema of parquet file
To familiarize myself with the schema I run parquet_schema
1LOAD httfs; -- once at the beginning of the session
2SELECT * FROM parquet_schema('s3://work-project/clickdata_001.parquet');
which will return:
clicked_at::timestamp
user_id::int
event_type::string
Query a parquet file
I can use postgres SQL dialect to query the parquet files with the read_parquet
function, e.g.
1--- count number of events in the file
2SELECT count(*) FROM read_parquet('s3://work-project/clickdata_001.parquet');
3
4--- find the last event time of a particular user
5SELECT max(clicked_at) FROM read_parquet('s3://work-project/clickdata_001.parquet') where user_id = 1234;
Query multiple parquet files
Say I want to find the users with the highest number of events across all files. I can use
glob syntax
with *
to run a query against all files as if they were one table:
1SELECT user_id, count(*) as num_events FROM read_parquet('s3://work-project/*.parquet') group by user_id order by num_events desc;
Create table from files
Let’s assume that all clickdata files together have a size of 500MB. It would be annoying if I had to download these files for every query that I want to run. Let’s instead create a table from the files.
1create table clickdata as select * from read_parquet('s3://work-project/*.parquet');
This table is stored in memory until we close the shell, which allows me to run different queries against it until I am done with my analysis, e.g.
1select * from clickdata;
I can also store the table in a local database on my computer in case I need to work with the data for a longer time. The following SQL statement
1EXPORT DATABASE 'mydatabase' (FORMAT PARQUET);
will create a local folder mydatabase
which stores the tables currently in memory. In a later duckdb session I can reload the table from the database using the following command
1IMPORT DATABASE 'mydatabase';
Query column statistics
A nice feature of the parquet file format is that it stores
statistics about each column
, e.g. min
, max
, null_count
, distinct_count
. If you are interested, you can inspect parquet file metadata using select * from parquet_metadata('file.parquet');
Assume the click event dataset is several GB large and I want to identify the earliest clicked_at
time. I would run the query below.
1SELECT min(clicked_at) FROM read_parquet('s3://work-project/*.parquet');
Fortunately, DuckDB uses the column statistics stored in the parquet files to compute the answer without having to download the whole dataset from S3. Simon Willison shows a more extreme example of this feature.
DuckDB is a great tool for quick investigations of remotely hosted files. Especially after having configured automatic handling of AWS credentials I can spin up a duckdb shell with one command. I hope it becomes a time saver for you too.
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.