Building a local data warehouse with DuckDB, dbt, and Superset
⋅ 6 minute read
Contents
In previous blog posts, I described two DuckDB use cases for data scientists and data engineers: Querying remote parquet files and processing larger-than memory datasets .
Now I want to explore if I can set up a local and open-source version of the analytics data stack that we use at my company. At work, we are using Snowflake, dbt cloud, and Google’s Looker which cost us several thousand EUR per month. I am going to use the following open-source tools in my local setup:
cloud | local | |
---|---|---|
analytical database | snowflake | DuckDB |
data modelling tool | dbt cloud | local dbt |
dashboard tool | Looker | Apache Superset |
Figure1 shows how the components work together.
I will use a toy visualization problem to demonstrate the setup. My goal is to load the race time dataset that I used in Data analysis: Vätternrundan 2024 results into DuckDB. I then use dbt to create data models and Superset to create an interactive dashboard to visualize the data.
Loading the raw data into DuckDB
I want to use DuckDB for all data processing. Therefore, I will load the raw race time data into DuckDB.
I start by installing DuckDB as my analytical database. Since I use homebrew as a package manager, I can run:
brew install duckdb
I then create a persistent database (a file on my machine):
duckdb database.duckdb
and create a new DuckDB table from the parquet file that contains the raw data.
CREATE TABLE vatternrundan AS select startnumber, city, country, result_time, start_time from read_parquet('results_vatternrundan24.parquet');
To verify that everything works, I run:
describe table vatternrundan;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ startnumber │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ city │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ country │ VARCHAR │ YES │ NULL │ NULL │ NULL │
│ result_time │ BIGINT │ YES │ NULL │ NULL │ NULL │
│ start_time │ VARCHAR │ YES │ NULL │ NULL │ NULL │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Setting up dbt for data transformations
I want to use dbt to create data models on top of the raw data. I start by creating a new poetry project to install dependencies:
poetry init
poetry add duckdb-dbt
Then I initialize the dbt project local_warehouse
with
poetry run dbt init local_warehouse
Next, I configure dbt to use DuckDB as the data processing backend. In my work setup, this would point at a snowflake instance instead. I point at the local database file in the dbt profiles file profiles.yml
:
local_warehouse:
outputs:
dev:
type: duckdb
path: ../duckdb/database.duckdb
target: dev
dbt models
I define the DuckDB table vatternrundan
as a dbt source in sources.yml
:
version: 2
sources:
- name: vatternrundan_db
schema: main
tables:
- name: vatternrundan
This means I can now refer to this raw data in dbt.
Moreover, I want to create two more models. One thin staging layer over the raw data stg_vatternrundan.sql
, and one refined model that aggregates the rider data by country average_speed_by_country.sql
.
The staging model is stored in the dbt project as models/staging/stg_vatternrundan.sql
:
select
startnumber::int64 as startnumber,
city::varchar as city,
country::varchar as country,
to_microseconds((result_time/1000)::int64) as result_time,
from {{ source('vatternrundan_db', 'vatternrundan') }}
As you can see, I refer to the DuckDB source table, define the column types, and do some light conversions.
The aggregation model is stored in models/refined/average_speed_by_country.sql
:
with speeds as (
select startnumber,
country,
60 * datepart('hours', result_time) + datepart('minutes', result_time) as result_time_minutes,
315 * 60 / result_time_minutes as average_speed,
from {{ref('stg_vatternrundan')}}
)
select country, count(*) as num_riders, mean(average_speed) as average_speed_of_country
from speeds
group by country
order by average_speed_of_country desc
I use this dbt model to compute the average rider speed for every country in the dataset.
Now that I have defined my dbt models, I configure dbt to run them by adding the following to dbt_project.yml
:
models:
local_warehouse:
staging:
+schema: staging
+materialized: table
refined:
+schema: refined
+materialized: table
This tells dbt how to materialize the dbt models in DuckDB. So let’s run dbt to build the models:
poetry run dbt run --profiles-dir=.
Check tables were created
I can verify in DuckDB that these tables have been created
Run
duckdb database.duckdb
followed by:
show all tables;
which should now show:
database.main_staging.stg_vatternrundan
database.main_refined.average_speed_by_country
Superset for data analytics dashboards
After modelling my data, I want to create a dashboard and visualize the data. I use Apache Superset as my dashboarding tool. Superset works with many analytical databases. However, setting it up to work with DuckDB is a bit clunky and I had to troubleshoot quite a bit.
Install Superset with DuckDB support
I follow the Quickstart Guide with some modifications. First, I download the git repository and checkout the last tagged commit:
git clone https://github.com/apache/superset
cd superset
git checkout tags/4.1.1
I need to make two modifications before building the container.
- Install
duckdb-engine
inside the container. This allows us later to select DuckDB as a database in Superset. To do this, add
RUN pip install duckdb-engine
to the Dockerfile
.
- I need to make the DuckDB database file available in the Docker container. I add its path as a volume. Modify
docker-compose-image-tag.yml
:
superset:
[...]
volumes:
- ./docker:/app/docker
- superset_home:/app/superset_home
- <local-machine-path-to-duckdb-database>:/app/duckdb
and force the Superset container to build from the Dockerfile and not from the image:
superset:
[...]
build:
context: .
dockerfile: Dockerfile
# image: apache/superset: comment out or remove
Then build the container with
docker compose build
and start Superset with
docker compose -f docker-compose-image-tag.yml up
Then, I can access Superset with my browser under http://localhost:8088
and login with username: admin
and password: admin
.
Configure Superset
Now, I can add DuckDB as a database in the Superset UI. I navigate to Settings
-> Data
-> Database Connections
+ Database
-> DuckDB
and add the link to the DuckDB database file as the SQLAlchemy URI
:
duckdb:////app/duckdb/database.duckdb?access_mode=READ_ONLY
Clicking on Test Connection
should respond with “Connection looks good!”.
Add datasets in Superset
In the navbar under Datasets
I can add the DuckDB tables stg_vatternrundan
and average_speed_by_country
as new Superset datasets.
Create dashboard in Superset
Creating the dashboard is self-explanatory.
In the navbar, I select Dashboards
and add a new Dashboard called “Vätternrundan Dashboard”. Next, I create two charts:
- number of riders per country
- average speed of riders per country
Just select the relevant Superset Dataset, build the charts, and add them to the dashboard. Figure 2 shows how I did it.
Conclusion
The combination of DuckDB, dbt, and Apache Superset is a local version of “the modern datastack”. This can be useful for personal projects, and to experiment with data transformations and visualizations.
I found that the integration of DuckDB and dbt worked seamlessly. On the other hand, connecting Superset to DuckDB was clunky. Superset would benefit from better DuckDB support out-of-the-box.
Troubleshooting
- DuckDB only allows one connection with read-write-access, but multiple concurrent read-only connections. Make sure you are not connected to your DuckDB database with more than one client as access mode allows write by default, e.g. Superset and the DuckDB CLI client in your terminal.
- Ensure to connect Superset to DuckDB in read only mode, i.e. add
access_mode=READ_ONLY
to the SQLAlchemy URI to connect.
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.
Tags:#data-engineering #data-science #dbt #superset #duck-db