Building a local data warehouse with DuckDB, dbt, and Superset

⋅ 6 minute read


In previous blog posts, I described two DuckDB use cases for data scientists and data engineers: Querying remote parquet files and processing larger-than memory datasets .

Now I want to explore if I can set up a local and open-source version of the analytics data stack that we use at my company. At work, we are using Snowflake, dbt cloud, and Google’s Looker which cost us several thousand EUR per month. I am going to use the following open-source tools in my local setup:

cloud local
analytical database snowflake DuckDB
data modelling tool dbt cloud local dbt
dashboard tool Looker Apache Superset

Figure1 shows how the components work together.

Local data warehouse
Figure 1. Components of my local data warehouse stack.

I will use a toy visualization problem to demonstrate the setup. My goal is to load the race time dataset that I used in Data analysis: Vätternrundan 2024 results into DuckDB. I then use dbt to create data models and Superset to create an interactive dashboard to visualize the data.

Loading the raw data into DuckDB

I want to use DuckDB for all data processing. Therefore, I will load the raw race time data into DuckDB.

I start by installing DuckDB as my analytical database. Since I use homebrew as a package manager, I can run:

brew install duckdb

I then create a persistent database (a file on my machine):

duckdb database.duckdb

and create a new DuckDB table from the parquet file that contains the raw data.

CREATE TABLE vatternrundan AS select startnumber, city, country, result_time, start_time from read_parquet('results_vatternrundan24.parquet');

To verify that everything works, I run:

describe table vatternrundan;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
 column_name  column_type   null      key    default   extra  
   varchar      varchar    varchar  varchar  varchar  varchar 
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
 startnumber  BIGINT       YES      NULL     NULL     NULL    
 city         VARCHAR      YES      NULL     NULL     NULL    
 country      VARCHAR      YES      NULL     NULL     NULL    
 result_time  BIGINT       YES      NULL     NULL     NULL    
 start_time   VARCHAR      YES      NULL     NULL     NULL    
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘

Setting up dbt for data transformations

I want to use dbt to create data models on top of the raw data. I start by creating a new poetry project to install dependencies:

poetry init
poetry add duckdb-dbt

Then I initialize the dbt project local_warehouse with

poetry run dbt init local_warehouse

Next, I configure dbt to use DuckDB as the data processing backend. In my work setup, this would point at a snowflake instance instead. I point at the local database file in the dbt profiles file profiles.yml:

local_warehouse:
  outputs:
    dev:
      type: duckdb
      path: ../duckdb/database.duckdb
  target: dev

dbt models

I define the DuckDB table vatternrundan as a dbt source in sources.yml:

version: 2
sources:
  - name: vatternrundan_db
    schema: main 
    tables:
    - name: vatternrundan

This means I can now refer to this raw data in dbt. Moreover, I want to create two more models. One thin staging layer over the raw data stg_vatternrundan.sql, and one refined model that aggregates the rider data by country average_speed_by_country.sql.

The staging model is stored in the dbt project as models/staging/stg_vatternrundan.sql:

select 
    startnumber::int64 as startnumber,
    city::varchar as city,
    country::varchar as country,
    to_microseconds((result_time/1000)::int64) as result_time,
from {{ source('vatternrundan_db', 'vatternrundan') }}

As you can see, I refer to the DuckDB source table, define the column types, and do some light conversions.

The aggregation model is stored in models/refined/average_speed_by_country.sql:

with speeds as (
    select startnumber,
    country,
    60 * datepart('hours', result_time) + datepart('minutes', result_time) as result_time_minutes,
    315 * 60 / result_time_minutes as average_speed,
    from {{ref('stg_vatternrundan')}}
)

select country, count(*) as num_riders, mean(average_speed) as average_speed_of_country 
from speeds
group by country
order by average_speed_of_country desc

I use this dbt model to compute the average rider speed for every country in the dataset.

Now that I have defined my dbt models, I configure dbt to run them by adding the following to dbt_project.yml:

models:
  local_warehouse:
    staging:
      +schema: staging
      +materialized: table
    refined:
      +schema: refined
      +materialized: table

This tells dbt how to materialize the dbt models in DuckDB. So let’s run dbt to build the models:

poetry run dbt run --profiles-dir=.

Check tables were created

I can verify in DuckDB that these tables have been created

Run

duckdb database.duckdb

followed by:

show all tables;

which should now show:

database.main_staging.stg_vatternrundan
database.main_refined.average_speed_by_country

Superset for data analytics dashboards

After modelling my data, I want to create a dashboard and visualize the data. I use Apache Superset as my dashboarding tool. Superset works with many analytical databases. However, setting it up to work with DuckDB is a bit clunky and I had to troubleshoot quite a bit.

Install Superset with DuckDB support

I follow the Quickstart Guide with some modifications. First, I download the git repository and checkout the last tagged commit:

git clone https://github.com/apache/superset
cd superset
git checkout tags/4.1.1

I need to make two modifications before building the container.

  1. Install duckdb-engine inside the container. This allows us later to select DuckDB as a database in Superset. To do this, add
RUN pip install duckdb-engine

to the Dockerfile.

  1. I need to make the DuckDB database file available in the Docker container. I add its path as a volume. Modify docker-compose-image-tag.yml:
superset:
[...]
    volumes:
        - ./docker:/app/docker
        - superset_home:/app/superset_home
        - <local-machine-path-to-duckdb-database>:/app/duckdb  

and force the Superset container to build from the Dockerfile and not from the image:

superset:
[...]
    build:
        context: .
        dockerfile: Dockerfile
#   image: apache/superset: comment out or remove

Then build the container with

docker compose build

and start Superset with

docker compose -f docker-compose-image-tag.yml up

Then, I can access Superset with my browser under http://localhost:8088 and login with username: admin and password: admin.

Configure Superset

Now, I can add DuckDB as a database in the Superset UI. I navigate to Settings -> Data -> Database Connections + Database -> DuckDB

and add the link to the DuckDB database file as the SQLAlchemy URI:

duckdb:////app/duckdb/database.duckdb?access_mode=READ_ONLY

Clicking on Test Connection should respond with “Connection looks good!”.

Add datasets in Superset

In the navbar under Datasets I can add the DuckDB tables stg_vatternrundan and average_speed_by_country as new Superset datasets.

Create dashboard in Superset

Creating the dashboard is self-explanatory. In the navbar, I select Dashboards and add a new Dashboard called “Vätternrundan Dashboard”. Next, I create two charts:

  • number of riders per country
  • average speed of riders per country

Just select the relevant Superset Dataset, build the charts, and add them to the dashboard. Figure 2 shows how I did it.

Superset Dashboard
Figure 2. Visualization of Vätternrundan rider speed by country in Apache Superset.

Conclusion

The combination of DuckDB, dbt, and Apache Superset is a local version of “the modern datastack”. This can be useful for personal projects, and to experiment with data transformations and visualizations.

I found that the integration of DuckDB and dbt worked seamlessly. On the other hand, connecting Superset to DuckDB was clunky. Superset would benefit from better DuckDB support out-of-the-box.

Troubleshooting

  • DuckDB only allows one connection with read-write-access, but multiple concurrent read-only connections. Make sure you are not connected to your DuckDB database with more than one client as access mode allows write by default, e.g. Superset and the DuckDB CLI client in your terminal.
  • Ensure to connect Superset to DuckDB in read only mode, i.e. add access_mode=READ_ONLY to the SQLAlchemy URI to connect.

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

Tags:
#data-engineering   #data-science   #dbt   #superset   #duck-db