staticnotes.org

Processing 112M rows of steam reviews locally with DuckDB

⋅ 9 minute read

Contents

In DuckDB use cases for data scientists: Querying remote S3 files I wrote how I use DuckDB as a convenient way to query data from CSV or Parquet files in S3. Another use case for data scientists and data engineers is DuckDB’s ability to processing larger-than-memory data on your local machine. For us data scientists this fills the gap between

In this post I am going to run some exploratory queries against this kaggle Steam review dataset external link . It contains 112M rows of Steam game reviews and comes as an uncompressed 46GB CSV file (17GB compressed). Let’s find out how my MacBook copes with it. Later I am going to run a small comparison against polars external link , a DataFrame library, which is often mentioned for out-of-memory processing.I am not comparing against pandas because its inability to work with large datasets was one of the reasons why I explored this topic. However, you could do this with chunking external link .

Useful commands and settings

Before I start, here are some useful DuckDB commands and settings that I often use:

Command What is it good for?
.timer on Prints the execution time after each SQL command
FROM duckdb_memory(); Gives an overview of how much memory is used by DuckDB
SET enable_progress_bar = true; Prints a progress bar for query runs
SET memory_limit = '2GB'; Controls how much data DuckDB can keep in RAM.
.mode line Prints query results one at a time (useful if many columns). Default: .mode duckbox
SUMMARIZE (select *) Summarizes the contents of a table.
.exit Stops DuckDB.

Steam review dataset

First, I want to familiarize myself with the dataset. I downloaded the 46GB file all_reviews.csv to my 2023 Macbook Pro M3 with 18GB RAM. After starting a non-persistent session with the duckdb command, I use the above commands to activate the timer and line mode. I then take a look at an example row with:

1 select * from read_csv('all_reviews.csv', ignore_errors=true) limit 1;
              recommendationid = 148919893
                         appid = 10
                          game = Counter-Strike
                author_steamid = 76561199036724879
        author_num_games_owned = 0
            author_num_reviews = 3
       author_playtime_forever = 197
author_playtime_last_two_weeks = 197
     author_playtime_at_review = 197
            author_last_played = 1698336369
                      language = russian
                        review = старость
             timestamp_created = 1698336397
             timestamp_updated = 1698336397
                      voted_up = 1
                      votes_up = 0
                   votes_funny = 0
           weighted_vote_score = 0.0
                 comment_count = 0
                steam_purchase = 1
             received_for_free = 0
   written_during_early_access = 0
         hidden_in_steam_china = 1
          steam_china_location = 

This gives me an overview of the file columns and an idea of their content.

Processing the file directly

Rather than slowly loading all data into DuckDB, I want to test DuckDB’s stream processing capabilities. Instead of fully materializing the data in memory, the execution engine reads and processes the data in chunks.This is a useful feature if you want to convert larger-than-memory files from one format to another format, e.g. CSV to Parquet. But that’s not what we are here for today..

I am going to use two aggregation queries for my benchmark:

  1. The first query counts the number of English reviews per steam account and sorts them from highest to lowest

    D select author_steamid, 
    COUNT(*) AS num_reviews  
    FROM read_csv('all_reviews.csv', ignore_errors = true)  
    WHERE language = 'english' 
    GROUP BY author_steamid 
    ORDER BY num_reviews DESC;
    100% ▕████████████████████████████████████████████████████████████▏ 
    ┌───────────────────┬─────────────┐
    │  author_steamid   │ num_reviews │
    │       int64       │    int64    │
    ├───────────────────┼─────────────┤
    │ 76561198030784015 │        9674 │
    │ 76561198024340430 │        5930 │
    │ 76561198067298289 │        5534 │
    │ 76561198094803808 │        4341 │
    │ 76561198125392509 │        4124 │
    │ 76561198027267313 │        4124 │
    │ 76561197960373660 │        3350 │
    │ 76561197970602587 │        3212 │
    │ 76561197961017729 │        2810 │
    │ 76561198155150242 │        2554 │
    │ 76561198066590240 │        2490 │
    │ 76561198045381877 │        2280 │
    │ 76561198069159152 │        2216 │
    │ 76561197960319772 │        1978 │
    │ 76561198062813911 │        1940 │
    │ 76561198025731804 │        1928 │
    │ 76561198137285867 │        1898 │
    │ 76561198043135631 │        1883 │
    │ 76561198036629241 │        1875 │
    │ 76561198055119582 │        1843 │
    │         ·         │           · │
    │         ·         │           · │
    │         ·         │           · │
    │ 76561198111175247 │           1 │
    ├───────────────────┴─────────────┤
    │    15324507 rows (40 shown)     │
    └─────────────────────────────────┘
    Run Time (s): real 22.556 user 166.382276 sys 9.815767
    

    This query processes the 112M rows in a surprisingly short 22.5s. The most active Steam user has written 9674 game reviews. You can find them here external link .

  2. I made the second query intentionally more complex. I filter for reviews that contain the word love and I also compute the mean of the votes_up score and the sum of the votes_funny score for each user. I then filter the aggregation using having for only users that have a sum_votes_funny score of more than 100.

    D SELECT author_steamid, 
        COUNT(*) AS num_reviews,
        avg(votes_up) as mean_votes_up, 
        sum(votes_funny) as sum_votes_funny,
        FROM read_csv('all_reviews.csv', ignore_errors = true) 
        WHERE language = 'english' and review LIKE '%love%'
        GROUP BY author_steamid 
        having sum_votes_funny > 100 
        ORDER BY num_reviews DESC; 
    100% ▕████████████████████████████████████████████████████████████▏ 
    ┌───────────────────┬─────────────┬────────────────────┬─────────────────┐
    │  author_steamid   │ num_reviews │   mean_votes_up    │ sum_votes_funny │
    │       int64       │    int64    │       double       │     int128      │
    ├───────────────────┼─────────────┼────────────────────┼─────────────────┤
    │ 76561198155150242 │         777 │   1.09009009009009 │             362 │
    │ 76561198042406453 │         665 │ 13.478195488721804 │             341 │
    │ 76561198043135631 │         475 │ 2.9410526315789474 │             116 │
    │ 76561198149437416 │         352 │ 1.2386363636363635 │             256 │
    │ 76561198007343154 │         342 │  4.038011695906433 │             375 │
    │ 76561198066590240 │         273 │  9.293040293040294 │             247 │
    │ 76561197970314107 │         241 │  38.15767634854772 │             319 │
    │ 76561197961017729 │         237 │ 10.278481012658228 │             251 │
    │ 76561197981638563 │         231 │ 26.372294372294373 │             509 │
    │ 76561197972040704 │         223 │  8.560538116591928 │             123 │
    │ 76561197992694498 │         223 │   73.1390134529148 │             759 │
    │ 76561197970761123 │         222 │ 22.603603603603602 │             150 │
    │ 76561198043609914 │         217 │ 18.792626728110598 │             129 │
    │ 76561198053422627 │         202 │  17.04950495049505 │             284 │
    │ 76561198007888370 │         196 │  93.96938775510205 │            3345 │
    │ 76561198040884867 │         191 │  38.41884816753927 │             509 │
    │ 76561198817597644 │         166 │ 22.542168674698797 │             119 │
    │ 76561198356141989 │         166 │ 19.246987951807228 │             216 │
    │ 76561198031599084 │         165 │ 17.163636363636364 │             171 │
    │ 76561198011647032 │         159 │  6.345911949685535 │             192 │
    │         ·         │           · │                ·   │              ·  │
    │         ·         │           · │                ·   │              ·  │
    │         ·         │           · │                ·   │              ·  │
    │ 76561198106232693 │           1 │               96.0 │             155 │
    ├───────────────────┴─────────────┴────────────────────┴─────────────────┤
    │ 2383 rows (40 shown)                                         4 columns │
    └────────────────────────────────────────────────────────────────────────┘
    Run Time (s): real 26.648 user 213.432443 sys 7.281861
    

    With ~26s this is on par with the previous query. User 76561198155150242 used the word “love” in 777 reviews and collected 362 funny upvotes across their reviews.

These timings show how incredibly convenient DuckDB is to analyse large datasets. <30s is still in the realm where I can interactively work with the dataset and explore different queries.You can push DuckDB to the limit by using a blocking operator, e.g. a rank() window function. This query will be slow because the entire input needs to be buffered to compute the result.

When I explore data it is more common that I work in a notebook instead of the CLI. Fortunately, I can run the same queries using the DuckDB Python client external link . In the next section I am going to compare it against polars, another relatively new kid on the block for out-of-memory analytics.

Working in a Jupyter notebook

DuckDB Python client

I am going to write the code to run the above queries using the duckdb python package and return the results as a pandas dataframe.

 1import duckdb
 2import pandas as pd
 3
 4# Connect to DuckDB
 5conn = duckdb.connect(':memory:')
 6
 7
 8# Define the queries
 9query1 = """
10SELECT author_steamid, COUNT(*) AS num_reviews 
11FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true) 
12WHERE language = 'english' 
13GROUP BY author_steamid 
14ORDER BY num_reviews DESC 
15"""
16
17query2 = """
18SELECT author_steamid, 
19COUNT(*) AS num_reviews,
20avg(votes_up) as mean_votes_up, 
21sum(votes_funny) as sum_votes_funny,
22FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true) 
23WHERE language = 'english' and review LIKE '%love%'
24GROUP BY author_steamid 
25having sum_votes_funny > 100 
26ORDER BY num_reviews DESC 
27"""
1%%timeit
2
3# Execute the query
4result1 = conn.execute(query1)
5# Convert the result to a pandas DataFrame
6df1 = result1.df()
16.6 s ± 627 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1df1.head(5)

author_steamid num_reviews
0 76561198030784015 9674
1 76561198024340430 5930
2 76561198067298289 5534
3 76561198094803808 4341
4 76561198027267313 4124
1%%timeit
2
3# Execute the query
4result2 = conn.execute(query2)
5# Convert the result to a pandas DataFrame
6df2 = result2.df()
19.9 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1df2.head(5)

author_steamid num_reviews mean_votes_up sum_votes_funny
0 76561198155150242 777 1.090090 362.0
1 76561198042406453 665 13.478195 341.0
2 76561198043135631 475 2.941053 116.0
3 76561198149437416 352 1.238636 256.0
4 76561198007343154 342 4.038012 375.0

Not surprisingly this gives the same results as the queries run from the CLI. Having the aggregated results as a pandas dataframe allows me to use it in downstream work. I don’t need to learn another syntax.

1## clear memory 
2import gc 
3del result1, result2, df1, df2;
4gc.collect();

Polars

Since polars is often brought up as a faster and better pandas, I want to do a performance comparison. Polars supports lazy dataframes, which allows me to define operations on the dataframe without loading it fully into memory first.

Below are the two equivalent queries written using polars syntax.

1import  polars as pl
2
3# Create a lazy DataFrame
4df_lazy = pl.scan_csv("./steam_reviews/all_reviews/all_reviews.csv", ignore_errors=True)
 1%%time
 2
 3# Define the query using lazy operations
 4result_polars1 = (
 5    df_lazy.filter(pl.col("language") == "english")
 6    .group_by("author_steamid")
 7    .agg(pl.len().alias("num_reviews"))
 8    .sort("num_reviews", descending=True)
 9    
10)
11
12# Execute the query and collect the results
13df_polars1 = result_polars1.collect()
14df_polars1.head(5)
CPU times: user 1min 3s, sys: 22.1 s, total: 1min 26s
Wall time: 2min 7s
shape: (5, 2)
author_steamidnum_reviews
i64u32
765611980307840159822
765611980243404305983
765611980672982895577
765611980948038084408
765611981253925094203
 1%%time
 2
 3result_polars2 = (
 4    df_lazy
 5    .filter(
 6        (pl.col("language") == "english") & 
 7        pl.col("review").str.contains("love")
 8    )
 9    .group_by("author_steamid")
10    .agg([
11        pl.len().alias("num_reviews"),
12        pl.col("votes_up").mean().alias("mean_votes_up"),
13        pl.col("votes_funny").sum().alias("sum_votes_funny")
14    ])
15    .filter(pl.col("sum_votes_funny") > 100)
16    .sort("num_reviews", descending=True)
17)
18
19
20# Execute the query and collect the results
21df_polars2 = result_polars2.collect()
22df_polars2.head(5)
<timed exec>:9: DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.


CPU times: user 1min 21s, sys: 42.7 s, total: 2min 4s
Wall time: 5min 14s
shape: (5, 4)
author_steamidnum_reviewsmean_votes_upsum_votes_funny
i64u32f64i64
765611981551502427861.094148369
7656119804240645367313.43685345
765611980431356314802.922917116
765611981494374163601.216667256
765611980073431543544.014124381

Now we can compare DuckDB vs. polars execution times for the two queries:

wall time DuckDB polars
query 1 17s 2min 7s
query 2 20s 5min 14s

Conclusion

This investigation shows that DuckDB is a powerful and convenient tool to process larger-than-memory datasets on a single machine. As a data scientist this is useful, because I can focus on exploring the data in the early exploration phase. I don’t yet have to spend time setting up more complex tools or work on a remote machine. The quick comparison with polars also shows its speed advantages and that it can be used as a drop-in when aggregating data in Jupyter notebooks.

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

#work   #data-engineering   #data-science   #duck-db