Processing 112M rows of steam reviews locally with DuckDB

2024-09-28 ⋅ 9 minute read

Contents

In DuckDB use cases for data scientists: Querying remote S3 files I wrote how I use DuckDB as a convenient way to query data from CSV or Parquet files in S3. Another use case for data scientists and data engineers is DuckDB’s ability to processing larger-than-memory data on your local machine. For us data scientists this fills the gap between

small data that you can transform with pandas
big data that typically requires a multi-node processing engine like PySpark.

In this post I am going to run some exploratory queries against this kaggle Steam review dataset . It contains 112M rows of Steam game reviews and comes as an uncompressed 46GB CSV file (17GB compressed). Let’s find out how my MacBook copes with it. Later I am going to run a small comparison against polars , a DataFrame library, which is often mentioned for out-of-memory processing.I am not comparing against pandas because its inability to work with large datasets was one of the reasons why I explored this topic. However, you could do this with chunking .

Useful commands and settings

Before I start, here are some useful DuckDB commands and settings that I often use:

Command	What is it good for?
`.timer on`	Prints the execution time after each SQL command
`FROM duckdb_memory();`	Gives an overview of how much memory is used by DuckDB
`SET enable_progress_bar = true;`	Prints a progress bar for query runs
`SET memory_limit = '2GB';`	Controls how much data DuckDB can keep in RAM.
`.mode line`	Prints query results one at a time (useful if many columns). Default: `.mode duckbox`
`SUMMARIZE (select *)`	Summarizes the contents of a table.
`.exit`	Stops DuckDB.

Steam review dataset

First, I want to familiarize myself with the dataset. I downloaded the 46GB file all_reviews.csv to my 2023 Macbook Pro M3 with 18GB RAM. After starting a non-persistent session with the duckdb command, I use the above commands to activate the timer and line mode. I then take a look at an example row with:

 select * from read_csv('all_reviews.csv', ignore_errors=true) limit 1;

              recommendationid = 148919893
                         appid = 10
                          game = Counter-Strike
                author_steamid = 76561199036724879
        author_num_games_owned = 0
            author_num_reviews = 3
       author_playtime_forever = 197
author_playtime_last_two_weeks = 197
     author_playtime_at_review = 197
            author_last_played = 1698336369
                      language = russian
                        review = старость
             timestamp_created = 1698336397
             timestamp_updated = 1698336397
                      voted_up = 1
                      votes_up = 0
                   votes_funny = 0
           weighted_vote_score = 0.0
                 comment_count = 0
                steam_purchase = 1
             received_for_free = 0
   written_during_early_access = 0
         hidden_in_steam_china = 1
          steam_china_location =

This gives me an overview of the file columns and an idea of their content.

Processing the file directly

Rather than slowly loading all data into DuckDB, I want to test DuckDB’s stream processing capabilities. Instead of fully materializing the data in memory, the execution engine reads and processes the data in chunks.This is a useful feature if you want to convert larger-than-memory files from one format to another format, e.g. CSV to Parquet. But that’s not what we are here for today..

I am going to use two aggregation queries for my benchmark:

The first query counts the number of English reviews per steam account and sorts them from highest to lowest

D select author_steamid, 
COUNT(*) AS num_reviews  
FROM read_csv('all_reviews.csv', ignore_errors = true)  
WHERE language = 'english' 
GROUP BY author_steamid 
ORDER BY num_reviews DESC;
100% ▕████████████████████████████████████████████████████████████▏ 
┌───────────────────┬─────────────┐
│  author_steamid   │ num_reviews │
│       int64       │    int64    │
├───────────────────┼─────────────┤
│ 76561198030784015 │        9674 │
│ 76561198024340430 │        5930 │
│ 76561198067298289 │        5534 │
│ 76561198094803808 │        4341 │
│ 76561198125392509 │        4124 │
│ 76561198027267313 │        4124 │
│ 76561197960373660 │        3350 │
│ 76561197970602587 │        3212 │
│ 76561197961017729 │        2810 │
│ 76561198155150242 │        2554 │
│ 76561198066590240 │        2490 │
│ 76561198045381877 │        2280 │
│ 76561198069159152 │        2216 │
│ 76561197960319772 │        1978 │
│ 76561198062813911 │        1940 │
│ 76561198025731804 │        1928 │
│ 76561198137285867 │        1898 │
│ 76561198043135631 │        1883 │
│ 76561198036629241 │        1875 │
│ 76561198055119582 │        1843 │
│         ·         │           · │
│         ·         │           · │
│         ·         │           · │
│ 76561198111175247 │           1 │
├───────────────────┴─────────────┤
│    15324507 rows (40 shown)     │
└─────────────────────────────────┘
Run Time (s): real 22.556 user 166.382276 sys 9.815767

This query processes the 112M rows in a surprisingly short 22.5s. The most active Steam user has written 9674 game reviews. You can find them here .

I made the second query intentionally more complex. I filter for reviews that contain the word love and I also compute the mean of the votes_up score and the sum of the votes_funny score for each user. I then filter the aggregation using having for only users that have a sum_votes_funny score of more than 100.

D SELECT author_steamid, 
    COUNT(*) AS num_reviews,
    avg(votes_up) as mean_votes_up, 
    sum(votes_funny) as sum_votes_funny,
    FROM read_csv('all_reviews.csv', ignore_errors = true) 
    WHERE language = 'english' and review LIKE '%love%'
    GROUP BY author_steamid 
    having sum_votes_funny > 100 
    ORDER BY num_reviews DESC; 
100% ▕████████████████████████████████████████████████████████████▏ 
┌───────────────────┬─────────────┬────────────────────┬─────────────────┐
│  author_steamid   │ num_reviews │   mean_votes_up    │ sum_votes_funny │
│       int64       │    int64    │       double       │     int128      │
├───────────────────┼─────────────┼────────────────────┼─────────────────┤
│ 76561198155150242 │         777 │   1.09009009009009 │             362 │
│ 76561198042406453 │         665 │ 13.478195488721804 │             341 │
│ 76561198043135631 │         475 │ 2.9410526315789474 │             116 │
│ 76561198149437416 │         352 │ 1.2386363636363635 │             256 │
│ 76561198007343154 │         342 │  4.038011695906433 │             375 │
│ 76561198066590240 │         273 │  9.293040293040294 │             247 │
│ 76561197970314107 │         241 │  38.15767634854772 │             319 │
│ 76561197961017729 │         237 │ 10.278481012658228 │             251 │
│ 76561197981638563 │         231 │ 26.372294372294373 │             509 │
│ 76561197972040704 │         223 │  8.560538116591928 │             123 │
│ 76561197992694498 │         223 │   73.1390134529148 │             759 │
│ 76561197970761123 │         222 │ 22.603603603603602 │             150 │
│ 76561198043609914 │         217 │ 18.792626728110598 │             129 │
│ 76561198053422627 │         202 │  17.04950495049505 │             284 │
│ 76561198007888370 │         196 │  93.96938775510205 │            3345 │
│ 76561198040884867 │         191 │  38.41884816753927 │             509 │
│ 76561198817597644 │         166 │ 22.542168674698797 │             119 │
│ 76561198356141989 │         166 │ 19.246987951807228 │             216 │
│ 76561198031599084 │         165 │ 17.163636363636364 │             171 │
│ 76561198011647032 │         159 │  6.345911949685535 │             192 │
│         ·         │           · │                ·   │              ·  │
│         ·         │           · │                ·   │              ·  │
│         ·         │           · │                ·   │              ·  │
│ 76561198106232693 │           1 │               96.0 │             155 │
├───────────────────┴─────────────┴────────────────────┴─────────────────┤
│ 2383 rows (40 shown)                                         4 columns │
└────────────────────────────────────────────────────────────────────────┘
Run Time (s): real 26.648 user 213.432443 sys 7.281861

With ~26s this is on par with the previous query. User 76561198155150242 used the word “love” in 777 reviews and collected 362 funny upvotes across their reviews.

These timings show how incredibly convenient DuckDB is to analyse large datasets. <30s is still in the realm where I can interactively work with the dataset and explore different queries.You can push DuckDB to the limit by using a blocking operator, e.g. a rank() window function. This query will be slow because the entire input needs to be buffered to compute the result.

When I explore data it is more common that I work in a notebook instead of the CLI. Fortunately, I can run the same queries using the DuckDB Python client . In the next section I am going to compare it against polars, another relatively new kid on the block for out-of-memory analytics.

Working in a Jupyter notebook

DuckDB Python client

I am going to write the code to run the above queries using the duckdb python package and return the results as a pandas dataframe.

import duckdb
import pandas as pd

# Connect to DuckDB
conn = duckdb.connect(':memory:')


# Define the queries
query1 = """
SELECT author_steamid, COUNT(*) AS num_reviews 
FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true) 
WHERE language = 'english' 
GROUP BY author_steamid 
ORDER BY num_reviews DESC 
"""

query2 = """
SELECT author_steamid, 
COUNT(*) AS num_reviews,
avg(votes_up) as mean_votes_up, 
sum(votes_funny) as sum_votes_funny,
FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true) 
WHERE language = 'english' and review LIKE '%love%'
GROUP BY author_steamid 
having sum_votes_funny > 100 
ORDER BY num_reviews DESC 
"""

%%timeit

# Execute the query
result1 = conn.execute(query1)
# Convert the result to a pandas DataFrame
df1 = result1.df()

16.6 s ± 627 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df1.head(5)

	author_steamid	num_reviews
0	76561198030784015	9674
1	76561198024340430	5930
2	76561198067298289	5534
3	76561198094803808	4341
4	76561198027267313	4124

%%timeit

# Execute the query
result2 = conn.execute(query2)
# Convert the result to a pandas DataFrame
df2 = result2.df()

19.9 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df2.head(5)

	author_steamid	num_reviews	mean_votes_up	sum_votes_funny
0	76561198155150242	777	1.090090	362.0
1	76561198042406453	665	13.478195	341.0
2	76561198043135631	475	2.941053	116.0
3	76561198149437416	352	1.238636	256.0
4	76561198007343154	342	4.038012	375.0

Not surprisingly this gives the same results as the queries run from the CLI. Having the aggregated results as a pandas dataframe allows me to use it in downstream work. I don’t need to learn another syntax.

## clear memory 
import gc 
del result1, result2, df1, df2;
gc.collect();

Polars

Since polars is often brought up as a faster and better pandas, I want to do a performance comparison. Polars supports lazy dataframes, which allows me to define operations on the dataframe without loading it fully into memory first.

Below are the two equivalent queries written using polars syntax.

import  polars as pl

# Create a lazy DataFrame
df_lazy = pl.scan_csv("./steam_reviews/all_reviews/all_reviews.csv", ignore_errors=True)

%%time

# Define the query using lazy operations
result_polars1 = (
    df_lazy.filter(pl.col("language") == "english")
    .group_by("author_steamid")
    .agg(pl.len().alias("num_reviews"))
    .sort("num_reviews", descending=True)
    
)

# Execute the query and collect the results
df_polars1 = result_polars1.collect()
df_polars1.head(5)

CPU times: user 1min 3s, sys: 22.1 s, total: 1min 26s
Wall time: 2min 7s

shape: (5, 2)

author_steamid	num_reviews
i64	u32
76561198030784015	9822
76561198024340430	5983
76561198067298289	5577
76561198094803808	4408
76561198125392509	4203

%%time

result_polars2 = (
    df_lazy
    .filter(
        (pl.col("language") == "english") & 
        pl.col("review").str.contains("love")
    )
    .group_by("author_steamid")
    .agg([
        pl.len().alias("num_reviews"),
        pl.col("votes_up").mean().alias("mean_votes_up"),
        pl.col("votes_funny").sum().alias("sum_votes_funny")
    ])
    .filter(pl.col("sum_votes_funny") > 100)
    .sort("num_reviews", descending=True)
)


# Execute the query and collect the results
df_polars2 = result_polars2.collect()
df_polars2.head(5)

<timed exec>:9: DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.


CPU times: user 1min 21s, sys: 42.7 s, total: 2min 4s
Wall time: 5min 14s

shape: (5, 4)

author_steamid	num_reviews	mean_votes_up	sum_votes_funny
i64	u32	f64	i64
76561198155150242	786	1.094148	369
76561198042406453	673	13.43685	345
76561198043135631	480	2.922917	116
76561198149437416	360	1.216667	256
76561198007343154	354	4.014124	381

Now we can compare DuckDB vs. polars execution times for the two queries:

wall time	DuckDB	polars
query 1	17s	2min 7s
query 2	20s	5min 14s

Conclusion

This investigation shows that DuckDB is a powerful and convenient tool to process larger-than-memory datasets on a single machine. As a data scientist this is useful, because I can focus on exploring the data in the early exploration phase. I don’t yet have to spend time setting up more complex tools or work on a remote machine. The quick comparison with polars also shows its speed advantages and that it can be used as a drop-in when aggregating data in Jupyter notebooks.

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

Tags:
#work #data-engineering #data-science #duck-db