Processing 112M rows of steam reviews locally with DuckDB
⋅ 9 minute read
Contents
In DuckDB use cases for data scientists: Querying remote S3 files I wrote how I use DuckDB as a convenient way to query data from CSV or Parquet files in S3. Another use case for data scientists and data engineers is DuckDB’s ability to processing larger-than-memory data on your local machine. For us data scientists this fills the gap between
- small data that you can transform with pandas
- big data that typically requires a multi-node processing engine like PySpark.
In this post I am going to run some exploratory queries against this kaggle Steam review dataset . It contains 112M rows of Steam game reviews and comes as an uncompressed 46GB CSV file (17GB compressed). Let’s find out how my MacBook copes with it. Later I am going to run a small comparison against polars , a DataFrame library, which is often mentioned for out-of-memory processing.I am not comparing against pandas because its inability to work with large datasets was one of the reasons why I explored this topic. However, you could do this with chunking .
Useful commands and settings
Before I start, here are some useful DuckDB commands and settings that I often use:
Command | What is it good for? |
---|---|
.timer on |
Prints the execution time after each SQL command |
FROM duckdb_memory(); |
Gives an overview of how much memory is used by DuckDB |
SET enable_progress_bar = true; |
Prints a progress bar for query runs |
SET memory_limit = '2GB'; |
Controls how much data DuckDB can keep in RAM. |
.mode line |
Prints query results one at a time (useful if many columns). Default: .mode duckbox |
SUMMARIZE (select *) |
Summarizes the contents of a table. |
.exit |
Stops DuckDB. |
Steam review dataset
First, I want to familiarize myself with the dataset. I downloaded the 46GB file all_reviews.csv
to my 2023 Macbook Pro M3 with 18GB RAM. After starting a non-persistent session with the duckdb
command, I use the above commands to activate the timer and line mode. I then take a look at an example row with:
1 select * from read_csv('all_reviews.csv', ignore_errors=true) limit 1;
recommendationid = 148919893
appid = 10
game = Counter-Strike
author_steamid = 76561199036724879
author_num_games_owned = 0
author_num_reviews = 3
author_playtime_forever = 197
author_playtime_last_two_weeks = 197
author_playtime_at_review = 197
author_last_played = 1698336369
language = russian
review = старость
timestamp_created = 1698336397
timestamp_updated = 1698336397
voted_up = 1
votes_up = 0
votes_funny = 0
weighted_vote_score = 0.0
comment_count = 0
steam_purchase = 1
received_for_free = 0
written_during_early_access = 0
hidden_in_steam_china = 1
steam_china_location =
This gives me an overview of the file columns and an idea of their content.
Processing the file directly
Rather than slowly loading all data into DuckDB, I want to test DuckDB’s stream processing capabilities. Instead of fully materializing the data in memory, the execution engine reads and processes the data in chunks.This is a useful feature if you want to convert larger-than-memory files from one format to another format, e.g. CSV to Parquet. But that’s not what we are here for today..
I am going to use two aggregation queries for my benchmark:
-
The first query counts the number of English reviews per steam account and sorts them from highest to lowest
D select author_steamid, COUNT(*) AS num_reviews FROM read_csv('all_reviews.csv', ignore_errors = true) WHERE language = 'english' GROUP BY author_steamid ORDER BY num_reviews DESC; 100% ▕████████████████████████████████████████████████████████████▏ ┌───────────────────┬─────────────┐ │ author_steamid │ num_reviews │ │ int64 │ int64 │ ├───────────────────┼─────────────┤ │ 76561198030784015 │ 9674 │ │ 76561198024340430 │ 5930 │ │ 76561198067298289 │ 5534 │ │ 76561198094803808 │ 4341 │ │ 76561198125392509 │ 4124 │ │ 76561198027267313 │ 4124 │ │ 76561197960373660 │ 3350 │ │ 76561197970602587 │ 3212 │ │ 76561197961017729 │ 2810 │ │ 76561198155150242 │ 2554 │ │ 76561198066590240 │ 2490 │ │ 76561198045381877 │ 2280 │ │ 76561198069159152 │ 2216 │ │ 76561197960319772 │ 1978 │ │ 76561198062813911 │ 1940 │ │ 76561198025731804 │ 1928 │ │ 76561198137285867 │ 1898 │ │ 76561198043135631 │ 1883 │ │ 76561198036629241 │ 1875 │ │ 76561198055119582 │ 1843 │ │ · │ · │ │ · │ · │ │ · │ · │ │ 76561198111175247 │ 1 │ ├───────────────────┴─────────────┤ │ 15324507 rows (40 shown) │ └─────────────────────────────────┘ Run Time (s): real 22.556 user 166.382276 sys 9.815767
This query processes the 112M rows in a surprisingly short 22.5s. The most active Steam user has written 9674 game reviews. You can find them here .
-
I made the second query intentionally more complex. I filter for reviews that contain the word
love
and I also compute the mean of thevotes_up
score and the sum of thevotes_funny
score for each user. I then filter the aggregation usinghaving
for only users that have asum_votes_funny
score of more than 100.D SELECT author_steamid, COUNT(*) AS num_reviews, avg(votes_up) as mean_votes_up, sum(votes_funny) as sum_votes_funny, FROM read_csv('all_reviews.csv', ignore_errors = true) WHERE language = 'english' and review LIKE '%love%' GROUP BY author_steamid having sum_votes_funny > 100 ORDER BY num_reviews DESC; 100% ▕████████████████████████████████████████████████████████████▏ ┌───────────────────┬─────────────┬────────────────────┬─────────────────┐ │ author_steamid │ num_reviews │ mean_votes_up │ sum_votes_funny │ │ int64 │ int64 │ double │ int128 │ ├───────────────────┼─────────────┼────────────────────┼─────────────────┤ │ 76561198155150242 │ 777 │ 1.09009009009009 │ 362 │ │ 76561198042406453 │ 665 │ 13.478195488721804 │ 341 │ │ 76561198043135631 │ 475 │ 2.9410526315789474 │ 116 │ │ 76561198149437416 │ 352 │ 1.2386363636363635 │ 256 │ │ 76561198007343154 │ 342 │ 4.038011695906433 │ 375 │ │ 76561198066590240 │ 273 │ 9.293040293040294 │ 247 │ │ 76561197970314107 │ 241 │ 38.15767634854772 │ 319 │ │ 76561197961017729 │ 237 │ 10.278481012658228 │ 251 │ │ 76561197981638563 │ 231 │ 26.372294372294373 │ 509 │ │ 76561197972040704 │ 223 │ 8.560538116591928 │ 123 │ │ 76561197992694498 │ 223 │ 73.1390134529148 │ 759 │ │ 76561197970761123 │ 222 │ 22.603603603603602 │ 150 │ │ 76561198043609914 │ 217 │ 18.792626728110598 │ 129 │ │ 76561198053422627 │ 202 │ 17.04950495049505 │ 284 │ │ 76561198007888370 │ 196 │ 93.96938775510205 │ 3345 │ │ 76561198040884867 │ 191 │ 38.41884816753927 │ 509 │ │ 76561198817597644 │ 166 │ 22.542168674698797 │ 119 │ │ 76561198356141989 │ 166 │ 19.246987951807228 │ 216 │ │ 76561198031599084 │ 165 │ 17.163636363636364 │ 171 │ │ 76561198011647032 │ 159 │ 6.345911949685535 │ 192 │ │ · │ · │ · │ · │ │ · │ · │ · │ · │ │ · │ · │ · │ · │ │ 76561198106232693 │ 1 │ 96.0 │ 155 │ ├───────────────────┴─────────────┴────────────────────┴─────────────────┤ │ 2383 rows (40 shown) 4 columns │ └────────────────────────────────────────────────────────────────────────┘ Run Time (s): real 26.648 user 213.432443 sys 7.281861
With ~26s this is on par with the previous query. User
76561198155150242
used the word “love” in 777 reviews and collected 362 funny upvotes across their reviews.
These timings show how incredibly convenient DuckDB is to analyse large datasets. <30s is still in the realm where I can interactively work with the dataset and explore different queries.You can push DuckDB to the limit by using a blocking operator, e.g. a rank() window function. This query will be slow because the entire input needs to be buffered to compute the result.
When I explore data it is more common that I work in a notebook instead of the CLI. Fortunately, I can run the same queries using the DuckDB Python client . In the next section I am going to compare it against polars, another relatively new kid on the block for out-of-memory analytics.
Working in a Jupyter notebook
DuckDB Python client
I am going to write the code to run the above queries using the duckdb
python package and return the results as a pandas
dataframe.
1import duckdb
2import pandas as pd
3
4# Connect to DuckDB
5conn = duckdb.connect(':memory:')
6
7
8# Define the queries
9query1 = """
10SELECT author_steamid, COUNT(*) AS num_reviews
11FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true)
12WHERE language = 'english'
13GROUP BY author_steamid
14ORDER BY num_reviews DESC
15"""
16
17query2 = """
18SELECT author_steamid,
19COUNT(*) AS num_reviews,
20avg(votes_up) as mean_votes_up,
21sum(votes_funny) as sum_votes_funny,
22FROM read_csv('./steam_reviews/all_reviews/all_reviews.csv', ignore_errors = true)
23WHERE language = 'english' and review LIKE '%love%'
24GROUP BY author_steamid
25having sum_votes_funny > 100
26ORDER BY num_reviews DESC
27"""
1%%timeit
2
3# Execute the query
4result1 = conn.execute(query1)
5# Convert the result to a pandas DataFrame
6df1 = result1.df()
16.6 s ± 627 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1df1.head(5)
author_steamid | num_reviews | |
---|---|---|
0 | 76561198030784015 | 9674 |
1 | 76561198024340430 | 5930 |
2 | 76561198067298289 | 5534 |
3 | 76561198094803808 | 4341 |
4 | 76561198027267313 | 4124 |
1%%timeit
2
3# Execute the query
4result2 = conn.execute(query2)
5# Convert the result to a pandas DataFrame
6df2 = result2.df()
19.9 s ± 343 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1df2.head(5)
author_steamid | num_reviews | mean_votes_up | sum_votes_funny | |
---|---|---|---|---|
0 | 76561198155150242 | 777 | 1.090090 | 362.0 |
1 | 76561198042406453 | 665 | 13.478195 | 341.0 |
2 | 76561198043135631 | 475 | 2.941053 | 116.0 |
3 | 76561198149437416 | 352 | 1.238636 | 256.0 |
4 | 76561198007343154 | 342 | 4.038012 | 375.0 |
Not surprisingly this gives the same results as the queries run from the CLI. Having the aggregated results as a pandas dataframe allows me to use it in downstream work. I don’t need to learn another syntax.
1## clear memory
2import gc
3del result1, result2, df1, df2;
4gc.collect();
Polars
Since polars is often brought up as a faster and better pandas, I want to do a performance comparison. Polars supports lazy dataframes, which allows me to define operations on the dataframe without loading it fully into memory first.
Below are the two equivalent queries written using polars syntax.
1import polars as pl
2
3# Create a lazy DataFrame
4df_lazy = pl.scan_csv("./steam_reviews/all_reviews/all_reviews.csv", ignore_errors=True)
1%%time
2
3# Define the query using lazy operations
4result_polars1 = (
5 df_lazy.filter(pl.col("language") == "english")
6 .group_by("author_steamid")
7 .agg(pl.len().alias("num_reviews"))
8 .sort("num_reviews", descending=True)
9
10)
11
12# Execute the query and collect the results
13df_polars1 = result_polars1.collect()
14df_polars1.head(5)
CPU times: user 1min 3s, sys: 22.1 s, total: 1min 26s
Wall time: 2min 7s
author_steamid | num_reviews |
---|---|
i64 | u32 |
76561198030784015 | 9822 |
76561198024340430 | 5983 |
76561198067298289 | 5577 |
76561198094803808 | 4408 |
76561198125392509 | 4203 |
1%%time
2
3result_polars2 = (
4 df_lazy
5 .filter(
6 (pl.col("language") == "english") &
7 pl.col("review").str.contains("love")
8 )
9 .group_by("author_steamid")
10 .agg([
11 pl.len().alias("num_reviews"),
12 pl.col("votes_up").mean().alias("mean_votes_up"),
13 pl.col("votes_funny").sum().alias("sum_votes_funny")
14 ])
15 .filter(pl.col("sum_votes_funny") > 100)
16 .sort("num_reviews", descending=True)
17)
18
19
20# Execute the query and collect the results
21df_polars2 = result_polars2.collect()
22df_polars2.head(5)
<timed exec>:9: DeprecationWarning: `pl.count()` is deprecated. Please use `pl.len()` instead.
CPU times: user 1min 21s, sys: 42.7 s, total: 2min 4s
Wall time: 5min 14s
author_steamid | num_reviews | mean_votes_up | sum_votes_funny |
---|---|---|---|
i64 | u32 | f64 | i64 |
76561198155150242 | 786 | 1.094148 | 369 |
76561198042406453 | 673 | 13.43685 | 345 |
76561198043135631 | 480 | 2.922917 | 116 |
76561198149437416 | 360 | 1.216667 | 256 |
76561198007343154 | 354 | 4.014124 | 381 |
Now we can compare DuckDB vs. polars execution times for the two queries:
wall time | DuckDB | polars |
---|---|---|
query 1 | 17s | 2min 7s |
query 2 | 20s | 5min 14s |
Conclusion
This investigation shows that DuckDB is a powerful and convenient tool to process larger-than-memory datasets on a single machine. As a data scientist this is useful, because I can focus on exploring the data in the early exploration phase. I don’t yet have to spend time setting up more complex tools or work on a remote machine. The quick comparison with polars also shows its speed advantages and that it can be used as a drop-in when aggregating data in Jupyter notebooks.
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.