Data analysis: Vätternrundan 2024 results
⋅ 9 minute read
Contents
In 2024, I was foolish enough to participate in the Vätternrundan , which is a 315km distance bike sportive around Sweden’s second biggest lake. It took me and my friends a little less than 11.5 hours of cycling (15h including breaks).
Since this event is a sportive , not a race, the organizers only publish finishing times via a bib number search on their website . However, I was curious to understand how well my group did overall. To get my hands on the data I wrote a small scraping script to collect all finish times from the results page.
I can use this data to find out:
- information about the participants, e.g. what countries had a lot of starters
- distribution of race times, e.g. how fast is the median rider
- what percentage of people finished the race
Scraping the dataset
Since getting the timing information from the results page involves several clicks for each start number, I use selenium to automate the browser actions. The script performs the actions for each start number between 1 and 30,000 and collects the timing information. Some start numbers are not used and many people didn’t finish the race. All the data is stored in a parquet file that is used in the next section. I had to do some light data cleaning, e.g. converting scraped strings to timedelta, removing data for unused startnumbers, etc.
Loading the dataset
I load the data from the parquet file into a pandas dataframe. I perform a few additional data cleaning steps, so it is easier to work with the dataset.
1import pandas as pd
2import numpy as np
3import datetime as dt
4import matplotlib.pyplot as plt
5import seaborn as sns
6from matplotlib.ticker import FuncFormatter
1df = pd.read_parquet("results_vatternrundan24.parquet")
2df['result_time_minutes'] = df['result_time'].dt.total_seconds() / 60
3df['average_speed'] = 1/df['result_time_minutes'] * 315 * 60
4df['has_finished'] = ~df['result_time'].isna() & (df['timing_consistent'])
1print(f'The dataset contains {df.shape[0]} entries.')
The dataset contains 15813 entries.
Analysing the dataset
The dataset has the following columns:
startnumber
initials
: initials of the riderclub
: club of the ridercity
: home city of the ridercountry
: home country of the riderresult_time
: full duration between crossing start and finish lines (includes breaks at food stations)average_speed
is_sub9
: whether the rider was part of a sub9 group that started later in the day (after 11:30 on Saturday)has_finished
: a boolean whether a rider completed the event or dropped out
1display(df[['startnumber', 'initials', 'club','city', 'country', 'result_time', 'average_speed', 'has_finished']].head(5))
2df.dtypes
startnumber | initials | club | city | country | result_time | average_speed | has_finished | |
---|---|---|---|---|---|---|---|---|
40 | 5121 | JK | - | Mantorp | SE | 0 days 15:11:00 | 20.746432 | True |
41 | 5122 | MI | - | Skärblacka | SE | 0 days 15:11:00 | 20.746432 | True |
42 | 5123 | TP | - | Höör | SE | 0 days 14:35:00 | 21.600000 | True |
43 | 5124 | KH | Trimgutta | Løvenstad | NO | 0 days 14:47:00 | 21.307779 | True |
44 | 5125 | HN | - | Jessheim | NO | NaT | NaN | False |
startnumber int64
initials object
club object
city object
country object
result_time timedelta64[ns]
start_time object
station_records object
timing_consistent bool
is_sub9 bool
result_time_minutes float64
average_speed float64
has_finished bool
dtype: object
The riders
Let’s first look at the top countries, cities, and clubs that are in this dataset. Not surprisingly, Sweden (SE) has the most participants, followed by Germany (DE), Norway (NO), and Finnland (FI).
1df.groupby('country').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)
num_starters | num_finishers | perc_finished | mean_result_time | fastest_finisher | |
---|---|---|---|---|---|
country | |||||
SE | 12035 | 9904 | 0.822933 | 0 days 13:38:44.486094316 | 0 days 05:15:00 |
DE | 1437 | 1174 | 0.816980 | 0 days 14:30:54.331914893 | 0 days 07:37:00 |
NO | 846 | 658 | 0.777778 | 0 days 11:15:29.272727272 | 0 days 07:16:00 |
FI | 402 | 335 | 0.833333 | 0 days 12:56:00.537313432 | 0 days 08:09:00 |
DK | 346 | 281 | 0.812139 | 0 days 13:24:26.501766784 | 0 days 07:57:00 |
GB | 269 | 194 | 0.721190 | 0 days 14:24:04.307692307 | 0 days 02:54:00 |
NL | 87 | 77 | 0.885057 | 0 days 14:25:03.896103896 | 0 days 08:53:00 |
43 | 36 | 0.837209 | 0 days 15:36:35 | 0 days 09:01:00 | |
US | 38 | 25 | 0.657895 | 0 days 13:39:55.384615384 | 0 days 09:19:00 |
PL | 38 | 30 | 0.789474 | 0 days 12:52:26 | 0 days 09:20:00 |
The cities with most participants are all Swedish: Stockholm, Göteborg, Linköping, Uppsala, Malmö.
1df.groupby('city').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)
num_starters | num_finishers | perc_finished | mean_result_time | fastest_finisher | |
---|---|---|---|---|---|
city | |||||
Stockholm | 851 | 726 | 0.853114 | 0 days 13:37:09.752066115 | 0 days 07:31:00 |
Göteborg | 613 | 516 | 0.841762 | 0 days 13:30:09.266409266 | 0 days 07:35:00 |
Linköping | 378 | 313 | 0.828042 | 0 days 13:48:16.815286624 | 0 days 07:31:00 |
Uppsala | 323 | 268 | 0.829721 | 0 days 13:50:19.029850746 | 0 days 07:37:00 |
Malmö | 233 | 191 | 0.819742 | 0 days 13:52:36.125654450 | 0 days 07:31:00 |
Örebro | 230 | 184 | 0.800000 | 0 days 13:34:19.677419354 | 0 days 08:19:00 |
Motala | 223 | 186 | 0.834081 | 0 days 14:07:49.354838709 | 0 days 07:35:00 |
Västerås | 186 | 142 | 0.763441 | 0 days 13:25:54.929577464 | 0 days 07:37:00 |
Lund | 168 | 138 | 0.821429 | 0 days 13:56:18.260869565 | 0 days 07:16:00 |
Jönköping | 147 | 124 | 0.843537 | 0 days 13:49:49.354838709 | 0 days 08:55:00 |
The following clubs had the most riders.
1df.groupby('club').agg(city = ('city', 'first'), num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False)[1:10]
city | num_starters | num_finishers | perc_finished | mean_result_time | fastest_finisher | |
---|---|---|---|---|---|---|
club | ||||||
IMOVEFORCANCER | Karlskrona | 69 | 58 | 0.840580 | 0 days 09:54:26.896551724 | 0 days 07:37:00 |
Fredrikshof | Stockholm | 56 | 48 | 0.857143 | 0 days 10:46:42.500000 | 0 days 09:20:00 |
Örebrocyklisterna | Kumla | 55 | 42 | 0.763636 | 0 days 11:31:47.142857142 | 0 days 08:04:00 |
Team Kungälv | Kungälv | 43 | 29 | 0.674419 | 0 days 10:37:20 | 0 days 08:53:00 |
schulz sportreisen | Dresden | 40 | 36 | 0.900000 | 0 days 13:47:45 | 0 days 10:20:00 |
KCE - Kävlinge Cykelentusiaster | Kävlinge | 38 | 32 | 0.842105 | 0 days 11:50:01.875000 | 0 days 09:41:00 |
Fredrikshofs IF Cykelklubb | Enskede | 35 | 30 | 0.857143 | 0 days 11:29:52 | 0 days 09:44:00 |
Försvarsmakten | Örebro | 32 | 29 | 0.906250 | 0 days 13:51:00 | 0 days 09:21:00 |
Team Sportia Uppsala | Uppsala | 31 | 29 | 0.935484 | 0 days 10:50:47.586206896 | 0 days 09:24:00 |
Number of finishers
315km is a long event. Let’s check how many of the participants finished the event:
1num_starters = df.shape[0]
2num_finishers = df.query('has_finished').shape[0]
3print(f"Among the {num_starters} starting riders in the dataset, we found recorded and consistent finishing times for {num_finishers}. \
4This is a finishing rate of {100 * num_finishers / num_starters:.2f}%.")
Among the 15813 starting riders in the dataset, we found recorded and consistent finishing times for 12920. This is a finishing rate of 81.70%.
The podium and top 10
1fastest_rider = df.query('has_finished').sort_values('result_time').head(1)
2print(f'The fastest rider {fastest_rider.initials.values[0]} from {fastest_rider.city.values[0]} only needed {dt.timedelta(microseconds= float(fastest_rider.result_time.values[0])/1000)} to complete the 315km. \
3That is a mindblowing average speed of {(fastest_rider.average_speed.values[0]):.1f} km/h.')
The fastest rider JL from Lund only needed 7:16:00 to complete the 315km. That is a mindblowing average speed of 43.3 km/h.
The 10 fastest riders were below 7h 32min.
1df.query('has_finished')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)
startnumber | initials | city | club | country | result_time | average_speed | |
---|---|---|---|---|---|---|---|
16311 | 27012 | JL | Lund | Kjekkas IF | SE | 0 days 07:16:00 | 43.348624 |
16301 | 27002 | TJ | Slattum | - | NO | 0 days 07:16:00 | 43.348624 |
16304 | 27005 | EO | Nittedal | - | NO | 0 days 07:16:00 | 43.348624 |
16308 | 27009 | BHB | Follebu | - | NO | 0 days 07:16:00 | 43.348624 |
16310 | 27011 | DT | Oslo | - | NO | 0 days 07:16:00 | 43.348624 |
16316 | 27017 | JK | Lillehammer | - | NO | 0 days 07:16:00 | 43.348624 |
16314 | 27015 | SD | Oslo - Norway | - | NO | 0 days 07:16:00 | 43.348624 |
16781 | 27482 | HD | Kalmar | - | SE | 0 days 07:31:00 | 41.906874 |
16782 | 27483 | TE | Rockneby | - | SE | 0 days 07:31:00 | 41.906874 |
16783 | 27484 | JL | Ekerö | - | SE | 0 days 07:31:00 | 41.906874 |
The 10 fastest riders that didn’t ride as part of a sub-9 registered teams were:
1df.query('has_finished and not is_sub9')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)
startnumber | initials | city | club | country | result_time | average_speed | |
---|---|---|---|---|---|---|---|
8860 | 16661 | MN | Haslum | Team Tøff i Tryne | NO | 0 days 08:02:00 | 39.211618 |
8850 | 16651 | HL | Oslo | Team Tøff i Trynet | NO | 0 days 08:02:00 | 39.211618 |
28298 | 18529 | MS | Otalampi | Team Wassu | FI | 0 days 08:12:00 | 38.414634 |
28297 | 18528 | JK | Kotka | Team Wassu | FI | 0 days 08:12:00 | 38.414634 |
28295 | 18526 | JP | Helsinki | Team Wassu | FI | 0 days 08:12:00 | 38.414634 |
8849 | 16650 | SG | Eiksmarka | Team Tøff i Trynet | NO | 0 days 08:17:00 | 38.028169 |
28296 | 18527 | JS | Hamina | Team Wassu | FI | 0 days 08:21:00 | 37.724551 |
8861 | 16662 | FA | Oslo | Team Tøff i Trynet | NO | 0 days 08:23:00 | 37.574553 |
11817 | 23218 | WG | Lemgo | RC Sprintax Bielefeld | DE | 0 days 08:24:00 | 37.500000 |
11819 | 23220 | CB | Bielefeld | RC Sprintax Bielefeld | DE | 0 days 08:24:00 | 37.500000 |
Lanterne rouge
The slowest finisher took almost 28h.
1df.query('has_finished')[['startnumber', 'initials', 'city', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).tail(5)
startnumber | initials | city | country | result_time | average_speed | |
---|---|---|---|---|---|---|
3416 | 2367 | AH | Danderyd | SE | 1 days 02:37:00 | 11.834690 |
1464 | 415 | MP | Östersund | SE | 1 days 02:49:00 | 11.746426 |
1465 | 416 | ZYP | Stockholm | SE | 1 days 02:49:00 | 11.746426 |
1260 | 211 | HB | Svedala | SE | 1 days 03:36:00 | 11.413043 |
1109 | 60 | KH | Skara | SE | 1 days 03:44:00 | 11.358173 |
Finishing times (excluding sub-9 groups)
To get a better sense of the finishing times, I plot the histogram and get some summary statistics of the result_time
column for every finisher.
I only consider the times of the regular riders below. This means excluding the sub-9 cycling teams that start separately later on Saturday.
With a little less than 15h total (11.5h moving time) my group was slower than the median rider. Next time, we might want to take shorter breaks.
1times = df.query('has_finished and not is_sub9')
2times['result_time'].describe()
count 12175
mean 0 days 13:55:05.829979466
std 0 days 02:57:47.200342638
min 0 days 08:02:00
25% 0 days 11:37:00
50% 0 days 13:44:00
75% 0 days 15:48:00
max 1 days 03:44:00
Name: result_time, dtype: object
1print(f"To be among the top 5% fastest finishers, you need to beat {times['result_time'].quantile(q=0.05)}.")
2print(f"To be among the top 10% fastest finishers, you need to beat {times['result_time'].quantile(q=0.1)}.")
3print(f"To be among the top 25% fastest finishers, you need to beat {times['result_time'].quantile(q=0.25)}.")
To be among the top 5% fastest finishers, you need to beat 0 days 09:44:00.
To be among the top 10% fastest finishers, you need to beat 0 days 10:10:00.
To be among the top 25% fastest finishers, you need to beat 0 days 11:37:00.
To plot the distribution of result times, I am binning the result time in minutes and also add some vertical lines indicating the 0.1- and 0.25- percentiles. It’s not surprising that the distribution is right-skewed with many riders that have very long finishing times, but nobody faster than 8 hours.
1
2def minutes_to_hours_minutes(x, pos):
3 """Define a function to convert minutes to HH:MM. Used for the axis labelling."""
4 hours = int(x // 60)
5 minutes = int(x % 60)
6 return f"{hours:02d}:{minutes:02d}"
7
8# create histogram plot of result time distribution
9plt.figure(figsize=(10, 6))
10sns.histplot(data=times, x="result_time_minutes", binwidth=15)
11
12plt.gca().xaxis.set_major_formatter(FuncFormatter(minutes_to_hours_minutes))
13
14# draw colored vertical percentile lines
15plt.axvline(times.result_time_minutes.median(), color='red', linestyle='--', linewidth=1.5)
16plt.axvline(times.result_time_minutes.quantile(0.25), color='green', linestyle='--', linewidth=1.5)
17plt.axvline(times.result_time_minutes.quantile(0.1), color='orange', linestyle='--', linewidth=1.5)
18plt.text(times.result_time_minutes.median(), plt.ylim()[1] * 0.95, f'Median',
19 color='red', ha='left', va='bottom')
20plt.text(times.result_time_minutes.quantile(0.25), plt.ylim()[1] * 0.95, f'Top 25%',
21 color='green', ha='center', va='bottom')
22plt.text(times.result_time_minutes.quantile(0.1), plt.ylim()[1] * 0.95, f'Top 10%',
23 color='orange', ha='right', va='bottom');
24plt.xlabel('Finish time (incl. breaks)');
If you are planning to race next year’s Vatternrundan, this distribution might give you an indication of what to expect and a time to aim for.
Jupyter Notebook
You can find the jupyter notebook and the datasets for this post here .
If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.