staticnotes.org

Data analysis: Vätternrundan 2024 results

⋅ 9 minute read

Contents

In 2024, I was foolish enough to participate in the Vätternrundan external link , which is a 315km distance bike sportive around Sweden’s second biggest lake. It took me and my friends a little less than 11.5 hours of cycling (15h including breaks).

Since this event is a sportive external link , not a race, the organizers only publish finishing times via a bib number search on their website external link . However, I was curious to understand how well my group did overall. To get my hands on the data I wrote a small scraping script to collect all finish times from the results page.

I can use this data to find out:

Scraping the dataset

Since getting the timing information from the results page involves several clicks for each start number, I use selenium external link to automate the browser actions. The script performs the actions for each start number between 1 and 30,000 and collects the timing information. Some start numbers are not used and many people didn’t finish the race. All the data is stored in a parquet file that is used in the next section. I had to do some light data cleaning, e.g. converting scraped strings to timedelta, removing data for unused startnumbers, etc.

Loading the dataset

I load the data from the parquet file into a pandas dataframe. I perform a few additional data cleaning steps, so it is easier to work with the dataset.

1import pandas as pd
2import numpy as np
3import datetime as dt
4import matplotlib.pyplot as plt
5import seaborn as sns
6from matplotlib.ticker import FuncFormatter
1df = pd.read_parquet("results_vatternrundan24.parquet")
2df['result_time_minutes'] = df['result_time'].dt.total_seconds() / 60
3df['average_speed'] = 1/df['result_time_minutes'] * 315 * 60
4df['has_finished'] = ~df['result_time'].isna() & (df['timing_consistent'])
1print(f'The dataset contains {df.shape[0]} entries.')
The dataset contains 15813 entries.

Analysing the dataset

The dataset has the following columns:

1display(df[['startnumber', 'initials', 'club','city', 'country', 'result_time', 'average_speed', 'has_finished']].head(5))
2df.dtypes

startnumber initials club city country result_time average_speed has_finished
40 5121 JK - Mantorp SE 0 days 15:11:00 20.746432 True
41 5122 MI - Skärblacka SE 0 days 15:11:00 20.746432 True
42 5123 TP - Höör SE 0 days 14:35:00 21.600000 True
43 5124 KH Trimgutta Løvenstad NO 0 days 14:47:00 21.307779 True
44 5125 HN - Jessheim NO NaT NaN False
startnumber                      int64
initials                        object
club                            object
city                            object
country                         object
result_time            timedelta64[ns]
start_time                      object
station_records                 object
timing_consistent                 bool
is_sub9                           bool
result_time_minutes            float64
average_speed                  float64
has_finished                      bool
dtype: object

The riders

Let’s first look at the top countries, cities, and clubs that are in this dataset. Not surprisingly, Sweden (SE) has the most participants, followed by Germany (DE), Norway (NO), and Finnland (FI).

1df.groupby('country').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)

num_starters num_finishers perc_finished mean_result_time fastest_finisher
country
SE 12035 9904 0.822933 0 days 13:38:44.486094316 0 days 05:15:00
DE 1437 1174 0.816980 0 days 14:30:54.331914893 0 days 07:37:00
NO 846 658 0.777778 0 days 11:15:29.272727272 0 days 07:16:00
FI 402 335 0.833333 0 days 12:56:00.537313432 0 days 08:09:00
DK 346 281 0.812139 0 days 13:24:26.501766784 0 days 07:57:00
GB 269 194 0.721190 0 days 14:24:04.307692307 0 days 02:54:00
NL 87 77 0.885057 0 days 14:25:03.896103896 0 days 08:53:00
43 36 0.837209 0 days 15:36:35 0 days 09:01:00
US 38 25 0.657895 0 days 13:39:55.384615384 0 days 09:19:00
PL 38 30 0.789474 0 days 12:52:26 0 days 09:20:00

The cities with most participants are all Swedish: Stockholm, Göteborg, Linköping, Uppsala, Malmö.

1df.groupby('city').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)

num_starters num_finishers perc_finished mean_result_time fastest_finisher
city
Stockholm 851 726 0.853114 0 days 13:37:09.752066115 0 days 07:31:00
Göteborg 613 516 0.841762 0 days 13:30:09.266409266 0 days 07:35:00
Linköping 378 313 0.828042 0 days 13:48:16.815286624 0 days 07:31:00
Uppsala 323 268 0.829721 0 days 13:50:19.029850746 0 days 07:37:00
Malmö 233 191 0.819742 0 days 13:52:36.125654450 0 days 07:31:00
Örebro 230 184 0.800000 0 days 13:34:19.677419354 0 days 08:19:00
Motala 223 186 0.834081 0 days 14:07:49.354838709 0 days 07:35:00
Västerås 186 142 0.763441 0 days 13:25:54.929577464 0 days 07:37:00
Lund 168 138 0.821429 0 days 13:56:18.260869565 0 days 07:16:00
Jönköping 147 124 0.843537 0 days 13:49:49.354838709 0 days 08:55:00

The following clubs had the most riders.

1df.groupby('club').agg(city = ('city', 'first'), num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False)[1:10]

city num_starters num_finishers perc_finished mean_result_time fastest_finisher
club
IMOVEFORCANCER Karlskrona 69 58 0.840580 0 days 09:54:26.896551724 0 days 07:37:00
Fredrikshof Stockholm 56 48 0.857143 0 days 10:46:42.500000 0 days 09:20:00
Örebrocyklisterna Kumla 55 42 0.763636 0 days 11:31:47.142857142 0 days 08:04:00
Team Kungälv Kungälv 43 29 0.674419 0 days 10:37:20 0 days 08:53:00
schulz sportreisen Dresden 40 36 0.900000 0 days 13:47:45 0 days 10:20:00
KCE - Kävlinge Cykelentusiaster Kävlinge 38 32 0.842105 0 days 11:50:01.875000 0 days 09:41:00
Fredrikshofs IF Cykelklubb Enskede 35 30 0.857143 0 days 11:29:52 0 days 09:44:00
Försvarsmakten Örebro 32 29 0.906250 0 days 13:51:00 0 days 09:21:00
Team Sportia Uppsala Uppsala 31 29 0.935484 0 days 10:50:47.586206896 0 days 09:24:00

Number of finishers

315km is a long event. Let’s check how many of the participants finished the event:

1num_starters = df.shape[0]
2num_finishers = df.query('has_finished').shape[0]
3print(f"Among the {num_starters} starting riders in the dataset, we found recorded and consistent finishing times for {num_finishers}. \
4This is a finishing rate of {100 * num_finishers / num_starters:.2f}%.")
Among the 15813 starting riders in the dataset, we found recorded and consistent finishing times for 12920. This is a finishing rate of 81.70%.

The podium and top 10

1fastest_rider = df.query('has_finished').sort_values('result_time').head(1)
2print(f'The fastest rider {fastest_rider.initials.values[0]} from {fastest_rider.city.values[0]} only needed {dt.timedelta(microseconds= float(fastest_rider.result_time.values[0])/1000)} to complete the 315km. \
3That is a mindblowing average speed of {(fastest_rider.average_speed.values[0]):.1f} km/h.')
The fastest rider JL from Lund only needed 7:16:00 to complete the 315km. That is a mindblowing average speed of 43.3 km/h.

The 10 fastest riders were below 7h 32min.

1df.query('has_finished')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)

startnumber initials city club country result_time average_speed
16311 27012 JL Lund Kjekkas IF SE 0 days 07:16:00 43.348624
16301 27002 TJ Slattum - NO 0 days 07:16:00 43.348624
16304 27005 EO Nittedal - NO 0 days 07:16:00 43.348624
16308 27009 BHB Follebu - NO 0 days 07:16:00 43.348624
16310 27011 DT Oslo - NO 0 days 07:16:00 43.348624
16316 27017 JK Lillehammer - NO 0 days 07:16:00 43.348624
16314 27015 SD Oslo - Norway - NO 0 days 07:16:00 43.348624
16781 27482 HD Kalmar - SE 0 days 07:31:00 41.906874
16782 27483 TE Rockneby - SE 0 days 07:31:00 41.906874
16783 27484 JL Ekerö - SE 0 days 07:31:00 41.906874

The 10 fastest riders that didn’t ride as part of a sub-9 registered teams were:

1df.query('has_finished and not is_sub9')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)

startnumber initials city club country result_time average_speed
8860 16661 MN Haslum Team Tøff i Tryne NO 0 days 08:02:00 39.211618
8850 16651 HL Oslo Team Tøff i Trynet NO 0 days 08:02:00 39.211618
28298 18529 MS Otalampi Team Wassu FI 0 days 08:12:00 38.414634
28297 18528 JK Kotka Team Wassu FI 0 days 08:12:00 38.414634
28295 18526 JP Helsinki Team Wassu FI 0 days 08:12:00 38.414634
8849 16650 SG Eiksmarka Team Tøff i Trynet NO 0 days 08:17:00 38.028169
28296 18527 JS Hamina Team Wassu FI 0 days 08:21:00 37.724551
8861 16662 FA Oslo Team Tøff i Trynet NO 0 days 08:23:00 37.574553
11817 23218 WG Lemgo RC Sprintax Bielefeld DE 0 days 08:24:00 37.500000
11819 23220 CB Bielefeld RC Sprintax Bielefeld DE 0 days 08:24:00 37.500000

Lanterne rouge

The slowest finisher took almost 28h.

1df.query('has_finished')[['startnumber', 'initials', 'city', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).tail(5)

startnumber initials city country result_time average_speed
3416 2367 AH Danderyd SE 1 days 02:37:00 11.834690
1464 415 MP Östersund SE 1 days 02:49:00 11.746426
1465 416 ZYP Stockholm SE 1 days 02:49:00 11.746426
1260 211 HB Svedala SE 1 days 03:36:00 11.413043
1109 60 KH Skara SE 1 days 03:44:00 11.358173

Finishing times (excluding sub-9 groups)

To get a better sense of the finishing times, I plot the histogram and get some summary statistics of the result_time column for every finisher. I only consider the times of the regular riders below. This means excluding the sub-9 cycling teams that start separately later on Saturday. With a little less than 15h total (11.5h moving time) my group was slower than the median rider. Next time, we might want to take shorter breaks.

1times = df.query('has_finished and not is_sub9')
2times['result_time'].describe()
count                        12175
mean     0 days 13:55:05.829979466
std      0 days 02:57:47.200342638
min                0 days 08:02:00
25%                0 days 11:37:00
50%                0 days 13:44:00
75%                0 days 15:48:00
max                1 days 03:44:00
Name: result_time, dtype: object
1print(f"To be among the top 5% fastest finishers, you need to beat {times['result_time'].quantile(q=0.05)}.")
2print(f"To be among the top 10% fastest finishers, you need to beat {times['result_time'].quantile(q=0.1)}.")
3print(f"To be among the top 25% fastest finishers, you need to beat {times['result_time'].quantile(q=0.25)}.")
To be among the top 5% fastest finishers, you need to beat 0 days 09:44:00.
To be among the top 10% fastest finishers, you need to beat 0 days 10:10:00.
To be among the top 25% fastest finishers, you need to beat 0 days 11:37:00.

To plot the distribution of result times, I am binning the result time in minutes and also add some vertical lines indicating the 0.1- and 0.25- percentiles. It’s not surprising that the distribution is right-skewed with many riders that have very long finishing times, but nobody faster than 8 hours.

 1
 2def minutes_to_hours_minutes(x, pos):
 3    """Define a function to convert minutes to HH:MM. Used for the axis labelling."""
 4    hours = int(x // 60)
 5    minutes = int(x % 60)
 6    return f"{hours:02d}:{minutes:02d}"
 7
 8# create histogram plot of result time distribution
 9plt.figure(figsize=(10, 6)) 
10sns.histplot(data=times, x="result_time_minutes", binwidth=15)
11
12plt.gca().xaxis.set_major_formatter(FuncFormatter(minutes_to_hours_minutes))
13
14# draw colored vertical percentile lines
15plt.axvline(times.result_time_minutes.median(), color='red', linestyle='--', linewidth=1.5)
16plt.axvline(times.result_time_minutes.quantile(0.25), color='green', linestyle='--', linewidth=1.5)
17plt.axvline(times.result_time_minutes.quantile(0.1), color='orange', linestyle='--', linewidth=1.5)
18plt.text(times.result_time_minutes.median(), plt.ylim()[1] * 0.95, f'Median',
19         color='red', ha='left', va='bottom')
20plt.text(times.result_time_minutes.quantile(0.25), plt.ylim()[1] * 0.95, f'Top 25%',
21         color='green', ha='center', va='bottom')
22plt.text(times.result_time_minutes.quantile(0.1), plt.ylim()[1] * 0.95, f'Top 10%',
23         color='orange', ha='right', va='bottom');
24plt.xlabel('Finish time (incl. breaks)');

If you are planning to race next year’s Vatternrundan, this distribution might give you an indication of what to expect and a time to aim for.

Jupyter Notebook

You can find the jupyter notebook and the datasets for this post here external link .

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

#notebook   #statistics