Data analysis: Vätternrundan 2024 results

2024-08-03 ⋅ 9 minute read

Contents

In 2024, I was foolish enough to participate in the Vätternrundan , which is a 315km distance bike sportive around Sweden’s second biggest lake. It took me and my friends a little less than 11.5 hours of cycling (15h including breaks).

Since this event is a sportive , not a race, the organizers only publish finishing times via a bib number search on their website . However, I was curious to understand how well my group did overall. To get my hands on the data I wrote a small scraping script to collect all finish times from the results page.

I can use this data to find out:

information about the participants, e.g. what countries had a lot of starters
distribution of race times, e.g. how fast is the median rider
what percentage of people finished the race

Scraping the dataset

Since getting the timing information from the results page involves several clicks for each start number, I use selenium to automate the browser actions. The script performs the actions for each start number between 1 and 30,000 and collects the timing information. Some start numbers are not used and many people didn’t finish the race. All the data is stored in a parquet file that is used in the next section. I had to do some light data cleaning, e.g. converting scraped strings to timedelta, removing data for unused startnumbers, etc.

Loading the dataset

I load the data from the parquet file into a pandas dataframe. I perform a few additional data cleaning steps, so it is easier to work with the dataset.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import FuncFormatter

df = pd.read_parquet("results_vatternrundan24.parquet")
df['result_time_minutes'] = df['result_time'].dt.total_seconds() / 60
df['average_speed'] = 1/df['result_time_minutes'] * 315 * 60
df['has_finished'] = ~df['result_time'].isna() & (df['timing_consistent'])

print(f'The dataset contains {df.shape[0]} entries.')

The dataset contains 15813 entries.

Analysing the dataset

The dataset has the following columns:

startnumber
initials: initials of the rider
club: club of the rider
city: home city of the rider
country: home country of the rider
result_time: full duration between crossing start and finish lines (includes breaks at food stations)
average_speed
is_sub9: whether the rider was part of a sub9 group that started later in the day (after 11:30 on Saturday)
has_finished: a boolean whether a rider completed the event or dropped out

display(df[['startnumber', 'initials', 'club','city', 'country', 'result_time', 'average_speed', 'has_finished']].head(5))
df.dtypes

	startnumber	initials	club	city	country	result_time	average_speed	has_finished
40	5121	JK	-	Mantorp	SE	0 days 15:11:00	20.746432	True
41	5122	MI	-	Skärblacka	SE	0 days 15:11:00	20.746432	True
42	5123	TP	-	Höör	SE	0 days 14:35:00	21.600000	True
43	5124	KH	Trimgutta	Løvenstad	NO	0 days 14:47:00	21.307779	True
44	5125	HN	-	Jessheim	NO	NaT	NaN	False

startnumber                      int64
initials                        object
club                            object
city                            object
country                         object
result_time            timedelta64[ns]
start_time                      object
station_records                 object
timing_consistent                 bool
is_sub9                           bool
result_time_minutes            float64
average_speed                  float64
has_finished                      bool
dtype: object

The riders

Let’s first look at the top countries, cities, and clubs that are in this dataset. Not surprisingly, Sweden (SE) has the most participants, followed by Germany (DE), Norway (NO), and Finnland (FI).

df.groupby('country').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)

	num_starters	num_finishers	perc_finished	mean_result_time	fastest_finisher
country
SE	12035	9904	0.822933	0 days 13:38:44.486094316	0 days 05:15:00
DE	1437	1174	0.816980	0 days 14:30:54.331914893	0 days 07:37:00
NO	846	658	0.777778	0 days 11:15:29.272727272	0 days 07:16:00
FI	402	335	0.833333	0 days 12:56:00.537313432	0 days 08:09:00
DK	346	281	0.812139	0 days 13:24:26.501766784	0 days 07:57:00
GB	269	194	0.721190	0 days 14:24:04.307692307	0 days 02:54:00
NL	87	77	0.885057	0 days 14:25:03.896103896	0 days 08:53:00
	43	36	0.837209	0 days 15:36:35	0 days 09:01:00
US	38	25	0.657895	0 days 13:39:55.384615384	0 days 09:19:00
PL	38	30	0.789474	0 days 12:52:26	0 days 09:20:00

The cities with most participants are all Swedish: Stockholm, Göteborg, Linköping, Uppsala, Malmö.

df.groupby('city').agg(num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False).head(10)

	num_starters	num_finishers	perc_finished	mean_result_time	fastest_finisher
city
Stockholm	851	726	0.853114	0 days 13:37:09.752066115	0 days 07:31:00
Göteborg	613	516	0.841762	0 days 13:30:09.266409266	0 days 07:35:00
Linköping	378	313	0.828042	0 days 13:48:16.815286624	0 days 07:31:00
Uppsala	323	268	0.829721	0 days 13:50:19.029850746	0 days 07:37:00
Malmö	233	191	0.819742	0 days 13:52:36.125654450	0 days 07:31:00
Örebro	230	184	0.800000	0 days 13:34:19.677419354	0 days 08:19:00
Motala	223	186	0.834081	0 days 14:07:49.354838709	0 days 07:35:00
Västerås	186	142	0.763441	0 days 13:25:54.929577464	0 days 07:37:00
Lund	168	138	0.821429	0 days 13:56:18.260869565	0 days 07:16:00
Jönköping	147	124	0.843537	0 days 13:49:49.354838709	0 days 08:55:00

The following clubs had the most riders.

df.groupby('club').agg(city = ('city', 'first'), num_starters = ('startnumber', 'count'), num_finishers = ('has_finished', 'sum'), perc_finished = ('has_finished', 'mean'), mean_result_time = ('result_time', 'mean'), fastest_finisher = ('result_time', 'min')).sort_values('num_starters', ascending=False)[1:10]

	city	num_starters	num_finishers	perc_finished	mean_result_time	fastest_finisher
club
IMOVEFORCANCER	Karlskrona	69	58	0.840580	0 days 09:54:26.896551724	0 days 07:37:00
Fredrikshof	Stockholm	56	48	0.857143	0 days 10:46:42.500000	0 days 09:20:00
Örebrocyklisterna	Kumla	55	42	0.763636	0 days 11:31:47.142857142	0 days 08:04:00
Team Kungälv	Kungälv	43	29	0.674419	0 days 10:37:20	0 days 08:53:00
schulz sportreisen	Dresden	40	36	0.900000	0 days 13:47:45	0 days 10:20:00
KCE - Kävlinge Cykelentusiaster	Kävlinge	38	32	0.842105	0 days 11:50:01.875000	0 days 09:41:00
Fredrikshofs IF Cykelklubb	Enskede	35	30	0.857143	0 days 11:29:52	0 days 09:44:00
Försvarsmakten	Örebro	32	29	0.906250	0 days 13:51:00	0 days 09:21:00
Team Sportia Uppsala	Uppsala	31	29	0.935484	0 days 10:50:47.586206896	0 days 09:24:00

Number of finishers

315km is a long event. Let’s check how many of the participants finished the event:

num_starters = df.shape[0]
num_finishers = df.query('has_finished').shape[0]
print(f"Among the {num_starters} starting riders in the dataset, we found recorded and consistent finishing times for {num_finishers}. \
This is a finishing rate of {100 * num_finishers / num_starters:.2f}%.")

Among the 15813 starting riders in the dataset, we found recorded and consistent finishing times for 12920. This is a finishing rate of 81.70%.

The podium and top 10

fastest_rider = df.query('has_finished').sort_values('result_time').head(1)
print(f'The fastest rider {fastest_rider.initials.values[0]} from {fastest_rider.city.values[0]} only needed {dt.timedelta(microseconds= float(fastest_rider.result_time.values[0])/1000)} to complete the 315km. \
That is a mindblowing average speed of {(fastest_rider.average_speed.values[0]):.1f} km/h.')

The fastest rider JL from Lund only needed 7:16:00 to complete the 315km. That is a mindblowing average speed of 43.3 km/h.

The 10 fastest riders were below 7h 32min.

df.query('has_finished')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)

	startnumber	initials	city	club	country	result_time	average_speed
16311	27012	JL	Lund	Kjekkas IF	SE	0 days 07:16:00	43.348624
16301	27002	TJ	Slattum	-	NO	0 days 07:16:00	43.348624
16304	27005	EO	Nittedal	-	NO	0 days 07:16:00	43.348624
16308	27009	BHB	Follebu	-	NO	0 days 07:16:00	43.348624
16310	27011	DT	Oslo	-	NO	0 days 07:16:00	43.348624
16316	27017	JK	Lillehammer	-	NO	0 days 07:16:00	43.348624
16314	27015	SD	Oslo - Norway	-	NO	0 days 07:16:00	43.348624
16781	27482	HD	Kalmar	-	SE	0 days 07:31:00	41.906874
16782	27483	TE	Rockneby	-	SE	0 days 07:31:00	41.906874
16783	27484	JL	Ekerö	-	SE	0 days 07:31:00	41.906874

The 10 fastest riders that didn’t ride as part of a sub-9 registered teams were:

df.query('has_finished and not is_sub9')[['startnumber', 'initials', 'city', 'club', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).head(10)

	startnumber	initials	city	club	country	result_time	average_speed
8860	16661	MN	Haslum	Team Tøff i Tryne	NO	0 days 08:02:00	39.211618
8850	16651	HL	Oslo	Team Tøff i Trynet	NO	0 days 08:02:00	39.211618
28298	18529	MS	Otalampi	Team Wassu	FI	0 days 08:12:00	38.414634
28297	18528	JK	Kotka	Team Wassu	FI	0 days 08:12:00	38.414634
28295	18526	JP	Helsinki	Team Wassu	FI	0 days 08:12:00	38.414634
8849	16650	SG	Eiksmarka	Team Tøff i Trynet	NO	0 days 08:17:00	38.028169
28296	18527	JS	Hamina	Team Wassu	FI	0 days 08:21:00	37.724551
8861	16662	FA	Oslo	Team Tøff i Trynet	NO	0 days 08:23:00	37.574553
11817	23218	WG	Lemgo	RC Sprintax Bielefeld	DE	0 days 08:24:00	37.500000
11819	23220	CB	Bielefeld	RC Sprintax Bielefeld	DE	0 days 08:24:00	37.500000

Lanterne rouge

The slowest finisher took almost 28h.

df.query('has_finished')[['startnumber', 'initials', 'city', 'country', 'result_time', 'average_speed']].sort_values('result_time', ascending=True).tail(5)

	startnumber	initials	city	country	result_time	average_speed
3416	2367	AH	Danderyd	SE	1 days 02:37:00	11.834690
1464	415	MP	Östersund	SE	1 days 02:49:00	11.746426
1465	416	ZYP	Stockholm	SE	1 days 02:49:00	11.746426
1260	211	HB	Svedala	SE	1 days 03:36:00	11.413043
1109	60	KH	Skara	SE	1 days 03:44:00	11.358173

Finishing times (excluding sub-9 groups)

To get a better sense of the finishing times, I plot the histogram and get some summary statistics of the result_time column for every finisher. I only consider the times of the regular riders below. This means excluding the sub-9 cycling teams that start separately later on Saturday. With a little less than 15h total (11.5h moving time) my group was slower than the median rider. Next time, we might want to take shorter breaks.

times = df.query('has_finished and not is_sub9')
times['result_time'].describe()

count                        12175
mean     0 days 13:55:05.829979466
std      0 days 02:57:47.200342638
min                0 days 08:02:00
25%                0 days 11:37:00
50%                0 days 13:44:00
75%                0 days 15:48:00
max                1 days 03:44:00
Name: result_time, dtype: object

print(f"To be among the top 5% fastest finishers, you need to beat {times['result_time'].quantile(q=0.05)}.")
print(f"To be among the top 10% fastest finishers, you need to beat {times['result_time'].quantile(q=0.1)}.")
print(f"To be among the top 25% fastest finishers, you need to beat {times['result_time'].quantile(q=0.25)}.")

To be among the top 5% fastest finishers, you need to beat 0 days 09:44:00.
To be among the top 10% fastest finishers, you need to beat 0 days 10:10:00.
To be among the top 25% fastest finishers, you need to beat 0 days 11:37:00.

To plot the distribution of result times, I am binning the result time in minutes and also add some vertical lines indicating the 0.1- and 0.25- percentiles. It’s not surprising that the distribution is right-skewed with many riders that have very long finishing times, but nobody faster than 8 hours.


def minutes_to_hours_minutes(x, pos):
    """Define a function to convert minutes to HH:MM. Used for the axis labelling."""
    hours = int(x // 60)
    minutes = int(x % 60)
    return f"{hours:02d}:{minutes:02d}"

# create histogram plot of result time distribution
plt.figure(figsize=(10, 6)) 
sns.histplot(data=times, x="result_time_minutes", binwidth=15)

plt.gca().xaxis.set_major_formatter(FuncFormatter(minutes_to_hours_minutes))

# draw colored vertical percentile lines
plt.axvline(times.result_time_minutes.median(), color='red', linestyle='--', linewidth=1.5)
plt.axvline(times.result_time_minutes.quantile(0.25), color='green', linestyle='--', linewidth=1.5)
plt.axvline(times.result_time_minutes.quantile(0.1), color='orange', linestyle='--', linewidth=1.5)
plt.text(times.result_time_minutes.median(), plt.ylim()[1] * 0.95, f'Median',
         color='red', ha='left', va='bottom')
plt.text(times.result_time_minutes.quantile(0.25), plt.ylim()[1] * 0.95, f'Top 25%',
         color='green', ha='center', va='bottom')
plt.text(times.result_time_minutes.quantile(0.1), plt.ylim()[1] * 0.95, f'Top 10%',
         color='orange', ha='right', va='bottom');
plt.xlabel('Finish time (incl. breaks)');

If you are planning to race next year’s Vatternrundan, this distribution might give you an indication of what to expect and a time to aim for.

Jupyter Notebook

You can find the jupyter notebook and the datasets for this post here .

If you have any thoughts, questions, or feedback about this post, I would love to hear it. Please reach out to me via email.

Tags:
#notebook #statistics