Now that we’ve seen linear regression in R applied to MLB batting averages, let’s practice what we’ve learned in a context we haven’t worked with before. We will look at past results of the Ironman Triathlon and investigate relationships between racing splits and demographic variables. First let us download the Ironman Data and save it to our “data” folder.
Let’s read in the data and take a look at the variables.
## Rows: 22124 Columns: 389
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (103): Source Table, Name, Country, Gender, Division, Swim, Bike, Run, Overall, Division Ra...
## dbl (102): BIB, Gender Rank, Overall Rank, Log Rank, Bike 28.2 mi Distance, Bike 50.4 mi Distan...
## time (184): Swim Total Split Time, Swim Total Race Time, T1, Bike 28.2 mi Split Time, Bike 28.2 ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This is a large data set with hundreds of variables. Digging deeper, we can see that many of the variables seem to be repeated and many have data only in certain rows. This is common in historical racing results, so it’s important to determine which variables are most important for our analysis. In our case, this analysis will be looking at the results from the 2017 Florida race, and we will focus on the swim, bike, run, and overall finishing times as well as demographics including country, gender, and division.
Moreover, we note that the finishing times are actually
character-types in the form of ‘minutes:seconds’. Keeping these
variables in this form will pose problems as we are going to be
performing operations that require numeric data. So, we need to convert
these clock times into pure numbers. The best way to work with clock
times within the tidyverse framework is to use the
lubridate
package. Within this package, we can parse the
times in the ‘minutes:seconds’ format using the ms()
function, and then convert the result to a numeric using the
period_to_seconds()
function (the result will be a time in
seconds, so we can get this back to minutes by dividing by 60).
#install.packages("lubridate")
library(lubridate)
results_2017 <- data %>%
filter(`Source Table` == '2017 - Florida') %>%
select(Country, Gender, Division, Swim, Bike, Run, Overall) %>%
drop_na() %>%
mutate(Swim = period_to_seconds(ms(Swim))/60,
Bike = period_to_seconds(ms(Bike))/60,
Run = period_to_seconds(ms(Run))/60,
Overall = period_to_seconds(ms(Overall))/60)
Now that we have our tidy dataset, let’s start looking at the relationships between each leg of the triathlon.
Create scatterplots of swim times vs. bike times, bike times vs. run times, and swim times vs. run times. Save them as variables titled ‘plot.swim_bike’, etc.
To compare the relationships between these three variables, we
can plot all three plots in the same window, rather than having to
scroll through each one individually. To do so, we use the
ggarrange
function within the ggpubr
package.
As arguments of the function, we select our three plots and can specify
the number of rows and columns to be displayed.
#install.packages("ggpubr")
library(ggpubr)
ggarrange(plot.swim_bike, plot.bike_run, plot.swim_run, nrow = 2, ncol = 2)
## [1] 0.6379414
## [1] 0.642096
## [1] 0.4433809
## (Intercept) Swim
## 179.129651 2.395295
predict()
to generate predictions based on the model, and
mutate()
to add them to your data. Then, use
geom_line()
to add this predicted line to the plot. You may
need to increase the size to increase the visibility of the line.results_2017 <- results_2017 %>%
mutate(swim_bike_pred = predict(lm.swim_bike))
ggplot(results_2017) +
geom_point(aes(Swim, Bike)) +
geom_line(aes(x = Swim, y = swim_bike_pred), color = 'lightcoral', size = 1.5)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In the following two lectures we will be looking at other forms of regression, namely logistic regression and multiple regression. Whereas simple linear regression deals with two continuous variables that share a linear relationship, we can extend our ideas of regression to more than two variables as well as variables that are binary or discrete in scale, or share nonlinear relationships.
What if we had reason to believe the association between the finishing times of each leg of the triathlon was quadratic, rather than linear? For instance, imagine higher times on the run leg were associated with slightly higher times on the bike leg, but not linearly higher? This could be possible if we had reason to believe that the top racers in the Ironman are good on the bike but separate themselves with really fast run times, while the slower racers are equally slow on the bike and on the run.
## (Intercept) Run
## 205.7610166 0.5147168
results_2017 <- results_2017 %>%
mutate(run_bike_pred = predict(lm.run_bike))
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(aes(x = Run, y = run_bike_pred), color = 'lightcoral')
## (Intercept) Run I(Run^2)
## 26.222478107 1.649746436 -0.001732342
results_2017 <- results_2017 %>%
mutate(run_bike_pred2 = predict(lm.run_bike2))
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(aes(x = Run, y = run_bike_pred), color = 'lightcoral', size = 1.5) +
geom_line(aes(x = Run, y = run_bike_pred2), color = 'lightblue4', size = 1.5)
reframe(results_2017,
rmse.run_bike = sqrt(mean(lm.run_bike$residuals^2)),
rmse.run_bike2 = sqrt(mean(lm.run_bike2$residuals^2)))
## # A tibble: 1 × 2
## rmse.run_bike rmse.run_bike2
## <dbl> <dbl>
## 1 37.266 36.524
Let’s start with our Run vs. Bike linear model again; what if we now want to add a gender effect? This would be the case if we think that the bike times differ depending on a racer’s gender.
lm.run_bike_gender <- lm(data = results_2017, formula = Bike ~ Run + Gender)
lm.run_bike_gender$coefficients
## (Intercept) Run GenderMale
## 229.316963 0.502562 -26.143851
results_2017 <- results_2017 %>%
mutate(run_bike_gender = predict(lm.run_bike_gender))
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(aes(x = Run, y = run_bike_pred), color = 'lightcoral', size = 1.5) +
geom_line(aes(x = Run, y = run_bike_gender, col = Gender), size = 1.5) +
scale_color_manual(values = c(Female="lightblue4", Male="darkseagreen3"))
reframe(results_2017,
rmse.run_bike = sqrt(mean(lm.run_bike$residuals^2)),
rmse.run_bike_gender = sqrt(mean(lm.run_bike_gender$residuals^2)))
## # A tibble: 1 × 2
## rmse.run_bike rmse.run_bike_gender
## <dbl> <dbl>
## 1 37.266 35.511