Now that we’ve seen linear regression in R applied to MLB batting averages, let’s practice what we’ve learned in a context we haven’t worked with before. We will look at past results of the Ironman Triathlon and investigate relationships between racing splits and demographic variables. First let us download the Ironman Data and save it to our “data” folder.
Let’s read in the data and take a look at the variables.
## Rows: 22124 Columns: 389
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (103): Source Table, Name, Country, Gender, Division, Swim, Bike, Run, Overall, Division Rank, Swim Total Pace, Bike 28.2 mi...
## dbl (102): BIB, Gender Rank, Overall Rank, Log Rank, Bike 28.2 mi Distance, Bike 50.4 mi Distance, Bike 68.6 mi Distance, Bike 8...
## time (184): Swim Total Split Time, Swim Total Race Time, T1, Bike 28.2 mi Split Time, Bike 28.2 mi Race Time, Bike 50.4 mi Split ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
This is a large data set with hundreds of variables. Digging deeper, we can see that many of the variables seem to be repeated and many have data only in certain rows. This is common in historical racing results, so it’s important to determine which variables are most important for our analysis. In our case, this analysis will be looking at the results from the 2017 Florida race, and we will focus on the swim, bike, run, and overall finishing times as well as demographics including country, gender, and division.
Moreover, we note that the finishing times are actually
character-types in the form of ‘minutes:seconds’. Keeping these
variables in this form will pose problems as we are going to be
performing operations that require numeric data. So, we need to convert
these clock times into pure numbers. The best way to work with clock
times within the tidyverse framework is to use the
lubridate
package. Within this package, we can parse the
times in the ‘minutes:seconds’ format using the ms()
function, and then convert the result to a numeric using the
period_to_seconds()
function (the result will be a time in
seconds, so we can get this back to minutes by dividing by 60).
#install.packages("lubridate")
library(lubridate)
results_2017 <- data %>%
filter(`Source Table` == '2017 - Florida') %>%
select(Country, Gender, Division, Swim, Bike, Run, Overall) %>%
drop_na() %>%
mutate(Swim = period_to_seconds(ms(Swim))/60,
Bike = period_to_seconds(ms(Bike))/60,
Run = period_to_seconds(ms(Run))/60,
Overall = period_to_seconds(ms(Overall))/60)
Now that we have our tidy dataset, let’s start looking at the relationships between each leg of the triathlon.
Create scatterplots of swim times vs. bike times, bike times vs. run times, and swim times vs. run times. Save them as variables titled ‘plot.swim_bike’, etc.
To compare the relationships between these three variables, we
can plot all three plots in the same window, rather than having to
scroll through each one individually. To do so, we use the
ggarrange
function within the ggpubr
package.
As arguments of the function, we select our three plots and can specify
the number of rows and columns to be displayed.
#install.packages("ggpubr")
library(ggpubr)
ggarrange(plot.swim_bike, plot.bike_run, plot.swim_run, nrow = 2, ncol = 2)
## [1] 0.6379414
## [1] 0.642096
## [1] 0.4433809
## (Intercept) Swim
## 179.129651 2.395295
geom_abline()
while the second involved creating a
data_grid
over the predictor variable, adding the
predictions to the grid, and then drawing a line through the predictions
created on the grid using geom_line
. As discussed in
Lecture 6, the first approach only works for simple linear regression,
while the second can be extended for all types of regression. For this
reason, let’s practice using the second approach.library(modelr)
grid.swim <- results_2017 %>% data_grid(Swim)
grid.swim <- grid.swim %>% add_predictions(model = lm.swim_bike, var = 'pred.swim_bike')
ggplot(results_2017) +
geom_point(aes(Swim, Bike)) +
geom_line(data = grid.swim, aes(x = Swim, y = pred.swim_bike), color = 'red')
In the following two lectures we will be looking at other forms of regression, namely logistic regression and multiple regression. Whereas simple linear regression deals with two continuous variables that share a linear relationship, we can extend our ideas of regression to more than two variables as well as variables that are binary or discrete in scale, or share nonlinear relationships.
What if we had reason to believe the association between the finishing times of each leg of the triathlon was quadratic, rather than linear? For instance, imagine higher times on the run leg were associated with slightly higher times on the bike leg, but not linearly higher? This could be possible if we had reason to believe that the top racers in the Ironman are good on the bike but separate themselves with really fast run times, while the slower racers are equally slow on the bike and on the run.
## (Intercept) Run
## 205.7610166 0.5147168
grid.run <- results_2017 %>% data_grid(Run)
grid.run <- grid.run %>% add_predictions(model = lm.run_bike, var= 'pred.run_bike')
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(data = grid.run, aes(x = Run, y = pred.run_bike), color = 'red')
## (Intercept) Run I(Run^2)
## 26.222478107 1.649746436 -0.001732342
grid.run <- grid.run %>% add_predictions(model = lm.run_bike2, var= 'pred.run_bike2')
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(data = grid.run, aes(x = Run, y = pred.run_bike), color = 'red') +
geom_line(data = grid.run, aes(x = Run, y = pred.run_bike2), color = 'green')
summarize(results_2017,
rmse.run_bike = sqrt(mean(lm.run_bike$residuals^2)),
rmse.run_bike2 = sqrt(mean(lm.run_bike2$residuals^2)))
## # A tibble: 1 × 2
## rmse.run_bike rmse.run_bike2
## <dbl> <dbl>
## 1 37.3 36.5
Let’s start with our Run vs. Bike linear model again; what if we now want to add a gender effect? This would be the case if we think that the bike times differ depending on a racer’s gender.
lm.run_bike_gender <- lm(data = results_2017, formula = Bike ~ Run + Gender)
lm.run_bike_gender$coefficients
## (Intercept) Run GenderMale
## 229.316963 0.502562 -26.143851
grid.run_gender <- results_2017 %>% data_grid(Run, Gender)
grid.run_gender <- grid.run_gender %>%
add_predictions(model = lm.run_bike_gender, var= 'pred.run_bike_gender')
ggplot(results_2017) +
geom_point(aes(Run, Bike)) +
geom_line(data = grid.run, aes(x = Run, y = pred.run_bike), color = 'red') +
geom_line(data = grid.run_gender, aes(x = Run, y = pred.run_bike_gender, col = Gender)) +
scale_color_manual(values = c(Female="blue", Male="green"))
summarize(results_2017,
rmse.run_bike = sqrt(mean(lm.run_bike$residuals^2)),
rmse.run_bike_gender = sqrt(mean(lm.run_bike_gender$residuals^2)))
## # A tibble: 1 × 2
## rmse.run_bike rmse.run_bike_gender
## <dbl> <dbl>
## 1 37.3 35.5