In Lecture 4 and Lecture 6, we started building a model for predicting a baseball player’s 2015 batting average using his 2014 batting average. We found that some models, even though they fit the data quite well, appeared to overfit and may not predict future observations well. Overfitting happens when the model is too complex; the model is improperly describing the random error rather than properly describing the relationship between variables. We will switch gears a little bit and discuss how to diagnose overfit issues using data on field goals in the NFL.
Suppose we have fit multiple models to a given dataset. How should we choose which model is the best?
One strategy might be to see how well the models predict the data we used to fit them. While this seems intuitive, it is possible to “over-learn” the patterns in this training data. Over-learning leads models to apply only the sample with which it was built, not the overall population or future data instances
A common alternative is to split our original dataset into two parts: a training set and a testing set. We fit all of our models on the training set and then see how well they predict the values in the testing set. Think of the training set as a sample and the testing set as a population.
The file nfl_fg_train.csv contains a large dataset about field goals attempted in the NFL between 2005 and 2015. First, we will train several models of field goal success with this data. Then, we will evaluate their predictive performance using the data contained in nfl_fg_test.csv. Download both of these files and move them to your “data” folder.
## Rows: 9505 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Team, Kicker
## dbl (5): Year, GameMinute, Distance, ScoreDiff, Success
## lgl (1): Grass
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The simplest forecast for field goal success probability is the overall average success rate. This forecast does not differentiate between players or attempt to adjust for distance or other game contexts. To compute this forecast, we need to find the average of the data in the column “Success.” The code below adds a column to the tbl called “phat_all”.
It turns out that kickers make just over 83% of their attempts.
However, we know that there are some truly elite kickers (e.g. Justin
Tucker) who make well over 83% of their attempts, and other kickers
(e.g. Billy Cundiff) who make less than 83% of their attempts. Instead
of forecasting field goal success probabilities with the overall
average, we could compute each individual kicker’s conversion rate. This
can be done using group_by()
and mutate()
. In
the code below, we add a column to fg_train
called
“phat_kicker”, which contains each kicker’s individual field goal
conversion rate. Notice that because we adding a grouping to carry out
this computation, we need to remove the grouping using
ungroup()
when we’re done.
Intuitively, distance is one of the main factors of whether a kicker
makes a field goal. We will use the cut()
function to bin
the data according to distance. Then, we will compute the conversion
rate (averaged over all kickers) within each bin. To do this properly,
we must have used the ungroup()
function at the end of our
last line of code.
In our dataset, the shortest field goal attempt was 18 yards and the longest was 76. We will begin by binning our data into 10 yard increments, 10 – 20, 20 – 30, …, 70 – 80. We then save the bin label in a column called “Dist_10” and the predictions in a column called “phat_dist_10”.
fg_train <-
fg_train %>%
mutate(Dist_10 = cut(Distance, breaks = seq(from = 10, to = 80, by = 10))) %>%
group_by(Dist_10) %>%
mutate(phat_dist_10 = mean(Success)) %>%
ungroup()
We can now look at a scatter plot of distance and our forecasts “phat_dist_10”. Since there are many attempts from certain yardages, we’ll use alpha-blending (see Lecture 2 for a refresher on this!) to change change the transparency of the points according to their frequency. Essentially, darker dots have higher n values than lighter dots.
It certainly looks like our predictions make some intuitive sense: the estimated probability of making a field goal decreases as the distance increases.
We also could have binned field goal Distance 5- or 2-yard
increments, saving bin labels into columns called “Dist_5” or “Dist_2”.
And similar to the 10-yard bins, we could add columns to
fg_train
that computes the overall conversion rate within
each of these 5-yard or 2-yard bins.
fg_train <- fg_train %>%
mutate(Dist_5 = cut(Distance, breaks = seq(from = 10, to = 80, by = 5))) %>%
group_by(Dist_5) %>%
mutate(phat_dist_5 = mean(Success)) %>%
ungroup()
fg_train <- fg_train %>%
mutate(Dist_2 = cut(Distance, breaks = seq(from = 10, to = 80, by = 2))) %>%
group_by(Dist_2) %>%
mutate(phat_dist_2 = mean(Success)) %>%
ungroup()
Now we plot our predictions based on three different binning methods.
fg_plot <-
ggplot(fg_train) +
geom_point(aes(x = Distance, y = phat_dist_10), alpha = 0.1, col = 'black') +
geom_point(aes(x = Distance, y = phat_dist_5), alpha = 0.1, col = 'blue') +
geom_point(aes(x = Distance, y = phat_dist_2), alpha = 0.1, col = 'red')
fg_plot
Now when we binned our data into 2-yard increments, we find that our forecasts are no longer monotonic decreasing. Probability of making a field decreases overall; but in our 2-yard bins, the probability increases from 44 to 46 and 52 to 54 years. Remember, monotonic means only decreasing or only increasing for all values. In Lecture 8, we’ll address this with a more formal regression method.
We now have several predictive models of field goal success: phat_all, phat_kicker, phat_dist_10, phat_dist_5, and phat_dist_2. Recall from Lecture 4 that to assess how well we are predicting a continuous outcome, we could use the RMSE. When predicting a binary outcome, we have a few more options. To set the stage, let \(y_{i}\) be the outcome of the \(i^{\text{th}}\) observation and let \(\hat{p}_{i}\) be the forecasted probability that \(y_{i} = 1.\) (i.e. of making the FG)
The Brier Score is defined as \[ BS = \frac{1}{n}\sum_{i = 1}^{n}{(y_{i} - \hat{p}_{i})^{2}} \] Looking at the formula, we see that the Brier score is just the mean square error of our forecasts. Using code that is very nearly identical to that used to compute RSME in Lecture 4, we can compute the Brier score of each of our prediction models.
summarize(fg_train,
phat_all = mean( (Success - phat_all)^2),
phat_kicker = mean( (Success - phat_kicker)^2),
phat_dist_10 = mean( (Success - phat_dist_10)^2),
phat_dist_5 = mean( (Success - phat_dist_5)^2),
phat_dist_2 = mean( (Success - phat_dist_2)^2))
## # A tibble: 1 × 5
## phat_all phat_kicker phat_dist_10 phat_dist_5 phat_dist_2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.1403 0.1383 0.1245 0.1233 0.1231
Which model has the lowest Brier score? Do you think this model over-fits the data?
Before assessing how well our models predict out-of-sample,
it will be useful to create tibbles that summarize each model’s
forecasts. For instance, when we grouped the data by kickers, we don’t
need multiple rows recording the same prediction for each kicker.
Instead, we can create a tbl called phat_kicker
which has
one row per kicker and two columns, one identifying kicker and one for
the associated forecast:
Create similar tibbles for phat_dist_10, phat_dist_5, and phat_dist_2.
phat_dist_10 <-
fg_train %>%
group_by(Dist_10) %>%
summarize(phat_dist_10 = mean(Success))
phat_dist_5 <-
fg_train %>%
group_by(Dist_5) %>%
summarize(phat_dist_5 = mean(Success))
phat_dist_2 <-
fg_train %>%
group_by(Dist_2) %>%
summarize(phat_dist_2 = mean(Success))
The file “nfl_fg_test.csv” contains additional data on more field
goals kicked between 2005 and 2015. Since we have not used the data in
this file to train our models, we can get a sense of the
out-of-sample predictive performance of our models by looking
at how well they predict these field goals. We first load the data into
a tbl called fg_test
.
## Rows: 1682 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Team, Kicker
## dbl (5): Year, GameMinute, Distance, ScoreDiff, Success
## lgl (1): Grass
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using mutate()
and cut()
, add columns
“Dist_10”, “Dist_5”, and “Dist_2” to fg_test
that bin the
data into 10-yard, 5-yard, and 2-yard increments.
fg_test <-
fg_test %>%
mutate(Dist_10 = cut(Distance, breaks = seq(from = 10, to = 80, by = 10)),
Dist_5 = cut(Distance, breaks = seq(from = 10, to = 80, by = 5)),
Dist_2 = cut(Distance, breaks = seq(from = 10, to = 80, by = 2)))
We are now ready to apply our fg_train
model’s forecasts
to fg_test
. For instance, we can add the forecast from
“phat_all”, which is just the overall conversion rate averaged over all
kickers. Notice that instead of computing the mean of the column
“Success” from fg_test
, we are computing the mean from
fg_train
.
Joining is a powerful technique to combine data from two different
tables based on a key. The key is what enables us to match rows
between the tables. For instance, to add the forecasts from the tbl
phat_kicker
to the tbl fg_test
join works
row-by-row. First, for each field goal in fg_test
, the
function identifies the kicker who attempted the field goal. Then it
matches that kicker with corresponding kicker row from the tbl
phat_kicker
. The join function then appends that kicker’s
forecast from phat_kicker
to the row in
fg_test.
What we have just described is what is known as an
“inner join”. In this situation, the key was the Kicker. The code to
carry out these operations is given below.
To verify that we have successfully performed this join, we can print
out a few rows of fg_test
:
## # A tibble: 1,682 × 4
## Kicker Success phat_all phat_kicker
## <chr> <dbl> <dbl> <dbl>
## 1 Akers 1 0.8312 0.7884
## 2 Akers 1 0.8312 0.7884
## 3 Bironas 1 0.8312 0.8504
## 4 Bironas 1 0.8312 0.8504
## 5 Bironas 1 0.8312 0.8504
## 6 Brien 0 0.8312 0.3333
## 7 Brown 0 0.8312 0.8291
## 8 Brown 1 0.8312 0.8291
## 9 Brown 1 0.8312 0.8291
## 10 Brown 1 0.8312 0.8291
## # … with 1,672 more rows
Let’s unpack the inner join code line-by-line: in the first two
lines, we are telling R that we want to over-write fg_test
.
Next, we pipe fg_test
to the function
inner_join
, which takes two more arguments. The first
argument phat_kicker
tells the function where the
additional data corresponding to each key value is. The last argument
by = "Kicker"
tells the function that the key we want to
use is the kicker.
Mimic the code above, we can add columns for fg_test
for
the remaining three predictive models: phat_dist_10
,
phat_dist_5
, and phat_dist_2
.
fg_test <-
fg_test %>%
inner_join(phat_dist_10, by = "Dist_10") %>%
inner_join(phat_dist_5, by = "Dist_5") %>%
inner_join(phat_dist_2, by = "Dist_2")
Now, we can compute the out-of-sample Brier scores for each of our models. Which has the best out-of-sample performance?
## # A tibble: 1 × 5
## phat_all phat_kicker phat_dist_10 phat_dist_5 phat_dist_2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.1340 0.1332 0.1198 0.1177 0.1180
Tomorrow, we are going to use logistic regression to take
the “binning-and-averaging” approach we used above to its logical
extreme (i.e. what would happen if we made our bins infinitesimally
small). In order to do that, we will want to save our tbls
fg_train
and fg_test
: