We will now continue modeling data with logistic regression using nfl_wp.csv to analyze win percentages as a
function of multiple variables. The dataset includes
game_id
; season
; label_win
, which
is a binary (0 or 1) indicator on whether the possessing team won the
game; score_differential
, which is the current score
differential between the team in possession and the team on defense;
posteam_spread
, which is the +- spread for the possessing
team prior to the game start, yardline_100
;
down
; and ydstogo
.
nfl_wp = read.csv('data/nfl_wp.csv')
set.seed(42)
#get unique games
unique_game_ids <- unique(nfl_wp$game_id)
# randomly sample 80% of the game ids for training, rounding down to nearest integer
train_game_ids <- sample(unique_game_ids, size = floor(0.8 * length(unique_game_ids)))
#filter for train and test
nfl_train <- nfl_wp %>% filter(game_id %in% train_game_ids)
nfl_test <- nfl_wp %>% filter(!(game_id %in% train_game_ids))
Next, do some exploratory data analysis (EDA) to understand the
data. First, plot a histogram of score_differential
to show
the distribution of the variable. Then, make a separate data table
grouped by game and select the first instance of
posteam_spread
so that we have only one per game. Plot the
distribution of this variable.
Then fit a logistic model called diff_model
using
glm
to predict label_win
based on
score_differential
. Call the model after you fit it to see
the coefficients and then add its predictions to the dataset.
Repeat step 3 but predicting with posteam_spread
instead.
Now, plot the predicted probabilities of winning based on the two
models you created in steps 3 and 4. Put the predictor variable on the
x-axis and the predicted win probability on the y axis. Color the points
by the label_win variable using as.factor()
. Which variable
seems to predict better?
Finally, let’s make a model to account for both variables. Name
your model spread_diff_model
and follow the previous steps
to obtain predictions and plot the data as a function of score
differential (since this variable is more continuous than spread as we
saw with our histogram). How does this model compare to the previous
two?
Let’s test all of our models. Use predict()
on the
test data with the previously created models, then add them to the test
dataset. Then, run the following code below to convert from win
probability to a predicted win or loss and then calculate the accuracy.
Print your results. Which model ended up being the best?
#set win = >=0.5 win probability, loss otherwise
nfl_test <- nfl_test %>%
mutate(
diff_pred_label = ifelse(diff_pred_prob >= 0.5, 1, 0),
spread_pred_label = ifelse(spread_pred_prob >= 0.5, 1, 0),
spread_diff_pred_label = ifelse(spread_diff_pred_prob >= 0.5, 1, 0)
)
#calculate accuracy as being the proportion of correct predictions
accuracy <- function(actual, predicted) {
mean(actual == predicted)
}
#apply to all three models
acc_diff <- accuracy(nfl_test$label_win, nfl_test$diff_pred_label)
acc_spread <- accuracy(nfl_test$label_win, nfl_test$spread_pred_label)
acc_both <- accuracy(nfl_test$label_win, nfl_test$spread_diff_pred_label)
Challenge: Can you make an even better model with the other variables in the dataset or with numerical transformations to variables? Additionally, feel free to explore other metrics for out-of-sample performance, such as log-loss or AUC!