Logistic Regression

We will now continue modeling data with logistic regression using nfl_wp.csv to analyze win percentages as a function of multiple variables. The dataset includes game_id; season; label_win, which is a binary (0 or 1) indicator on whether the possessing team won the game; score_differential, which is the current score differential between the team in possession and the team on defense; posteam_spread, which is the +- spread for the possessing team prior to the game start, yardline_100; down; and ydstogo.

  1. First, load in the data. Previously, two different datasets have been provided for test and train. We can do this ourselves by running the following code which may also be helpful for your projects. For the first half of the problem set, use the training data.
nfl_wp = read.csv('data/nfl_wp.csv')

set.seed(42)

#get unique games   
unique_game_ids <- unique(nfl_wp$game_id)
# randomly sample 80% of the game ids for training, rounding down to nearest integer
train_game_ids <- sample(unique_game_ids, size = floor(0.8 * length(unique_game_ids))) 

#filter for train and test 
nfl_train <- nfl_wp %>% filter(game_id %in% train_game_ids) 
nfl_test <- nfl_wp %>% filter(!(game_id %in% train_game_ids))
  1. Next, do some exploratory data analysis (EDA) to understand the data. First, plot a histogram of score_differential to show the distribution of the variable. Then, make a separate data table grouped by game and select the first instance of posteam_spread so that we have only one per game. Plot the distribution of this variable.

  2. Then fit a logistic model called diff_model using glm to predict label_win based on score_differential. Call the model after you fit it to see the coefficients and then add its predictions to the dataset.

  3. Repeat step 3 but predicting with posteam_spread instead.

  4. Now, plot the predicted probabilities of winning based on the two models you created in steps 3 and 4. Put the predictor variable on the x-axis and the predicted win probability on the y axis. Color the points by the label_win variable using as.factor(). Which variable seems to predict better?

  5. Finally, let’s make a model to account for both variables. Name your model spread_diff_model and follow the previous steps to obtain predictions and plot the data as a function of score differential (since this variable is more continuous than spread as we saw with our histogram). How does this model compare to the previous two?

  6. Let’s test all of our models. Use predict() on the test data with the previously created models, then add them to the test dataset. Then, run the following code below to convert from win probability to a predicted win or loss and then calculate the accuracy. Print your results. Which model ended up being the best?

#set win = >=0.5 win probability, loss otherwise 
nfl_test <- nfl_test %>% 
  mutate(
  diff_pred_label = ifelse(diff_pred_prob >= 0.5, 1, 0),
  spread_pred_label = ifelse(spread_pred_prob >= 0.5, 1, 0),
  spread_diff_pred_label = ifelse(spread_diff_pred_prob >= 0.5, 1, 0)
  )

#calculate accuracy as being the proportion of correct predictions
accuracy <- function(actual, predicted) {
  mean(actual == predicted)
}

#apply to all three models
acc_diff  <- accuracy(nfl_test$label_win, nfl_test$diff_pred_label)
acc_spread <- accuracy(nfl_test$label_win, nfl_test$spread_pred_label)
acc_both  <- accuracy(nfl_test$label_win, nfl_test$spread_diff_pred_label)

Challenge: Can you make an even better model with the other variables in the dataset or with numerical transformations to variables? Additionally, feel free to explore other metrics for out-of-sample performance, such as log-loss or AUC!