Testing Regression Models

We will now continue modeling data with regression, but we can now test and compare different models more efficiently. We will use this dataset on mlb payrolls: mlb_payrolls_train.csv to analyze general manager performance based on team payrolls and winning percentages.

  1. First, load in the data to R. Let’s see the range of seasons that are included in this data
payroll <- read_csv("data/mlb_payrolls_train.csv")
## Rows: 603 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): teamID, name
## dbl (7): yearID, G, W, L, wins_162, payroll_norm, log_payroll_norm
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
unique(payroll$yearID)
##  [1] 1998 1999 2000 2002 2003 2004 2005 2006 2007 2008 2010 2011 2013 2014 2017
## [16] 2018 2020 2021 2022 2001 2009 2012 2015 2019 2023 2016
  1. We can see that the data ranges from 1998 to 2023. Since the 2020 season was shortened due to Covid, filter it out. Then use mutate() to create a column named WP based on wins (W) and games (G).

  2. Then, explore the relationship between payrolls and winning percentage by generating a scatterplot. Use ifelse() with the %in% operator to mark the Oakland A’s and New York Yankees’ points on the plot. Be sure to mark them with different colors by using scale_color_manual() and to change the alpha values or colors if needed so that overlapping point’s don’t completely obscure each other.

  3. Now, fit a regression line to the data as a function of payroll_norm, then add it to the plot.

  4. Next, try fitting a model based on log_payroll_norm and adding that to the plot.

Just by looking at the plots from 3 and 4, which seems to fit the data better?

  1. We saw that in lecture 6, one way to potentially look at model performance is with the residuals. Using add_residuals from modelr, add a column of residuals from each model to your payroll table. Then, generate a scatter plot of the residuals for each model.

An indication of a poor model fit is a non-uniform residual plot. This would indicate that perhaps a linear model is being used when the underlying trend is instead logarithmic, or polynomial. Do you observe this?

  1. Finally, we can directly compare our models with out-of-sample performance. Load in mlb_payrolls_test.csv and call it payrolls_test. Do not refit your model but instead use predict with the models you have already created to generate predictions for the test data set called test_pred and test_log_pred, similar to its usage in Lecture 6. Then, use the line below to compute the RMSE for each model.
rmse <- sqrt(mean((payrolls_test$WP - payrolls_test$test_pred)^2))
rmse_log <- sqrt(mean((payrolls_test$WP - payrolls_test$test_log_pred)^2))

Based on the RMSE values, which model is better?

Challenge: Find the difference between the actual WP values and predictions of the log model in the test data and add them to payrolls_test. Group by team and find the mean difference across all years. Multiply their percentages by 162 to obtain number of wins. Make a bar plot showing teams’ mean performance compared to predictions.