We will now continue modeling data with regression, but we can now test and compare different models more efficiently. We will use this dataset on mlb payrolls: mlb_payrolls_train.csv to analyze general manager performance based on team payrolls and winning percentages.
## Rows: 603 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): teamID, name
## dbl (7): yearID, G, W, L, wins_162, payroll_norm, log_payroll_norm
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## [1] 1998 1999 2000 2002 2003 2004 2005 2006 2007 2008 2010 2011 2013 2014 2017
## [16] 2018 2020 2021 2022 2001 2009 2012 2015 2019 2023 2016
We can see that the data ranges from 1998 to 2023. Since the 2020
season was shortened due to Covid, filter it out. Then use
mutate()
to create a column named WP
based on
wins (W) and games (G).
Then, explore the relationship between payrolls and winning
percentage by generating a scatterplot. Use ifelse()
with
the %in%
operator to mark the Oakland A’s and New York
Yankees’ points on the plot. Be sure to mark them with different colors
by using scale_color_manual()
and to change the alpha
values or colors if needed so that overlapping point’s don’t completely
obscure each other.
Now, fit a regression line to the data as a function of
payroll_norm
, then add it to the plot.
Next, try fitting a model based on log_payroll_norm
and adding that to the plot.
Just by looking at the plots from 3 and 4, which seems to fit the data better?
add_residuals
from
modelr
, add a column of residuals from each model to your
payroll table. Then, generate a scatter plot of the residuals for each
model.An indication of a poor model fit is a non-uniform residual plot. This would indicate that perhaps a linear model is being used when the underlying trend is instead logarithmic, or polynomial. Do you observe this?
payrolls_test
. Do not refit your model but instead use
predict
with the models you have already created to
generate predictions for the test data set called test_pred
and test_log_pred
, similar to its usage in Lecture 6. Then,
use the line below to compute the RMSE for each model.rmse <- sqrt(mean((payrolls_test$WP - payrolls_test$test_pred)^2))
rmse_log <- sqrt(mean((payrolls_test$WP - payrolls_test$test_log_pred)^2))
Based on the RMSE values, which model is better?
Challenge: Find the difference between the actual WP values
and predictions of the log model in the test data and add them
to payrolls_test
. Group by team and find the mean
difference across all years. Multiply their percentages by 162 to obtain
number of wins. Make a bar plot showing teams’ mean performance compared
to predictions.