In this problem set, you will practice using the pipe
%>%
and grouped calculations with batting statistics
taken from the Lahman dataset. We will compute the same statistics as we
did in Problem Set 2 and also standardize these
statistics within year, year and league, and also historical era.
To load the batting data into R, we can do the following:
Unfortunately, some statistics like hit-by-pitch (HBP) were not
recorded in the earlier decades of baseball. In the tbl, these missing
values are designated NA
. To see some of these, we can run
the following code:
## # A tibble: 113,799 × 10
## playerID yearID teamID AB BB HBP SH SF IBB GIDP
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int> <int>
## 1 aardsda01 2004 SFN 0 0 0 0 0 0 0
## 2 aardsda01 2006 CHN 2 0 0 1 0 0 0
## 3 aardsda01 2007 CHA 0 0 0 0 0 0 0
## 4 aardsda01 2008 BOS 1 0 0 0 0 0 0
## 5 aardsda01 2009 SEA 0 0 0 0 0 0 0
## 6 aardsda01 2010 SEA 0 0 0 0 0 0 0
## 7 aardsda01 2012 NYA 0 0 0 0 0 0 0
## 8 aardsda01 2013 NYN 0 0 0 0 0 0 0
## 9 aardsda01 2015 ATL 1 0 0 0 0 0 0
## 10 aaronha01 1954 ML1 468 28 3 6 4 NA 13
## # ℹ 113,789 more rows
A common convention for dealing with the missing values when
computing \(\text{PA}\) is to replace
the NA
with a 0
. To do this, we can use the
function replace_na()
within a pipe as follows.
batting <-
batting %>%
replace_na(list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
batting %>% select(playerID, yearID, teamID, AB, BB, HBP, SH, SF, IBB)
## # A tibble: 113,799 × 9
## playerID yearID teamID AB BB HBP SH SF IBB
## <chr> <int> <fct> <int> <int> <int> <int> <int> <int>
## 1 aardsda01 2004 SFN 0 0 0 0 0 0
## 2 aardsda01 2006 CHN 2 0 0 1 0 0
## 3 aardsda01 2007 CHA 0 0 0 0 0 0
## 4 aardsda01 2008 BOS 1 0 0 0 0 0
## 5 aardsda01 2009 SEA 0 0 0 0 0 0
## 6 aardsda01 2010 SEA 0 0 0 0 0 0
## 7 aardsda01 2012 NYA 0 0 0 0 0 0
## 8 aardsda01 2013 NYN 0 0 0 0 0 0
## 9 aardsda01 2015 ATL 1 0 0 0 0 0
## 10 aaronha01 1954 ML1 468 28 3 6 4 0
## # ℹ 113,789 more rows
The syntax for replace_na()
is a bit involved but the
basic idea is you have to specify the value you want to replace each
NA
. When using replace_na()
it is very
important to remember to include the list(...)
bit.
Load the Lahman data and run the above code to create the tbl
Batting
, which has replaced all of the NA
’ in
the columns IBB, HBP, SH, SF
and GIDP`.
Using the pipe %>%
, mutate()
,
filter()
, and select()
, create a tbl
batting
by:
mutate()
) for plate appearances
(PA), unintentional walks (uBB), singles (X1B), batting average (BA),
on-base percentage (OBP), on-base plus slugging (OPS), and weighted
On-Base Average (wOBA). Note that the formula for plate appearances is
\(\text{PA} = \text{AB} + \text{BB} +
\text{HBP} + \text{SH} + \text{SF}.\) Formulae for the remaining
statistics are given in Problem Set 2.filter()
Standardize each of BA, OBP, OPS, and wOBA using data from all of
the years. Name the columns containing these new standardized values
zBA_all
, zOPB_all
, etc. Remember, you can
do this with the scale()
function from Lecture 3 Who were the best and worst
batters according to these four metrics?
Group batting
by year and compute the standardized
BA, OBP, OPS, and wOBA within each year. Name the columns containing
these new standardized values zBA_year
,
zOBP_year
, etc. Now who are the best and worst batters
according to the four measures?
Remove the grouping by year and instead group by year and league.
Once again, standardize OBP, OPS, and wOBA within each league-year
combination, and name the columns containing these new standardized
values zBA_year_lg
, zOBP_year_lg
, etc. Are the
best and worst batters still the same?
Remove the grouping you created in Problem 4. Bill James divided baseball history into several eras as follows:
Use mutate()
and case_when()
(just like we
did in Lecture 2) to add a column called
Hist_era to batting
that records the historical
era.
Group batting
by Hist_era and standardize BA, OBP,
OPS, and wOBA within historical era. Who are the best and worst batters
now? Name the columns containing these new standardized values
zBA_hist
, zOBP_hist
, etc.
Optional: choose one historical statistic calculated in question 7 and plot the top 10 players for this statistic.
Recall from Problem Set 2, we plotted the relative payroll of MLB teams against their winning percentage. In that problem set, we read in a file that had included the relative payroll for each team as a separate column. To get some additional practice with dplyr, we will read in a different dataset and re-compute these relative payrolls.
scale()
to compute z-scores. We can manually recreate this
code with a function, as reproduced below:standardize <- function(x){
mu <- mean(x, na.rm = TRUE)
sigma <- sd(x, na.rm = TRUE)
return( (x - mu)/sigma)
}
x
, compute
its median, and then divides every element of x
by the
median.Read in the MLB Payroll Data
and load it into a tibble called mlb_payrolls
.
Using the pipe %>%
, group_by()
, and
mutate()
, add a column to mlb_payrolls
that
contains the relative payroll for each team.
Make a scatterplot of winning percentage against relative payrolls. Comment on the relationship. Your scatterplot should be identical to one you made in Problem Set 2.
Using the reframe()
function, compute the average
team payroll and relative payroll for each year. Save these results in a
new tbl called payroll_avg
.
Make a scatterplot that shows how team payrolls have evolved over the year. Similar to what we did in Lecture 3, add a line to this scatterplot that shows the average team payroll. Do the same thing for relative payroll. What do you notice about the average team payroll and relative payroll?
As you will see in coming lectures, correlation is a measure of
the strength of the linear relationship between two variables.
The closer to +1 or -1 the correlation between two variables is, the
more predictable they are of each other. We can compute it using the
cor()
function. Using summary()
and
cor()
, compute the correlation between relative payroll and
winning percentage within each year. What do you notice about how the
relationship between winning percentage and relative payroll changes
year to year?
Without running the code, work with your teammates to see if you can figure out what the code below is doing.
batting_2014_2015 <-
batting %>%
filter(yearID %in% c(2014, 2015)) %>%
group_by(playerID) %>%
filter(n() == 2) %>%
select(playerID, yearID, BA) %>%
arrange(playerID)
Now, run the code above and save the tbl
batting_2014_2015
to the file “data/batting_2014_2015.csv”
using the write_csv()
function. We will return to this
dataset in Lecture 4.