Please spend a few minutes reading through the notes from Lecture 2. Like in Problem Set 1, you should go through each code block with someone in your group and see if you can both explain to each other what all of the code does.
In lecture, Professor Wyner discussed the relationship between a team’s payroll and its winning percentage. In particular, for each season, he computed the “relative payroll” of each team by taking its payroll and dividing it by the median of payrolls of all teams in that season. We will replicate his analysis in the following problems using the dataset “mlb_relative_payrolls.csv”, which we saved to the “data” folder of our working directory back in Lecture 0. You should save all of the code for this analysis in an R script called “ps2_mlb_payroll.R”.
Read the data in from “mlb_relative_payrolls.csv” and save it as a tbl called “relative_payroll”
Make a histogram of team winning percentages. Play around with different binwidths.
Make a histogram of the relative payrolls.
Make a scatterplot with relative payroll on the horizontal axis and winning percentage on the vertical axis.
Without executing the code below, discuss with your group and see if you can figure out what it is doing.
In this problem set, we will gain more experience using the dplyr verbs we learned in Lecture 2 to analyze batting statistics of MLB players with at least 502.2 plate appearances. We will be using the dataset hitting_qualified.csv (click on that link to download it, then move it into your “data” folder). You should write save all of the code for this analyses in an R script called “ps2_mlb_batting.R”.
Load the data into a tibble called hitting_qualified
using read_csv()
.
The columns of this dataset include:
playerID
: the player’s ID codeyearID
: Yearstint
: the player’s stint (order of appearances within
a season)teamID
: the player’s teamlgID
: the player’s leagueG
: the number of Games the player played in that
yearAB
: number of At Bats of that player in that yearPA
: number of plate appearances by the player that
yearR
: number of Runs the player made in that yearH
: number of Hits the player had in that yearX2B
: number of Doubles (hits on which the batter
reached second base safely)X3B
: number of Triples (hits on which the batter
reached third base safely)HR
: number of Homeruns the player made that yearRBI
: number of Runs Batted In the player made that
yearSB
: number of Bases Stolen by the player in that
yearCS
: number of times a player was Caught Stealing that
yearBB
: Base on BallsSO
: number of Strikeouts the player had that yearIBB
Intentional walksHBP
: Hit by pitchSH
: Sacrifice hitsSF
Sacrifice fliesGIDP
Grounded into double playsUse arrange()
to find out the first and last season
for which we have data. Hint: you may need to use
desc()
as well.
Use summarize()
to find out the first and last
season for which we have data. Hint, you only need one line of code
to do this
When you print out hitting_qualified
you’ll notice
that some columns were read in as characters and not integers or
numerics. This can happen sometimes whenever the original csv file has
missing values. In this case, the columns IBB, HBP, SH, SF, and GIDP
were read in as characters. We want to convert these to
integers. We can do this using mutate()
and the function
as.integer()
.
hitting_qualified <- mutate(hitting_qualified,
IBB = as.integer(IBB),
HBP = as.integer(HBP),
# finish on your own
## # A tibble: 12,043 × 8
## playerID yearID AB IBB HBP SH SF GIDP
## <chr> <dbl> <dbl> <int> <int> <int> <int> <int>
## 1 ansonca01 1884 475 NA NA NA NA NA
## 2 bradyst01 1884 485 NA 0 NA NA NA
## 3 connoro01 1884 477 NA NA NA NA NA
## 4 dalryab01 1884 521 NA NA NA NA NA
## 5 farreja02 1884 469 NA NA NA NA NA
## 6 gleasbi01 1884 472 NA 12 NA NA NA
## 7 hinespa01 1884 490 NA NA NA NA NA
## 8 hornujo01 1884 518 NA NA NA NA NA
## 9 jonesch01 1884 472 NA 10 NA NA NA
## 10 nelsoca01 1884 432 NA 9 NA NA NA
## # … with 12,033 more rows
NA
values, which indicates that some of these values are missing. This
makes sense, since a lot of these statistics were not recorded in the
early years of baseball. A popular convention for dealing with these
missing statistics is to impute the missing values with
0. That is, for instance, every place we see an NA
we need
to replace it with a 0. We can do that with mutate()
and
replace_na()
function as follows.hitting_qualified <- replace_na(hitting_qualified,
list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))
replace_na()
later in
lecture. Now, rerun the select()
function from above to
check that the NAs were replaced with zeros.Use mutate()
to add a column for the number of
singles, which can be computed as \(\text{X1B}
= \text{H} - \text{X2B} - \text{X3B} - \text{HR}\).
The variable BB includes as a subset all intentional walks (IBB).
Use mutate()
to add a column to
hitting_qualified
that counts the number of
un-intentional walks (uBB). Be sure to save the resulting
tibble as hitting_qualified
.
Use mutate()
to add columns for the following
offensive statistics, whose formulae are given below. We have also
included links to pages on Fangraphs that define and discuss
each of these statistics.
mutate()
and
case_when()
to add the ratings for walk percentage (BBP),
strike-out percentage (KP), on-base percentage (OBP), on-base plus
slugging (OPS), and wOBA. Call the columns “BBP_rating”, “KP_rating”,
“OBP_rating”, “OPS_rating”, and “wOBA_rating.”hitting_qualified <- mutate(hitting_qualified,
BBP_rating = case_when(BBP >= .15 ~ "Excellent",
BBP < .15 & BBP >= .125 ~ "Great",
# finish on your own
Use filter()
to subset the players who played
between 2000 and 2015. Call the new tbl
tmp_batting
.
Use select()
on tmp_batting
to create a
tibble called batting_recent
containing all players who
played between 2000 and 2015 with the following columns: playerID,
yearID, teamID, lgID, and all of the statistics and rankings created in
Problems 8 and 9.
Explore the distribution of some of the batting statistics
introduced in problem 8 using the tbl batting_recent
using
histograms. Then explore the relationship between some of these
statistics with scatterplots.