Problem Set 2

Review Lecture 2 Notes

Please spend a few minutes reading through the notes from Lecture 2. Like in Problem Set 1, you should go through each code block with someone in your group and see if you can both explain to each other what all of the code does.

Payroll and Winning Percentage in the MLB

In lecture, Professor Wyner discussed the relationship between a team’s payroll and its winning percentage. In particular, for each season, he computed the “relative payroll” of each team by taking its payroll and dividing it by the median of payrolls of all teams in that season. We will replicate his analysis in the following problems using the dataset “mlb_relative_payrolls.csv”, which we saved to the “data” folder of our working directory back in Lecture 0. You should save all of the code for this analysis in an R script called “ps2_mlb_payroll.R”.

Read the data in from “mlb_relative_payrolls.csv” and save it as a tbl called “relative_payroll”
Make a histogram of team winning percentages. Play around with different binwidths.
Make a histogram of the relative payrolls.
Make a scatterplot with relative payroll on the horizontal axis and winning percentage on the vertical axis.
Without executing the code below, discuss with your group and see if you can figure out what it is doing.

ggplot(data = relative_payroll) + 
  geom_point(mapping = aes(x = Year, y = Team_Payroll))

Execute the code above. What can you say about how team payrolls have evolved over time? Make a similar plot that visualizes how relative payrolls have evolved over time.

MLB Batting Statistics

In this problem set, we will gain more experience using the dplyr verbs we learned in Lecture 2 to analyze batting statistics of MLB players with at least 502.2 plate appearances. We will be using the dataset hitting_qualified.csv (click on that link to download it, then move it into your “data” folder). You should write save all of the code for this analyses in an R script called “ps2_mlb_batting.R”.

Load the data into a tibble called hitting_qualified using read_csv().

The columns of this dataset include:
- playerID: the player’s ID code
- yearID: Year
- stint: the player’s stint (order of appearances within a season)
- teamID: the player’s team
- lgID: the player’s league
- G: the number of Games the player played in that year
- AB: number of At Bats of that player in that year
- PA: number of plate appearances by the player that year
- R: number of Runs the player made in that year
- H: number of Hits the player had in that year
- X2B: number of Doubles (hits on which the batter reached second base safely)
- X3B: number of Triples (hits on which the batter reached third base safely)
- HR: number of Homeruns the player made that year
- RBI: number of Runs Batted In the player made that year
- SB: number of Bases Stolen by the player in that year
- CS: number of times a player was Caught Stealing that year
- BB: Base on Balls
- SO: number of Strikeouts the player had that year
- IBB Intentional walks
- HBP: Hit by pitch
- SH: Sacrifice hits
- SF Sacrifice flies
- GIDP Grounded into double plays
Use arrange() to find out the first and last season for which we have data. Hint: you may need to use desc() as well.
Use reframe() to find out the first and last season for which we have data. Hint, you only need one line of code to do this
When you print out hitting_qualified you’ll notice that some columns were read in as characters and not integers or numerics. This can happen sometimes whenever the original csv file has missing values. In this case, the columns IBB, HBP, SH, SF, and GIDP were read in as characters. We want to convert these to integers. We can do this using mutate() and the function as.integer().

hitting_qualified <- mutate(hitting_qualified,
                            IBB = as.integer(IBB),
                            HBP = as.integer(HBP),
                            # finish on your own

Let’s take a look at some of the columns we just converted:

select(hitting_qualified, playerID, yearID, AB, IBB, HBP, SH, SF, GIDP)

## # A tibble: 12,043 × 8
##    playerID  yearID    AB   IBB   HBP    SH    SF  GIDP
##    <chr>      <dbl> <dbl> <int> <int> <int> <int> <int>
##  1 ansonca01   1884   475    NA    NA    NA    NA    NA
##  2 bradyst01   1884   485    NA     0    NA    NA    NA
##  3 connoro01   1884   477    NA    NA    NA    NA    NA
##  4 dalryab01   1884   521    NA    NA    NA    NA    NA
##  5 farreja02   1884   469    NA    NA    NA    NA    NA
##  6 gleasbi01   1884   472    NA    12    NA    NA    NA
##  7 hinespa01   1884   490    NA    NA    NA    NA    NA
##  8 hornujo01   1884   518    NA    NA    NA    NA    NA
##  9 jonesch01   1884   472    NA    10    NA    NA    NA
## 10 nelsoca01   1884   432    NA     9    NA    NA    NA
## # ℹ 12,033 more rows

You’ll notice that a lot of these columns contain NA values, which indicates that some of these values are missing. This makes sense, since a lot of these statistics were not recorded in the early years of baseball. A popular convention for dealing with these missing statistics is to impute the missing values with 0. That is, for instance, every place we see an NA we need to replace it with a 0. We can do that with mutate() and replace_na() function as follows.

hitting_qualified <- replace_na(hitting_qualified, 
                                list(IBB = 0, HBP = 0, SH = 0, SF = 0, GIDP = 0))

We will discuss the syntax for replace_na() later in lecture. Now, rerun the select() function from above to check that the NAs were replaced with zeros.

Use mutate() to add a column for the number of singles, which can be computed as \(\text{X1B} = \text{H} - \text{X2B} - \text{X3B} - \text{HR}\).
The variable BB includes as a subset all intentional walks (IBB). Use mutate() to add a column to hitting_qualified that counts the number of un-intentional walks (uBB). Be sure to save the resulting tibble as hitting_qualified.
Use mutate() to add columns for the following offensive statistics, whose formulae are given below. We have also included links to pages on Fangraphs that define and discuss each of these statistics.
- Walk Percentage (BBP): \[ \text{BBP} = \frac{\text{BB}}{\text{PA}} \]
- Strike-out Percentage (KP): \[\text{KP} = \frac{\text{SO}}{\text{PA}}\]
- On-Base Percentage (OBP): \[\text{OBP} = \frac{\text{H} + \text{BB} + \text{HBP}}{\text{AB} + \text{BB} + \text{HBP} + \text{SF}}\]
- Slugging (SLG): \[ \text{SLG} = \frac{\text{X1B} + 2 \times \text{X2B} + 3\times \text{X3B} + 4\times \text{HR}}{\text{AB}} \]
- On-Base Plus Slugging (OPS): \[\text{OPS} = \text{OBP} + \text{SLG}\]
- weighted On-Base Average (wOBA): We will use the 2013 weights which can be found here \[ \text{wOBA} = \frac{0.687 \times \text{uBB} + 0.718 \times \text{HBP} + 0.881 \times \text{X1B} + 1.256 \times \text{X2B} + 1.594 \times \text{X3B} + 2.065 \times \text{HR}}{\text{AB} + \text{uBB} + \text{SF} + \text{HBP}} \]

hitting_qualified <- mutate(hitting_qualified,
                  BBP = BB/PA,
                  KP = SO/PA,
                  # finish on your own

For most of the statistics in the previous question, Fangraphs has defined rating scales (to see these ratings, click on the linked page for each statistic in Question 6 and scroll down to the “Context” section of the page). Use mutate() and case_when() to add the ratings for walk percentage (BBP), strike-out percentage (KP), on-base percentage (OBP), on-base plus slugging (OPS), and wOBA. Call the columns “BBP_rating”, “KP_rating”, “OBP_rating”, “OPS_rating”, and “wOBA_rating.”

hitting_qualified <- mutate(hitting_qualified, 
                            BBP_rating = case_when(BBP >= .15 ~ "Excellent",
                                                   BBP < .15 & BBP >= .125 ~ "Great",
                                                   # finish on your own

Use filter() to subset the players who played between 2000 and 2015. Call the new tbl tmp_batting.
Use select() on tmp_batting to create a tibble called batting_recent containing all players who played between 2000 and 2015 with the following columns: playerID, yearID, teamID, lgID, and all of the statistics and rankings created in Problems 8 and 9.
Explore the distribution of some of the batting statistics introduced in problem 8 using the tbl batting_recent using histograms. Then explore the relationship between some of these statistics with scatterplots.
Explore the distribution of the ratings for the statistics introduced in Problem 9 using bar plots. Hint: You may need to mutate the rating columns with factor(BBP_rating, levels = c(Awful, Poor...)) to ensure the ratings are plotted in the correct order.