In this part, we will continue to use heatmaps (introduced briefly in Lecture 2) to explore the strike zone in baseball. We will focus on data collected by PITCHf/x. At a high-level, PITCHf/x consists of a set of cameras installed at every ballpark which tracks the motion of each pitch. For more information about the system, check out this article by Mike Fast The data collected by PITCHf/x is then transmitted to the MLB Gameday application along with contextual information about the pitch. The dataset we’ll be using contains the measurements from the PITCHf/x system recorded in 2015.
Download pitchfx_2015.csv to
your “data” folder. Read the CSV file into a tbl called
pitches
using the read_csv
function.
The columns are:
Description
: Records the outcome of the pitch (Called
Strike, Swinging Strike, Foul, etc.)X
and Z
: the horizontal and vertical
coordinates of the pitch in inches. Note that the center of home plate
corresponds to X = 0
.
X
coordinate are recorded from the
catcher’s perspective, with negative values on the left and positive
values on the right. In this coordinate system, a right-handed batter
will line up to the left (i.e. negative X
values).COUNT
: The ball-strike count for each pitchP_HAND
and B_HAND
: the handedness of the
batter and pitcher.
To visualize the strike zone, we are going to want to filter out
only the called strikes and balls. Moreover, it will be helpful to
convert the Description to numeric values (1 for called strikes, 0 for
balls). Use the pipe operator, filter()
,
mutate()
, and case_when()
to create a new tbl
called_pitches
containing only the called strike and balls
and that includes a new column “Call” whose value is 0 for balls and 1
for called strike.
To get started, we will create a plot and then add to it sequentially:
stat_summary_2d()
function, which takes three
aesthetics:
stat_summary_2d()
divides the plane into
rectangles based on the aesthetics x and y, and then computes the
average value of z for observations in the bin. We can add this layer to
our plot g
as follows and obtain the following plot.stat_summary_2d()
has added a legend to our plot. However, the title of the legend is a
somewhat non-informative. Moreover, the color scheme does not
distinguish between different values particularly well. We can change
both the title of the legend and the color scheme inside a function
called scale_fill_distiller
. Don’t worry too much about
what this function means for now; we will cover it in more depth in Lecture 5.xmin
and xmax
arguments give the
horizontal limits of the strike zone (in this case, the coordinates of
the edges of the strike zone) and the ymin
and
ymax
arguments are the average vertical limits measured by
PITCHf/x. Note: these values were pre-computed using a much
larger datasetg <- g + annotate("rect", xmin = -8.5, xmax = 8.5, ymin = 19, ymax = 41.5, alpha = 0, color = "black")
g
g <- g + theme_classic() +
theme(axis.title.x = element_blank(),
axis.title.y = element_blank()) +
labs(title = "Estimated Strike Zone")
g
The file “nba_boxscore.csv” lists detailed box score information about every NBA player in every season ranging from 1996–97 season and 2015-16 season. We will look at team shooting statistics over this 20-season span.
raw_boxscore
.table()
function. This function takes a vector and returns
the frequencies of each unique value.## .
## ATL BOS BRK CHA CHH CHI CHO CLE DAL DEN DET GSW HOU IND LAC LAL MEM MIA MIL MIN NJN NOH NOK NOP NYK OKC
## 347 350 72 183 101 335 34 359 356 354 321 355 359 319 347 319 271 348 337 328 298 161 34 63 345 144
## ORL PHI PHO POR SAC SAS SEA TOR TOT UTA VAN WAS WSB
## 340 352 348 338 332 342 195 366 1047 309 80 333 15
filter()
to take a closer look at these
players.## # A tibble: 15 × 22
## Season Player Pos Age Tm G GS MP FGM FGA TPM TPA FTM FTA ORB DRB AST STL BLK TOV PF
## <dbl> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997 Ashraf A… PF 25 WSB 31 0 144 12 40 1 1 15 28 19 33 3 7 3 10 29
## 2 1997 Calbert … SG 25 WSB 79 79 2411 369 730 4 30 95 137 70 198 114 77 18 94 226
## 3 1997 Matt Fish C 27 WSB 5 0 7 1 3 0 0 0 0 1 4 0 0 0 2 2
## 4 1997 Harvey G… PF 31 WSB 78 25 1604 129 314 28 89 30 39 63 193 68 46 48 30 167
## 5 1997 Juwan Ho… SF 23 WSB 82 82 3324 638 1313 0 2 294 389 202 450 311 93 23 246 259
## 6 1997 Jaren Ja… SG 29 WSB 75 0 1133 134 329 53 158 53 69 31 101 65 45 16 60 131
## 7 1997 Tim Legl… SG 30 WSB 15 0 182 15 48 8 29 6 7 0 21 7 3 5 9 21
## 8 1997 Gheorghe… C 25 WSB 73 69 1849 327 541 0 0 123 199 141 340 29 43 96 117 230
## 9 1997 Tracy Mu… SF 25 WSB 82 1 1814 288 678 106 300 135 161 84 169 78 69 19 86 150
## 10 1997 Gaylon N… SG 27 WSB 1 0 6 1 3 0 1 0 0 0 1 0 1 0 1 0
## 11 1997 Rod Stri… PG 30 WSB 82 81 2997 515 1105 13 77 367 497 95 240 727 143 14 270 166
## 12 1997 Ben Wall… PF 22 WSB 34 0 197 16 46 0 0 6 20 25 33 2 8 11 18 27
## 13 1997 Chris We… PF 23 WSB 72 72 2806 604 1167 60 151 177 313 238 505 331 122 137 230 258
## 14 1997 Chris Wh… PG 25 WSB 82 1 1117 139 330 58 163 94 113 13 91 182 49 4 68 100
## 15 1997 Lorenzo … C 27 WSB 19 0 264 20 31 0 0 5 7 28 41 4 6 8 18 49
## # … with 1 more variable: PTS <dbl>
These fifteen players during the 1996-97 season on the Washington Bullets, which was renamed the Washington Wizards at the end of that season.There are a few other examples: VAN refers to the Vancouver Grizzlies who moved to Memphis and CHH refers to the original Charlotte Hornets franchise, which ultimately relocated to New Orleans.
One of the teams listed is “TOT”. This does not refer any specific team. Instead these rows record the total statistics recorded by a player if he played for multiple teams in a single season. For the purposes of understanding how team shooting statistics changed over time, we will not want to include these rows in our analysis.
Use filter()
, group_by()
,
summarize()
, and mutate()
to create a new tbl
called team_boxscore
that does the following:
Tm == "TOT"
)Use filter()
to create a new tbl called
reduced_boxscore
that pulls out the rows of
team_boxscore
corresponding to the following teams: BOS,
CLE, DAL, DET, GSW, LAL, MIA, and SAS. Then create a plot of these
teams’ three point percentage in each season. Be sure to color the
points according to the team (Hint: to map the Tm
variable to the points as colors, use the color =
argument
within the aes
function. We will learn more about this in
Lecture 5). What patterns do you
notice?
Once you finish reviewing the material from earlier this week, we’d like you to use some of the tools we introduced in Lecture 4 to read in new data into R. Then, using the skills you’ve learned in the first four lectures, we’d like to you do some type of analysis with this data. It doesn’t need to be sophisticated – even making a few interesting visualizations or computing some interesting summaries is enough. We just want you to get some practice working with some data that you’ve collected yourselves!