Problem Set 5

Recreate a Graph

There are many types of visualizations you can explore with ggplot. Sometimes the most difficult part is choosing the type of graph and designing a layout that will best communicate your thoughts. A great place to find inspiration is by looking at visualizations from other sources.

For instance, let’s take an example from Neil Paine. Neil is a writer for Five Thirty Eight, a leader in “data journalism” for sports and politics. Many organizations like these have separate teams of visualization experts who work on designing and coding graphics for popular articles.

Let’s try to recreate one of Neil’s graphics from his article on the top fight songs of college teams. Start by reading Neil’s original article. Then, download the data file used to make this article from here, and save it to your “data” folder.

fight_songs <- read.csv('data/fight-songs.csv')

The data contains information on fight songs from all schools in the ACC, the Big Ten, the Big 12, Pac-12, and SEC, plus Notre Dame. Start by inspecting the structure of the dataset using the str() function. It is important to identify the “type of each variable as you create graphics. In this case, the two variables that we will be using as the x and the y axis, the duration and bpm, are already identified as integer variables.

str(fight_songs)

## 'data.frame':    65 obs. of  23 variables:
##  $ school         : chr  "Notre Dame" "Baylor" "Iowa State" "Kansas" ...
##  $ conference     : chr  "Independent" "Big 12" "Big 12" "Big 12" ...
##  $ song_name      : chr  "Victory March" "Old Fight" "Iowa State Fights" "I'm a Jayhawk" ...
##  $ writers        : chr  "Michael J. Shea and John F. Shea" "Dick Baker and Frank Boggs" "Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook" "George \"Dumpy\" Bowles" ...
##  $ year           : chr  "1908" "1947" "1930" "1912" ...
##  $ student_writer : chr  "No" "Yes" "Yes" "Yes" ...
##  $ official_song  : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ contest        : chr  "No" "No" "No" "No" ...
##  $ bpm            : int  152 76 155 137 80 153 180 81 149 159 ...
##  $ sec_duration   : int  64 99 55 62 67 37 29 65 47 54 ...
##  $ fight          : chr  "Yes" "Yes" "Yes" "No" ...
##  $ number_fights  : int  1 4 5 0 6 0 5 17 2 8 ...
##  $ victory        : chr  "Yes" "Yes" "No" "No" ...
##  $ win_won        : chr  "Yes" "Yes" "No" "No" ...
##  $ victory_win_won: chr  "Yes" "Yes" "No" "No" ...
##  $ rah            : chr  "Yes" "No" "Yes" "No" ...
##  $ nonsense       : chr  "No" "No" "No" "Yes" ...
##  $ colors         : chr  "Yes" "Yes" "No" "No" ...
##  $ men            : chr  "Yes" "No" "Yes" "Yes" ...
##  $ opponents      : chr  "No" "No" "No" "Yes" ...
##  $ spelling       : chr  "No" "Yes" "Yes" "No" ...
##  $ trope_count    : int  6 5 4 3 3 2 4 4 6 3 ...
##  $ spotify_id     : chr  "15a3ShKX3XWKzq0lSS48yr" "2ZsaI0Cu4nz8DHfBkPt0Dl" "3yyfoOXZQCtR6pfRJqu9pl" "0JzbjZgcjugS0dmPjF9R89" ...

To visualize the data, use a geom point and place your x and y axes on the graph.

ggplot(fight_songs, aes(x = sec_duration, y = bpm)) +
  geom_point()

Next, color by school name (we omit the legend as it would be too large and confusing with labels for all 65 schools).

ggplot(fight_songs, aes(x = sec_duration, y = bpm, color = school)) +
  geom_point() +
  theme(legend.position = "none")

Use ggplot arguments to make the following changes to the graph to recreate Neil’s original graph:

Use alpha and size arguments to match the original graph.
Add a title, x label, and y label that match the labels on the original graph
Add a black point to the graph to mark Notre Dame
Label the Notre Dame point with text
Use geom_vline and geom_hline to create an axis at the x and y intercept for the averages

The final plot should look as follows.

Make it Interactive

You may have noticed a lot more interactive data visualization pieces on the internet over the last few years. The goal of these graphics is to allow the user to explore specific parts of the data and increase engagement. The COVID-19 Dashboard put together by the Center for Systems Science and Engineering at Johns Hopkins University is a good example.

Plotly is one package that can be used in R to make interactive visualizations. Once you have your graph saved as a new object, install the plotly package and run the function ggplotly on the graph in order to add interactive elements. It will look as follows (place your mouse over the points to see how it’s interactive).

Work with new data to create a graph

Choose from the plots on Five Thirty Eight and recreate a graph using ggplot. Try to match the style and format of the graph, or use new tools like plotly to make the graph interactive. Perhaps you find a graph that you think would be better visualized in another way. See if you can use the data to think of a better approach.

You can find data and code behind some of the articles here.

Alternatively, you can work with the graph we have recreated from this article on Fatal Collisions. See if you can turn this into a stacked bar chart and replicate the other graphs in the article. The data can be downloaded here.

# load data
drivers<-read.csv('data/bad-drivers.csv')

# rename columns
drivers <- drivers %>% 
  rename(total_collisions = Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,
         non_distracted = Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted)

# mutate data
drivers <- drivers %>%
  mutate(non_distracted_per = (non_distracted/100) * total_collisions) %>%
  arrange((State))

# plot data
ggplot(data = drivers, aes(x = as.factor(State), y = non_distracted_per)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Drivers Involved in Fatal Collisions Who Were Not Distracted",
       subtitle = "As a share of the number of fatal collisions per billion miles, 2021") +
  theme(axis.title.x = element_blank()) +
  theme(axis.title.y = element_blank())