There are many types of visualizations you can explore with ggplot. Sometimes the most difficult part is choosing the type of graph and designing a layout that will best communicate your thoughts. A great place to find inspiration is by looking at visualizations from other sources.
For instance, let’s take an example from Neil Paine. Neil is a writer for Five Thirty Eight, a leader in “data journalism” for sports and politics. Many organizations like these have separate teams of visualization experts who work on designing and coding graphics for popular articles.
Let’s try to recreate one of Neil’s graphics from his article on the top fight songs of college teams. Start by reading Neil’s original article. Then, download the data file used to make this article from here, and save it to your “data” folder.
The data contains information on fight songs from all schools in the
ACC, the Big Ten, the Big 12, Pac-12, and SEC, plus Notre Dame. Start by
inspecting the structure of the dataset using the str()
function. It is important to identify the “type of each variable as you
create graphics. In this case, the two variables that we will be using
as the x and the y axis, the duration and bpm, are already identified as
integer variables.
## 'data.frame': 65 obs. of 23 variables:
## $ school : chr "Notre Dame" "Baylor" "Iowa State" "Kansas" ...
## $ conference : chr "Independent" "Big 12" "Big 12" "Big 12" ...
## $ song_name : chr "Victory March" "Old Fight" "Iowa State Fights" "I'm a Jayhawk" ...
## $ writers : chr "Michael J. Shea and John F. Shea" "Dick Baker and Frank Boggs" "Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook" "George \"Dumpy\" Bowles" ...
## $ year : chr "1908" "1947" "1930" "1912" ...
## $ student_writer : chr "No" "Yes" "Yes" "Yes" ...
## $ official_song : chr "Yes" "Yes" "Yes" "Yes" ...
## $ contest : chr "No" "No" "No" "No" ...
## $ bpm : int 152 76 155 137 80 153 180 81 149 159 ...
## $ sec_duration : int 64 99 55 62 67 37 29 65 47 54 ...
## $ fight : chr "Yes" "Yes" "Yes" "No" ...
## $ number_fights : int 1 4 5 0 6 0 5 17 2 8 ...
## $ victory : chr "Yes" "Yes" "No" "No" ...
## $ win_won : chr "Yes" "Yes" "No" "No" ...
## $ victory_win_won: chr "Yes" "Yes" "No" "No" ...
## $ rah : chr "Yes" "No" "Yes" "No" ...
## $ nonsense : chr "No" "No" "No" "Yes" ...
## $ colors : chr "Yes" "Yes" "No" "No" ...
## $ men : chr "Yes" "No" "Yes" "Yes" ...
## $ opponents : chr "No" "No" "No" "Yes" ...
## $ spelling : chr "No" "Yes" "Yes" "No" ...
## $ trope_count : int 6 5 4 3 3 2 4 4 6 3 ...
## $ spotify_id : chr "15a3ShKX3XWKzq0lSS48yr" "2ZsaI0Cu4nz8DHfBkPt0Dl" "3yyfoOXZQCtR6pfRJqu9pl" "0JzbjZgcjugS0dmPjF9R89" ...
To visualize the data, use a geom point and place your x and y axes on the graph.
Next, color by school name (we omit the legend as it would be too large and confusing with labels for all 65 schools).
ggplot(fight_songs, aes(x = sec_duration, y = bpm, color = school)) +
geom_point() +
theme(legend.position = "none")
Use ggplot arguments to make the following changes to the graph to recreate Neil’s original graph:
The final plot should look as follows.
You may have noticed a lot more interactive data visualization pieces on the internet over the last few years. The goal of these graphics is to allow the user to explore specific parts of the data and increase engagement. The COVID-19 Dashboard put together by the Center for Systems Science and Engineering at Johns Hopkins University is a good example.
Plotly is one package that can be used in R to make interactive
visualizations. Once you have your graph saved as a new object, install
the plotly
package and run the function
ggplotly
on the graph in order to add interactive elements.
It will look as follows (place your mouse over the points to see how
it’s interactive).
Choose from the plots on Five Thirty Eight and recreate a graph using
ggplot. Try to match the style and format of the graph, or use new tools
like plotly
to make the graph interactive. Perhaps you find
a graph that you think would be better visualized in another way. See if
you can use the data to think of a better approach.
You can find data and code behind some of the articles here.
Alternatively, you can work with the graph we have recreated from this article on Fatal Collisions. See if you can turn this into a stacked bar chart and replicate the other graphs in the article. The data can be downloaded here.
# load data
drivers<-read.csv('data/bad-drivers.csv')
# rename columns
drivers <- drivers %>%
rename(total_collisions = Number.of.drivers.involved.in.fatal.collisions.per.billion.miles,
non_distracted = Percentage.Of.Drivers.Involved.In.Fatal.Collisions.Who.Were.Not.Distracted)
# mutate data
drivers <- drivers %>%
mutate(non_distracted_per = (non_distracted/100) * total_collisions) %>%
arrange((State))
# plot data
ggplot(data = drivers, aes(x = as.factor(State), y = non_distracted_per)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
theme_minimal() +
labs(title = "Drivers Involved in Fatal Collisions Who Were Not Distracted",
subtitle = "As a share of the number of fatal collisions per billion miles, 2021") +
theme(axis.title.x = element_blank()) +
theme(axis.title.y = element_blank())