15 Presenting spatial data without maps
Maps are a powerful tool for visualizing spatial data, but they are not always the best choice. This chapter explores alternative methods for presenting spatial data effectively, including tables and charts. By understanding when to use these techniques, you will learn how to communicate spatial information clearly and concisely. The chapter covers key scenarios where maps may be less effective, such as when summarizing small datasets or comparing multiple variables.
Open RStudio or – if you already have RStudio open – click Session then Restart R. Make sure you’re working inside the Crime Mapping RStudio project you created in Section 1.4.2, then you’re ready to start mapping.
15.1 Introduction
Making maps is the core of analysing spatial data. But just because a particular dataset has a spatial element to it does not always mean that a map is the best way to present that data. In this chapter we will learn some other techniques for presenting data that can be more effective than maps for answering certain questions about spatial data.
As with so much in spatial analysis, whether it is best to make a map or use some other technique to convey information will depend on the circumstances. When you decide how to communicate information about the data you are analysing, you will need to consider the questions you are trying to answer, the audience that you are communicating to, what they will be using the information for and in what circumstances they will be using it.
While the best choice of how to communicate spatial information will depend on the circumstances, there are a few instances in which maps are typically not the best way to communicate your data. These include:
When you only need to convey a handful of pieces of information
Maps are very effective for communicating detailed information, such as the density of crime across thousands of cells in a KDE grid. But to do this, maps typically encode information into aesthetics such as colour, size and so on. This is necessary for communicating large amounts of information, but it makes the connection between the data and the visual representation of the data less direct. If you only need to communicate a small amount of information, there is less justification for forcing your audience to mentally translate the aesthetic into whatever it represents.
For example, if you wanted to show the number of violent and sexual offences in each of the seven districts in Northamptonshire in England, a choropleth map is less clear than a bar chart (for example, in being able to decipher if there were more offences in Kettering or in Wellingborough).
A map might be a useful addition to the bar chart in this case if you are trying to communicate information to people who are not familiar with the locations of the districts. In that case, we might want to add a small reference map to help people understand which area is which:
But in most circumstances in which you create crime maps, you will be creating them for an audience (such as local police officers) that already has sufficient knowledge of the area and so an inset map such as this would not be needed. In that case, a bar chart will probably be more effective at showing this information than a map would be.
When you need to convey several different things about one place
Maps are generally most effective when they show a single piece of data about each place (e.g. a grid cell or a polygon representing a statistical area). For example, a choropleth map shows a single shade of colour for each area on the map to represent a single value, such as the frequency or rate of crime in that area. If you wanted to show the frequency of burglary and the frequency of robbery in the same area on a map, this would be quite hard. So if you need to convey multiple different things about each place, it is generally best to do this in a table or chart, rather than a map.
One exception to this is when you present multiple maps side by side, each showing a single thing about an area. These are called small multiple maps and we will learn about them in Chapter 16.
When the geographic relationship between places on the map is not the most important thing about them
Maps emphasise the spatial relationship between different places, but they do this at the expense of making non-spatial relationships between those places less obvious. If the spatial relationships are the most important thing that you want to convey, a map makes sense. For example, a hotspot map is often a very good way to communicate where crime is most concentrated. But in other cases the geographic relationships between variables will be much less important. For example, if you wanted to show the relationship between the amount of crime in an area and the level of poverty there, a scatter plot would probably be a more-effective way to do this than a map would be.
What is one reason why spatial relationships might not be important in some visualizations?
What is a key factor in deciding whether to use a map, table, or chart?
15.2 Tables
Well-designed tables can be a very effective way of communicating information, whereas badly-design tables can be confusing and even lead your audience to give up trying to engage with the information you’re trying to communicate.
Tables used to present information almost always show only a summary of the available data, so the first step in preparing a table is to wrangle the data into the right format. In Section 3.6 we learned about the summarise() function from the dplyr package that we can use to produce summaries of rows of data.
To learn about creating a good table for displaying summary data in a report, we will use the example of the frequency of different types of violence in the different states of Malaysia in 2017.
Open a new R script file and save it as chapter15a.R. Copy this code into that file and run it.
We can get a feel for the data by looking at a random sample of rows using the slice_sample() function function from the dplyr package (remember dplyr was loaded automatically when we loaded tidyverse).
# A tibble: 10 × 5
region state year crime_type count
<chr> <chr> <dbl> <chr> <dbl>
1 West Malaysia Kelantan 2017 unarmed robbery 219
2 West Malaysia Melaka 2017 murder 7
3 West Malaysia Melaka 2017 unarmed robbery 589
4 East Malaysia Sarawak 2017 armed robbery 3
5 West Malaysia Johor 2017 unarmed robbery 1701
6 West Malaysia Melaka 2017 rape 69
7 West Malaysia Pahang 2017 rape 163
8 West Malaysia Kuala Lumpur 2017 aggravated assault 651
9 West Malaysia Pulau Pinang 2017 unarmed robbery 706
10 West Malaysia Johor 2017 murder 66
The output of slice_sample() looks acceptable as a table, especially if it is included in a Quarto document, but readers of our reports probably don’t want to know the type of each variable (underneath the variable names) and won’t want to page through the table if there are more rows or columns than can fit in the available space. We can make this table much more useful for readers by wrangling it into a different format.
15.2.1 Making data wider for presentation
One issue with printing the violence object as a table is that it has 70 rows, so it will take up a lot of space on a page or screen. We can make the data more compact by converting it from long format to wide format. In Chapter 12 we learned that data are often easier to analyse in long format. But it is often better to present data in a table in wide format. When you are choosing between storing data in long versus wide format, remember: analyse in long format, present in wide format.
To convert the table to a wider format we can use the pivot_wider() function from the tidyr package, just was we used the corresponding pivot_longer() function to tidy data in Chapter 12. To make data wider, we specify a single column in the data to use as the names of multiple new columns using the names_from argument and a column to use as the values for the new columns using the values_from argument.
R Console
# A tibble: 14 × 8
region state year aggravated_assault armed_robbery murder rape
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 East Malaysia Sabah 2017 230 0 36 211
2 East Malaysia Sarawak 2017 368 3 27 150
3 West Malaysia Johor 2017 614 1 66 196
4 West Malaysia Kedah 2017 364 2 21 119
5 West Malaysia Kelantan 2017 252 2 13 114
6 West Malaysia Kuala Lump… 2017 651 4 37 132
7 West Malaysia Melaka 2017 176 1 7 69
8 West Malaysia Negeri Sem… 2017 241 2 14 91
9 West Malaysia Pahang 2017 188 1 16 163
10 West Malaysia Perak 2017 380 4 35 95
11 West Malaysia Perlis 2017 47 0 2 30
12 West Malaysia Pulau Pina… 2017 275 0 17 80
13 West Malaysia Selangor 2017 1108 14 83 321
14 West Malaysia Terengganu 2017 130 0 5 64
# ℹ 1 more variable: unarmed_robbery <dbl>
janitor::clean_names()
You will be used to seeing janitor::clean_names() used to clean the column names in a dataset that has just been loaded. In this case, the new columns created by pivot_wider() will have spaces in them, because the names are taken from the values of the crime_type column in the original dataset. Column names with spaces in them a harder to work with, so this code converts them to snake case so that they are easier to work with.
Later in the code we will replace these column names with labels that are suitable for displaying the data in a table.
Now the table has only 14 rows, which makes it much easier to present both on screen and in print. We can also see that the year column is constant (all the values are the same), so we can remove this using the select() function from dplyr. We can also use select() to change the order of the columns from left to right so that the two types of robbery appear next to each other.
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
)# A tibble: 14 × 7
region state murder rape aggravated_assault armed_robbery unarmed_robbery
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 East Mal… Sabah 36 211 230 0 284
2 East Mal… Sara… 27 150 368 3 328
3 West Mal… Johor 66 196 614 1 1701
4 West Mal… Kedah 21 119 364 2 490
5 West Mal… Kela… 13 114 252 2 219
6 West Mal… Kual… 37 132 651 4 3175
7 West Mal… Mela… 7 69 176 1 589
8 West Mal… Nege… 14 91 241 2 536
9 West Mal… Paha… 16 163 188 1 288
10 West Mal… Perak 35 95 380 4 626
11 West Mal… Perl… 2 30 47 0 53
12 West Mal… Pula… 17 80 275 0 706
13 West Mal… Sela… 83 321 1108 14 4944
14 West Mal… Tere… 5 64 130 0 155
15.2.2 Using the gt package to make better tables
The table we created in the last section was better than simply showing the raw data to readers of a report. But we can create much better display tables with the gt package, which is designed to format data for display. The gt package works in a similar way to the ggplot2 package, in that tables are made up of stacks of functions that contribute to the appearance of the final table. One difference is that the layers in a gt stack are joined using the pipe operator (|>) rather than the plus operator (+).
We can create a very basic gt table by just passing a data frame or tibble to the gt() function. So we can add gt() to the end of the pipeline of functions we have already started to build to create a good display table. At this point, the only argument we will add to gt() is the rowname_col argument, which we use to specify which column in the data holds the row labels (in this case, the name of each state).
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state")| region | murder | rape | aggravated_assault | armed_robbery | unarmed_robbery | |
|---|---|---|---|---|---|---|
| Sabah | East Malaysia | 36 | 211 | 230 | 0 | 284 |
| Sarawak | East Malaysia | 27 | 150 | 368 | 3 | 328 |
| Johor | West Malaysia | 66 | 196 | 614 | 1 | 1701 |
| Kedah | West Malaysia | 21 | 119 | 364 | 2 | 490 |
| Kelantan | West Malaysia | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | West Malaysia | 37 | 132 | 651 | 4 | 3175 |
| Melaka | West Malaysia | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | West Malaysia | 14 | 91 | 241 | 2 | 536 |
| Pahang | West Malaysia | 16 | 163 | 188 | 1 | 288 |
| Perak | West Malaysia | 35 | 95 | 380 | 4 | 626 |
| Perlis | West Malaysia | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | West Malaysia | 17 | 80 | 275 | 0 | 706 |
| Selangor | West Malaysia | 83 | 321 | 1108 | 14 | 4944 |
| Terengganu | West Malaysia | 5 | 64 | 130 | 0 | 155 |
This table is already better than the default table produced by Quarto if we just print a data frame or tibble. The gt table does not take up the whole width of the page unnecessarily (which can make it harder to read across rows) and has hidden the type of each column.
We can add more functions to the gt() stack to adjust the appearance of the table. For example, we can format the numeric columns as numbers using the fmt_number() function. This adds thousand separators (in British English, commas) to make it easier to read the large numeric values and can make various other changes such as adding a prefix or suffix to numbers (useful for showing units), scaling numbers (useful for very large numbers) or automatically formatting numbers according to the conventions of the language your computer is set to use (referred to in R help pages as the locale of your computer).
We choose which columns fmt_number() should format using the columns argument. In this case, we want to format all the numeric columns in the data, so we will set columns = where(is.numeric).
We don’t want the numbers in the table to have any decimal places (since the crime counts are all whole numbers), so we also set decimals = 0. We can use the default values of all the other arguments to fmt_number() – type ?gt::fmt_number in the R console to find out more about the different options available on the help page for the fmt_number() function.
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state") |>
# Format numbers with thousand separators and no decimals
fmt_number(columns = where(is.numeric), decimals = 0)| region | murder | rape | aggravated_assault | armed_robbery | unarmed_robbery | |
|---|---|---|---|---|---|---|
| Sabah | East Malaysia | 36 | 211 | 230 | 0 | 284 |
| Sarawak | East Malaysia | 27 | 150 | 368 | 3 | 328 |
| Johor | West Malaysia | 66 | 196 | 614 | 1 | 1,701 |
| Kedah | West Malaysia | 21 | 119 | 364 | 2 | 490 |
| Kelantan | West Malaysia | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | West Malaysia | 37 | 132 | 651 | 4 | 3,175 |
| Melaka | West Malaysia | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | West Malaysia | 14 | 91 | 241 | 2 | 536 |
| Pahang | West Malaysia | 16 | 163 | 188 | 1 | 288 |
| Perak | West Malaysia | 35 | 95 | 380 | 4 | 626 |
| Perlis | West Malaysia | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | West Malaysia | 17 | 80 | 275 | 0 | 706 |
| Selangor | West Malaysia | 83 | 321 | 1,108 | 14 | 4,944 |
| Terengganu | West Malaysia | 5 | 64 | 130 | 0 | 155 |
fmt_number() is one of several formatting functions available in gt. For example, we could use fmt_currency() to format columns according to the conventions for currency values, fmt_date() for dates or fmt_percent() for percentages.
The region column only has two values: West Malaysia for states and territories in Peninsular Malaysia and East Malaysia for states on the island of Borneo. Rather than repeat these two values on every row of the table – which is a waste of space and makes the table more cluttered than necessary – we can instead group the rows according to these two regions and then only show the region names once at the top of each group.
gt() will automatically create group headings in a table if the data frame or tibble passed to gt() contains groups created by the group_by() function from the dplyr package. All we have to do is use group_by() to specify which column (in this case, region) contains the values that we should use to determine which group each row is in.
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Specify the table rows should be grouped by the values of `region`
group_by(region) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state") |>
# Format numbers with thousand separators and no decimals
fmt_number(columns = where(is.numeric), decimals = 0)| murder | rape | aggravated_assault | armed_robbery | unarmed_robbery | |
|---|---|---|---|---|---|
| East Malaysia | |||||
| Sabah | 36 | 211 | 230 | 0 | 284 |
| Sarawak | 27 | 150 | 368 | 3 | 328 |
| West Malaysia | |||||
| Johor | 66 | 196 | 614 | 1 | 1,701 |
| Kedah | 21 | 119 | 364 | 2 | 490 |
| Kelantan | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | 37 | 132 | 651 | 4 | 3,175 |
| Melaka | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | 14 | 91 | 241 | 2 | 536 |
| Pahang | 16 | 163 | 188 | 1 | 288 |
| Perak | 35 | 95 | 380 | 4 | 626 |
| Perlis | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | 17 | 80 | 275 | 0 | 706 |
| Selangor | 83 | 321 | 1,108 | 14 | 4,944 |
| Terengganu | 5 | 64 | 130 | 0 | 155 |
In tables containing lots of numbers it can be difficult to see patterns. One way to help readers to understand patterns is to map the numbers to an aesthetic property such as colour that people can easily see patterns in. To do this, we can colour the cells in a column according to the value of each cell using the data_color() function (note the spelling of ‘color’ in this function). To use data_color(), we specify the columns we want to shade using the columns argument and the colour palette we want to use using the palette argument.
In this example, we will only colour the values in two columns, so we will pass the column names to the columns argument.
The easiest way to specify a colour palette is to use one of the built-in colour palettes that the gt package understands automatically. These use the same colour palette names we have used in previous chapters when using functions such as scale_fill_distiller().
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Specify the table rows should be grouped by the values of `region`
group_by(region) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state") |>
# Format numbers with thousand separators and no decimals
fmt_number(columns = where(is.numeric), decimals = 0) |>
# Show distribution of values in some columns using colour
data_color(columns = unarmed_robbery, palette = "Oranges") |>
data_color(columns = rape, palette = "Blues")| murder | rape | aggravated_assault | armed_robbery | unarmed_robbery | |
|---|---|---|---|---|---|
| East Malaysia | |||||
| Sabah | 36 | 211 | 230 | 0 | 284 |
| Sarawak | 27 | 150 | 368 | 3 | 328 |
| West Malaysia | |||||
| Johor | 66 | 196 | 614 | 1 | 1,701 |
| Kedah | 21 | 119 | 364 | 2 | 490 |
| Kelantan | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | 37 | 132 | 651 | 4 | 3,175 |
| Melaka | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | 14 | 91 | 241 | 2 | 536 |
| Pahang | 16 | 163 | 188 | 1 | 288 |
| Perak | 35 | 95 | 380 | 4 | 626 |
| Perlis | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | 17 | 80 | 275 | 0 | 706 |
| Selangor | 83 | 321 | 1,108 | 14 | 4,944 |
| Terengganu | 5 | 64 | 130 | 0 | 155 |
In this table we use two different colours to show the patterns in the frequency of murder and unarmed robbery. This is because we want readers to remember that different types of crime are different and so comparisons that treat crimes as being equivalent to one another are likely to be flawed. If we used the same colour across columns, readers might end up seeing that the shade used for unarmed robberies in Kuala Lumpur was darker than the shade showing the number of murders and conclude that unarmed robberies were a bigger problem than murders. This would be a potentially false conclusion because a single murder and a single unarmed robbery are not the same in terms of their seriousness.
For the same reason the table does not include a column showing the total number of crimes in each state – when we total all types of crime together, we are implicitly assuming that all types of crime are the same when that is obviously untrue.
15.2.3 Changing column names
Now that we have formatted the data, we can move onto changing the column labels. At the moment these are taken from the column names in the data, which means we have column labels such as aggravated_assault. Underscore characters (_) aren’t standard in English text, so we can should change the labels to remove them. We can do this by adding the cols_label() function to the gt() stack. As well as removing the underscores, we can also use cols_label() to abbreviate labels or split them over multiple lines so that the column labels don’t force the columns to be wider than necessary.
We can use the md() helper function to use Markdown formatting to control the appearance of the labels. As well as using markup such as asterisks to create **strongly emphasised text** we can also use HTML markup to add more-advanced formatting. For example, we can use the code <br> to insert a line break to split labels over multiple lines.
R Console
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Specify the table rows should be grouped by the values of `region`
group_by(region) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state") |>
# Format numbers with thousand separators and no decimals
fmt_number(columns = where(is.numeric), decimals = 0) |>
# Show distribution of values in some columns using colour
data_color(columns = unarmed_robbery, palette = "Oranges") |>
data_color(columns = rape, palette = "Blues") |>
# Add column labels
cols_label(
"aggravated_assault" ~ "agg. assault",
"armed_robbery" ~ md("robbery<br>(armed)"),
"unarmed_robbery" ~ md("robbery<br>(unarmed)")
)| murder | rape | agg. assault | robbery (armed) |
robbery (unarmed) |
|
|---|---|---|---|---|---|
| East Malaysia | |||||
| Sabah | 36 | 211 | 230 | 0 | 284 |
| Sarawak | 27 | 150 | 368 | 3 | 328 |
| West Malaysia | |||||
| Johor | 66 | 196 | 614 | 1 | 1,701 |
| Kedah | 21 | 119 | 364 | 2 | 490 |
| Kelantan | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | 37 | 132 | 651 | 4 | 3,175 |
| Melaka | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | 14 | 91 | 241 | 2 | 536 |
| Pahang | 16 | 163 | 188 | 1 | 288 |
| Perak | 35 | 95 | 380 | 4 | 626 |
| Perlis | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | 17 | 80 | 275 | 0 | 706 |
| Selangor | 83 | 321 | 1,108 | 14 | 4,944 |
| Terengganu | 5 | 64 | 130 | 0 | 155 |
15.2.4 Adding summary rows
The final thing we will add to this table is a summary row containing the total number of each type of crime across all the states and territories. We do this using the summary_rows() function from gt. We specify the columns we want to summarise using the columns argument as we did for fmt_number().
Summary rows can be produced using lots of different R functions. For example, we could use the mean() function to produce a summary row showing the mean (average) number of crimes of each time across the states. In this case, we want to know the total number of each type of crime across all states, so we will use the sum() function. To specify this, we pass the fns argument to summary_rows(). The fns argument has two parts, separated by a tilde (~). On the left-hand side we specify the label we want the summary row to have, and on the right-hand side we specify the function we want to use to calculate the summary. In this case, we can specify fns = "regional total" ~ sum(.) to say we want the summary row to have the label ‘total’ and that we want to summarise the rows using the sum() function. The . in the code fns = "regional total" ~ sum(.) is a place-holder that represent the data we want to summarise.
As well as summarising the data in each column, we want to specify how the summary values should be formatted. To do that, we use the fmt argument of summary_rows(). This is also a two-sided (‘formula’) argument, with the two sides separated by a ~. On the left-hand side we specify which summary values we want to format. In this case we want all the summary values to be formatted as numbers, so we can use the everything() helper function. On the right-hand side we use a call to one of the fmt_*() family of functions we used earlier: in this case, we use fmt_number(). Looking at the code below, you’ll notice that the code fmt = everything() ~ fmt_number(., decimals = 0) again uses the . place-holder to specify that we want to format the summary value produced by sum().
The summary_rows() function produces a summary for each group of rows (in the case of this table, one summary for each region). As well as having a regional total, it would also be useful to have a total for all the groups together (i.e. for the whole country of Malaysia). To do that, we add the grand_summary_rows() function to our gt() stack, using the same arguments as for the summary_rows() function.
Paste this code into the chapter15a.R file and run it.
chapter15a.R
# Produce table of crime counts in each Malaysian state
violence |>
# Convert data to have one row per state
pivot_wider(names_from = crime_type, values_from = count) |>
# Convert new column names (the former values of `crime_type`) to snake case
janitor::clean_names() |>
# Choose only the columns we want to show in the table
select(
region, state, murder, rape, aggravated_assault, armed_robbery,
unarmed_robbery
) |>
# Specify the table rows should be grouped by the values of `region`
group_by(region) |>
# Functions from tidyverse above and functions from gt below
gt(rowname_col = "state") |>
# Format numbers with thousand separators and no decimals
fmt_number(columns = where(is.numeric), decimals = 0) |>
# Show distribution of values in some columns using colour
data_color(columns = unarmed_robbery, palette = "Oranges") |>
data_color(columns = rape, palette = "Blues") |>
# Add column labels
cols_label(
"aggravated_assault" ~ "agg. assault",
"armed_robbery" ~ md("robbery<br>(armed)"),
"unarmed_robbery" ~ md("robbery<br>(unarmed)")
) |>
# Add a summary row showing the total number of crimes in each region
summary_rows(
columns = where(is.numeric),
fns = "regional total" ~ sum(.),
fmt = everything() ~ fmt_number(., decimals = 0)
) |>
# Add a summary row showing the total number of crimes in Malaysia
grand_summary_rows(
columns = where(is.numeric),
fns = "national total" ~ sum(.),
fmt = everything() ~ fmt_number(., decimals = 0)
)| murder | rape | agg. assault | robbery (armed) |
robbery (unarmed) |
|
|---|---|---|---|---|---|
| East Malaysia | |||||
| Sabah | 36 | 211 | 230 | 0 | 284 |
| Sarawak | 27 | 150 | 368 | 3 | 328 |
| regional total | 63 | 361 | 598 | 3 | 612 |
| West Malaysia | |||||
| Johor | 66 | 196 | 614 | 1 | 1,701 |
| Kedah | 21 | 119 | 364 | 2 | 490 |
| Kelantan | 13 | 114 | 252 | 2 | 219 |
| Kuala Lumpur | 37 | 132 | 651 | 4 | 3,175 |
| Melaka | 7 | 69 | 176 | 1 | 589 |
| Negeri Sembilan | 14 | 91 | 241 | 2 | 536 |
| Pahang | 16 | 163 | 188 | 1 | 288 |
| Perak | 35 | 95 | 380 | 4 | 626 |
| Perlis | 2 | 30 | 47 | 0 | 53 |
| Pulau Pinang | 17 | 80 | 275 | 0 | 706 |
| Selangor | 83 | 321 | 1,108 | 14 | 4,944 |
| Terengganu | 5 | 64 | 130 | 0 | 155 |
| regional total | 316 | 1,474 | 4,426 | 31 | 13,482 |
| national total | 379 | 1,835 | 5,024 | 34 | 14,094 |
Tables are good for showing detailed information, particularly when we want to present multiple pieces of information about a single place. But it can be hard to spot patterns in tables even with coloured cells. For this reason, do not use tables when you are primarily trying to show the relationship between two or more variables. In the next section, we will learn to create bar chart in R to show patterns more effectively.
In which of these circumstances is a table typically more effective than a map?
Why is it often preferable to present summary data in a wide format?
Which function is used in R to convert long-format data into wide-format data?
15.3 Bar charts
Bar charts are useful for showing values of one continuous variable (e.g. a count of crimes) for each value of one categorical variable (e.g. states of a country). Bar charts are very common, but there are several things we can do to make them more useful. In this section we will learn how to construct a good bar chart.
You’re already an expert at making maps using functions from the ggplot2 package. We can use these same functions to create many other types of graphics. For example, we can use geom_bar() to create bar charts just as we use geom_sf() to create a map using data stored in an SF object.
geom_bar() calculates the length of each bar on a chart by counting the number of rows of data in each category. This isn’t what we want to do to visualise the violence object, since the data provided by the Royal Malaysian Police are already in the form of counts of crimes. Instead, we will use the geom_col() function, which creates bar charts from this type of summary data.
To create a simple bar chart, we will work with the original (long-format) data and filter it to show only the number of murders in each state.
R Console

You might notice that this code uses the aes() function differently to what we’ve seen in previous chapters. As we learned in Chapter 6, aes() is used to specify which aspects of a map or chart should be controlled by the values of particular columns in a dataset. When we use aes() inside a geom_*() function (as we have done with geom_sf() in previous chapters), the mapping between columns in a dataset and aspects of the map or chart appearance applies only to that layer. In a map, that is usually what we want because there is typically only one layer (e.g. a layer showing the density of crime) that represents values in a dataset. But we can also use aes() outside a geom_*() function by adding it to the stack directly after the call to ggplot() itself. In that case, the mapping between data and chart will apply to all the layers on the chart.
We can improve this basic chart in several ways:
- We can switch the order of the variables used for the
xandyaesthetics so that the bars are horizontal rather than vertical, which will stop the state names from overlapping. It is almost always better for bar charts to use horizontal bars rather than vertical bars, to avoid overlapping labels. - We can use
labs()to add a title and caption, as well as controlling thexandyaxis titles on the chart. In the code below, we sety = NULLto remove the title for the vertical axis, since a title is unnecessary when it is obvious from the context what the values on that axis represent (Malaysian states). - We can reduce the visual clutter in the chart using
theme_minimal(). Whiletheme_void()is generally the best ggplot2 theme for maps, for charts it’s almost always best to usetheme_minimal().
R Console
# Create bar chart of murder counts
violence |>
# Keep only rows representing murder counts
filter(crime_type == "murder") |>
ggplot() +
# Translate columns in the data to aesthetics on the chart
aes(x = count, y = state) +
# Add bars
geom_col() +
# Add labels
labs(
title = "Murders in Malaysian states, 2017",
caption = "Data from the Royal Malaysian Police",
x = "number of murders",
y = NULL
) +
# Remove unnecessary map elements
theme_minimal()
This chart is better, but we can improve it further. For example, we can reduce the space between the state names and the bars by setting the expand argument to the scale_x_continuous() function. scale_x_continuous() works in a similar way to the other scale functions we have used already, such as using scale_fill_brewer() to control the colour of areas in a choropleth map.
Although we are trying to reduce the gap between the bars and labels on the y axis, we use a function that change the x axis. This is because the space we are reducing is created by R by-default adding some space to each end of any continuous axis, such as the count of murders.
We can specify the space at the end of each axis using the helper function expansion(). In this case we just want to remove the space completely, so we can set expand = expansion(0).
At the same time, we can also remove the grid lines on the y axis (i.e. those running along the length of the bars) since they don’t really make it any easier to understand the chart. As a general principle, we want to remove anything on a chart that does not contribute to communicating information, since unnecessary chart elements can distract readers from understanding the data.
We can remove the grid lines by setting the panel.grid.major.y and panel.grid.minor.y arguments to the theme() function. The value we want to use is the helper function element_blank(), which sets the grid lines to be blank.
R Console
# Create bar chart of murder counts
violence |>
# Keep only rows representing murder counts
filter(crime_type == "murder") |>
ggplot() +
# Translate columns in the data to aesthetics on the chart
aes(x = count, y = state) +
# Add bars
geom_col() +
# Remove space at either end of horizontal axis
scale_x_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Murders in Malaysian states, 2017",
caption = "Data from the Royal Malaysian Police",
x = "number of murders",
y = NULL
) +
# Remove unnecessary map elements
theme_minimal() +
# Remove unnecessary horizontal grid lines
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
One of the reasons why bar charts are easy to interpret is that the length of each bar directly corresponds to the relative size of that particular value. But this direct relationship between bar length and value only applies if the bars start at zero. If you create a bar chart in which the bars don’t start at zero, readers are likely to be mislead, so remember bar charts should always start at zero. But don’t worry – ggplot() will handle this for you automatically.
15.3.1 Ordering bar charts by value
If you were trying to find the three Malaysian states or territories with the most murders from this chart, it would be pretty easy to see that Selangor had the most murders, followed by Johor. But at a glance, it’s not so easy to see which state or territory comes third. We can make this easier to see by changing the order of the bars from the default alphabetical order to an order based on how many murders there were.
To do this, we need to convert the state column in the data to a new type of variable: a factor. Factors are what R calls categorical variables that have a defined set of possible values. For example, a factor recording if a person was under or over 18 might have two possible values: ‘adult’ and ‘child’.
One of the benefits of storing a variable as a factor is that we can specify an order for the categories. This is useful for categories that have a meaningful order, such as ‘bad’, ‘acceptable’, ‘good’, ‘excellent’. But we can also use this feature of factors to specify that values should appear in a particular order in any charts produced from the data, whatever the order of the values in the data itself.
To work with factors in R we can use the forcats package, so-called because it’s for working with categories. forcats is loaded as part of tidyverse, so we don’t need to load it separately.
All the functions in the forcats package start with the letters fct_, just as all the functions in the SF package start st_. For our bar chart, we will use the fct_reorder() function. This takes a factor or character variable (such as the names of the Malaysian states and territories) and sets the order of the categories according to the values of a numeric variable (such as the number of murders in a state). So to re-order the state variable according to the count of murders, we can use fct_reorder(state, count). Since we’re changing an existing variable, we will do this inside a call to the mutate() function.
R Console
# Create bar chart of murder counts
violence |>
# Keep only rows representing murder counts
filter(crime_type == "murder") |>
# Re-order states according to number of murders
mutate(state = fct_reorder(state, count)) |>
ggplot() +
# Translate columns in the data to aesthetics on the chart
aes(x = count, y = state) +
# Add bars
geom_col() +
# Remove space at either end of horizontal axis
scale_x_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Murders in Malaysian states, 2017",
caption = "Data from the Royal Malaysian Police",
x = "number of murders",
y = NULL
) +
# Remove unnecessary map elements
theme_minimal() +
theme(
# Remove unnecessary horizontal grid lines
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)
15.3.2 Colour in bar charts
We can further improve our bar chart by using colour to indicate which states are in which of the two regions of Malaysia. To do this, we will:
- specify in the call to
aes()that the fill colour of the bars should be controlled by theregioncolumn in the data, - specify in the call to
labs()that we don’t want the legend to have a title, since the meaning is obvious from the values ‘East Malaysia’ and ‘West Malaysia’, and - specify in the call to
theme()that we would like the legend use up some of the empty space in the bottom-right corner of the chart, rather than making the chart smaller to give space for the legend on the right-hand side.
To move the legend, we need to specify several different arguments in the theme() function. legend.position determines where around the plot the legend should be placed: ‘top’, ‘right’, ‘bottom’, ‘left’ or ‘inside’. In this case we want the legend to appear in some spare space on the plot itself, so we will set legend.position = "inside". We then use the legend.position.inside argument to theme() to specify exactly where inside the plot we want the legend to appear. We do that by specifying where the legend should appear horizontally and vertically, as a proportion of the axis length, on a scale from zero to one.
`-1.png)
Using this specification, we can place the legend in the right-most point on the horizontal axis and the bottom-most point on the vertical axis by specifying legend.position.inside = c(1, 0).
legend.position.inside sets the anchor point from which the legend is created, with the actual size of the legend depending on how much space is required by its contents. By default, a legend will spread out in all directions from the anchor point, i.e. the legend will be horizontally and vertically centred on the anchor point. As we have positioned the legend in a corner of the plot, this is probably not what we want since some of the legend will be hidden outside the plot area. Instead, we can set the legend.justifcation argument of theme() using a similar specification to that for legend.position.inside based on which way we want the legend to grow.

If you want the legend to grow ‘inwards’ from a corner, just set legend.justification to the same value as you used for legend.position.inside. In this case, we want the legend to be anchored in the bottom-right corner and to grow inwards from it, so we will set both arguments to c(1, 0).
Add this code to your script file and run it.
R Console
# Create bar chart of murder counts
malaysia_murder_bar_chart <- violence |>
# Keep only rows representing murder counts
filter(crime_type == "murder") |>
# Re-order states according to number of murders
mutate(state = fct_reorder(state, count)) |>
ggplot() +
# Translate columns in the data to aesthetics on the chart
aes(x = count, y = state, fill = region) +
# Add bars
geom_col() +
# Remove space at either end of horizontal axis
scale_x_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Murders in Malaysian states, 2017",
caption = "Data from the Royal Malaysian Police",
x = "number of murders",
y = NULL
) +
# Remove unnecessary map elements
theme_minimal() +
theme(
# Move legend to bottom-right corner and give it a solid white background
legend.background = element_rect(colour = NA, fill = "white"),
legend.justification = c(1, 0),
legend.position = "inside",
legend.position.inside = c(1, 0),
# Remove unnecessary horizontal grid lines
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)One issue with this chart is that the default colours that ggplot() produces are not easy for everyone to discern. In particular, people with colour blindness may struggle to distinguish between some combinations of colours. Some colour combinations are also hard (or impossible) to distinguish even for people with normal colour vision if a chart is printed in black and white or viewed on a screen in some lighting conditions.
We can check how well people with different colour vision will be able to read a chart using the cvd_grid() function from the colorblindr package. This function takes an existing ggplot() stack and prints several versions of the chart that simulate how different people will see it.
The colorblindr package is not on CRAN, the repository we usually install R packages from. That means we need to use slightly different code to install it. Instead of installing from CRAN, we will instead install the package from GitHub, a website that programmers use to store versions of their code. To install packages from GitHub we can use the p_install_gh() function from the pacman package (the same package we use to load packages at the start of each R script).
Remember that because we only need to install a package once on each computer we use R on, you should never install packages inside an R script. This means you should only ever run pacman::p_install_gh() in the R Console, never in an R script.
Once you’ve installed the colorblindr package, you can use it to check how different people are likely to see the chart of murder in Malaysia.
R Console

From this we can see that this combination of colours works well for people with some types of colour blindness, but is likely to be hard for some people, and indeed for everyone if the chart is printed on a black-and-white printer.
Fortunately, there are lots of different R packages that provide colour palettes that are suitable for people with different colour vision. The paletteer package brings a lot of these colour palettes together in one place. paletteer provides three pairs of functions for different types of colour scale:
scale_colour_paletteer_c()/scale_fill_paletteer_c()for continuous scales that are suitable for representing continuous variables.scale_colour_paletteer_d()/scale_fill_paletteer_d()for discrete scales that are suitable for representing categorical variables.scale_colour_paletteer_binned()/scale_fill_paletteer_binned()for binned scales that are suitable for showing continuous variables that have been sub-divided (‘binned’) into ordered categories.
We can use each of these functions in the same way we have used functions like scale_fill_distiller() in previous chapters. The first argument to all the main functions in the paletteer package is the name of a colour palette, of which a total of 2,759 different palettes are available. We can specify which palette to use using the same syntax we have sometimes used to refer to R functions: package_name::palette_name. For example, to create a discrete colour scale using the OKeeffe2 palette from the MetBrewer package (a palette inspired by the painting Red and Yellow Cliffs by Georgia O’Keeffe) we could use scale_colour_paletteer_d("MetBrewer::OKeeffe2").
There are many useful colour palettes in the PrettyCols package. Let’s use the Bright palette from this package to control the colours on the chart.
chapter15a.R
# Create bar chart of murder counts
malaysia_murder_bar_chart <- violence |>
# Keep only rows representing murder counts
filter(crime_type == "murder") |>
# Re-order states according to number of murders
mutate(state = fct_reorder(state, count)) |>
ggplot() +
# Translate columns in the data to aesthetics on the chart
aes(x = count, y = state, fill = region) +
# Add bars
geom_col() +
# Remove space at either end of horizontal axis
scale_x_continuous(expand = c(0, 0)) +
# Specify colour-blind-safe fill colours
scale_fill_paletteer_d("PrettyCols::Bright") +
# Add labels
labs(
title = "Murders in Malaysian states, 2017",
caption = "Data from the Royal Malaysian Police",
x = "number of murders",
y = NULL
) +
# Remove unnecessary map elements
theme_minimal() +
theme(
# Move legend to bottom-right corner and give it a solid white background
legend.background = element_rect(colour = NA, fill = "white"),
legend.justification = c(1, 0),
legend.position = "inside",
legend.position.inside = c(1, 0),
# Remove unnecessary horizontal grid lines
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank()
)To see if these colours are likely to work for different people, we can again check the colours with the colorblindr package:
R Console

From this, we can see that this colour palette is going to be much more useful for people with different colour vision, as well as for everyone if the chart is printed in black and white or viewed on a screen in bad light.
There are lots of other ways to control colours on charts and maps in R. For more detail, read Working with colours in R.
Bar charts are a very common way of presenting a numeric variable for each value of a categorical variable. Bar charts are easy to interpret, even for people who are not used to interpreting charts or who only have time to look at the chart for a few seconds.
When creating a bar chart, why might you want to sort the bars by value?
What is one reason why a reference map might be included alongside a bar chart?
Why might a bar chart be more effective than a choropleth map for presenting crime data?
15.4 Visualising distributions
Bar charts show a single piece of information about each category present in a dataset. So we might use a bar chart to show, for example, the average number of burglaries in neighbourhoods in different districts. But what if the average values masked substantial differences in the number of burglaries within each district? Averages often mask variation, and can sometimes be misleading as a result. In those circumstances it would be better to show more detail rather than a misleading average.
Let’s start with the simple example of showing the distribution of burglary counts within a single district. Restart R (Session > Restart R), open a new R script file and save it as chapter15b.R. Use this code to load a dataset of burglaries in each lower-layer super output area (LSOA) in Northamptonshire in England in 2020.
To show the distribution of burglary counts we can create a histogram using geom_histogram(). A histogram divides the range of values present in the data into a number of equally sized bins, then shows bars representing the number of observations (rows) in the data that have values fitting into each bin. We can either allow geom_histogram() to set the number of bins automatically, or set it ourself with the binwidth argument.
We will set the binwidth argument of the geom_histogram() function to binwidth = 1 so that each bar on the chart will show now many LSOAs have each individual value. We will also add some labels to help readers interpret the chart.
chapter15b.R
# Create histogram of burglary counts
burglary |>
ggplot() +
# Specify which column in the data contains the value we're interested in
aes(x = count) +
# Add a histogram to the chart
geom_histogram(binwidth = 1) +
# Remove unnecessary space at either end of both axes
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Number of burglaries in Northamptonshire neighbourhoods",
x = "count of burglaries, 2020",
y = "number of LSOAs"
) +
theme_minimal() +
theme(
# Remove unnecessary vertical grid lines
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
Interpreting histograms can be slightly counter-intuitive. The important thing to remember is that the horizontal position of each bar represents a number of burglaries and the vertical height of the bar represents how many neighbourhoods had that number of burglaries. This is the opposite of a bar chart, where the length of the bar represents the number of burglaries.
In this chart, we can see on this chart that most LSOAs had only a few burglaries in 2020 (represented by the tallest bars are to the left of the chart), while a few LSOAs had a larger number (the bars to the right of the chart). This is what we would expect, since we know the crimes are generally concentrated in a few places.
15.4.1 Plotting density curves
Histograms are one way to show the distribution of a variable (in this case, the count of burglaries). Another way to show the distribution of a variable is to create a density curve with geom_density(). A density curve is a smoothed version of a histogram, which is useful to show the general distribution of a variable (in this case, the number of LSOAs with different numbers of burglaries) at the cost of not showing the exact data. The mathematical procedure used by geom_density() to calculate a density curve is the same as the kernel-density estimation process we have already learned to use to show concentrations of crime on a map.
chapter15b.R
# Create a density plot of burglaries
burglary |>
ggplot() +
# Specify which column in the data contains the value we're interested in
aes(x = count) +
# Add a density curve to the chart
geom_density(colour = "red", linewidth = 1) +
# Remove unnecessary space at either end of both axes
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(labels = scales::label_percent()) +
# Add labels
labs(
title = "Number of burglaries in Northamptonshire neighbourhoods",
x = "count of burglaries, 2020",
y = "percentage of LSOAs"
) +
theme_minimal() +
theme(
# Remove unnecessary vertical grid lines
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank()
)
We can use density curves to show the distribution of a variable across multiple categories at once. For example, we could show the distribution of burglary counts at the neighbourhood level for all the districts in Northamptonshire. To do this we use the geom_density_ridges() function from the ggridges package to create a ridge plot. Although this function does not come from the ggplot2 package, it is designed to be used inside a ggplot() stack.
chapter15b.R
# Add ridge plot of Northamptonshire burglary
burglary |>
# Wrap the district names by replacing any space in a name with a new-line
mutate(district = str_replace_all(district, "\\s", "\n")) |>
ggplot() +
# Specify which columns in the data contain the values we're interested in
aes(x = count, y = district) +
# Add ridge plot
ggridges::geom_density_ridges() +
# Remove unnecessary space at either end of x axis
scale_x_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Number of burglaries in Kettering neighbourhoods",
x = "count of burglaries, 2020",
y = NULL
) +
theme_minimal()Picking joint bandwidth of 1.98

Picking joint bandwidth of 1.98 mean?
As you know from previous chapters, density estimation depends on us choosing a bandwidth to control the degree of smoothing between data points. By default, geom_density_ridges() chooses a suitable bandwidth automatically and reports this in a message. The bandwidth is referred to as ‘joint’ because the same bandwidth is used for all the density curves on a chart.
If you wanted to include a ridge plot in a Quarto document, you would probably not want this message to appear in your report. To suppress the message, you can use the Quarto chunk option #| message: false.
The ridge plot shows the distribution of burglary counts in LSOAs within each district, with the distributions overlapping slightly to save space. From this we can see that across all districts most LSOAs have few burglaries, with a small number of LSOAs having more. We can also see there are a small number of LSOAs (probably, in fact, just one LSOA) in Wellingborough district with a much higher number of burglaries than anywhere else in Northamptonshire.
15.4.2 Small-multiple charts
Density plots can be helpful to summarise a lot of information, but they have some disadvantages. In particular, they don’t show that the number of LSOAs in each district is quite different: there are 131 LSOAs in Northampton but only 41 LSOAs in Corby. To make this clearer we can instead produce several dot plots, one for each district – what are called small-multiple charts.
We could create small-multiple charts by producing a separate histogram for each district and then combine them with the patchwork package, but that would involve a lot of repeated code. Fortunately, we can use a feature of the ggplot2 package called faceting to split our single histogram into multiple plots based on a column in the data (in this case, the district name).
Adding facet_wrap() to a ggplot() stack will cause R to create multiple plots and wrap them across multiple rows and columns so that they approximately fit into the available space. If we only want the small multiples to appear on top of each other (i.e. in multiple rows) or next to each other (i.e. in multiple columns), we can use the facet_grid() function. In this case we want the small multiples to appear on top of each other, so we will use facet_grid() and say that the small multiples (which ggplot2 calls facets) should be based on the district column in the data by specifying rows = vars(district) (it is necessary to wrap the name of the column that you want to use as the basis of the small multiples in the vars() helper function, but we do not need to go into why).
chapter15b.R
# Create histogram of burglary counts
burglary |>
ggplot() +
# Specify which column in the data contains the value we're interested in
aes(x = count) +
# Add a histogram to the chart
geom_histogram(binwidth = 1) +
# Split chart into small multiples, one for each district
facet_grid(rows = vars(district), labeller = label_wrap_gen(width = 10)) +
# Remove unnecessary space at either end of both axes
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
# Add labels
labs(
title = "Number of burglaries in Northamptonshire neighbourhoods",
x = "count of burglaries, 2020",
y = "number of LSOAs"
) +
theme_minimal() +
theme(
# Remove unnecessary vertical grid lines
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
# Control alignment of facet titles
strip.text.y = element_text(angle = 0, hjust = 0)
)
You might have noticed we made some other changes to our code for this chart to make it clearer:
- Wrapped the facet labels using the
label_wrap_gen()helper function so that some of the longer district names don’t take up too much space horizontally. - Made the facet labels easier to read by making the text horizontal (rather than the default vertical text) using the
strip.text.yattribute totheme()and theelement_text()helper function.anglesets the rotation of the text (or in this case, specifies that there should be no rotation) andhjust = 0specifies that the text should be left aligned.
There are many more-technical ways to show distributions, such as box plots or violin plots. However, these can be difficult to interpret for people who are not used to looking at those particular types of chart, so they should probably be avoided for communicating with general audiences.
15.5 Comparing continuous variables
So far we have used bar charts to communicate a single number (in our example, a number of murders) for each value of a categorical variable (the name of each Malaysian state or territory), and histograms to show multiple numbers (burglary counts for each neighbourhood) for each value of a categorical variable (districts in Northamptonshire).
Both these types of chart compare a numeric variable to a categorical one. But sometimes we may want to compare two categorical variables. We can do this with a scatter plot.
Restart R (Session > Restart R), open a new R script file and save it as chapter15c.R. Use this code to load rates of thefts of and from motor vehicles per 1,000 households saying they own a vehicle for a selection of municipalities in South Africa.
Since thefts of vehicle and thefts from vehicles are different but related crimes, we might want to see if there is a relationship between counts of each type.
To create a ggplot() scatter plot we use geom_point(), the same function we previously used to create point maps. This makes sense, since point maps are a specialised type of scatter plot in which the x and y axes of the chart show the latitude and longitude or easting and northing of each crime location.
The data in the vehicle_theft object are in long format, with each row representing a count of crime in a particular category for a particular municipality. To make a scatter plot where each point represents a municipality, we need to have all the data for a municipality in a single row of data, so we will need to transform the data with pivot_wider() (as we did for some of the tables at the start of this chapter). Since this converts the values of the crime_category column into column names. These names are quite long, so we first convert them to snake case using janitor::clean_names() and then use rename() to shorten the names so they are easier to work with.
Add this code to the R script file.
We can now make a basic scatter plot.
R Console
# Create scatter plot of vehicle theft
ggplot(vehicle_theft_wider) +
# Specify which columns in the data should control the x and y positions of
# each point
aes(x = theft_of, y = theft_from) +
# Add the points
geom_point() +
# Specify the format for the axis labels
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
# Add labels
labs(
title = "Vehicle thefts in South African municipalities",
subtitle = "each dot represents one municipality, 2018-19",
x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
) +
theme_minimal()
From this plot we can see that most areas have low rates of both theft of and theft from motor vehicles, with a few areas having very-high rates of one type or the other (but none have high rates of both).
Looking at the bottom-left corner of the chart we can see that we have again encountered the problem of overlapping points making patterns less clear. We can try to deal with this by making the points semi-transparent using the alpha argument to geom_point().
Scatter plots can be hard for people to interpret, especially if they are not used to interpreting charts. To help readers, we can annotate the plot to show how to interpret each region of the chart. We will add two types of annotation: lines to show the median value on each axis, and labels to help interpretation.
We can add median lines using the geom_hline() and geom_vline() functions, which add horizontal and vertical lines to plots. We will add these to the ggplot() stack before geom_point() so that the lines appear behind the points.
To add text annotations we use the annotate() function from ggplot2, which allows us to add data to a chart by specifying the aesthetics (x and y position, etc.) directly rather than by referencing columns in the data. To add a text annotation, we set the geom argument of annotate() to "text".
R Console
# Create scatter plot of vehicle theft
ggplot(vehicle_theft_wider) +
# Specify which columns in the data should control the x and y positions of
# each point
aes(x = theft_of, y = theft_from) +
# Add vertical line showing median rate of theft of a vehicle
geom_vline(
xintercept = median(pull(vehicle_theft_wider, "theft_of")),
linetype = "22"
) +
# Add horizontal line showing median rate of theft from a vehicle
geom_hline(
yintercept = median(pull(vehicle_theft_wider, "theft_from")),
linetype = "22"
) +
# Add points
geom_point(alpha = 0.5) +
# Add annotations to aid interpretation
annotate(
geom = "text",
x = 20,
y = 0,
label = "high rate of thefts of vehicles\nlow rate of thefts from vehicles",
hjust = 1,
lineheight = 1
) +
annotate(
geom = "text",
x = 1,
y = 75,
label = "low rate of thefts of vehicles\nhigh rate of thefts from vehicles",
hjust = 0,
lineheight = 1
) +
# Specify the format for the axis labels
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
# Add labels
labs(
title = "Vehicle thefts in South African municipalities",
subtitle = str_glue(
"each dot represents one municipality, 2018-19, dashed lines show ",
"median values"
),
x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
) +
theme_minimal()
From this plot we can now see that half of municipalities have very low rates of both types of theft (shown by the dots below and to the left of the median lines).
We can make some further changes to this chart. For example, instead of labelling areas on the plot we could instead label the municipalities with high rates of vehicle theft (we cannot include both types of label because they would overlap). To do that, we will create a new column in the data containing either the municipality name (for high-rate municipalities) or NA (meaning ggplot() will not create a label for that row if we set na.rm = TRUE). We can then use geom_label_repel() to add the labels to the chart, remembering to add label = label to the aes() function so ggplot() knows which column in the data to use for the labels.
R Console
# Create scatter plot of vehicle theft
vehicle_theft_wider |>
# Create a new column in the data, either containing the municipality name or
# `NA` depending on the values of `theft_of` and `theft_from`
mutate(label = if_else(theft_of > 17 | theft_from > 65, municipality, NA)) |>
ggplot() +
# Specify which columns in the data should control the x and y positions of
# each point and the labels (for those points that have labels)
aes(x = theft_of, y = theft_from, label = label) +
# Add vertical line showing median rate of theft of a vehicle
geom_vline(
xintercept = median(pull(vehicle_theft_wider, "theft_of")),
linetype = "22"
) +
# Add horizontal line showing median rate of theft from a vehicle
geom_hline(
yintercept = median(pull(vehicle_theft_wider, "theft_from")),
linetype = "22"
) +
# Add points
geom_point(alpha = 0.5) +
# Add labels
geom_label_repel(na.rm = TRUE, label.size = 0, lineheight = 1) +
# Specify the format for the axis labels
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
# Add labels
labs(
title = "Vehicle thefts in South African municipalities",
subtitle = str_glue(
"each dot represents one municipality, 2018-19, dashed lines show ",
"median values"
),
x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
) +
theme_minimal()
Finally, we can add a trend line to the plot. We do this using the geom_smooth() function from ggplot2. geom_smooth() can add different types of trend line to a plot, but in this example we will specify a simple linear trend line by setting method = "lm". We will also specify formula = y ~ x (the default) to avoid geom_smooth() producing a message to tell us what formula it used to calculate the trend.
Add this code to the R script file and run it.
chapter15c.R
# Create scatter plot of vehicle theft
vehicle_theft_wider |>
# Create a new column in the data, either containing the municipality name or
# `NA` depending on the values of `theft_of` and `theft_from`
mutate(label = if_else(theft_of > 17 | theft_from > 65, municipality, NA)) |>
ggplot() +
# Specify which columns in the data should control the x and y positions of
# each point and the labels (for those points that have labels)
aes(x = theft_of, y = theft_from, label = label) +
# Add vertical line showing median rate of theft of a vehicle
geom_vline(
xintercept = median(pull(vehicle_theft_wider, "theft_of")),
linetype = "22"
) +
# Add horizontal line showing median rate of theft from a vehicle
geom_hline(
yintercept = median(pull(vehicle_theft_wider, "theft_from")),
linetype = "22"
) +
# Add trend line
geom_smooth(method = "lm", formula = y ~ x, colour = "grey20") +
# Add points
geom_point(alpha = 0.5) +
# Add labels
geom_label_repel(na.rm = TRUE, label.size = 0, lineheight = 1) +
# Specify the format for the axis labels
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format()) +
# Add labels
labs(
title = "Vehicle thefts in South African municipalities",
subtitle = str_glue(
"each dot represents one municipality, 2018-19, dashed lines show ",
"median values"
),
x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
) +
theme_minimal()Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?

The following aesthetics were dropped ...?
The code that produces the chart above also produces a warning saying some aesthetics were dropped during a statistical transformation. This is because geom_smooth() does not know how to use the label aesthetic. There are two options to prevent this warning appearing in a Quarto document. First, you could set the #| warning: false chunk option, although this will suppress any other warnings produced by this code chunk, so you should only add this chunk option once you are sure the code runs without any problems.
The second option is to remove label = label from the call to aes() on line 9 of the code and add aes(label = label) separately on line 21.
From this chart, we can see which municipalities have particularly unusual vehicle-theft rates. For example, we might well want to explore the rates of theft from vehicles in Beaufort West or Stellenbosch municipalities to see what makes them so different from the others, and similarly for the rate of theft of vehicles in Ethekwini.
One note of caution when using geom_smooth(): this function will show the direction of the relationship between two variables regardless of the strength of that relationship. In extreme cases, that could mean that a chart would show a trend line between two variables even if the variables had almost no relationship to one another.
For example, geom_smooth() produces a line showing the direction of the relationship between the two variables in each of these three charts, even though the relationship on the right is much stronger than the one on the left.

Be very careful about trying to interpret the strength of relationships between two variables by plotting them on a chart. It is much better to measure the strength of the relationship using a statistical test such as a correlation test, but statistical tests are outside the scope of this course.
Which of these types of visualization is best used to show the distribution of a single continuous variable?
What is an advantage of using a density plot over a histogram?
15.6 In summary
In this chapter we have learned how to present data about crime at places without using maps. These techniques give us more flexibility about how to best present data to communicate the main points that we want to get across.
Whether to use a map or a chart, and which type of map or chart to use, are design decisions for you to make. When you make these decisions, always remember that what is most important is that your audience understands your message. This makes it very important that you understand your audience.
Visualising data with charts is a very large topic and there are lots of resources available to help you learn more. To get started, you might want to look at:
- An Introduction to ggplot2 from the University of Cincinnati Business Analytics team.
- The ggplot2 cheat sheet by the team that develops the ggplot2 package.
- The R Graph Gallery for examples of many other types of chart that you can produce in R.
Answer these questions to check you have understood the main points covered in this chapter. Write between 50 and 100 words to answer each question.
- Explain at least two scenarios where using a table or a chart would be more effective than a map for presenting spatial data.
- Why might you choose to use a bar chart rather than a choropleth map for displaying crime data in some circumstances?
- When should you use a table instead of a bar chart to convey information about a continuous variable for each of several categories?
- When is it useful to include a reference map alongside a chart, and when is it unnecessary?
- What factors should you consider when choosing the best way to present spatial data?




