15  Presenting spatial data without maps

Maps are a powerful tool for visualizing spatial data, but they are not always the best choice. This chapter explores alternative methods for presenting spatial data effectively, including tables and charts. By understanding when to use these techniques, you will learn how to communicate spatial information clearly and concisely. The chapter covers key scenarios where maps may be less effective, such as when summarizing small datasets or comparing multiple variables.

Before you start

Open RStudio or – if you already have RStudio open – click Session then Restart R. Make sure you’re working inside the Crime Mapping RStudio project you created in Section 1.4.2, then you’re ready to start mapping.

15.1 Introduction

Making maps is the core of analysing spatial data. But just because a particular dataset has a spatial element to it does not always mean that a map is the best way to present that data. In this chapter we will learn some other techniques for presenting data that can be more effective than maps for answering certain questions about spatial data.

As with so much in spatial analysis, whether it is best to make a map or use some other technique to convey information will depend on the circumstances. When you decide how to communicate information about the data you are analysing, you will need to consider the questions you are trying to answer, the audience that you are communicating to, what they will be using the information for and in what circumstances they will be using it.

While the best choice of how to communicate spatial information will depend on the circumstances, there are a few instances in which maps are typically not the best way to communicate your data. These include:

When you only need to convey a handful of pieces of information

Maps are very effective for communicating detailed information, such as the density of crime across thousands of cells in a KDE grid. But to do this, maps typically encode information into aesthetics such as colour, size and so on. This is necessary for communicating large amounts of information, but it makes the connection between the data and the visual representation of the data less direct. If you only need to communicate a small amount of information, there is less justification for forcing your audience to mentally translate the aesthetic into whatever it represents.

For example, if you wanted to show the number of violent and sexual offences in each of the seven districts in Northamptonshire in England, a choropleth map is less clear than a bar chart (for example, in being able to decipher if there were more offences in Kettering or in Wellingborough).

Choropleth map and bar chart both showing the frequency of violence in each district in Northamptonshire in 2020

A map might be a useful addition to the bar chart in this case if you are trying to communicate information to people who are not familiar with the locations of the districts. In that case, we might want to add a small reference map to help people understand which area is which:

Bar chart showing the frequency of violence in each district in Northamptonshire in 2020, with a reference map showing the area covered by each district

But in most circumstances in which you create crime maps, you will be creating them for an audience (such as local police officers) that already has sufficient knowledge of the area and so an inset map such as this would not be needed. In that case, a bar chart will probably be more effective at showing this information than a map would be.

When you need to convey several different things about one place

Maps are generally most effective when they show a single piece of data about each place (e.g. a grid cell or a polygon representing a statistical area). For example, a choropleth map shows a single shade of colour for each area on the map to represent a single value, such as the frequency or rate of crime in that area. If you wanted to show the frequency of burglary and the frequency of robbery in the same area on a map, this would be quite hard. So if you need to convey multiple different things about each place, it is generally best to do this in a table or chart, rather than a map.

One exception to this is when you present multiple maps side by side, each showing a single thing about an area. These are called small multiple maps and we will learn about them in Chapter 16.

When the geographic relationship between places on the map is not the most important thing about them

Maps emphasise the spatial relationship between different places, but they do this at the expense of making non-spatial relationships between those places less obvious. If the spatial relationships are the most important thing that you want to convey, a map makes sense. For example, a hotspot map is often a very good way to communicate where crime is most concentrated. But in other cases the geographic relationships between variables will be much less important. For example, if you wanted to show the relationship between the amount of crime in an area and the level of poverty there, a scatter plot would probably be a more-effective way to do this than a map would be.

Maps, charts and tables

What is one reason why spatial relationships might not be important in some visualizations?

What is a key factor in deciding whether to use a map, table, or chart?

15.2 Tables

Well-designed tables can be a very effective way of communicating information, whereas badly-design tables can be confusing and even lead your audience to give up trying to engage with the information you’re trying to communicate.

Tables used to present information almost always show only a summary of the available data, so the first step in preparing a table is to wrangle the data into the right format. In Section 3.6 we learned about the summarise() function from the dplyr package that we can use to produce summaries of rows of data.

To learn about creating a good table for displaying summary data in a report, we will use the example of the frequency of different types of violence in the different states of Malaysia in 2017.

Open a new R script file and save it as chapter15a.R. Copy this code into that file and run it.

chapter15a.R
# Load packages
pacman::p_load(gt, paletteer, tidyverse)

# Load annual counts of different types of violence in Malaysia, 2006 to 2017
violence <- read_rds("https://mpjashby.github.io/crimemappingdata/malaysia_violence_counts.rds") |> 
  # Keep only counts from 2017
  filter(year == 2017)

We can get a feel for the data by looking at a random sample of rows using the slice_sample() function function from the dplyr package (remember dplyr was loaded automatically when we loaded tidyverse).

R Console
slice_sample(violence, n = 10)
# A tibble: 10 × 5
   region        state         year crime_type         count
   <chr>         <chr>        <dbl> <chr>              <dbl>
 1 West Malaysia Kelantan      2017 unarmed robbery      219
 2 West Malaysia Melaka        2017 murder                 7
 3 West Malaysia Melaka        2017 unarmed robbery      589
 4 East Malaysia Sarawak       2017 armed robbery          3
 5 West Malaysia Johor         2017 unarmed robbery     1701
 6 West Malaysia Melaka        2017 rape                  69
 7 West Malaysia Pahang        2017 rape                 163
 8 West Malaysia Kuala Lumpur  2017 aggravated assault   651
 9 West Malaysia Pulau Pinang  2017 unarmed robbery      706
10 West Malaysia Johor         2017 murder                66

The output of slice_sample() looks acceptable as a table, especially if it is included in a Quarto document, but readers of our reports probably don’t want to know the type of each variable (underneath the variable names) and won’t want to page through the table if there are more rows or columns than can fit in the available space. We can make this table much more useful for readers by wrangling it into a different format.

15.2.1 Making data wider for presentation

One issue with printing the violence object as a table is that it has 70 rows, so it will take up a lot of space on a page or screen. We can make the data more compact by converting it from long format to wide format. In Chapter 12 we learned that data are often easier to analyse in long format. But it is often better to present data in a table in wide format. When you are choosing between storing data in long versus wide format, remember: analyse in long format, present in wide format.

To convert the table to a wider format we can use the pivot_wider() function from the tidyr package, just was we used the corresponding pivot_longer() function to tidy data in Chapter 12. To make data wider, we specify a single column in the data to use as the names of multiple new columns using the names_from argument and a column to use as the values for the new columns using the values_from argument.

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names()
# A tibble: 14 × 8
   region        state        year aggravated_assault armed_robbery murder  rape
   <chr>         <chr>       <dbl>              <dbl>         <dbl>  <dbl> <dbl>
 1 East Malaysia Sabah        2017                230             0     36   211
 2 East Malaysia Sarawak      2017                368             3     27   150
 3 West Malaysia Johor        2017                614             1     66   196
 4 West Malaysia Kedah        2017                364             2     21   119
 5 West Malaysia Kelantan     2017                252             2     13   114
 6 West Malaysia Kuala Lump…  2017                651             4     37   132
 7 West Malaysia Melaka       2017                176             1      7    69
 8 West Malaysia Negeri Sem…  2017                241             2     14    91
 9 West Malaysia Pahang       2017                188             1     16   163
10 West Malaysia Perak        2017                380             4     35    95
11 West Malaysia Perlis       2017                 47             0      2    30
12 West Malaysia Pulau Pina…  2017                275             0     17    80
13 West Malaysia Selangor     2017               1108            14     83   321
14 West Malaysia Terengganu   2017                130             0      5    64
# ℹ 1 more variable: unarmed_robbery <dbl>

You will be used to seeing janitor::clean_names() used to clean the column names in a dataset that has just been loaded. In this case, the new columns created by pivot_wider() will have spaces in them, because the names are taken from the values of the crime_type column in the original dataset. Column names with spaces in them a harder to work with, so this code converts them to snake case so that they are easier to work with.

Later in the code we will replace these column names with labels that are suitable for displaying the data in a table.

Now the table has only 14 rows, which makes it much easier to present both on screen and in print. We can also see that the year column is constant (all the values are the same), so we can remove this using the select() function from dplyr. We can also use select() to change the order of the columns from left to right so that the two types of robbery appear next to each other.

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  )
# A tibble: 14 × 7
   region    state murder  rape aggravated_assault armed_robbery unarmed_robbery
   <chr>     <chr>  <dbl> <dbl>              <dbl>         <dbl>           <dbl>
 1 East Mal… Sabah     36   211                230             0             284
 2 East Mal… Sara…     27   150                368             3             328
 3 West Mal… Johor     66   196                614             1            1701
 4 West Mal… Kedah     21   119                364             2             490
 5 West Mal… Kela…     13   114                252             2             219
 6 West Mal… Kual…     37   132                651             4            3175
 7 West Mal… Mela…      7    69                176             1             589
 8 West Mal… Nege…     14    91                241             2             536
 9 West Mal… Paha…     16   163                188             1             288
10 West Mal… Perak     35    95                380             4             626
11 West Mal… Perl…      2    30                 47             0              53
12 West Mal… Pula…     17    80                275             0             706
13 West Mal… Sela…     83   321               1108            14            4944
14 West Mal… Tere…      5    64                130             0             155

15.2.2 Using the gt package to make better tables

gt package hex sticker

The table we created in the last section was better than simply showing the raw data to readers of a report. But we can create much better display tables with the gt package, which is designed to format data for display. The gt package works in a similar way to the ggplot2 package, in that tables are made up of stacks of functions that contribute to the appearance of the final table. One difference is that the layers in a gt stack are joined using the pipe operator (|>) rather than the plus operator (+).

We can create a very basic gt table by just passing a data frame or tibble to the gt() function. So we can add gt() to the end of the pipeline of functions we have already started to build to create a good display table. At this point, the only argument we will add to gt() is the rowname_col argument, which we use to specify which column in the data holds the row labels (in this case, the name of each state).

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state")
region murder rape aggravated_assault armed_robbery unarmed_robbery
Sabah East Malaysia 36 211 230 0 284
Sarawak East Malaysia 27 150 368 3 328
Johor West Malaysia 66 196 614 1 1701
Kedah West Malaysia 21 119 364 2 490
Kelantan West Malaysia 13 114 252 2 219
Kuala Lumpur West Malaysia 37 132 651 4 3175
Melaka West Malaysia 7 69 176 1 589
Negeri Sembilan West Malaysia 14 91 241 2 536
Pahang West Malaysia 16 163 188 1 288
Perak West Malaysia 35 95 380 4 626
Perlis West Malaysia 2 30 47 0 53
Pulau Pinang West Malaysia 17 80 275 0 706
Selangor West Malaysia 83 321 1108 14 4944
Terengganu West Malaysia 5 64 130 0 155

This table is already better than the default table produced by Quarto if we just print a data frame or tibble. The gt table does not take up the whole width of the page unnecessarily (which can make it harder to read across rows) and has hidden the type of each column.

We can add more functions to the gt() stack to adjust the appearance of the table. For example, we can format the numeric columns as numbers using the fmt_number() function. This adds thousand separators (in British English, commas) to make it easier to read the large numeric values and can make various other changes such as adding a prefix or suffix to numbers (useful for showing units), scaling numbers (useful for very large numbers) or automatically formatting numbers according to the conventions of the language your computer is set to use (referred to in R help pages as the locale of your computer).

We choose which columns fmt_number() should format using the columns argument. In this case, we want to format all the numeric columns in the data, so we will set columns = where(is.numeric).

We don’t want the numbers in the table to have any decimal places (since the crime counts are all whole numbers), so we also set decimals = 0. We can use the default values of all the other arguments to fmt_number() – type ?gt::fmt_number in the R console to find out more about the different options available on the help page for the fmt_number() function.

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state") |> 
  # Format numbers with thousand separators and no decimals
  fmt_number(columns = where(is.numeric), decimals = 0)
region murder rape aggravated_assault armed_robbery unarmed_robbery
Sabah East Malaysia 36 211 230 0 284
Sarawak East Malaysia 27 150 368 3 328
Johor West Malaysia 66 196 614 1 1,701
Kedah West Malaysia 21 119 364 2 490
Kelantan West Malaysia 13 114 252 2 219
Kuala Lumpur West Malaysia 37 132 651 4 3,175
Melaka West Malaysia 7 69 176 1 589
Negeri Sembilan West Malaysia 14 91 241 2 536
Pahang West Malaysia 16 163 188 1 288
Perak West Malaysia 35 95 380 4 626
Perlis West Malaysia 2 30 47 0 53
Pulau Pinang West Malaysia 17 80 275 0 706
Selangor West Malaysia 83 321 1,108 14 4,944
Terengganu West Malaysia 5 64 130 0 155

fmt_number() is one of several formatting functions available in gt. For example, we could use fmt_currency() to format columns according to the conventions for currency values, fmt_date() for dates or fmt_percent() for percentages.

The region column only has two values: West Malaysia for states and territories in Peninsular Malaysia and East Malaysia for states on the island of Borneo. Rather than repeat these two values on every row of the table – which is a waste of space and makes the table more cluttered than necessary – we can instead group the rows according to these two regions and then only show the region names once at the top of each group.

gt() will automatically create group headings in a table if the data frame or tibble passed to gt() contains groups created by the group_by() function from the dplyr package. All we have to do is use group_by() to specify which column (in this case, region) contains the values that we should use to determine which group each row is in.

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Specify the table rows should be grouped by the values of `region`
  group_by(region) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state") |> 
  # Format numbers with thousand separators and no decimals
  fmt_number(columns = where(is.numeric), decimals = 0)
murder rape aggravated_assault armed_robbery unarmed_robbery
East Malaysia
Sabah 36 211 230 0 284
Sarawak 27 150 368 3 328
West Malaysia
Johor 66 196 614 1 1,701
Kedah 21 119 364 2 490
Kelantan 13 114 252 2 219
Kuala Lumpur 37 132 651 4 3,175
Melaka 7 69 176 1 589
Negeri Sembilan 14 91 241 2 536
Pahang 16 163 188 1 288
Perak 35 95 380 4 626
Perlis 2 30 47 0 53
Pulau Pinang 17 80 275 0 706
Selangor 83 321 1,108 14 4,944
Terengganu 5 64 130 0 155

In tables containing lots of numbers it can be difficult to see patterns. One way to help readers to understand patterns is to map the numbers to an aesthetic property such as colour that people can easily see patterns in. To do this, we can colour the cells in a column according to the value of each cell using the data_color() function (note the spelling of ‘color’ in this function). To use data_color(), we specify the columns we want to shade using the columns argument and the colour palette we want to use using the palette argument.

In this example, we will only colour the values in two columns, so we will pass the column names to the columns argument.

The easiest way to specify a colour palette is to use one of the built-in colour palettes that the gt package understands automatically. These use the same colour palette names we have used in previous chapters when using functions such as scale_fill_distiller().

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Specify the table rows should be grouped by the values of `region`
  group_by(region) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state") |> 
  # Format numbers with thousand separators and no decimals
  fmt_number(columns = where(is.numeric), decimals = 0) |> 
  # Show distribution of values in some columns using colour
  data_color(columns = unarmed_robbery, palette = "Oranges") |> 
  data_color(columns = rape, palette = "Blues")
murder rape aggravated_assault armed_robbery unarmed_robbery
East Malaysia
Sabah 36 211 230 0 284
Sarawak 27 150 368 3 328
West Malaysia
Johor 66 196 614 1 1,701
Kedah 21 119 364 2 490
Kelantan 13 114 252 2 219
Kuala Lumpur 37 132 651 4 3,175
Melaka 7 69 176 1 589
Negeri Sembilan 14 91 241 2 536
Pahang 16 163 188 1 288
Perak 35 95 380 4 626
Perlis 2 30 47 0 53
Pulau Pinang 17 80 275 0 706
Selangor 83 321 1,108 14 4,944
Terengganu 5 64 130 0 155
Avoid using the same colour across multiple columns

In this table we use two different colours to show the patterns in the frequency of murder and unarmed robbery. This is because we want readers to remember that different types of crime are different and so comparisons that treat crimes as being equivalent to one another are likely to be flawed. If we used the same colour across columns, readers might end up seeing that the shade used for unarmed robberies in Kuala Lumpur was darker than the shade showing the number of murders and conclude that unarmed robberies were a bigger problem than murders. This would be a potentially false conclusion because a single murder and a single unarmed robbery are not the same in terms of their seriousness.

For the same reason the table does not include a column showing the total number of crimes in each state – when we total all types of crime together, we are implicitly assuming that all types of crime are the same when that is obviously untrue.

15.2.3 Changing column names

Now that we have formatted the data, we can move onto changing the column labels. At the moment these are taken from the column names in the data, which means we have column labels such as aggravated_assault. Underscore characters (_) aren’t standard in English text, so we can should change the labels to remove them. We can do this by adding the cols_label() function to the gt() stack. As well as removing the underscores, we can also use cols_label() to abbreviate labels or split them over multiple lines so that the column labels don’t force the columns to be wider than necessary.

We can use the md() helper function to use Markdown formatting to control the appearance of the labels. As well as using markup such as asterisks to create **strongly emphasised text** we can also use HTML markup to add more-advanced formatting. For example, we can use the code <br> to insert a line break to split labels over multiple lines.

R Console
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Specify the table rows should be grouped by the values of `region`
  group_by(region) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state") |> 
  # Format numbers with thousand separators and no decimals
  fmt_number(columns = where(is.numeric), decimals = 0) |> 
  # Show distribution of values in some columns using colour
  data_color(columns = unarmed_robbery, palette = "Oranges") |> 
  data_color(columns = rape, palette = "Blues") |> 
  # Add column labels
  cols_label(
    "aggravated_assault" ~ "agg. assault",
    "armed_robbery" ~ md("robbery<br>(armed)"),
    "unarmed_robbery" ~ md("robbery<br>(unarmed)")
  )
murder rape agg. assault robbery
(armed)
robbery
(unarmed)
East Malaysia
Sabah 36 211 230 0 284
Sarawak 27 150 368 3 328
West Malaysia
Johor 66 196 614 1 1,701
Kedah 21 119 364 2 490
Kelantan 13 114 252 2 219
Kuala Lumpur 37 132 651 4 3,175
Melaka 7 69 176 1 589
Negeri Sembilan 14 91 241 2 536
Pahang 16 163 188 1 288
Perak 35 95 380 4 626
Perlis 2 30 47 0 53
Pulau Pinang 17 80 275 0 706
Selangor 83 321 1,108 14 4,944
Terengganu 5 64 130 0 155

15.2.4 Adding summary rows

The final thing we will add to this table is a summary row containing the total number of each type of crime across all the states and territories. We do this using the summary_rows() function from gt. We specify the columns we want to summarise using the columns argument as we did for fmt_number().

Summary rows can be produced using lots of different R functions. For example, we could use the mean() function to produce a summary row showing the mean (average) number of crimes of each time across the states. In this case, we want to know the total number of each type of crime across all states, so we will use the sum() function. To specify this, we pass the fns argument to summary_rows(). The fns argument has two parts, separated by a tilde (~). On the left-hand side we specify the label we want the summary row to have, and on the right-hand side we specify the function we want to use to calculate the summary. In this case, we can specify fns = "regional total" ~ sum(.) to say we want the summary row to have the label ‘total’ and that we want to summarise the rows using the sum() function. The . in the code fns = "regional total" ~ sum(.) is a place-holder that represent the data we want to summarise.

As well as summarising the data in each column, we want to specify how the summary values should be formatted. To do that, we use the fmt argument of summary_rows(). This is also a two-sided (‘formula’) argument, with the two sides separated by a ~. On the left-hand side we specify which summary values we want to format. In this case we want all the summary values to be formatted as numbers, so we can use the everything() helper function. On the right-hand side we use a call to one of the fmt_*() family of functions we used earlier: in this case, we use fmt_number(). Looking at the code below, you’ll notice that the code fmt = everything() ~ fmt_number(., decimals = 0) again uses the . place-holder to specify that we want to format the summary value produced by sum().

The summary_rows() function produces a summary for each group of rows (in the case of this table, one summary for each region). As well as having a regional total, it would also be useful to have a total for all the groups together (i.e. for the whole country of Malaysia). To do that, we add the grand_summary_rows() function to our gt() stack, using the same arguments as for the summary_rows() function.

Paste this code into the chapter15a.R file and run it.

chapter15a.R
# Produce table of crime counts in each Malaysian state
violence |> 
  # Convert data to have one row per state
  pivot_wider(names_from = crime_type, values_from = count) |> 
  # Convert new column names (the former values of `crime_type`) to snake case
  janitor::clean_names() |> 
  # Choose only the columns we want to show in the table
  select(
    region, state, murder, rape, aggravated_assault, armed_robbery, 
    unarmed_robbery
  ) |> 
  # Specify the table rows should be grouped by the values of `region`
  group_by(region) |> 
  # Functions from tidyverse above and functions from gt below
  gt(rowname_col = "state") |> 
  # Format numbers with thousand separators and no decimals
  fmt_number(columns = where(is.numeric), decimals = 0) |> 
  # Show distribution of values in some columns using colour
  data_color(columns = unarmed_robbery, palette = "Oranges") |> 
  data_color(columns = rape, palette = "Blues") |> 
  # Add column labels
  cols_label(
    "aggravated_assault" ~ "agg. assault",
    "armed_robbery" ~ md("robbery<br>(armed)"),
    "unarmed_robbery" ~ md("robbery<br>(unarmed)")
  ) |> 
  # Add a summary row showing the total number of crimes in each region
  summary_rows(
    columns = where(is.numeric),
    fns = "regional total" ~ sum(.),
    fmt = everything() ~ fmt_number(., decimals = 0)
  ) |> 
  # Add a summary row showing the total number of crimes in Malaysia
  grand_summary_rows(
    columns = where(is.numeric),
    fns = "national total" ~ sum(.),
    fmt = everything() ~ fmt_number(., decimals = 0)
  )
murder rape agg. assault robbery
(armed)
robbery
(unarmed)
East Malaysia
Sabah 36 211 230 0 284
Sarawak 27 150 368 3 328
regional total 63 361 598 3 612
West Malaysia
Johor 66 196 614 1 1,701
Kedah 21 119 364 2 490
Kelantan 13 114 252 2 219
Kuala Lumpur 37 132 651 4 3,175
Melaka 7 69 176 1 589
Negeri Sembilan 14 91 241 2 536
Pahang 16 163 188 1 288
Perak 35 95 380 4 626
Perlis 2 30 47 0 53
Pulau Pinang 17 80 275 0 706
Selangor 83 321 1,108 14 4,944
Terengganu 5 64 130 0 155
regional total 316 1,474 4,426 31 13,482
national total 379 1,835 5,024 34 14,094
Tables are not good at showing patterns

Tables are good for showing detailed information, particularly when we want to present multiple pieces of information about a single place. But it can be hard to spot patterns in tables even with coloured cells. For this reason, do not use tables when you are primarily trying to show the relationship between two or more variables. In the next section, we will learn to create bar chart in R to show patterns more effectively.

Tables

In which of these circumstances is a table typically more effective than a map?

Why is it often preferable to present summary data in a wide format?

Which function is used in R to convert long-format data into wide-format data?

15.3 Bar charts

Bar charts are useful for showing values of one continuous variable (e.g. a count of crimes) for each value of one categorical variable (e.g. states of a country). Bar charts are very common, but there are several things we can do to make them more useful. In this section we will learn how to construct a good bar chart.

You’re already an expert at making maps using functions from the ggplot2 package. We can use these same functions to create many other types of graphics. For example, we can use geom_bar() to create bar charts just as we use geom_sf() to create a map using data stored in an SF object.

geom_bar() calculates the length of each bar on a chart by counting the number of rows of data in each category. This isn’t what we want to do to visualise the violence object, since the data provided by the Royal Malaysian Police are already in the form of counts of crimes. Instead, we will use the geom_col() function, which creates bar charts from this type of summary data.

To create a simple bar chart, we will work with the original (long-format) data and filter it to show only the number of murders in each state.

R Console
# Create bar chart of murder counts
violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = state, y = count) +
  # Add bars
  geom_col()

You might notice that this code uses the aes() function differently to what we’ve seen in previous chapters. As we learned in Chapter 6, aes() is used to specify which aspects of a map or chart should be controlled by the values of particular columns in a dataset. When we use aes() inside a geom_*() function (as we have done with geom_sf() in previous chapters), the mapping between columns in a dataset and aspects of the map or chart appearance applies only to that layer. In a map, that is usually what we want because there is typically only one layer (e.g. a layer showing the density of crime) that represents values in a dataset. But we can also use aes() outside a geom_*() function by adding it to the stack directly after the call to ggplot() itself. In that case, the mapping between data and chart will apply to all the layers on the chart.

We can improve this basic chart in several ways:

  • We can switch the order of the variables used for the x and y aesthetics so that the bars are horizontal rather than vertical, which will stop the state names from overlapping. It is almost always better for bar charts to use horizontal bars rather than vertical bars, to avoid overlapping labels.
  • We can use labs() to add a title and caption, as well as controlling the x and y axis titles on the chart. In the code below, we set y = NULL to remove the title for the vertical axis, since a title is unnecessary when it is obvious from the context what the values on that axis represent (Malaysian states).
  • We can reduce the visual clutter in the chart using theme_minimal(). While theme_void() is generally the best ggplot2 theme for maps, for charts it’s almost always best to use theme_minimal().
R Console
# Create bar chart of murder counts
violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = count, y = state) +
  # Add bars
  geom_col() +
  # Add labels
  labs(
    title = "Murders in Malaysian states, 2017",
    caption = "Data from the Royal Malaysian Police",
    x = "number of murders",
    y = NULL
  ) +
  # Remove unnecessary map elements
  theme_minimal()

This chart is better, but we can improve it further. For example, we can reduce the space between the state names and the bars by setting the expand argument to the scale_x_continuous() function. scale_x_continuous() works in a similar way to the other scale functions we have used already, such as using scale_fill_brewer() to control the colour of areas in a choropleth map.

Although we are trying to reduce the gap between the bars and labels on the y axis, we use a function that change the x axis. This is because the space we are reducing is created by R by-default adding some space to each end of any continuous axis, such as the count of murders.

We can specify the space at the end of each axis using the helper function expansion(). In this case we just want to remove the space completely, so we can set expand = expansion(0).

At the same time, we can also remove the grid lines on the y axis (i.e. those running along the length of the bars) since they don’t really make it any easier to understand the chart. As a general principle, we want to remove anything on a chart that does not contribute to communicating information, since unnecessary chart elements can distract readers from understanding the data.

We can remove the grid lines by setting the panel.grid.major.y and panel.grid.minor.y arguments to the theme() function. The value we want to use is the helper function element_blank(), which sets the grid lines to be blank.

R Console
# Create bar chart of murder counts
violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = count, y = state) +
  # Add bars
  geom_col() +
  # Remove space at either end of horizontal axis
  scale_x_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Murders in Malaysian states, 2017",
    caption = "Data from the Royal Malaysian Police",
    x = "number of murders",
    y = NULL
  ) +
  # Remove unnecessary map elements
  theme_minimal() +
  # Remove unnecessary horizontal grid lines
  theme(
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
  )

Bars on a bar chart should always start at zero

One of the reasons why bar charts are easy to interpret is that the length of each bar directly corresponds to the relative size of that particular value. But this direct relationship between bar length and value only applies if the bars start at zero. If you create a bar chart in which the bars don’t start at zero, readers are likely to be mislead, so remember bar charts should always start at zero. But don’t worry – ggplot() will handle this for you automatically.

15.3.1 Ordering bar charts by value

If you were trying to find the three Malaysian states or territories with the most murders from this chart, it would be pretty easy to see that Selangor had the most murders, followed by Johor. But at a glance, it’s not so easy to see which state or territory comes third. We can make this easier to see by changing the order of the bars from the default alphabetical order to an order based on how many murders there were.

To do this, we need to convert the state column in the data to a new type of variable: a factor. Factors are what R calls categorical variables that have a defined set of possible values. For example, a factor recording if a person was under or over 18 might have two possible values: ‘adult’ and ‘child’.

One of the benefits of storing a variable as a factor is that we can specify an order for the categories. This is useful for categories that have a meaningful order, such as ‘bad’, ‘acceptable’, ‘good’, ‘excellent’. But we can also use this feature of factors to specify that values should appear in a particular order in any charts produced from the data, whatever the order of the values in the data itself.

forcats package hex sticker

To work with factors in R we can use the forcats package, so-called because it’s for working with categories. forcats is loaded as part of tidyverse, so we don’t need to load it separately.

All the functions in the forcats package start with the letters fct_, just as all the functions in the SF package start st_. For our bar chart, we will use the fct_reorder() function. This takes a factor or character variable (such as the names of the Malaysian states and territories) and sets the order of the categories according to the values of a numeric variable (such as the number of murders in a state). So to re-order the state variable according to the count of murders, we can use fct_reorder(state, count). Since we’re changing an existing variable, we will do this inside a call to the mutate() function.

R Console
# Create bar chart of murder counts
violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  # Re-order states according to number of murders
  mutate(state = fct_reorder(state, count)) |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = count, y = state) +
  # Add bars
  geom_col() +
  # Remove space at either end of horizontal axis
  scale_x_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Murders in Malaysian states, 2017",
    caption = "Data from the Royal Malaysian Police",
    x = "number of murders",
    y = NULL
  ) +
  # Remove unnecessary map elements
  theme_minimal() +
  theme(
    # Remove unnecessary horizontal grid lines
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
  )

15.3.2 Colour in bar charts

We can further improve our bar chart by using colour to indicate which states are in which of the two regions of Malaysia. To do this, we will:

  • specify in the call to aes() that the fill colour of the bars should be controlled by the region column in the data,
  • specify in the call to labs() that we don’t want the legend to have a title, since the meaning is obvious from the values ‘East Malaysia’ and ‘West Malaysia’, and
  • specify in the call to theme() that we would like the legend use up some of the empty space in the bottom-right corner of the chart, rather than making the chart smaller to give space for the legend on the right-hand side.

To move the legend, we need to specify several different arguments in the theme() function. legend.position determines where around the plot the legend should be placed: ‘top’, ‘right’, ‘bottom’, ‘left’ or ‘inside’. In this case we want the legend to appear in some spare space on the plot itself, so we will set legend.position = "inside". We then use the legend.position.inside argument to theme() to specify exactly where inside the plot we want the legend to appear. We do that by specifying where the legend should appear horizontally and vertically, as a proportion of the axis length, on a scale from zero to one.

Using this specification, we can place the legend in the right-most point on the horizontal axis and the bottom-most point on the vertical axis by specifying legend.position.inside = c(1, 0).

legend.position.inside sets the anchor point from which the legend is created, with the actual size of the legend depending on how much space is required by its contents. By default, a legend will spread out in all directions from the anchor point, i.e. the legend will be horizontally and vertically centred on the anchor point. As we have positioned the legend in a corner of the plot, this is probably not what we want since some of the legend will be hidden outside the plot area. Instead, we can set the legend.justifcation argument of theme() using a similar specification to that for legend.position.inside based on which way we want the legend to grow.

If you want the legend to grow ‘inwards’ from a corner, just set legend.justification to the same value as you used for legend.position.inside. In this case, we want the legend to be anchored in the bottom-right corner and to grow inwards from it, so we will set both arguments to c(1, 0).

Add this code to your script file and run it.

R Console
# Create bar chart of murder counts
malaysia_murder_bar_chart <- violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  # Re-order states according to number of murders
  mutate(state = fct_reorder(state, count)) |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = count, y = state, fill = region) +
  # Add bars
  geom_col() +
  # Remove space at either end of horizontal axis
  scale_x_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Murders in Malaysian states, 2017",
    caption = "Data from the Royal Malaysian Police",
    x = "number of murders",
    y = NULL
  ) +
  # Remove unnecessary map elements
  theme_minimal() +
  theme(
    # Move legend to bottom-right corner and give it a solid white background
    legend.background = element_rect(colour = NA, fill = "white"),
    legend.justification = c(1, 0),
    legend.position = "inside",
    legend.position.inside = c(1, 0),
    # Remove unnecessary horizontal grid lines
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
  )
R Console
malaysia_murder_bar_chart

One issue with this chart is that the default colours that ggplot() produces are not easy for everyone to discern. In particular, people with colour blindness may struggle to distinguish between some combinations of colours. Some colour combinations are also hard (or impossible) to distinguish even for people with normal colour vision if a chart is printed in black and white or viewed on a screen in some lighting conditions.

We can check how well people with different colour vision will be able to read a chart using the cvd_grid() function from the colorblindr package. This function takes an existing ggplot() stack and prints several versions of the chart that simulate how different people will see it.

The colorblindr package is not on CRAN, the repository we usually install R packages from. That means we need to use slightly different code to install it. Instead of installing from CRAN, we will instead install the package from GitHub, a website that programmers use to store versions of their code. To install packages from GitHub we can use the p_install_gh() function from the pacman package (the same package we use to load packages at the start of each R script).

R Console
# Install the colorblindr package from GitHub
pacman::p_install_gh("clauswilke/colorblindr")

Remember that because we only need to install a package once on each computer we use R on, you should never install packages inside an R script. This means you should only ever run pacman::p_install_gh() in the R Console, never in an R script.

Once you’ve installed the colorblindr package, you can use it to check how different people are likely to see the chart of murder in Malaysia.

R Console
# Check if chart colours are safe for different people
colorblindr::cvd_grid(malaysia_murder_bar_chart)

From this we can see that this combination of colours works well for people with some types of colour blindness, but is likely to be hard for some people, and indeed for everyone if the chart is printed on a black-and-white printer.

paletteer website

Fortunately, there are lots of different R packages that provide colour palettes that are suitable for people with different colour vision. The paletteer package brings a lot of these colour palettes together in one place. paletteer provides three pairs of functions for different types of colour scale:

  • scale_colour_paletteer_c()/scale_fill_paletteer_c() for continuous scales that are suitable for representing continuous variables.
  • scale_colour_paletteer_d()/scale_fill_paletteer_d() for discrete scales that are suitable for representing categorical variables.
  • scale_colour_paletteer_binned()/scale_fill_paletteer_binned() for binned scales that are suitable for showing continuous variables that have been sub-divided (‘binned’) into ordered categories.

We can use each of these functions in the same way we have used functions like scale_fill_distiller() in previous chapters. The first argument to all the main functions in the paletteer package is the name of a colour palette, of which a total of 2,759 different palettes are available. We can specify which palette to use using the same syntax we have sometimes used to refer to R functions: package_name::palette_name. For example, to create a discrete colour scale using the OKeeffe2 palette from the MetBrewer package (a palette inspired by the painting Red and Yellow Cliffs by Georgia O’Keeffe) we could use scale_colour_paletteer_d("MetBrewer::OKeeffe2").

There are many useful colour palettes in the PrettyCols package. Let’s use the Bright palette from this package to control the colours on the chart.

chapter15a.R
# Create bar chart of murder counts
malaysia_murder_bar_chart <- violence |> 
  # Keep only rows representing murder counts
  filter(crime_type == "murder") |> 
  # Re-order states according to number of murders
  mutate(state = fct_reorder(state, count)) |> 
  ggplot() +
  # Translate columns in the data to aesthetics on the chart
  aes(x = count, y = state, fill = region) +
  # Add bars
  geom_col() +
  # Remove space at either end of horizontal axis
  scale_x_continuous(expand = c(0, 0)) +
  # Specify colour-blind-safe fill colours
  scale_fill_paletteer_d("PrettyCols::Bright") +
  # Add labels
  labs(
    title = "Murders in Malaysian states, 2017",
    caption = "Data from the Royal Malaysian Police",
    x = "number of murders",
    y = NULL
  ) +
  # Remove unnecessary map elements
  theme_minimal() +
  theme(
    # Move legend to bottom-right corner and give it a solid white background
    legend.background = element_rect(colour = NA, fill = "white"),
    legend.justification = c(1, 0),
    legend.position = "inside",
    legend.position.inside = c(1, 0),
    # Remove unnecessary horizontal grid lines
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank()
  )
R Console
malaysia_murder_bar_chart

To see if these colours are likely to work for different people, we can again check the colours with the colorblindr package:

R Console
# Check if chart colours are safe for different people
colorblindr::cvd_grid(malaysia_murder_bar_chart)

From this, we can see that this colour palette is going to be much more useful for people with different colour vision, as well as for everyone if the chart is printed in black and white or viewed on a screen in bad light.

There are lots of other ways to control colours on charts and maps in R. For more detail, read Working with colours in R.

Bar charts are a very common way of presenting a numeric variable for each value of a categorical variable. Bar charts are easy to interpret, even for people who are not used to interpreting charts or who only have time to look at the chart for a few seconds.

Quiz

When creating a bar chart, why might you want to sort the bars by value?

What is one reason why a reference map might be included alongside a bar chart?

Why might a bar chart be more effective than a choropleth map for presenting crime data?

15.4 Visualising distributions

Bar charts show a single piece of information about each category present in a dataset. So we might use a bar chart to show, for example, the average number of burglaries in neighbourhoods in different districts. But what if the average values masked substantial differences in the number of burglaries within each district? Averages often mask variation, and can sometimes be misleading as a result. In those circumstances it would be better to show more detail rather than a misleading average.

Let’s start with the simple example of showing the distribution of burglary counts within a single district. Restart R (Session > Restart R), open a new R script file and save it as chapter15b.R. Use this code to load a dataset of burglaries in each lower-layer super output area (LSOA) in Northamptonshire in England in 2020.

chapter15b.R
# Load packages
pacman::p_load(tidyverse)

# Load data
burglary <- read_rds("https://mpjashby.github.io/crimemappingdata/northants_burglary_counts.rds")

To show the distribution of burglary counts we can create a histogram using geom_histogram(). A histogram divides the range of values present in the data into a number of equally sized bins, then shows bars representing the number of observations (rows) in the data that have values fitting into each bin. We can either allow geom_histogram() to set the number of bins automatically, or set it ourself with the binwidth argument.

We will set the binwidth argument of the geom_histogram() function to binwidth = 1 so that each bar on the chart will show now many LSOAs have each individual value. We will also add some labels to help readers interpret the chart.

chapter15b.R
# Create histogram of burglary counts
burglary |> 
  ggplot() +
  # Specify which column in the data contains the value we're interested in
  aes(x = count) +
  # Add a histogram to the chart
  geom_histogram(binwidth = 1) +
  # Remove unnecessary space at either end of both axes
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Number of burglaries in Northamptonshire neighbourhoods",
    x = "count of burglaries, 2020",
    y = "number of LSOAs"
  ) +
  theme_minimal() +
  theme(
    # Remove unnecessary vertical grid lines
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

Interpreting histograms can be slightly counter-intuitive. The important thing to remember is that the horizontal position of each bar represents a number of burglaries and the vertical height of the bar represents how many neighbourhoods had that number of burglaries. This is the opposite of a bar chart, where the length of the bar represents the number of burglaries.

In this chart, we can see on this chart that most LSOAs had only a few burglaries in 2020 (represented by the tallest bars are to the left of the chart), while a few LSOAs had a larger number (the bars to the right of the chart). This is what we would expect, since we know the crimes are generally concentrated in a few places.

15.4.1 Plotting density curves

Histograms are one way to show the distribution of a variable (in this case, the count of burglaries). Another way to show the distribution of a variable is to create a density curve with geom_density(). A density curve is a smoothed version of a histogram, which is useful to show the general distribution of a variable (in this case, the number of LSOAs with different numbers of burglaries) at the cost of not showing the exact data. The mathematical procedure used by geom_density() to calculate a density curve is the same as the kernel-density estimation process we have already learned to use to show concentrations of crime on a map.

chapter15b.R
# Create a density plot of burglaries
burglary |> 
  ggplot() +
  # Specify which column in the data contains the value we're interested in
  aes(x = count) +
  # Add a density curve to the chart
  geom_density(colour = "red", linewidth = 1) +
  # Remove unnecessary space at either end of both axes
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(labels = scales::label_percent()) +
  # Add labels
  labs(
    title = "Number of burglaries in Northamptonshire neighbourhoods",
    x = "count of burglaries, 2020",
    y = "percentage of LSOAs"
  ) +
  theme_minimal() +
  theme(
    # Remove unnecessary vertical grid lines
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )

We can use density curves to show the distribution of a variable across multiple categories at once. For example, we could show the distribution of burglary counts at the neighbourhood level for all the districts in Northamptonshire. To do this we use the geom_density_ridges() function from the ggridges package to create a ridge plot. Although this function does not come from the ggplot2 package, it is designed to be used inside a ggplot() stack.

chapter15b.R
# Add ridge plot of Northamptonshire burglary
burglary |> 
  # Wrap the district names by replacing any space in a name with a new-line
  mutate(district = str_replace_all(district, "\\s", "\n")) |> 
  ggplot() +
  # Specify which columns in the data contain the values we're interested in
  aes(x = count, y = district) +
  # Add ridge plot
  ggridges::geom_density_ridges() +
  # Remove unnecessary space at either end of x axis
  scale_x_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Number of burglaries in Kettering neighbourhoods",
    x = "count of burglaries, 2020",
    y = NULL
  ) +
  theme_minimal()
Picking joint bandwidth of 1.98

As you know from previous chapters, density estimation depends on us choosing a bandwidth to control the degree of smoothing between data points. By default, geom_density_ridges() chooses a suitable bandwidth automatically and reports this in a message. The bandwidth is referred to as ‘joint’ because the same bandwidth is used for all the density curves on a chart.

If you wanted to include a ridge plot in a Quarto document, you would probably not want this message to appear in your report. To suppress the message, you can use the Quarto chunk option #| message: false.

The ridge plot shows the distribution of burglary counts in LSOAs within each district, with the distributions overlapping slightly to save space. From this we can see that across all districts most LSOAs have few burglaries, with a small number of LSOAs having more. We can also see there are a small number of LSOAs (probably, in fact, just one LSOA) in Wellingborough district with a much higher number of burglaries than anywhere else in Northamptonshire.

15.4.2 Small-multiple charts

Density plots can be helpful to summarise a lot of information, but they have some disadvantages. In particular, they don’t show that the number of LSOAs in each district is quite different: there are 131 LSOAs in Northampton but only 41 LSOAs in Corby. To make this clearer we can instead produce several dot plots, one for each district – what are called small-multiple charts.

We could create small-multiple charts by producing a separate histogram for each district and then combine them with the patchwork package, but that would involve a lot of repeated code. Fortunately, we can use a feature of the ggplot2 package called faceting to split our single histogram into multiple plots based on a column in the data (in this case, the district name).

Adding facet_wrap() to a ggplot() stack will cause R to create multiple plots and wrap them across multiple rows and columns so that they approximately fit into the available space. If we only want the small multiples to appear on top of each other (i.e. in multiple rows) or next to each other (i.e. in multiple columns), we can use the facet_grid() function. In this case we want the small multiples to appear on top of each other, so we will use facet_grid() and say that the small multiples (which ggplot2 calls facets) should be based on the district column in the data by specifying rows = vars(district) (it is necessary to wrap the name of the column that you want to use as the basis of the small multiples in the vars() helper function, but we do not need to go into why).

chapter15b.R
# Create histogram of burglary counts
burglary |> 
  ggplot() +
  # Specify which column in the data contains the value we're interested in
  aes(x = count) +
  # Add a histogram to the chart
  geom_histogram(binwidth = 1) +
  # Split chart into small multiples, one for each district
  facet_grid(rows = vars(district), labeller = label_wrap_gen(width = 10)) +
  # Remove unnecessary space at either end of both axes
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  # Add labels
  labs(
    title = "Number of burglaries in Northamptonshire neighbourhoods",
    x = "count of burglaries, 2020",
    y = "number of LSOAs"
  ) +
  theme_minimal() +
  theme(
    # Remove unnecessary vertical grid lines
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    # Control alignment of facet titles
    strip.text.y = element_text(angle = 0, hjust = 0)
  )

You might have noticed we made some other changes to our code for this chart to make it clearer:

  • Wrapped the facet labels using the label_wrap_gen() helper function so that some of the longer district names don’t take up too much space horizontally.
  • Made the facet labels easier to read by making the text horizontal (rather than the default vertical text) using the strip.text.y attribute to theme() and the element_text() helper function. angle sets the rotation of the text (or in this case, specifies that there should be no rotation) and hjust = 0 specifies that the text should be left aligned.

There are many more-technical ways to show distributions, such as box plots or violin plots. However, these can be difficult to interpret for people who are not used to looking at those particular types of chart, so they should probably be avoided for communicating with general audiences.

15.5 Comparing continuous variables

So far we have used bar charts to communicate a single number (in our example, a number of murders) for each value of a categorical variable (the name of each Malaysian state or territory), and histograms to show multiple numbers (burglary counts for each neighbourhood) for each value of a categorical variable (districts in Northamptonshire).

Both these types of chart compare a numeric variable to a categorical one. But sometimes we may want to compare two categorical variables. We can do this with a scatter plot.

Restart R (Session > Restart R), open a new R script file and save it as chapter15c.R. Use this code to load rates of thefts of and from motor vehicles per 1,000 households saying they own a vehicle for a selection of municipalities in South Africa.

chapter15c.R
# Load packages
pacman::p_load(ggrepel, tidyverse)

# Load South African vehicle theft rates
vehicle_theft <- read_rds("https://mpjashby.github.io/crimemappingdata/south_africa_vehicle_theft.rds")

Since thefts of vehicle and thefts from vehicles are different but related crimes, we might want to see if there is a relationship between counts of each type.

To create a ggplot() scatter plot we use geom_point(), the same function we previously used to create point maps. This makes sense, since point maps are a specialised type of scatter plot in which the x and y axes of the chart show the latitude and longitude or easting and northing of each crime location.

The data in the vehicle_theft object are in long format, with each row representing a count of crime in a particular category for a particular municipality. To make a scatter plot where each point represents a municipality, we need to have all the data for a municipality in a single row of data, so we will need to transform the data with pivot_wider() (as we did for some of the tables at the start of this chapter). Since this converts the values of the crime_category column into column names. These names are quite long, so we first convert them to snake case using janitor::clean_names() and then use rename() to shorten the names so they are easier to work with.

Add this code to the R script file.

chapter15c.R
# Convert vehicle theft data to wide format
vehicle_theft_wider <- vehicle_theft |> 
  pivot_wider(names_from = crime_category, values_from = theft_rate) |> 
  janitor::clean_names() |> 
  rename(
    theft_of = theft_of_motor_vehicle,
    theft_from = theft_out_of_or_from_motor_vehicle
  )

We can now make a basic scatter plot.

R Console
# Create scatter plot of vehicle theft
ggplot(vehicle_theft_wider) +
  # Specify which columns in the data should control the x and y positions of
  # each point
  aes(x = theft_of, y = theft_from) +
  # Add the points
  geom_point() +
  # Specify the format for the axis labels
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  # Add labels
  labs(
    title = "Vehicle thefts in South African municipalities",
    subtitle = "each dot represents one municipality, 2018-19",
    x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
    y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
  ) +
  theme_minimal()

From this plot we can see that most areas have low rates of both theft of and theft from motor vehicles, with a few areas having very-high rates of one type or the other (but none have high rates of both).

Looking at the bottom-left corner of the chart we can see that we have again encountered the problem of overlapping points making patterns less clear. We can try to deal with this by making the points semi-transparent using the alpha argument to geom_point().

Scatter plots can be hard for people to interpret, especially if they are not used to interpreting charts. To help readers, we can annotate the plot to show how to interpret each region of the chart. We will add two types of annotation: lines to show the median value on each axis, and labels to help interpretation.

We can add median lines using the geom_hline() and geom_vline() functions, which add horizontal and vertical lines to plots. We will add these to the ggplot() stack before geom_point() so that the lines appear behind the points.

To add text annotations we use the annotate() function from ggplot2, which allows us to add data to a chart by specifying the aesthetics (x and y position, etc.) directly rather than by referencing columns in the data. To add a text annotation, we set the geom argument of annotate() to "text".

R Console
# Create scatter plot of vehicle theft
ggplot(vehicle_theft_wider) +
  # Specify which columns in the data should control the x and y positions of
  # each point
  aes(x = theft_of, y = theft_from) +
  # Add vertical line showing median rate of theft of a vehicle
  geom_vline(
    xintercept = median(pull(vehicle_theft_wider, "theft_of")),
    linetype = "22"
  ) +
  # Add horizontal line showing median rate of theft from a vehicle
  geom_hline(
    yintercept = median(pull(vehicle_theft_wider, "theft_from")),
    linetype = "22"
  ) +
  # Add points
  geom_point(alpha = 0.5) +
  # Add annotations to aid interpretation
  annotate(
    geom = "text", 
    x = 20, 
    y = 0, 
    label = "high rate of thefts of vehicles\nlow rate of thefts from vehicles", 
    hjust = 1,
    lineheight = 1
  ) +
  annotate(
    geom = "text", 
    x = 1, 
    y = 75, 
    label = "low rate of thefts of vehicles\nhigh rate of thefts from vehicles", 
    hjust = 0,
    lineheight = 1
  ) +
  # Specify the format for the axis labels
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  # Add labels
  labs(
    title = "Vehicle thefts in South African municipalities",
    subtitle = str_glue(
      "each dot represents one municipality, 2018-19, dashed lines show ",
      "median values"
    ),
    x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
    y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
  ) +
  theme_minimal()

From this plot we can now see that half of municipalities have very low rates of both types of theft (shown by the dots below and to the left of the median lines).

We can make some further changes to this chart. For example, instead of labelling areas on the plot we could instead label the municipalities with high rates of vehicle theft (we cannot include both types of label because they would overlap). To do that, we will create a new column in the data containing either the municipality name (for high-rate municipalities) or NA (meaning ggplot() will not create a label for that row if we set na.rm = TRUE). We can then use geom_label_repel() to add the labels to the chart, remembering to add label = label to the aes() function so ggplot() knows which column in the data to use for the labels.

R Console
# Create scatter plot of vehicle theft
vehicle_theft_wider |> 
  # Create a new column in the data, either containing the municipality name or
  # `NA` depending on the values of `theft_of` and `theft_from`
  mutate(label = if_else(theft_of > 17 | theft_from > 65, municipality, NA)) |> 
  ggplot() +
  # Specify which columns in the data should control the x and y positions of
  # each point and the labels (for those points that have labels)
  aes(x = theft_of, y = theft_from, label = label) +
  # Add vertical line showing median rate of theft of a vehicle
  geom_vline(
    xintercept = median(pull(vehicle_theft_wider, "theft_of")),
    linetype = "22"
  ) +
  # Add horizontal line showing median rate of theft from a vehicle
  geom_hline(
    yintercept = median(pull(vehicle_theft_wider, "theft_from")),
    linetype = "22"
  ) +
  # Add points
  geom_point(alpha = 0.5) +
  # Add labels
  geom_label_repel(na.rm = TRUE, label.size = 0, lineheight = 1) +
  # Specify the format for the axis labels
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  # Add labels
  labs(
    title = "Vehicle thefts in South African municipalities",
    subtitle = str_glue(
      "each dot represents one municipality, 2018-19, dashed lines show ",
      "median values"
    ),
    x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
    y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
  ) +
  theme_minimal()

Finally, we can add a trend line to the plot. We do this using the geom_smooth() function from ggplot2. geom_smooth() can add different types of trend line to a plot, but in this example we will specify a simple linear trend line by setting method = "lm". We will also specify formula = y ~ x (the default) to avoid geom_smooth() producing a message to tell us what formula it used to calculate the trend.

Add this code to the R script file and run it.

chapter15c.R
# Create scatter plot of vehicle theft
vehicle_theft_wider |> 
  # Create a new column in the data, either containing the municipality name or
  # `NA` depending on the values of `theft_of` and `theft_from`
  mutate(label = if_else(theft_of > 17 | theft_from > 65, municipality, NA)) |> 
  ggplot() +
  # Specify which columns in the data should control the x and y positions of
  # each point and the labels (for those points that have labels)
  aes(x = theft_of, y = theft_from, label = label) +
  # Add vertical line showing median rate of theft of a vehicle
  geom_vline(
    xintercept = median(pull(vehicle_theft_wider, "theft_of")),
    linetype = "22"
  ) +
  # Add horizontal line showing median rate of theft from a vehicle
  geom_hline(
    yintercept = median(pull(vehicle_theft_wider, "theft_from")),
    linetype = "22"
  ) +
  # Add trend line
  geom_smooth(method = "lm", formula = y ~ x, colour = "grey20") +
  # Add points
  geom_point(alpha = 0.5) +
  # Add labels
  geom_label_repel(na.rm = TRUE, label.size = 0, lineheight = 1) +
  # Specify the format for the axis labels
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  # Add labels
  labs(
    title = "Vehicle thefts in South African municipalities",
    subtitle = str_glue(
      "each dot represents one municipality, 2018-19, dashed lines show ",
      "median values"
    ),
    x = "rate of thefts of motor vehicles per 1,000 vehicle-owning households",
    y = "rate of thefts from motor vehicles per 1,000 vehicle-owning households"
  ) +
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: label.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

The code that produces the chart above also produces a warning saying some aesthetics were dropped during a statistical transformation. This is because geom_smooth() does not know how to use the label aesthetic. There are two options to prevent this warning appearing in a Quarto document. First, you could set the #| warning: false chunk option, although this will suppress any other warnings produced by this code chunk, so you should only add this chunk option once you are sure the code runs without any problems.

The second option is to remove label = label from the call to aes() on line 9 of the code and add aes(label = label) separately on line 21.

From this chart, we can see which municipalities have particularly unusual vehicle-theft rates. For example, we might well want to explore the rates of theft from vehicles in Beaufort West or Stellenbosch municipalities to see what makes them so different from the others, and similarly for the rate of theft of vehicles in Ethekwini.

One note of caution when using geom_smooth(): this function will show the direction of the relationship between two variables regardless of the strength of that relationship. In extreme cases, that could mean that a chart would show a trend line between two variables even if the variables had almost no relationship to one another.

For example, geom_smooth() produces a line showing the direction of the relationship between the two variables in each of these three charts, even though the relationship on the right is much stronger than the one on the left.

Don’t try to interpret the strength of a relationship from a trend line

Be very careful about trying to interpret the strength of relationships between two variables by plotting them on a chart. It is much better to measure the strength of the relationship using a statistical test such as a correlation test, but statistical tests are outside the scope of this course.

Visualising distributions

Which of these types of visualization is best used to show the distribution of a single continuous variable?

What is an advantage of using a density plot over a histogram?

15.6 In summary

In this chapter we have learned how to present data about crime at places without using maps. These techniques give us more flexibility about how to best present data to communicate the main points that we want to get across.

Whether to use a map or a chart, and which type of map or chart to use, are design decisions for you to make. When you make these decisions, always remember that what is most important is that your audience understands your message. This makes it very important that you understand your audience.

Visualising data with charts is a very large topic and there are lots of resources available to help you learn more. To get started, you might want to look at:

Check your knowledge: Revision questions

Answer these questions to check you have understood the main points covered in this chapter. Write between 50 and 100 words to answer each question.

  1. Explain at least two scenarios where using a table or a chart would be more effective than a map for presenting spatial data.
  2. Why might you choose to use a bar chart rather than a choropleth map for displaying crime data in some circumstances?
  3. When should you use a table instead of a bar chart to convey information about a continuous variable for each of several categories?
  4. When is it useful to include a reference map alongside a chart, and when is it unnecessary?
  5. What factors should you consider when choosing the best way to present spatial data?