+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to ggplot2


Session 4

Artwork by @allison_horst

1 / 56

Acknowledgement

This session shadows Chapter 3 of the excellent:

2 / 56

ggplot2

Is one of several plotting systems in R

{plotly} is used by Public Health Scotland

3 / 56

Why is ggplot popular?

  1. Well designed and supported
  2. Highly versatile
  3. Attractive graphics (with a little work)
4 / 56

5 / 56

ggplot2

ggplot2 is part of the tidyverse.
So, at the top of your script type:

library(tidyverse)
6 / 56

Project 1:

Let’s explore a perennial
challenge for the NHS:

7 / 56

Pressures in A&E

  1. Picture 1 by Unknown Author is licensed under CC BY SA NC
  2. Picture 2 by Unknown Author is licensed under CC BY SA NC
8 / 56

Data: Capacity in A&E

The dataset we loaded earlier, capacity_ae, shows
changes in the capacity of A&E departments from
2017 to 2018

Closely based on datasets collected by the NHS Benchmarking Network

9 / 56

Data: Capacity in A&E

The dataset we loaded earlier, capacity_ae, shows
changes in the capacity of A&E departments from
2017 to 2018

Closely based on datasets collected by the NHS Benchmarking Network

The object named capacity_ae is a data frame

9 / 56

What is a data frame?

A data frame stores tabular data:

Artwork by @allison_horst

10 / 56

tibble = data frame

In the tidyverse you may see the term "tibble"
We’ll take "tibble" to be synonymous with "data frame"

A tibble... is a modern reimagining of the data.frame...

Tibbles are data.frames that are lazy and surly: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and complain more (e.g. when a variable does not exist).

This forces you to confront problems earlier, typically leading to cleaner, more expressive code.

emphasis added to quote

11 / 56

Viewing the data frame

Option 1

12 / 56

Viewing the data frame

This brings up a view of the data in a new tab:

13 / 56

Viewing the data frame

Click here to show the data frame in a new window

Useful when using multiple monitors

14 / 56

Viewing the data frame

Option 2

Type the name of the dataset in editor/console, and run the line (shortcut Ctrl + Enter)

15 / 56

Q. Do we understand the variable names?

(and what they mean)

16 / 56

Q. Do we understand the variable names?

(and what they mean)

16 / 56

"The simple graph has brought more information to the data analyst's mind than any other device"

John Tukey, quoted in R for Data Science

17 / 56

Q. Is a change in the number of cubicles available in A&E associated with a change in length of attendance?

18 / 56

Q. Is a change in the number of cubicles available in A&E associated with a change in length of attendance?

Let's explain the code

We begin our plot with ggplot2

ggplot() +
18 / 56

Q. Is a change in the number of cubicles available in A&E associated with a change in length of attendance?

Let's explain the code

We begin our plot with ggplot2

ggplot() +

Inside ggplot() we can specify the dataset

ggplot(data = capacity_ae)
18 / 56

Q. Is a change in the number of cubicles available in A&E associated with a change in length of attendance?

Let's explain the code

We begin our plot with ggplot2

ggplot() +

Inside ggplot() we can specify the dataset

ggplot(data = capacity_ae)

Next, we add layer(s) with + at the end

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait))
18 / 56

Choices

There are choices about the chart to use but also the details of the chart

19 / 56

Choices


1. What shape will represent the data points?

20 / 56

Choices


1. What shape will represent the data points?

geometric object

21 / 56

Choices

1. What shape will represent the data? geom

2. What visual (aesthetic attributes do we give to the geom?)

22 / 56

Choices

1. What shape will represent the data? geom

2. What visual (aesthetic attributes do we give to the geom?)

size

23 / 56

Choices

1. What shape will represent the data? geom

2. What visual (aesthetic attributes do we give to the geom?)

position (x axis)

24 / 56

Choices

1. What shape will represent the data? geom

2. What visual (aesthetic attributes do we give to the geom?)

position (y axis)

25 / 56

Choices

1. What shape will represent the data? geom

2. What visual (aesthetic attributes do we give to the geom?)

colour

26 / 56

A statistical graphic

Shape/colour/size geom all default

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait))
27 / 56

Functions ()

ggplot(), geom_point(), and aes() are functions

Running a function does something
Functions are given zero or more inputs (arguments)
Arguments of a function are separated by commas

28 / 56

Functions ()

You can explicitly name arguments;

ggplot(data = capacity_ae) +

Or not:

ggplot(capacity_ae) +
29 / 56

Functions ()

Other arguments like axes x and y are in a particular order;

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait))

It is possible to write it like:

ggplot(data = capacity_ae) +
geom_point(aes(y = dwait, x = dcubicles))

But could be confusing.

30 / 56

Functions ()

Here, we have provided ggplot() with one named argument

ggplot(data = capacity_ae) +

geom_point(aes(x = dcubicles, y = dwait))

And given aes() two named arguments

Unspecified (yet required) arguments will often revert to default values

31 / 56

Shorthand

Since ggplot2 knows the order of essential arguments, it is not necessary to name arguments:

data = can be omitted

and

x = goes first and y = goes second

ggplot(capacity_ae) +
geom_point(aes(dcubicles, dwait))
32 / 56

geoms

We tend to describe plots in terms of the geom used:

33 / 56

Layering geoms

We can display more than one geom in a plot:

to add a layer

ggplot(data = capacity_ae)
geom_point(aes(x = dcubicles, y = dwait))
geom_smooth(aes(x = dcubicles, y = dwait))

then specify another geom...

34 / 56

Your turn

This is our current plot:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait))

Add a geom_smooth layer (to help identify patterns)

Hint: Don't forget the + and aes() values in the new layer

35 / 56

Your turn - Answer

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
geom_smooth(aes(x = dcubicles, y = dwait))
36 / 56

37 / 56

One more thing

We'd probably prefer a linear fit rather than a non linear fit:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm")
38 / 56

39 / 56

What is happening here?

40 / 56

Hypothesis

The two sites have seen staffing increases

We can map point colour (aesthetic attribute) to the staff_increase variable to find out

We will add colour to the chart depending on the value of staff_increase (TRUE or FALSE, 1 or 0)

41 / 56

Adding another dimension

Put an argument inside aes() if you want a visual attribute to change with different values of a variable.

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait,
colour = staff_increase)) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm")

We could have equally have chosen size or shape but these make graphic less clear

42 / 56

What is happening here? - Answer

The two sites have indeed seen an increase in staff levels which has had an effect on the dwait even though dcubicles are relatively low.

43 / 56

Important distinction

If you want a visual attribute to be applied across the whole plot, the argument goes outside aes():

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait),
colour = "red") +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm")

This works too because the colour is generically applied:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait,
colour = "red")) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm")
44 / 56

Important distinction

Or apply a size globally:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait),
size = 4) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm")
45 / 56

Layering geoms

To avoid duplication, we can pass the common local aes() arguments to ggplot to make them global. Instead of duplicating the same aes(dcubicles, dwait):

ggplot(data = capacity_ae) +
geom_point(aes(dcubicles, dwait)) +
geom_smooth(aes(dcubicles, dwait))

Move the aes to the "global":

ggplot(data = capacity_ae, aes(dcubicles, dwait)) +
geom_point() +
geom_smooth()
46 / 56

Small multiples magic

Another way to visualise the relationship between multiple variables is with a facet_wrap() layer:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
facet_wrap(~ staff_increase)
47 / 56

Small multiples

Another way to visualise the relationship between multiple variables is with a facet_wrap() layer:

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
facet_wrap(~ staff_increase,
ncol = 1)
48 / 56

Demonstrating geom charts

(note: these are simple,
unpolished graphics)

49 / 56

Q. How are "wait" values distributed?

Histogram

ggplot(data = capacity_ae) +
geom_histogram(aes(dwait))
50 / 56

Q. How are “wait” values distributed?

Histogram

ggplot(data = capacity_ae) +
geom_histogram(aes(dwait),
binwidth = 10)

With "bins" set so more uniformed in spread:

51 / 56

Q. Number of attendances by site?

Bar plot

ggplot(data = capacity_ae) +
geom_col(aes(x = site,
y = attendance2018))
52 / 56

Q. Number of attendances by site?

Bar plot

ggplot(data = capacity_ae) +
geom_col(aes(x = site,
y = attendance2018))

Reorder site by attendances

ggplot(data = capacity_ae) +
geom_col(aes(x = reorder(site, attendance2018),
y = attendance2018))
52 / 56

Q. Number of attendances by site?

Boxplot

ggplot(data = capacity_ae) +
geom_boxplot(aes(staff_increase, dwait))
53 / 56

Q. Number of attendances by site?

Boxplot

ggplot(data = capacity_ae) +
geom_boxplot(aes(staff_increase, dwait))

Plot labels

Can be applied to all types of charts:

ggplot(data = capacity_ae) +
geom_boxplot(aes(staff_increase, dwait)) +
labs(title = "Do changes in staffing...",
y = "Waiting")
53 / 56

To save a plot

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm") +
ggsave("plot_name.png")
54 / 56

To save a plot

ggplot(data = capacity_ae) +
geom_point(aes(x = dcubicles, y = dwait)) +
geom_smooth(aes(x = dcubicles, y = dwait),
method = "lm") +
ggsave("plot_name.png", units = "cm",
height = 10, width = 8)

By default saves a plot in the same dimensions as plot window.

In future, you'll wish to add height, width and "units" arguments to specify plot dimensions.

55 / 56

This work is licensed as

Creative Commons
Attribution
ShareAlike 4.0
International
To view a copy of this license, visit
https://creativecommons.org/licenses/by/4.0/

56 / 56

Acknowledgement

This session shadows Chapter 3 of the excellent:

2 / 56
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow