# Acknowledgement This session shadows [Chapter 3]( of the excellent: <img class="center" src="data_image/png;base64,#img/session04/r-for-data-science.PNG" width="40%"/> --- # ggplot2 Is one of several plotting systems in R <img class="center" src="data_image/png;base64,#img/session04/tweet-poll.PNG" width="40%"/> {plotly} is used by [Public Health Scotland]( --- # Why is ggplot popular? 1. Well designed and supported 2. Highly versatile 3. Attractive graphics (with a little work) .footnote[1. Why start with ggplot : 2. Argument against:] --- [ <img class="center" src="data_image/png;base64,#img/session04/bbc-plots.PNG"/>]( --- # ggplot2 ggplot2 is part of the tidyverse. So, at the top of your script type: ```r library(tidyverse) ``` --- class: inverse, middle, center ## Project 1: Let’s explore a perennial challenge for the NHS: --- # Pressures in A&E <img class="center" src="data:image/png;base64,#img/session04/demand-and-capacity.PNG"/> .footnote[ 1. [Picture 1]( by Unknown Author is licensed under [CC BY SA NC]( 2. [Picture 2]( by Unknown Author is licensed under [CC BY SA NC]( ] --- class: center # Data: Capacity in A&E The dataset we loaded earlier, `capacity_ae`, shows changes in the capacity of A&E departments from 2017 to 2018 .footnote[Closely based on datasets collected by the NHS Benchmarking Network] -- The object named .green[capacity_ae] is a data frame --- ## What is a data frame? A data frame stores tabular data: <img class="center" src="data:image/png;base64,#img/session04/tidydata_1.JPG" width="90%"/> .footnote[Artwork by @allison_horst] --- ## tibble = data frame In the tidyverse you may see the term "tibble" We’ll take "tibble" to be synonymous with "data frame" >A tibble... is a modern reimagining of the data.frame... > >Tibbles are data.frames that are .blue[**lazy and surly**]: they do less (i.e. they don’t change variable names or types, and don’t do partial matching) and .blue[**complain more**] (e.g. when a variable does not exist). > > This forces you to confront problems earlier, typically leading to cleaner, more expressive code. .footnote[emphasis added to [quote](] --- ## Viewing the data frame ### Option 1 <img class="center" src="data:image/png;base64,#img/session04/view-data-frame.PNG" width="90%"/> --- ## Viewing the data frame This brings up a view of the data in a new tab: <img class="center" src="data:image/png;base64,#img/session04/view-capacity-ae.PNG" width="90%"/> --- ## Viewing the data frame Click here to show the data frame in a new window <img class="center" src="data:image/png;base64,#img/session04/open-view-window.PNG" width="90%"/> Useful when using multiple monitors --- ## Viewing the data frame ### Option 2 Type the name of the dataset in editor/console, and run the line (shortcut <kbd> Ctrl + Enter</kbd>) <img class="center" src="data:image/png;base64,#img/session04/view-data-frame2.PNG" width="60%"/> --- class: inverse, center, middle ## Q. Do we understand the variable names? _(and what they mean)_ -- <img class="center" src="data:image/png;base64,#img/session04/variable-names.PNG" width="60%"/> --- ### "The simple graph has brought more information to the data analyst's mind than any other device" <img class="center" src="img/session04/ggplot2_exploratory.PNG" width="50%"> .footnote[John Tukey, quoted in [R for Data Science](] --- # Q. Is a change in the number of cubicles available in A&E associated with a change in length of attendance? -- ### Let's explain the code We begin our plot with ggplot2 ```r ggplot() + ``` -- Inside ggplot() we can specify the dataset ```r ggplot(data = capacity_ae) ``` -- Next, we add layer(s) with + at the end ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) ``` --- class: center, middle # Choices There are choices about the chart to use but also the details of the chart <img class="center" src="data:image/png;base64,#img/session04/pie_charts.JPG" width="80%"/> --- class: center, middle # Choices </br>1. What shape will represent the data points? # .black[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M512 512H0V0h512v512z"></path></svg>] --- class: center, middle # Choices </br>1. What shape will represent the data points? # .black[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg>] .pull-right[ .blue[**geom**]etric object] --- class: center, middle # Choices .darkgrey[ 1\. What shape will represent the data? .blue[geom]] 2\. What visual (.blue[**aes**]thetic attributes do we give to the geom?) # .black[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg>] --- class: center, middle # Choices .darkgrey[ 1\. What shape will represent the data? .blue[geom]] 2\. What visual (.blue[**aes**]thetic attributes do we give to the geom?) # .black[<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg>] .pull-right[ ## .blue[size] ] --- class: center, middle # Choices .darkgrey[ 1\. What shape will represent the data? .blue[geom]] 2\. What visual (.blue[**aes**]thetic attributes do we give to the geom?) # <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M377.941 169.941V216H134.059v-46.059c0-21.382-25.851-32.09-40.971-16.971L7.029 239.029c-9.373 9.373-9.373 24.568 0 33.941l86.059 86.059c15.119 15.119 40.971 4.411 40.971-16.971V296h243.882v46.059c0 21.382 25.851 32.09 40.971 16.971l86.059-86.059c9.373-9.373 9.373-24.568 0-33.941l-86.059-86.059c-15.119-15.12-40.971-4.412-40.971 16.97z"></path></svg> .black[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg>] .pull-right[ ## position (x axis) ] --- class: center, middle # Choices .darkgrey[ 1\. What shape will represent the data? .blue[geom]] 2\. What visual (.blue[**aes**]thetic attributes do we give to the geom?) # <svg viewBox="0 0 256 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M214.059 377.941H168V134.059h46.059c21.382 0 32.09-25.851 16.971-40.971L144.971 7.029c-9.373-9.373-24.568-9.373-33.941 0L24.971 93.088c-15.119 15.119-4.411 40.971 16.971 40.971H88v243.882H41.941c-21.382 0-32.09 25.851-16.971 40.971l86.059 86.059c9.373 9.373 24.568 9.373 33.941 0l86.059-86.059c15.12-15.119 4.412-40.971-16.97-40.971z"></path></svg> .black[<svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;height:2em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg>] .pull-right[ ## position (y axis) ] --- class: center, middle # Choices .darkgrey[ 1\. What shape will represent the data? .blue[geom]] 2\. What visual (.blue[**aes**]thetic attributes do we give to the geom?) # <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:green;height:2em;" xmlns=""> <path d="M256 8C119 8 8 119 8 256s111 248 248 248 248-111 248-248S393 8 256 8z"></path></svg> .pull-right[ ## colour ] --- # A statistical graphic Shape/colour/size <svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns=""> <path d="M190.5 66.9l22.2-22.2c9.4-9.4 24.6-9.4 33.9 0L441 239c9.4 9.4 9.4 24.6 0 33.9L246.6 467.3c-9.4 9.4-24.6 9.4-33.9 0l-22.2-22.2c-9.5-9.5-9.3-25 .4-34.3L311.4 296H24c-13.3 0-24-10.7-24-24v-32c0-13.3 10.7-24 24-24h287.4L190.9 101.2c-9.8-9.3-10-24.8-.4-34.3z"></path></svg> geom all default ```r ggplot(data = capacity_ae) + * geom_point(aes(x = dcubicles, y = dwait)) ``` --- class: middle, center # Functions () ggplot(), geom_point(), and aes() are functions Running a function does something Functions are given zero or more inputs (arguments) Arguments of a function are separated by commas --- class: center # Functions () You can explicitly name arguments; ```r ggplot(data = capacity_ae) + ``` Or not: ```r ggplot(capacity_ae) + ``` --- # Functions () Other arguments like axes x and y are in a particular order; ```r ggplot(data = capacity_ae) + * geom_point(aes(x = dcubicles, y = dwait)) ``` It is possible to write it like: ```r ggplot(data = capacity_ae) + * geom_point(aes(y = dwait, x = dcubicles)) ``` But could be confusing. --- # Functions () Here, we have provided **ggplot()** with one named argument .pull-right[ .darkgrey[ggplot(.blue[**data = capacity_ae**]) + geom_point(.blue[**aes(x = dcubicles, y = dwait)**])] ] And given **aes()** two named arguments Unspecified (yet required) arguments will often revert to .green[default values] --- # Shorthand Since ggplot2 knows the order of essential arguments, it is not necessary to name arguments: .green[data = can be omitted ] and .green[x = goes first and y = goes second] ```r ggplot(capacity_ae) + * geom_point(aes(dcubicles, dwait)) ``` --- # geoms We tend to describe plots in terms of the geom used: <img class="center" src="data:image/png;base64,#img/session04/geoms.PNG" width="80%"/> --- # Layering geoms We can display more than one geom in a plot: .blue[<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns=""> <path d="M416 208H272V64c0-17.67-14.33-32-32-32h-32c-17.67 0-32 14.33-32 32v144H32c-17.67 0-32 14.33-32 32v32c0 17.67 14.33 32 32 32h144v144c0 17.67 14.33 32 32 32h32c17.67 0 32-14.33 32-32V304h144c17.67 0 32-14.33 32-32v-32c0-17.67-14.33-32-32-32z"></path></svg>] to add a layer ggplot(data = capacity_ae) .blue[<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns=""> <path d="M416 208H272V64c0-17.67-14.33-32-32-32h-32c-17.67 0-32 14.33-32 32v144H32c-17.67 0-32 14.33-32 32v32c0 17.67 14.33 32 32 32h144v144c0 17.67 14.33 32 32 32h32c17.67 0 32-14.33 32-32V304h144c17.67 0 32-14.33 32-32v-32c0-17.67-14.33-32-32-32z"></path></svg>] .blue[geom_point](aes(x = dcubicles, y = dwait)) .blue[<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns=""> <path d="M416 208H272V64c0-17.67-14.33-32-32-32h-32c-17.67 0-32 14.33-32 32v144H32c-17.67 0-32 14.33-32 32v32c0 17.67 14.33 32 32 32h144v144c0 17.67 14.33 32 32 32h32c17.67 0 32-14.33 32-32V304h144c17.67 0 32-14.33 32-32v-32c0-17.67-14.33-32-32-32z"></path></svg>] .blue[geom_smooth](aes(x = dcubicles, y = dwait)) .blue[then specify another geom...] --- # Your turn This is our current plot: ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) ``` Add a geom_smooth layer (to help identify patterns) Hint: Don't forget the .blue[+] and aes() values in the new layer --- # Your turn - Answer ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + geom_smooth(aes(x = dcubicles, y = dwait)) ``` --- <!-- --> --- # One more thing We'd probably prefer a linear fit rather than a non linear fit: ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + geom_smooth(aes(x = dcubicles, y = dwait), * method = "lm") ``` --- <!-- --> --- # What is happening here? <!-- --> --- class: center, middle # Hypothesis The two sites have seen staffing increases We can map point .blue[colour] (aesthetic attribute) to the staff_increase variable to find out We will add colour to the chart depending on the value of staff_increase (TRUE or FALSE, 1 or 0) --- # Adding another dimension Put an argument **inside** aes() if you want a visual attribute to change with different values of a variable. ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait, * colour = staff_increase)) + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") ``` We could have equally have chosen size or shape but these make graphic less clear --- # What is happening here? - Answer The two sites have indeed seen an increase in staff levels which has had an effect on the dwait even though dcubicles are relatively low. <img src="data:image/png;base64,#04-workshop_ggplot2_files/figure-html/unnamed-chunk-14-1.png" width="50%" /> --- # Important distinction If you want a visual attribute to be applied across the whole plot, the argument goes **outside** aes(): ```r ggplot(data = capacity_ae) + * geom_point(aes(x = dcubicles, y = dwait), colour = "red") + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") ``` This works too because the colour is generically applied: ```r ggplot(data = capacity_ae) + * geom_point(aes(x = dcubicles, y = dwait, colour = "red")) + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") ``` --- # Important distinction Or apply a size globally: ```r ggplot(data = capacity_ae) + * geom_point(aes(x = dcubicles, y = dwait), * size = 4) + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") ``` --- # Layering geoms To avoid duplication, we can pass the common local aes() arguments to ggplot to make them global. Instead of duplicating the same aes(dcubicles, dwait): ```r ggplot(data = capacity_ae) + * geom_point(aes(dcubicles, dwait)) + * geom_smooth(aes(dcubicles, dwait)) ``` Move the aes to the "global": ```r ggplot(data = capacity_ae, aes(dcubicles, dwait)) + * geom_point() + * geom_smooth() ``` --- # Small multiples magic Another way to visualise the relationship between multiple variables is with a facet_wrap() layer: ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + * facet_wrap(~ staff_increase) ``` --- # Small multiples Another way to visualise the relationship between multiple variables is with a facet_wrap() layer: ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + facet_wrap(~ staff_increase, * ncol = 1) ``` --- class: center, middle # Demonstrating geom charts (note: these are simple, unpolished graphics) --- # Q. How are "wait" values distributed? ## Histogram ```r ggplot(data = capacity_ae) + geom_histogram(aes(dwait)) ``` --- # Q. How are “wait” values distributed? ## Histogram ```r ggplot(data = capacity_ae) + geom_histogram(aes(dwait), binwidth = 10) ``` With "bins" set so more uniformed in spread: --- # Q. Number of attendances by site? ## Bar plot ```r ggplot(data = capacity_ae) + geom_col(aes(x = site, y = attendance2018)) ``` -- Reorder site __by__ attendances ```r ggplot(data = capacity_ae) + * geom_col(aes(x = reorder(site, attendance2018), y = attendance2018)) ``` --- # Q. Number of attendances by site? ## Boxplot ```r ggplot(data = capacity_ae) + geom_boxplot(aes(staff_increase, dwait)) ``` -- ### Plot labels Can be applied to all types of charts: ```r ggplot(data = capacity_ae) + geom_boxplot(aes(staff_increase, dwait)) + * labs(title = "Do changes in staffing...", * y = "Waiting") ``` --- # To save a plot ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") + * ggsave("plot_name.png") ``` --- # To save a plot ```r ggplot(data = capacity_ae) + geom_point(aes(x = dcubicles, y = dwait)) + geom_smooth(aes(x = dcubicles, y = dwait), method = "lm") + * ggsave("plot_name.png", units = "cm", * height = 10, width = 8) ``` By default saves a plot in the same dimensions as plot window. 