4 Exercise 1
4.1 Q1. Getting Started
Before we get started we need to open our project.
As discussed in our Project Management chapter 3, we first want to open our QEH
Project file. To open the QEH.Rproj
file, we head over to the folder we put it in and double click on it in our file browse - or we open it from RStudio.
Now we can check that we are in the right working directory and can check that we are in the right project by checking in the top right of RStudio.
getwd()
Let’s create a new RMarkdown file (see 3.3) by selecting R Notebook in the “File” menu and let’s save it as Introduction_1.Rmd
in our project folder. Because we want to create a PDF report with a proper title and our name, we modify the header of the Rmarkdown document ever so slightly (most importantly, we replace the html_notebook
with pdf_document
).
---
title: "Introduction Exercise 1"
author: "Moritz Schwarz"
date: "15. January 2021"
output: pdf_document
---
Next, we want to make sure we have all tools ready that we will need in this exercise, so we load a few libraries (use install.packages("packagename")
first, if you are missing one):
library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse
library(skimr) # allows us to get an overview over the data quickly
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project
library(esquisse) # an app to help us with the plotting in ggplot
Now we are ready to actually start our work!
4.1.1 Loading Data
As pointed out in the exercise, we load the titanic3s12.dta
dataset - I have put all my datasets into a data
folder in our project folder.
Having loaded the haven
package and using the here()
command to navigate to our data folder, we use:
4.2 Q2. Explore the data
Let’s get a feel for the data first, by printing it to the console:
titanic
# A tibble: 1,309 x 15
survived name pclass age child old female sibsp parch alone
<dbl> <chr> <dbl> <dbl> <dbl+lb> <dbl> <dbl+l> <dbl> <dbl> <dbl>
1 1 "" 3 NA NA NA 1 [Fem~ 0 0 1
2 1 "" 2 45 0 [Adu~ 0 1 [Fem~ 0 0 1
3 1 "" 2 6 1 [Chi~ 0 1 [Fem~ 0 1 0
4 0 "" 3 NA NA NA 0 [Mal~ 0 0 1
5 0 "" 3 NA NA NA 0 [Mal~ 0 0 1
6 1 "" 3 15 1 [Chi~ 0 1 [Fem~ 0 0 1
7 1 "" 1 21 0 [Adu~ 0 1 [Fem~ 2 2 0
8 1 "" 3 18 0 [Adu~ 0 0 [Mal~ 0 0 1
9 1 "" 3 NA NA NA 1 [Fem~ 0 0 1
10 1 "" 3 0.167 1 [Chi~ 0 1 [Fem~ 1 2 0
# ... with 1,299 more rows, and 5 more variables: fare <dbl>,
# cherbourg <dbl>, queenstown <dbl>, southampton <dbl>,
# familymembers <dbl>
and then let’s open the full data set so we can browse through it. As we are using the tidyverse
style of programming here, we are using the pipe operator %>%
to string multiple commands together (shortcut CTRL + Shift + M
). Remember, just think of the pipe command as saying “then” i.e. do a command then do this command. Here we want to take the data that is called titanic
and then we want to view it:
titanic %>% View()
Because it is easier to visually follow a command, it is advisable that you press enter in your code editor after each pipe operator - this gives us a nice automatic indentation (also select and use CTRL + I
to auto-indent your code).
titanic %>%
View()
Of course, as we discussed in Section ??, this is equivalent to using View(titanic)
.
We can use two functions to get a basic summary for the data:
titanic %>%
summary()
survived name pclass age
Min. :0.000 Length:1309 Min. :1.000 Min. : 0.1667
1st Qu.:0.000 Class :character 1st Qu.:2.000 1st Qu.:21.0000
Median :0.000 Mode :character Median :3.000 Median :28.0000
Mean :0.382 Mean :2.295 Mean :29.8811
3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:39.0000
Max. :1.000 Max. :3.000 Max. :80.0000
NA's :263
child old female sibsp
Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.0000
Median :0.0000 Median :0.0000 Median :0.000 Median :0.0000
Mean :0.1099 Mean :0.1052 Mean :0.356 Mean :0.4989
3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000 Max. :1.000 Max. :8.0000
NA's :263 NA's :263
parch alone fare cherbourg
Min. :0.000 Min. :0.0000 Min. : 0.000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.0000 1st Qu.: 7.896 1st Qu.:0.0000
Median :0.000 Median :1.0000 Median : 14.454 Median :0.0000
Mean :0.385 Mean :0.6035 Mean : 33.294 Mean :0.2066
3rd Qu.:0.000 3rd Qu.:1.0000 3rd Qu.: 31.275 3rd Qu.:0.0000
Max. :9.000 Max. :1.0000 Max. :512.000 Max. :1.0000
NA's :1 NA's :2
queenstown southampton familymembers
Min. :0.00000 Min. :0.0000 Min. : 0.0000
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 0.0000
Median :0.00000 Median :1.0000 Median : 0.0000
Mean :0.09411 Mean :0.6993 Mean : 0.8839
3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.: 1.0000
Max. :1.00000 Max. :1.0000 Max. :10.0000
NA's :2 NA's :2
or:
titanic %>%
skim()
Name | Piped data |
Number of rows | 1309 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.0 | 0 | 82 | 42 | 1243 | 0 |
child | 263 | 0.8 | 1 | 1 | 0 | 2 | 0 |
female | 0 | 1.0 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
survived | 0 | 1.0 | 0.38 | 0.49 | 0.00 | 0.0 | 0.00 | 1.00 | 1 | ▇▁▁▁▅ |
pclass | 0 | 1.0 | 2.29 | 0.84 | 1.00 | 2.0 | 3.00 | 3.00 | 3 | ▃▁▃▁▇ |
age | 263 | 0.8 | 29.88 | 14.41 | 0.17 | 21.0 | 28.00 | 39.00 | 80 | ▂▇▅▂▁ |
old | 263 | 0.8 | 0.11 | 0.31 | 0.00 | 0.0 | 0.00 | 0.00 | 1 | ▇▁▁▁▁ |
sibsp | 0 | 1.0 | 0.50 | 1.04 | 0.00 | 0.0 | 0.00 | 1.00 | 8 | ▇▁▁▁▁ |
parch | 0 | 1.0 | 0.39 | 0.87 | 0.00 | 0.0 | 0.00 | 0.00 | 9 | ▇▁▁▁▁ |
alone | 0 | 1.0 | 0.60 | 0.49 | 0.00 | 0.0 | 1.00 | 1.00 | 1 | ▅▁▁▁▇ |
fare | 1 | 1.0 | 33.29 | 51.75 | 0.00 | 7.9 | 14.45 | 31.27 | 512 | ▇▁▁▁▁ |
cherbourg | 2 | 1.0 | 0.21 | 0.41 | 0.00 | 0.0 | 0.00 | 0.00 | 1 | ▇▁▁▁▂ |
queenstown | 2 | 1.0 | 0.09 | 0.29 | 0.00 | 0.0 | 0.00 | 0.00 | 1 | ▇▁▁▁▁ |
southampton | 2 | 1.0 | 0.70 | 0.46 | 0.00 | 0.0 | 1.00 | 1.00 | 1 | ▃▁▁▁▇ |
familymembers | 0 | 1.0 | 0.88 | 1.58 | 0.00 | 0.0 | 0.00 | 1.00 | 10 | ▇▁▁▁▁ |
Let’s now use a few functions to get a better sense of types of data we are working with (str
for structure):
titanic %>%
str()
tibble [1,309 x 15] (S3: tbl_df/tbl/data.frame)
$ survived : num [1:1309] 1 1 1 0 0 1 1 1 1 1 ...
..- attr(*, "label")= chr "Passenger survived"
..- attr(*, "format.stata")= chr "%8.0g"
$ name : chr [1:1309] "" "" "" "" ...
..- attr(*, "label")= chr "Name of passenger"
..- attr(*, "format.stata")= chr "%82s"
$ pclass : num [1:1309] 3 2 2 3 3 3 1 3 3 3 ...
..- attr(*, "label")= chr "Passenger class"
..- attr(*, "format.stata")= chr "%8.0g"
$ age : num [1:1309] NA 45 6 NA NA ...
..- attr(*, "label")= chr "Age of passenger"
..- attr(*, "format.stata")= chr "%9.0g"
$ child : dbl+lbl [1:1309] NA, 0, 1, NA, NA, 1, 0, 0, NA, 1, 0, NA, ...
..@ label : chr "Child (< 16 years old)"
..@ format.stata: chr "%9.0g"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "Adult" "Child"
$ old : num [1:1309] NA 0 0 NA NA 0 0 0 NA 0 ...
..- attr(*, "label")= chr "Old passenger (>= 50 years old)"
..- attr(*, "format.stata")= chr "%9.0g"
$ female : dbl+lbl [1:1309] 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, ...
..@ label : chr "Female passenger"
..@ format.stata: chr "%9.0g"
..@ labels : Named num [1:2] 0 1
.. ..- attr(*, "names")= chr [1:2] "Male" "Female"
$ sibsp : num [1:1309] 0 0 0 0 0 0 2 0 0 1 ...
..- attr(*, "label")= chr "Number of siblings and spouses aboard"
..- attr(*, "format.stata")= chr "%8.0g"
$ parch : num [1:1309] 0 0 1 0 0 0 2 0 0 2 ...
..- attr(*, "label")= chr "Number of parents and children aboard"
..- attr(*, "format.stata")= chr "%8.0g"
$ alone : num [1:1309] 1 1 0 1 1 1 0 1 1 0 ...
..- attr(*, "label")= chr "Passenger travelled alone"
..- attr(*, "format.stata")= chr "%9.0g"
$ fare : num [1:1309] 7.88 13.5 33 7.75 8.05 ...
..- attr(*, "label")= chr "Passenger fare (in Pre-1970 British Pounds)"
..- attr(*, "format.stata")= chr "%9.0g"
$ cherbourg : num [1:1309] 0 0 0 0 0 1 1 0 0 0 ...
..- attr(*, "label")= chr "Embarked at Cherbourg (France)"
..- attr(*, "format.stata")= chr "%9.0g"
$ queenstown : num [1:1309] 1 0 0 1 0 0 0 0 1 0 ...
..- attr(*, "label")= chr "Embarked at Queenstown (Ireland)"
..- attr(*, "format.stata")= chr "%9.0g"
$ southampton : num [1:1309] 0 1 1 0 1 0 0 1 0 1 ...
..- attr(*, "label")= chr "Embarked at Southampton (UK)"
..- attr(*, "format.stata")= chr "%9.0g"
$ familymembers: num [1:1309] 0 0 1 0 0 0 4 0 0 3 ...
..- attr(*, "format.stata")= chr "%9.0g"
Given that each variable is in one column, we could either look at the output of the summaries above to find out the number of variables, or we use the ncol()
command, which gives us the number of columns in an object:
titanic %>%
ncol()
[1] 15
We use a similar command, nrow()
giving us the number of rows, to answer the first part of question 2:
titanic %>%
nrow()
[1] 1309
And we can use the summary
command from above again to check the number of missing values. The skim()
command helpfully gives us this information very easily in a n_missing
column, so we can use the select()
command from the tidyverse
to select just that!
select: dropped 16 variables (skim_type, skim_variable, complete_rate, character.min, character.max, …)
# A tibble: 15 x 1
n_missing
<int>
1 0
2 263
3 0
4 0
5 0
6 263
7 263
8 0
9 0
10 0
11 1
12 2
13 2
14 2
15 0
To make it a bit easier what each row refers to, we can also select the variable name column form the skim()
output.
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
# A tibble: 15 x 2
skim_variable n_missing
<chr> <int>
1 name 0
2 child 263
3 female 0
4 survived 0
5 pclass 0
6 age 263
7 old 263
8 sibsp 0
9 parch 0
10 alone 0
11 fare 1
12 cherbourg 2
13 queenstown 2
14 southampton 2
15 familymembers 0
I personally don’t like the name of the new variable column - so I want to rename it using the (surprise) rename
command by just adding another line and again the pipe operator %>%
to then rename it to just variable
:
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
rename: renamed one variable (variable)
# A tibble: 15 x 2
variable n_missing
<chr> <int>
1 name 0
2 child 263
3 female 0
4 survived 0
5 pclass 0
6 age 263
7 old 263
8 sibsp 0
9 parch 0
10 alone 0
11 fare 1
12 cherbourg 2
13 queenstown 2
14 southampton 2
15 familymembers 0
Now all we have to do, if we want to use the object later again, is to save it! We do so using the assignment operator ->
and print it again to look at it:
titanic %>%
skim() %>%
select(skim_variable,n_missing) %>%
rename(variable = skim_variable) -> missing_each_variable
Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.
Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
rename: renamed one variable (variable)
4.3 Q3. List the first few names
Question 3 asks us to list the 10 first names that were on the titanic. Not a problem, indeed the head()
command does just that: it always shows us the first few elements/rows of an object. To get the first 10, we can use head(data,10)
or just head(10)
if we use it in the pipe. To get the last few elements, we can use tail
.
So let’s do this on the titanic data. We first select()
just the name
column and then take a look at the first 10 lines:
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
name
<chr>
1 ""
2 ""
3 ""
4 ""
5 ""
6 ""
7 ""
8 ""
9 ""
10 ""
Hmmm… the first 10 names seem to be empty. If we us e.g. the View()
command, we’ll quickly see that these values are not actually missing (i.e. are not NA
) - they are just empty character strings ""
.
So if we want the first real names, we need to filter them out. And again, there is a filter()
function just for that:
filter: removed 42 rows (3%), 1,267 rows remaining
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
name
<chr>
1 " Agnes Hughes)"
2 " Alexander)"
3 " Dart Trevaskis)"
4 " Elias)"
5 " Godfrey)"
6 " Inglis Milne)"
7 " Mowad)"
8 " Treanor)"
9 " Watson)"
10 ")"
Great! But now we still have a name in our short list that is just ")"
- again clearly a data error! So let’s filter those out as well - there are many different ways to add this second evaluation, so here are a two simple examples that all both to the same result. See 1.2 for more detail:
# Add a second statement to the filter command
titanic %>%
filter(name != "" & name != ")") %>%
select(name) %>%
head(10)
# Add a second filter command
titanic %>%
filter(name != "") %>%
filter(name != ")") %>%
select(name) %>%
head(10)
Now we get 10 names that make sense! Fantastic.
Let’s now find all our Allisons on the Titanic - we can use the filter()
command here again. However, because filter(name == "Allison")
would not work because the names contain both first and last names, we need to be a bit smarter. We’re using the grepl()
command here, which takes a pattern
argument. It then searches our name column for this pattern and returns a TRUE
and FALSE
for each row, whether the pattern is found:
filter: removed 1,305 rows (>99%), 4 rows remaining
# A tibble: 4 x 15
survived name pclass age child old female sibsp parch alone
<dbl> <chr> <dbl> <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl>
1 1 Alli~ 1 0.917 1 [Chi~ 0 0 [Mal~ 1 2 0
2 0 Alli~ 1 2 1 [Chi~ 0 1 [Fem~ 1 2 0
3 0 Alli~ 1 30 0 [Adu~ 0 0 [Mal~ 1 2 0
4 0 Alli~ 1 25 0 [Adu~ 0 1 [Fem~ 1 2 0
# ... with 5 more variables: fare <dbl>, cherbourg <dbl>,
# queenstown <dbl>, southampton <dbl>, familymembers <dbl>
If we wanted to get the actual number as a value, we can of course add the nrow()
command again:
filter: removed 1,305 rows (>99%), 4 rows remaining
[1] 4
4.4 Q4. List the last few names
Now we just repeat what we already know - using tail()
rather than head()
:
filter: removed 66 rows (5%), 1,243 rows remaining
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
name
<chr>
1 de Messemaeker, Mr. Guillaume Joseph
2 de Messemaeker, Mrs. Guillaume Joseph (Emma)
3 de Mulder, Mr. Theodore
4 de Pelsmaeker, Mr. Alfons
5 del Carlo, Mr. Sebastiano
6 del Carlo, Mrs. Sebastiano (Argenia Genovesi)
7 van Billiard, Master. James William
8 van Billiard, Master. Walter John
9 van Billiard, Mr. Austin Blyler
10 van Melkebeke, Mr. Philemon
filter: removed 1,307 rows (>99%), 2 rows remaining
[1] 2
4.5 Q5. The oldest passenger
Also identifying the oldest passenger is not a problem with the same tools! We first identify the maximum age in the dataset using a summarise
command, which does just what we want it to do:
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
`max(age)`
<dbl>
1 NA
What does it return? It returns one row, but the value it returns is actually NA
. Now we see one of R’s ways to save us from ourselves - if there are missing values in the dataset, it wants us to be aware of this! So we need to use the max()
command with a special command to ignore all missing values in the dataset:
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
`max(age, na.rm = TRUE)`
<dbl>
1 80
Now we see that the maximum age in the dataset is 80!
Let’s make the result above slightly more pretty by naming the new column that we get from the summarise
command like this:
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
maxage
<dbl>
1 80
Knowing that 80 is the maximum age, we could now just use filter(age == 80)
. This is great for this particular dataset - we are quite unlikely to get an update to this data and find someone e.g. aged 83. However, with more recent data, we don’t want to hard-code this value into our filter command, because an update to a dataset could mean that 80 might not be the highest number anymore. So let’s make our our result more dynamic:
filter: removed 1,308 rows (>99%), one row remaining
# A tibble: 1 x 15
survived name pclass age child old female sibsp parch alone fare
<dbl> <chr> <dbl> <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl>
1 1 Bark~ 1 80 0 [Adu~ 1 0 [Mal~ 0 0 1 30
# ... with 4 more variables: cherbourg <dbl>, queenstown <dbl>,
# southampton <dbl>, familymembers <dbl>
You can see, rather than just putting 80
into the filter evluation, we have now used the same command we used in the summarise()
command above.
4.6 Q6. Fare Histogram
To plot our histogram, we tart to use ggplot()
for the very first time (see (plotting)). ggplot
is incredibly powerful - but a bit challenging to get started! So we fire up a helping app, called esquisse
:
We then select the titanic
dataset and drag fare
to the x
section - and already our histogram appears. Feel free to play around with the colours etc. a bit! When you are done, you can click on Export & Code
and ask the app to insert the code into your script!
We don’t need to load the ggplot2
library anymore, so the script it spits out for me is:
ggplot(titanic) +
aes(x = fare) +
geom_histogram(bins = 30L, fill = "#0c4c8a") +
theme_minimal()
Warning: Removed 1 rows containing non-finite values (stat_bin).

The most basic for of this would have been:
titanic %>%
ggplot() +
geom_histogram(aes(x=fare))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (stat_bin).

But the configuration options are nearly endless!!
4.7 Q7. Ticket Price Stats (Max and Median)
Now we want to summarise some information about the ticket price. We already know how to do this:
summarise: now one row and 2 columns, ungrouped
# A tibble: 1 x 2
maxfare medianfare
<dbl> <dbl>
1 512 14.5
4.8 Q8. Who bought the most expensive fare
Again, nothing new for us:
filter: removed 1,305 rows (>99%), 4 rows remaining
# A tibble: 4 x 15
survived name pclass age child old female sibsp parch alone fare
<dbl> <chr> <dbl> <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl>
1 1 Card~ 1 36 0 [Adu~ 0 0 [Mal~ 0 1 0 512
2 1 Card~ 1 58 0 [Adu~ 1 1 [Fem~ 0 1 0 512
3 1 Lesu~ 1 35 0 [Adu~ 0 0 [Mal~ 0 0 1 512
4 1 Ward~ 1 35 0 [Adu~ 0 1 [Fem~ 0 0 1 512
# ... with 4 more variables: cherbourg <dbl>, queenstown <dbl>,
# southampton <dbl>, familymembers <dbl>
4.9 Q9. Average Ticket Prices
To calculate the ticket prices by place point of embarkation, we could use the filter()
and summarise()
again:
filter: removed 1,039 rows (79%), 270 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
meanfare
<dbl>
1 62.3
filter: removed 395 rows (30%), 914 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
meanfare
<dbl>
1 27.4
filter: removed 1,186 rows (91%), 123 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
meanfare
<dbl>
1 12.4
4.10 Q10-11. Create new variable and arrange by size
To create a new variable, we use the mutate()
command to add the sibsp
and the parch
variable together. To make it easier for us to find it, we can either use select()
to just select the columns that we care about or use relocate()
to bring it to the left!
To then order according to this variable, we add the arrange()
command, which orders the data according to the variables used as arguments - it’s default is to use ascending ordering, so we wrap our argument in desc()
to have the largest family on top!
titanic %>%
select(survived, sibsp,parch) %>%
mutate(familymembers = sibsp + parch) %>%
arrange(desc(familymembers))
select: dropped 12 variables (name, pclass, age, child, old, …)
mutate: new variable 'familymembers' (double) with 9 unique values and 0% NA
# A tibble: 1,309 x 4
survived sibsp parch familymembers
<dbl> <dbl> <dbl> <dbl>
1 0 8 2 10
2 0 8 2 10
3 0 8 2 10
4 0 8 2 10
5 0 8 2 10
6 0 8 2 10
7 0 8 2 10
8 0 8 2 10
9 0 8 2 10
10 0 1 9 10
# ... with 1,299 more rows
Because this variable just implies the number of siblings and parents, the largest family on the titanic was actually 11
, and not the maximum number in the data, which is 10
.
4.11 Q12. Frequency Table
To investigate the passenger classes, we can simply use the count()
command:
count: now 3 rows and 2 columns, ungrouped
mutate: new variable 'total' (integer) with one unique value and 0% NA
new variable 'freq' (double) with 3 unique values and 0% NA
# A tibble: 3 x 4
pclass n total freq
<dbl> <int> <int> <dbl>
1 1 323 1309 0.247
2 2 277 1309 0.212
3 3 709 1309 0.542
If we want to add the relative frequency, we can use the mutate()
command to add another column. We divide all counts by the sum of all observations.
count: now 3 rows and 2 columns, ungrouped
mutate: new variable 'share' (double) with 3 unique values and 0% NA
# A tibble: 3 x 3
pclass n share
<dbl> <int> <dbl>
1 1 323 0.247
2 2 277 0.212
3 3 709 0.542
4.12 Q13. Average age by class
To calculate the average age by passenger class, we need to use the group_by()
command for the first time. Dedicating a dataset as a grouped dataset means that each following command will be executed for each group separately. Using the summarise()
command again, we can get the age by passenger class quite easily using mean()
and the options na.rm = TRUE
again.
group_by: one grouping variable (pclass)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 x 2
pclass mean_age
<dbl> <dbl>
1 1 39.2
2 2 29.5
3 3 24.8
4.13 Q14. Boxplots
To plot a boxplot, we use the esquisser()
again to get some drag-and-drop help and we are using it on the titanic
dataset straight away.
esquisser(titanic)
Playing around with the app a little bit gets me to this:
ggplot(titanic) +
aes(x = "", y = age, fill = pclass, group = pclass) +
geom_boxplot() +
scale_fill_distiller(palette = "Pastel1") +
theme_minimal()

This is great - but there is one small aspect I’m not happy with: the legend seems to indicate that a class could also be e.g. 1.5 or 2.5, which certainly does not make sense! The reason for this is that the type of the pclass
variable is double
, which allows any continuous numeric value.
For this reason, I’m changing (using the mutate()
command) the type of the variable to a factor()
, which only allows certain distinct values - in this case the 1
, 2
, and 3
that stand for the passenger classes.
mutate: converted 'pclass' from double to factor (0 new NA)
Now using esquisser()
again, the ggplot()
command now looks slightly different in one line: we now use scale_fill_brewer
which works for distinct variables rather than scale_fill_distiller
, which interpolates any colour scheme to work with any numeric variable.
ggplot(titanic_data_to_plot) +
aes(x = "", y = age, fill = pclass, group = pclass) +
geom_boxplot() +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal()

One more tip: if we exchange the geom_boxplot()
with geom_violin()
, we can get an ever better looking graph to represent the distribution of a variable (in my mind):
ggplot(titanic_data_to_plot) +
aes(x = "", y = age, fill = pclass, group = pclass) +
geom_violin() +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal()

4.14 Q15. Dot Plots
To create a dot plot is slightly more difficult in R than it is in Stata.
First we use the factor()
command again and this time use the label
option to set the value 0
in the female
variable as Male
and 1
as Female
.
Then we use the stat_summary()
command from ggplot()
to calculate the mean for the alone
variable by both the gender and the passenger class group.
Finally we use facet_wrap
to separete the code by the female
variable.
titanic_data_to_plot %>%
mutate(female = factor(female, labels = c("Male","Female"))) %>%
ggplot() +
aes(x=alone, y=pclass)+
stat_summary(fun.y = mean) +
facet_wrap(~female, nrow=2) +
theme_minimal()
mutate: converted 'female' from double to factor (0 new NA)
Warning: `fun.y` is deprecated. Use `fun` instead.
Warning: Removed 3 rows containing missing values (geom_segment).
Warning: Removed 3 rows containing missing values (geom_segment).

4.15 Q16. Adding a title to the dot plot
The next step, adding a title to the dotplot, however, is very easy and actually the same for all ggplot()
commands: We just add a line with the labs()
command:
titanic_data_to_plot %>%
mutate(female = factor(female, labels = c("Male","Female"))) %>%
ggplot() +
aes(x=alone, y=pclass)+
stat_summary(fun = mean) +
facet_wrap(~female, nrow=2) +
labs(x = "Share of passengers travelling alone") +
theme_minimal()
mutate: converted 'female' from double to factor (0 new NA)
Warning: Removed 3 rows containing missing values (geom_segment).
Warning: Removed 3 rows containing missing values (geom_segment).

4.16 Q17. Frequency Table for Survivors
count: now 2 rows and 2 columns, ungrouped
mutate: new variable 'share' (double) with 2 unique values and 0% NA
# A tibble: 2 x 3
survived n share
<dbl> <int> <dbl>
1 0 809 0.618
2 1 500 0.382
4.17 Q18. Crosstable of women and survivors
We can easily get a cross-table of two variables by first selecting them using select()
and then using the simple table()
command.
select: dropped 13 variables (name, pclass, age, child, old, …)
female
survived 0 1
0 682 127
1 161 339
If we want to get the share of each cell, we simply add prop.table()
as a next command.
titanic %>%
select(survived, female) %>%
table() %>%
prop.table()
select: dropped 13 variables (name, pclass, age, child, old, …)
female
survived 0 1
0 0.52100840 0.09702063
1 0.12299465 0.25897632
4.18 Q19. Crosstable of children and survivors
titanic %>%
select(survived, child) %>%
table() %>%
prop.table()
select: dropped 13 variables (name, pclass, age, old, female, …)
child
survived 0 1
0 0.54493308 0.04684512
1 0.34512428 0.06309751
Finishing Up
Once you are finished with your exercises, click on “Knit” to create the PDF document from your RMarkdown. See 3.3 for more information how to create a final PDF.