4 Exercise 1

4.1 Q1. Getting Started

Before we get started we need to open our project.

As discussed in our Project Management chapter 3, we first want to open our QEH Project file. To open the QEH.Rproj file, we head over to the folder we put it in and double click on it in our file browse - or we open it from RStudio.

Now we can check that we are in the right working directory and can check that we are in the right project by checking in the top right of RStudio.

Let’s create a new RMarkdown file (see 3.3) by selecting R Notebook in the “File” menu and let’s save it as Introduction_1.Rmd in our project folder. Because we want to create a PDF report with a proper title and our name, we modify the header of the Rmarkdown document ever so slightly (most importantly, we replace the html_notebook with pdf_document).

---
title: "Introduction Exercise 1"
author: "Moritz Schwarz"
date: "15. January 2021"
output: pdf_document
---

Next, we want to make sure we have all tools ready that we will need in this exercise, so we load a few libraries (use install.packages("packagename") first, if you are missing one):

library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse 
library(skimr) # allows us to get an overview over the data quickly
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project
library(esquisse) # an app to help us with the plotting in ggplot

Now we are ready to actually start our work!

4.1.1 Loading Data

As pointed out in the exercise, we load the titanic3s12.dta dataset - I have put all my datasets into a data folder in our project folder.

Having loaded the haven package and using the here() command to navigate to our data folder, we use:

titanic <- read_dta(here("data","titanic3s12.dta"))

4.2 Q2. Explore the data

Let’s get a feel for the data first, by printing it to the console:

titanic
# A tibble: 1,309 x 15
   survived name  pclass    age    child   old  female sibsp parch alone
      <dbl> <chr>  <dbl>  <dbl> <dbl+lb> <dbl> <dbl+l> <dbl> <dbl> <dbl>
 1        1 ""         3 NA     NA          NA 1 [Fem~     0     0     1
 2        1 ""         2 45      0 [Adu~     0 1 [Fem~     0     0     1
 3        1 ""         2  6      1 [Chi~     0 1 [Fem~     0     1     0
 4        0 ""         3 NA     NA          NA 0 [Mal~     0     0     1
 5        0 ""         3 NA     NA          NA 0 [Mal~     0     0     1
 6        1 ""         3 15      1 [Chi~     0 1 [Fem~     0     0     1
 7        1 ""         1 21      0 [Adu~     0 1 [Fem~     2     2     0
 8        1 ""         3 18      0 [Adu~     0 0 [Mal~     0     0     1
 9        1 ""         3 NA     NA          NA 1 [Fem~     0     0     1
10        1 ""         3  0.167  1 [Chi~     0 1 [Fem~     1     2     0
# ... with 1,299 more rows, and 5 more variables: fare <dbl>,
#   cherbourg <dbl>, queenstown <dbl>, southampton <dbl>,
#   familymembers <dbl>

and then let’s open the full data set so we can browse through it. As we are using the tidyverse style of programming here, we are using the pipe operator %>% to string multiple commands together (shortcut CTRL + Shift + M). Remember, just think of the pipe command as saying “then” i.e. do a command then do this command. Here we want to take the data that is called titanic and then we want to view it:

titanic %>% View()

Because it is easier to visually follow a command, it is advisable that you press enter in your code editor after each pipe operator - this gives us a nice automatic indentation (also select and use CTRL + I to auto-indent your code).

titanic %>% 
  View()

Of course, as we discussed in Section ??, this is equivalent to using View(titanic).

We can use two functions to get a basic summary for the data:

titanic %>% 
  summary()
    survived         name               pclass           age         
 Min.   :0.000   Length:1309        Min.   :1.000   Min.   : 0.1667  
 1st Qu.:0.000   Class :character   1st Qu.:2.000   1st Qu.:21.0000  
 Median :0.000   Mode  :character   Median :3.000   Median :28.0000  
 Mean   :0.382                      Mean   :2.295   Mean   :29.8811  
 3rd Qu.:1.000                      3rd Qu.:3.000   3rd Qu.:39.0000  
 Max.   :1.000                      Max.   :3.000   Max.   :80.0000  
                                                    NA's   :263      
     child             old             female          sibsp       
 Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.000   Median :0.0000  
 Mean   :0.1099   Mean   :0.1052   Mean   :0.356   Mean   :0.4989  
 3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.000   Max.   :8.0000  
 NA's   :263      NA's   :263                                      
     parch           alone             fare           cherbourg     
 Min.   :0.000   Min.   :0.0000   Min.   :  0.000   Min.   :0.0000  
 1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:  7.896   1st Qu.:0.0000  
 Median :0.000   Median :1.0000   Median : 14.454   Median :0.0000  
 Mean   :0.385   Mean   :0.6035   Mean   : 33.294   Mean   :0.2066  
 3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.: 31.275   3rd Qu.:0.0000  
 Max.   :9.000   Max.   :1.0000   Max.   :512.000   Max.   :1.0000  
                                  NA's   :1         NA's   :2       
   queenstown       southampton     familymembers    
 Min.   :0.00000   Min.   :0.0000   Min.   : 0.0000  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.: 0.0000  
 Median :0.00000   Median :1.0000   Median : 0.0000  
 Mean   :0.09411   Mean   :0.6993   Mean   : 0.8839  
 3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.: 1.0000  
 Max.   :1.00000   Max.   :1.0000   Max.   :10.0000  
 NA's   :2         NA's   :2                         

or:

titanic %>% 
  skim()
Table 4.1: Data summary
Name Piped data
Number of rows 1309
Number of columns 15
_______________________
Column type frequency:
character 3
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.0 0 82 42 1243 0
child 263 0.8 1 1 0 2 0
female 0 1.0 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
survived 0 1.0 0.38 0.49 0.00 0.0 0.00 1.00 1 ▇▁▁▁▅
pclass 0 1.0 2.29 0.84 1.00 2.0 3.00 3.00 3 ▃▁▃▁▇
age 263 0.8 29.88 14.41 0.17 21.0 28.00 39.00 80 ▂▇▅▂▁
old 263 0.8 0.11 0.31 0.00 0.0 0.00 0.00 1 ▇▁▁▁▁
sibsp 0 1.0 0.50 1.04 0.00 0.0 0.00 1.00 8 ▇▁▁▁▁
parch 0 1.0 0.39 0.87 0.00 0.0 0.00 0.00 9 ▇▁▁▁▁
alone 0 1.0 0.60 0.49 0.00 0.0 1.00 1.00 1 ▅▁▁▁▇
fare 1 1.0 33.29 51.75 0.00 7.9 14.45 31.27 512 ▇▁▁▁▁
cherbourg 2 1.0 0.21 0.41 0.00 0.0 0.00 0.00 1 ▇▁▁▁▂
queenstown 2 1.0 0.09 0.29 0.00 0.0 0.00 0.00 1 ▇▁▁▁▁
southampton 2 1.0 0.70 0.46 0.00 0.0 1.00 1.00 1 ▃▁▁▁▇
familymembers 0 1.0 0.88 1.58 0.00 0.0 0.00 1.00 10 ▇▁▁▁▁

Let’s now use a few functions to get a better sense of types of data we are working with (str for structure):

titanic %>% 
  str()
tibble [1,309 x 15] (S3: tbl_df/tbl/data.frame)
 $ survived     : num [1:1309] 1 1 1 0 0 1 1 1 1 1 ...
  ..- attr(*, "label")= chr "Passenger survived"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ name         : chr [1:1309] "" "" "" "" ...
  ..- attr(*, "label")= chr "Name of passenger"
  ..- attr(*, "format.stata")= chr "%82s"
 $ pclass       : num [1:1309] 3 2 2 3 3 3 1 3 3 3 ...
  ..- attr(*, "label")= chr "Passenger class"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ age          : num [1:1309] NA 45 6 NA NA ...
  ..- attr(*, "label")= chr "Age of passenger"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ child        : dbl+lbl [1:1309] NA,  0,  1, NA, NA,  1,  0,  0, NA,  1,  0, NA, ...
   ..@ label       : chr "Child (< 16 years old)"
   ..@ format.stata: chr "%9.0g"
   ..@ labels      : Named num [1:2] 0 1
   .. ..- attr(*, "names")= chr [1:2] "Adult" "Child"
 $ old          : num [1:1309] NA 0 0 NA NA 0 0 0 NA 0 ...
  ..- attr(*, "label")= chr "Old passenger (>= 50 years old)"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ female       : dbl+lbl [1:1309] 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, ...
   ..@ label       : chr "Female passenger"
   ..@ format.stata: chr "%9.0g"
   ..@ labels      : Named num [1:2] 0 1
   .. ..- attr(*, "names")= chr [1:2] "Male" "Female"
 $ sibsp        : num [1:1309] 0 0 0 0 0 0 2 0 0 1 ...
  ..- attr(*, "label")= chr "Number of siblings and spouses aboard"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ parch        : num [1:1309] 0 0 1 0 0 0 2 0 0 2 ...
  ..- attr(*, "label")= chr "Number of parents and children aboard"
  ..- attr(*, "format.stata")= chr "%8.0g"
 $ alone        : num [1:1309] 1 1 0 1 1 1 0 1 1 0 ...
  ..- attr(*, "label")= chr "Passenger travelled alone"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ fare         : num [1:1309] 7.88 13.5 33 7.75 8.05 ...
  ..- attr(*, "label")= chr "Passenger fare (in Pre-1970 British Pounds)"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ cherbourg    : num [1:1309] 0 0 0 0 0 1 1 0 0 0 ...
  ..- attr(*, "label")= chr "Embarked at Cherbourg (France)"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ queenstown   : num [1:1309] 1 0 0 1 0 0 0 0 1 0 ...
  ..- attr(*, "label")= chr "Embarked at Queenstown (Ireland)"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ southampton  : num [1:1309] 0 1 1 0 1 0 0 1 0 1 ...
  ..- attr(*, "label")= chr "Embarked at Southampton (UK)"
  ..- attr(*, "format.stata")= chr "%9.0g"
 $ familymembers: num [1:1309] 0 0 1 0 0 0 4 0 0 3 ...
  ..- attr(*, "format.stata")= chr "%9.0g"

Given that each variable is in one column, we could either look at the output of the summaries above to find out the number of variables, or we use the ncol() command, which gives us the number of columns in an object:

titanic %>% 
  ncol()
[1] 15

We use a similar command, nrow() giving us the number of rows, to answer the first part of question 2:

titanic %>% 
  nrow()
[1] 1309

And we can use the summary command from above again to check the number of missing values. The skim() command helpfully gives us this information very easily in a n_missing column, so we can use the select() command from the tidyverse to select just that!

titanic %>% 
  skim() %>% 
  select(n_missing)
select: dropped 16 variables (skim_type, skim_variable, complete_rate, character.min, character.max, …)
# A tibble: 15 x 1
   n_missing
       <int>
 1         0
 2       263
 3         0
 4         0
 5         0
 6       263
 7       263
 8         0
 9         0
10         0
11         1
12         2
13         2
14         2
15         0

To make it a bit easier what each row refers to, we can also select the variable name column form the skim() output.

titanic %>% 
  skim() %>% 
  select(skim_variable,n_missing) 
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
# A tibble: 15 x 2
   skim_variable n_missing
   <chr>             <int>
 1 name                  0
 2 child               263
 3 female                0
 4 survived              0
 5 pclass                0
 6 age                 263
 7 old                 263
 8 sibsp                 0
 9 parch                 0
10 alone                 0
11 fare                  1
12 cherbourg             2
13 queenstown            2
14 southampton           2
15 familymembers         0

I personally don’t like the name of the new variable column - so I want to rename it using the (surprise) rename command by just adding another line and again the pipe operator %>% to then rename it to just variable:

titanic %>% 
  skim() %>% 
  select(skim_variable,n_missing) %>% 
  rename(variable = skim_variable)
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
rename: renamed one variable (variable)
# A tibble: 15 x 2
   variable      n_missing
   <chr>             <int>
 1 name                  0
 2 child               263
 3 female                0
 4 survived              0
 5 pclass                0
 6 age                 263
 7 old                 263
 8 sibsp                 0
 9 parch                 0
10 alone                 0
11 fare                  1
12 cherbourg             2
13 queenstown            2
14 southampton           2
15 familymembers         0

Now all we have to do, if we want to use the object later again, is to save it! We do so using the assignment operator -> and print it again to look at it:

titanic %>% 
  skim() %>% 
  select(skim_variable,n_missing) %>% 
  rename(variable = skim_variable) -> missing_each_variable
Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.

Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.
select: dropped 15 variables (skim_type, complete_rate, character.min, character.max, character.empty, …)
rename: renamed one variable (variable)

4.3 Q3. List the first few names

Question 3 asks us to list the 10 first names that were on the titanic. Not a problem, indeed the head() command does just that: it always shows us the first few elements/rows of an object. To get the first 10, we can use head(data,10) or just head(10) if we use it in the pipe. To get the last few elements, we can use tail.

So let’s do this on the titanic data. We first select() just the name column and then take a look at the first 10 lines:

titanic %>% 
  select(name) %>% 
  head(10)
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
   name 
   <chr>
 1 ""   
 2 ""   
 3 ""   
 4 ""   
 5 ""   
 6 ""   
 7 ""   
 8 ""   
 9 ""   
10 ""   

Hmmm… the first 10 names seem to be empty. If we us e.g. the View() command, we’ll quickly see that these values are not actually missing (i.e. are not NA) - they are just empty character strings "".

So if we want the first real names, we need to filter them out. And again, there is a filter() function just for that:

titanic %>% 
  filter(name != "") %>% 
  select(name) %>% 
  head(10)
filter: removed 42 rows (3%), 1,267 rows remaining
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
   name              
   <chr>             
 1 " Agnes Hughes)"  
 2 " Alexander)"     
 3 " Dart Trevaskis)"
 4 " Elias)"         
 5 " Godfrey)"       
 6 " Inglis Milne)"  
 7 " Mowad)"         
 8 " Treanor)"       
 9 " Watson)"        
10 ")"               

Great! But now we still have a name in our short list that is just ")" - again clearly a data error! So let’s filter those out as well - there are many different ways to add this second evaluation, so here are a two simple examples that all both to the same result. See 1.2 for more detail:

# Add a second statement to the filter command
titanic %>% 
  filter(name != "" & name != ")") %>% 
  select(name) %>% 
  head(10)

# Add a second filter command
titanic %>% 
  filter(name != "") %>% 
  filter(name != ")") %>% 
  select(name) %>% 
  head(10)

Now we get 10 names that make sense! Fantastic.

Let’s now find all our Allisons on the Titanic - we can use the filter() command here again. However, because filter(name == "Allison") would not work because the names contain both first and last names, we need to be a bit smarter. We’re using the grepl() command here, which takes a pattern argument. It then searches our name column for this pattern and returns a TRUE and FALSE for each row, whether the pattern is found:

titanic %>% 
  filter(grepl(pattern = "Allison",x = name))
filter: removed 1,305 rows (>99%), 4 rows remaining
# A tibble: 4 x 15
  survived name  pclass    age   child   old  female sibsp parch alone
     <dbl> <chr>  <dbl>  <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl>
1        1 Alli~      1  0.917 1 [Chi~     0 0 [Mal~     1     2     0
2        0 Alli~      1  2     1 [Chi~     0 1 [Fem~     1     2     0
3        0 Alli~      1 30     0 [Adu~     0 0 [Mal~     1     2     0
4        0 Alli~      1 25     0 [Adu~     0 1 [Fem~     1     2     0
# ... with 5 more variables: fare <dbl>, cherbourg <dbl>,
#   queenstown <dbl>, southampton <dbl>, familymembers <dbl>

If we wanted to get the actual number as a value, we can of course add the nrow() command again:

titanic %>% 
  filter(grepl(pattern = "Allison",x = name)) %>% 
  nrow()
filter: removed 1,305 rows (>99%), 4 rows remaining
[1] 4

4.4 Q4. List the last few names

Now we just repeat what we already know - using tail() rather than head():

titanic %>% 
  filter(name != "" & name != ")") %>% 
  select(name) %>% 
  tail(10)
filter: removed 66 rows (5%), 1,243 rows remaining
select: dropped 14 variables (survived, pclass, age, child, old, …)
# A tibble: 10 x 1
   name                                         
   <chr>                                        
 1 de Messemaeker, Mr. Guillaume Joseph         
 2 de Messemaeker, Mrs. Guillaume Joseph (Emma) 
 3 de Mulder, Mr. Theodore                      
 4 de Pelsmaeker, Mr. Alfons                    
 5 del Carlo, Mr. Sebastiano                    
 6 del Carlo, Mrs. Sebastiano (Argenia Genovesi)
 7 van Billiard, Master. James William          
 8 van Billiard, Master. Walter John            
 9 van Billiard, Mr. Austin Blyler              
10 van Melkebeke, Mr. Philemon                  
titanic %>% 
  filter(grepl(pattern = "Zakarian",x = name)) %>% 
  nrow()
filter: removed 1,307 rows (>99%), 2 rows remaining
[1] 2

4.5 Q5. The oldest passenger

Also identifying the oldest passenger is not a problem with the same tools! We first identify the maximum age in the dataset using a summarise command, which does just what we want it to do:

titanic %>% 
  summarise(max(age))
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  `max(age)`
       <dbl>
1         NA

What does it return? It returns one row, but the value it returns is actually NA. Now we see one of R’s ways to save us from ourselves - if there are missing values in the dataset, it wants us to be aware of this! So we need to use the max() command with a special command to ignore all missing values in the dataset:

titanic %>% 
  summarise(max(age,na.rm=TRUE))
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  `max(age, na.rm = TRUE)`
                     <dbl>
1                       80

Now we see that the maximum age in the dataset is 80!

Let’s make the result above slightly more pretty by naming the new column that we get from the summarise command like this:

titanic %>% 
  summarise(maxage = max(age, na.rm = TRUE))
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  maxage
   <dbl>
1     80

Knowing that 80 is the maximum age, we could now just use filter(age == 80). This is great for this particular dataset - we are quite unlikely to get an update to this data and find someone e.g. aged 83. However, with more recent data, we don’t want to hard-code this value into our filter command, because an update to a dataset could mean that 80 might not be the highest number anymore. So let’s make our our result more dynamic:

titanic %>% 
  filter(age == max(age, na.rm = TRUE))
filter: removed 1,308 rows (>99%), one row remaining
# A tibble: 1 x 15
  survived name  pclass   age   child   old  female sibsp parch alone  fare
     <dbl> <chr>  <dbl> <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl>
1        1 Bark~      1    80 0 [Adu~     1 0 [Mal~     0     0     1    30
# ... with 4 more variables: cherbourg <dbl>, queenstown <dbl>,
#   southampton <dbl>, familymembers <dbl>

You can see, rather than just putting 80 into the filter evluation, we have now used the same command we used in the summarise() command above.

4.6 Q6. Fare Histogram

To plot our histogram, we tart to use ggplot() for the very first time (see (plotting)). ggplot is incredibly powerful - but a bit challenging to get started! So we fire up a helping app, called esquisse:

We then select the titanic dataset and drag fare to the x section - and already our histogram appears. Feel free to play around with the colours etc. a bit! When you are done, you can click on Export & Code and ask the app to insert the code into your script!

We don’t need to load the ggplot2 library anymore, so the script it spits out for me is:

ggplot(titanic) +
  aes(x = fare) +
  geom_histogram(bins = 30L, fill = "#0c4c8a") +
  theme_minimal()
Warning: Removed 1 rows containing non-finite values (stat_bin).

The most basic for of this would have been:

titanic %>% 
  ggplot() +
  geom_histogram(aes(x=fare))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (stat_bin).

But the configuration options are nearly endless!!

4.7 Q7. Ticket Price Stats (Max and Median)

Now we want to summarise some information about the ticket price. We already know how to do this:

titanic %>% 
  summarise(maxfare = max(fare,na.rm=TRUE),
            medianfare = median(fare,na.rm=TRUE))
summarise: now one row and 2 columns, ungrouped
# A tibble: 1 x 2
  maxfare medianfare
    <dbl>      <dbl>
1     512       14.5

4.8 Q8. Who bought the most expensive fare

Again, nothing new for us:

titanic %>% 
  filter(fare == max(fare, na.rm = TRUE))
filter: removed 1,305 rows (>99%), 4 rows remaining
# A tibble: 4 x 15
  survived name  pclass   age   child   old  female sibsp parch alone  fare
     <dbl> <chr>  <dbl> <dbl> <dbl+l> <dbl> <dbl+l> <dbl> <dbl> <dbl> <dbl>
1        1 Card~      1    36 0 [Adu~     0 0 [Mal~     0     1     0   512
2        1 Card~      1    58 0 [Adu~     1 1 [Fem~     0     1     0   512
3        1 Lesu~      1    35 0 [Adu~     0 0 [Mal~     0     0     1   512
4        1 Ward~      1    35 0 [Adu~     0 1 [Fem~     0     0     1   512
# ... with 4 more variables: cherbourg <dbl>, queenstown <dbl>,
#   southampton <dbl>, familymembers <dbl>

4.9 Q9. Average Ticket Prices

To calculate the ticket prices by place point of embarkation, we could use the filter() and summarise() again:

titanic %>% 
  filter(cherbourg == 1) %>% 
  summarise(meanfare = mean(fare, na.rm = TRUE))
filter: removed 1,039 rows (79%), 270 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  meanfare
     <dbl>
1     62.3
titanic %>% 
  filter(southampton == 1) %>% 
  summarise(meanfare = mean(fare, na.rm = TRUE))
filter: removed 395 rows (30%), 914 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  meanfare
     <dbl>
1     27.4
titanic %>% 
  filter(queenstown == 1) %>% 
  summarise(meanfare = mean(fare, na.rm = TRUE))
filter: removed 1,186 rows (91%), 123 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  meanfare
     <dbl>
1     12.4

4.10 Q10-11. Create new variable and arrange by size

To create a new variable, we use the mutate() command to add the sibsp and the parch variable together. To make it easier for us to find it, we can either use select() to just select the columns that we care about or use relocate() to bring it to the left!

To then order according to this variable, we add the arrange() command, which orders the data according to the variables used as arguments - it’s default is to use ascending ordering, so we wrap our argument in desc() to have the largest family on top!

titanic %>% 
  select(survived, sibsp,parch) %>% 
  mutate(familymembers = sibsp + parch) %>% 
  arrange(desc(familymembers))
select: dropped 12 variables (name, pclass, age, child, old, …)
mutate: new variable 'familymembers' (double) with 9 unique values and 0% NA
# A tibble: 1,309 x 4
   survived sibsp parch familymembers
      <dbl> <dbl> <dbl>         <dbl>
 1        0     8     2            10
 2        0     8     2            10
 3        0     8     2            10
 4        0     8     2            10
 5        0     8     2            10
 6        0     8     2            10
 7        0     8     2            10
 8        0     8     2            10
 9        0     8     2            10
10        0     1     9            10
# ... with 1,299 more rows

Because this variable just implies the number of siblings and parents, the largest family on the titanic was actually 11, and not the maximum number in the data, which is 10.

4.11 Q12. Frequency Table

To investigate the passenger classes, we can simply use the count() command:

titanic %>% 
  count(pclass) %>% 
  mutate(total = sum(n),
         freq = n/total)
count: now 3 rows and 2 columns, ungrouped
mutate: new variable 'total' (integer) with one unique value and 0% NA
        new variable 'freq' (double) with 3 unique values and 0% NA
# A tibble: 3 x 4
  pclass     n total  freq
   <dbl> <int> <int> <dbl>
1      1   323  1309 0.247
2      2   277  1309 0.212
3      3   709  1309 0.542

If we want to add the relative frequency, we can use the mutate() command to add another column. We divide all counts by the sum of all observations.

titanic %>% 
  count(pclass) %>% 
  mutate(share = n/sum(n))
count: now 3 rows and 2 columns, ungrouped
mutate: new variable 'share' (double) with 3 unique values and 0% NA
# A tibble: 3 x 3
  pclass     n share
   <dbl> <int> <dbl>
1      1   323 0.247
2      2   277 0.212
3      3   709 0.542

4.12 Q13. Average age by class

To calculate the average age by passenger class, we need to use the group_by() command for the first time. Dedicating a dataset as a grouped dataset means that each following command will be executed for each group separately. Using the summarise() command again, we can get the age by passenger class quite easily using mean() and the options na.rm = TRUE again.

titanic %>% 
  group_by(pclass) %>% 
  summarise(mean_age = mean(age, na.rm = TRUE))
group_by: one grouping variable (pclass)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 x 2
  pclass mean_age
   <dbl>    <dbl>
1      1     39.2
2      2     29.5
3      3     24.8

4.13 Q14. Boxplots

To plot a boxplot, we use the esquisser() again to get some drag-and-drop help and we are using it on the titanic dataset straight away.

esquisser(titanic)

Playing around with the app a little bit gets me to this:

ggplot(titanic) +
  aes(x = "", y = age, fill = pclass, group = pclass) +
  geom_boxplot() +
  scale_fill_distiller(palette = "Pastel1") +
  theme_minimal()

This is great - but there is one small aspect I’m not happy with: the legend seems to indicate that a class could also be e.g. 1.5 or 2.5, which certainly does not make sense! The reason for this is that the type of the pclass variable is double, which allows any continuous numeric value.

For this reason, I’m changing (using the mutate() command) the type of the variable to a factor(), which only allows certain distinct values - in this case the 1, 2, and 3 that stand for the passenger classes.

titanic %>% 
  mutate(pclass = factor(pclass)) -> titanic_data_to_plot
mutate: converted 'pclass' from double to factor (0 new NA)

Now using esquisser() again, the ggplot() command now looks slightly different in one line: we now use scale_fill_brewer which works for distinct variables rather than scale_fill_distiller, which interpolates any colour scheme to work with any numeric variable.

ggplot(titanic_data_to_plot) +
  aes(x = "", y = age, fill = pclass, group = pclass) +
  geom_boxplot() +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal()

One more tip: if we exchange the geom_boxplot() with geom_violin(), we can get an ever better looking graph to represent the distribution of a variable (in my mind):

ggplot(titanic_data_to_plot) +
  aes(x = "", y = age, fill = pclass, group = pclass) +
  geom_violin() +
  scale_fill_brewer(palette = "Pastel1") +
  theme_minimal()

4.14 Q15. Dot Plots

To create a dot plot is slightly more difficult in R than it is in Stata.

First we use the factor() command again and this time use the label option to set the value 0 in the female variable as Male and 1 as Female.

Then we use the stat_summary() command from ggplot() to calculate the mean for the alone variable by both the gender and the passenger class group.

Finally we use facet_wrap to separete the code by the female variable.

titanic_data_to_plot %>% 
  mutate(female = factor(female, labels = c("Male","Female"))) %>% 
  ggplot() + 
  aes(x=alone, y=pclass)+
  stat_summary(fun.y = mean) + 
  facet_wrap(~female, nrow=2) + 
  theme_minimal()
mutate: converted 'female' from double to factor (0 new NA)
Warning: `fun.y` is deprecated. Use `fun` instead.
Warning: Removed 3 rows containing missing values (geom_segment).

Warning: Removed 3 rows containing missing values (geom_segment).

4.15 Q16. Adding a title to the dot plot

The next step, adding a title to the dotplot, however, is very easy and actually the same for all ggplot() commands: We just add a line with the labs() command:

titanic_data_to_plot %>% 
  mutate(female = factor(female, labels = c("Male","Female"))) %>% 
  ggplot() + 
  aes(x=alone, y=pclass)+
  stat_summary(fun = mean) + 
  facet_wrap(~female, nrow=2) + 
  
  labs(x = "Share of passengers travelling alone") +
  
  theme_minimal()
mutate: converted 'female' from double to factor (0 new NA)
Warning: Removed 3 rows containing missing values (geom_segment).

Warning: Removed 3 rows containing missing values (geom_segment).

4.16 Q17. Frequency Table for Survivors

titanic %>% 
  count(survived) %>% 
  mutate(share = n/sum(n))
count: now 2 rows and 2 columns, ungrouped
mutate: new variable 'share' (double) with 2 unique values and 0% NA
# A tibble: 2 x 3
  survived     n share
     <dbl> <int> <dbl>
1        0   809 0.618
2        1   500 0.382

4.17 Q18. Crosstable of women and survivors

We can easily get a cross-table of two variables by first selecting them using select() and then using the simple table() command.

titanic %>% 
  select(survived, female) %>% 
  table()
select: dropped 13 variables (name, pclass, age, child, old, …)
        female
survived   0   1
       0 682 127
       1 161 339

If we want to get the share of each cell, we simply add prop.table() as a next command.

titanic %>% 
  select(survived, female) %>% 
  table() %>% 
  prop.table()
select: dropped 13 variables (name, pclass, age, child, old, …)
        female
survived          0          1
       0 0.52100840 0.09702063
       1 0.12299465 0.25897632

4.18 Q19. Crosstable of children and survivors

titanic %>% 
  select(survived, child) %>% 
  table() %>% 
  prop.table()
select: dropped 13 variables (name, pclass, age, old, female, …)
        child
survived          0          1
       0 0.54493308 0.04684512
       1 0.34512428 0.06309751

Finishing Up

Once you are finished with your exercises, click on “Knit” to create the PDF document from your RMarkdown. See 3.3 for more information how to create a final PDF.