2 Data Science with R and the Tidyverse

2.1 TL;DR

  • The tidyverse is the future of R - load it with library(tidyverse) at the start of each script to get access to all its features
  • Use the tidyverse pipe %>% (shortcut CTRL + m) to bind commands together e.g. data %>% select(column_A)
  • Use CTRL + i to visually arrange your code
  • Remember the most important tidyverse commands to get started with data science: select, mutate, filter, full_join, pivot_wider, pivot_longer, rename as well as the slightly more advanced group_by, summarise and across.
  • Use ggplot to create beautiful plots - and use esquisser() from the esquisse package to create a ggplot using drag-and-drop

2.2 The tidyverse

2.2.1 Installing the tidyverse and getting the data

Let’s say we want to apply more than one function to an object (e.g. a data.frame) - for this we’ll load a few packages first; the tidyverse package, which contains most of the functions we in the following sections, and the palmerpenguins package which contains exactly what you think it contains: Data on Penguins (and who doesn’t like Penguins?). We are also loading a small package that makes it easier for you to get started with the tidyverse: tidylog, which gives you more output on what each command actually does.

To do this, we might need to first install both packages (only needed once) and then we can always use the library() command to load the package (see 1.4.2 for more information).

install.packages("tidyverse")
install.packages("tidylog")
install.packages("palmerpenguins")

Once we have installed the packages, we can load them using (Note: Make sure to load the tidylog package after the tidyverse package to get its full functionality - if you have not followed this order, just restart R and do it correctly):

Let’s now take a look at the penguins data, which is contained in a simple function: penguins.

penguins
# A tibble: 344 x 8
   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
   <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
 1 Adelie  Torge~           39.1          18.7              181        3750
 2 Adelie  Torge~           39.5          17.4              186        3800
 3 Adelie  Torge~           40.3          18                195        3250
 4 Adelie  Torge~           NA            NA                 NA          NA
 5 Adelie  Torge~           36.7          19.3              193        3450
 6 Adelie  Torge~           39.3          20.6              190        3650
 7 Adelie  Torge~           38.9          17.8              181        3625
 8 Adelie  Torge~           39.2          19.6              195        4675
 9 Adelie  Torge~           34.1          18.1              193        3475
10 Adelie  Torge~           42            20.2              190        4250
# ... with 334 more rows, and 2 more variables: sex <fct>, year <int>

If you are working in RMarkdown, you can now directly browse a little bit through the data. If you want to see all the data, you can use the View() command.

View(penguins)

and we can find out some more information about the dataset using:

  • str() to find out more about the structure of the dataset
  • summary() to get a quick overview
  • skim() from the skimr package (which we have not loaded here)
str(penguins)
tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

2.2.2 Introduction to the tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages that we consider in the tidyverse share an underlying design philosophy, grammar, and data structures (e.g. part of this grammar is that each command is a verb).

With the tidyverse, you can import, clean, visualise and analyse data. To give you an overview over the functions that are included in the tidyverse, consider Figure 2.1.

Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.

Figure 2.1: Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.

Don’t worry, we will not go through all of them in detail. But below, you find a short introduction to the tidyverse.

The ultimate guide to the tidyverse is this book by Hadley Wickham: R for Data Science.

2.2.3 Applying multiple commands

In base R, selecting a variable from a dataset is typically done using the $ operator. Let’s say we are interested in the years the penguins data was collected, then we can extract the information using penguins$year and could e.g. find the average year that the data was collected by wrapping it in a mean() command:

mean(penguins$year)
[1] 2008.029

This gives us a value of 2008.0290698 - what should we do with the .029? This is a bit weird for a year, so let’s just round the number to the nearest integer using the round() function.

We could do this either by saving the first result in a variable:

mean_of_year <- mean(penguins$year)
round(mean_of_year)
[1] 2008

But this might get a bit tedious if we want to do several calculations. So alternatively, we could simply wrap the two commands together:

round(mean(penguins$year))
[1] 2008

Which gives us the same result! Great!!

But even this could get a bit tedious, if we start adding more and more commands into this list.

as.character(sqrt(round(mean(penguins$year))))
[1] "44.8107130048162"

Phew, this is getting complicated, because I need to read this command essentially from the inside out! R always starts to evaluate a command from the centre and then moves outwards.

If I want to check each separate step, this is getting difficult, as more commands are added. See here an illustration:

Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.

Figure 2.2: Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.

The tidyverse has a different way of combining commands!! Rather than evaluating a command from the inside out, it evaluates commands from the top left to the bottom right; just how we normally read text in English.

To make sure that the order of the commands is still correct, the tidyverse uses something that is called the pipe operator, which looks like this: %>%. This looks a bit weird to start with, but as you get used to it, you will see how useful this is! You can fundamentally think of the pipe operator as evaluating to "do one command then do another command.

The pipe is included in the tidyverse package collection. This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value)

or it can be written like this:

data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using %>% may not be immediately evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, group it by categories, summarize it, and then order the summarized results we could write it out three different ways. See more information here: Simplify Your Code with %>%.

Of course it helps to indent each command, as I’ve done here where after each %>% I start a new line. If you don’t want to type the operator over and over again, make sure to use the CTRL + Shift + M shortcut.

Now back to our arbitrary example: Notice, how all commands, and indeed the result, are the same!!

penguins$year %>% 
  mean() %>% 
  round() %>% 
  sqrt() %>% 
  as.character()
[1] "44.8107130048162"

Now if I want to check the intermediate steps, I just need to select all commands from the start to just before the next pipe:

The commands and results are the same when using the pipe operator - but reading the specific steps and evaluating intermediate steps is much easier!

Figure 2.3: The commands and results are the same when using the pipe operator - but reading the specific steps and evaluating intermediate steps is much easier!

Consider this example from Andrew Heiss who illustrates the pipe operator like this (the commands are not actual R commands):

The logic behind the pipe command again. Slide taken from Andrew Heiss.

Figure 2.4: The logic behind the pipe command again. Slide taken from Andrew Heiss.

2.2.4 Selecting columns

penguins %>% 
  select(species, island, bill_length_mm, sex) -> penguins_subsetted
select: dropped 4 variables (bill_depth_mm, flipper_length_mm, body_mass_g, year)

2.2.5 Filtering a Dataset

Now that we have defined what the pipe is and have a dataset loaded, we will consider how we can go about cleaning and modifying datasets.

The first function we consider is the filter(). The function subsets rows i.e. removes certain rows, while keeping others.

The way that R decides which rows to keep, is by evaluating what is called a logical expression - so essentially we are trying to come up with statements that evaluate to being TRUE or FALSE. For more information on logical expressions see Section 1.9.2.

Let’s look at an example: Let’s say that we now want to consider only female penguins from the penguin dataset. To achieve this, we construct a filter() command and then subset the sex column to just contain rows that are female. If we check the help section for the function using ?filter(), we see that the first argument is the data argument and the second are the expressions to evaluate - so the command could look like filter(penguins, sex == "female"). In this case, though, we will use the tidyverse logic and pass the data to the filter() command using the pipe operator %>% so we construct our command like this:

penguins_subsetted %>% 
  filter(sex=="female")
filter: removed 179 rows (52%), 165 rows remaining
# A tibble: 165 x 4
   species island    bill_length_mm sex   
   <fct>   <fct>              <dbl> <fct> 
 1 Adelie  Torgersen           39.5 female
 2 Adelie  Torgersen           40.3 female
 3 Adelie  Torgersen           36.7 female
 4 Adelie  Torgersen           38.9 female
 5 Adelie  Torgersen           41.1 female
 6 Adelie  Torgersen           36.6 female
 7 Adelie  Torgersen           38.7 female
 8 Adelie  Torgersen           34.4 female
 9 Adelie  Biscoe              37.8 female
10 Adelie  Biscoe              35.9 female
# ... with 155 more rows

Great, we now have a dataset with only female penguins - indeed we have removed more than half of all rows in the datset.

Let’s say we additionally want to consider only Gentoo penguins. We can simply add another filter command like this:

penguins_subsetted %>% 
  filter(sex == "female") %>% 
  filter(species == "Gentoo")

Alternatively, we could also implement this in a single filter() command using our logical operators (see Table 1.2).

penguins_subsetted %>% 
  filter(sex == "female" & species == "Gentoo")
An overview of the filter command.

Figure 2.5: An overview of the filter command.

2.2.6 Useful data cleaning tools

clean_names()

drop_na()

Weirdly named columns? No problem, simply use the clean_names command from the janitor package!

Figure 2.6: Weirdly named columns? No problem, simply use the clean_names command from the janitor package!

2.2.7 Adding a new column or changing an existing one

We can, for example, convert the bill_length_mm column to meters using the mutate() command.

penguins %>% 
  mutate(bill_length_m = bill_length_mm / 1000)
mutate: new variable 'bill_length_m' (double) with 165 unique values and 1% NA
# A tibble: 344 x 9
   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
   <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
 1 Adelie  Torge~           39.1          18.7              181        3750
 2 Adelie  Torge~           39.5          17.4              186        3800
 3 Adelie  Torge~           40.3          18                195        3250
 4 Adelie  Torge~           NA            NA                 NA          NA
 5 Adelie  Torge~           36.7          19.3              193        3450
 6 Adelie  Torge~           39.3          20.6              190        3650
 7 Adelie  Torge~           38.9          17.8              181        3625
 8 Adelie  Torge~           39.2          19.6              195        4675
 9 Adelie  Torge~           34.1          18.1              193        3475
10 Adelie  Torge~           42            20.2              190        4250
# ... with 334 more rows, and 3 more variables: sex <fct>, year <int>,
#   bill_length_m <dbl>
Rather than creating a really complicated if statement infrastrucutre, the case_when command helps us to organise these commands really nicely.

Figure 2.7: Rather than creating a really complicated if statement infrastrucutre, the case_when command helps us to organise these commands really nicely.

After we have now added our columns, we can move them to the front (most left), if we want to. To do this, we want to use the relocate() command.

Move a column around easily with relocate.

Figure 2.8: Move a column around easily with relocate.

2.2.8 Summarising

2.2.9 The special across command

The across() command can be used within the mutate() and the summarise() command. The command makes it easy to apply the same transformation across a number of columns.

The across command takes a bit to understand - but is incredbily useful after that. Always remember, you first need to choose which columns you want to choose and then which function you want to apply to each column.

Figure 2.9: The across command takes a bit to understand - but is incredbily useful after that. Always remember, you first need to choose which columns you want to choose and then which function you want to apply to each column.

2.2.10 Performing operations by group

We can perform certain information separately for each group using the group_by() command. We can, for example, calculate the bill length by group using the summarise command:

penguins %>% 
  group_by(sex) %>% 
  summarise(mean = mean(bill_length_mm, na.rm=TRUE))
group_by: one grouping variable (sex)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 x 2
  sex     mean
  <fct>  <dbl>
1 female  42.1
2 male    45.9
3 <NA>    41.3

We can also use the group_by() command when using the mutate() command.

penguins %>% 
  group_by(island) %>% 
  mutate(max = max(bill_length_mm, na.rm = TRUE)) %>% 
  ungroup()
group_by: one grouping variable (island)
mutate (grouped): new variable 'max' (double) with 3 unique values and 0% NA
ungroup: no grouping variables
# A tibble: 344 x 9
   species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
   <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
 1 Adelie  Torge~           39.1          18.7              181        3750
 2 Adelie  Torge~           39.5          17.4              186        3800
 3 Adelie  Torge~           40.3          18                195        3250
 4 Adelie  Torge~           NA            NA                 NA          NA
 5 Adelie  Torge~           36.7          19.3              193        3450
 6 Adelie  Torge~           39.3          20.6              190        3650
 7 Adelie  Torge~           38.9          17.8              181        3625
 8 Adelie  Torge~           39.2          19.6              195        4675
 9 Adelie  Torge~           34.1          18.1              193        3475
10 Adelie  Torge~           42            20.2              190        4250
# ... with 334 more rows, and 3 more variables: sex <fct>, year <int>,
#   max <dbl>

2.2.11 Merging and reshaping

A number of useful functions are contained in the tidyr package, which again is part of the tidyverse. tidyr is a one such package which was built for the sole purpose of simplifying the process of creating tidy data. In real world data science, we often need to change, combine, and reshape our datasets.

2.2.11.1 Join two datasets

To join together two datasets, we can use four different functions that all work using the same logic:

Source: RStudio.

Figure 2.10: Source: RStudio.

So these four functions are:

full_join()

inner_join()

left_join()

right_join()

And one additional function is:

anti_join()

To explain how they work, we can also visualise this using a Venn diagram:

Source: Hadley Wickham.

Figure 2.11: Source: Hadley Wickham.

Let’s see them in action:

We use the band_members dataset, which is a tiny example of a dataset.

band_members
# A tibble: 3 x 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles

We also have a second tiny dataset, which is band_instruments.

band_instruments
# A tibble: 3 x 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

If we want to combine the information from both datasets, we use the fact that both have a name column. But we also note that Mick doesn’t play an instrument (tbf he doesn’t need to) - and that Keith is missing from the band_members data.

We specify that both datasets should be matched by the name column. I recommend always using full_join() in these cases, because it retains all rows, even if there are no matching rows in the other dataset:

band_members %>% 
  full_join(band_instruments, by = "name") -> full
full_join: added one column (plays)
           > rows only in x   1
           > rows only in y   1
           > matched rows     2
           >                 ===
           > rows total       4
full
# A tibble: 4 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

The opposite of that is the inner_join(), where only rows are retained, which are part of both datasets:

band_members %>% 
  inner_join(band_instruments, by = "name") -> inner
inner_join: added one column (plays)
            > rows only in x  (1)
            > rows only in y  (1)
            > matched rows     2
            >                 ===
            > rows total       2
inner
# A tibble: 2 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  

Then we can see that we only retain John and Paul - because they are in both datasets.

The left_join() command retains all rows from the “left” dataset, in this case the band_members:

band_members %>% 
  left_join(band_instruments, by = "name") -> left
left_join: added one column (plays)
           > rows only in x   1
           > rows only in y  (1)
           > matched rows     2
           >                 ===
           > rows total       3
left
# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  

While the right_join() does the opposite: here it keeps all rows from the band_instruments dataset and uses NA values where there is no value in the other dataset.

band_members %>% 
  right_join(band_instruments, by = "name") -> right
right_join: added one column (plays)
            > rows only in x  (1)
            > rows only in y   1
            > matched rows     2
            >                 ===
            > rows total       3
right
# A tibble: 3 x 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

And then we also have a final function anti_join(), which sometimes comes in handy:

band_members %>% 
  anti_join(band_instruments, by = "name") -> anti
anti_join: added no columns
           > rows only in x   1
           > rows only in y  (1)
           > matched rows    (2)
           >                 ===
           > rows total       1
anti
# A tibble: 1 x 2
  name  band  
  <chr> <chr> 
1 Mick  Stones

2.2.11.2 Reshaping datasets

2.3 Plotting with R

2.3.1 ggplot

You can produce incredible plots with the "ggplot" package - but it does take some getting used to the syntax. To get started, simply type "esquisser()" into your console (having previously installed the "esquisse" package) to get some help.

Figure 2.12: You can produce incredible plots with the “ggplot” package - but it does take some getting used to the syntax. To get started, simply type “esquisser()” into your console (having previously installed the “esquisse” package) to get some help.