2 Data Science with R and the Tidyverse
2.1 TL;DR
- The tidyverse is the future of R - load it with
library(tidyverse)
at the start of each script to get access to all its features - Use the tidyverse pipe
%>%
(shortcutCTRL
+m
) to bind commands together e.g.data %>% select(column_A)
- Use
CTRL
+i
to visually arrange your code - Remember the most important tidyverse commands to get started with data science:
select
,mutate
,filter
,full_join
,pivot_wider
,pivot_longer
,rename
as well as the slightly more advancedgroup_by
,summarise
andacross
. - Use
ggplot
to create beautiful plots - and useesquisser()
from theesquisse
package to create aggplot
using drag-and-drop
2.2 The tidyverse
2.2.1 Installing the tidyverse
and getting the data
Let’s say we want to apply more than one function to an object (e.g. a data.frame) - for this we’ll load a few packages first; the tidyverse
package, which contains most of the functions we in the following sections, and the palmerpenguins
package which contains exactly what you think it contains: Data on Penguins (and who doesn’t like Penguins?). We are also loading a small package that makes it easier for you to get started with the tidyverse: tidylog
, which gives you more output on what each command actually does.
To do this, we might need to first install both packages (only needed once) and then we can always use the library()
command to load the package (see 1.4.2 for more information).
install.packages("tidyverse")
install.packages("tidylog")
install.packages("palmerpenguins")
Once we have installed the packages, we can load them using (Note: Make sure to load the tidylog
package after the tidyverse
package to get its full functionality - if you have not followed this order, just restart R and do it correctly):
Let’s now take a look at the penguins data, which is contained in a simple function: penguins
.
penguins
# A tibble: 344 x 8
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torge~ 39.1 18.7 181 3750
2 Adelie Torge~ 39.5 17.4 186 3800
3 Adelie Torge~ 40.3 18 195 3250
4 Adelie Torge~ NA NA NA NA
5 Adelie Torge~ 36.7 19.3 193 3450
6 Adelie Torge~ 39.3 20.6 190 3650
7 Adelie Torge~ 38.9 17.8 181 3625
8 Adelie Torge~ 39.2 19.6 195 4675
9 Adelie Torge~ 34.1 18.1 193 3475
10 Adelie Torge~ 42 20.2 190 4250
# ... with 334 more rows, and 2 more variables: sex <fct>, year <int>
If you are working in RMarkdown, you can now directly browse a little bit through the data. If you want to see all the data, you can use the View()
command.
View(penguins)
and we can find out some more information about the dataset using:
-
str()
to find out more about the structure of the dataset -
summary()
to get a quick overview -
skim()
from theskimr
package (which we have not loaded here)
str(penguins)
tibble [344 x 8] (S3: tbl_df/tbl/data.frame)
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
summary(penguins)
species island bill_length_mm bill_depth_mm
Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
Mean :43.92 Mean :17.15
3rd Qu.:48.50 3rd Qu.:18.70
Max. :59.60 Max. :21.50
NA's :2 NA's :2
flipper_length_mm body_mass_g sex year
Min. :172.0 Min. :2700 female:165 Min. :2007
1st Qu.:190.0 1st Qu.:3550 male :168 1st Qu.:2007
Median :197.0 Median :4050 NA's : 11 Median :2008
Mean :200.9 Mean :4202 Mean :2008
3rd Qu.:213.0 3rd Qu.:4750 3rd Qu.:2009
Max. :231.0 Max. :6300 Max. :2009
NA's :2 NA's :2
2.2.2 Introduction to the tidyverse
The tidyverse is an opinionated collection of R packages designed for data science. All packages that we consider in the tidyverse share an underlying design philosophy, grammar, and data structures (e.g. part of this grammar is that each command is a verb).
With the tidyverse, you can import, clean, visualise and analyse data. To give you an overview over the functions that are included in the tidyverse, consider Figure 2.1.

Figure 2.1: Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.
Don’t worry, we will not go through all of them in detail. But below, you find a short introduction to the tidyverse.
The ultimate guide to the tidyverse is this book by Hadley Wickham: R for Data Science.
2.2.3 Applying multiple commands
In base R, selecting a variable from a dataset is typically done using the $
operator. Let’s say we are interested in the years the penguins
data was collected, then we can extract the information using penguins$year
and could e.g. find the average year that the data was collected by wrapping it in a mean()
command:
mean(penguins$year)
[1] 2008.029
This gives us a value of 2008.0290698 - what should we do with the .029
? This is a bit weird for a year, so let’s just round the number to the nearest integer using the round()
function.
We could do this either by saving the first result in a variable:
[1] 2008
But this might get a bit tedious if we want to do several calculations. So alternatively, we could simply wrap the two commands together:
[1] 2008
Which gives us the same result! Great!!
But even this could get a bit tedious, if we start adding more and more commands into this list.
as.character(sqrt(round(mean(penguins$year))))
[1] "44.8107130048162"
Phew, this is getting complicated, because I need to read this command essentially from the inside out! R always starts to evaluate a command from the centre and then moves outwards.
If I want to check each separate step, this is getting difficult, as more commands are added. See here an illustration:

Figure 2.2: Here, we have just wrapped a few commands together. Checking the progress for each is quite tedious - I need to select the correct brackets for each command to check an intermediate step.
The tidyverse has a different way of combining commands!! Rather than evaluating a command from the inside out, it evaluates commands from the top left to the bottom right; just how we normally read text in English.
To make sure that the order of the commands is still correct, the tidyverse uses something that is called the pipe operator, which looks like this: %>%
. This looks a bit weird to start with, but as you get used to it, you will see how useful this is! You can fundamentally think of the pipe operator as evaluating to "do one command then do another command.
The pipe is included in the tidyverse
package collection. This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:
filter(data, variable == numeric_value)
or it can be written like this:
data %>% filter(variable == numeric_value)
Both functions complete the same task and the benefit of using %>%
may not be immediately evident; however, when you desire to perform multiple functions its advantage becomes obvious. For instance, if we want to filter some data, group it by categories, summarize it, and then order the summarized results we could write it out three different ways. See more information here: Simplify Your Code with %>%.
Of course it helps to indent each command, as I’ve done here where after each %>%
I start a new line. If you don’t want to type the operator over and over again, make sure to use the CTRL + Shift + M
shortcut.
Now back to our arbitrary example: Notice, how all commands, and indeed the result, are the same!!
penguins$year %>%
mean() %>%
round() %>%
sqrt() %>%
as.character()
[1] "44.8107130048162"
Now if I want to check the intermediate steps, I just need to select all commands from the start to just before the next pipe:

Figure 2.3: The commands and results are the same when using the pipe operator - but reading the specific steps and evaluating intermediate steps is much easier!
Consider this example from Andrew Heiss who illustrates the pipe operator like this (the commands are not actual R commands):

Figure 2.4: The logic behind the pipe command again. Slide taken from Andrew Heiss.
2.2.4 Selecting columns
penguins %>%
select(species, island, bill_length_mm, sex) -> penguins_subsetted
select: dropped 4 variables (bill_depth_mm, flipper_length_mm, body_mass_g, year)
2.2.5 Filtering a Dataset
Now that we have defined what the pipe is and have a dataset loaded, we will consider how we can go about cleaning and modifying datasets.
The first function we consider is the filter()
. The function subsets rows i.e. removes certain rows, while keeping others.
The way that R decides which rows to keep, is by evaluating what is called a logical expression - so essentially we are trying to come up with statements that evaluate to being TRUE
or FALSE
. For more information on logical expressions see Section 1.9.2.
Let’s look at an example: Let’s say that we now want to consider only female penguins from the penguin
dataset. To achieve this, we construct a filter()
command and then subset the sex
column to just contain rows that are female
. If we check the help section for the function using ?filter()
, we see that the first argument is the data argument and the second are the expressions to evaluate - so the command could look like filter(penguins, sex == "female")
. In this case, though, we will use the tidyverse logic and pass the data to the filter()
command using the pipe operator %>%
so we construct our command like this:
penguins_subsetted %>%
filter(sex=="female")
filter: removed 179 rows (52%), 165 rows remaining
# A tibble: 165 x 4
species island bill_length_mm sex
<fct> <fct> <dbl> <fct>
1 Adelie Torgersen 39.5 female
2 Adelie Torgersen 40.3 female
3 Adelie Torgersen 36.7 female
4 Adelie Torgersen 38.9 female
5 Adelie Torgersen 41.1 female
6 Adelie Torgersen 36.6 female
7 Adelie Torgersen 38.7 female
8 Adelie Torgersen 34.4 female
9 Adelie Biscoe 37.8 female
10 Adelie Biscoe 35.9 female
# ... with 155 more rows
Great, we now have a dataset with only female penguins - indeed we have removed more than half of all rows in the datset.
Let’s say we additionally want to consider only Gentoo
penguins. We can simply add another filter command like this:
Alternatively, we could also implement this in a single filter()
command using our logical operators (see Table 1.2).
penguins_subsetted %>%
filter(sex == "female" & species == "Gentoo")

Figure 2.5: An overview of the filter command.
2.2.6 Useful data cleaning tools
clean_names()
drop_na()

Figure 2.6: Weirdly named columns? No problem, simply use the clean_names command from the janitor package!
2.2.7 Adding a new column or changing an existing one
We can, for example, convert the bill_length_mm
column to meters using the mutate()
command.
penguins %>%
mutate(bill_length_m = bill_length_mm / 1000)
mutate: new variable 'bill_length_m' (double) with 165 unique values and 1% NA
# A tibble: 344 x 9
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torge~ 39.1 18.7 181 3750
2 Adelie Torge~ 39.5 17.4 186 3800
3 Adelie Torge~ 40.3 18 195 3250
4 Adelie Torge~ NA NA NA NA
5 Adelie Torge~ 36.7 19.3 193 3450
6 Adelie Torge~ 39.3 20.6 190 3650
7 Adelie Torge~ 38.9 17.8 181 3625
8 Adelie Torge~ 39.2 19.6 195 4675
9 Adelie Torge~ 34.1 18.1 193 3475
10 Adelie Torge~ 42 20.2 190 4250
# ... with 334 more rows, and 3 more variables: sex <fct>, year <int>,
# bill_length_m <dbl>

Figure 2.7: Rather than creating a really complicated if statement infrastrucutre, the case_when command helps us to organise these commands really nicely.
After we have now added our columns, we can move them to the front (most left), if we want to. To do this, we want to use the relocate()
command.

Figure 2.8: Move a column around easily with relocate.
2.2.9 The special across
command
The across()
command can be used within the mutate()
and the summarise()
command. The command makes it easy to apply the same transformation across a number of columns.

Figure 2.9: The across command takes a bit to understand - but is incredbily useful after that. Always remember, you first need to choose which columns you want to choose and then which function you want to apply to each column.
2.2.10 Performing operations by group
We can perform certain information separately for each group using the group_by()
command. We can, for example, calculate the bill length by group using the summarise command:
group_by: one grouping variable (sex)
summarise: now 3 rows and 2 columns, ungrouped
# A tibble: 3 x 2
sex mean
<fct> <dbl>
1 female 42.1
2 male 45.9
3 <NA> 41.3
We can also use the group_by()
command when using the mutate()
command.
group_by: one grouping variable (island)
mutate (grouped): new variable 'max' (double) with 3 unique values and 0% NA
ungroup: no grouping variables
# A tibble: 344 x 9
species island bill_length_mm bill_depth_mm flipper_length_~ body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torge~ 39.1 18.7 181 3750
2 Adelie Torge~ 39.5 17.4 186 3800
3 Adelie Torge~ 40.3 18 195 3250
4 Adelie Torge~ NA NA NA NA
5 Adelie Torge~ 36.7 19.3 193 3450
6 Adelie Torge~ 39.3 20.6 190 3650
7 Adelie Torge~ 38.9 17.8 181 3625
8 Adelie Torge~ 39.2 19.6 195 4675
9 Adelie Torge~ 34.1 18.1 193 3475
10 Adelie Torge~ 42 20.2 190 4250
# ... with 334 more rows, and 3 more variables: sex <fct>, year <int>,
# max <dbl>
2.2.11 Merging and reshaping
A number of useful functions are contained in the tidyr
package, which again is part of the tidyverse
. tidyr
is a one such package which was built for the sole purpose of simplifying the process of creating tidy data. In real world data science, we often need to change, combine, and reshape our datasets.
2.2.11.1 Join two datasets
To join together two datasets, we can use four different functions that all work using the same logic:

Figure 2.10: Source: RStudio.
So these four functions are:
full_join()
inner_join()
left_join()
right_join()
And one additional function is:
anti_join()
To explain how they work, we can also visualise this using a Venn diagram:

Figure 2.11: Source: Hadley Wickham.
Let’s see them in action:
We use the band_members
dataset, which is a tiny example of a dataset.
band_members
# A tibble: 3 x 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
We also have a second tiny dataset, which is band_instruments
.
band_instruments
# A tibble: 3 x 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
If we want to combine the information from both datasets, we use the fact that both have a name
column. But we also note that Mick doesn’t play an instrument (tbf he doesn’t need to) - and that Keith is missing from the band_members
data.
We specify that both datasets should be matched by the name
column. I recommend always using full_join()
in these cases, because it retains all rows, even if there are no matching rows in the other dataset:
band_members %>%
full_join(band_instruments, by = "name") -> full
full_join: added one column (plays)
> rows only in x 1
> rows only in y 1
> matched rows 2
> ===
> rows total 4
full
# A tibble: 4 x 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
4 Keith <NA> guitar
The opposite of that is the inner_join()
, where only rows are retained, which are part of both datasets:
band_members %>%
inner_join(band_instruments, by = "name") -> inner
inner_join: added one column (plays)
> rows only in x (1)
> rows only in y (1)
> matched rows 2
> ===
> rows total 2
inner
# A tibble: 2 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
Then we can see that we only retain John and Paul - because they are in both datasets.
The left_join()
command retains all rows from the “left” dataset, in this case the band_members
:
band_members %>%
left_join(band_instruments, by = "name") -> left
left_join: added one column (plays)
> rows only in x 1
> rows only in y (1)
> matched rows 2
> ===
> rows total 3
left
# A tibble: 3 x 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
While the right_join()
does the opposite: here it keeps all rows from the band_instruments
dataset and uses NA
values where there is no value in the other dataset.
band_members %>%
right_join(band_instruments, by = "name") -> right
right_join: added one column (plays)
> rows only in x (1)
> rows only in y 1
> matched rows 2
> ===
> rows total 3
right
# A tibble: 3 x 3
name band plays
<chr> <chr> <chr>
1 John Beatles guitar
2 Paul Beatles bass
3 Keith <NA> guitar
And then we also have a final function anti_join()
, which sometimes comes in handy:
band_members %>%
anti_join(band_instruments, by = "name") -> anti
anti_join: added no columns
> rows only in x 1
> rows only in y (1)
> matched rows (2)
> ===
> rows total 1
anti
# A tibble: 1 x 2
name band
<chr> <chr>
1 Mick Stones