6 Exercise 3
Before we get started we need to open our project.
As discussed in our Project Management chapter 3, we first want to open our QEH
Project file. To open the QEH.Rproj
file, we head over to the folder we put it in and double click on it in our file browse - or we open it from RStudio.
Now we can check that we are in the right working directory and can chech that we are in the right project by checking in the top right of RStudio.
getwd()
Let’s create a new RMarkdown file (see 3.3) by selecting R Notebook in the “File” menu and let’s save it as Introduction_1.Rmd
in our project folder. Because we want to create a PDF report with a proper title and our name, we modify the header of the Rmarkdown document ever so slightly (most importantly, we replace the html_notebook
with pdf_document
).
Next, we want to make sure we have all tools ready that we will need in this exercise, so we load a few libraries (use install.packages("packagename")
first, if you are missing one):
library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse
library(skimr) # allows us to get an overview over the data quickly
library(readxl) # allows us to load .xlsx (Excel) files
library(here) # needed to navigate to folders and files in a project
library(haven) # allows us to load .dta (Stata specific) files
library(janitor) # to allow us to use the clean_names() command
library(DescTools) # for the Mode() command
library(esquisse) # an app to help us with the plotting in ggplot
Now we are ready to actually start our work!
6.1 Import Data
As pointed out in the exercise, we load the issp1995subset.dta
dataset - I have put all my datasets into a data
folder in our project folder.
Having loaded the haven
package, which we need need to load Stata .dta
files, and using the here()
command to navigate to our data folder, we use:
We will quickly do something that is only needed because we are using a .dta
file, which is a Stata specific file: This is needed because Stata uses labels and R does not, so we simply convert the country variable to a factor.
mutate: converted 'country' from double to factor (0 new NA)
Now, let’s consider some basic summaries of the data:
issp %>%
skim()
Name | Piped data |
Number of rows | 30894 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 15 |
factor | 1 |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
age | 228 | 0.99 | 2 | 2 | 0 | 84 | 0 |
marital | 236 | 0.99 | 1 | 1 | 0 | 5 | 0 |
status | 477 | 0.98 | 1 | 2 | 0 | 10 | 0 |
socialclass | 5291 | 0.83 | 1 | 1 | 0 | 6 | 0 |
citizen | 1454 | 0.95 | 1 | 1 | 0 | 2 | 0 |
rural | 6820 | 0.78 | 1 | 1 | 0 | 3 | 0 |
tumember | 6232 | 0.80 | 1 | 1 | 0 | 2 | 0 |
partyid | 12542 | 0.59 | 1 | 1 | 0 | 5 | 0 |
relserv | 2099 | 0.93 | 1 | 1 | 0 | 6 | 0 |
teachforlang | 1228 | 0.96 | 1 | 1 | 0 | 5 | 0 |
immiecon | 3003 | 0.90 | 1 | 1 | 0 | 5 | 0 |
immijobs | 1900 | 0.94 | 1 | 1 | 0 | 5 | 0 |
party_germany | 29106 | 0.06 | 1 | 2 | 0 | 16 | 0 |
party_uk | 29878 | 0.03 | 1 | 2 | 0 | 9 | 0 |
party_us | 29529 | 0.04 | 1 | 2 | 0 | 8 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
country | 0 | 1 | FALSE | 24 | aus: 2438, nl: 2089, pl: 1598, rus: 1585 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 1497526.64 | 1047331.11 | 1321.0 | 700966.2 | 1400977.50 | 2001235.75 | 5014485.00 | ▆▇▃▁▁ |
tradepro | 0 | 1.00 | 0.21 | 0.41 | 0.0 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
tradecon | 0 | 1.00 | 0.56 | 0.50 | 0.0 | 0.0 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
male | 116 | 1.00 | 0.47 | 0.50 | 0.0 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
nominalincome | 9034 | 0.71 | 29090.18 | 50113.62 | 4.0 | 3319.5 | 12500.00 | 32500.00 | 834000.00 | ▇▁▁▁▁ |
schooling | 2872 | 0.91 | 11.64 | 3.64 | 0.0 | 9.0 | 12.00 | 14.00 | 20.00 | ▁▃▇▅▂ |
highschoolincpl | 600 | 0.98 | 0.21 | 0.41 | 0.0 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
highschool | 600 | 0.98 | 0.30 | 0.46 | 0.0 | 0.0 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
college | 600 | 0.98 | 0.24 | 0.43 | 0.0 | 0.0 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
gdpcap | 612 | 0.98 | 16505.00 | 7484.32 | 3439.6 | 8660.1 | 19624.51 | 22275.08 | 27904.94 | ▅▃▂▇▂ |
issp %>%
summary()
id country tradepro tradecon
Min. : 1321 aus : 2438 Min. :0.0000 Min. :0.0000
1st Qu.: 700966 nl : 2089 1st Qu.:0.0000 1st Qu.:0.0000
Median :1400978 pl : 1598 Median :0.0000 Median :1.0000
Mean :1497527 rus : 1585 Mean :0.2127 Mean :0.5649
3rd Qu.:2001236 cdn : 1543 3rd Qu.:0.0000 3rd Qu.:1.0000
Max. :5014485 n : 1527 Max. :1.0000 Max. :1.0000
(Other):20114
male age marital status
Min. :0.0000 Min. :14.00 Min. :1.000 Min. : 1.000
1st Qu.:0.0000 1st Qu.:31.00 1st Qu.:1.000 1st Qu.: 1.000
Median :0.0000 Median :43.00 Median :1.000 Median : 2.000
Mean :0.4706 Mean :45.08 Mean :2.095 Mean : 3.864
3rd Qu.:1.0000 3rd Qu.:58.00 3rd Qu.:3.000 3rd Qu.: 7.000
Max. :1.0000 Max. :98.00 Max. :5.000 Max. :10.000
NA's :116 NA's :228 NA's :236 NA's :477
socialclass nominalincome schooling highschoolincpl
Min. :1.00 Min. : 4 Min. : 0.00 Min. :0.0000
1st Qu.:2.00 1st Qu.: 3320 1st Qu.: 9.00 1st Qu.:0.0000
Median :4.00 Median : 12500 Median :12.00 Median :0.0000
Mean :3.15 Mean : 29090 Mean :11.64 Mean :0.2132
3rd Qu.:4.00 3rd Qu.: 32500 3rd Qu.:14.00 3rd Qu.:0.0000
Max. :6.00 Max. :834000 Max. :20.00 Max. :1.0000
NA's :5291 NA's :9034 NA's :2872 NA's :600
highschool college citizen rural
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:1.000
Median :0.0000 Median :0.0000 Median :1.0000 Median :1.000
Mean :0.3037 Mean :0.2417 Mean :0.9723 Mean :1.696
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3.000
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :3.000
NA's :600 NA's :600 NA's :1454 NA's :6820
tumember partyid relserv teachforlang
Min. :0.000 Min. :1.000 Min. :1.000 Min. :1.00
1st Qu.:0.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.00
Median :0.000 Median :3.000 Median :5.000 Median :2.00
Mean :0.317 Mean :2.906 Mean :4.003 Mean :2.06
3rd Qu.:1.000 3rd Qu.:4.000 3rd Qu.:6.000 3rd Qu.:3.00
Max. :1.000 Max. :5.000 Max. :6.000 Max. :5.00
NA's :6232 NA's :12542 NA's :2099 NA's :1228
immiecon immijobs gdpcap party_germany
Min. :1.000 Min. :1.000 Min. : 3440 Min. : 1.000
1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 8660 1st Qu.: 1.000
Median :3.000 Median :3.000 Median :19625 Median : 2.000
Mean :3.139 Mean :2.922 Mean :16505 Mean : 4.423
3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:22275 3rd Qu.: 4.000
Max. :5.000 Max. :5.000 Max. :27905 Max. :96.000
NA's :3003 NA's :1900 NA's :612 NA's :29106
party_uk party_us
Min. : 1.00 Min. : 1.000
1st Qu.: 1.00 1st Qu.: 2.000
Median : 2.00 Median : 4.000
Mean :11.25 Mean : 5.457
3rd Qu.: 3.00 3rd Qu.: 6.000
Max. :96.00 Max. :95.000
NA's :29878 NA's :29529
6.2 Detailed Summary of schooling
variable
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
schooling
Min. : 0.00
1st Qu.: 9.00
Median :12.00
Mean :11.64
3rd Qu.:14.00
Max. :20.00
NA's :2872
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Name | Piped data |
Number of rows | 30894 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
schooling | 2872 | 0.91 | 11.64 | 3.64 | 0 | 9 | 12 | 14 | 20 | ▁▃▇▅▂ |
To get the mode, we can either use a table()
command (similar to the tabulate command in Stata):
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
166 54 76 116 372 374 748 643 2992 1764 3338 2879 4433 2215 2091
15 16 17 18 19 20
1312 1635 941 842 353 678
Or simply using the count()
command:
issp %>%
count(schooling)
count: now 22 rows and 2 columns, ungrouped
# A tibble: 22 x 2
schooling n
<dbl> <int>
1 0 166
2 1 54
3 2 76
4 3 116
5 4 372
6 5 374
7 6 748
8 7 643
9 8 2992
10 9 1764
# ... with 12 more rows
And the getting the row with the highest count:
count: now 22 rows and 2 columns, ungrouped
filter: removed 21 rows (95%), one row remaining
# A tibble: 1 x 2
schooling n
<dbl> <int>
1 12 4433
Or we can use the Mode()
command from the DescTools
package:
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
schooling_median
<dbl>
1 12
6.3 Dot Plot for average schooling years
To create a useful dot plot, in R we first summarise the data down to the means for each group (country). Note that we first use the as_factor()
command to convert the labels from the country
variable to a factor - something that is necessary because Stata uses labels, but R generally does not.
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
country schooling
<fct> <dbl>
1 aus 11.9
2 d-w 10.9
3 d-e 10.9
4 gb 11.3
5 usa 13.4
6 a 10.3
7 h 10.5
8 i 11.0
9 irl 12.3
10 nl 12.7
# ... with 14 more rows
Given this data, we can then easily recreate the Stata dotplot by using ggplot()
(note: you can of course also use esquisser()
on the modified data). In the ggplot
command, we use the aes()
to do our mapping - to recreate the Stata plot, we choose countries on the y axis and schooling on the x-axis. We also add x-limit to display the full range down to 0.
issp %>%
group_by(country) %>%
summarise(schooling = mean(schooling, na.rm=TRUE)) %>%
ggplot() +
aes(x=schooling, y = country) + #
geom_point() +
xlim(c(0,15))

Great, so we can recreate the plot from Stata exactly. Now we can modify the command above a bit to improve the styling using the theme()
command. I switch around the axes (personal preference) and use column rather than points - this way, we can also easily spot that we are missing all schooling values for Bulgaria.
issp %>%
mutate(country = as_factor(country)) %>% # this is needed because Stata uses labels and R does not
group_by(country) %>%
summarise(schooling = mean(schooling, na.rm=TRUE)) %>%
ggplot() +
aes(x = country, y = schooling) +
geom_col(fill = "darkblue")+
labs(x = NULL, y = "Average Years of Schooling", title = "Average Schooling") +
theme_minimal()

6.4 Summary Statistics for the age
variable
Just as above:
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.
Name | Piped data |
Number of rows | 30894 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
character | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
age | 228 | 0.99 | 2 | 2 | 0 | 84 | 0 |
And getting the precise percentiles:
summarise: now 5 rows and 2 columns, ungrouped
# A tibble: 5 x 2
age percentile
<dbl> <chr>
1 14 0%
2 31 25%
3 43 50%
4 58 75%
5 98 100%
6.5 Using the return
command
In R, the return
command is not the same as in Stata. Essentially, you will not need this for now. To use any summary statistic, we can simply use our usual summarise()
and mutate()
commands.
6.6 Country Boxplot for age
We can easily create a boxplot:
issp %>%
ggplot() +
aes(x = country, y = age, fill=country) +
geom_boxplot() +
# some styling
theme_minimal() +
theme(legend.position = "none")+
labs(x=NULL, y="Age", title = "Age Distribution by Country")

An alternative to using boxplot is to use a violin plot:
issp %>%
ggplot() +
aes(x = country, y = age, fill = country) +
geom_violin() +
# some styling
theme_minimal() +
theme(legend.position = "none") +
labs(x=NULL, y="Age", title = "Age Distribution by Country")

Finding out the median for each country is of course also possible like this:
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
country age
<fct> <dbl>
1 aus 48
2 d-w 45
3 d-e 48
4 gb 44
5 usa 42
6 a 46
7 h 46
8 i 42
9 irl 44
10 nl 41
# ... with 14 more rows
6.7 Histograms by country
To plot a separate histogram for each country, we facet_wrap()
by our grouping variable (country in this case).
If we want to add a normal distribution to a histogram, we need to use a package that is not yet published on the standard repository that all R packages lie on - indeed this is more difficult than in Stata.
We can install this using the command devtools::install_github("teunbrand/ggh4x")
. Then we can load it normally again:
# devtools::install_github("teunbrand/ggh4x") # install this new package like this
library(ggh4x) # A package that allows us to add the normal distribution
issp %>%
ggplot() +
aes(x = age, fill = country) +
# We first convert our data from a count to a density (like in Stat)
geom_histogram(aes(y = after_stat(density))) +
stat_theodensity() + # and add the normal density here
# Now we make sure each country is displayed separately
facet_wrap(~country) + # if you want each histogram to have an individual scale, use the scales = "free" option
# some styling
theme_minimal() +
theme(legend.position = "none") +
labs(x=NULL, y="Count", title = "Age Distribution by Country")
Warning: Removed 228 rows containing non-finite values (stat_bin).

6.8 Create an old
dummy variable
mutate: new variable 'old' (double) with 3 unique values and 1% NA
# A tibble: 30,894 x 27
id country tradepro tradecon male age marital status
<dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl+l> <dbl+lb>
1 100001 aus 0 1 1 77 1 [mar~ 7 [ret~
2 100002 aus 0 1 1 56 3 [div~ 5 [une~
3 100003 aus 1 0 1 41 1 [mar~ 1 [f-t~
4 100004 aus 0 1 0 28 1 [mar~ 8 [hou~
5 100005 aus 1 0 0 43 1 [mar~ 1 [f-t~
6 100006 aus 1 0 0 49 1 [mar~ 1 [f-t~
7 100007 aus 0 0 1 46 1 [mar~ 1 [f-t~
8 100008 aus 0 1 1 48 1 [mar~ 7 [ret~
9 100009 aus 0 1 1 22 5 [not~ 1 [f-t~
10 100010 aus 0 1 0 55 1 [mar~ NA
# ... with 30,884 more rows, and 19 more variables: socialclass <dbl+lbl>,
# nominalincome <dbl>, schooling <dbl>, highschoolincpl <dbl>,
# highschool <dbl>, college <dbl>, citizen <dbl+lbl>, rural <dbl+lbl>,
# tumember <dbl+lbl>, partyid <dbl+lbl>, relserv <dbl+lbl>,
# teachforlang <dbl+lbl>, immiecon <dbl+lbl>, immijobs <dbl+lbl>,
# gdpcap <dbl>, party_germany <dbl+lbl>, party_uk <dbl+lbl>,
# party_us <dbl+lbl>, old <dbl>
6.9 nominalincome
for Australia
Let’s first request the summary for the nominalincome
for Australia:
filter: removed 28,456 rows (92%), 2,438 rows remaining
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Name | Piped data |
Number of rows | 2438 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
nominalincome | 388 | 0.84 | 29511.22 | 28877.36 | 1500 | 12500 | 22500 | 37500 | 3e+05 | ▇▁▁▁▁ |
And now let’s analyse the skew (you could use the skew()
command from the DescTools
package to formally test this) by looking at the histogram of the variable for Australia:
issp %>%
filter(country == "aus") %>%
ggplot() +
aes(x = nominalincome) +
geom_histogram() +
#styling
theme_minimal() +
labs(title = "Distribution of Nominal Income", subtitle = "Australia")

6.10 nominalincome
for all countries
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
country nominal_income
<fct> <dbl>
1 aus 29511.
2 d-w 2513.
3 d-e 1791.
4 gb 14885.
5 usa 19283.
6 a 13402.
7 h 20013.
8 i NaN
9 irl 9308.
10 nl 45551.
# ... with 14 more rows