6 Exercise 3

Before we get started we need to open our project.

As discussed in our Project Management chapter 3, we first want to open our QEH Project file. To open the QEH.Rproj file, we head over to the folder we put it in and double click on it in our file browse - or we open it from RStudio.

Now we can check that we are in the right working directory and can chech that we are in the right project by checking in the top right of RStudio.

Let’s create a new RMarkdown file (see 3.3) by selecting R Notebook in the “File” menu and let’s save it as Introduction_1.Rmd in our project folder. Because we want to create a PDF report with a proper title and our name, we modify the header of the Rmarkdown document ever so slightly (most importantly, we replace the html_notebook with pdf_document).

Next, we want to make sure we have all tools ready that we will need in this exercise, so we load a few libraries (use install.packages("packagename") first, if you are missing one):

library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse 
library(skimr) # allows us to get an overview over the data quickly
library(readxl) # allows us to load .xlsx (Excel) files
library(here) # needed to navigate to folders and files in a project
library(haven) # allows us to load .dta (Stata specific) files
library(janitor) # to allow us to use the clean_names() command
library(DescTools) # for the Mode() command
library(esquisse) # an app to help us with the plotting in ggplot

Now we are ready to actually start our work!

6.1 Import Data

As pointed out in the exercise, we load the issp1995subset.dta dataset - I have put all my datasets into a data folder in our project folder.

Having loaded the haven package, which we need need to load Stata .dta files, and using the here() command to navigate to our data folder, we use:

issp <- read_dta(here("data","issp1995subset.dta"))

We will quickly do something that is only needed because we are using a .dta file, which is a Stata specific file: This is needed because Stata uses labels and R does not, so we simply convert the country variable to a factor.

issp %>% 
  mutate(country = as_factor(country)) -> issp  
mutate: converted 'country' from double to factor (0 new NA)

Now, let’s consider some basic summaries of the data:

issp %>% 
  skim()
Table 6.1: Data summary
Name Piped data
Number of rows 30894
Number of columns 26
_______________________
Column type frequency:
character 15
factor 1
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
age 228 0.99 2 2 0 84 0
marital 236 0.99 1 1 0 5 0
status 477 0.98 1 2 0 10 0
socialclass 5291 0.83 1 1 0 6 0
citizen 1454 0.95 1 1 0 2 0
rural 6820 0.78 1 1 0 3 0
tumember 6232 0.80 1 1 0 2 0
partyid 12542 0.59 1 1 0 5 0
relserv 2099 0.93 1 1 0 6 0
teachforlang 1228 0.96 1 1 0 5 0
immiecon 3003 0.90 1 1 0 5 0
immijobs 1900 0.94 1 1 0 5 0
party_germany 29106 0.06 1 2 0 16 0
party_uk 29878 0.03 1 2 0 9 0
party_us 29529 0.04 1 2 0 8 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
country 0 1 FALSE 24 aus: 2438, nl: 2089, pl: 1598, rus: 1585

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 1497526.64 1047331.11 1321.0 700966.2 1400977.50 2001235.75 5014485.00 ▆▇▃▁▁
tradepro 0 1.00 0.21 0.41 0.0 0.0 0.00 0.00 1.00 ▇▁▁▁▂
tradecon 0 1.00 0.56 0.50 0.0 0.0 1.00 1.00 1.00 ▆▁▁▁▇
male 116 1.00 0.47 0.50 0.0 0.0 0.00 1.00 1.00 ▇▁▁▁▇
nominalincome 9034 0.71 29090.18 50113.62 4.0 3319.5 12500.00 32500.00 834000.00 ▇▁▁▁▁
schooling 2872 0.91 11.64 3.64 0.0 9.0 12.00 14.00 20.00 ▁▃▇▅▂
highschoolincpl 600 0.98 0.21 0.41 0.0 0.0 0.00 0.00 1.00 ▇▁▁▁▂
highschool 600 0.98 0.30 0.46 0.0 0.0 0.00 1.00 1.00 ▇▁▁▁▃
college 600 0.98 0.24 0.43 0.0 0.0 0.00 0.00 1.00 ▇▁▁▁▂
gdpcap 612 0.98 16505.00 7484.32 3439.6 8660.1 19624.51 22275.08 27904.94 ▅▃▂▇▂
issp %>% 
  summary()
       id             country         tradepro         tradecon     
 Min.   :   1321   aus    : 2438   Min.   :0.0000   Min.   :0.0000  
 1st Qu.: 700966   nl     : 2089   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1400978   pl     : 1598   Median :0.0000   Median :1.0000  
 Mean   :1497527   rus    : 1585   Mean   :0.2127   Mean   :0.5649  
 3rd Qu.:2001236   cdn    : 1543   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :5014485   n      : 1527   Max.   :1.0000   Max.   :1.0000  
                   (Other):20114                                    
      male             age           marital          status      
 Min.   :0.0000   Min.   :14.00   Min.   :1.000   Min.   : 1.000  
 1st Qu.:0.0000   1st Qu.:31.00   1st Qu.:1.000   1st Qu.: 1.000  
 Median :0.0000   Median :43.00   Median :1.000   Median : 2.000  
 Mean   :0.4706   Mean   :45.08   Mean   :2.095   Mean   : 3.864  
 3rd Qu.:1.0000   3rd Qu.:58.00   3rd Qu.:3.000   3rd Qu.: 7.000  
 Max.   :1.0000   Max.   :98.00   Max.   :5.000   Max.   :10.000  
 NA's   :116      NA's   :228     NA's   :236     NA's   :477     
  socialclass   nominalincome      schooling     highschoolincpl 
 Min.   :1.00   Min.   :     4   Min.   : 0.00   Min.   :0.0000  
 1st Qu.:2.00   1st Qu.:  3320   1st Qu.: 9.00   1st Qu.:0.0000  
 Median :4.00   Median : 12500   Median :12.00   Median :0.0000  
 Mean   :3.15   Mean   : 29090   Mean   :11.64   Mean   :0.2132  
 3rd Qu.:4.00   3rd Qu.: 32500   3rd Qu.:14.00   3rd Qu.:0.0000  
 Max.   :6.00   Max.   :834000   Max.   :20.00   Max.   :1.0000  
 NA's   :5291   NA's   :9034     NA's   :2872    NA's   :600     
   highschool        college          citizen           rural      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:1.000  
 Median :0.0000   Median :0.0000   Median :1.0000   Median :1.000  
 Mean   :0.3037   Mean   :0.2417   Mean   :0.9723   Mean   :1.696  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :3.000  
 NA's   :600      NA's   :600      NA's   :1454     NA's   :6820   
    tumember        partyid         relserv       teachforlang 
 Min.   :0.000   Min.   :1.000   Min.   :1.000   Min.   :1.00  
 1st Qu.:0.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:1.00  
 Median :0.000   Median :3.000   Median :5.000   Median :2.00  
 Mean   :0.317   Mean   :2.906   Mean   :4.003   Mean   :2.06  
 3rd Qu.:1.000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:3.00  
 Max.   :1.000   Max.   :5.000   Max.   :6.000   Max.   :5.00  
 NA's   :6232    NA's   :12542   NA's   :2099    NA's   :1228  
    immiecon        immijobs         gdpcap      party_germany   
 Min.   :1.000   Min.   :1.000   Min.   : 3440   Min.   : 1.000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 8660   1st Qu.: 1.000  
 Median :3.000   Median :3.000   Median :19625   Median : 2.000  
 Mean   :3.139   Mean   :2.922   Mean   :16505   Mean   : 4.423  
 3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:22275   3rd Qu.: 4.000  
 Max.   :5.000   Max.   :5.000   Max.   :27905   Max.   :96.000  
 NA's   :3003    NA's   :1900    NA's   :612     NA's   :29106   
    party_uk        party_us     
 Min.   : 1.00   Min.   : 1.000  
 1st Qu.: 1.00   1st Qu.: 2.000  
 Median : 2.00   Median : 4.000  
 Mean   :11.25   Mean   : 5.457  
 3rd Qu.: 3.00   3rd Qu.: 6.000  
 Max.   :96.00   Max.   :95.000  
 NA's   :29878   NA's   :29529   

6.2 Detailed Summary of schooling variable

issp %>% 
  select(schooling) %>% 
  summary()
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
   schooling    
 Min.   : 0.00  
 1st Qu.: 9.00  
 Median :12.00  
 Mean   :11.64  
 3rd Qu.:14.00  
 Max.   :20.00  
 NA's   :2872   
issp %>% 
  select(schooling) %>% 
  skim()
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Table 6.2: Data summary
Name Piped data
Number of rows 30894
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
schooling 2872 0.91 11.64 3.64 0 9 12 14 20 ▁▃▇▅▂

To get the mode, we can either use a table() command (similar to the tabulate command in Stata):

issp %>% 
  select(schooling) %>% 
  table()
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
.
   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
 166   54   76  116  372  374  748  643 2992 1764 3338 2879 4433 2215 2091 
  15   16   17   18   19   20 
1312 1635  941  842  353  678 

Or simply using the count() command:

issp %>% 
  count(schooling)
count: now 22 rows and 2 columns, ungrouped
# A tibble: 22 x 2
   schooling     n
       <dbl> <int>
 1         0   166
 2         1    54
 3         2    76
 4         3   116
 5         4   372
 6         5   374
 7         6   748
 8         7   643
 9         8  2992
10         9  1764
# ... with 12 more rows

And the getting the row with the highest count:

issp %>% 
  count(schooling) %>% 
  filter(n == max(n))
count: now 22 rows and 2 columns, ungrouped
filter: removed 21 rows (95%), one row remaining
# A tibble: 1 x 2
  schooling     n
      <dbl> <int>
1        12  4433

Or we can use the Mode() command from the DescTools package:

issp %>% 
  summarise(schooling_median = Mode(schooling, na.rm=TRUE))
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  schooling_median
             <dbl>
1               12

6.3 Dot Plot for average schooling years

To create a useful dot plot, in R we first summarise the data down to the means for each group (country). Note that we first use the as_factor() command to convert the labels from the country variable to a factor - something that is necessary because Stata uses labels, but R generally does not.

issp %>% 
  group_by(country) %>% 
  summarise(schooling = mean(schooling, na.rm=TRUE))
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
   country schooling
   <fct>       <dbl>
 1 aus          11.9
 2 d-w          10.9
 3 d-e          10.9
 4 gb           11.3
 5 usa          13.4
 6 a            10.3
 7 h            10.5
 8 i            11.0
 9 irl          12.3
10 nl           12.7
# ... with 14 more rows

Given this data, we can then easily recreate the Stata dotplot by using ggplot() (note: you can of course also use esquisser() on the modified data). In the ggplot command, we use the aes() to do our mapping - to recreate the Stata plot, we choose countries on the y axis and schooling on the x-axis. We also add x-limit to display the full range down to 0.

issp %>% 
  group_by(country) %>% 
  summarise(schooling = mean(schooling, na.rm=TRUE)) %>% 
  
  ggplot() + 
  aes(x=schooling, y = country) + # 
  geom_point() + 
  xlim(c(0,15))

Great, so we can recreate the plot from Stata exactly. Now we can modify the command above a bit to improve the styling using the theme() command. I switch around the axes (personal preference) and use column rather than points - this way, we can also easily spot that we are missing all schooling values for Bulgaria.

issp %>% 
  mutate(country = as_factor(country)) %>%  # this is needed because Stata uses labels and R does not
  group_by(country) %>% 
  summarise(schooling = mean(schooling, na.rm=TRUE)) %>% 
  
  ggplot() +
  aes(x = country, y = schooling) +
  geom_col(fill = "darkblue")+
  labs(x = NULL, y = "Average Years of Schooling", title = "Average Schooling") +
  theme_minimal()

6.4 Summary Statistics for the age variable

Just as above:

issp %>% 
  select(age) %>% 
  skim()
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Warning: Couldn't find skimmers for class: haven_labelled, vctrs_vctr,
double, numeric; No user-defined `sfl` provided. Falling back to
`character`.
Table 6.3: Data summary
Name Piped data
Number of rows 30894
Number of columns 1
_______________________
Column type frequency:
character 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
age 228 0.99 2 2 0 84 0

And getting the precise percentiles:

issp %>% 
  summarise(age = quantile(age, na.rm=TRUE),
            percentile = names(age))
summarise: now 5 rows and 2 columns, ungrouped
# A tibble: 5 x 2
    age percentile
  <dbl> <chr>     
1    14 0%        
2    31 25%       
3    43 50%       
4    58 75%       
5    98 100%      

6.5 Using the return command

In R, the return command is not the same as in Stata. Essentially, you will not need this for now. To use any summary statistic, we can simply use our usual summarise() and mutate() commands.

6.6 Country Boxplot for age

We can easily create a boxplot:

issp %>% 
  
  ggplot() + 
  aes(x = country, y = age, fill=country) + 
  geom_boxplot() + 
  
  # some styling
  theme_minimal() + 
  theme(legend.position = "none")+
  labs(x=NULL, y="Age", title = "Age Distribution by Country")

An alternative to using boxplot is to use a violin plot:

issp %>% 
  
  ggplot() + 
  aes(x = country, y = age, fill = country) + 
  geom_violin() + 
  
  # some styling
  theme_minimal() + 
  theme(legend.position = "none") +
  labs(x=NULL, y="Age", title = "Age Distribution by Country")

Finding out the median for each country is of course also possible like this:

issp %>% 
  group_by(country) %>% 
  summarise(age = median(age))
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
   country   age
   <fct>   <dbl>
 1 aus        48
 2 d-w        45
 3 d-e        48
 4 gb         44
 5 usa        42
 6 a          46
 7 h          46
 8 i          42
 9 irl        44
10 nl         41
# ... with 14 more rows

6.7 Histograms by country

To plot a separate histogram for each country, we facet_wrap() by our grouping variable (country in this case).

If we want to add a normal distribution to a histogram, we need to use a package that is not yet published on the standard repository that all R packages lie on - indeed this is more difficult than in Stata.

We can install this using the command devtools::install_github("teunbrand/ggh4x"). Then we can load it normally again:

# devtools::install_github("teunbrand/ggh4x") # install this new package like this
library(ggh4x) # A package that allows us to add the normal distribution

issp %>% 
  
  ggplot() + 
  aes(x = age, fill = country) + 
  
  # We first convert our data from a count to a density (like in Stat)
  geom_histogram(aes(y = after_stat(density))) + 
  stat_theodensity() + # and add the normal density here 
  
  # Now we make sure each country is displayed separately
  facet_wrap(~country) + # if you want each histogram to have an individual scale, use the scales = "free" option
  
  # some styling
  theme_minimal() + 
  theme(legend.position = "none") +
  labs(x=NULL, y="Count", title = "Age Distribution by Country")
Warning: Removed 228 rows containing non-finite values (stat_bin).

6.8 Create an old dummy variable

issp %>% 
  mutate(old = ifelse(test = age > mean(age, na.rm=TRUE), 
                      yes = 1,
                      no = 0)) 
mutate: new variable 'old' (double) with 3 unique values and 1% NA
# A tibble: 30,894 x 27
       id country tradepro tradecon  male   age marital   status
    <dbl> <fct>      <dbl>    <dbl> <dbl> <dbl> <dbl+l> <dbl+lb>
 1 100001 aus            0        1     1    77 1 [mar~  7 [ret~
 2 100002 aus            0        1     1    56 3 [div~  5 [une~
 3 100003 aus            1        0     1    41 1 [mar~  1 [f-t~
 4 100004 aus            0        1     0    28 1 [mar~  8 [hou~
 5 100005 aus            1        0     0    43 1 [mar~  1 [f-t~
 6 100006 aus            1        0     0    49 1 [mar~  1 [f-t~
 7 100007 aus            0        0     1    46 1 [mar~  1 [f-t~
 8 100008 aus            0        1     1    48 1 [mar~  7 [ret~
 9 100009 aus            0        1     1    22 5 [not~  1 [f-t~
10 100010 aus            0        1     0    55 1 [mar~ NA      
# ... with 30,884 more rows, and 19 more variables: socialclass <dbl+lbl>,
#   nominalincome <dbl>, schooling <dbl>, highschoolincpl <dbl>,
#   highschool <dbl>, college <dbl>, citizen <dbl+lbl>, rural <dbl+lbl>,
#   tumember <dbl+lbl>, partyid <dbl+lbl>, relserv <dbl+lbl>,
#   teachforlang <dbl+lbl>, immiecon <dbl+lbl>, immijobs <dbl+lbl>,
#   gdpcap <dbl>, party_germany <dbl+lbl>, party_uk <dbl+lbl>,
#   party_us <dbl+lbl>, old <dbl>

6.9 nominalincome for Australia

Let’s first request the summary for the nominalincome for Australia:

issp %>% 
  
  filter(country == "aus") %>% 
  select(nominalincome) %>% 
  skim()
filter: removed 28,456 rows (92%), 2,438 rows remaining
select: dropped 25 variables (id, country, tradepro, tradecon, male, …)
Table 6.4: Data summary
Name Piped data
Number of rows 2438
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
nominalincome 388 0.84 29511.22 28877.36 1500 12500 22500 37500 3e+05 ▇▁▁▁▁

And now let’s analyse the skew (you could use the skew() command from the DescTools package to formally test this) by looking at the histogram of the variable for Australia:

issp %>% 
  
  filter(country == "aus") %>% 
  ggplot() + 
  aes(x = nominalincome) + 
  geom_histogram() + 
  
  #styling 
  theme_minimal() + 
  labs(title = "Distribution of Nominal Income", subtitle = "Australia")

6.10 nominalincome for all countries

issp %>% 
  group_by(country) %>% 
  summarise(nominal_income = mean(nominalincome, na.rm=TRUE))
group_by: one grouping variable (country)
summarise: now 24 rows and 2 columns, ungrouped
# A tibble: 24 x 2
   country nominal_income
   <fct>            <dbl>
 1 aus             29511.
 2 d-w              2513.
 3 d-e              1791.
 4 gb              14885.
 5 usa             19283.
 6 a               13402.
 7 h               20013.
 8 i                 NaN 
 9 irl              9308.
10 nl              45551.
# ... with 14 more rows