5 Exercise 2
Before we get started we need to open our project.
As discussed in our Project Management chapter 3, we first want to open our QEH
Project file. To open the QEH.Rproj
file, we head over to the folder we put it in and double click on it in our file browse - or we open it from RStudio.
Now we can check that we are in the right working directory and can chech that we are in the right project by checking in the top right of RStudio.
getwd()
Let’s create a new RMarkdown file (see 3.3) by selecting R Notebook in the “File” menu and let’s save it as Introduction_1.Rmd
in our project folder. Because we want to create a PDF report with a proper title and our name, we modify the header of the Rmarkdown document ever so slightly (most importantly, we replace the html_notebook
with pdf_document
).
Next, we want to make sure we have all tools ready that we will need in this exercise, so we load a few libraries (use install.packages("packagename")
first, if you are missing one):
library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse
library(skimr) # allows us to get an overview over the data quickly
library(readxl) # allows us to load .xlsx (Excel) files
library(here) # needed to navigate to folders and files in a project
library(janitor) # to allow us to use the clean_names() command
library(esquisse) # an app to help us with the plotting in ggplot
Now we are ready to actually start our work!
5.1 Import Data
survey <- read_excel(path = here("data","QMSurvey.xlsx"))
5.2 Rename variables
Again, we use the tidyverse pipe operator %>%
(shortcut CTRL + M
) to string multiple commands together. Remember, just think of the pipe command as saying “then” i.e. do a command then do this command.
Before we rename the columns, which represent each variable, we want to see what they are called right now:
survey %>%
names()
[1] "Timestamp"
[2] "1. What was your undergraduate degree in (major and minor, if applicable)?"
[3] "2. If you already have a masters degree, what was it it?"
[4] "3. Did you attend any university-level courses in quantitative methods (including statistics, econometrics etc.)? If yes, how many courses or up to which level?"
[5] "4. Which of the following topics have you ever heard of?"
[6] "4. Which of the following topics do you feel at ease with?"
[7] "5. Which statistical softwares have you already used?"
[8] "6. Have you already used quantitative methods in your academic work?"
[9] "7. Do you plan to do so for your MPhil dissertation?"
[10] "8. If you have prior work experience (including internship), how much have you used quantitative methods at work?"
[11] "9. How often do you expect to use quantitative methods in your career?"
[12] "10. Finally [I usually need examples to understand theory]"
[13] "10. Finally [I feel uncomfortable with maths / calculus]"
[14] "10. Finally [I feel comfortable with figures]"
[15] "10. Finally [I am usually good at learning how to use a new software]"
[16] "10. Finally [I feel more at ease with qualitative methods (than with quantitative methods)]"
[17] "10. Finally [Whenever I see a formula / maths in a paper, I tend to skip that part]"
[18] "10. Finally [Whenever possible, I let other people do the maths for me]"
[19] "11. What are you most hoping to get out of the QM class?"
[20] "12. What are your main topics of interest? For instance, the topic on which you would like to work in your thesis."
[21] "Which degree course are you following?"
You can see that R, other than Stata, does not transform the variable names at all but keeps all of the information intact. To help us a little bit with renaming the variables, we will make them a bit simpler though, using the clean_names()
command.
survey %>%
clean_names() %>%
names()
[1] "timestamp"
[2] "x1_what_was_your_undergraduate_degree_in_major_and_minor_if_applicable"
[3] "x2_if_you_already_have_a_masters_degree_what_was_it_it"
[4] "x3_did_you_attend_any_university_level_courses_in_quantitative_methods_including_statistics_econometrics_etc_if_yes_how_many_courses_or_up_to_which_level"
[5] "x4_which_of_the_following_topics_have_you_ever_heard_of"
[6] "x4_which_of_the_following_topics_do_you_feel_at_ease_with"
[7] "x5_which_statistical_softwares_have_you_already_used"
[8] "x6_have_you_already_used_quantitative_methods_in_your_academic_work"
[9] "x7_do_you_plan_to_do_so_for_your_m_phil_dissertation"
[10] "x8_if_you_have_prior_work_experience_including_internship_how_much_have_you_used_quantitative_methods_at_work"
[11] "x9_how_often_do_you_expect_to_use_quantitative_methods_in_your_career"
[12] "x10_finally_i_usually_need_examples_to_understand_theory"
[13] "x10_finally_i_feel_uncomfortable_with_maths_calculus"
[14] "x10_finally_i_feel_comfortable_with_figures"
[15] "x10_finally_i_am_usually_good_at_learning_how_to_use_a_new_software"
[16] "x10_finally_i_feel_more_at_ease_with_qualitative_methods_than_with_quantitative_methods"
[17] "x10_finally_whenever_i_see_a_formula_maths_in_a_paper_i_tend_to_skip_that_part"
[18] "x10_finally_whenever_possible_i_let_other_people_do_the_maths_for_me"
[19] "x11_what_are_you_most_hoping_to_get_out_of_the_qm_class"
[20] "x12_what_are_your_main_topics_of_interest_for_instance_the_topic_on_which_you_would_like_to_work_in_your_thesis"
[21] "which_degree_course_are_you_following"
Ok, this makes it slightly easier - albeit not perfect. There are a few options in the clean_names()
command that work nicely, but let’s not get too hung up about them here.
Let’s get on with renaming the variables, as we’re instructed and save it to a new variable survey_renamed
:
survey %>%
clean_names() %>%
rename(nomaths = x10_finally_whenever_possible_i_let_other_people_do_the_maths_for_me,
skipmaths = x10_finally_whenever_i_see_a_formula_maths_in_a_paper_i_tend_to_skip_that_part,
uncomfortable = x10_finally_i_feel_uncomfortable_with_maths_calculus,
examples = x10_finally_i_usually_need_examples_to_understand_theory) -> survey_renamed
rename: renamed 4 variables (examples, uncomfortable, skipmaths, nomaths)
Better! But you can see how important it is to find good names for variables - you’ll be using them a lot as you do your analysis.
5.3 Converting a variable to numeric
We first need to tell R what each value would correspond to using case_when()
, which just checks whether a condition is true or moves on to the next condition.
survey_renamed %>%
mutate(c_nomaths = case_when(nomaths == "Totally disagree"~1,
nomaths == "Disagree"~2,
nomaths == "Don't especially agree or disagree"~3,
nomaths == "Agree"~4,
nomaths == "Fully agree"~5)) %>%
select(nomaths, c_nomaths)
mutate: new variable 'c_nomaths' (double) with 5 unique values and 0% NA
select: dropped 20 variables (timestamp, x1_what_was_your_undergraduate_degree_in_major_and_minor_if_applicable, x2_if_you_already_have_a_masters_degree_what_was_it_it, x3_did_you_attend_any_university_level_courses_in_quantitative_methods_including_statistics_econometrics_etc_if_yes_how_many_courses_or_up_to_which_level, x4_which_of_the_following_topics_have_you_ever_heard_of, …)
# A tibble: 57 x 2
nomaths c_nomaths
<chr> <dbl>
1 Don't especially agree or disagree 3
2 Agree 4
3 Don't especially agree or disagree 3
4 Agree 4
5 Disagree 2
6 Agree 4
7 Agree 4
8 Disagree 2
9 Disagree 2
10 Agree 4
# ... with 47 more rows
You can now see that the c_nomaths
variable is of type double
(e.g. using str()
), which means it’s a numeric variable.
5.4 Labelling the new numeric variable
Now R and Stata differ a bit with regards to labelling - R does not generally label any values - the value you see is the value that the variable has. But there are ways to achieve the same functionality in Stata using factor()
variables. factor()
variables are perfect for categorical variables can have a level
and a label
- but the resulting variable is the not a numeric variable anymore.
Let’s create a factor variable (and then select the nomaths
and c_nomaths
to compare our results):
survey_renamed %>%
mutate(c_nomaths = case_when(nomaths == "Totally disagree"~1,
nomaths == "Disagree"~2,
nomaths == "Don't especially agree or disagree"~3,
nomaths == "Agree"~4,
nomaths == "Fully agree"~5)) %>%
mutate(c_nomaths = factor(c_nomaths, levels = c(1:5), labels = c("Totally disagree",
"Disagree",
"Don't especially agree or disagree",
"Agree",
"Fully agree"))) %>%
select(nomaths, c_nomaths)
mutate: new variable 'c_nomaths' (double) with 5 unique values and 0% NA
mutate: converted 'c_nomaths' from double to factor (0 new NA)
select: dropped 20 variables (timestamp, x1_what_was_your_undergraduate_degree_in_major_and_minor_if_applicable, x2_if_you_already_have_a_masters_degree_what_was_it_it, x3_did_you_attend_any_university_level_courses_in_quantitative_methods_including_statistics_econometrics_etc_if_yes_how_many_courses_or_up_to_which_level, x4_which_of_the_following_topics_have_you_ever_heard_of, …)
# A tibble: 57 x 2
nomaths c_nomaths
<chr> <fct>
1 Don't especially agree or disagree Don't especially agree or disagree
2 Agree Agree
3 Don't especially agree or disagree Don't especially agree or disagree
4 Agree Agree
5 Disagree Disagree
6 Agree Agree
7 Agree Agree
8 Disagree Disagree
9 Disagree Disagree
10 Agree Agree
# ... with 47 more rows
5.5 Using a loop for each variable - or using across
from the tidyverse
Similarly to Stata, we can use a for()
loop - in this instance, we let var
cycle through the vector c("nomaths","skipmaths", "uncomfortable", "examples")
.
The syntax of the for()
loop is fairly straightforward:
vector <- 1:5 for(variable in vector){ print(paste0(“Round,”variable)) }
This is possible to do in the tidyverse - but because the tidyverse is not really meant for loop operations, we need to include some weird operators. So for completeness, here it is directly below - but let’s focus on an alternative below!
survey_renamed_loop <- survey_renamed
for (var in c("nomaths","skipmaths", "uncomfortable", "examples")){
survey_renamed %>%
mutate(!!paste0("c_",var) := case_when(get(var) == "Totally disagree"~1,
get(var) == "Disagree"~2,
get(var) == "Don't especially agree or disagree"~3,
get(var) == "Agree"~4,
get(var) == "Fully agree"~5)) %>%
mutate(!!paste0("c_",var) := factor(get(paste0("c_",var)),
levels = c(1:5),
labels = c("Totally disagree",
"Disagree",
"Don't especially agree or disagree",
"Agree",
"Fully agree"))) -> survey_renamed_loop
}
mutate: new variable 'c_nomaths' (double) with 5 unique values and 0% NA
mutate: converted 'c_nomaths' from double to factor (0 new NA)
mutate: new variable 'c_skipmaths' (double) with 5 unique values and 0% NA
mutate: converted 'c_skipmaths' from double to factor (0 new NA)
mutate: new variable 'c_uncomfortable' (double) with 5 unique values and 0% NA
mutate: converted 'c_uncomfortable' from double to factor (0 new NA)
mutate: new variable 'c_examples' (double) with 4 unique values and 0% NA
mutate: converted 'c_examples' from double to factor (0 new NA)
To do this in a more tidyverse-like fashion, we make use of the command across()
within the mutate()
command. This command takes a set of columns you want to execute a certain function over - and each column name is simply represented by a .
and because we have used a function with some augments, we need to precede it by ~
. See 2.2.9 for more detailed information on the command:
survey_renamed %>%
# Translate the character values to a numeric variable
mutate(across(.cols = c(nomaths,skipmaths, uncomfortable, examples),
.fns = ~case_when(. == "Totally disagree"~1,
. == "Disagree"~2,
. == "Don't especially agree or disagree"~3,
. == "Agree"~4,
. == "Fully agree"~5),
.names = "c_{.col}")) -> survey_renamed_numeric
mutate: new variable 'c_nomaths' (double) with 5 unique values and 0% NA
new variable 'c_skipmaths' (double) with 5 unique values and 0% NA
new variable 'c_uncomfortable' (double) with 5 unique values and 0% NA
new variable 'c_examples' (double) with 4 unique values and 0% NA
survey_renamed_numeric %>%
# Create the labels for the values by creating a factor variable
mutate(across(.cols = c(c_nomaths,c_skipmaths, c_uncomfortable, c_examples),
.fns = ~factor(., levels = c(1:5), labels = c("Totally disagree",
"Disagree",
"Don't especially agree or disagree",
"Agree",
"Fully agree")))) -> survey_renamed_factor
mutate: converted 'c_nomaths' from double to factor (0 new NA)
converted 'c_skipmaths' from double to factor (0 new NA)
converted 'c_uncomfortable' from double to factor (0 new NA)
converted 'c_examples' from double to factor (0 new NA)
5.6 Create a histogram
Using our helper app again:
esquisser(survey_renamed_factor)
We can get to a barplot (which is technically the right graph, not the histogram, which is for numeric values not categorical ones):
ggplot(survey_renamed_factor) +
aes(x = c_nomaths) +
# We want a bar plot and then specify the colour of the fill
geom_bar(fill = "#0c4c8a") +
# We disable the title of the x-axis as it would just be "c_nomaths"
labs(x = NULL) +
theme_minimal()

We could redo these for all four variables. If you wanted to save these plots, save them to an object and then use the ggsave()
command to get them to a file as e.g. png or pdf file.
5.7 Box Plot
Before we want to plot a boxplot, we need to make sure that 1) our plotted variable is indeed numeric and not a factor and 2) that the variable we want to separate the data by is indeed a factor.
We use mutate()
and factor()
again (and can use esquisser()
to find out the geom_boxplot()
command and that we need to put the grouping variable into group()
and fill()
to get a legend).
survey_renamed_numeric %>%
# modify the gropuing variable to be a factor
# (optional - but allows us to get labels that we then see in the plot)
mutate(used_quant_before = factor(x6_have_you_already_used_quantitative_methods_in_your_academic_work,
levels = c(1,2),
labels = c("No","Yes"))) %>%
ggplot() +
aes(y=uncomfortable,fill=used_quant_before, group=used_quant_before) +
geom_boxplot() +
theme_minimal() +
# we just turn off the text at the bottom of the x-axis, as it does not make a lot of sense in this context
theme(axis.text.x = element_blank())
mutate: new variable 'used_quant_before' (factor) with 2 unique values and 0% NA

5.8 Scatter Plot
We use the code from above again when the variables are not yet converted to a factor (e.g. they are still numeric):
survey_renamed_numeric %>%
ggplot() +
aes(x=c_nomaths, y=c_skipmaths) +
geom_point() +
theme_minimal()

Really not great! This does not tell us a lot, does it? Let’s see if we can make this a bit better by usinng the count()
command.
survey_renamed_numeric %>%
count(c_skipmaths, c_nomaths) %>%
ggplot() +
aes(x=c_nomaths, y=c_skipmaths, size = n, colour=n) +
# Create a scatter plot
geom_point() +
# Define the colour scale from Red to Blue
scale_color_distiller(palette = "RdBu") +
# Here we just delete the second legend for the size of the points - we don't need it
scale_size(guide = FALSE) +
theme_minimal()

This tell us a little more than before!
5.9 Correlation
survey_renamed_numeric %>%
# Now we just want to retain the columns we need!
select(starts_with("c_")) %>%
cor()
c_nomaths c_skipmaths c_uncomfortable c_examples
c_nomaths 1.00000000 0.6966001 0.46989701 0.04674316
c_skipmaths 0.69660012 1.0000000 0.40077664 0.14288471
c_uncomfortable 0.46989701 0.4007766 1.00000000 -0.07783833
c_examples 0.04674316 0.1428847 -0.07783833 1.00000000
If we want to be a bit fancier, we can use some more libraries to plot correlation:
library(PerformanceAnalytics)
survey_renamed_numeric %>%
# Now we just want to retain the columns we need!
select(starts_with("c_")) %>%
chart.Correlation()
