8 HT2 Sampling
library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project
8.2 Sample
We sample 5% of the values using sample_frac()
:
df %>%
sample_frac(0.05)
sample_frac: removed 183 rows (95%), 10 rows remaining
# A tibble: 10 x 6
cname ccodealp bl_asy25mf chga_demo gle_rgdpc wdi_trade
<chr> <chr> <dbl> <dbl+lbl> <dbl> <dbl>
1 Zambia ZMB 6.49 0 [0. Dictators~ 1690. 81.7
2 Switzerland CHE 10.3 1 [1. Democracy] 44426. 92.7
3 Mauritania MRT 3.73 0 [0. Dictators~ 1981. 126.
4 Cambodia KHM 4.02 0 [0. Dictators~ 1764. 114.
5 Suriname SUR NA 1 [1. Democracy] 5606. NA
6 Antigua and Ba~ ATG NA 1 [1. Democracy] 11203. 104.
7 Uganda UGA 4.73 0 [0. Dictators~ 1194. 53.9
8 Mongolia MNG 8.32 1 [1. Democracy] 4802. 117.
9 Gabon GAB 7.43 0 [0. Dictators~ 10921. 85.6
10 Iceland ISL 10.4 1 [1. Democracy] 28660. 103.
We can use the familiar summarise()
command to get the mean.
df %>%
sample_frac(0.05) %>%
summarise(gdp_mean = mean(gle_rgdpc, na.rm=TRUE))
sample_frac: removed 183 rows (95%), 10 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
gdp_mean
<dbl>
1 11245.
8.3 Loop
We then add the row to an external object, a dataframe. We create three dataframes for all three sample sizes:
sample_5pc <- tibble()
sample_25pc <- tibble()
sample_50pc <- tibble()
And then we can create a loop:
for(i in 1:5){
df %>%
sample_frac(0.05) %>%
summarise(gdp_mean_005 = mean(gle_rgdpc, na.rm=TRUE)) %>%
mutate(sample = i) %>%
relocate(sample) %>%
bind_rows(sample_5pc,.) -> sample_5pc
df %>%
sample_frac(0.25) %>%
summarise(gdp_mean_025 = mean(gle_rgdpc, na.rm=TRUE)) %>%
mutate(sample = i) %>%
relocate(sample) %>%
bind_rows(sample_25pc,.) -> sample_25pc
df %>%
sample_frac(0.5) %>%
summarise(gdp_mean_05 = mean(gle_rgdpc, na.rm=TRUE)) %>%
mutate(sample = i) %>%
relocate(sample) %>%
bind_rows(sample_50pc,.) -> sample_50pc
}
And then let’s visualise this as a table:
sample_5pc %>%
full_join(sample_25pc, by = "sample") %>%
full_join(sample_50pc, by = "sample") -> all_samples
all_samples
# A tibble: 5 x 4
sample gdp_mean_005 gdp_mean_025 gdp_mean_05
<int> <dbl> <dbl> <dbl>
1 1 14320. 11352. 14796.
2 2 6001. 13208. 13547.
3 3 9461. 11731. 13419.
4 4 12419. 15617. 10642.
5 5 11693. 12012. 13696.
Let’s display the means and standard deviation by sample size:
all_samples %>%
pivot_longer(-sample) %>%
group_by(name) %>%
summarise(mean = mean(value),
sd = sd(value))
pivot_longer: reorganized (gdp_mean_005, gdp_mean_025, gdp_mean_05) into (name, value) [was 5x4, now 15x3]
group_by: one grouping variable (name)
summarise: now 3 rows and 3 columns, ungrouped
# A tibble: 3 x 3
name mean sd
<chr> <dbl> <dbl>
1 gdp_mean_005 10779. 3187.
2 gdp_mean_025 12784. 1729.
3 gdp_mean_05 13220. 1541.
8.4 Visualising
Let’s now repeat this - but rather than 5 replications, I will now use 50 to plot a histogram:
# A tibble: 50 x 4
sample gdp_mean_005 gdp_mean_025 gdp_mean_05
<int> <dbl> <dbl> <dbl>
1 1 3080. 17162. 14032.
2 2 8381. 10599. 12172.
3 3 19040. 17071. 15076.
4 4 3247. 13024. 13157.
5 5 7010. 11271. 11022.
6 6 14195. 12389. 13648.
7 7 12489. 12197. 12967.
8 8 18581. 15404. 12514.
9 9 23673. 13487. 12823.
10 10 8718. 12792. 11949.
# ... with 40 more rows
all_samples %>%
pivot_longer(-sample) %>%
ggplot() +
aes(x = value, fill = name, group=name) +
geom_density(alpha = 0.5) +
theme_minimal()
