8 HT2 Sampling

library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse 
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project

8.1 Load the Data

Load the dataset:

df <- read_dta(here("data","PS1.dta"))

8.2 Sample

We sample 5% of the values using sample_frac():

df %>% 
  sample_frac(0.05)
sample_frac: removed 183 rows (95%), 10 rows remaining
# A tibble: 10 x 6
   cname           ccodealp bl_asy25mf        chga_demo gle_rgdpc wdi_trade
   <chr>           <chr>         <dbl>        <dbl+lbl>     <dbl>     <dbl>
 1 Zambia          ZMB            6.49 0 [0. Dictators~     1690.      81.7
 2 Switzerland     CHE           10.3  1 [1. Democracy]    44426.      92.7
 3 Mauritania      MRT            3.73 0 [0. Dictators~     1981.     126. 
 4 Cambodia        KHM            4.02 0 [0. Dictators~     1764.     114. 
 5 Suriname        SUR           NA    1 [1. Democracy]     5606.      NA  
 6 Antigua and Ba~ ATG           NA    1 [1. Democracy]    11203.     104. 
 7 Uganda          UGA            4.73 0 [0. Dictators~     1194.      53.9
 8 Mongolia        MNG            8.32 1 [1. Democracy]     4802.     117. 
 9 Gabon           GAB            7.43 0 [0. Dictators~    10921.      85.6
10 Iceland         ISL           10.4  1 [1. Democracy]    28660.     103. 

We can use the familiar summarise() command to get the mean.

df %>% 
  sample_frac(0.05) %>% 
  summarise(gdp_mean = mean(gle_rgdpc, na.rm=TRUE))
sample_frac: removed 183 rows (95%), 10 rows remaining
summarise: now one row and one column, ungrouped
# A tibble: 1 x 1
  gdp_mean
     <dbl>
1   11245.

8.3 Loop

We then add the row to an external object, a dataframe. We create three dataframes for all three sample sizes:

sample_5pc <- tibble()
sample_25pc <- tibble()
sample_50pc <- tibble()

And then we can create a loop:

for(i in 1:5){
  df %>% 
    sample_frac(0.05) %>% 
    summarise(gdp_mean_005 = mean(gle_rgdpc, na.rm=TRUE)) %>% 
    mutate(sample = i) %>% 
    relocate(sample) %>% 
    bind_rows(sample_5pc,.) -> sample_5pc
  
  df %>% 
    sample_frac(0.25) %>% 
    summarise(gdp_mean_025 = mean(gle_rgdpc, na.rm=TRUE)) %>% 
    mutate(sample = i) %>% 
    relocate(sample) %>% 
    bind_rows(sample_25pc,.) -> sample_25pc
  
  df %>% 
    sample_frac(0.5) %>% 
    summarise(gdp_mean_05 = mean(gle_rgdpc, na.rm=TRUE)) %>% 
    mutate(sample = i) %>% 
    relocate(sample) %>% 
    bind_rows(sample_50pc,.) -> sample_50pc
}

And then let’s visualise this as a table:

sample_5pc %>% 
  
  full_join(sample_25pc, by = "sample") %>% 
  full_join(sample_50pc, by = "sample") -> all_samples
all_samples
# A tibble: 5 x 4
  sample gdp_mean_005 gdp_mean_025 gdp_mean_05
   <int>        <dbl>        <dbl>       <dbl>
1      1       14320.       11352.      14796.
2      2        6001.       13208.      13547.
3      3        9461.       11731.      13419.
4      4       12419.       15617.      10642.
5      5       11693.       12012.      13696.

Let’s display the means and standard deviation by sample size:

all_samples %>% 
  pivot_longer(-sample) %>% 
  group_by(name) %>% 
  summarise(mean = mean(value),
            sd = sd(value))
pivot_longer: reorganized (gdp_mean_005, gdp_mean_025, gdp_mean_05) into (name, value) [was 5x4, now 15x3]
group_by: one grouping variable (name)
summarise: now 3 rows and 3 columns, ungrouped
# A tibble: 3 x 3
  name           mean    sd
  <chr>         <dbl> <dbl>
1 gdp_mean_005 10779. 3187.
2 gdp_mean_025 12784. 1729.
3 gdp_mean_05  13220. 1541.

8.4 Visualising

Let’s now repeat this - but rather than 5 replications, I will now use 50 to plot a histogram:

# A tibble: 50 x 4
   sample gdp_mean_005 gdp_mean_025 gdp_mean_05
    <int>        <dbl>        <dbl>       <dbl>
 1      1        3080.       17162.      14032.
 2      2        8381.       10599.      12172.
 3      3       19040.       17071.      15076.
 4      4        3247.       13024.      13157.
 5      5        7010.       11271.      11022.
 6      6       14195.       12389.      13648.
 7      7       12489.       12197.      12967.
 8      8       18581.       15404.      12514.
 9      9       23673.       13487.      12823.
10     10        8718.       12792.      11949.
# ... with 40 more rows
all_samples %>% 
  pivot_longer(-sample) %>% 
  ggplot() + 
  aes(x = value, fill = name, group=name) + 
  geom_density(alpha = 0.5) + 
  theme_minimal()