14 HT8 Regression: Difference-in-Difference

library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse 
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project
library(skimr) # allows us to get an overview over the data quickly
library(estimatr) # to estimate robust models
library(performance) # to check our model - also install the 'see' and the 'qqplotr' package
library(magrittr) # to use the exposition operator
library(webr) # to plot the t-test outcome
library(modelsummary) # to get nice model tables

A microcredit program was implemented in randomly selected Thai villages, there were two sides of the program: a male microcredit program and a female microcredit program.

We will only look at the program targeted at women our variable of interest (outcome) is lexptot, the \(log_n\) of total expenditure.

14.1 Load the Data

Load the dataset:

df <- read_dta(here("data","hh_98.dta"))
df %>% 
  skim
Table 14.1: Data summary
Name Piped data
Number of rows 1129
Number of columns 29
_______________________
Column type frequency:
numeric 29
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
nh 0 1 169970.45 87505.85 11054.00 93014.00 181173.00 233183.00 323103.00 ▇▆▇▇▆
year 0 1 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
villid 0 1 2.18 0.95 1.00 1.00 2.00 3.00 4.00 ▇▇▁▇▂
thanaid 0 1 16.77 8.75 1.00 9.00 18.00 23.00 32.00 ▇▅▇▇▆
agehead 0 1 46.01 12.68 18.00 36.00 45.00 55.00 95.00 ▂▇▅▂▁
sexhead 0 1 0.91 0.29 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇
educhead 0 1 2.32 3.48 0.00 0.00 0.00 4.00 16.00 ▇▂▁▁▁
famsize 0 1 5.30 2.21 1.00 4.00 5.00 6.00 18.00 ▇▇▂▁▁
hhland 0 1 76.83 204.02 0.20 4.00 15.00 69.00 4208.00 ▇▁▁▁▁
hhasset 0 1 155576.41 849719.93 1.00 13240.00 34990.00 109595.35 24235540.00 ▇▁▁▁▁
expfd 0 1 3660.19 1558.64 945.32 2602.07 3373.66 4232.55 15270.67 ▇▃▁▁▁
expnfd 0 1 1813.08 3316.89 89.55 514.37 865.31 1710.24 43411.15 ▇▁▁▁▁
exptot 0 1 5473.27 4140.22 1193.33 3253.76 4432.26 6038.71 47981.01 ▇▁▁▁▁
dmmfd 0 1 0.19 0.40 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
dfmfd 0 1 0.53 0.50 0.00 0.00 1.00 1.00 1.00 ▇▁▁▁▇
weight 0 1 1.00 0.82 0.12 0.47 0.65 1.39 5.00 ▇▂▁▁▁
vaccess 0 1 0.84 0.37 0.00 1.00 1.00 1.00 1.00 ▂▁▁▁▇
pcirr 0 1 0.56 0.33 0.00 0.15 0.65 0.85 0.99 ▆▁▂▆▇
rice 0 1 10.28 1.57 6.88 9.25 9.92 10.83 15.45 ▂▇▅▂▁
wheat 0 1 7.47 0.85 5.41 6.77 7.44 8.12 9.47 ▂▆▅▇▂
milk 0 1 10.90 3.38 6.77 8.12 10.83 13.53 20.30 ▇▇▃▂▁
potato 0 1 6.96 1.06 4.66 6.26 6.72 7.33 10.46 ▂▇▅▂▁
egg 0 1 1.95 0.37 1.35 1.69 2.03 2.03 2.71 ▂▇▇▂▃
oil 0 1 39.40 4.01 23.01 37.89 40.60 40.60 50.75 ▁▁▆▇▁
lexptot 0 1 8.45 0.51 7.09 8.09 8.40 8.71 10.78 ▁▇▅▁▁
lnland 0 1 0.38 0.51 0.00 0.04 0.14 0.52 3.76 ▇▂▁▁▁
vill 0 1 169.90 87.50 11.00 93.00 181.00 233.00 323.00 ▇▆▇▇▆
progvillm 0 1 0.60 0.49 0.00 0.00 1.00 1.00 1.00 ▅▁▁▁▇
progvillf 0 1 0.94 0.24 0.00 1.00 1.00 1.00 1.00 ▁▁▁▁▇

14.2 Using a ttest, test the impact of the program

We can use a simple formula framework lexptot ~ progvillf:

df %$%
  t.test(lexptot ~ progvillf, paired = FALSE)

    Welch Two Sample t-test

data:  lexptot by progvillf
t = -1.9585, df = 74.081, p-value = 0.05394
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.261949559  0.002256459
sample estimates:
mean in group 0 mean in group 1 
       8.328525        8.458371 

To make interpretation easier, we can easily plot the result again (remember to load the webr package):

df %$%
  t.test(lexptot ~ progvillf, paired = FALSE) %>% 
  plot()

We can clearly see that there is a massive difference!

14.3 Obtain a similar result using a regression model

Now we get the same result, using a linear model lm.

df %>% 
  lm(lexptot ~ progvillf, data = .) %>% 
  summary

Call:
lm(formula = lexptot ~ progvillf, data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.37303 -0.36086 -0.05676  0.25641  2.44470 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.32852    0.06269 132.843   <2e-16 ***
progvillf    0.12985    0.06464   2.009   0.0448 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5132 on 1127 degrees of freedom
Multiple R-squared:  0.003567,  Adjusted R-squared:  0.002683 
F-statistic: 4.035 on 1 and 1127 DF,  p-value: 0.04481

Let’s use the command lm_robust() from the estimatr package to get the same result (we use stata for our standard errors to get the same result as Stata - the default is HC2).

df %>% 
  lm_robust(lexptot ~ progvillf, se_type = "stata" ,data = .) -> base_model
summary(base_model)

Call:
lm_robust(formula = lexptot ~ progvillf, data = ., se_type = "stata")

Standard error type:  HC1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  CI Lower CI Upper   DF
(Intercept)   8.3285    0.06398 130.167    0.000 8.2029844   8.4541 1127
progvillf     0.1298    0.06589   1.971    0.049 0.0005698   0.2591 1127

Multiple R-squared:  0.003567 , Adjusted R-squared:  0.002683 
F-statistic: 3.884 on 1 and 1127 DF,  p-value: 0.049

14.4 Add covariates

You may want to hold a series of relevant factors constant. Run the estimation taking into account the covariates vaccess and pcirr

df %>% 
  lm_robust(lexptot ~ progvillf + vaccess + pcirr, se_type = "stata" ,data = .) -> model_covariates

summary(model_covariates)

Call:
lm_robust(formula = lexptot ~ progvillf + vaccess + pcirr, data = ., 
    se_type = "stata")

Standard error type:  HC1 

Coefficients:
            Estimate Std. Error  t value Pr(>|t|) CI Lower CI Upper   DF
(Intercept)  8.38687    0.08006 104.7574  0.00000  8.22978 8.543951 1125
progvillf    0.11794    0.06632   1.7785  0.07560 -0.01218 0.248066 1125
vaccess     -0.07567    0.04309  -1.7561  0.07934 -0.16020 0.008874 1125
pcirr        0.02865    0.04538   0.6313  0.52797 -0.06038 0.117675 1125

Multiple R-squared:  0.006507 , Adjusted R-squared:  0.003857 
F-statistic: 2.294 on 3 and 1125 DF,  p-value: 0.07631

We can use the amazing functionality of the modelsummary package to get a quick plot of the effect sizes:

modelplot(list(base_model, model_covariates), coef_omit = "(Intercept)")

And we can also get a quick summary table of the results:

modelsummary(list(base_model, model_covariates), stars = TRUE)
Model 1 Model 2
(Intercept) 8.329*** 8.387***
(0.064) (0.080)
progvillf 0.130** 0.118*
(0.066) (0.066)
vaccess -0.076*
(0.043)
pcirr 0.029
(0.045)
Num.Obs. 1129 1129
R2 0.004 0.007
R2 Adj. 0.003 0.004
se_type HC1 HC1
* p < 0.1, ** p < 0.05, *** p < 0.01

14.5 Testing the impact of the program at the household level

We again are instructed to use a t-test:

df %$% 
  t.test(lexptot ~ dfmfd, paired = FALSE) %>% 
  plot()

14.6 Deal with stratified survey using a weighted least squares

The survey was stratified, and must be weighted accordingly. We now use a regression framework, using weighted least squares, to estimate the impact of the program at the household level. We do this by specifying weights.

Let’s first explore the weights:

df %>% 
  ggplot(aes(x = weight)) + 
  geom_histogram()+ 
  
  theme_minimal()
noweights <- lm_robust(lexptot ~ dfmfd, df, se = "stata")
weighted_dfmfd <- lm_robust(lexptot ~ dfmfd, weights = weight, data = df, se = "stata")
weighted_prgvillf <- lm_robust(lexptot ~ progvillf, weights = weight, data = df, se = "stata")

Let’s explore them:

modelsummary(list(noweights, weighted_dfmfd, weighted_prgvillf), stars = TRUE)
Model 1 Model 2 Model 3
(Intercept) 8.448*** 8.520*** 8.527***
(0.023) (0.029) (0.121)
dfmfd 0.005 -0.071*
(0.031) (0.037)
progvillf -0.036
(0.122)
Num.Obs. 1129 1129 1129
R2 0.000 0.004 0.000
R2 Adj. -0.001 0.003 -0.001
se_type HC1 HC1 HC1
* p < 0.1, ** p < 0.05, *** p < 0.01

14.7 Now we add the covariates again

full_model <- lm_robust(lexptot ~ dfmfd + vaccess + pcirr, se = "stata", data = df)
full_model_weighted <- lm_robust(lexptot ~ dfmfd + vaccess + pcirr, se = "stata", data = df, weights = weight)

# We also want to give our models names
modellist <- list(full_model, full_model_weighted)
names(modellist) <- c("Full Model without Weights","Full Model with Weights")
modelsummary(modellist, stars = TRUE, )
Full Model without Weights Full Model with Weights
(Intercept) 8.504*** 8.614***
(0.047) (0.056)
dfmfd 0.002 -0.072*
(0.031) (0.037)
vaccess -0.084* -0.123**
(0.043) (0.055)
pcirr 0.027 0.014
(0.045) (0.061)
Num.Obs. 1129 1129
R2 0.004 0.012
R2 Adj. 0.001 0.009
se_type HC1 HC1
* p < 0.1, ** p < 0.05, *** p < 0.01

We now see that the dfmfd variable changes quite significantly.

modelplot(modellist, coef_omit = "(Intercept)")

14.8 Now we acquire new data

This data is from the baseline study for this region from 1991. This will allow us to run a difference-in-difference model.

Let’s load the new data:

df_new <- read_dta(here("data","hh_9198.dta"))
df_new %>% 
  skim
Table 14.2: Data summary
Name Piped data
Number of rows 1652
Number of columns 26
_______________________
Column type frequency:
numeric 26
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
nh 0 1 153219.63 81544.31 11054.00 82035.00 153087.00 222103.00 293154.00 <U+2586><U+2587><U+2586><U+2587><U+2586>
year 0 1 0.50 0.50 0.00 0.00 0.50 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2587>
villid 0 1 2.02 0.82 1.00 1.00 2.00 3.00 3.00 <U+2587><U+2581><U+2587><U+2581><U+2587>
thanaid 0 1 15.11 8.15 1.00 8.00 15.00 22.00 29.00 <U+2586><U+2587><U+2586><U+2587><U+2586>
agehead 0 1 43.37 12.72 18.00 34.00 42.00 52.00 95.00 <U+2585><U+2587><U+2585><U+2582><U+2581>
sexhead 0 1 0.93 0.26 0.00 1.00 1.00 1.00 1.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
educhead 0 1 2.32 3.37 0.00 0.00 0.00 4.00 16.00 <U+2587><U+2582><U+2581><U+2581><U+2581>
famsize 0 1 5.30 2.24 1.00 4.00 5.00 6.00 19.00 <U+2586><U+2587><U+2581><U+2581><U+2581>
hhland 0 1 79.48 237.38 0.20 4.00 14.00 65.00 4618.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
hhasset 0 1 125535.49 710461.35 1.00 11861.00 30350.00 91551.25 24235540.00 <U+2587><U+2581><U+2581><U+2581><U+2581>
expfd 0 1 3409.38 1326.88 1085.90 2533.39 3137.55 3925.82 15270.67 <U+2587><U+2582><U+2581><U+2581><U+2581>
expnfd 0 1 1449.59 2807.25 62.80 436.29 706.23 1343.89 43411.15 <U+2587><U+2581><U+2581><U+2581><U+2581>
exptot 0 1 4858.97 3538.51 1193.33 3110.72 3989.23 5377.24 47981.01 <U+2587><U+2581><U+2581><U+2581><U+2581>
dmmfd 0 1 0.21 0.41 0.00 0.00 0.00 0.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2582>
dfmfd 0 1 0.44 0.50 0.00 0.00 0.00 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2586>
weight 0 1 1.00 0.82 0.13 0.47 0.66 1.36 5.00 <U+2587><U+2582><U+2581><U+2581><U+2581>
vaccess 0 1 0.90 0.31 0.00 1.00 1.00 1.00 1.00 <U+2581><U+2581><U+2581><U+2581><U+2587>
pcirr 0 1 0.51 0.33 0.00 0.20 0.60 0.80 0.99 <U+2587><U+2583><U+2583><U+2587><U+2587>
rice 0 1 10.04 1.38 6.88 9.02 10.00 10.60 15.45 <U+2582><U+2587><U+2583><U+2581><U+2581>
wheat 0 1 8.04 1.28 1.00 7.00 8.12 9.00 11.00 <U+2581><U+2581><U+2583><U+2587><U+2582>
milk 0 1 11.48 3.30 1.00 9.47 10.83 13.53 20.30 <U+2581><U+2583><U+2587><U+2583><U+2581>
potato 0 1 7.84 1.31 4.66 6.87 8.00 9.00 12.00 <U+2582><U+2586><U+2587><U+2582><U+2581>
egg 0 1 2.12 0.41 1.35 1.69 2.03 2.50 3.00 <U+2582><U+2587><U+2586><U+2586><U+2582>
oil 0 1 46.73 8.36 23.01 40.60 48.00 55.00 64.00 <U+2581><U+2586><U+2587><U+2587><U+2586>
lexptot 0 1 8.35 0.47 7.09 8.04 8.29 8.59 10.78 <U+2581><U+2587><U+2583><U+2581><U+2581>
dmmfd98 0 1 0.53 0.50 0.00 0.00 1.00 1.00 1.00 <U+2587><U+2581><U+2581><U+2581><U+2587>

In the data, we now have a variable called dmmfd98, which is a dummy variable that takes the value 1 if the household participated in the female microcredit program. The outcome we are interested in is still the same (lexptot).

Technically we now don’t need the old data frame df anymore - but as you can see, R can hold both without a problem.

Now we will run our diff-in-diff model. As a reminder, the model in this context will look like this:

\[ Outcome_{i,t} = \beta_0 + \beta_1 Treatment_{i,t} + \beta_2 Time_{i,t} + \beta_3 Treatment_{i,t} * Time_{i,t} + \varepsilon_{i,t}\\ lexptot_{i,t} = \beta_0 + \beta_1 dmmfd98_{i,t} + \beta_2 year_{i,t} + \beta_3 dfmmfd98_{i,t} * year_{i,t} + \varepsilon_{i,t} \] where \(i\) stands for the individual household and \(t\) stands for either 1991 or 1998.

To estimate this, we turn to lm_robust again. To create this interaction to estimate \(\beta_3\) we can either create the interaction manually e.g. using df_new %>% mutate(DiD = year * dmmfd98), we can use the colon symbol : to let R create the interaction term, use * to insert the interaction and both terms, or we can use the I() command, which creates interactions as well. All models below are creating the same result:

lm_robust(lexptot ~ dmmfd98 * year, se = "stata", data = df_new)
lm_robust(lexptot ~ dmmfd98 + year + I(dmmfd98 * year), se = "stata", data = df_new)
lm_robust(lexptot ~ dmmfd98 + year + dmmfd98 : year, se = "stata", data = df_new)

14.9 Let’s look at the treatment effect

Let’s save the first one and also estimate one model where we use the stratified weights:

did_model <- lm_robust(lexptot ~ dmmfd98 * year, se = "stata", data = df_new)
did_model_weighted <- lm_robust(lexptot ~ dmmfd98 * year, se = "stata", data = df_new, weights = weight)

The coefficient dmmfd98:year (or in the table below dmmfd98 x year) now corresponds to the treatment effect. Why? Because the term year controls for anything that is different across all units between 1991 an 1998 - e.g. all households could have improved living conditions because of some other factors. At the same time, there is likely some constant difference between households that participated in the study.

So it’s the combination of both that we’re interested: the households that participated in the program in 1998 compared to households that did not participate in the program in 1998 - which is what the coefficient of dmmfd98:year coefficient gives us.

modellist <- list(
  "DiD Model OLS" = did_model, 
  "DiD Model WLS" = did_model_weighted
)
modelsummary(modellist, stars = TRUE)
DiD Model OLS DiD Model WLS
(Intercept) 8.310*** 8.380***
(0.022) (0.031)
dmmfd98 -0.115*** -0.174***
(0.027) (0.037)
year 0.147*** 0.157***
(0.036) (0.048)
dmmfd98 × year 0.111** 0.080
(0.046) (0.059)
Num.Obs. 1652 1652
R2 0.055 0.054
R2 Adj. 0.054 0.052
se_type HC1 HC1
* p < 0.1, ** p < 0.05, *** p < 0.01

Here you can also see the importance of accounting for a stratified sample: if we use the OLS results (i.e. without using weights), we find a statistically significant DiD effect - when we use Weighted Least Squares (WLS, i.e. with weights for our stratified sample), the significance disappears.

An important thing to note: Be careful when considering models that do not use OLS when using DiD. An example of this could be Logit or Probit models (which we don’t cover in this course). These models use what is called Maximum Likelihood Estimation and the estimates you might receive for DiD are not always valid.

14.10 We now control for other factors again

If we want to control for other aspects and factors, we simply add more covariates to the model the way we always do - so nothing new or surprising there:

did_model_control <- lm_robust(lexptot ~ dmmfd98 * year + agehead + sexhead + educhead + famsize, se = "stata", data = df_new)
did_model_control_weighted <- lm_robust(lexptot ~ dmmfd98 * year+ agehead + sexhead + educhead + famsize, 
                                        se = "stata", data = df_new, weights = weight)

Let’s consider all models together and do some nice edits to the table

modellist <- list("DiD OLS Base" = did_model,
                  "DiD OLS Full" = did_model_control,
                  "DiD WLS Base" = did_model_weighted,
                  "DiD WLS Full" = did_model_control_weighted)
modelsummary(modellist, stars = TRUE)
DiD OLS Base DiD OLS Full DiD WLS Base DiD WLS Full
(Intercept) 8.310*** 8.186*** 8.380*** 8.129***
(0.022) (0.068) (0.031) (0.087)
dmmfd98 -0.115*** -0.079*** -0.174*** -0.109***
(0.027) (0.026) (0.037) (0.035)
year 0.147*** 0.114*** 0.157*** 0.125***
(0.036) (0.033) (0.048) (0.044)
dmmfd98 × year 0.111** 0.120*** 0.080 0.088
(0.046) (0.043) (0.059) (0.055)
agehead 0.005*** 0.005***
(0.001) (0.001)
sexhead -0.039 0.001
(0.052) (0.059)
educhead 0.047*** 0.052***
(0.003) (0.005)
famsize -0.030*** -0.023***
(0.006) (0.007)
Num.Obs. 1652 1652 1652 1652
R2 0.055 0.183 0.054 0.202
R2 Adj. 0.054 0.180 0.052 0.198
se_type HC1 HC1 HC1 HC1
* p < 0.1, ** p < 0.05, *** p < 0.01

We can also consider this as a plot (we exclude the Intercept again for stylistic reasons):

modelplot(modellist, coef_omit = "(Intercept)")

Want to spice up the plot a bit? I’d recommend at least adding a line for 0 to see more clearly whether the confidence intervals indicate statistical significance.

To do this, we simple use the command from before and select draw = FALSE. This gives us a nicely formatted table that we can use ggplot on:

df_to_plot <- modelplot(modellist, coef_omit = "(Intercept)", draw = FALSE)
df_to_plot
             term        model     estimate     conf.low    conf.high
10        famsize DiD OLS Full -0.030046487 -0.041319929 -0.018773045
20        famsize DiD WLS Full -0.022983816 -0.037241340 -0.008726292
9        educhead DiD OLS Full  0.047092625  0.040247910  0.053937340
19       educhead DiD WLS Full  0.051732776  0.042484965  0.060980587
8         sexhead DiD OLS Full -0.039319930 -0.141745809  0.063105949
18        sexhead DiD WLS Full  0.001164941 -0.114693496  0.117023378
7         agehead DiD OLS Full  0.004768854  0.002917701  0.006620007
17        agehead DiD WLS Full  0.005215859  0.002579159  0.007852559
3  dmmfd98 × year DiD OLS Base  0.111376418  0.022025210  0.200727625
6  dmmfd98 × year DiD OLS Full  0.120322707  0.036822362  0.203823051
13 dmmfd98 × year DiD WLS Base  0.079545825 -0.035463339  0.194554989
16 dmmfd98 × year DiD WLS Full  0.087562324 -0.020585254  0.195709901
2            year DiD OLS Base  0.147318771  0.077140815  0.217496726
5            year DiD OLS Full  0.114014613  0.050011762  0.178017464
12           year DiD WLS Base  0.157108221  0.062532794  0.251683649
15           year DiD WLS Full  0.124888765  0.038095784  0.211681747
1         dmmfd98 DiD OLS Base -0.114567069 -0.167072070 -0.062062069
4         dmmfd98 DiD OLS Full -0.079428734 -0.129502314 -0.029355154
11        dmmfd98 DiD WLS Base -0.173806804 -0.246657990 -0.100955618
14        dmmfd98 DiD WLS Full -0.108696635 -0.178211509 -0.039181761

Now with a bit of ggplot magic, we get this:

df_to_plot %>% 
  ggplot() + 
  aes(y = term, color = model) + 
  geom_vline(aes(xintercept = 0)) + # adding a vertical line at 0 to see significance more clearly
  
  # the command below adds a point with confidence intervals. The position dodge command ensure they don't overlap
  geom_pointrange(aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high), 
                  position = position_dodge(width = .5)) + 
  
  # some styling
  scale_color_viridis_d(name = "Model:") + 
  theme_minimal() + 
  scale_y_discrete(labels = function(x){str_to_title(x)}) + # just a small command to format the y axis labels
  labs(x = "Estimate", y = "Coefficient", title = "Difference-in-Difference Estimates using different Models")+
  theme(legend.position = "bottom")

14.11 Introducing Fixed Effects

Now we introduce a new concept: so called Fixed Effects. Conceptually, fixed effects control for certain things that are invariant across a certain dimension i.e. that do not change. Most often, this means that there are fixed effects across two dimensions: across a certain group or a certain time.

In this case, we want to control for where a certain household is - the reasoning for this is that households in one village might have a fixed difference to households of another village - no matter whether we look at them in 1991 or 1998. So what we want to do is to add each of those villages in as a dummy variable - which is what fixed effects essentially are: dummy variables for each different unit (e.g. a dummy_1 that is 1 for village 1 and 0 otherwise, a dummy_2 that is 1 for village 2 and 0 otherwise, etc.).

Conceptually, each fixed effect gives a village a unique intercept. So while before we only had one intercept that was valid for all, we now have an intercept for each unit that we look at.

Here, the level of control we introduce is called the ‘Thana’ level, which is in an administrative unit (in our data this an id for each Thana unit called thanaid). In effect what we are doing is to control for constant differences across administrative units, so we are less interested in the coefficient of the individual fixed effects, but rather care more about the other coefficients, once we have controlled for these differences.

Let’s look at this in practice: We can do this by simply adding the thanaid as a factor using as.factor(thanaid) to the formula equation. We can also create this factor beforehand, e.g. using df_new %>% mutate(thanaid = as.factor(thanaid)), which would give us the same answer.

lm_robust(lexptot ~ dmmfd98 * year + agehead + sexhead + educhead + famsize + as.factor(thanaid), se = "stata", data = df_new)
                         Estimate   Std. Error    t value     Pr(>|t|)
(Intercept)           8.258290403 0.0924886327 89.2897879 0.000000e+00
dmmfd98              -0.103139633 0.0257629947 -4.0034024 6.526685e-05
year                  0.112027938 0.0313610205  3.5722032 3.643696e-04
agehead               0.004910911 0.0009125816  5.3813389 8.478655e-08
sexhead              -0.005062933 0.0476536388 -0.1062444 9.154016e-01
educhead              0.046298490 0.0033846348 13.6790208 2.192249e-40
famsize              -0.036656569 0.0056991788 -6.4319037 1.655451e-10
as.factor(thanaid)2  -0.275276021 0.0955273872 -2.8816450 4.008292e-03
as.factor(thanaid)3   0.314425667 0.0943580708  3.3322604 8.808829e-04
as.factor(thanaid)4  -0.128139628 0.0845607580 -1.5153557 1.298779e-01
as.factor(thanaid)5  -0.059071425 0.0879975240 -0.6712851 5.021349e-01
as.factor(thanaid)6   0.031807249 0.0869995106  0.3656026 7.147095e-01
as.factor(thanaid)7   0.009957096 0.0808168628  0.1232057 9.019596e-01
as.factor(thanaid)8  -0.090852943 0.0785979620 -1.1559198 2.478848e-01
as.factor(thanaid)9  -0.196714744 0.0843840522 -2.3311839 1.986609e-02
as.factor(thanaid)10 -0.133555038 0.0746956980 -1.7879883 7.396519e-02
as.factor(thanaid)11  0.009294655 0.0852636245  0.1090108 9.132075e-01
as.factor(thanaid)12 -0.073805329 0.0860043005 -0.8581586 3.909322e-01
as.factor(thanaid)13 -0.054536910 0.0930646520 -0.5860110 5.579500e-01
as.factor(thanaid)14 -0.241798795 0.0782476325 -3.0901739 2.034520e-03
as.factor(thanaid)15  0.088865055 0.0804657240  1.1043840 2.695910e-01
as.factor(thanaid)16 -0.178674773 0.0871452885 -2.0503090 4.049531e-02
as.factor(thanaid)17 -0.094422861 0.0771446843 -1.2239711 2.211416e-01
as.factor(thanaid)18 -0.110108822 0.0840424527 -1.3101572 1.903289e-01
as.factor(thanaid)19 -0.210815176 0.0801843340 -2.6291317 8.641497e-03
as.factor(thanaid)20 -0.206317682 0.0791537469 -2.6065435 9.230220e-03
as.factor(thanaid)21  0.039054598 0.0832697017  0.4690133 6.391234e-01
as.factor(thanaid)22 -0.098483055 0.0777047548 -1.2674006 2.051947e-01
as.factor(thanaid)23  0.269307630 0.0844652270  3.1883846 1.458104e-03
as.factor(thanaid)24  0.103105800 0.1031030518  1.0000267 3.174473e-01
as.factor(thanaid)25 -0.070302314 0.0830918769 -0.8460793 3.976338e-01
as.factor(thanaid)26 -0.143039643 0.0861495689 -1.6603640 9.703525e-02
as.factor(thanaid)27 -0.053292191 0.0951226960 -0.5602469 5.753887e-01
as.factor(thanaid)28 -0.071077036 0.0846062397 -0.8400921 4.009810e-01
as.factor(thanaid)29 -0.307561242 0.0764550612 -4.0227715 6.017220e-05
dmmfd98:year          0.121585103 0.0405631643  2.9974265 2.764219e-03
                         CI Lower     CI Upper   DF
(Intercept)           8.076880141  8.439700664 1616
dmmfd98              -0.153672023 -0.052607244 1616
year                  0.050515396  0.173540480 1616
agehead               0.003120943  0.006700879 1616
sexhead              -0.098532355  0.088406489 1616
educhead              0.039659755  0.052937224 1616
famsize              -0.047835127 -0.025478011 1616
as.factor(thanaid)2  -0.462646596 -0.087905446 1616
as.factor(thanaid)3   0.129348628  0.499502706 1616
as.factor(thanaid)4  -0.293999894  0.037720638 1616
as.factor(thanaid)5  -0.231672678  0.113529827 1616
as.factor(thanaid)6  -0.138836467  0.202450965 1616
as.factor(thanaid)7  -0.148559770  0.168473962 1616
as.factor(thanaid)8  -0.245017584  0.063311697 1616
as.factor(thanaid)9  -0.362228413 -0.031201075 1616
as.factor(thanaid)10 -0.280065649  0.012955574 1616
as.factor(thanaid)11 -0.157944237  0.176533546 1616
as.factor(thanaid)12 -0.242497007  0.094886349 1616
as.factor(thanaid)13 -0.237076994  0.128003175 1616
as.factor(thanaid)14 -0.395276288 -0.088321302 1616
as.factor(thanaid)15 -0.068963076  0.246693186 1616
as.factor(thanaid)16 -0.349604423 -0.007745124 1616
as.factor(thanaid)17 -0.245736995  0.056891273 1616
as.factor(thanaid)18 -0.274952467  0.054734822 1616
as.factor(thanaid)19 -0.368091380 -0.053538973 1616
as.factor(thanaid)20 -0.361572458 -0.051062906 1616
as.factor(thanaid)21 -0.124273348  0.202382543 1616
as.factor(thanaid)22 -0.250895730  0.053929619 1616
as.factor(thanaid)23  0.103634742  0.434980519 1616
as.factor(thanaid)24 -0.099123934  0.305335534 1616
as.factor(thanaid)25 -0.233281468  0.092676840 1616
as.factor(thanaid)26 -0.312016255  0.025936969 1616
as.factor(thanaid)27 -0.239868991  0.133284609 1616
as.factor(thanaid)28 -0.237026511  0.094872439 1616
as.factor(thanaid)29 -0.457522727 -0.157599758 1616
dmmfd98:year          0.042023171  0.201147034 1616

As you can see, each administrative unit now gets a separate coefficient - note that the as.factor(thanaid)1 is missing - this is because we already have the intercept. When you repeat the above without an intercept (by adding -1 to the end), you’d be getting one more fixed effect, but the values would all stay exactly the same: lm_robust(lexptot ~ dmmfd98 * year + agehead + sexhead + educhead + famsize + as.factor(thanaid) -1, se = "stata", data = df_new)

As mentioned, more often than not, we don’t actually care about the individual fixed effects, so we can also pass it to the explicit fixed_effect option in lm_robust:

did_model_control_FE <- lm_robust(lexptot ~ dmmfd98 * year + agehead + sexhead + educhead + famsize, 
                                  se = "stata", 
                                  fixed_effects = thanaid,
                                  data = df_new)
did_model_control_FE
                 Estimate   Std. Error    t value     Pr(>|t|)
dmmfd98      -0.103139633 0.0257629947 -4.0034024 6.526685e-05
year          0.112027938 0.0313610205  3.5722032 3.643696e-04
agehead       0.004910911 0.0009125816  5.3813389 8.478655e-08
sexhead      -0.005062933 0.0476536388 -0.1062444 9.154016e-01
educhead      0.046298490 0.0033846348 13.6790208 2.192249e-40
famsize      -0.036656569 0.0056991788 -6.4319037 1.655451e-10
dmmfd98:year  0.121585103 0.0405631643  2.9974265 2.764219e-03
                 CI Lower     CI Upper   DF
dmmfd98      -0.153672023 -0.052607244 1616
year          0.050515396  0.173540480 1616
agehead       0.003120943  0.006700879 1616
sexhead      -0.098532355  0.088406489 1616
educhead      0.039659755  0.052937224 1616
famsize      -0.047835127 -0.025478011 1616
dmmfd98:year  0.042023171  0.201147034 1616

If you now compare the coefficients to above, you’ll see that all estimates have remained exactly the same! We quickly do the same for our weighted least squares:

did_model_control_FE_weighted <- lm_robust(lexptot ~ dmmfd98 * year + agehead + sexhead + educhead + famsize, 
                                           se = "stata", 
                                           fixed_effects = thanaid,
                                           weights = weight,
                                           data = df_new)

And consider all estimates together again to compare them:

modellist <- list(
  "DiD OLS Base" = did_model,
  "DiD OLS Full" = did_model_control,
  "DiD OLS Full Fixed Effects" = did_model_control_FE,
  "DiD WLS Base" = did_model_weighted,
  "DiD WLS Full" = did_model_control_weighted,
  "DiD WLS Full Fixed Effects" = did_model_control_FE_weighted
)
modelsummary(modellist, stars = TRUE, add_rows = data.frame("Fixed Effects", "No","No","Yes","No","No","Yes"))
DiD OLS Base DiD OLS Full DiD OLS Full Fixed Effects DiD WLS Base DiD WLS Full DiD WLS Full Fixed Effects
(Intercept) 8.310*** 8.186*** 8.380*** 8.129***
(0.022) (0.068) (0.031) (0.087)
dmmfd98 -0.115*** -0.079*** -0.103*** -0.174*** -0.109*** -0.129***
(0.027) (0.026) (0.026) (0.037) (0.035) (0.034)
year 0.147*** 0.114*** 0.112*** 0.157*** 0.125*** 0.118***
(0.036) (0.033) (0.031) (0.048) (0.044) (0.041)
dmmfd98 × year 0.111** 0.120*** 0.122*** 0.080 0.088 0.092*
(0.046) (0.043) (0.041) (0.059) (0.055) (0.052)
agehead 0.005*** 0.005*** 0.005*** 0.005***
(0.001) (0.001) (0.001) (0.001)
sexhead -0.039 -0.005 0.001 0.008
(0.052) (0.048) (0.059) (0.053)
educhead 0.047*** 0.046*** 0.052*** 0.051***
(0.003) (0.003) (0.005) (0.004)
famsize -0.030*** -0.037*** -0.023*** -0.031***
(0.006) (0.006) (0.007) (0.007)
Num.Obs. 1652 1652 1652 1652 1652 1652
R2 0.055 0.183 0.273 0.054 0.202 0.297
R2 Adj. 0.054 0.180 0.257 0.052 0.198 0.282
se_type HC1 HC1 HC1 HC1 HC1 HC1
Fixed Effects No No Yes No No Yes
* p < 0.1, ** p < 0.05, *** p < 0.01

And our plot again:

modelplot(modellist, draw = FALSE, coef_omit = "(Intercept)") %>% 

  ggplot() + 
  aes(y = term, color = model) + 
  geom_vline(aes(xintercept = 0)) + # adding a vertical line at 0 to see significance more clearly
  
  # the command below adds a point with confidence intervals. The position dodge command ensure they don't overlap
  geom_pointrange(aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high), 
                  position = position_dodge(width = .5)) + 
  
  # some styling
  scale_color_viridis_d(name = "Model:") + 
  theme_minimal() + 
  scale_y_discrete(labels = function(x){str_to_title(x)}) + # just a small command to format the y axis labels
  labs(x = "Estimate", y = "Coefficient", title = "Difference-in-Difference Estimates using different Models")+
  theme(legend.position = "bottom")

And that’s it, this is a first taste to think about regression estimation in R.