13 HT7 Regression: IV Estimation

library(tidyverse) # our main collection of functions
library(tidylog) # prints additional output from the tidyverse commands - load after tidyverse 
library(haven) # allows us to load .dta (Stata specific) files
library(here) # needed to navigate to folders and files in a project
library(skimr) # allows us to get an overview over the data quickly
library(lmtest) # for robust standard errors
library(sandwich) # for robust standard errors (named this way because of the "Sandwich Estimator")
library(performance) # to check our model - also install the 'see' and the 'qqplotr' package
library(ivmodel) # to estimate Instrumental Variable models

13.1 Load the Data

Load the dataset:

df <- read_dta(here("data","Nepal.dta"))
df %>% 
  skim
Table 13.1: Data summary
Name Piped data
Number of rows 414
Number of columns 25
_______________________
Column type frequency:
character 2
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
belt 0 1 1 1 0 3 0
region 0 1 1 1 0 5 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
WWWHH 0 1 14027.76 7723.51 401.00 8103.50 13008.50 20755.50 26914.00 ▆▇▇▅▇
nkilled_dis 0 1 0.26 0.33 0.01 0.05 0.14 0.33 1.98 ▇▁▁▁▁
hhsize 0 1 6.22 2.56 1.00 5.00 6.00 7.00 19.00 ▃▇▂▁▁
live_bg 0 1 4.20 2.48 1.00 2.00 4.00 5.00 20.00 ▇▃▁▁▁
livest_sm 0 1 3.21 3.66 0.00 1.00 2.00 5.00 40.00 ▇▁▁▁▁
land_ha 0 1 1.09 1.24 0.00 0.40 0.79 1.35 15.58 ▇▁▁▁▁
land_pakho 0 1 0.48 0.99 0.00 0.00 0.23 0.62 15.58 ▇▁▁▁▁
land_khet 0 1 0.61 0.91 0.00 0.06 0.34 0.71 6.94 ▇▁▁▁▁
land_cultiv 0 1 2.24 2.30 0.07 0.92 1.76 2.71 31.15 ▇▁▁▁▁
pension 0 1 1002.57 5356.76 0.00 0.00 0.00 0.00 47868.20 ▇▁▁▁▁
inc_remit 0 1 8642.34 24690.92 0.00 0.00 0.00 6376.30 263524.66 ▇▁▁▁▁
selflab 0 1 2380.72 1622.88 24.00 1320.00 2040.00 3136.50 9894.00 ▇▆▂▁▁
totcon_freq 0 1 42770.67 27050.79 7821.60 26669.89 36459.41 51455.32 274078.47 ▇▂▁▁▁
selfinc 0 1 29112.88 31727.75 574.96 13265.27 20537.42 34066.29 335860.72 ▇▁▁▁▁
fixed_inc 0 1 14994.41 31301.84 0.00 0.00 1030.16 17727.19 383308.59 ▇▁▁▁▁
edu 0 1 13.25 10.61 1.00 5.00 10.00 19.00 74.00 ▇▃▁▁▁
selflab_stock 0 1 3.98 1.57 1.00 3.00 3.66 4.75 11.50 ▆▇▃▁▁
ln_selfinc 0 1 9.96 0.78 6.35 9.49 9.93 10.44 12.72 ▁▁▇▅▁
ln_live_bg 0 1 1.27 0.59 0.00 0.69 1.39 1.61 3.00 ▂▇▇▂▁
ln_edu 0 1 2.26 0.85 0.00 1.61 2.30 2.94 4.30 ▂▅▇▇▁
ln_land_cult 0 1 0.46 0.87 -2.69 -0.08 0.57 1.00 3.44 ▁▂▇▃▁
ln_selflab 0 1 7.51 0.82 3.18 7.19 7.62 8.05 9.20 ▁▁▂▇▃
ln_selflab_stock 0 1 1.31 0.38 0.00 1.10 1.30 1.56 2.44 ▁▃▇▅▁

13.2 Estimate the model

lm(ln_selfinc~ln_live_bg + ln_edu + ln_land_cult + ln_selflab, data = df) -> initial_model
summary(initial_model)

Call:
lm(formula = ln_selfinc ~ ln_live_bg + ln_edu + ln_land_cult + 
    ln_selflab, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.44026 -0.42238  0.04034  0.40734  2.30682 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.40786    0.30803  24.049  < 2e-16 ***
ln_live_bg    0.23228    0.05855   3.967 8.58e-05 ***
ln_edu        0.15601    0.03992   3.908 0.000109 ***
ln_land_cult  0.23310    0.04072   5.725 2.00e-08 ***
ln_selflab    0.23926    0.04140   5.779 1.49e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6541 on 409 degrees of freedom
Multiple R-squared:  0.3108,    Adjusted R-squared:  0.304 
F-statistic: 46.11 on 4 and 409 DF,  p-value: < 2.2e-16

13.3 Analyse the model

We also want to do some diagnostic checks on our model. So we plot some interesting information from our model using the fantastic performance package (for the plots you also want to install the see and the qqplotr packages):

check_model(initial_model)

Now we can assess whether our assumptions are valid - we can assess whether the Residuals are normally distributed (fairly valid from a visual inspection) and whether the variance is homogeneous (otherwise we need to use a robust model).

Next, we can test formally for error normality:

check_normality(initial_model)
Warning: Non-normality of residuals detected (p < .001).

So again, we have a problem: This test would actually suggest that our residuals are not normally distributed!! Going forward, we’d need to reconsider our specifications, i.e. do we really need all logged variables, can we include a non-linear variable or have we missed something altogether (endogeneity etc.). For now, we’ll just continue with the model though!

We can also formally check for the serial correlation assumption (which we did not discuss in this course) - you can see the command below, but I do not execute it because we are not dealing with time series or panel data, where autocorrelation is most important (Autocorrelation can also be a problem in cross sectional data, but that’s beyond this material).

check_autocorrelation(initial_model)

We can also test for homoscedasticity formally:

Warning: Heteroscedasticity (non-constant error variance) detected (p = 0.005).

This test tells us that we have a problem with heteroscedasticity - so we need to use a robust model!

If we wanted to consider the regression again with robust standard errors though, we can simply pass our estimated model to coeftest (which is in the lmtest package) and specify a new vcov (“Variance Co-Variance Matrix,” which is what the robust S.E. calculation actually modifies) option. We use the vcovHC() command here from the sandwich package (for the so called “Sandwich Estimator”) - the HC in vcovHC stands for Heteroscedasticity Robust.

We can use different types of robust estimation - default in R is HC3, which differs slightly from the Stata results.

coeftest(initial_model, vcov = vcovHC(initial_model, type="HC3"))

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  7.407857   0.360788 20.5325 < 2.2e-16 ***
ln_live_bg   0.232283   0.057437  4.0441 6.278e-05 ***
ln_edu       0.156007   0.039504  3.9491 9.235e-05 ***
ln_land_cult 0.233099   0.048551  4.8011 2.217e-06 ***
ln_selflab   0.239264   0.048457  4.9377 1.154e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you want exactly the result you get in Stata, use HC1:

coeftest(initial_model, vcov = vcovHC(initial_model, type="HC1"))

t test of coefficients:

             Estimate Std. Error t value  Pr(>|t|)    
(Intercept)  7.407857   0.353722 20.9426 < 2.2e-16 ***
ln_live_bg   0.232283   0.056628  4.1019 4.946e-05 ***
ln_edu       0.156007   0.039090  3.9910 7.797e-05 ***
ln_land_cult 0.233099   0.047703  4.8865 1.476e-06 ***
ln_selflab   0.239264   0.047482  5.0390 7.038e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the use of robust Standard Errors does not fundamentally change our conclusions of the coefficients.

13.4 Use an Instrument

For more information on IV-estimation in R see here.

To use IV estimation in R we need to use another package - the ivmodel package.

We use the ivmodelFormula(), which works in the same way as an lm model. We specify a formula in the way: y ~ endog + exog | instr + exog where y is our endogenous variable, endog is the endogenous variable that we want to instrument. instr is the instrument we use in the first stage regression to instrument the endog variable, while exog are all the variables that are exogenous (i.e. that we don’t need to instrument). Note that the two formulas, separated by | contain the exogenous variables.

Let’s see this in action:

ivmodelFormula(ln_selfinc ~ ln_live_bg + ln_edu + ln_land_cult + ln_selflab | 
                 ln_selflab_stock + ln_live_bg + ln_edu + ln_land_cult, 
               data = df) -> model_using_iv

Now let’s see the output (which is long!):

model_using_iv

Call:
ivmodel(Y = Y, D = D, Z = Z, X = X, intercept = intercept, beta0 = beta0, 
    alpha = alpha, k = k, manyweakSE = manyweakSE, heteroSE = heteroSE, 
    clusterID = clusterID, deltarange = deltarange, na.action = na.action)
sample size: 414
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

First Stage Regression Result:

F=21.58976, df1=1, df2=409, p-value is 4.5603e-06
R-squared=0.05013997,   Adjusted R-squared=0.04781757
Residual standard error: 0.7613904 on 410 degrees of freedom
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Coefficients of k-Class Estimators:

            k Estimate Std. Error t value Pr(>|t|)    
OLS    0.0000   0.2393     0.0414   5.779 1.49e-08 ***
Fuller 0.9976   0.7404     0.2107   3.515  0.00049 ***
TSLS   1.0000   0.7649     0.2183   3.504  0.00051 ***
LIML   1.0000   0.7649     0.2183   3.504  0.00051 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Alternative tests for the treatment effect under H_0: beta=0.

Anderson-Rubin test (under F distribution):
F=16.45739, df1=1, df2=409, p-value is 5.9602e-05
95 percent confidence interval:
 [0.396758589790007, 1.37422089106227]

Conditional Likelihood Ratio test (under Normal approximation):
Test Stat=16.45739, p-value is 5.9602e-05
95 percent confidence interval:
 [0.396758617266406, 1.37422081576433]

What do we want to focus on?

First of all, we want to analyse the first stage regression, which is under First Stage Regression Result - here the first number alredy gives us the F-test result we want to ensure is above 10.

Next, we consider the Coefficients of k-Class Estimators, which basically is the coefficient of the endogenous variable under a number of different models - the original OLS, the Fuller-k estimate, the traditional Two-Stage Least Squares (TSLS) that we are focusing on and then the Limited Information Maximum Likelihood Ratio (LIML) Estimator (which again, you can ignore). We can also access this information explicitly:

model_using_iv %>% 
  coef()
              k  Estimate Std. Error  t value     Pr(>|t|)
OLS    0.000000 0.2392641 0.04140123 5.779156 1.490238e-08
Fuller 0.997555 0.7403625 0.21065283 3.514610 4.896194e-04
TSLS   1.000000 0.7648576 0.21830326 3.503647 5.096036e-04
LIML   1.000000 0.7648576 0.21830326 3.503647 5.096036e-04

Personally, I prefer to use a little printing trick using printCoefmat(), which just produces a bit nicer results:

model_using_iv %>% 
  coef() %>% 
  printCoefmat()
              k Estimate Std. Error t value  Pr(>|t|)    
OLS    0.000000 0.239264   0.041401  5.7792  1.49e-08 ***
Fuller 0.997555 0.740363   0.210653  3.5146 0.0004896 ***
TSLS   1.000000 0.764858   0.218303  3.5036 0.0005096 ***
LIML   1.000000 0.764858   0.218303  3.5036 0.0005096 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This means, we can now compare the coefficients of the initial OLS model with the IV estimated models - and clearly see the difference in coefficient estimates of the endogenous variable.

If we want to see the coefficient estimates of all the other variables, we need to use the command coefOther():

model_using_iv %>% 
  coefOther() 
$OLS
              Estimate Std. Error   t value     Pr(>|t|)
ln_live_bg   0.2322826 0.05854911  3.967312 8.582435e-05
ln_edu       0.1560072 0.03992143  3.907856 1.089522e-04
ln_land_cult 0.2330989 0.04071572  5.725034 2.004022e-08
intercept    7.4078569 0.30802799 24.049298 0.000000e+00

$TSLS
               Estimate Std. Error   t value   Pr(>|t|)
ln_live_bg   0.08728102 0.09068675 0.9624451 0.33639471
ln_edu       0.11072546 0.05057385 2.1893815 0.02913356
ln_land_cult 0.14719816 0.05933081 2.4809733 0.01350353
intercept    3.78506193 1.51092983 2.5051209 0.01262886

$LIML
               Estimate Std. Error   t value   Pr(>|t|)
ln_live_bg   0.08728102 0.09068675 0.9624451 0.33639471
ln_edu       0.11072546 0.05057385 2.1893815 0.02913356
ln_land_cult 0.14719816 0.05933081 2.4809733 0.01350353
intercept    3.78506193 1.51092983 2.5051209 0.01262886

$Fuller
               Estimate Std. Error  t value    Pr(>|t|)
ln_live_bg   0.09403876 0.08863420 1.060976 0.289326885
ln_edu       0.11283580 0.04976586 2.267333 0.023891034
ln_land_cult 0.15120153 0.05809182 2.602802 0.009582603
intercept    3.95390086 1.45825658 2.711389 0.006982554

We can also use robust estimation easily by just selecting heteroSE = TRUE:

ivmodelFormula(ln_selfinc ~ ln_live_bg + ln_edu + ln_land_cult + ln_selflab | 
                 ln_selflab_stock + ln_live_bg + ln_edu + ln_land_cult, 
               data = df, heteroSE = TRUE) 

Call:
ivmodel(Y = Y, D = D, Z = Z, X = X, intercept = intercept, beta0 = beta0, 
    alpha = alpha, k = k, manyweakSE = manyweakSE, heteroSE = heteroSE, 
    clusterID = clusterID, deltarange = deltarange, na.action = na.action)
sample size: 414
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

First Stage Regression Result:

F=21.58976, df1=1, df2=409, p-value is 4.5603e-06
R-squared=0.05013997,   Adjusted R-squared=0.04781757
Residual standard error: 0.7613904 on 410 degrees of freedom
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Coefficients of k-Class Estimators:

             k Estimate Std. Error t value Pr(>|t|)    
OLS    0.00000  0.23926    0.04719   5.070 6.05e-07 ***
Fuller 0.99756  0.74036    0.21898   3.381 0.000792 ***
TSLS   1.00000  0.76486    0.23209   3.296 0.001068 ** 
LIML   1.00000  0.76486    0.23209   3.296 0.001068 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Alternative tests for the treatment effect under H_0: beta=0.

Anderson-Rubin test (under F distribution):
F=16.45739, df1=1, df2=409, p-value is 5.9602e-05
95 percent confidence interval:
 [0.396758589790007, 1.37422089106227]

Conditional Likelihood Ratio test (under Normal approximation):
Test Stat=16.45739, p-value is 5.9602e-05
95 percent confidence interval:
 [0.396758617266406, 1.37422081576433]