Review

Estimands

In the past few weeks we have covered several different estimands: most commonly, the average treatment effect (or ATE), and most recently, the complier average causal effect (or CACE). By way of review, I will give you a toy dataset, and ask you to estimate the following quantities:

Average Treatment Effect (ATE)

\[\begin{equation} E[Y_i(1) - Y_i(0)] \end{equation}\]

Average Treatment Effect on the Treated (ATT)

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i = 1] \end{equation}\]

Average Treatment Effect on Control Units (ATC)

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i = 0] \end{equation}\]

Intent to Treat Effect (\(ITT_Y\))

\[\begin{equation} E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0] \end{equation}\]

Complier Average Causal Effect (CACE)

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)] \end{equation}\]

library(tidyverse)

estimands_dataset <- data_frame(
  ID = c(1:5),
  Y1 = c(5,8,4,6,4),
  Y0 = c(4,5,6,5,3),
  D1 = c(1,1,0,1,1),
  D0 = c(0,0,0,1,0),
  Z = c(1,0,1,0,0),
  D = D1*Z + D0*(1-Z),
  Y_Z1 = Y1*D1 + Y0*(1-D1),
  Y_Z0 = Y1*D0 + Y0*(1-D0),
  Y = Y_Z1*Z + Y_Z0*(1-Z)
)

estimands_dataset %>% select(-c(D,Y_Z1, Y_Z0,Y)) %>%
  knitr::kable(., 
               caption = "Exercise 1: Toy Dataset",
               col.names = c("ID","Y(D=1)","Y(D=0)","D(Z=1)","D(Z=0)","Z"),
               format = "pandoc")

Exercise 1: Toy Dataset
ID	Y(D=1)	Y(D=0)	D(Z=1)	D(Z=0)	Z
1	5	4	1	0	1
2	8	5	1	0	0
3	4	6	0	0	1
4	6	5	1	1	0
5	4	3	1	0	0

Answer

First, lets use the switching equation to create three new columns: \(D_i = D_i(1) \times Z_i + D_i(0) \times (1-Z_i)\); \(Y_i(D_i(Z_i=1)) = Y_i(1) \times D_i(1) + Y_i(0) \times (1-D_i(1))\); and \(Y_i(D_i(Z_i=0)) = Y_i(1) \times D_i(0) + Y_i(0) \times (1-D_i(0))\) This gives the following science table:

estimands_dataset %>%
  knitr::kable(., 
               caption = "Exercise 1: Filling-in columns in the toy dataset",
               col.names = c("ID","Y(D=1)","Y(D=0)","D(Z=1)","D(Z=0)","Z", "D","Y(D(Z=1))","Y(D(Z=0))","Y"),
               format = "pandoc")

Exercise 1: Filling-in columns in the toy dataset
ID	Y(D=1)	Y(D=0)	D(Z=1)	D(Z=0)	Z	D	Y(D(Z=1))	Y(D(Z=0))	Y
1	5	4	1	0	1	1	5	4	5
2	8	5	1	0	0	0	8	5	5
3	4	6	0	0	1	0	6	6	6
4	6	5	1	1	0	1	6	6	6
5	4	3	1	0	0	0	4	3	3

Next, lets calculate each of the quantities:

Average Treatment Effect (ATE) is the average difference in treated and untreated potential outcomes.

\[\begin{equation} E[Y_i(1) - Y_i(0)] = \frac{5+8+4+6+4}{5} - \frac{4+5+6+5+3}{5} = \frac{27}{5} - \frac{23}{5} = \frac{4}{5} = 0.8 \end{equation}\]

R: with(data, mean(Y1-Y0))

Average Treatment Effect on the Treated (ATT) is the average difference in treated and untreated potential outcomes of treated subjects (i.e. subjects 1 and 4).

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i = 1] = \frac{6+5}{2} - \frac{5+4}{2} = \frac{11-9}{2} = 1 \end{equation}\]

R: with(data, mean(Y1[D==1]) - mean(Y0[D==1]))

Average Treatment Effect on Control Units (ATC) is the mean difference in the treated and untreated potential outcomes of control subjects (i.e. subjects 2, 3, and 5).

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i = 0] = \frac{8+4+4}{3} - \frac{5+6+3}{3} = \frac{16}{3} - \frac{14}{3} = \frac{2}{3} \approx 0.6667 \end{equation}\]

R: with(data, mean(Y1[D==0]) - mean(Y0[D==0]))

Claim: ATE is a weighted average of ATT and ATC.

\[\begin{equation} ATE = ATT\times P(D_i=1) + ATC\times P(D_i=0) = (\frac{2}{5}\times 1) + (\frac{3}{5}\times \frac{2}{3}) = \frac{2}{5} + \frac{2}{5} = \frac{4}{5} \end{equation}\]

Intent to Treat Effect (\(ITT_Y\)) is the mean difference in potential outcomes in the assigned-to-treatment, and assigned-to-control groups.

\[\begin{equation} E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0] = E[Y_i(D_i(Z_i=1))] - E[Y_i(D_i(Z_i=0))] = \frac{5+8+6+6+4}{5} - \frac{4+5+6+6+3}{5} = \frac{29-24}{5} = 1 \end{equation}\]

R: with(data, mean(Y_Z1) - mean(Y_Z0))

Note: This is the true value of the \(ITT\). Using observable data, we can also get an estimate, \(\widehat{ITT}\). That is called the “reduced form” or the difference-in-means: \(\overline{Y_{Z=1}} - \overline{Y_{Z=0}}\)

\[\begin{equation} \overline{Y_{Z=1}} - \overline{Y_{Z=0}} = \frac{5+6}{2} - \frac{5+6+3}{3} = \frac{11}{2} - \frac{14}{3} \approx 0.833 \end{equation}\]

R: with(data, mean(Y[Z==1]) - mean(Y[Z==0]))

Furthermore: We can calculate \(ITT_D = E[D_i | Z_i = 1] - E[D_i | Z_i = 0] = E[D_i(Z_i=1)] - E[D_i(Z_i=0)]\). Which is: \((\frac{1+1+0+1+1}{5}) - (\frac{0+0+0+1+0}{5}) = \frac{4}{5} - \frac{1}{5} = \frac{3}{5} = 0.6\). In R this is: with(data, mean(D1) - mean(D0)). This is different from \(\widehat{ITT_D}\), or the first-stage estimate: \(\overline{D_{Z=1}} - \overline{D_{Z=0}} = \frac{1+0}{2} - \frac{0+1+0}{3} = 0.16667\). In R: with(data, mean(D[Z==1]) - mean(D[Z==0]))

Complier Average Causal Effect (CACE) is the average difference in treated and untreated potential outcomes of compliers (i.e. subjects that satisfy \(D_i(Z_i = 1) > D_i(Z_i = 0)\)). In this case, subjects 1, 2 and 5.

\[\begin{equation} E[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)] = \frac{8+5+4}{3} - \frac{4+5+3}{3} = \frac{17}{3} - \frac{12}{3} \approx 1.6667 \end{equation}\]

R: with(data, mean(Y1[D1 > D0]) - mean(Y0[D1 > D0]))

Fitted and Residualized \(D_i\)

Consider an outcome \(Y\), an explanatory variable \(D\), and some confounder \(X\). In the past few weeks, we have used a regression of \(D\) on \(X\) in at least four different ways. As recap, lets go through the various ways in which we used fitted and residualized values of \(D\).

Say, the regression specification is: \[\begin{equation} D_i = \alpha_0 + \alpha_1 \cdot X_i + \epsilon_i \end{equation}\]

And the fitted or predicted values are \(\widehat{D_i} = \widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i\); and the residuals are \(\tilde{D_i} = D_i - (\widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i)\).

Use 1: \(\tilde{D_i}\) in the partial regression framework

We showed that in the ordinary least squares regression \(Y_i = \beta_0 + \beta_1 \cdot D_i + \beta_2 \cdot X_i + \xi_i\),

\[\begin{equation} \beta_1 = \frac{cov(\tilde{Y},\tilde{D})}{var(\tilde{D})} \end{equation}\]

\(\beta_1\) can be exactly recovered by regressing \(D_i\) on \(X_i\), collecting the residuals \(\tilde{D_i}\), similarly regressing \(Y_i\) on \(X_i\), collecting the residuals \(\tilde{Y_i}\), and running a bivariate regression \(\tilde{Y_i} \sim \tilde{D_i}\).

Use 2: \(\tilde{D_i}\) in regression weights

We know that regression weights observations by the conditional variance of treatment, \(Var(D_i | X_i)\). Aronow and Samii (2016) tell us:

We may estimate each of the multiple regression weights with the estimator \(\widehat{w_i} = \tilde{D_i^2}\) where \(\tilde{D_i}\) is the residual from a regression of \(D_i\) on \(X_i\). (AS 2016:255)

To understand this, consider the expression \(Var(D_i | X_i)\), and re-write it as:

\[\begin{equation} Var(D_i | X_i) = Var(\widehat{D_i} + \tilde{D_i} | X_i) \end{equation}\]

We know from the first-order condition while deriving OLS coefficients that \(X_i \perp \epsilon_i\), i.e. \(E[X_i\epsilon_i] = 0\). Since \(\widehat{D_i} = \alpha_0 + \alpha_1 \cdot X_i\) (where \(\alpha_0\) and \(\alpha_1\) are constants, \(\widehat{D_i} \rightarrow f(X_i)\), and \(\tilde{D_i} = \epsilon_i\). So \(\widehat{D_i} \perp \tilde{D_i}\). This means we can split up the variance as follows:

\[\begin{equation} Var(\{\alpha_0 + \alpha_1 \cdot X_i\} | X_i) + Var(\tilde{D_i} | X_i) \end{equation}\]

The first quantity simplifies to \(\alpha_1^2 Var(X_i | X_i) = 0\). This leaves:

\[\begin{equation} Var(\tilde{D_i} | X_i) = E[\tilde{D_i}^2 | X_i] - E[\tilde{D_i} | X_i]^2 \end{equation}\]

But we know that \(E[\tilde{D_i} | X_i] = E[\epsilon_i | X_i] = 0\). So we are left with \(E[\tilde{D_i}^2]\). This motivates the use of squared-residuals, \(\tilde{D_i^2}\), to estimate regression weights.

Use 3: \(\widehat{D_i}\) in propensity score theorem, inverse-probability weighting

The propensity score theorem implies that “the only covariate [we] need to control for is the probability of treatment itself” (Angrist and Pischke, p81).

We know that the propensity score, \(p(X_i) = \widehat{D_i} = \widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i\). In section 2, I showed that we can recover the average treatment effect by weighting outcomes of treated units by \(\frac{1}{p(X_i)}\) and outcomes of control units by \(\frac{1}{1 - p(X_i)}\). Alternatively, we include \(\widehat{D_i}\) as a covariate in the regression \(Y_i \sim D_i + \widehat{D_i}\).

Use 3: \(\widehat{D_i}\) in propensity score matching

Finally, last week’s problem set used the propensity score to find matches and estimate the ATT. Ichino and Nathan (2013) have \(n = 169\) observations, 86 of which are treated (i.e. primaries are held). One way to find matches for these 86 observations is to identify an untreated observation with the most similar propensity score. Here are the steps:

Regress \(D_i\) on all the covariates \(X_{1,i}, X_{2,i} \cdots X_{k,i}\). Capture the predicted values \(\widehat{D_i}\).
For every treated unit \(j\), with propensity score \(\widehat{D_j}\), find the nearest untreated subject \(k\) in propensity score space, (i.e. \(min(\widehat{D_j} - \widehat{D_k})^2\)). This produces \(j\) pairs, and \(2j\) units for the analysis.
Estimate the “treatment effect” within each pair, aggregate these estimates, weighting by the probability of treatment (specifically: \(\frac{P(D_i = 1 | X_i = x) \cdot P(X_i =x)}{\sum_x P(D_i = 1 | X_i = x) \cdot P(X_i =x)}\)).

library(Matching)
library(tidyverse)
library(estimatr)

# Upload the dataset
ndc <- read.csv("primariesdata_NDC.csv") 

# Logistic regression of D on all Xs:
prop_model <-
  glm(
  prim_elec ~ pres_t_1 + ethfrac + year2008 + holds1 + incumbent,
  family = "binomial",
  data = ndc
  )

# Extracted the predicted values or "propensity scores"
ndc$propensityscore  <- prop_model$fitted

# Use the Match package, specify X = propensityscore,
# M = 1 (i.e. find only 1 match for each treated unit)

prop_score_estimate <-  
  with(ndc, Match(Y = winner_t, Tr = prim_elec, X = propensityscore, M = 1))

summary(prop_score_estimate)

## 
## Estimate...  0.075581 
## AI SE......  0.089403 
## T-stat.....  0.8454 
## p.val......  0.39789 
## 
## Original number of observations..............  169 
## Original number of treated obs...............  86 
## Matched number of observations...............  86 
## Matched number of observations  (unweighted).  90

I want to draw your attention to two facts about matching: (1) we drop untreated observations that are not matched with any treated observation; and (2) we can recycle untreated observations if they are the best match for multiple treated observations.

Here are the treated observations referenced by their row number. Note that unique(\(\cdot\)) = 86 . Match does recycle some treated observations if they have more than 1 equally good matches, then down-weights these observations to maintain \(n_{D=1} = 86\)

prop_score_estimate$index.treated # 86 unique values

##  [1]   1   2   3  11  14  15  16  17  19  20  23  24  25  27  29  32  35
## [18]  39  41  42  45  56  57  63  64  65  66  67  69  70  71  72  74  75
## [35]  78  79  80  81  86  86  87  88  90  93  95  97  98  99  99 100 103
## [52] 104 106 107 108 110 112 113 117 118 119 127 128 129 131 133 134 136
## [69] 137 138 142 143 144 147 148 149 151 151 152 153 154 155 156 158 158
## [86] 160 164 167 168 169

And here are the “matched” untreated observations, also referenced by their row number. Observe that unique(\(\cdot\)) = 51 . In other words, some untreated observations are picked as matches multiple times (e.g.: 123 is matched with 2 and 169).

prop_score_estimate$index.control # 51 unique values

##  [1]  58 123  22 140 139  52   6  55  34 132  33  50  96   6 146 161  68
## [18]  82  36  77 141 116 165 123  82 139  13 139 150  38  76  18 146  31
## [35]  13 139  33  13 125 145  37  34  83 146  44  89  18   9  53  84  62
## [52] 105 157  13  48  55 162  31  92 123  47  37  18  73  92  28 150 121
## [69]  83 139  83  94 146 124 139  83 125 145  83  38  84 135 146 125 145
## [86]  83  46 146  50 123

Instrumental Variables

Deriving \(\beta_{2SLS}\)

Consider a treatment \(D\) (education), and its impact on an outcome \(Y\) (wages). This association is confounded by an unobservable variable, \(A\) (ability). Since the effect of education on wages is unidentified, we search for an instrument. We find that an educational body provided college college scholarships (\(Z\)) to randomly selected high school students. It, however, awarded scholarships with different probabilities to male and female students. Here is a long-form regression and a DAG that captures this information:

\[\begin{equation} Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \beta_3 A_i + \xi_{0,i} \end{equation}\]

Where \(D_i\) is a dummy variable that equals 1 if the subject finished college, and \(\beta_3 A_i + \xi_{1,i}\) is the composite error term because \(A_i\) is unobserved.

Figure 1: Instrumental Variable Setup

We will show: \[\begin{equation} \beta_{2SLS} = \frac{cov(\tilde{Y_i},\tilde{Z_i})}{cov(\tilde{D_i},\tilde{Z_i})} = \frac{\frac{cov(\tilde{Y_i},\tilde{Z_i})}{var(\tilde{Z_i})}}{\frac{cov(\tilde{D_i},\tilde{Z_i})}{var(\tilde{Z_i})}} = \frac{\beta_{Y \cdot Z + X}}{\beta_{D \cdot Z + X}} \end{equation}\]
\(\beta_{2SLS}\) is a ratio of the regression coefficient from the reduced form and first stage.

Consider the “first-stage” regression:

\[\begin{equation} D_i = \alpha_0 + \alpha_1 X_i + \alpha_2 Z_i + \xi_{1,i} \end{equation}\]

Plug-in \(D_i\) from this equation into the long-form regression:

\[\begin{equation} Y_i = \beta_0 + \beta_1(\alpha_0 + \alpha_1 X_i + \alpha_2 Z_i + \xi_{1,i}) + \beta_2 X_i + \beta_3 A_i + \xi_{0,i} \end{equation}\]

And re-arrange the terms so we get:

\[\begin{equation} Y_i = (\beta_0 + \beta_1 \alpha_0) + X_i(\beta_1 \alpha_1 + \beta_2) + Z_i(\beta_1 \alpha_2) + \{ \beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\} \end{equation}\]

We now have a regression of the type: \(Y_i = \pi_0 + \pi_1 X_i + \pi_2 Z_i + \xi_{2,i}\) (or the “reduced form”). Focusing on the coefficients

\[\begin{equation} \pi_1 = (\beta_1 \alpha_1) + \beta_2 \end{equation}\]

\(\beta_1 \alpha_1\) is the “indirect” effect of \(X_i\) on \(Y_i\), and \(\beta_2\) is the “direct” effect of \(X_i\) on \(Y_i\).

We are interested in isolating \(\beta_1\) or the effect of schooling on wages:

\[\begin{equation} \pi_2 = (\beta_1 \alpha_2) \Rightarrow \beta_1 = \frac{\pi_2}{\alpha_2} \end{equation}\]

Note that \(\pi_2\) captures the variation in \(Y_i\) caused by \(Z_i\) solely through \(D_i\), and \(\alpha_2\) captures the variation in \(D_i\) caused by \(Z_i\). Since \(X_i\) is a covariate in both steps, the partial regression framework tells us:

\[\begin{equation} \pi_2 = \frac{cov(\tilde{Y_i},\tilde{Z_i})}{var(\tilde{Z_i})} \end{equation}\] \[\begin{equation} \alpha_2 = \frac{cov(\tilde{D_i},\tilde{Z_i})}{var(\tilde{Z_i})} \end{equation}\]

Where \(\tilde{D_i}\) is residualized D by partialling-out the variation caused by \(X_i\), \(\tilde{Y_i}\) is the residual from \(Y_i \sim X_i\), and \(\tilde{Z_i}\) from \(Z_i \sim X_i\).

Finally, if we re-arrange the terms in another way we can see what the second-stage looks like:

\[\begin{equation} Y_i = (\beta_0 + \beta_1 \alpha_0) + \beta_2 X_i + \beta_1(\alpha_1 X_i + \alpha_2 Z_i) + \{ \beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\} \end{equation}\]

And \(\alpha_1 X_i + \alpha_2 Z_i\) are just the fitted values from the first-stage (\(D_i \sim Z_i + X_i\)).

This is why we do the estimation in two stages:

Regress \(D_i \sim Z_i + X_i\), and get the fitted values \(\widehat{D_i}\)
Estimate the “second stage” using the fitted values from the first stage: \(Y_i \sim \widehat{D_i} + X_i\)

Note: In deriving \(\beta_{2SLS}\) we relied on five assumptions: constant effects, conditional independence or “exogeneity” (\(Y_i(1),Y_i(0) \perp Z_i | X_i\)), excludability (\(Y_i(d,z) = Y_i(d)\)), and non-zero first stage. We did not need monotonicity (\(D_i(1) > D_i(0)\)).

2SLS in `R`

Let’s confirm the two-stage least squares procedure by doing it manually, then checking if we get the same result by specifying an instrumental variables regression in iv_robust. Here is what I do in the attached code:

Step 1: Create data in line with the DAG and system of equations specified above.

Step 2: Run the “first stage” (\(D_i \sim Z_i + X_i\)) regression, collect fitted values \(\widehat{D_i}\), and plug these into the second stage regression \(Y_i \sim \widehat{D_i} + X_i\).

Step 3: Specify the IV setup iv_robust(Y ~ D + X | Z + X, dat) . Note a few things: (a) we write this in the outcome ~ endogenous regressor + covariate | instrument + covariate format; (b) covariates are included in both stages (not doing so will produce biased estimates).

Step 4: Printing a htmlreg table with all three models.

# STEP 1: Create data matching the above specification
library(DeclareDesign)

iv_design <-
  declare_population(N = 2000, 
                     noise = rnorm(N, mean = 10, sd = 4),
                     A = rnorm(N),
                     X = rbinom(N, size = 1, prob = 0.48)) +
  declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
                             assignment_variables = "Z") +
  declare_potential_outcomes(Y ~  noise + 10*A + 1.25*X + 10*D,
                             assignment_variables = "D") +
  declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
  declare_assignment(blocks = X, block_prob = c(0.6, 0.4),
                     assignment_variable = "Z") +
  declare_estimator(Y ~ D + X | Z + X, 
                    model = iv_robust, 
                    term = "D", 
                    estimand = "ATE", 
                    label = "IV")

dat <- iv_design %>% draw_data

# STEP 2: Two stage least squares done manually

first_stage = lm_robust(D ~ Z + X, data = dat)

dat$D_hat <- first_stage$fitted.values

second_stage = lm_robust(Y ~ D_hat + X, data = dat)

# Now using iv_robust

iv_model <- iv_robust(Y ~ D + X | Z + X, dat)

# Putting it together in a table:

texreg::htmlreg(list(first_stage, second_stage, iv_model),
  stars = c(0.001, 0.01, 0.05),
  custom.coef.names = c("(Intercept)", "Z", "X",
  "Fitted-D", "D"),
  custom.model.names = c("First Stage", "Second Stage", "IV Robust"),
  digits = 3,
  caption = "Two-Stage Least Squares: Manually and Using iv_robust",
  include.ci = FALSE
  )

Two-Stage Least Squares: Manually and Using iv_robust
	First Stage	Second Stage	IV Robust
(Intercept)	0.487^***	9.041^***	9.041^***
	(0.020)	(1.403)	(1.012)
Z	0.359^***
	(0.020)
X	-0.004	1.477^*	1.477^**
	(0.020)	(0.664)	(0.491)
Fitted-D		11.467^***
		(1.846)
D			11.467^***
			(1.363)
R²	0.145	0.019	0.466
Adj. R²	0.144	0.018	0.465
Num. obs.	2000	2000	2000
RMSE	0.436	14.471	10.681
p < 0.001, p < 0.01, p < 0.05

Why are the standard errors different?

When we do the two-stage least squares manually, the residuals in the second stage are \(\beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\). However, in the long-form regression they were \(\beta_3 A_i + \xi_{0,i}\). We are trying to estimate \(\beta_1\) from the long-form regression, so we must use \(\beta_3 A_i + \xi_{0,i}\), not \(\color{red}{\beta_1 \xi_{1,i}} + \beta_3 A_i + \xi_{0,i}\). The computer makes this correction while estimating standard errors, something we may miss doing if we manually do the procedure.

IV and OLS

Now lets compare the ordinary least squares regression to the two-stage least squares estimator. In DeclareDesign, I specify both estimators, sims = 500, and plot the resulting sampling distribution using ggplot2.

# Step 1: Specify both estimators 
iv_design <-
  declare_population(N = 2000, 
                     noise = rnorm(N, mean = 10, sd = 4),
                     A = rnorm(N),
                     X = rbinom(N, size = 1, prob = 0.48)) +
  declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
                             assignment_variables = "Z") +
  declare_potential_outcomes(Y ~  noise + 10*A + 1.25*X + 10*D,
                             assignment_variables = "D") +
  declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
  declare_assignment(blocks = X, block_prob = c(0.6, 0.4),
                     assignment_variable = "Z") +
  declare_estimator(Y ~ D + X | Z + X, 
                    model = iv_robust, 
                    term = "D", 
                    estimand = "ATE", 
                    label = "IV")+
  declare_estimator(Y ~ D + X, 
                    model = lm_robust, 
                    term = "D", 
                    estimand = "ATE", 
                    label = "OLS")

simulations <- simulate_designs(iv_design, sims = 500)


estimator_performance <- simulations %>% 
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
            mean_estimate = mean(estimate))  %>%
  gather(key = "Quantity", value = "Value", 
         mean_estimand, mean_estimate)

## Step 2: Creating a figure using ggplot2

library(ggthemes)

figure8 <- ggplot(data = simulations,
                  aes(x = estimate)) +
                  xlab("Estimate") + ylab("Count") +
                  geom_histogram(bins = 30) +
                  facet_wrap(~ estimator_label)+
  geom_vline(
  data = estimator_performance,
  aes(xintercept = Value, color = Quantity),
  size = 0.5,
  linetype = "dashed"
  ) +
  theme_economist() +
  theme(legend.position = "bottom") +
  ggtitle(label = "Sampling Distribution of Different Estimators")

figure8

With Heterogenous Effects

So far we have assumed that \(\beta_1\) (effect of \(D_i\) on \(Y_i\)) is constant. That is: \(Y_i(1) - Y_i(0) = \beta_1\) \(\forall i\). We can relax this assumption but would need monotonicity to identify the complier average causal effect (LATE). In this section, I briefly describe each assumption, why we need it, and how we can break it.

First, lets define the estimand:

\[\begin{equation} CACE = \frac{ITT_Y}{ITT_D} = \frac{E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]}{E[D_i | Z_i = 1] - E[D_i | Z_i = 0]} = E[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)] \end{equation}\]

Independence

\(Z_i\) is as good as randomly assigned. Values of \(Z_i\) do not covary with potential outcomes or the type of response to treatment assignment. \(\{Y_i(d,z) \forall d,z\}, D_i(Z=1), D_i(Z=0) \perp Z_i\)

Implication

If the instrument \(Z_i\) is not exogenously determined, or conditionally independent, we still have confounding (i.e. open backdoor paths connecting \(Z\) and \(Y\)). Visually, imagine the DAG we had earlier but now there is some unobserved variable \(u\) that causes \(Y\) and \(Z\). In such circumstances, when we estimate the reduced form (\(Y_i \sim Z_i\)), the backdoor path \(Z \leftarrow u \rightarrow Y\) will be unblocked.

Figure 2: Exogeneity Violation

DeclareDesign

We would violate exogeneity by defining the assignment procedure, declare_assignment(\(\cdot\)), in terms of noise (some unobservable factor that causes \(Y\)). For example:

declare_assignment(prob_unit = 0.4 + 0.2*X - 0.25*as.numeric(noise <= 10), simple = TRUE)

Excludability

Potential outcomes respond to treatment status \(D_i\), not treatment assignment \(Z_i\). The instrument operates through a single known causal channel. Formally: \(Y_i(d,0) = Y_i(d,1) = Y_i(d) \forall d\)

Implication: If the instrument affects outcomes through causal channels other than \(D_i\), then the reduced form overestimates the effect of \(D\) on \(Y\). For example, if \(Z\) affects \(Y\) through \(M\) and \(D\), \(ITT_Y\) combines the effect of both channels.

Figure 3: Excludability Violation

DeclareDesign

We would violate excludability by defining potential outcomes in terms of Z and D. Excludability implies that \(Y_i(d_i,z_i) = Y_i(d_i)\) (i.e. if we fix \(i\)’s treatment status \(d_i\), then \(i\)’s outcome will not change because of any changes in their assignment status \(z_i\)). An excludability violation can take two forms: either \(Z_i\) directly affects \(Y_i\), or \(Z_i\) affects \(Y_i\) through a mediator other than \(D_i\) (\(M_i\) in the DAG above).

What this implies is that potential outcomes are no longer solely a function of \(D_i\). Take the case of \(Z_i\) directly affecting \(Y_i\). We now have four potential outcomes: \(Y_i(Z=1,D=1)\), \(Y_i(Z=1,D=0)\), \(Y_i(Z=0,D=1)\), and \(Y_i(Z=0,D=0)\). If we specify a mediator \(M_i\), these become: \(Y_i(M=1,D=1)\), \(Y_i(M=1,D=0)\), \(Y_i(M=0,D=1)\), and \(Y_i(M=0,D=0)\). Since \(M\) is itself a potential outcome with two values - \(M_i(Z=1)\) and \(M_i(Z=0)\) - we can re-write the four potential outcomes as a function of \(D\) and \(Z\).

This means we have a slightly complicated estimand.The effect of \(D\) on \(Y\) is \(E[Y|D=1] - E[Y| D =0]\). Think of \(E[Y|D=1]\) as being a weighted average of two quantities: \(E[Y|Z=1, D=1]\) and \(E[Y|Z=0, D=1]\). Similarly, \(E[Y|D=0]\) is a weighted average of \(E[Y|Z=1, D=0]\), \(E[Y|Z=0, D=0]\). So the ATE is essentially the average of \(E[Y| D=1, Z=0] - E[Y| D=0, Z=0]\) and \(E[Y| D=1, Z=1] - E[Y| D=0, Z=1]\) (i.e. fixing \(Z\), and only moving \(D\) we calculate the effect for each value of \(Z\)). In DeclareDesign this looks as follows:

Step 1: Declare a mediator \(M\) that opens an alternative causal pathway from \(Z\) to \(Y\):

declare_potential_outcomes(M ~ rnorm(N) + 2*Z, assignment_variables = “Z”)

Step 2: Write \(Y\) as a function of \(D\) and \(M\), remember to specify two assignment variables (\(Z\) and \(D\)):

declare_potential_outcomes(Y ~ noise + 10*A + 1.25*X + 10*D + 2*(rnorm(N) + 2*Z),assignment_variables = c(D,Z))

Note: If we specified \(M\) instead of \((rnorm(N) + 2\cdot Z)\), \(Y\) would have to be a function of \(D\) and \(M\). Since \(M\) is itself a function of \(Z\), I specify \(Y\) in terms of the constituents of \(M\), i.e. \((rnorm(N) + 2\cdot Z)\). Now, \(Y\) can be a function of \(D\) and \(Z\), which clearly illustrates the exclusion restriction.

Step 3: Declare estimands, i.e. the Complier Average Causal Effect and Average Treatment Effect. In section, I derived \(\beta_{2SLS}\) assuming constant effects, and without ruling out defiers. As a result, my dataset contains compliers, always takers, never takers, and defiers. However, the treatment effect for all of them is the same (\(\beta_1 = 10 \forall i\)). By extension, the \(CACE = ATE\). However, if we relax this assumption (i.e. constant effects), and assume monotonicity (no defiers), the \(CACE \neq ATE\), and we can separately estimate these quantities by specifying subset = D_Z_1 > D_Z_0 (in the case of the problem set: subset = “CO”).

declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1))/2), subset = D_Z_1 > D_Z_0)

and

declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1))/2))

I reproduce this in code for both the example used in section, and the problem set question. I then produce two tables summarizing the mean(estimand) and mean(estimate) for each estimator.

# Section Example (Constant Effects)

iv_design <-
  declare_population(
  N = 2000,
  noise = rnorm(N, mean = 10, sd = 4),
  A = rnorm(N),
  X = rbinom(N, size = 1, prob = 0.48)
  ) +
  declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
  assignment_variables = "Z") +
  # Z now affects Y through M
  declare_potential_outcomes(M ~ rnorm(N) + 2 * Z,
  assignment_variables = "Z") +
  # The effect of D on Y is 10 for everyone. Compare this to section where D's impact varies by type
  declare_potential_outcomes(Y ~  noise + 10 * A + 1.25 * X + 10 * D + 2 *
  (rnorm(N) + 2 * Z),
  assignment_variables = c(D, Z)) +
  # E[Y | D=1] and E[Y | D=0] is a weighted average of two quantities; 
  # I subset to D_Z_1 > D_Z_0 because only compliers satisfy D(Z=1) > D(Z=0)
  declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
  ) / 2), subset = D_Z_1 > D_Z_0) +
  declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
  ) / 2)) +
  declare_assignment(blocks = X, block_prob = c(0.6, 0.4)) +
  declare_estimator(
  Y ~ D + X | Z + X,
  model = iv_robust,
  term = "D",
  estimand = "CACE",
  label = "IV"
  ) +
  declare_estimator(
  Y ~ D + X,
  model = lm_robust,
  term = "D",
  estimand = "ATE",
  label = "OLS"
  )

simulations <- simulate_designs(iv_design, sims = 500)

estimator_performance <- simulations %>% 
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
            mean_estimate = mean(estimate))

estimator_performance %>% knitr::kable(.,
                                        digits = 2,
                                        caption = "Section Example (Exclusion Restriction Violation)")

Section Example (Exclusion Restriction Violation)
estimator_label	mean_estimand	mean_estimate
IV	10	21.65
OLS	10	26.07

# Problem Set Question (Heterogenous effects, Assuming Monotonicity)

pi_CO <- 0.5
pi_AT <- 0.1
pi_NT <- 0.4
pi_DE <- 1 - (pi_CO + pi_AT + pi_NT)

pset_design <-
  declare_population(
  N = 100,
  noise = rnorm(N),
  type = sample(
  c("CO", "AT", "NT", "DE"),
  N,
  replace = TRUE,
  prob = c(pi_CO, pi_AT, pi_NT, pi_DE)
  )
  ) +
  declare_potential_outcomes(D ~ as.numeric(Z == 1 &
  type %in% c("CO", "AT") |
  Z == 0 &
  type %in% c("DE", "AT")),
  assignment_variables = "Z") +
  declare_potential_outcomes(M ~ rnorm(N) + 2 * Z,
  assignment_variables = "Z") +
  declare_potential_outcomes(
  Y ~  noise+-0.5 * D * (type == "CO") +
  0.0 * D * (type == "AT") +
  0.5 * D * (type == "NT") +
  1.0 * D * (type == "DE") +
  2 * (rnorm(N) + 2 * Z),
  assignment_variables = c(D, Z)
  ) +
  declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
  ) / 2), subset = type == "CO") +
  declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
  ) / 2)) +
  declare_assignment(prob = 0.5, assignment_variable = "Z") +
  declare_estimator(
  Y ~ D | Z,
  model = iv_robust,
  term = "D",
  estimand = "CACE",
  label = "IV"
  ) +
  declare_estimator(
  Y ~ D,
  model = lm_robust,
  term = "D",
  estimand = "ATE",
  label = "OLS"
  )

simulations2 <- simulate_design(pset_design, sims = 500)

estimator_performance2 <- simulations2 %>%
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
  mean_estimate = mean(estimate))


estimator_performance2 %>% knitr::kable(.,
                                        digits = 2,
                                        caption = "Problem Set (Exclusion Restriction Violation)")

Problem Set (Exclusion Restriction Violation)
estimator_label	mean_estimand	mean_estimate
IV	-0.49	7.68
OLS	-0.04	1.89

Monotonicity

The instrument can have no effect on some units but all those affected should be affected in the same way. Formally: \(D_i(1) > D_i(0)\)

Implication

Without monotonicity, \(ITT_Y\) is the weighted difference in the treatment effect among compliers and defiers, and \(ITT_D \neq \pi_c\). To see this, observe that:

\[\begin{equation} E[Y | Z = 1] = E[Y_1 | C]\pi_C + E[Y_1 | AT]\pi_{AT} + E[Y_0 | NT]\pi_{NT} + E[Y_0 | D]\pi_D \end{equation}\] \[\begin{equation} E[Y | Z = 0] = E[Y_0 | C]\pi_C + E[Y_1 | AT]\pi_{AT} + E[Y_0 | NT]\pi_{NT} + E[Y_1 | D]\pi_D \end{equation}\]

So the difference between these quantities, or \(ITT_Y\) is:

\[\begin{equation} E[Y_1 - Y_0 | C]\pi_C - E[Y_1 - Y_0 |D]\pi_D \end{equation}\] Similarly, \[\begin{equation} E[D | Z = 1] - E[D | Z = 0] = (\pi_C + \pi_{AT}) - (\pi_{AT} + \pi_D) = \pi_C - \pi_D \end{equation}\]

Note:

When compliers respond differently to treatment than defiers, the ratio estimator \(\frac{ITT_Y}{ITT_D}\) becomes:

\[\begin{equation} \frac{ITT_Y}{ITT_D} = \frac{E[Y_1 - Y_0 | C]\pi_C - E[Y_1 - Y_0 |D]\pi_D}{(\pi_C - \pi_D)} \end{equation}\]

Now imagine constant effects, i.e. \(E[Y_1 - Y_0 | C] = E[Y_1 - Y_0 |D] = \beta_1\). Then we can rewrite the above as:

\[\begin{equation} \frac{ITT_Y}{ITT_D} = \frac{\beta_1(\pi_C - \pi_D)}{(\pi_C - \pi_D)} = \beta_1 \end{equation}\]

This is why monotonicity was not assumed when I derived the \(\beta_{2SLS}\) under constant effects.

Non-Zero First Stage

The instrument must produce some variation in D. Formally: \(E[D_i | Z_i =1] - E[D_i | Z_i =0] \neq 0\).

Figure 4: Weak First Stage

This is necessary because the Wald estimator is undefined if \(E[D_i | Z_i =1] - E[D_i | Z_i =0] = 0\). Note also that as \(E[D_i | Z_i =1] - E[D_i | Z_i =0] \rightarrow 0\), we amplify any biases in the \(ITT_Y\) due to exclusion restriction violations. The intuition behind this is that when we have exclusion restriction violations, the \(ITT_Y\) overestimates the effect of \(D\) on \(Y\). If we then rescale this estimate by a very small \(\pi_C\), we “blow-up” the bias in the reduced form estimate.

DeclareDesign

To model this in Declare Design, we must do two things: (1) \(Z\) should not determine values of \(D\) beacuse if \(Z \perp D\), we will get a close-to-zero first stage; (2) \(D\) must still be endogenous (i.e. caused by a predictor of \(Y\)).

Here is code using both the section example, and problem set question, illustrating this.

# Section Example

iv_design <-
  declare_population(
  N = 2000,
  noise = rnorm(N, mean = 10, sd = 4),
  A = rnorm(N),
  X = rbinom(N, size = 1, prob = 0.48),
  # Specify Z in the population because it is just a background variable
  Z = X*rbinom(N, size = 1, prob = 0.6) + (1-X)*rbinom(N, size = 1, prob = 0.4)
  ) +
  declare_potential_outcomes(Y ~  noise + 10 * A + 1.25 * X + 10 * D,
  assignment_variables = "D") +
  declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
  # D must be endogenous (noise causes D and Y)
  # D must not be caused by Z
  declare_assignment(prob_unit = pnorm(noise), simple = TRUE,
  assignment_variable = "D"
  ) +
  declare_estimator(
  Y ~ D | Z,
  model = iv_robust,
  term = "D",
  estimand = "ATE",
  label = "IV"
  ) +
  declare_estimator(
  Y ~ D,
  model = lm_robust,
  term = "D",
  estimand = "ATE",
  label = "OLS"
  )
simulations <- simulate_designs(iv_design, sims = 500)

estimator_performance <- simulations %>% 
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
            mean_estimate = mean(estimate))

estimator_performance %>% knitr::kable(.,
                                        digits = 2,
                                        caption = "Section Example (Weak First Stage)")

Section Example (Weak First Stage)
estimator_label	mean_estimand	mean_estimate
IV	10	-1.398186e+11
OLS	10	2.069000e+01

# Problem Set Question

pi_CO <- 0.5
pi_AT <- 0.1
pi_NT <- 0.4
pi_DE <- 1 - (pi_CO + pi_AT + pi_NT)

pset_design <-
  declare_population(N = 100, 
                     noise = rnorm(N),
                     type = sample(c("CO", "AT", "NT", "DE"), 
                                   N, 
                                   replace = TRUE, 
                                   prob = c(pi_CO, pi_AT, pi_NT, pi_DE)),
                      # Specify Z in the population because it is just a background variable
                     Z = rbinom(N, size = 1, prob = 0.5)) +
  declare_potential_outcomes(Y ~  noise + 
                               -0.5 * D * (type == "CO") +
                               0.0 * D * (type == "AT") +
                               0.5 * D * (type == "NT") +
                               1.0 * D * (type == "DE"),
                             assignment_variables = "D") +
  declare_estimand(CACE = mean(Y_D_1 - Y_D_0), subset = type == "CO") +
  declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
  # Specify D as a function of noise because it is endogenous
  # D should not be a function of Z (this ensures a weak first stage)
  declare_assignment(prob_unit = pnorm(noise), simple = TRUE,
                     assignment_variable = "D") +
  declare_estimator(Y ~ D | Z, 
                    model = iv_robust, 
                    term = "D", 
                    estimand = "CACE", 
                    label = "IV") +
  declare_estimator(Y ~ D, 
                    model = lm_robust, 
                    term = "D", 
                    estimand = "ATE", 
                    label = "OLS")

simulations2 <- simulate_design(pset_design, sims = 500)
# Note: you may end up with some IV estimates being NA
# This happens because when the first stage is very close to 0
# ITT / 0 = Inf. 
# To factor this in, include na.rm = TRUE below

estimator_performance2 <- simulations2 %>%
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand, na.rm = T),
  mean_estimate = mean(estimate, na.rm = T))


estimator_performance2 %>% knitr::kable(.,
                                        digits = 2,
                                        caption = "Problem Set (Weak First Stage)")

Problem Set (Weak First Stage)
estimator_label	mean_estimand	mean_estimate
IV	-0.50	-1.617827e+11
OLS	-0.05	1.080000e+00

Section 4: Instrumental Variables

Shikhar Singh

3 October 2019

Review

Estimands

Fitted and Residualized \(D_i\)

Use 1: \(\tilde{D_i}\) in the partial regression framework

Use 2: \(\tilde{D_i}\) in regression weights

Use 3: \(\widehat{D_i}\) in propensity score theorem, inverse-probability weighting

Use 3: \(\widehat{D_i}\) in propensity score matching

Instrumental Variables

Deriving \(\beta_{2SLS}\)

2SLS in `R`

IV and OLS

With Heterogenous Effects

Independence

Excludability

Monotonicity

Non-Zero First Stage

Section 4: Instrumental Variables

Shikhar Singh

3 October 2019

Review

Estimands

Fitted and Residualized \(D_i\)

Use 1: \(\tilde{D_i}\) in the partial regression framework

Use 2: \(\tilde{D_i}\) in regression weights

Use 3: \(\widehat{D_i}\) in propensity score theorem, inverse-probability weighting

Use 3: \(\widehat{D_i}\) in propensity score matching

Instrumental Variables

Deriving \(\beta_{2SLS}\)

2SLS in R

IV and OLS

With Heterogenous Effects

Independence

Excludability

Monotonicity

Non-Zero First Stage

2SLS in `R`