In the past few weeks we have covered several different estimands: most commonly, the average treatment effect (or ATE), and most recently, the complier average causal effect (or CACE). By way of review, I will give you a toy dataset, and ask you to estimate the following quantities:
library(tidyverse)
estimands_dataset <- data_frame(
ID = c(1:5),
Y1 = c(5,8,4,6,4),
Y0 = c(4,5,6,5,3),
D1 = c(1,1,0,1,1),
D0 = c(0,0,0,1,0),
Z = c(1,0,1,0,0),
D = D1*Z + D0*(1-Z),
Y_Z1 = Y1*D1 + Y0*(1-D1),
Y_Z0 = Y1*D0 + Y0*(1-D0),
Y = Y_Z1*Z + Y_Z0*(1-Z)
)
estimands_dataset %>% select(-c(D,Y_Z1, Y_Z0,Y)) %>%
knitr::kable(.,
caption = "Exercise 1: Toy Dataset",
col.names = c("ID","Y(D=1)","Y(D=0)","D(Z=1)","D(Z=0)","Z"),
format = "pandoc")
ID | Y(D=1) | Y(D=0) | D(Z=1) | D(Z=0) | Z |
---|---|---|---|---|---|
1 | 5 | 4 | 1 | 0 | 1 |
2 | 8 | 5 | 1 | 0 | 0 |
3 | 4 | 6 | 0 | 0 | 1 |
4 | 6 | 5 | 1 | 1 | 0 |
5 | 4 | 3 | 1 | 0 | 0 |
Answer
First, lets use the switching equation to create three new columns: \(D_i = D_i(1) \times Z_i + D_i(0) \times (1-Z_i)\); \(Y_i(D_i(Z_i=1)) = Y_i(1) \times D_i(1) + Y_i(0) \times (1-D_i(1))\); and \(Y_i(D_i(Z_i=0)) = Y_i(1) \times D_i(0) + Y_i(0) \times (1-D_i(0))\) This gives the following science table:
estimands_dataset %>%
knitr::kable(.,
caption = "Exercise 1: Filling-in columns in the toy dataset",
col.names = c("ID","Y(D=1)","Y(D=0)","D(Z=1)","D(Z=0)","Z", "D","Y(D(Z=1))","Y(D(Z=0))","Y"),
format = "pandoc")
ID | Y(D=1) | Y(D=0) | D(Z=1) | D(Z=0) | Z | D | Y(D(Z=1)) | Y(D(Z=0)) | Y |
---|---|---|---|---|---|---|---|---|---|
1 | 5 | 4 | 1 | 0 | 1 | 1 | 5 | 4 | 5 |
2 | 8 | 5 | 1 | 0 | 0 | 0 | 8 | 5 | 5 |
3 | 4 | 6 | 0 | 0 | 1 | 0 | 6 | 6 | 6 |
4 | 6 | 5 | 1 | 1 | 0 | 1 | 6 | 6 | 6 |
5 | 4 | 3 | 1 | 0 | 0 | 0 | 4 | 3 | 3 |
Next, lets calculate each of the quantities:
R: with(data, mean(Y1-Y0))
R: with(data, mean(Y1[D==1]) - mean(Y0[D==1]))
R: with(data, mean(Y1[D==0]) - mean(Y0[D==0]))
Claim: ATE is a weighted average of ATT and ATC.
\[\begin{equation} ATE = ATT\times P(D_i=1) + ATC\times P(D_i=0) = (\frac{2}{5}\times 1) + (\frac{3}{5}\times \frac{2}{3}) = \frac{2}{5} + \frac{2}{5} = \frac{4}{5} \end{equation}\]R: with(data, mean(Y_Z1) - mean(Y_Z0))
Note: This is the true value of the \(ITT\). Using observable data, we can also get an estimate, \(\widehat{ITT}\). That is called the “reduced form” or the difference-in-means: \(\overline{Y_{Z=1}} - \overline{Y_{Z=0}}\)
\[\begin{equation} \overline{Y_{Z=1}} - \overline{Y_{Z=0}} = \frac{5+6}{2} - \frac{5+6+3}{3} = \frac{11}{2} - \frac{14}{3} \approx 0.833 \end{equation}\]R: with(data, mean(Y[Z==1]) - mean(Y[Z==0]))
Furthermore: We can calculate \(ITT_D = E[D_i | Z_i = 1] - E[D_i | Z_i = 0] = E[D_i(Z_i=1)] - E[D_i(Z_i=0)]\). Which is: \((\frac{1+1+0+1+1}{5}) - (\frac{0+0+0+1+0}{5}) = \frac{4}{5} - \frac{1}{5} = \frac{3}{5} = 0.6\). In R this is: with(data, mean(D1) - mean(D0)). This is different from \(\widehat{ITT_D}\), or the first-stage estimate: \(\overline{D_{Z=1}} - \overline{D_{Z=0}} = \frac{1+0}{2} - \frac{0+1+0}{3} = 0.16667\). In R: with(data, mean(D[Z==1]) - mean(D[Z==0]))
R: with(data, mean(Y1[D1 > D0]) - mean(Y0[D1 > D0]))
Consider an outcome \(Y\), an explanatory variable \(D\), and some confounder \(X\). In the past few weeks, we have used a regression of \(D\) on \(X\) in at least four different ways. As recap, lets go through the various ways in which we used fitted and residualized values of \(D\).
Say, the regression specification is: \[\begin{equation} D_i = \alpha_0 + \alpha_1 \cdot X_i + \epsilon_i \end{equation}\]And the fitted or predicted values are \(\widehat{D_i} = \widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i\); and the residuals are \(\tilde{D_i} = D_i - (\widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i)\).
We showed that in the ordinary least squares regression \(Y_i = \beta_0 + \beta_1 \cdot D_i + \beta_2 \cdot X_i + \xi_i\),
\[\begin{equation} \beta_1 = \frac{cov(\tilde{Y},\tilde{D})}{var(\tilde{D})} \end{equation}\]
\(\beta_1\) can be exactly recovered by regressing \(D_i\) on \(X_i\), collecting the residuals \(\tilde{D_i}\), similarly regressing \(Y_i\) on \(X_i\), collecting the residuals \(\tilde{Y_i}\), and running a bivariate regression \(\tilde{Y_i} \sim \tilde{D_i}\).
We know that regression weights observations by the conditional variance of treatment, \(Var(D_i | X_i)\). Aronow and Samii (2016) tell us:
We may estimate each of the multiple regression weights with the estimator \(\widehat{w_i} = \tilde{D_i^2}\) where \(\tilde{D_i}\) is the residual from a regression of \(D_i\) on \(X_i\). (AS 2016:255)
To understand this, consider the expression \(Var(D_i | X_i)\), and re-write it as:
\[\begin{equation} Var(D_i | X_i) = Var(\widehat{D_i} + \tilde{D_i} | X_i) \end{equation}\]We know from the first-order condition while deriving OLS coefficients that \(X_i \perp \epsilon_i\), i.e. \(E[X_i\epsilon_i] = 0\). Since \(\widehat{D_i} = \alpha_0 + \alpha_1 \cdot X_i\) (where \(\alpha_0\) and \(\alpha_1\) are constants, \(\widehat{D_i} \rightarrow f(X_i)\), and \(\tilde{D_i} = \epsilon_i\). So \(\widehat{D_i} \perp \tilde{D_i}\). This means we can split up the variance as follows:
\[\begin{equation} Var(\{\alpha_0 + \alpha_1 \cdot X_i\} | X_i) + Var(\tilde{D_i} | X_i) \end{equation}\]The first quantity simplifies to \(\alpha_1^2 Var(X_i | X_i) = 0\). This leaves:
\[\begin{equation} Var(\tilde{D_i} | X_i) = E[\tilde{D_i}^2 | X_i] - E[\tilde{D_i} | X_i]^2 \end{equation}\]But we know that \(E[\tilde{D_i} | X_i] = E[\epsilon_i | X_i] = 0\). So we are left with \(E[\tilde{D_i}^2]\). This motivates the use of squared-residuals, \(\tilde{D_i^2}\), to estimate regression weights.
The propensity score theorem implies that “the only covariate [we] need to control for is the probability of treatment itself” (Angrist and Pischke, p81).
We know that the propensity score, \(p(X_i) = \widehat{D_i} = \widehat{\alpha_0} + \widehat{\alpha_1} \cdot X_i\). In section 2, I showed that we can recover the average treatment effect by weighting outcomes of treated units by \(\frac{1}{p(X_i)}\) and outcomes of control units by \(\frac{1}{1 - p(X_i)}\). Alternatively, we include \(\widehat{D_i}\) as a covariate in the regression \(Y_i \sim D_i + \widehat{D_i}\).
Finally, last week’s problem set used the propensity score to find matches and estimate the ATT. Ichino and Nathan (2013) have \(n = 169\) observations, 86 of which are treated (i.e. primaries are held). One way to find matches for these 86 observations is to identify an untreated observation with the most similar propensity score. Here are the steps:
Regress \(D_i\) on all the covariates \(X_{1,i}, X_{2,i} \cdots X_{k,i}\). Capture the predicted values \(\widehat{D_i}\).
For every treated unit \(j\), with propensity score \(\widehat{D_j}\), find the nearest untreated subject \(k\) in propensity score space, (i.e. \(min(\widehat{D_j} - \widehat{D_k})^2\)). This produces \(j\) pairs, and \(2j\) units for the analysis.
Estimate the “treatment effect” within each pair, aggregate these estimates, weighting by the probability of treatment (specifically: \(\frac{P(D_i = 1 | X_i = x) \cdot P(X_i =x)}{\sum_x P(D_i = 1 | X_i = x) \cdot P(X_i =x)}\)).
library(Matching)
library(tidyverse)
library(estimatr)
# Upload the dataset
ndc <- read.csv("primariesdata_NDC.csv")
# Logistic regression of D on all Xs:
prop_model <-
glm(
prim_elec ~ pres_t_1 + ethfrac + year2008 + holds1 + incumbent,
family = "binomial",
data = ndc
)
# Extracted the predicted values or "propensity scores"
ndc$propensityscore <- prop_model$fitted
# Use the Match package, specify X = propensityscore,
# M = 1 (i.e. find only 1 match for each treated unit)
prop_score_estimate <-
with(ndc, Match(Y = winner_t, Tr = prim_elec, X = propensityscore, M = 1))
summary(prop_score_estimate)
##
## Estimate... 0.075581
## AI SE...... 0.089403
## T-stat..... 0.8454
## p.val...... 0.39789
##
## Original number of observations.............. 169
## Original number of treated obs............... 86
## Matched number of observations............... 86
## Matched number of observations (unweighted). 90
I want to draw your attention to two facts about matching: (1) we drop untreated observations that are not matched with any treated observation; and (2) we can recycle untreated observations if they are the best match for multiple treated observations.
Here are the treated observations referenced by their row number. Note that unique(\(\cdot\)) = 86 . Match does recycle some treated observations if they have more than 1 equally good matches, then down-weights these observations to maintain \(n_{D=1} = 86\)
prop_score_estimate$index.treated # 86 unique values
## [1] 1 2 3 11 14 15 16 17 19 20 23 24 25 27 29 32 35
## [18] 39 41 42 45 56 57 63 64 65 66 67 69 70 71 72 74 75
## [35] 78 79 80 81 86 86 87 88 90 93 95 97 98 99 99 100 103
## [52] 104 106 107 108 110 112 113 117 118 119 127 128 129 131 133 134 136
## [69] 137 138 142 143 144 147 148 149 151 151 152 153 154 155 156 158 158
## [86] 160 164 167 168 169
And here are the “matched” untreated observations, also referenced by their row number. Observe that unique(\(\cdot\)) = 51 . In other words, some untreated observations are picked as matches multiple times (e.g.: 123 is matched with 2 and 169).
prop_score_estimate$index.control # 51 unique values
## [1] 58 123 22 140 139 52 6 55 34 132 33 50 96 6 146 161 68
## [18] 82 36 77 141 116 165 123 82 139 13 139 150 38 76 18 146 31
## [35] 13 139 33 13 125 145 37 34 83 146 44 89 18 9 53 84 62
## [52] 105 157 13 48 55 162 31 92 123 47 37 18 73 92 28 150 121
## [69] 83 139 83 94 146 124 139 83 125 145 83 38 84 135 146 125 145
## [86] 83 46 146 50 123
Consider a treatment \(D\) (education), and its impact on an outcome \(Y\) (wages). This association is confounded by an unobservable variable, \(A\) (ability). Since the effect of education on wages is unidentified, we search for an instrument. We find that an educational body provided college college scholarships (\(Z\)) to randomly selected high school students. It, however, awarded scholarships with different probabilities to male and female students. Here is a long-form regression and a DAG that captures this information:
\[\begin{equation} Y_i = \beta_0 + \beta_1 D_i + \beta_2 X_i + \beta_3 A_i + \xi_{0,i} \end{equation}\]Where \(D_i\) is a dummy variable that equals 1 if the subject finished college, and \(\beta_3 A_i + \xi_{1,i}\) is the composite error term because \(A_i\) is unobserved.
We will show: \[\begin{equation} \beta_{2SLS} = \frac{cov(\tilde{Y_i},\tilde{Z_i})}{cov(\tilde{D_i},\tilde{Z_i})} = \frac{\frac{cov(\tilde{Y_i},\tilde{Z_i})}{var(\tilde{Z_i})}}{\frac{cov(\tilde{D_i},\tilde{Z_i})}{var(\tilde{Z_i})}} = \frac{\beta_{Y \cdot Z + X}}{\beta_{D \cdot Z + X}} \end{equation}\]\(\beta_{2SLS}\) is a ratio of the regression coefficient from the reduced form and first stage.
Consider the “first-stage” regression:
\[\begin{equation} D_i = \alpha_0 + \alpha_1 X_i + \alpha_2 Z_i + \xi_{1,i} \end{equation}\]Plug-in \(D_i\) from this equation into the long-form regression:
\[\begin{equation} Y_i = \beta_0 + \beta_1(\alpha_0 + \alpha_1 X_i + \alpha_2 Z_i + \xi_{1,i}) + \beta_2 X_i + \beta_3 A_i + \xi_{0,i} \end{equation}\]And re-arrange the terms so we get:
\[\begin{equation} Y_i = (\beta_0 + \beta_1 \alpha_0) + X_i(\beta_1 \alpha_1 + \beta_2) + Z_i(\beta_1 \alpha_2) + \{ \beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\} \end{equation}\]We now have a regression of the type: \(Y_i = \pi_0 + \pi_1 X_i + \pi_2 Z_i + \xi_{2,i}\) (or the “reduced form”). Focusing on the coefficients
\[\begin{equation} \pi_1 = (\beta_1 \alpha_1) + \beta_2 \end{equation}\]\(\beta_1 \alpha_1\) is the “indirect” effect of \(X_i\) on \(Y_i\), and \(\beta_2\) is the “direct” effect of \(X_i\) on \(Y_i\).
We are interested in isolating \(\beta_1\) or the effect of schooling on wages:
\[\begin{equation} \pi_2 = (\beta_1 \alpha_2) \Rightarrow \beta_1 = \frac{\pi_2}{\alpha_2} \end{equation}\]Note that \(\pi_2\) captures the variation in \(Y_i\) caused by \(Z_i\) solely through \(D_i\), and \(\alpha_2\) captures the variation in \(D_i\) caused by \(Z_i\). Since \(X_i\) is a covariate in both steps, the partial regression framework tells us:
\[\begin{equation} \pi_2 = \frac{cov(\tilde{Y_i},\tilde{Z_i})}{var(\tilde{Z_i})} \end{equation}\] \[\begin{equation} \alpha_2 = \frac{cov(\tilde{D_i},\tilde{Z_i})}{var(\tilde{Z_i})} \end{equation}\]Where \(\tilde{D_i}\) is residualized D by partialling-out the variation caused by \(X_i\), \(\tilde{Y_i}\) is the residual from \(Y_i \sim X_i\), and \(\tilde{Z_i}\) from \(Z_i \sim X_i\).
Finally, if we re-arrange the terms in another way we can see what the second-stage looks like:
\[\begin{equation} Y_i = (\beta_0 + \beta_1 \alpha_0) + \beta_2 X_i + \beta_1(\alpha_1 X_i + \alpha_2 Z_i) + \{ \beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\} \end{equation}\]And \(\alpha_1 X_i + \alpha_2 Z_i\) are just the fitted values from the first-stage (\(D_i \sim Z_i + X_i\)).
This is why we do the estimation in two stages:
Regress \(D_i \sim Z_i + X_i\), and get the fitted values \(\widehat{D_i}\)
Estimate the “second stage” using the fitted values from the first stage: \(Y_i \sim \widehat{D_i} + X_i\)
Note: In deriving \(\beta_{2SLS}\) we relied on five assumptions: constant effects, conditional independence or “exogeneity” (\(Y_i(1),Y_i(0) \perp Z_i | X_i\)), excludability (\(Y_i(d,z) = Y_i(d)\)), and non-zero first stage. We did not need monotonicity (\(D_i(1) > D_i(0)\)).
Let’s confirm the two-stage least squares procedure by doing it manually, then checking if we get the same result by specifying an instrumental variables regression in iv_robust. Here is what I do in the attached code:
Step 1: Create data in line with the DAG and system of equations specified above.
Step 2: Run the “first stage” (\(D_i \sim Z_i + X_i\)) regression, collect fitted values \(\widehat{D_i}\), and plug these into the second stage regression \(Y_i \sim \widehat{D_i} + X_i\).
Step 3: Specify the IV setup iv_robust(Y ~ D + X | Z + X, dat) . Note a few things: (a) we write this in the outcome ~ endogenous regressor + covariate | instrument + covariate format; (b) covariates are included in both stages (not doing so will produce biased estimates).
Step 4: Printing a htmlreg table with all three models.
# STEP 1: Create data matching the above specification
library(DeclareDesign)
iv_design <-
declare_population(N = 2000,
noise = rnorm(N, mean = 10, sd = 4),
A = rnorm(N),
X = rbinom(N, size = 1, prob = 0.48)) +
declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
assignment_variables = "Z") +
declare_potential_outcomes(Y ~ noise + 10*A + 1.25*X + 10*D,
assignment_variables = "D") +
declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
declare_assignment(blocks = X, block_prob = c(0.6, 0.4),
assignment_variable = "Z") +
declare_estimator(Y ~ D + X | Z + X,
model = iv_robust,
term = "D",
estimand = "ATE",
label = "IV")
dat <- iv_design %>% draw_data
# STEP 2: Two stage least squares done manually
first_stage = lm_robust(D ~ Z + X, data = dat)
dat$D_hat <- first_stage$fitted.values
second_stage = lm_robust(Y ~ D_hat + X, data = dat)
# Now using iv_robust
iv_model <- iv_robust(Y ~ D + X | Z + X, dat)
# Putting it together in a table:
texreg::htmlreg(list(first_stage, second_stage, iv_model),
stars = c(0.001, 0.01, 0.05),
custom.coef.names = c("(Intercept)", "Z", "X",
"Fitted-D", "D"),
custom.model.names = c("First Stage", "Second Stage", "IV Robust"),
digits = 3,
caption = "Two-Stage Least Squares: Manually and Using iv_robust",
include.ci = FALSE
)
First Stage | Second Stage | IV Robust | ||
---|---|---|---|---|
(Intercept) | 0.487*** | 9.041*** | 9.041*** | |
(0.020) | (1.403) | (1.012) | ||
Z | 0.359*** | |||
(0.020) | ||||
X | -0.004 | 1.477* | 1.477** | |
(0.020) | (0.664) | (0.491) | ||
Fitted-D | 11.467*** | |||
(1.846) | ||||
D | 11.467*** | |||
(1.363) | ||||
R2 | 0.145 | 0.019 | 0.466 | |
Adj. R2 | 0.144 | 0.018 | 0.465 | |
Num. obs. | 2000 | 2000 | 2000 | |
RMSE | 0.436 | 14.471 | 10.681 | |
p < 0.001, p < 0.01, p < 0.05 |
Why are the standard errors different?
When we do the two-stage least squares manually, the residuals in the second stage are \(\beta_1 \xi_{1,i} + \beta_3 A_i + \xi_{0,i}\). However, in the long-form regression they were \(\beta_3 A_i + \xi_{0,i}\). We are trying to estimate \(\beta_1\) from the long-form regression, so we must use \(\beta_3 A_i + \xi_{0,i}\), not \(\color{red}{\beta_1 \xi_{1,i}} + \beta_3 A_i + \xi_{0,i}\). The computer makes this correction while estimating standard errors, something we may miss doing if we manually do the procedure.
Now lets compare the ordinary least squares regression to the two-stage least squares estimator. In DeclareDesign, I specify both estimators, sims = 500, and plot the resulting sampling distribution using ggplot2.
# Step 1: Specify both estimators
iv_design <-
declare_population(N = 2000,
noise = rnorm(N, mean = 10, sd = 4),
A = rnorm(N),
X = rbinom(N, size = 1, prob = 0.48)) +
declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
assignment_variables = "Z") +
declare_potential_outcomes(Y ~ noise + 10*A + 1.25*X + 10*D,
assignment_variables = "D") +
declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
declare_assignment(blocks = X, block_prob = c(0.6, 0.4),
assignment_variable = "Z") +
declare_estimator(Y ~ D + X | Z + X,
model = iv_robust,
term = "D",
estimand = "ATE",
label = "IV")+
declare_estimator(Y ~ D + X,
model = lm_robust,
term = "D",
estimand = "ATE",
label = "OLS")
simulations <- simulate_designs(iv_design, sims = 500)
estimator_performance <- simulations %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand),
mean_estimate = mean(estimate)) %>%
gather(key = "Quantity", value = "Value",
mean_estimand, mean_estimate)
## Step 2: Creating a figure using ggplot2
library(ggthemes)
figure8 <- ggplot(data = simulations,
aes(x = estimate)) +
xlab("Estimate") + ylab("Count") +
geom_histogram(bins = 30) +
facet_wrap(~ estimator_label)+
geom_vline(
data = estimator_performance,
aes(xintercept = Value, color = Quantity),
size = 0.5,
linetype = "dashed"
) +
theme_economist() +
theme(legend.position = "bottom") +
ggtitle(label = "Sampling Distribution of Different Estimators")
figure8
So far we have assumed that \(\beta_1\) (effect of \(D_i\) on \(Y_i\)) is constant. That is: \(Y_i(1) - Y_i(0) = \beta_1\) \(\forall i\). We can relax this assumption but would need monotonicity to identify the complier average causal effect (LATE). In this section, I briefly describe each assumption, why we need it, and how we can break it.
First, lets define the estimand:
\[\begin{equation} CACE = \frac{ITT_Y}{ITT_D} = \frac{E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0]}{E[D_i | Z_i = 1] - E[D_i | Z_i = 0]} = E[Y_i(1) - Y_i(0) | D_i(1) > D_i(0)] \end{equation}\]\(Z_i\) is as good as randomly assigned. Values of \(Z_i\) do not covary with potential outcomes or the type of response to treatment assignment. \(\{Y_i(d,z) \forall d,z\}, D_i(Z=1), D_i(Z=0) \perp Z_i\)
Implication
If the instrument \(Z_i\) is not exogenously determined, or conditionally independent, we still have confounding (i.e. open backdoor paths connecting \(Z\) and \(Y\)). Visually, imagine the DAG we had earlier but now there is some unobserved variable \(u\) that causes \(Y\) and \(Z\). In such circumstances, when we estimate the reduced form (\(Y_i \sim Z_i\)), the backdoor path \(Z \leftarrow u \rightarrow Y\) will be unblocked.
DeclareDesign
We would violate exogeneity by defining the assignment procedure, declare_assignment(\(\cdot\)), in terms of noise (some unobservable factor that causes \(Y\)). For example:
declare_assignment(prob_unit = 0.4 + 0.2*X - 0.25*as.numeric(noise <= 10), simple = TRUE)
Potential outcomes respond to treatment status \(D_i\), not treatment assignment \(Z_i\). The instrument operates through a single known causal channel. Formally: \(Y_i(d,0) = Y_i(d,1) = Y_i(d) \forall d\)
Implication: If the instrument affects outcomes through causal channels other than \(D_i\), then the reduced form overestimates the effect of \(D\) on \(Y\). For example, if \(Z\) affects \(Y\) through \(M\) and \(D\), \(ITT_Y\) combines the effect of both channels.
DeclareDesign
We would violate excludability by defining potential outcomes in terms of Z and D. Excludability implies that \(Y_i(d_i,z_i) = Y_i(d_i)\) (i.e. if we fix \(i\)’s treatment status \(d_i\), then \(i\)’s outcome will not change because of any changes in their assignment status \(z_i\)). An excludability violation can take two forms: either \(Z_i\) directly affects \(Y_i\), or \(Z_i\) affects \(Y_i\) through a mediator other than \(D_i\) (\(M_i\) in the DAG above).
What this implies is that potential outcomes are no longer solely a function of \(D_i\). Take the case of \(Z_i\) directly affecting \(Y_i\). We now have four potential outcomes: \(Y_i(Z=1,D=1)\), \(Y_i(Z=1,D=0)\), \(Y_i(Z=0,D=1)\), and \(Y_i(Z=0,D=0)\). If we specify a mediator \(M_i\), these become: \(Y_i(M=1,D=1)\), \(Y_i(M=1,D=0)\), \(Y_i(M=0,D=1)\), and \(Y_i(M=0,D=0)\). Since \(M\) is itself a potential outcome with two values - \(M_i(Z=1)\) and \(M_i(Z=0)\) - we can re-write the four potential outcomes as a function of \(D\) and \(Z\).
This means we have a slightly complicated estimand.The effect of \(D\) on \(Y\) is \(E[Y|D=1] - E[Y| D =0]\). Think of \(E[Y|D=1]\) as being a weighted average of two quantities: \(E[Y|Z=1, D=1]\) and \(E[Y|Z=0, D=1]\). Similarly, \(E[Y|D=0]\) is a weighted average of \(E[Y|Z=1, D=0]\), \(E[Y|Z=0, D=0]\). So the ATE is essentially the average of \(E[Y| D=1, Z=0] - E[Y| D=0, Z=0]\) and \(E[Y| D=1, Z=1] - E[Y| D=0, Z=1]\) (i.e. fixing \(Z\), and only moving \(D\) we calculate the effect for each value of \(Z\)). In DeclareDesign this looks as follows:
Step 1: Declare a mediator \(M\) that opens an alternative causal pathway from \(Z\) to \(Y\):
declare_potential_outcomes(M ~ rnorm(N) + 2*Z, assignment_variables = “Z”)
Step 2: Write \(Y\) as a function of \(D\) and \(M\), remember to specify two assignment variables (\(Z\) and \(D\)):
declare_potential_outcomes(Y ~ noise + 10*A + 1.25*X + 10*D + 2*(rnorm(N) + 2*Z),assignment_variables = c(D,Z))
Note: If we specified \(M\) instead of \((rnorm(N) + 2\cdot Z)\), \(Y\) would have to be a function of \(D\) and \(M\). Since \(M\) is itself a function of \(Z\), I specify \(Y\) in terms of the constituents of \(M\), i.e. \((rnorm(N) + 2\cdot Z)\). Now, \(Y\) can be a function of \(D\) and \(Z\), which clearly illustrates the exclusion restriction.
Step 3: Declare estimands, i.e. the Complier Average Causal Effect and Average Treatment Effect. In section, I derived \(\beta_{2SLS}\) assuming constant effects, and without ruling out defiers. As a result, my dataset contains compliers, always takers, never takers, and defiers. However, the treatment effect for all of them is the same (\(\beta_1 = 10 \forall i\)). By extension, the \(CACE = ATE\). However, if we relax this assumption (i.e. constant effects), and assume monotonicity (no defiers), the \(CACE \neq ATE\), and we can separately estimate these quantities by specifying subset = D_Z_1 > D_Z_0 (in the case of the problem set: subset = “CO”).
declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1))/2), subset = D_Z_1 > D_Z_0)
and
declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1))/2))
I reproduce this in code for both the example used in section, and the problem set question. I then produce two tables summarizing the mean(estimand) and mean(estimate) for each estimator.
# Section Example (Constant Effects)
iv_design <-
declare_population(
N = 2000,
noise = rnorm(N, mean = 10, sd = 4),
A = rnorm(N),
X = rbinom(N, size = 1, prob = 0.48)
) +
declare_potential_outcomes(D ~ as.numeric(pnorm(Z + A) > 0.5),
assignment_variables = "Z") +
# Z now affects Y through M
declare_potential_outcomes(M ~ rnorm(N) + 2 * Z,
assignment_variables = "Z") +
# The effect of D on Y is 10 for everyone. Compare this to section where D's impact varies by type
declare_potential_outcomes(Y ~ noise + 10 * A + 1.25 * X + 10 * D + 2 *
(rnorm(N) + 2 * Z),
assignment_variables = c(D, Z)) +
# E[Y | D=1] and E[Y | D=0] is a weighted average of two quantities;
# I subset to D_Z_1 > D_Z_0 because only compliers satisfy D(Z=1) > D(Z=0)
declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
) / 2), subset = D_Z_1 > D_Z_0) +
declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
) / 2)) +
declare_assignment(blocks = X, block_prob = c(0.6, 0.4)) +
declare_estimator(
Y ~ D + X | Z + X,
model = iv_robust,
term = "D",
estimand = "CACE",
label = "IV"
) +
declare_estimator(
Y ~ D + X,
model = lm_robust,
term = "D",
estimand = "ATE",
label = "OLS"
)
simulations <- simulate_designs(iv_design, sims = 500)
estimator_performance <- simulations %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand),
mean_estimate = mean(estimate))
estimator_performance %>% knitr::kable(.,
digits = 2,
caption = "Section Example (Exclusion Restriction Violation)")
estimator_label | mean_estimand | mean_estimate |
---|---|---|
IV | 10 | 21.65 |
OLS | 10 | 26.07 |
# Problem Set Question (Heterogenous effects, Assuming Monotonicity)
pi_CO <- 0.5
pi_AT <- 0.1
pi_NT <- 0.4
pi_DE <- 1 - (pi_CO + pi_AT + pi_NT)
pset_design <-
declare_population(
N = 100,
noise = rnorm(N),
type = sample(
c("CO", "AT", "NT", "DE"),
N,
replace = TRUE,
prob = c(pi_CO, pi_AT, pi_NT, pi_DE)
)
) +
declare_potential_outcomes(D ~ as.numeric(Z == 1 &
type %in% c("CO", "AT") |
Z == 0 &
type %in% c("DE", "AT")),
assignment_variables = "Z") +
declare_potential_outcomes(M ~ rnorm(N) + 2 * Z,
assignment_variables = "Z") +
declare_potential_outcomes(
Y ~ noise+-0.5 * D * (type == "CO") +
0.0 * D * (type == "AT") +
0.5 * D * (type == "NT") +
1.0 * D * (type == "DE") +
2 * (rnorm(N) + 2 * Z),
assignment_variables = c(D, Z)
) +
declare_estimand(CACE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
) / 2), subset = type == "CO") +
declare_estimand(ATE = mean(((Y_D_1_Z_0 - Y_D_0_Z_0) + (Y_D_1_Z_1 - Y_D_0_Z_1)
) / 2)) +
declare_assignment(prob = 0.5, assignment_variable = "Z") +
declare_estimator(
Y ~ D | Z,
model = iv_robust,
term = "D",
estimand = "CACE",
label = "IV"
) +
declare_estimator(
Y ~ D,
model = lm_robust,
term = "D",
estimand = "ATE",
label = "OLS"
)
simulations2 <- simulate_design(pset_design, sims = 500)
estimator_performance2 <- simulations2 %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand),
mean_estimate = mean(estimate))
estimator_performance2 %>% knitr::kable(.,
digits = 2,
caption = "Problem Set (Exclusion Restriction Violation)")
estimator_label | mean_estimand | mean_estimate |
---|---|---|
IV | -0.49 | 7.68 |
OLS | -0.04 | 1.89 |
The instrument can have no effect on some units but all those affected should be affected in the same way. Formally: \(D_i(1) > D_i(0)\)
Implication
Without monotonicity, \(ITT_Y\) is the weighted difference in the treatment effect among compliers and defiers, and \(ITT_D \neq \pi_c\). To see this, observe that:
\[\begin{equation} E[Y | Z = 1] = E[Y_1 | C]\pi_C + E[Y_1 | AT]\pi_{AT} + E[Y_0 | NT]\pi_{NT} + E[Y_0 | D]\pi_D \end{equation}\] \[\begin{equation} E[Y | Z = 0] = E[Y_0 | C]\pi_C + E[Y_1 | AT]\pi_{AT} + E[Y_0 | NT]\pi_{NT} + E[Y_1 | D]\pi_D \end{equation}\]So the difference between these quantities, or \(ITT_Y\) is:
\[\begin{equation} E[Y_1 - Y_0 | C]\pi_C - E[Y_1 - Y_0 |D]\pi_D \end{equation}\] Similarly, \[\begin{equation} E[D | Z = 1] - E[D | Z = 0] = (\pi_C + \pi_{AT}) - (\pi_{AT} + \pi_D) = \pi_C - \pi_D \end{equation}\]Note:
When compliers respond differently to treatment than defiers, the ratio estimator \(\frac{ITT_Y}{ITT_D}\) becomes:
\[\begin{equation} \frac{ITT_Y}{ITT_D} = \frac{E[Y_1 - Y_0 | C]\pi_C - E[Y_1 - Y_0 |D]\pi_D}{(\pi_C - \pi_D)} \end{equation}\]Now imagine constant effects, i.e. \(E[Y_1 - Y_0 | C] = E[Y_1 - Y_0 |D] = \beta_1\). Then we can rewrite the above as:
\[\begin{equation} \frac{ITT_Y}{ITT_D} = \frac{\beta_1(\pi_C - \pi_D)}{(\pi_C - \pi_D)} = \beta_1 \end{equation}\]This is why monotonicity was not assumed when I derived the \(\beta_{2SLS}\) under constant effects.
The instrument must produce some variation in D. Formally: \(E[D_i | Z_i =1] - E[D_i | Z_i =0] \neq 0\).
This is necessary because the Wald estimator is undefined if \(E[D_i | Z_i =1] - E[D_i | Z_i =0] = 0\). Note also that as \(E[D_i | Z_i =1] - E[D_i | Z_i =0] \rightarrow 0\), we amplify any biases in the \(ITT_Y\) due to exclusion restriction violations. The intuition behind this is that when we have exclusion restriction violations, the \(ITT_Y\) overestimates the effect of \(D\) on \(Y\). If we then rescale this estimate by a very small \(\pi_C\), we “blow-up” the bias in the reduced form estimate.
DeclareDesign
To model this in Declare Design, we must do two things: (1) \(Z\) should not determine values of \(D\) beacuse if \(Z \perp D\), we will get a close-to-zero first stage; (2) \(D\) must still be endogenous (i.e. caused by a predictor of \(Y\)).
Here is code using both the section example, and problem set question, illustrating this.
# Section Example
iv_design <-
declare_population(
N = 2000,
noise = rnorm(N, mean = 10, sd = 4),
A = rnorm(N),
X = rbinom(N, size = 1, prob = 0.48),
# Specify Z in the population because it is just a background variable
Z = X*rbinom(N, size = 1, prob = 0.6) + (1-X)*rbinom(N, size = 1, prob = 0.4)
) +
declare_potential_outcomes(Y ~ noise + 10 * A + 1.25 * X + 10 * D,
assignment_variables = "D") +
declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
# D must be endogenous (noise causes D and Y)
# D must not be caused by Z
declare_assignment(prob_unit = pnorm(noise), simple = TRUE,
assignment_variable = "D"
) +
declare_estimator(
Y ~ D | Z,
model = iv_robust,
term = "D",
estimand = "ATE",
label = "IV"
) +
declare_estimator(
Y ~ D,
model = lm_robust,
term = "D",
estimand = "ATE",
label = "OLS"
)
simulations <- simulate_designs(iv_design, sims = 500)
estimator_performance <- simulations %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand),
mean_estimate = mean(estimate))
estimator_performance %>% knitr::kable(.,
digits = 2,
caption = "Section Example (Weak First Stage)")
estimator_label | mean_estimand | mean_estimate |
---|---|---|
IV | 10 | -1.398186e+11 |
OLS | 10 | 2.069000e+01 |
# Problem Set Question
pi_CO <- 0.5
pi_AT <- 0.1
pi_NT <- 0.4
pi_DE <- 1 - (pi_CO + pi_AT + pi_NT)
pset_design <-
declare_population(N = 100,
noise = rnorm(N),
type = sample(c("CO", "AT", "NT", "DE"),
N,
replace = TRUE,
prob = c(pi_CO, pi_AT, pi_NT, pi_DE)),
# Specify Z in the population because it is just a background variable
Z = rbinom(N, size = 1, prob = 0.5)) +
declare_potential_outcomes(Y ~ noise +
-0.5 * D * (type == "CO") +
0.0 * D * (type == "AT") +
0.5 * D * (type == "NT") +
1.0 * D * (type == "DE"),
assignment_variables = "D") +
declare_estimand(CACE = mean(Y_D_1 - Y_D_0), subset = type == "CO") +
declare_estimand(ATE = mean(Y_D_1 - Y_D_0)) +
# Specify D as a function of noise because it is endogenous
# D should not be a function of Z (this ensures a weak first stage)
declare_assignment(prob_unit = pnorm(noise), simple = TRUE,
assignment_variable = "D") +
declare_estimator(Y ~ D | Z,
model = iv_robust,
term = "D",
estimand = "CACE",
label = "IV") +
declare_estimator(Y ~ D,
model = lm_robust,
term = "D",
estimand = "ATE",
label = "OLS")
simulations2 <- simulate_design(pset_design, sims = 500)
# Note: you may end up with some IV estimates being NA
# This happens because when the first stage is very close to 0
# ITT / 0 = Inf.
# To factor this in, include na.rm = TRUE below
estimator_performance2 <- simulations2 %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand, na.rm = T),
mean_estimate = mean(estimate, na.rm = T))
estimator_performance2 %>% knitr::kable(.,
digits = 2,
caption = "Problem Set (Weak First Stage)")
estimator_label | mean_estimand | mean_estimate |
---|---|---|
IV | -0.50 | -1.617827e+11 |
OLS | -0.05 | 1.080000e+00 |