Identification using the backdoor criterion

Figure 1: Identification, Estimation, Aggregation

Figure 1: Identification, Estimation, Aggregation

Last week we discussed Pearl’s backdoor criterion, and identification of parameters by closing all backdoor paths that connect the exposure variable (\(Z\)) to the outcome (\(Y\)). Here is a recap of the two conditions we need to satisfy:

  1. All back-door paths from \(Z\) to \(Y\) should be blocked, after conditioning on \(C\). We do this by “writing down back-door paths from [Z] to [Y], determine which ones are unblocked, and then search for a candidate conditioning set of observed variables that will block all unblocked back-door paths” (Morgan and Winship, 2015:109).

  2. No variable in the conditioning set should be a descendent of \(Z\), and should lie on a directed path that reaches \(Y\). We do this to “verify that the variables in the conditioning set do not block or otherwise adjust away any portion of the causal effect of interest” (MW, 2015:109).

The fundamental intuition behind conditioning is that within each stratum of \(C\), treatment is as-if randomly assigned:

\[\begin{equation} \{Y_i(1), Y_i(0), X_i\} \perp Z_i | C_i = c \end{equation}\]

Last Week’s Case

Say we have a causal diagram or model of the world that specifies the following dependencies:

Figure 2: Underlying Model

Figure 2: Underlying Model

The conditioning set here is \(C\). We can identify the effect of \(Z\) on \(Y\) after controlling for \(C\). But as we discovered, merely including \(C\) in a regression specification does not solve the problem. We must properly specify how \(C\) affects \(Y\), or as we shall see in this section, \(Z\) affects \(C\). The good news is: get either of those right, and you can identify the effect of \(Z\) on \(Y\).

As recap, and to start things off, lets look at the data generated from this model:

library(DeclareDesign)
library(tidyverse)

confounding_design <-
  declare_population(
  N = 1000,
  # C is a categorical variable with 3 levels
  C = sample(c(1:5), N, replace = TRUE),
  C_squared = C ^ 2,
  # Idiosyncratic variation in outcomes generated here
  noise = rnorm(N, mean = 0, sd = 1)
  ) +
  # Y is a function of Z, C, and noise
  declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +
  noise) +
  # We are interested in the causal effect of Z on Y
  declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
  # Z is a function of C (unequal probabilities of assignment)
  # Note: C also causes Y, so the indp. assumption breaks down
  declare_assignment(blocks = C,
  block_prob = c(0.9, 0.5, 0.1, 0.3, 0.7)) +
  # Estimator when we condition on C
  declare_estimator(Y ~ Z + C ,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS conditioning on C") +
  declare_estimator(
  Y ~ Z + C + C_squared,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS with quadratic term"
  )

dat <- confounding_design %>% draw_data()

library(ggplot2)
library(ggthemes)

dat_plot <- ggplot(dat, aes(x = C, y = Y, color = as.factor(Z))) +
  geom_point(position = "jitter", alpha = 0.2) +
  geom_smooth(method = "lm_robust") +
  theme_economist_white(gray_bg = TRUE) +
  ggtitle("Illustrative Case: A Draw of Data")
  dat_plot


Further, when we simulate to estimators:

\[\begin{equation} Y_i \sim Z_i + C_i \end{equation}\] \[\begin{equation} Y_i \sim Z_i + C_i + C_i^2 \end{equation}\]

We find that conditioning on \(C\) alone does not get us the right answer on average. However, conditioning on \(C\) and \(C^2\) does get us pretty close to the true average treatment effect.

simulations <- simulate_design(confounding_design, sims = 300)

estimator_performance <- simulations %>%
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
  mean_estimate = mean(estimate))

estimator_performance %>% knitr::kable(
  caption = "Average Estimates from Different Estimators",
  col.names = c("Estimator", "Mean Estimand", "Mean Estimate"),
  format = "pandoc"
)
Average Estimates from Different Estimators
Estimator Mean Estimand Mean Estimate
OLS conditioning on C 0.5 2.661663
OLS with quadratic term 0.5 0.496503
estimator_performance <- estimator_performance %>%
  gather(key = "Quantity",
  value = "Value",
  mean_estimand,
  mean_estimate)

estimator_perf_fig <- ggplot(data = simulations,
                             aes(x = estimate)) +
                             xlab("Estimate") + ylab("Count") +
                             geom_histogram(bins = 50) +
                             facet_wrap( ~ estimator_label) +
                             geom_vline(
                             data = estimator_performance,
                             aes(xintercept = Value, color = Quantity),
                             size = 0.5,
                             linetype = "dashed"
                             ) +
                             theme_economist_white(gray_bg = TRUE) +
                             theme(legend.position = "bottom") +
                             ggtitle(label = "Sampling Distribution of Different Estimators")

estimator_perf_fig


Partial Regression (FWL Theorem)

Let’s say we specify the following regression:

\[\begin{equation} Y_i = \beta_0 + \beta_1 \times X_{1,i} + \beta_2 X_{2,i} + \beta_3 X_{3,i} + e_i \end{equation}\]

Frisch and Waugh (1933) find that

\[\begin{equation} \beta_k = \frac{cov(\tilde{Y_{k,i}},\tilde{x_{k,i}})}{Var(\tilde{x_{k,i}})} \end{equation}\]

Where \(\tilde{Y_{k,i}}\) are the residuals from the regression \(Y_i \sim X_{2,i} + X_{3,i}\); and \(\tilde{X_{k,i}}\) are the residuals from \(X_{1,i} \sim X_{2,i} + X_{3,i}\).

Angrist and Pishke describe the logic as follows:

Each coefficient in a multivariate regression is the bivariate slope coefficient for the corresponding regressor after partialling out all the other covariates. (AP, 2009:35)


Wrong functional form

I will now use the partial regression framework to explain why conditioning on \(C\) was insufficient in last week’s illustrative case. Here is what I do:

  1. I draw_data using the design object confounding_design.

  2. I then estimate two regressions: \(Y_i \sim Z_i + C_i\) and \(Y_i \sim Z_i + C_i^2\). I report the results of these models in a table, made using texreg

  3. I then apply the Frisch, Waugh and Lovell theorem to demonstrate the regression anatomy.

  4. Crucially, look at the residualized values of \(Y\) and \(Z\) (i.e.\(\tilde{Y_{k,i}}\) and \(\tilde{Z_{k,i}}\)). We clearly see that all the variation in \(Y\) and \(Z\) associated with \(C\) is not netted-out when we only condition on \(C\). We do not fully “partial-out” the effect of \(C\) on \(Y\), and the effect of \(C\) on \(Z\). This is because we misspecified the functional form.

library(estimatr)
library(texreg)

# Step 1: Draw a dataset
dat <- confounding_design %>% draw_data()

# Step 2: Estimate the two models

model1 <- lm_robust(Y ~ Z + C, data = dat)
model2 <- lm_robust(Y ~ Z + C + C_squared, data = dat)

reg_table_1 <- texreg::htmlreg(
  list(model1, model2),
  stars = c(0.001, 0.01, 0.05),
  custom.coef.names = c("(Intercept)", "Z", "C",
  "C^2"),
  digits = 2,
  caption = "Different Conditioning Strategies",
  include.ci = FALSE
  )
  reg_table_1
Different Conditioning Strategies
Model 1 Model 2
(Intercept) -2.73*** 6.77***
(0.16) (0.20)
Z 2.59*** 0.53***
(0.12) (0.08)
C 5.79*** -1.26***
(0.05) (0.14)
C^2 1.15***
(0.02)
R2 0.95 0.99
Adj. R2 0.95 0.99
Num. obs. 1000 1000
RMSE 1.91 1.00
p < 0.001, p < 0.01, p < 0.05
# Step 3a: Partial regression framework (writing the first stage models)
model3 <- lm_robust(Z ~ C, data = dat)
model4 <- lm_robust(Z ~ C + C_squared, data = dat)

model5 <- lm_robust(Y ~ C, data = dat)
model6 <- lm_robust(Y ~ C + C_squared, data = dat)

# Step 3b: Partial regression framework (calculating residualized values)
dat <- dat %>%
  mutate(
    Z_res_model3 = Z - model3$fitted.values,
    Z_res_model4 = Z - model4$fitted.values,
    Y_res_model5 = Y - model5$fitted.values,
    Y_res_model6 = Y - model6$fitted.values
    
  )

# Step 3c: Res-Res plot

library(ggplot2)

ggplot(dat, aes(x = Z_res_model3, y = Y_res_model5)) +
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm_robust") + 
  theme_economist_white(gray_bg = TRUE) +
  xlab("Residualized X") +
  ylab("Residualized Y") + 
  ggtitle("Residualized X and Y from Y ~ Z + C")

ggplot(dat, aes(x = Z_res_model4, y = Y_res_model6)) + 
  geom_point(alpha = 0.2) + 
  geom_smooth(method = "lm_robust") +
  theme_economist_white(gray_bg = TRUE) +
  xlab("Residualized X") +
  ylab("Residualized Y") + 
  ggtitle("Residualized X and Y from Y ~ Z + C + C^2")

# Step 3d: Final step, Res-Res regressions
model7 <- lm_robust(Y_res_model5 ~ Z_res_model3, data = dat)

model8 <- lm_robust(Y_res_model6 ~ Z_res_model4, data = dat)

reg_table_2 <- texreg::htmlreg(
  list(model7, model8),
  stars = c(0.001, 0.01, 0.05),
  custom.coef.names = c("(Intercept)", "res(Z)", "res(Z)"),
  digits = 2,
  caption = "Partial Regression Framework (FWL Theorem)",
  include.ci = FALSE
  )
  reg_table_2
Partial Regression Framework (FWL Theorem)
Model 1 Model 2
(Intercept) -0.00 -0.00
(0.06) (0.03)
res(Z) 2.59*** 0.53***
(0.12) (0.08)
R2 0.31 0.05
Adj. R2 0.31 0.05
Num. obs. 1000 1000
RMSE 1.91 1.00
p < 0.001, p < 0.01, p < 0.05

Finally, have a look at the residuals:

head(dat) %>%
  select(ID,C,C_squared,Z,Y,Z_res_model3,Z_res_model4,Y_res_model5, Y_res_model6) %>%
  knitr::kable(., digits = 2,
               caption = "Dataset with Residualized Values")
Dataset with Residualized Values
ID C C_squared Z Y Z_res_model3 Z_res_model4 Y_res_model5 Y_res_model6
0001 5 25 0 27.91 -0.38 -0.69 0.70 -1.74
0002 4 16 0 20.31 -0.44 -0.28 -1.26 -0.01
0003 1 1 1 7.52 0.38 0.06 2.86 0.36
0004 2 4 0 9.69 -0.56 -0.40 -0.61 0.61
0005 3 9 0 11.69 -0.50 -0.19 -4.25 -1.78
0006 4 16 0 20.27 -0.44 -0.28 -1.31 -0.05

Post treatment variables

Angrist and Pischke point out that yet another bad conditioning strategy is to “control for variables that are themselves outcome variables”. Consider an study in which we randomly assigned college education. \(Z=1\) for college graduates, else \(Z=0\). Let \(Y\) be earnings, the main outcome of interest. Finally \(R\) is dummy variable capturing profession, \(R=1\) for white collar workers, otherwise \(R=0\).

Here is a causal diagram capturing these relationships:

Figure 3: Audit Study DAG

Figure 3: Audit Study DAG

Imagine now that we condition on \(R_i=1\) (i.e. we subset to white collar workeres). Angrist and Pischke would argue that this is a “bad control” because we have conditioned on a consequence of treatment \(Z_i\). Specifically:

\[\begin{equation} E[Y_i | Z_i = 1, R_i = 1] - E[Y_i | Z_i = 0, R_i = 1] = E[Y_1 - Y_0 | Z_i = 1, R_i = 1] \\ + \{E[Y_0 | Z_i = 1, R_i = 1] - E[Y_0 | Z_i = 0, R_i = 1]\} \end{equation}\]

Where \(E[Y_1 - Y_0 | Z_i = 1, R_i = 1]\) is the causal effect of \(Z\) on \(Y\) among white collar workers; and \(E[Y_0 | Z_i = 1, R_i = 1] - E[Y_0 | Z_i = 0, R_i = 1]\) is the baseline difference in the untreated potential outcomes of college graduates working in white collar jobs (\(Z_i = 1, R_i =1\)), and non-graduates working in white collar jobs (\(Z_i = 0, R_i =1\)). This is the selection bias we have induced by conditioning on a post-treatment variable. Angrist and Pischke put it as follows: “the selection bias term reflects the fact that college (\(Z\)) changes the composition of the pool of white collar workers” (Angrist and Pischke, 2009:66)


Exercise

To better understand this, consider the following toy population (\(n = 6\)). Please form pairs, pick one of the quantities I have specified below, and try estimating it using this data

bad_control_dat <- data.frame(
  ID = c(1:6),
  Y1 = c(6,3,8,5,6,6),
  Y0 = c(3,2,5,6,7,5),
  Z = c(1,1,0,0,1,0),
  R = c(1,1,1,1,0,0)
)

bad_control_dat %>% knitr::kable(.,
               caption = "Exercise: Post-Treatment Bias")
Exercise: Post-Treatment Bias
ID Y1 Y0 Z R
1 6 3 1 1
2 3 2 1 1
3 8 5 0 1
4 5 6 0 1
5 6 7 1 0
6 6 5 0 0

Can you estimate the following?

  1. \(E[\tau_i]\) or the average treatment effect.

  2. \(\widehat{E[\tau_i | R_i = 1]} = E[Y_i | Z_i = 1, R_i =1] - E[Y_i | Z_i = 0, R_i =1]\) or the difference in means conditional on being a white collar worker. Does it equal \(E[\tau_i]\)?

  3. \(E[Y_1 - Y_0 | Z_i = 1, R_i = 1]\) or the causal effect of \(Z\) on \(Y\) among white collar workers with college degrees.

  4. \(E[Y_0 | Z_i = 1, R_i = 1]\) or the expected value of untreated potential outcomes among white collar workers assigned to treatment (graduate degree).

  5. \(E[Y_0 | Z_i = 0, R_i = 1]\) or the expected value of untreated potential outcomes among white collar workers assigned to control (no graduate degree).

Are you able to put together the constituent terms in steps 3-5, and get back to \(E[Y_i | Z_i = 1, R_i =1] - E[Y_i | Z_i = 0, R_i =1]\)? Which of these quantities is not estimable if we did not have god-like knowledge about potential outcomes?

Answer

  1. \(E[\tau_i] = \frac{3 + 1 + 3 + (-1) + (-1) + 1}{6} = 1\)

  2. \(\widehat{E[\tau_i | R_i = 1]} = \frac{6+3}{2} - \frac{5+6}{2} = 4.5 - 5.5 = -1\)

  3. \(E[Y_0 | Z_i = 1, R_i = 1] = \frac{6+3}{2} - \frac{3+2}{2} = 4.5 - 2.5 = 2\)

  4. \(E[Y_0 | Z_i = 1, R_i = 1] = \frac{3+2}{2} = 2.5\)

  5. \(E[Y_0 | Z_i = 0, R_i = 1] = \frac{5+6}{2} = 5.5\)

And putting these quantities back into the equation:

\(2 + (2.5 - 5.5) = 2 - 3 = -1\)


Proxy controls

Not only do Angrist and Pischke consider post-treatment variables “bad controls”, they also suggest we should not condition on “variables that partially control for omitted variable factors but are themselves affected by the variable of interest”.

To see this, consider a “long” regression:

\[\begin{equation} Y_i = \alpha + \rho \cdot s_i + \gamma a_i + e_i \end{equation}\]

Where \(s_i\) is the variable of interest, and we cannot measure \(a_i\) so we use a “proxy” measure \(l_i\):

\[\begin{equation} l_i = \pi_0 + \pi_1 \cdot s_i + \pi_2 a_i + \eta_i \end{equation}\]

The problem is that if we exclude \(l_i\) from the regression, we have omitted variable bias. If we include it as a control we have post-treatment bias. To see this, plug-in \(a_i\) from the second equation into the long regression, and re-arrange the terms. We get:

\[\begin{equation} Y_i = \alpha + \rho \cdot s_i + \gamma [\frac{l_i - \pi_0 - \pi_1 \cdot s_i}{\pi_2}] + e_i \end{equation}\]

And after re-arranging the terms within \([\) \(]\)

\[\begin{equation} Y_i = (\alpha - \gamma \cdot\frac{\pi_0}{\pi_2}) + (\rho - \gamma \cdot \frac{\pi_1}{\pi_2}) \cdot s_i + \frac{\gamma}{\pi_2}\cdot l_i + e_i \end{equation}\]

Note that in this new regression: \(\alpha^\star = \alpha - \gamma \cdot\frac{\pi_0}{\pi_2}\), \(\rho^\star = \rho - \gamma \cdot \frac{\pi_1}{\pi_2}\), and \(\gamma^\star = \frac{\gamma}{\pi_2}\). This is the “bias” in coefficients by conditioning on a proxy-variable. For more on this, see Angrist and Pischke, pp64-69.


Aggregation Strategies

Figure 4: Regressions and Matching

Figure 4: Regressions and Matching

Estimands

Regressions and matching estimators shoot for different estimands. Regressions seek to estimate the average treatment effect, while matching estimators seek to estimate the average treatment effect on the treated:

\[\begin{equation} \delta_{ATE} = E[Y_{1,i} - Y_{0,i}] = \sum_x \delta_x \cdot P(X_i = x) \end{equation}\] \[\begin{equation} \delta_{ATT} = E[Y_{1,i} - Y_{0,i} | D_i = 1] = \sum_x \delta_x \cdot P(X_i = x | D_i = 1) \end{equation}\]

Where \(\delta_x\) is the difference-in-means within stratum (i.e. \(\delta_X = E[Y_i | D_i = 1, X_i] - E[Y_i | D_i = 0, X_i]\)). Note that OLS and matching estimators weight by different quantities: \(w_{ols}\) is essentially the probability of being in strata \(x\), \(w_{matching}\) is the conditional probability of being in strata \(x\) among treated units.


Regressions

Claim: Regression puts most weight on covariate cells where the conditional variance of treatment status is largest. Assuming \(D_i \sim Bern(p)\), the variance of \(D_i\) is \(p\cdot(1-p)\), which is maximized at 0.5.

Short Proof:

Let there be a regression: \(Y_i \sim D_i + X_i\)

We know:

\[\begin{equation} \delta_R = \frac{cov(Y_i, \tilde{D_i})}{Var(\tilde{D_i})} = \frac{E_X[E(D_i - E(D_i|X_i))^2 | X_i] \delta_X}{E_X[E(D_i - E(D_i|X_i))^2 | X_i]]} \end{equation}\]

Where \(E(D_i - E(D_i|X_i))^2 | X_i]\) is the conditional variance in treatment assignment, using the variance formula: \(Var(D) = E[(D - E(D))^2]\). We can also write this as: \(P(D_i = 1 | X_i = x)[1 - P(D_i = 1| X_i = x)]\) This expression then becomes:

\[\begin{equation} \delta_R = \frac{\sum_x \delta_xP(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}{\sum_x P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)} \end{equation}\]

Note: We weight the conditional variance of \(D_i\) by the probability of being in the stratum \(P(X_i=x)\) so that we can average over all \(x\).


Matching Estimator

Claim: Unlike regressions, matching estimators put the most weight on covariate cells containing those most likely to be treated.

Short Proof:

\[\begin{equation} \delta_M = E[Y_{1,i} - Y_{0,i} | D_i = 1] = \sum_x \delta_x \cdot P(X_i = x | D_i = 1) \end{equation}\]

But note that by Baye’s rule we know \(P(X | D) = \frac{P(D|X = x)P(X = x)}{P(D)}\), and by the law of total probability \(P(D) = \sum_x P(D|X = x)P(X = x)\). Plugging that in gives us:

\[\begin{equation} \delta_M = \frac{\sum_x \delta_x \cdot P(D_i = 1 | X_i = x) \cdot P(X_i =x)}{\sum_x P(D_i = 1 | X_i = x) \cdot P(X_i =x)} \end{equation}\]

Special Cases

  1. When \(\delta_X\) does not vary across cells, these weights don’t matter.

  2. When \(D_i \perp X_i\), as in experiments, then \(P(D_i = 1 | X_i = x) = P(D_i = 1)\) and is thus a constant. The matching estimator then equals the regression estimator. To see this, note that \(\sum_x \delta_x \cdot P(D_i = 1 | X_i = x) \cdot P(X_i =x) = P(D_i = 1) \sum_x \delta_x \cdot P(X_i =x)\). The first term \(P(D_i = 1)\) cancels because it comes out of the summation \(\sum_x\) in the denominator as well.

  3. Neither estimator gives weight to covariate cells that do not contain both treated and untreated observations.


Example

I take the data from the illustrative case, and report \(\delta_X\) (within-stratum difference in means), \(w_{ols}\) (regression weights), \(w_{matching}\) (matching weights), and \(w_{block}\) (stratified estimator weights) in the table below:

## Regression vs. matching

strata_estimates <- dat %>%
  group_by(C) %>%
  summarise(ATE_block = mean(Y[Z==1]) - mean(Y[Z==0]),
            N_j = n(),
            p_trt = mean(Z==1))

strata_estimates %>%
  mutate(
  w_block = N_j / sum(N_j),
  w_ols = (w_block * p_trt * (1 - p_trt)) / (sum(w_block *
  p_trt * (1 - p_trt))),
  w_matching = p_trt*w_block / sum(p_trt*w_block)
  ) %>%
  knitr::kable(., digits = 4,
  caption = "Within-Strata Estimates and Aggregation")
Within-Strata Estimates and Aggregation
C ATE_block N_j p_trt w_block w_ols w_matching
1 0.2346 195 0.9026 0.195 0.1000 0.3527
2 0.5411 197 0.4975 0.197 0.2873 0.1964
3 0.8381 196 0.1020 0.196 0.1048 0.0401
4 0.3560 208 0.3029 0.208 0.2562 0.1263
5 0.6427 204 0.6961 0.204 0.2517 0.2846

Guidance: If you would like to work through these proofs, please look at Angrist and Pischke, pp74-76. I’d be happy to walk you through the proof using my notes of the same pages.


Propensity Score Theorem

If the conditional independence assumption holds such that \(\{Y_{1,i},Y_{0,i}\} \perp D_i | X_i\), then \(\{Y_{1,i},Y_{0,i}\} \perp D_i | p(X_i)\) where \(p(X_i) = E[D_i | X_i] = P(D_i | X_i)\)

Implication: “The only covariate you need to control for is the probability of treatment itself” (Angrist and Pischke, p81)

Intuition for the proof: Angrist and Pischke sketch a concise proof of the theorem on page 81. Here is an annotated version:

The aim is to show that \(P(D_i = 1 | p(X_i), Y_{j,i}) = P(D_i = 1 | p(X_i)) = p(X_i)\) where \(j = \{0,1\}\) signifying treated or untreated potential outcomes. In simple terms, we want to show that within a covariate stratum, the probability of being assigned to treatment is independent of potential outcomes and precisely equal to the propensity score. Put another way, \(Y_{i,j}\) is “excludable”.

The way Angrist and Pischke show this is:

\[\begin{equation} P(D_i = 1 | Y_{i,j}, p(X_i)) = E[D_i | Y_{i,j}, p(X_i)] \end{equation}\]

By the law of iterated expectations

\[\begin{equation} E[D_i | Y_{i,j}, p(X_i)] = E_x\{E(D_i| X_i)| Y_{i,j}, p(X_i)\} \end{equation}\]

But we know \(E(D_i| X_i) = p(X_i)\)

\[\begin{equation} E_x\{E(D_i| X_i)| Y_{i,j}, p(X_i)\} = E_x\{p(X_i) | Y_{i,j}, p(X_i)\} = p(X_i) \end{equation}\]

Propensity score weighting meets FWL theorem

The propensity score theorem tells us that the only covariate we need to control for is the probability of treatment itself. In this section, I show that weighting the outcomes of treated units by \(\frac{1}{p(X_i)}\) and outcomes of control units by \(\frac{1}{1 - p(X_i)}\) is similar to “controlling” in a regression.

To see this lets start with the ATE estimand, and write it in a more familiar form

\[\begin{equation} E[Y_{1,i} - Y_{0,i}] = E[\frac{Y_i\cdot D_i}{p(X_i)} - \frac{Y_i(1 - D_i)}{1 - p(X_i)}] \end{equation}\]

If we cross-multiply the terms we get:

\[\begin{equation} E[\frac{Y_i\cdot (D_i - p(X_i))}{p(X_i)[1 - p(X_i)]} \end{equation}\]

Now think back to the FWL theorem, and partial regression framework. If we regress \(Y_i \sim D_i + X_i\), we know that \(\beta_D\) is

\[\begin{equation} \frac{cov(\tilde{D_i},Y_i)}{Var(\tilde{D_i})} = \frac{E[Y_i \cdot (D_i - p(X_i))]}{E[p(X_i)(1 - p(X_i))]} \end{equation}\]

Note: The partial regression coefficient is a ratio of expectations (i.e. \(\frac{E(A)}{E(B)}\)). This does not equal the expectation of a ratio (i.e \(E[\frac{A}{B}]\)). That is whyc conditioning on the propensity score produces consistent but not unbiased estimates.


Preview: Doubly Robust Estimators

Lets go back to the illustrative case from last week. Remember we found that controlling for \(C\) was not sufficient, but adding a quadratic term \(C^2\) got us the right answer on average. The fact is in that dataset, both \(Z\) and \(C\), and \(C\) and \(Y\) were non-linearly related. As a result, only a quadratic term would either “partial out the entire effect of \(C\)” or “accurately predict propensity scores, \(p(X_i)\)”.

In reality, we need to get any one of these right. To understand this, I will take exactly the same dataset but now make \(Z\) a linear function of \(C\). You’ll now see that controlling for \(C\) does the job, as does having \(C\) and \(C^2\). This is why regression estimators are often called “doubly robust” (you get two chances to be right!).

### See the non-linearity in the initial dataset

strata_estimates_1 <- dat %>%
  group_by(C) %>%
  summarise(p_trt = mean(Z==1))

ggplot(strata_estimates_1, aes(x = C, y = p_trt)) + 
  geom_point() + 
  ylim(c(0,1)) +
  theme_economist_white(gray_bg = TRUE) +
  ggtitle(label = "Propensity Scores at Different Levels of C",
          subtitle = "(Original Data)")


Now, lets declare exactly the same dataset but \(Z\) is linearly related with \(C\).

Specifying a linear relationship between C and Z

confounding_design2 <-
  declare_population(N = 1000,
  # C is a categorical variable with 3 levels
  C = sample(c(1:5), N, replace = TRUE),
  C_squared = C^2, 
  # Idiosyncratic variation in outcomes generated here
  noise = rnorm(N, mean = 0, sd = 1)) +
  # Y is a function of Z, C, and noise
  declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +noise) +
  # We are interested in the causal effect of Z on Y
  declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
  # Z is a function of C (unequal probabilities of assignment)
  # Note: C also causes Y, so the indp. assumption breaks down
  declare_assignment(blocks = C,
  block_prob = c(0.2, 0.35, 0.5, 0.65, 0.8)) + 
  # Estimator when we condition on C
  declare_estimator(Y ~ Z + C ,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS conditioning on C") + 
  declare_estimator(Y ~ Z + C + C_squared,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS with quadratic term")
  
dat2 <- confounding_design2 %>% draw_data()

strata_estimates_2 <- dat2 %>%
  group_by(C) %>%
  summarise(p_trt = mean(Z==1))

ggplot(strata_estimates_2, aes(x = C, y = p_trt)) + 
  geom_point() + 
  ylim(c(0,1))+ 
  theme_economist_white(gray_bg = TRUE) +
  ggtitle(label = "Propensity Scores at Different Levels of C",
          subtitle = "(New Data)")


Alright, lets simulate the properties of our two estimators. Remember, we have two types of OLS specifications:

\[\begin{equation} Y_i \sim Z_i + C_i \end{equation}\]

And

\[\begin{equation} Y_i \sim Z_i + C_i + C_i^2 \end{equation}\]
simulations <- simulate_design(confounding_design2, sims = 300)

simulations %>% head() %>%
  select(sim_ID, estimand_label, estimand, 
         estimator_label, estimate, std.error) %>%
  knitr::kable(., digits = 2,
               caption = "Head of the Simulations Table")
Head of the Simulations Table
sim_ID estimand_label estimand estimator_label estimate std.error
1 ATE 0.5 OLS conditioning on C 0.51 0.15
1 ATE 0.5 OLS with quadratic term 0.51 0.07
2 ATE 0.5 OLS conditioning on C 0.49 0.15
2 ATE 0.5 OLS with quadratic term 0.50 0.07
3 ATE 0.5 OLS conditioning on C 0.58 0.15
3 ATE 0.5 OLS with quadratic term 0.59 0.07
estimator_performance <- simulations %>% 
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
            mean_estimate = mean(estimate))

estimator_performance %>% knitr::kable(
  caption = "Average Estimates from Different Estimators",
  col.names = c("Estimator", "Mean Estimand", "Mean Estimate"),
  format = "pandoc"
)
Average Estimates from Different Estimators
Estimator Mean Estimand Mean Estimate
OLS conditioning on C 0.5 0.4980436
OLS with quadratic term 0.5 0.4974127