Alert: You can download this document from http://github.com/shikhar46/quantmethods or Canvas’ Files page.
Situation: Say I am interested in the average treatment effect in some population, \(E[Y_i(1) - Y_i(0)]\). I use a difference in means estimator to estimate this parameter or quantity of interest: \(\widehat{ATE} = \frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m}\). The true ATE is 0.5. In a random sample of data, and given some assignment vector \(Z\), we estimate the average treatment effect to be 1.5. Is the estimate “biased”?
Answer: Bias, and consistency, are properties of an estimation procedure (“estimator”); not any single realized value from that procedure.
library(DeclareDesign)
library(tidyverse)
basic_design <-
declare_population(N = 100, noise = rnorm(N)) +
declare_potential_outcomes(Y ~ noise + (0.5 + noise)*Z) +
declare_estimand(SATE = mean(Y_Z_1 - Y_Z_0)) +
declare_assignment(prob = 0.5) +
declare_reveal(Y, Z) +
declare_estimator(Y ~ Z, model = difference_in_means, estimand = "SATE")
simulations <-
simulate_design(basic_design, sims = 400)
simulations %>% head %>%
select(sim_ID, estimand_label, estimand,
estimator_label, term, estimate,
std.error, p.value) %>%
knitr::kable()
sim_ID | estimand_label | estimand | estimator_label | term | estimate | std.error | p.value |
---|---|---|---|---|---|---|---|
1 | SATE | 0.4414911 | estimator | Z | 0.6626830 | 0.3737734 | 0.0813785 |
2 | SATE | 0.5261206 | estimator | Z | 0.5538947 | 0.2961459 | 0.0654089 |
3 | SATE | 0.6110187 | estimator | Z | 0.3634664 | 0.3155997 | 0.2536787 |
4 | SATE | 0.4978010 | estimator | Z | 0.5253538 | 0.3175410 | 0.1029199 |
5 | SATE | 0.4884269 | estimator | Z | 0.1393658 | 0.3184552 | 0.6628616 |
6 | SATE | 0.5751908 | estimator | Z | 0.5797356 | 0.3514047 | 0.1034923 |
Estimands can be defined for different populations. For example, the average treatment effect can be defined for a sample (\(n=100\)), or a population (\(N=100,000\)) from which a random sample is drawn. Every random sample from the population will have one, true sample average treatment effect (SATE) = \(E[Y_i(1) - Y_i(0) | S_i = 1]\) (where \(S_i = 1\) for units in the population that are selected into the sample). The population, as a whole, will also have one, true population average treatment effect (PATE) = \(E[Y_i(1) - Y_i(0)]\). However, the SATE may not be equal to the PATE in a given sample, because different units respond differently to treatment and every sample comprises different units.
In a sense, yes. Random assignment is the assignment of units to treatment conditions with known probabilities between 0 and 1. In the problem set, Alex defined prob = pnorm(noise). If noise were observed, we could calculate each unit’s probability of being assigned to treatment, and by design this would always be between 0 and 1 (though never exactly 0 or 1).
That said, the problem such an assignment procedure poses for identification is that \(\{Y_i(1),Y_i(0),X\} \perp Z\) does not hold. The analytic fix for this is to estimate \(\widehat{p_i(Z = 1)}\) using \(noise\), and either condition on this propensity score, or use an inverse probability weighted estimator (where the \(weight_i = \frac{1}{\widehat{p_i(Z = 1)}}\)). We now have conditional independence of treatment assignment: \(\{Y_i(1),Y_i(0),X\} \perp Z | \widehat{p_i(Z=1)}\).
Definition: We say that an average causal effect is identifiable under a particular set of assumptions if these assumptions imply that the distribution of the observed data is compatible with a single value of the effect measure. Conversely, we say that an average causal effect is nonidentifiable under the assumptions when the distribution of the observed data is compatible with several values of the effect measure. Hernán MA, Robins JM (2019:27).
Some related points:
A parameter is “underidentified” when several parameter values are consistent with the data and assumptions. It is “overidentified” if no parameter value is consistent with the data. It is “just identified” if only one parameter value is consistent with the data and assumptions. In a way, under-identification is a problem of too little information; over-identification is a problem of too much information, and we strive for the middle ground (just the right amount of information to infer a value of the parameter).
Identification problems are different from statistical problems. Identification revolves around the parameter being estimable, that is: “if we have infinite data points, what assumptions must we make so that there is only one value of the parameter” (HR 2019). These questions usually have to do with confounding, selection, or measurement issues. In contrast, statistical problems refer to finite sample constraints (i.e. estimating a population parameter with limited amount of data). For more on this, see Hernán MA, Robins JM (2019) Chapter 10.1.
Lets practice making DAGs based on a functional summary of the data (i.e. some “model”). To do this, open http://dagitty.net on your browser, click on ‘Launch DAGitty’. I will work through an example, and then we will break into an exercise where I will supply a model, and you will make a DAG using dagitty.
Example 1:
\[\begin{equation} Z \sim f(u_Z, C) \end{equation}\] \[\begin{equation} C \sim f(u_C) \end{equation}\] \[\begin{equation} Y \sim f(Z, C, u_Y) \end{equation}\]Here is what I get:
Exercise 1:
Now its your turn. Can you please make a DAG based on the following information:
\[\begin{equation} Z = u_C + (u_C)^2 + u_Z \end{equation}\] \[\begin{equation} X = Z + u_C + Z\times(u_C) \end{equation}\] \[\begin{equation} M = log(Z) \end{equation}\] \[\begin{equation} Y = Z + M + (u_Y + u_C) \end{equation}\]Note that variables specified as u are unobserved.
Answer:
Here is what my DAG looks like.
Is the total effect of \(Z\) on \(Y\) identified, or recovarable using covariate adjustment?
Did we lose information in converting a system of equations into a causal graph? What type of information is not represented in the graph?
Exercise 2: Do this with the person sitting next to you
Lets do the opposite of what we just did. I will give you a causal graph. Can you open R Studio, load DeclareDesign, and specify a model that captures the relationships in this DAG? I’d probably start by writing the equations, then use declare_population to specify the latent or unobserved variables, declare_potential_outcomes to specify outcomes, and declare_assignment to specify the treatment variable.
Here is the DAG:
Answer:
Here is my code, and the associated dataset.
library(DeclareDesign)
library(tidyverse)
mediator_design <-
declare_population(
N = 1000,
U_z = rnorm(N),
U_m = rnorm(N),
U_y = rnorm(N)
) +
declare_potential_outcomes(M ~ 0.5 * Z + U_m) +
declare_potential_outcomes(Y ~ 0.5 * Z + (0.5 * Z + U_m) + U_y) +
declare_assignment(prob_unit = pnorm(U_z), simple = TRUE) +
declare_reveal(c(M, Y), Z) +
declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(Y ~ Z, estimand = "ATE", label = "DIM") +
declare_estimator(Y ~ Z + M,
model = lm_robust,
estimand = "ATE",
label = "OLS")
mediator_design %>% draw_data %>% head() %>% knitr::kable(., digits = 2)
ID | U_z | U_m | U_y | M_Z_0 | M_Z_1 | Y_Z_0 | Y_Z_1 | Z | Z_cond_prob | M | Y |
---|---|---|---|---|---|---|---|---|---|---|---|
0001 | 1.00 | -1.04 | 0.66 | -1.04 | -0.54 | -0.38 | 0.62 | 1 | 0.84 | -0.54 | 0.62 |
0002 | -0.59 | -0.16 | -0.33 | -0.16 | 0.34 | -0.49 | 0.51 | 0 | 0.72 | -0.16 | -0.49 |
0003 | -1.07 | -1.39 | 0.94 | -1.39 | -0.89 | -0.46 | 0.54 | 1 | 0.14 | -0.89 | 0.54 |
0004 | -0.67 | -0.53 | -1.77 | -0.53 | -0.03 | -2.30 | -1.30 | 1 | 0.25 | -0.03 | -1.30 |
0005 | -0.75 | 0.78 | -1.16 | 0.78 | 1.28 | -0.38 | 0.62 | 0 | 0.77 | 0.78 | -0.38 |
0006 | -0.18 | -1.59 | 0.58 | -1.59 | -1.09 | -1.02 | -0.02 | 0 | 0.57 | -1.59 | -1.02 |
Think About
Is the average treatment effect (“total effect of \(Z\) on \(Y\)”) identified? Are we dealing with a “Fundamentally Unidentified Question”?
Why did I do this, and not: \(Y \sim 0.5 \times Z + M + U_y\) ?
Is \(Z\) randomly assigned in this model? Does the assignment procedure satisfy the independence assumption: \(\{Y_i(1),Y_i(0),X\} \perp Z\)? Relatedly: is the difference in means estimator unbiased; and does conditioning on \(M\) bias the estimation procedure?
Do your equations have to look identical to mine? Why not?
Pearl says that the causal effect of \(Z\) on \(Y\) is identified when:
All back-door paths from \(Z\) to \(Y\) are blocked, after conditioning on \(X\). We do this by “writing down back-door paths from [Z] to [Y], determine which ones are unblocked, and then search for a candidate conditioning set of observed variables that will block all unblocked back-door paths” (Morgan and Winship, 2015:109).
No variable in the conditioning set is a descendent of \(Z\), and lies on a directed path that reaches \(Y\). We do this to “verify that the variables in the conditioning set do not block or otherwise adjust away any portion of the causal effect of interest” (MW, 2015:109).
Lets start with a simple model of the world captured by this DAG:
I input this into DeclareDesign:
library(DeclareDesign)
library(tidyverse)
confounding_design <-
declare_population(N = 500,
# C is a categorical variable with 3 levels
C = sample(c(1:5), N, replace = TRUE),
# Idiosyncratic variation in outcomes generated here
noise = rnorm(N, mean = 0, sd = 1)) +
# Y is a function of Z, C, and noise
declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +noise) +
# We are interested in the causal effect of Z on Y
declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
# Z is a function of C (unequal probabilities of assignment)
# Note: C also causes Y, so the indp. assumption breaks down
declare_assignment(blocks = C,
block_prob = c(0.9, 0.5,0.1, 0.3, 0.7)) +
# Estimator when there is no conditioning
declare_estimator(Y ~ Z,
model = difference_in_means,
estimand = "ATE",
label = "Naive DIM")
# Step 1: Lets see the data, then visualize it
dat <- confounding_design %>% draw_data()
dat %>% head() %>% knitr::kable(., digits = 2,
caption = "Illustrative Case: A Draw of Data")
ID | C | noise | Y_Z_0 | Y_Z_1 | Z | Z_cond_prob | Y |
---|---|---|---|---|---|---|---|
001 | 1 | -0.43 | 6.27 | 6.77 | 1 | 0.9 | 6.77 |
002 | 3 | 2.05 | 15.35 | 15.85 | 0 | 0.9 | 15.35 |
003 | 3 | 0.07 | 13.37 | 13.87 | 0 | 0.9 | 13.37 |
004 | 2 | -0.09 | 8.71 | 9.21 | 0 | 0.5 | 8.71 |
005 | 1 | 1.52 | 8.22 | 8.72 | 0 | 0.1 | 8.22 |
006 | 5 | -0.38 | 29.12 | 29.62 | 1 | 0.7 | 29.62 |
library(ggplot2)
library(ggthemes)
dat_plot <- ggplot(dat, aes(x = C, y =Y, color = as.factor(Z))) +
geom_point(position = "jitter") +
theme_economist_white(gray_bg = TRUE) +
ggtitle("Illustrative Case: A Draw of Data")
dat_plot
# Step 2: DIM without covariate adjustment is biased
simulations <- simulate_design(confounding_design, sims = 300)
simulations %>% head() %>%
select(sim_ID, estimand_label, estimand,
estimator_label, estimate, std.error) %>%
knitr::kable(., digits = 2)
sim_ID | estimand_label | estimand | estimator_label | estimate | std.error |
---|---|---|---|---|---|
1 | ATE | 0.5 | Naive DIM | -0.43 | 0.74 |
2 | ATE | 0.5 | Naive DIM | 0.02 | 0.75 |
3 | ATE | 0.5 | Naive DIM | 0.29 | 0.78 |
4 | ATE | 0.5 | Naive DIM | 0.05 | 0.75 |
5 | ATE | 0.5 | Naive DIM | 0.28 | 0.77 |
6 | ATE | 0.5 | Naive DIM | -0.75 | 0.73 |
simulations %>% summarise(avg.estimand = mean(estimand),
avg.estimate = mean(estimate)) %>%
knitr::kable()
avg.estimand | avg.estimate |
---|---|
0.5 | -0.0996861 |
Lets compare the performance of three estimators: the naive difference in means estimator, OLS with covariate adjustment, and the stratified difference in means estimator. What are these?
Naive DIM:
\[\begin{equation} Y_i \sim Z_i \Leftrightarrow \frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m} \end{equation}\]OLS with Covariate Adjustment:
\[\begin{equation} Y_i \sim Z_i + C_i \end{equation}\]Stratified Difference in Means:
\[\begin{equation} \widehat{ATE} = \sum_{j=1}^J \frac{N_{j}}{N}(\widehat{ATE_{j}}) \end{equation}\]# Step 1: Specify the design + three estimators
confounding_design <-
declare_population(N = 500,
# C is a categorical variable with 3 levels
C = sample(c(1:5), N, replace = TRUE),
# Idiosyncratic variation in outcomes generated here
noise = rnorm(N, mean = 0, sd = 1)) +
# Y is a function of Z, C, and noise
declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +noise) +
# We are interested in the causal effect of Z on Y
declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
# Z is a function of C (unequal probabilities of assignment)
# Note: C also causes Y, so the indp. assumption breaks down
declare_assignment(blocks = C,
block_prob = c(0.9, 0.5,0.1, 0.3, 0.7)) +
# Estimator when there is no conditioning
declare_estimator(Y ~ Z,
model = difference_in_means,
estimand = "ATE",
label = "Naive DIM") +
# Add the new estimators
declare_estimator(
Y ~ Z,
model = difference_in_means,
blocks = C,
estimand = "ATE",
label = "Stratified DIM"
) +
declare_estimator(Y ~ Z + C,
model = lm_robust,
estimand = "ATE",
label = "OLS")
# Step 2: Simulate, Assess Estimators
simulations <- simulate_design(confounding_design, sims = 300)
simulations %>% head() %>%
select(sim_ID, estimand_label, estimand,
estimator_label, estimate, std.error) %>%
knitr::kable(., digits = 2,
caption = "Head of the Simulations Table")
sim_ID | estimand_label | estimand | estimator_label | estimate | std.error |
---|---|---|---|---|---|
1 | ATE | 0.5 | Naive DIM | 0.19 | 0.76 |
1 | ATE | 0.5 | Stratified DIM | 0.62 | 0.12 |
1 | ATE | 0.5 | OLS | 2.80 | 0.17 |
2 | ATE | 0.5 | Naive DIM | -0.31 | 0.75 |
2 | ATE | 0.5 | Stratified DIM | 0.53 | 0.13 |
2 | ATE | 0.5 | OLS | 2.72 | 0.18 |
estimator_performance <- simulations %>%
group_by(estimator_label) %>%
summarise(mean_estimand = mean(estimand),
mean_estimate = mean(estimate))
estimator_performance %>% knitr::kable(
caption = "Average Estimates from Different Estimators",
col.names = c("Estimator", "Mean Estimand", "Mean Estimate"),
format = "pandoc"
)
Estimator | Mean Estimand | Mean Estimate |
---|---|---|
Naive DIM | 0.5 | -0.1473962 |
OLS | 0.5 | 2.6640462 |
Stratified DIM | 0.5 | 0.4904856 |
As you can see, both the naive difference-in-means estimator and OLS with covariate adjustment are biased. Why might this be the case?
Lets take simulations and plot the sampling distribution of each estimator. I will use ggplot2 and tidyverse functions spread and gather for this.
These cheet sheets are available here: https://ggplot2.tidyverse.org (ggplot2), and https://tidyr.tidyverse.org/index.html (gather and spread)
So here is what we will do:
I will use gather to create a data frame of summary statistics
I will use simulations and the summary statistics dataframe in ggplot2 to create a figure.
## Step 1: Data of Summary Statistics
estimator_performance <- estimator_performance %>%
gather(key = "Quantity", value = "Value",
mean_estimand, mean_estimate)
## Step 2: Creating a figure using ggplot2
figure8 <- ggplot(data = simulations,
aes(x = estimate)) +
xlab("Estimate") + ylab("Count") +
geom_histogram(bins = 30) +
facet_wrap(~ estimator_label)
figure8
## Embellishing it
figure8 +
geom_vline(
data = estimator_performance,
aes(xintercept = Value, color = Quantity),
size = 0.5,
linetype = "dashed"
) +
theme_economist_white(gray_bg = TRUE) +
theme(legend.position = "bottom") +
ggtitle(label = "Sampling Distribution of Different Estimators")
Why is the stratified estimator doing so well? To understand this, lets go back to how we specified the data generating process. Note two things:
Y is non-linearly associated with C
The probability of being assigned to treatment, \(\widehat{p_i(Z = 1)}\), varies considerably across strata.
OLS is not doing a good job mainly because of (1) - it assumes a linear relationship when there isn’t one. To see this, lets go back to the figure in which we plotted \(C\) and \(Y\) (with treated and untreated units in different colors). Here is the same figure with regression lines fitted for the treatment (\(Z=1\)) and control (\(Z=0\)) groups. This is analytically equivalent to running an OLS with the specification: \(Y_i\sim Z_i + C_i + (Z_i \times C_i)\). Despite allowing greater flexibility than an OLS that only “controls for” \(C\), this estimator does a bad job because it imposes an unlikely functional form assumption.
dat_plot2 <- ggplot(dat, aes(x = C, y =Y, color = as.factor(Z))) +
geom_point(position = "jitter", color = "gray80") +
geom_smooth(method = "lm_robust") +
theme_economist_white(gray_bg = TRUE) +
ggtitle("Illustrative Case: A Draw of Data")
dat_plot2
Next, lets think about how OLS weights within-strata estimates. Consider a regression specification:
\[\begin{equation} Y_i \sim Z_i + I_{C=2} + I_{C=3} + I_{C=4} + I_{C=5} \Leftrightarrow Y_i \sim Z_i + as.factor(C_i) \end{equation}\]Where \(I_{C=j}\) is an indicator variable that takes a value of 1 if the unit is in the \(j^{th}\) stratum of C. This specification is very close to the stratified estimator, considerably less parametric, and relaxes the linearity assumption compared to when \(C\) is treated as a continuous variable. Now, OLS performs as well as the stratified estimator, but note that it weights within-strata estimates roughly proportional to \(\widehat{p_i(Z = 1)} (1 - \widehat{p_i(Z = 1)}\) or the variance in treatment assignment. This is very different from the stratified estimator (something we will explore in greater detail next week!). I illustrate this in the table below:
## Step 1: Draw some data using this design
dat <- confounding_design %>% draw_data()
## Step 2: Calculate the ATE within each strata
strata_estimates <- dat %>%
group_by(C) %>%
summarise(ATE_block = mean(Y[Z==1]) - mean(Y[Z==0]),
N_j = n(),
p_trt = mean(Z==1))
strata_estimates %>%
mutate(
w_block = N_j / sum(N_j),
w_ols = (w_block * p_trt * (1 - p_trt)) / (sum(w_block *
p_trt * (1 - p_trt)))
) %>%
select(C, N_j, p_trt, ATE_block, w_block, w_ols) %>%
knitr::kable(., digits = 4,
caption = "Within-Strata Estimates and Aggregation")
C | N_j | p_trt | ATE_block | w_block | w_ols |
---|---|---|---|---|---|
1 | 79 | 0.8987 | 0.7949 | 0.158 | 0.0817 |
2 | 114 | 0.5000 | 0.4140 | 0.228 | 0.3237 |
3 | 101 | 0.0990 | 0.2882 | 0.202 | 0.1023 |
4 | 101 | 0.2970 | 0.7577 | 0.202 | 0.2395 |
5 | 105 | 0.6952 | 0.5168 | 0.210 | 0.2527 |
The table reports the strata (\(C\)), the number of units in that stratum (\(N_j\)), the probability of being assigned to treatment in that stratum (\(p_{trt}\)), the within-strata difference in means estimate (\(ATE_{block}\)), and the weights \(w\) (suffixed with \(\_block\) for the stratified estimator, and \(\_ols\) for the ordinary least squares regression with covariate adjustment).