Alert: You can download this document from http://github.com/shikhar46/quantmethods or Canvas’ Files page.

Notes from last week

Can an estimate be biased?

Situation: Say I am interested in the average treatment effect in some population, \(E[Y_i(1) - Y_i(0)]\). I use a difference in means estimator to estimate this parameter or quantity of interest: \(\widehat{ATE} = \frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m}\). The true ATE is 0.5. In a random sample of data, and given some assignment vector \(Z\), we estimate the average treatment effect to be 1.5. Is the estimate “biased”?

Answer: Bias, and consistency, are properties of an estimation procedure (“estimator”); not any single realized value from that procedure.

Why does the value of the estimand change when we introduce heterogeneity in treatment effects?

library(DeclareDesign)
library(tidyverse)

basic_design <-
  declare_population(N = 100, noise = rnorm(N)) +
  declare_potential_outcomes(Y ~ noise + (0.5 + noise)*Z) +
  declare_estimand(SATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(prob = 0.5) +
  declare_reveal(Y, Z) +
  declare_estimator(Y ~ Z, model = difference_in_means, estimand = "SATE")
simulations <-
  simulate_design(basic_design, sims = 400)
simulations %>% head %>% 
  select(sim_ID, estimand_label, estimand, 
         estimator_label, term, estimate, 
         std.error, p.value) %>% 
  knitr::kable()

sim_ID	estimand_label	estimand	estimator_label	term	estimate	std.error	p.value
1	SATE	0.4414911	estimator	Z	0.6626830	0.3737734	0.0813785
2	SATE	0.5261206	estimator	Z	0.5538947	0.2961459	0.0654089
3	SATE	0.6110187	estimator	Z	0.3634664	0.3155997	0.2536787
4	SATE	0.4978010	estimator	Z	0.5253538	0.3175410	0.1029199
5	SATE	0.4884269	estimator	Z	0.1393658	0.3184552	0.6628616
6	SATE	0.5751908	estimator	Z	0.5797356	0.3514047	0.1034923

Estimands can be defined for different populations. For example, the average treatment effect can be defined for a sample (\(n=100\)), or a population (\(N=100,000\)) from which a random sample is drawn. Every random sample from the population will have one, true sample average treatment effect (SATE) = \(E[Y_i(1) - Y_i(0) | S_i = 1]\) (where \(S_i = 1\) for units in the population that are selected into the sample). The population, as a whole, will also have one, true population average treatment effect (PATE) = \(E[Y_i(1) - Y_i(0)]\). However, the SATE may not be equal to the PATE in a given sample, because different units respond differently to treatment and every sample comprises different units.

If \(Z \rightarrow f(noise)\), do we still have random assignment?

In a sense, yes. Random assignment is the assignment of units to treatment conditions with known probabilities between 0 and 1. In the problem set, Alex defined prob = pnorm(noise). If noise were observed, we could calculate each unit’s probability of being assigned to treatment, and by design this would always be between 0 and 1 (though never exactly 0 or 1).

That said, the problem such an assignment procedure poses for identification is that \(\{Y_i(1),Y_i(0),X\} \perp Z\) does not hold. The analytic fix for this is to estimate \(\widehat{p_i(Z = 1)}\) using \(noise\), and either condition on this propensity score, or use an inverse probability weighted estimator (where the \(weight_i = \frac{1}{\widehat{p_i(Z = 1)}}\)). We now have conditional independence of treatment assignment: \(\{Y_i(1),Y_i(0),X\} \perp Z | \widehat{p_i(Z=1)}\).

Causal Graphs

Identification

Definition: We say that an average causal effect is identifiable under a particular set of assumptions if these assumptions imply that the distribution of the observed data is compatible with a single value of the effect measure. Conversely, we say that an average causal effect is nonidentifiable under the assumptions when the distribution of the observed data is compatible with several values of the effect measure. Hernán MA, Robins JM (2019:27).

Figure 1: When is a parameter identified? (Adapted from Frederick Savje’s slides)

Some related points:

A parameter is “underidentified” when several parameter values are consistent with the data and assumptions. It is “overidentified” if no parameter value is consistent with the data. It is “just identified” if only one parameter value is consistent with the data and assumptions. In a way, under-identification is a problem of too little information; over-identification is a problem of too much information, and we strive for the middle ground (just the right amount of information to infer a value of the parameter).
Identification problems are different from statistical problems. Identification revolves around the parameter being estimable, that is: “if we have infinite data points, what assumptions must we make so that there is only one value of the parameter” (HR 2019). These questions usually have to do with confounding, selection, or measurement issues. In contrast, statistical problems refer to finite sample constraints (i.e. estimating a population parameter with limited amount of data). For more on this, see Hernán MA, Robins JM (2019) Chapter 10.1.

`dagitty`

Lets practice making DAGs based on a functional summary of the data (i.e. some “model”). To do this, open http://dagitty.net on your browser, click on ‘Launch DAGitty’. I will work through an example, and then we will break into an exercise where I will supply a model, and you will make a DAG using dagitty.

Example 1:

\[\begin{equation} Z \sim f(u_Z, C) \end{equation}\] \[\begin{equation} C \sim f(u_C) \end{equation}\] \[\begin{equation} Y \sim f(Z, C, u_Y) \end{equation}\]

Here is what I get:

Figure 2: Making DAGs from Equations

Exercise 1:

Now its your turn. Can you please make a DAG based on the following information:

\[\begin{equation} Z = u_C + (u_C)^2 + u_Z \end{equation}\] \[\begin{equation} X = Z + u_C + Z\times(u_C) \end{equation}\] \[\begin{equation} M = log(Z) \end{equation}\] \[\begin{equation} Y = Z + M + (u_Y + u_C) \end{equation}\]

Note that variables specified as u are unobserved.

Answer:

Here is what my DAG looks like.

Figure 3: DAG from Exercise 1

Is the total effect of \(Z\) on \(Y\) identified, or recovarable using covariate adjustment?
Did we lose information in converting a system of equations into a causal graph? What type of information is not represented in the graph?

`DeclareDesign`

Exercise 2: Do this with the person sitting next to you

Lets do the opposite of what we just did. I will give you a causal graph. Can you open R Studio, load DeclareDesign, and specify a model that captures the relationships in this DAG? I’d probably start by writing the equations, then use declare_population to specify the latent or unobserved variables, declare_potential_outcomes to specify outcomes, and declare_assignment to specify the treatment variable.

Here is the DAG:

Figure 4: Writing functional summaries

Answer:

Here is my code, and the associated dataset.

library(DeclareDesign)
library(tidyverse)

mediator_design <-
  declare_population(
  N = 1000,
  U_z = rnorm(N),
  U_m = rnorm(N),
  U_y = rnorm(N)
  ) +
  declare_potential_outcomes(M ~ 0.5 * Z + U_m) +
  declare_potential_outcomes(Y ~ 0.5 * Z + (0.5 * Z + U_m) + U_y) +
  declare_assignment(prob_unit = pnorm(U_z), simple = TRUE) +
  declare_reveal(c(M, Y), Z) +
  declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_estimator(Y ~ Z, estimand = "ATE", label = "DIM") +
  declare_estimator(Y ~ Z + M,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS")
  
mediator_design %>% draw_data %>% head() %>% knitr::kable(., digits = 2)

ID	U_z	U_m	U_y	M_Z_0	M_Z_1	Y_Z_0	Y_Z_1	Z	Z_cond_prob	M	Y
0001	1.00	-1.04	0.66	-1.04	-0.54	-0.38	0.62	1	0.84	-0.54	0.62
0002	-0.59	-0.16	-0.33	-0.16	0.34	-0.49	0.51	0	0.72	-0.16	-0.49
0003	-1.07	-1.39	0.94	-1.39	-0.89	-0.46	0.54	1	0.14	-0.89	0.54
0004	-0.67	-0.53	-1.77	-0.53	-0.03	-2.30	-1.30	1	0.25	-0.03	-1.30
0005	-0.75	0.78	-1.16	0.78	1.28	-0.38	0.62	0	0.77	0.78	-0.38
0006	-0.18	-1.59	0.58	-1.59	-1.09	-1.02	-0.02	0	0.57	-1.59	-1.02

Think About

Is the average treatment effect (“total effect of \(Z\) on \(Y\)”) identified? Are we dealing with a “Fundamentally Unidentified Question”?
Observe that Y is not defined as a function of M, but the ingredients of M: \[\begin{equation} Y \sim 0.5 \times Z + (0.5 \times Z + U_m) + U_y) \end{equation}\]
Why did I do this, and not: \(Y \sim 0.5 \times Z + M + U_y\) ?
Is \(Z\) randomly assigned in this model? Does the assignment procedure satisfy the independence assumption: \(\{Y_i(1),Y_i(0),X\} \perp Z\)? Relatedly: is the difference in means estimator unbiased; and does conditioning on \(M\) bias the estimation procedure?
Do your equations have to look identical to mine? Why not?

Backdoor Criterion and Conditioning

Pearl says that the causal effect of \(Z\) on \(Y\) is identified when:

All back-door paths from \(Z\) to \(Y\) are blocked, after conditioning on \(X\). We do this by “writing down back-door paths from [Z] to [Y], determine which ones are unblocked, and then search for a candidate conditioning set of observed variables that will block all unblocked back-door paths” (Morgan and Winship, 2015:109).
No variable in the conditioning set is a descendent of \(Z\), and lies on a directed path that reaches \(Y\). We do this to “verify that the variables in the conditioning set do not block or otherwise adjust away any portion of the causal effect of interest” (MW, 2015:109).

Figure 5: Identifying Causal Effects Using Pearl’s Criterion

Illustrative Case

Lets start with a simple model of the world captured by this DAG:

Figure 5: Underlying Model

I input this into DeclareDesign:

library(DeclareDesign)
library(tidyverse)

confounding_design <-
  declare_population(N = 500,
  # C is a categorical variable with 3 levels
  C = sample(c(1:5), N, replace = TRUE),
  # Idiosyncratic variation in outcomes generated here
  noise = rnorm(N, mean = 0, sd = 1)) +
  # Y is a function of Z, C, and noise
  declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +noise) +
  # We are interested in the causal effect of Z on Y
  declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
  # Z is a function of C (unequal probabilities of assignment)
  # Note: C also causes Y, so the indp. assumption breaks down
  declare_assignment(blocks = C,
  block_prob = c(0.9, 0.5,0.1, 0.3, 0.7)) + 
  # Estimator when there is no conditioning
  declare_estimator(Y ~ Z,
  model = difference_in_means,
  estimand = "ATE",
  label = "Naive DIM")

# Step 1: Lets see the data, then visualize it

dat <- confounding_design %>% draw_data()

dat %>% head() %>% knitr::kable(., digits = 2,
                                caption = "Illustrative Case: A Draw of Data")

Illustrative Case: A Draw of Data
ID	C	noise	Y_Z_0	Y_Z_1	Z	Z_cond_prob	Y
001	1	-0.43	6.27	6.77	1	0.9	6.77
002	3	2.05	15.35	15.85	0	0.9	15.35
003	3	0.07	13.37	13.87	0	0.9	13.37
004	2	-0.09	8.71	9.21	0	0.5	8.71
005	1	1.52	8.22	8.72	0	0.1	8.22
006	5	-0.38	29.12	29.62	1	0.7	29.62

library(ggplot2)
library(ggthemes)

dat_plot <- ggplot(dat, aes(x = C, y =Y, color = as.factor(Z))) +
  geom_point(position = "jitter") +
  theme_economist_white(gray_bg = TRUE) +
  ggtitle("Illustrative Case: A Draw of Data")
dat_plot

# Step 2: DIM without covariate adjustment is biased 

simulations <- simulate_design(confounding_design, sims = 300)

simulations %>% head() %>%
  select(sim_ID, estimand_label, estimand, 
         estimator_label, estimate, std.error) %>%
  knitr::kable(., digits = 2)

sim_ID	estimand_label	estimand	estimator_label	estimate	std.error
1	ATE	0.5	Naive DIM	-0.43	0.74
2	ATE	0.5	Naive DIM	0.02	0.75
3	ATE	0.5	Naive DIM	0.29	0.78
4	ATE	0.5	Naive DIM	0.05	0.75
5	ATE	0.5	Naive DIM	0.28	0.77
6	ATE	0.5	Naive DIM	-0.75	0.73

simulations %>% summarise(avg.estimand = mean(estimand),
                          avg.estimate = mean(estimate)) %>%
  knitr::kable()

avg.estimand	avg.estimate
0.5	-0.0996861

How do different estimators fare?

Lets compare the performance of three estimators: the naive difference in means estimator, OLS with covariate adjustment, and the stratified difference in means estimator. What are these?

Naive DIM:

\[\begin{equation} Y_i \sim Z_i \Leftrightarrow \frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m} \end{equation}\]

OLS with Covariate Adjustment:

\[\begin{equation} Y_i \sim Z_i + C_i \end{equation}\]

Stratified Difference in Means:

\[\begin{equation} \widehat{ATE} = \sum_{j=1}^J \frac{N_{j}}{N}(\widehat{ATE_{j}}) \end{equation}\]

# Step 1: Specify the design + three estimators

confounding_design <-
  declare_population(N = 500,
  # C is a categorical variable with 3 levels
  C = sample(c(1:5), N, replace = TRUE),
  # Idiosyncratic variation in outcomes generated here
  noise = rnorm(N, mean = 0, sd = 1)) +
  # Y is a function of Z, C, and noise
  declare_potential_outcomes(Y ~ 0.5 * Z - 1.5 * C + 1.2 * C ^ 2 + 7 +noise) +
  # We are interested in the causal effect of Z on Y
  declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) +
  # Z is a function of C (unequal probabilities of assignment)
  # Note: C also causes Y, so the indp. assumption breaks down
  declare_assignment(blocks = C,
  block_prob = c(0.9, 0.5,0.1, 0.3, 0.7)) +
  # Estimator when there is no conditioning
  declare_estimator(Y ~ Z,
  model = difference_in_means,
  estimand = "ATE",
  label = "Naive DIM") +
  # Add the new estimators 
  declare_estimator(
  Y ~ Z,
  model = difference_in_means,
  blocks = C,
  estimand = "ATE",
  label = "Stratified DIM"
  ) +
  declare_estimator(Y ~ Z + C,
  model = lm_robust,
  estimand = "ATE",
  label = "OLS")


# Step 2: Simulate, Assess Estimators

simulations <- simulate_design(confounding_design, sims = 300)


simulations %>% head() %>%
  select(sim_ID, estimand_label, estimand, 
         estimator_label, estimate, std.error) %>%
  knitr::kable(., digits = 2,
               caption = "Head of the Simulations Table")

Head of the Simulations Table
sim_ID	estimand_label	estimand	estimator_label	estimate	std.error
1	ATE	0.5	Naive DIM	0.19	0.76
1	ATE	0.5	Stratified DIM	0.62	0.12
1	ATE	0.5	OLS	2.80	0.17
2	ATE	0.5	Naive DIM	-0.31	0.75
2	ATE	0.5	Stratified DIM	0.53	0.13
2	ATE	0.5	OLS	2.72	0.18

estimator_performance <- simulations %>% 
  group_by(estimator_label) %>%
  summarise(mean_estimand = mean(estimand),
            mean_estimate = mean(estimate)) 

estimator_performance %>% knitr::kable(
  caption = "Average Estimates from Different Estimators",
  col.names = c("Estimator", "Mean Estimand", "Mean Estimate"),
  format = "pandoc"
)

Average Estimates from Different Estimators
Estimator	Mean Estimand	Mean Estimate
Naive DIM	0.5	-0.1473962
OLS	0.5	2.6640462
Stratified DIM	0.5	0.4904856

As you can see, both the naive difference-in-means estimator and OLS with covariate adjustment are biased. Why might this be the case?

Visualizing using `ggplot2`

Lets take simulations and plot the sampling distribution of each estimator. I will use ggplot2 and tidyverse functions spread and gather for this.

Figure 6: ggplot2 CheatSheet

These cheet sheets are available here: https://ggplot2.tidyverse.org (ggplot2), and https://tidyr.tidyverse.org/index.html (gather and spread)

Figure 7: Spread and Gather

So here is what we will do:

I will use gather to create a data frame of summary statistics
I will use simulations and the summary statistics dataframe in ggplot2 to create a figure.

## Step 1: Data of Summary Statistics

estimator_performance <- estimator_performance %>%
  gather(key = "Quantity", value = "Value", 
         mean_estimand, mean_estimate)

## Step 2: Creating a figure using ggplot2

figure8 <- ggplot(data = simulations,
                  aes(x = estimate)) +
                  xlab("Estimate") + ylab("Count") +
                  geom_histogram(bins = 30) +
                  facet_wrap(~ estimator_label)
figure8

## Embellishing it

figure8 +
  geom_vline(
  data = estimator_performance,
  aes(xintercept = Value, color = Quantity),
  size = 0.5,
  linetype = "dashed"
  ) +
  theme_economist_white(gray_bg = TRUE) +
  theme(legend.position = "bottom") +
  ggtitle(label = "Sampling Distribution of Different Estimators")

Stratified DIM Estimator

Why is the stratified estimator doing so well? To understand this, lets go back to how we specified the data generating process. Note two things:

Y is non-linearly associated with C
The probability of being assigned to treatment, \(\widehat{p_i(Z = 1)}\), varies considerably across strata.

OLS is not doing a good job mainly because of (1) - it assumes a linear relationship when there isn’t one. To see this, lets go back to the figure in which we plotted \(C\) and \(Y\) (with treated and untreated units in different colors). Here is the same figure with regression lines fitted for the treatment (\(Z=1\)) and control (\(Z=0\)) groups. This is analytically equivalent to running an OLS with the specification: \(Y_i\sim Z_i + C_i + (Z_i \times C_i)\). Despite allowing greater flexibility than an OLS that only “controls for” \(C\), this estimator does a bad job because it imposes an unlikely functional form assumption.

dat_plot2 <- ggplot(dat, aes(x = C, y =Y, color = as.factor(Z))) +
  geom_point(position = "jitter", color = "gray80") +
  geom_smooth(method = "lm_robust") + 
  theme_economist_white(gray_bg = TRUE) +
  ggtitle("Illustrative Case: A Draw of Data")
dat_plot2

Next, lets think about how OLS weights within-strata estimates. Consider a regression specification:

\[\begin{equation} Y_i \sim Z_i + I_{C=2} + I_{C=3} + I_{C=4} + I_{C=5} \Leftrightarrow Y_i \sim Z_i + as.factor(C_i) \end{equation}\]

Where \(I_{C=j}\) is an indicator variable that takes a value of 1 if the unit is in the \(j^{th}\) stratum of C. This specification is very close to the stratified estimator, considerably less parametric, and relaxes the linearity assumption compared to when \(C\) is treated as a continuous variable. Now, OLS performs as well as the stratified estimator, but note that it weights within-strata estimates roughly proportional to \(\widehat{p_i(Z = 1)} (1 - \widehat{p_i(Z = 1)}\) or the variance in treatment assignment. This is very different from the stratified estimator (something we will explore in greater detail next week!). I illustrate this in the table below:

## Step 1: Draw some data using this design

dat <- confounding_design %>% draw_data()

## Step 2: Calculate the ATE within each strata

strata_estimates <- dat %>%
  group_by(C) %>%
  summarise(ATE_block = mean(Y[Z==1]) - mean(Y[Z==0]),
            N_j = n(),
            p_trt = mean(Z==1))

strata_estimates %>%
  mutate(
  w_block = N_j / sum(N_j),
  w_ols = (w_block * p_trt * (1 - p_trt)) / (sum(w_block *
  p_trt * (1 - p_trt)))
  ) %>%
  select(C, N_j, p_trt, ATE_block, w_block, w_ols) %>%
  knitr::kable(., digits = 4,
  caption = "Within-Strata Estimates and Aggregation")

Within-Strata Estimates and Aggregation
C	N_j	p_trt	ATE_block	w_block	w_ols
1	79	0.8987	0.7949	0.158	0.0817
2	114	0.5000	0.4140	0.228	0.3237
3	101	0.0990	0.2882	0.202	0.1023
4	101	0.2970	0.7577	0.202	0.2395
5	105	0.6952	0.5168	0.210	0.2527

The table reports the strata (\(C\)), the number of units in that stratum (\(N_j\)), the probability of being assigned to treatment in that stratum (\(p_{trt}\)), the within-strata difference in means estimate (\(ATE_{block}\)), and the weights \(w\) (suffixed with \(\_block\) for the stratified estimator, and \(\_ols\) for the ordinary least squares regression with covariate adjustment).

Section 1: Causal Graphs and Identification

Shikhar Singh

12 September 2019

Notes from last week

Causal Graphs

Identification

`dagitty`

`DeclareDesign`

Backdoor Criterion and Conditioning

Illustrative Case

How do different estimators fare?

Visualizing using `ggplot2`

Stratified DIM Estimator

Section 1: Causal Graphs and Identification

Shikhar Singh

12 September 2019

Notes from last week

Causal Graphs

Identification

dagitty

DeclareDesign

Backdoor Criterion and Conditioning

Illustrative Case

How do different estimators fare?

Visualizing using ggplot2

Stratified DIM Estimator

`dagitty`

`DeclareDesign`

Visualizing using `ggplot2`