Covariates

Covariates are supplementary variables that predict outcomes but are unaffected by treatment assignment. Covariates remain fixed regardless of whether an observation is assigned to treatment or control. Formally, \(X_i(1) = X_i(0) = X_i\). (Gerber and Green 2012:95-96)

In any experiment, covariates can have three types of uses:

  1. In estimation, to obtain more precise estimates of the average treatment effect.

  2. To check the integrity of the random assignment procedure. Since covariates are independent of treatment assignment(\(X_i \perp D_i\)), “any correlation between treatment and other factors that affect outcomes arises purely by chance, due to the way in which observations happened to be allocated to treatent and control groups” (Gerber and Green 2012:95). Formally, \(\mathbb{E}[X_i] = \mathbb{E}[X_i | D_i=1] = \mathbb{E}[X_i | D_i=0]\).

  3. To use in blocking, which can improve precision of estimates.


Covariates for Precision

Claim: Using covariates for precision does not affect the unbiasedness of the difference-in-means estimator. To explain this, let \(X_i\) be a pre-intervention measure of the outcome variable \(Y_i\). One way to gain precision through a covariate is to use the change score, i.e. (\(Y_i - X_i\)) as the outcome variable. Gerber and Green show that using the change score as an outcome variable still yields, in expectation, the average treatment effect. Here is the proof statement.

When we have the change score, the mean of the estimates, \(\widehat{ATE}\), can be written as:

\(\mathbb{E}[\widehat{ATE}] = \mathbb{E}[Y_i - X_i | D_i=1] - \mathbb{E}[Y_i - X_i | D_i=0]\)

Applying the linearity of expectations, we get:

\(\mathbb{E}[Y_i| D_i=1] - \mathbb{E}[X_i| D_i=1] - \mathbb{E}[Y_i | D_i=0] + \mathbb{E}[X_i| D_i=0]\)

Which simplifies to:

\(\mathbb{E}[Y_i| D_i=1] - \mathbb{E}[Y_i | D_i=0]\)


Estimators

We will use two types of estimators that incorporate covariates for precision: the difference-in-differences estimator, and regression-based covariate adjustment.

To start off, consider the difference-in-means (DIM) estimator: \(Y_i = a + bD_i + u_i\) (where \(b\) estimates the average treatment effect, and \(u_i\) is the individual-level residual variation in the outcome or error term).

The difference-in-differences (DID) estimator is: \(Y_i - X_i = a + bD_i + u_i\)

Covariate adjustment in an ordinary least squares regression is: \(Y_i = a + bD_i + cX_i + u_i\). In this specification, we essentially “control for” \(X_i\).

DID is a special case of OLS with covariate adjustment: To see this, start with the difference-in-differences estimator (\(Y_i - X_i = a + bD_i + u_i\)) and re-arrange the terms: \(Y_i = a + bD_i + X_i+ u_i\). You will notice that this resembles the OLS with covariate adjustment specification when \(c=1\). In other words, DID assumes that a unit increase (or decrease) in the covariate \(X_i\) is associated with a unit increase (or decrease) in the outcome \(Y_i\). When this is the case, DID will produce precision-gains similar to OLS with covariate adjustment. When \(c\neq 1\), OLS with covariate adjustment will give more precise estimates than DID.


Example (\(c=1\))

To illustrate this point, I use the data from Gendelman (2004). In this study of Israeli elementary school students, students were randomly assigned to treatment and control conditions after taking an initial computer-administered test. This produced a pre-treatment test score that is used as a covariate for precision. The treatment in this study is a 30 minute session with an instructor who describes some rules of thumb to keep in mind when solving logic puzzles. The outcome of interest is the number of correctly solved puzzles, or the score on the computer-administered test after the intervention. Details of this are available in Question 2 of Chapter 4.

In this example, we will estimate the average treatment effect (ATE) using the difference-in-means estimator, difference-in-differences estimator, and OLS with covariate adjustment.

First, here is a scatterplot that shows the association between the pre-test score (\(X_i\)) and the post-test score (\(Y_i\)). For reference, I include a \(45^{\circ}\) line that captures the \(c=1\) scenario.

In this case, do you think the difference-in-differences estimator will give as precise estimates as OLS with covariate adjustment? Why or why not?

library(DeclareDesign)
library(tidyverse)
library(knitr)
library(estimatr)

dat <- read_csv("gendelman2004.csv") %>%
  mutate(
    Treatment =  case_when(treatment == 0 ~ "Control",
                           treatment == 1 ~ "Treatment")
  )


ggplot(dat, 
       aes(x= pretest, y=posttest)) +
  geom_point(aes(color = Treatment), position = "jitter", size = 2) +
  geom_abline(aes(intercept = 0, slope = 1), linetype = "dashed") +
  geom_smooth(method = "lm_robust") +
  xlim(0,12) + ylim(0,12) +
  ylab("DV: Post-Test Score") +
  xlab("Covariate: Pre-Test Score") +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    legend.position = c(.75,.25)
  )

Here are the estimates from the three estimators:

dim_estimator <- dat %>%
  summarise(tidy(difference_in_means(posttest ~ treatment, data = cur_data()))) %>%
  mutate(
    Estimator = "DIM"
  ) %>%
  select(-term, -df)

did_estimator <- dat %>%
  summarise(tidy(difference_in_means(improvement ~ treatment, data = cur_data()))) %>%
  mutate(
    Estimator = "Difference in Differences"
  ) %>%
  select(-term, -df)

cov_adj_estimator <- dat %>%
  summarise(tidy(lm_robust(posttest ~ treatment + pretest, data = cur_data()))) %>%
  mutate(
    Estimator = "Cov. Adjustment"
  ) %>%
  filter(term == "treatment") %>%
  select(-term, -df)

results <- bind_rows(dim_estimator,
                     did_estimator,
                     cov_adj_estimator)

kable(results, 
      caption = "Precision Gains from Covariate Adjustment",
      digits = 3,
      caption.above=TRUE)
Precision Gains from Covariate Adjustment
estimate std.error statistic p.value conf.low conf.high outcome Estimator
-0.333 1.027 -0.324 0.750 -2.524 1.857 posttest DIM
1.000 0.448 2.233 0.041 0.049 1.951 improvement Difference in Differences
0.742 0.439 1.691 0.112 -0.193 1.677 posttest Cov. Adjustment

Task: Imagine we run a regression with the following specification: difference_in_means(pretest ~ treatment). Do you expect treatment to have a statistically significant effect on the pre-treatment score? Why or why not? Download this data set and confirm your intuition. This sort of exercise is often referred to as a randomization check.


Example (\(c\neq1\))

Now lets consider a situation in which \(c\neq1\). I use Declare Design to specify a data generating process in which a covariate \(X_i\) is negatively associated with potential outcomes. I use the declare_model function, specifically the potential_outcomes component to operationalize this.

Step 1: Specify the design.

design <- declare_model(
  N = 1000,
  X = rnorm(N, mean = 5, sd = 2),
  U = rnorm(N, mean = 0, sd = 2),
  potential_outcomes(Y ~ 1 * Z + U - X)
) +
  declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_assignment(Z = simple_ra(N, prob = 0.5)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
  declare_estimator(Y ~ Z,
                    model = lm_robust,
                    label = "DIM",
                    inquiry = "ATE") +
  declare_estimator(
    Y ~ Z + X,
    model = lm_robust,
    term = "Z",
    label = "Cov Adj",
    inquiry = "ATE"
  ) +
  declare_estimator(
    Y - X ~ Z,
    model = lm_robust,
    term = "Z",
    label = "DID",
    inquiry = "ATE"
  )

Step 2: Use the design object to draw_data( ). I provide a glimpse of the data set in the table below. I also plot the covariate \(X_i\) and outcome \(Y_i\) in a figure made using ggplot2. What can we infer about \(c\) (or the association between \(X_i\) and \(Y_i\)) from this figure?

dat2 <- draw_data(design)

kable(dat2 %>% head(), digits = 3, 
      caption = "Glimpse of Data",
      caption.above = TRUE)
Glimpse of Data
ID X U Y_Z_0 Y_Z_1 Z Y
0001 6.846 -1.378 -8.224 -7.224 0 -8.224
0002 4.725 -1.058 -5.783 -4.783 0 -5.783
0003 6.723 -0.759 -7.482 -6.482 0 -7.482
0004 2.591 -0.741 -3.331 -2.331 1 -2.331
0005 5.567 2.434 -3.133 -2.133 0 -3.133
0006 5.197 -0.195 -5.392 -4.392 1 -4.392
ggplot(dat2, 
       aes(x= X, y=Y)) +
  geom_point(aes(color = factor(Z)), position = "jitter", size = 2) +
  geom_abline(aes(intercept = -8, slope = 1), linetype = "dashed") +
  geom_smooth(method = "lm_robust") +
  ylab("Y") +
  xlab("Covariate X") +
  theme_bw() +
  theme(
    panel.grid = element_blank(),
    legend.position = c(.25,.25)
  )

Step 3: We run_design() or draw data and obtain estimates from each of the estimators specified in the design. Here is what that looks like.

run_design(design) %>% kable(digits = 3)
inquiry estimand estimator term estimate std.error statistic p.value conf.low conf.high df outcome
ATE 1 DIM Z 0.938 0.179 5.231 0.000 0.586 1.290 998 Y
ATE 1 Cov Adj Z 1.046 0.123 8.502 0.000 0.805 1.287 997 Y
ATE 1 DID Z 0.833 0.286 2.912 0.004 0.272 1.395 998 Y - X

Step 4: Finally, we use simulate_design() to run_design() several times (1,000 times to be specific). These simulations give us the sampling distribution of estimates under each estimator. I then plot these estimates using ggplot2, using facet_wrap(~estimator) to create three separate histograms, one for the difference in means estimator, another for the difference-in-differences estimator, and a third for OLS with covariate adjustment.

simulations = simulate_design(design, sims = 1000)


ggplot(data = simulations,
       aes(x = estimate)) +
  geom_histogram() +
  geom_vline(aes(xintercept = mean(estimate)), linetype = "dashed", color = "blue") +
  facet_wrap(~estimator, ncol = 1) + 
  theme_bw() +
  theme(
    panel.grid = element_blank()  
    )

Conclusion: When \(X_i\) is negatively associated with \(Y_i\), the change score method (difference-in-differences) can actually be counterproductive. OLS with covariate adjustment estimates \(c\), rather than wrongly assuming \(c=1\), and provides the most precise estimates (standard deviation or spread of the sampling distribution is the smallest). DID assumes \(c=1\) and actually yields a sampling distribution that has a larger spread than the difference-in-means estimator.


Revision

Randomization Inference

Please download the Clingingsmith, Khwaja, and Kramer dataset from Canvas. This paper looks at the impact of a visa lottery for the pilgrimage to Mecca on the views of Pakistani Muslims towards people from other countries. Library the haven package, and use the read_dta command to load the data set. The treatment variable in this study is success. The outcome variable is views. Evaluate the sharp null hypothesis that winning the visa lottery has no effect on views for any subject. Assume that the Pakistani authorities assigned visas using complete random assignment.

Alert: Practice doing such a task in 15 minutes or less. The last section of my section notes include the solution. We have done the following simulation-based tasks so far: (1) getting the sampling distribution of the average treatment effect; (2) obtaining a p-value under the sharp null hypothesis; (3) estimating the “statistical power” of a study with \(n\) observations. You should have code for each of these that can easily be adapted to different data sets in an exam setting.


Concepts

  1. How to read potential outcomes (Examples: \(E[Y_i | D_i = 1]\), \(E[Y_i | d_i=1]\)). In particular, the difference between \(d_i\) and \(D_i\).

Example: Consider a data set (\(n=4\),\(m=2\)) in which we know each subject’s potential outcomes, and the observed assignment vector is \(d_1\). There are \(4 \choose 2\) \(=6\) ways to assign exactly 2 of 4 subjects to treatment. These assignment vectors are labelled \(d_2 \cdots d_6\). They are not observed or realized. As practice and to clarify potential outcomes, can you estimate the following quantities: \(E[Y_i(1)]\), \(E[Y_i(0) | d_i = 1]\), \(E[Y_i(0) | D_i=1]\).

Unit Y1 Y0 d1 d2 d3 d4 d5 d6
1 10 5 1 0 0 0 1 1
2 8 4 1 0 1 1 0 0
3 12 10 0 1 0 1 0 1
4 3 2 0 1 1 0 1 0

Solutions:

\(E[Y_i(1)] = \frac{10+12+8+3}{4} = 8.25\)

\(E[Y_i(0) | d_i = 1] = \frac{5+4}{2} = 4.5\)

\(E[Y_i(0) | D_i = 1] = (\frac{1}{6} \cdot \frac{5+4}{2})+ (\frac{1}{6} \cdot \frac{10+2}{2}) + (\frac{1}{6} \cdot \frac{4+2}{2}) + (\frac{1}{6} \cdot \frac{10+4}{2}) + (\frac{1}{6} \cdot \frac{5+2}{2}) + (\frac{1}{6} \cdot \frac{5+10}{2}) = 5.25\)

Switching Equation: Can you create a new column called \(Y_i\) that applies the switching equation to potential outcomes, assuming \(d_6\) is the assignment vector that was randomly picked by a researcher. What are the observed \(Y_i\) in this case?


  1. Core assumptions in an experiment (in words and notation). Evaluating violations of these assumptions in some study. Remember, exclusion restriction violations include differential attrition in groups, asymmetric measurement in experimental groups, and bundled treatments.

  2. Difference between estimand, estimator, and estimate. Identifying various estimands we have covered in the course, and how different estimators map onto these estimands (Note: Every estimator “shoots for” an estimand.) For example, we have encountered the following estimators: difference-in-means, blocked ATE, difference-in-means with clustered data, difference-in-difference, covariate adjustment in OLS.

  3. Showing that the difference-in-means estimator is unbiased for the average treatment effect.

  4. What is the formula for the true standard error? What are some of the design implications of this formula?

  5. What is blocking? Why does it reduce sampling variability?

  6. What is clustering? When does it increase sampling variability?

  7. How is blocking different from clustering?

  8. When does a difference-in-difference (\(Y_i - X_i \sim D_i\)) estimator perform worse than a difference-in-means without covariate adjustment (\(Y_i \sim D_i\)? How does covariate adjustment (\(Y_i \sim D_i + X_i\)) address this situation?

  9. What is the sharp null hypothesis? What does a p-value generated under the sharp null hypothesis tell us?


Solutions

library(haven)
library(randomizr)

dat3 = read_dta("clingingsmith.dta")

# To find exactly how many assigned to treatment

dat3 %>% group_by(success) %>% summarise(N=n()) %>% kable()
success N
0 448
1 510
# Evaluating the sharp null

declaration = declare_ra(N = nrow(dat3), m=510)

D <- obtain_permutation_matrix(declaration) 

# Obtain sampling distribution under the sharp null
ests_sharp_null <- rep(NA, 10000)

for(i in 1:ncol(D)){
  dat3$d_temp <- D[,i]
  estimate_sharp_null <- difference_in_means(views ~ d_temp, data = dat3)
  ests_sharp_null[i] = estimate_sharp_null$coefficients
}

# Estimate from the study

fit <- difference_in_means(views ~ success, data = dat3)
estimate <- fit$coefficients

# Calculate the p-value
results <- tibble(
    `One Sided` = mean(ests_sharp_null >= estimate),
    `Two Sided` = mean(abs(ests_sharp_null) >= abs(estimate))
    )

kable(results, 
      caption = "p values from Randomization Inference",
      caption.above = TRUE,
      digits = 3)
p values from Randomization Inference
One Sided Two Sided
0.003 0.004