From The Midterm

Sharp Null Hypothesis

Concept: The sharp null hypothesis implies a treatment effect of 0 for every subject. Formally, \(Y_i(1) = Y_i(0) \forall i\). This is different from a null hypothesis of no average effect which states that \(\mu_{Y_i(1)} = \mu_{Y_i(0)}\).

Consider the following potential outcomes, a random assignment vector \(Z_i\), and an observed outcome \(Y_i\):

Table of Potential Outcomes
Y(0) Y(1) Z Y
5 9 1 9
6 4 1 4
1 1 0 1
6 8 0 6

To construct potential outcomes that correspond to the sharp null hypothesis, start by identifying the missing data problem:

Missing Potential Outcomes
Y(0) Y(1) Z Y Y0_null Y1_null
5 9 1 9 9
6 4 1 4 4
1 1 0 1 1
6 8 0 6 6

Under the sharp null hypothesis, every respondent’s \(Y_i = Y_i(1) = Y_i(0)\). So we fill in the missing potential outcomes:

Potential Outcomes Under Sharp Null
Y(0) Y(1) Z Y Y0_null Y1_null
5 9 1 9 9 9
6 4 1 4 4 4
1 1 0 1 1 1
6 8 0 6 6 6

Estimating Standard Errors

Takeaway: var_pop and cov_pop are used to calculate the true standard error (i.e. Equation 3.4). This is only possible when we know the potential outcomes \(Y_i(1)\) and \(Y_i(0)\) for all subjects. When we have some observed data, that is \(D_i\) and \(Y_i\), we estimate a standard error. These functions are not used to estimate the standard error, namely when we apply Equation 3.6.

The true standard error is:

\(\text{SE}(\widehat{ATE}) = \sqrt{\frac{1}{N-1}\{\frac{m}{N-m}Var[Y_i(0)] + \frac{N-m}{m}Var[Y_i(1)] + 2Cov[Y_i(0),Y_i(1)]\}}\)

Note that \(Var(Y_i(1)) = \frac{1}{N}\sum_{i=1}^N (Y_i(1) - \frac{\sum_1^N Y_i(1)}{N})^2\). When we do var(Y1), the software assumes we are estimating the variance of a variable, and so it calculates: \(\frac{1}{\color{red}{N-1}}\sum_{i=1}^N (Y_i(1) - \frac{\sum_1^N Y_i(1)}{N})^2\). To correct for this, we write a custom function var_pop that essentially calculates sum((x-mean(x))^2)/length(x) where length(x) = N. The same logic applies to \(Var(Y_i(0))\).

Similarly, \(Cov(Y_i(1),Y_i(0)) = \frac{1}{N}\sum_{i=1}^N (Y_i(1) - \frac{\sum_1^N Y_i(1)}{N})\cdot(Y_i(0) - \frac{\sum_1^N Y_i(0)}{N})\). However, cov(Y1,Y0) calculates the following quantity: \(\frac{1}{\color{red}{N-1}}\sum_{i=1}^N (Y_i(1) - \frac{\sum_1^N Y_i(1)}{N})\cdot(Y_i(0) - \frac{\sum_1^N Y_i(0)}{N})\). This is why we use cov_pop, which calculates sum((x-mean(x))*(y-mean(y)))/length(x) where length(x)= N.

By contrast, an estimate of the standard error is:

\(\widehat{SE} = \sqrt{\frac{\widehat{Var(Y_i(0))}}{N-m} + \frac{\widehat{Var(Y_i(1))}}{m}}\)

Here, \(\widehat{Var(Y_i(1))} = \frac{1}{m-1} \sum_1^m (Y_i | d_i=1 - \frac{\sum_1^m Y_i|d_i=1}{m})^2\). Crucially, var(Y[d==1]) estimates this very quantity. Similarly for \(Var(Y_i(0))\). We therefore do not need to use var_pop here. The same logic applies to \(Cov(Y_i(1),Y_i(0))\). This is why we do not use var_pop and cov_pop when we are estimating the standard error (i.e. we have actual data and are applying Equation 3.6).


Treatment-by-Covariate Interactions

An interaction effect refers to the change in treatment effect in different subgroups or covariate profiles. For example, let a covariate \(X_i\) take on two values: 0 and 1. In a given sample of subjects, there are two sub-groups: study participants with \(X_i=1\) and those with \(X_i=0\). An interaction effect refers to the fact that the treatment effect for sub-group \(X_i=1\) (or \(\widehat{ATE_{X=1}}\)) is different from the treatment effect for the sub-group \(X_i=0\) (or \(\widehat{ATE_{X=0}}\)). These ATEs for a particular subset or subgroup of subjects is referred to as a conditional average treatment effect or CATE.

Note that sub-group differences in treatment effects are descriptive findings. They are not the causal effect of a covariate \(X\) because we do not randomly assign values of \(X\). Any number of factors can be correlated with \(X\) that account for differences in the treatment effect.


Example

I create some data in which the treatment effect interacts with a covariate, \(X_i \in \{0,1\}\).

set.seed(04012022)

# Specify potential outcomes that are unobserved

dat_cov_interaction <- tibble(
  X = complete_ra(N = 500, m = 250),
  U = rnorm(500, mean = 2, sd = 2.5),
  Y0 = rnorm(500, mean = 0 , sd = 2.5) + U * X,
  Y1 = Y0 + 0.5 + 0.25*X
)

# Conduct a random assignment, apply the switching equation to get observed outcomes

dat_cov_interaction <- dat_cov_interaction %>%
  mutate(Z = complete_ra(N = 500, m = 250), # 250 of 500 subjects assigned to treatment
         Y = Y0 * (1 - Z) + Z * Y1)

# Select the observed variables

actual_dat <- dat_cov_interaction %>%
  select(X,Z,Y)

write_csv(actual_dat, file = "covariate_treatment_interaction.csv")

# Make a table of the head of this data set

kable(head(actual_dat),
      caption = "Glimpse of Dataset",
      caption.above = TRUE,
      digits = 3)
Glimpse of Dataset
X Z Y
1 1 1.224
1 0 0.390
1 0 1.348
0 0 0.346
1 1 1.454
0 0 4.183

Here is the average treatment effect for the two sub-groups, \(\widehat{ATE_{X=1}}\) and \(\widehat{ATE_{X=0}}\). These are called conditional average treatment effects:

cates <- actual_dat %>%
  group_by(X) %>%
  summarise(
    tidy(lm_robust(Y ~ Z, data = cur_data()))
  ) %>%
  select(-df,-outcome)

kable(cates,
      caption = "Conditional Average Treatment Effects",
      caption.above = TRUE,
      digits = 2)
Conditional Average Treatment Effects
X term estimate std.error statistic p.value conf.low conf.high
0 (Intercept) -0.31 0.24 -1.27 0.20 -0.79 0.17
0 Z 0.58 0.32 1.79 0.08 -0.06 1.21
1 (Intercept) 1.53 0.32 4.83 0.00 0.91 2.16
1 Z 1.31 0.43 3.05 0.00 0.47 2.16

And here is a coefficient plot made with ggplot2 that visualizes the same information:

fig_cates <- ggplot(data = cates %>% filter(term != "(Intercept)"),
                    aes(x = estimate, 
                        y = as.factor(X))) +
  geom_point() +
  geom_linerange(aes(xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  xlim(-3,3) +
  xlab("CATE") +
  ylab("Sub-Group") +
  theme_bw()

fig_cates


Exercise I

Load the data set covariate_treatment_interaction.csv and evaluate the hypothesis that \(\widehat{ATE_{X=1}} - \widehat{ATE_{X=0}} \neq 0\). In other words, the treatment effect in sub-group \(X_i=1\) is not equal to the treatment effect in sub-group \(X_i=0\).

Hint: You should specify a regression of the following type:

\(\text{Outcome} = \beta_0 + \beta_1\cdot(Treatment) + \beta_2\cdot(Covariate) + \beta_3\cdot(Treatment \times Covariate)\).

We are interested in \(\beta_3\) or the difference in conditional average treatment effects.

Extra: Can you use the information in the table titled Conditional Average Treatment Effects to calculate \(\beta_0\), \(\beta_1\), \(\beta_2\) and \(\beta_3\)? Confirm your calculations with the regression output.


Treatment-by-Treatment Interactions

A treatment-by-treatment interaction refers to a design in which subjects are randomly assigned to two treatments, \(Z_1\) and \(Z_2\). In effect, subjects can be assigned to one of four possible conditions: when both \(Z_1\) and \(Z_2\) are 0, when both \(Z_1\) and \(Z_2\) are 1, and when one of the two treatment conditions is equal to 1. A factorial design is a generalization of this in which \(k\) treatment are randomly assigned, producing \(2^k\) experimental conditions (assuming every treatment has 2 levels).

To estimate treatment-by-treatment interaction effects, we specify the following regression:

\(\text{Outcome} = \beta_0 + \beta_1Z_1 + \beta_2Z_2 + \beta_3(Z_1 \times Z_2)\)

Where \(Z_1\) is an indicator variable that takes a value of 1 if the subject receives treatment 1, otherwise 0; and \(Z_2\) is an indicator variable that takes a value of 1 if the subject receives treatment 2, otherwise 0. As before, \(\beta_3\) captures the difference in treatment effects, i.e. \(\widehat{ATE_{Z_2}} - \widehat{ATE_{Z_1}}\).

Note that this interaction effect is causal because we randomly assign subjects to both treatment conditions, namely \(Z_1\) and \(Z_2\).


Example

We go back to Bertrand and Mullainathan (2004)’s study that sends resumes to firms for a job opening, randomly varying the applicant’s name (perceptions of race) and “quality” of the resume. They conduct the study in two cities, Boston and Chicago. The outcome of interest is whether the resume gets a call back. Here is a table that reproduces the results in the paper, i.e. the call back rates by treatment conditions (race and quality) and the covariate city.

Figure 1: Bertrand and Mullainathan (2004) Results


Exercise II

  1. Propose a regression model that assesses the effects of treatments (race and quality), interaction between them, and interactions between the treatments and the covariate, city. Let race = 1 if the resume uses a name that cues Black identity, quality = 1 if the resume is of low quality, and city = 1 for Boston.

  2. Use the table above to estimate the parameters of that regression model.

Extra: Can you use confirm your calculations by estimating the regression using the Bertrand and Mullainathan (2004) data set we used in a prior week’s problem set?


Solutions

Exercise I

library(estimatr)
library(texreg)

dat <- read_csv("covariate_treatment_interaction.csv")

fit = lm_robust(Y ~ Z + X + Z:X, data = dat)

htmlreg(list(fit),
       include.ci = FALSE,
       digits = 2,
       caption = "CATE Estimates",
       booktabs = TRUE)
CATE Estimates
  Model 1
(Intercept) -0.31
  (0.24)
Z 0.58
  (0.32)
X 1.84***
  (0.40)
Z:X 0.74
  (0.54)
R2 0.14
Adj. R2 0.13
Num. obs. 500
RMSE 3.01
p < 0.001; p < 0.01; p < 0.05

Takeaway: The regression table reports \(\beta_3\) as Z:X or the interaction effect. This term captures the difference in CATEs, or \(\widehat{ATE_{X=1}} - \widehat{ATE_{X=0}}\). In the table we see that this difference is positive but we cannot reject the null hypothesis \(H_0: \beta_3 =0\). In other words, we cannot rule out the possibility that the treatment effect is the same for both sub-groups. This means there is no empirical support for the claim that the treatment “works” for sub-group \(X=1\) and does not for sub-group \(X=0\); or a version of this claim that treatment effects are “driven” by the sub-group \(X=1\).


Exercise II

  1. The regression model can have the following specification:

\[\begin{equation} \text{Call Back} \sim b_0 + b_1(\text{Black}_i) + b_2(\text{Low}_i) + b_3(\text{Boston}_i) \\ + b_4(\text{Black}_i \times \text{Low}_i) + b_5(\text{Black}_i \times \text{Boston}_i) \\ + b_6(\text{Low}_i \times \text{Boston}_i) \\ + b_7(\text{Black}_i \times \text{Low}_i \times \text{Boston}_i) \end{equation}\]

  1. The parameter estimates are:

\(\beta_0 = 8.94\), because the reference group is White, “high” quality resume in Chicago.

\(\beta_1 = -3.66\), because racial differences in call back rates among high quality resumes in Chicago is \(5.28 - 8.94 = -3.66\)

\(\beta_2 = -1.78\), because resume quality-based differences in call back rates among White applicants in Chicago is \(7.16 - 8.94 = -1.78\)

\(\beta_3 = 4.18\), because the difference in call back rates between Boston and Chicago among high quality resumes of White applicants is \(13.12 - 8.94 = 4.18\)

\(\beta_4 = 2.02\), because \((5.52 - 7.16) - (5.28 - 8.94) = 2.02\)

\(\beta_5 = -0.96\), because \((8.50 - 13.12) - (5.28 - 8.94) = -0.96\)

\(\beta_6 = -1.19\), because \((10.15 - 13.12) - (7.16 - 8.94) = -1.19\)

\(\beta_7 = -0.54\), because \([(7.01 - 10.15) - (8.50 - 13.12)] - [(5.52 - 7.16) - (5.28 - 8.94)] = -0.54\)