Notes from Last Week

Showing that \(\mathbb{E}[ATT] = ATE\)

Question 4 in Chapter 2 asks us to prove the following claim: “When treatments are allocated using complete random assignment, the ATT is, in expectation, equal to the ATE.”

Lets rewrite the estimator for the ATT, \(\frac{\sum_1^N \tau_i d_i}{\sum_1^N d_i}\), in terms of an expectation:

\(\frac{\sum_1^N \tau_i d_i}{\sum_1^N d_i} = \frac{\sum_1^m \tau_i| d_i = 1}{m} = \mathbb{E}[\tau_i | d_i = 1]\)

Note that the expectation operator in this step signifies an average over all subjects.

The question says that this quantity, in expectation, equals the ATE. Formally then:

\(\mathbb{E}\{ \mathbb{E}[\tau_i | d_i = 1] \} = \mathbb{E}_d\{\mathbb{E}[\tau_i | d_i = 1, d] \} = \mathbb{E}[\tau_i | D_i = 1]\)

Note that the expectation operation in the final step averages over \(i\) subjects and \(d\) assignment vectors.

We now show that \(\mathbb{E}[\tau_i | D_i = 1] = \mathbb{E}[Y_i(1) - Y_i(0) | D_i = 1]\)

Using the linearity of expectations, I get:

\(\mathbb{E}[Y_i(1) - Y_i(0) | D_i = 1] = \mathbb{E}[Y_i(1)| D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 1]\)

Random assignment implies \(\mathbb{E}[Y_i(0)] = \mathbb{E}[Y_i(0) | D_i = 1] = \mathbb{E}[Y_i(0) | D_i = 0]\). So we can replace the second term and rewrite the above quantity as:

\(\mathbb{E}[Y_i(1)| D_i = 1] - \mathbb{E}[Y_i(0) | D_i = 0]\) (which is the average treatment effect)


Non-intereference and Excludability

Questions 10 and 12 in Chapter 2 ask us to evaluate the non-interference and excludability assumptions. How are these assumptions different from each other?

Intuition: The non-interference assumption focuses on how interaction between units affects potential outcomes. The excludability assumption pertains to what we randomly assign and whether treatment is the only thing that moves with assignment. Notationally:

Non-interference implies \(Y_i(z,d) = Y_i(z',d')\) where \(z_i = z'_i\) and \(d_i=d'_i\). To illustrate, say we have \(n=3\) and \(m=1\). Let one assignment vector \(z = \{0,1,0\}\) in which the third subject (\(i=3\)) is untreated. If we assume non-interference, we in effect think that \(i\)’s potential outcome is the same when the assignment vector is \(z'= \{1,0,0\}\). Note that in both \(z\) and \(z'\), \(i\)’s assignment status is the same (only the first and second subject’s assignment status changes).

Excludability implies that \(Y_i(z,d) = Y_i(d)\), or \(Y_i(z=1,d) = Y_i(z=0,d)\). What this is saying is that whether we assign \(i\) to treatment or control does not matter; the only thing that determines which of \(i\)’s potential outcomes is revealed is their treatment status.

Example: In question 10, non-interference is compromised if student \(i\) sits next to “treated” student \(j\) in the cafeteria, and reads \(j\)’s newspaper. Now, \(i\)’s potential outcome is not a function of their treatment assignment (\(z_i=0\)) but \(j\)’s assignment and treatment status (\(z_j=1\) and \(d_j=1\)). In contrast, the exclusion restriction is violated because students assigned to treatment (\(z=1\)) not only receive a newspaper (\(D\)) but a letter informing them that they are part of a study (\(L\)). If the letter affects outcomes, “the exclusion restriction breaks down [because] random assignment sets in motion [another] cause of \(Y_i\) other than treatment \(d_i\)” (Gerber and Green 2012:40). Now, it is no longer true that \(Y_i(z=1,d=1) = Y_i(z=0,d=1)\).


Standard Error

The difference-in-means estimate changes in every iteration or hypothetical replication of an experiment. In each iteration, we draw a new treatment vector \(d_i\), a different (but random) subset of subjects are in the treatment group and reveal their \(Y_i(1)\), and a different subset are in the control group and reveal their \(Y_i(0)\). Consequently, the average value of the observed outcome \(Y_i\) changes with \(d_i\).

Sampling Distribution: A sampling distribution is the frequency distribution of a statistic obtained from hypothetical replications of a randomized experiment.

We can retrieve the sampling distribution in R using a for loop.

Standard Error: It is the standard deviation of the sampling distribution. Let \(\widehat{\theta_j}\) be the difference-in-means estimate in the \(j\)th iteration of the experiment. Then, \(\text{SE} = \sqrt{(\frac{1}{J})(\sum_1^J (\widehat{\theta_j} - \overline{\widehat{\theta_j}})^2}\) (where \(\overline{\widehat{\theta_j}}\) is the mean of the sampling distribution, and there are \(J\) possible random assignments).

Mathematically, the true standard error equals: \(\sqrt{\frac{1}{N-1}\{\frac{m}{N-m}Var[Y_i(0)] + \frac{N-m}{m}Var[Y_i(1)] + 2Cov[Y_i(0),Y_i(1)]\}}\)

Some implications of Equation 3.4

  1. The larger the sample (\(N\)), the smaller the standard error.

  2. The smaller the variance in \(Y_i(1)\) or \(Y_i(0)\), the smaller the standard error. Measurement error in the outcome increases the variance in potential outcomes and the standard error.

  3. Researchers should allocate more units to the condition with higher variance in potential outcomes.

library(tidyverse)

set.seed(45) # for replicability

## Case 1: Create some data such that Var(Y1) = Var(Y0)

Y1 <- runif(80, min = 0, max = 10)
Y0 <- Y1 - 5 # A constant treatment effect gives us Var[Y1] = Var[Y0].

# A function called sims that calculates the standard error, given some data

sims <- function(N,m,Y1,Y0){
# Getting all possible assignment vectors when m of N are treated
declaration <- randomizr::declare_ra(N = N, m = m)
D <- randomizr::obtain_permutation_matrix(declaration) 
# Defining the function dif_in_means
dif_in_means <- function(Y1,Y0,Z){
  # We need to ensure the dataset columns are numerical vectors, not characters
  as.num <- function(x) {as.numeric(as.character(x))} 
  Y1 <- as.num(Y1)
  Y0 <- as.num(Y0)
  Z <- as.num(Z)
  #Switching equation
  Y <- Y1*Z + (1-Z)*Y0
  # ATE estimate
  estimate <- mean(Y[Z==1], na.rm = T) - mean(Y[Z==0], na.rm = T)
  return(estimate)
}
# Getting sampling distribution of ATE
reps <- dim(D)[2] # gives the number of columns in the permutation matrix D
ate <- rep(0,reps)
for(i in 1:reps){
  Z = D[,i]
  ate[i] <- dif_in_means(Y1,Y0, Z)
}
# Creating output vector
output <- tibble(
  M = m,
  SE = sd(ate)
)
return(output)
}

# Calculating the standard error for different m

output1 <- c()

for(i in 1:79){
  result <- sims(N=80,m=i,Y1,Y0)
  output1 = bind_rows(output1, result)
}


# Case 2:  Create some data such that Var(Y1) > Var(Y0)

Y0_new <- runif(80, min = 0, max = 10)
Y1_new <- Y0_new + (Y0_new/2) # A treatment effect that is positively correlated with Y0 gives us Var[Y1] > Var[Y0].

output2 <- c()
for(i in 1:79){
  result <- sims(N=80,m=i,Y1_new,Y0_new)
  output2 = bind_rows(output2, result)
}

# Comparing the two simulations

library(patchwork)

fig1 <- ggplot(data = output1, aes(x = M, y = SE)) + 
  geom_point() + 
  geom_vline(xintercept = 40, linetype = "dashed") +
    ylim(c(0,4)) +
  ggtitle("Var[Y1] = Var[Y0]") +
  theme_bw()

fig2 <- ggplot(data = output2, aes(x = M, y = SE)) + 
  geom_point() + 
  geom_vline(xintercept = 40, linetype = "dashed") +
  ylim(c(0,4)) +
  ggtitle("Var[Y1] > Var[Y0]") +
  theme_bw()


fig1 + fig2

When \(Var[Y_i(1)] = Var[Y_i(0)]\), we observe the “tightest” sampling distribution (i.e. the lowest standard error) when \(m \approx 40\). When \(Var[Y_i(1)] > Var[Y_i(0)]\), the optimal design would have \(m \approx 47\) subjects in treatent. To minimize the standard error, should allocate more units to the treatment condition.


Calculating the True SE

Download the data set from Section I, and calculate the true standard error by applying the formula (equation 3.4) and conducting simulations using a for-loop. Note that functions in R like var, cov and sd adjust for degrees of freedom (i.e. divide by \(N-1\) rather than \(N\)) because they provide an estimate of these quantities. You will have to write custom functions or apply a degrees of freedom correction to get matching answers.

Note: The solution is available in the last tab of these section notes.


SE Estimate

Since we do not observe both potential outcomes for any subject, we cannot calculate the true standard error using either the formula or simulations. Instead, we estimate the standard error using observed data.

\(\widehat{SE} = \sqrt{\frac{\widehat{Var(Y_i(0))}}{N-m} + \frac{\widehat{Var(Y_i(1))}}{m}}\)


Randomization Inference

Randomization inference is a kind of hypothesis testing that generates a p value that tells us the probability of obtaining an estimate, or more extreme, under the sharp null hypothesis of no effect.

The sharp null hypothesis of no effect stipulates that \(\tau_i = 0\) for all \(i\). Formally, \(Y_i(1) = Y_i(0) \forall i\). If \(\tau_i=0 \forall i\), by implication \(\mathbb{E}[\tau_i] = 0\) or the average treatment effect is also 0.

The steps involved in this hypothesis test are as follows:

  1. Given some data, we only observe \(Y_i(1)\) for treatment group subjects and \(Y_i(0)\) for control group subjects. This is the observed outcome or \(Y_i\). The data set also contains a column with each subject’s treatment status, or \(d_{1,i}\). This is one of several possible assignment vectors that randomly assigns subjects to treatment and control conditions.

  2. Estimate the average treatment effect using the difference-in-means estimator, given the observed data. Call this \(\widehat{\theta}\).

  3. Use mutate to create two new columns: \(Y^0_{\text{null}}\) and \(Y^1_{\text{null}}\). Under the sharp null hypothesis, the observed \(Y_i = Y^0_{\text{null}} = Y^1_{\text{null}}\).

  4. Draw a new treatment status vector, \(d_{2,i}\), apply the switching equation to get the observed outcome \(Y_i\) given this treatment status vector, and estimate the average treatment effect. Repeat this process a large number of times, storing the estimate of the average treatment effect in each iteration.

  5. The resulting sampling distribution will be centered on 0. It is the sampling distribution of the ATE under the sharp null hypothesis. We are interested in knowing the probability of observing \(\widehat{\theta}\) or more extreme (i.e \(\theta \geq |\widehat{\theta}|\)), given the sampling distribution under the sharp null hypothesis. Formally, \(p(\theta \geq |\widehat{\theta}|)\) for a two-tailed p value.

P Value Interpretation: If randomization inference yields a large p value, say \(p=0.4\), this is not evidence in support of the sharp null hypothesis. This p value tells us that there is a 40% chance of observing the estimate we get from the data or more extreme, assuming the null hypothesis is correct.


RI in R

I will create some data where \(Y_1\) and \(Y_0\) are unobserved potential outcomes, \(X_1\) and \(X_2\) are some pre-treatment characteristics that can be used for blocking (more on this below). \(C_1\) and \(C_2\) are clusters (everyone within a cluster gets assigned to the same treatment condition). For this week, please ignore the cluster variables. Assume that we are randomly assigning each subject \(i\) to a treatment condition (\(d_i=1\)) or control condition (\(d_i=0\)).

Given the data generating process, the true ATE or \(E[Y_i(1) - Y_i(0)] \approx 5\) but note \(\tau_i \neq 5\) \(\forall i\).

library(tidyverse)
library(knitr)
library(randomizr)

dat2 <- tibble(
  Y0 = rnorm(100, mean = 0, sd = 8),
  Y1 = Y0 + rnorm(100, mean = 5, sd = 4),
  X1 = as.numeric(Y0 >= 0),
  X2 = randomizr::complete_ra(N = 100, m = 50),
  C1 = ifelse(Y0 <= -1,1,ifelse((Y0 > -1) & (Y0 <= 0),2,ifelse((Y0 > 0) & (Y0 <= 1),3,4))),
  C2 = sample(x = c(1,2,3,4), size = 100, replace = T, prob = c(0.25,0.25,0.25,0.25))
)

kable(head(dat2), caption = "Dataset for Randomization Inference", row.names = T)
Dataset for Randomization Inference
Y0 Y1 X1 X2 C1 C2
1 6.4251017 14.498177 1 1 4 3
2 -1.9473761 -1.092236 0 1 1 2
3 -4.2344162 1.111981 0 1 1 3
4 -17.5068722 -11.844754 0 1 1 4
5 0.8481727 7.005424 1 0 3 3
6 3.4715242 9.665946 1 1 4 2
dat2 %>% summarise(`True ATE` = mean(Y1 - Y0)) %>% kable()
True ATE
5.13877

Step 1: Assign exactly \(m=50\) of \(N=100\) subjects to treatment.

# Step 1: Create an assignment vector (i.e. the "realized" or "actual" assignment status)

dat2 <- dat2 %>%
  mutate(
    d_i = complete_ra(N = 100, m = 50),
    y_i = Y1*d_i + (1-d_i)*Y0
  )

kable(dat2 %>% select(y_i,d_i) %>% head(), caption = "Observed Data", row.names = T)
Observed Data
y_i d_i
1 6.425102 0
2 -1.092236 1
3 -4.234416 0
4 -17.506872 0
5 7.005424 1
6 3.471524 0

Step 2: Estimate the difference in means, \(\widehat{\theta}\)

ate_estimate = dat2 %>% summarise(`DIM` = mean(y_i[d_i==1]) - mean(y_i[d_i==0]))

kable(ate_estimate)
DIM
1.93908

Step 3: Use mutate to create two new columns: \(Y^0_{\text{null}}\) and \(Y^1_{\text{null}}\). Under the sharp null hypothesis, the observed \(Y_i = Y^0_{\text{null}} = Y^1_{\text{null}}\)

dat2 <- dat2 %>%
  mutate(
    y1_null = y_i,
    y0_null = y_i,
  )

kable(dat2 %>% select(y_i,d_i,y1_null,y0_null) %>% head(), caption = "Data Under the Sharp Null Hypothesis of No Effect", row.names = T)
Data Under the Sharp Null Hypothesis of No Effect
y_i d_i y1_null y0_null
1 6.425102 0 6.425102 6.425102
2 -1.092236 1 -1.092236 -1.092236
3 -4.234416 0 -4.234416 -4.234416
4 -17.506872 0 -17.506872 -17.506872
5 7.005424 1 7.005424 7.005424
6 3.471524 0 3.471524 3.471524

Step 4: Draw a new treatment status vector, \(d_{2,i}\), apply the switching equation to get the observed outcome \(Y_i\) given this treatment status vector, and estimate the average treatment effect. Repeat this process a large number of times, storing the estimate of the average treatment effect in each iteration.

# Obtain a permutation matrix (all possible assignment vectors)
declaration = declare_ra(N = nrow(dat2), m=50)

D <- obtain_permutation_matrix(declaration) 

# Obtain sampling distribution under the sharp null
ests_sharp_null <- c()

for(i in 1:ncol(D)){
  dat2$d_temp <- D[,i]
  estimate_sharp_null <- dat2 %>% summarise(`DIM` = mean(y_i[d_temp==1]) - mean(y_i[d_temp==0]))
  ests_sharp_null = bind_rows(ests_sharp_null, estimate_sharp_null)
}

kable(ests_sharp_null %>% head(), caption = "Sampling Distribution Under the Sharp Null Hypothesis", row.names = T)
Sampling Distribution Under the Sharp Null Hypothesis
DIM
1 -0.0696396
2 -1.4523337
3 1.8396020
4 -0.3762381
5 0.8862125
6 -2.9122631
ggplot(data = ests_sharp_null, aes(x = DIM)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = 0, linetype = "dashed", color = "blue") +
  xlab("Estimates") + ylab("Frequency") + ggtitle("Sampling Distribution Under the Sharp Null Hyp.") + theme_bw()


Step 5: Calculate \(p(\theta \geq |\widehat{\theta}|)\) to get the two-sided p-value.

p_values = ests_sharp_null %>% summarise(
  `One Sided P Value` = mean(DIM >= ate_estimate$DIM),
  `Two Sided P Value` = mean(abs(DIM) >= abs(ate_estimate$DIM))
)

kable(p_values, digits = 5)
One Sided P Value Two Sided P Value
0.1124 0.2238
ggplot(data = ests_sharp_null, aes(x = DIM)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = ate_estimate$DIM, linetype = "dashed", color = "blue") +
  geom_text(data = ate_estimate, aes(y = 550, x = DIM + 1.25, label = "DIM Estimate")) +
  xlab("Estimates") + ylab("Frequency") + 
  ggtitle("Prob. of Observing the Estimate Under Sharp Null") + theme_bw()


Advanced

Blocking

In this data, \(X_1\) is highly correlated with potential outcomes: \(\rho(Y_1, X_1) = 0.7\) and \(\rho(Y_0, X_1) = 0.77\).

Consider a design in which we conduct two experiments, one for units with covariate profile \(X_1 = 1\), and another for those with \(X_1 = 0\). Here is how to conduct randomization inference with blocked assignment. 50% of subjects in \(X_1=1\) block get assigned to treatment (\(m_{X_1=1}=27\) or \(28\)), and 50% of subjects in \(X_1=0\) block get assigned to treatment (\(m_{X_1=0}=23\) or \(22\)).

Step 1: Specify blocks in randomizr

# Step 1: Create an assignment vector (i.e. the "realized" or "actual" assignment status)

blocked_declaration = declare_ra(N = nrow(dat2), blocks = dat2$X1, block_prob = c(.5,.5)) 
# You can use block_m = c() to specify the exact number
# Since our blocks have an odd number of subjects, a 50% split yields either 27 or 28 treated subjects in X1=1 block
# and 22 or 23 subjects in X1=0 block. 
# If we use block_m, we would pick one of the two numbers for each block.
# If you use block_prob, randomizr will randomly pick one of the two numbers for each block in each iteration of the experiment.

Step 2: Conduct blocked random assignment, and estimate the blocked average treatment effect.

According to Gerber and Green, the overall ATE under block random assignment is: \(\sum_{j=1}^J \frac{N_j}{N}\widehat{ATE_j}\) (where J is the number of blocks, \(N_j\) are the number of subjects in \(j\)th block, and \(\widehat{ATE_j}\) is the ATE within each block)

library(estimatr)
library(texreg)

dat2 <- dat2 %>% mutate(
  d_blocked = block_ra(blocks = X1, block_prob = c(.5,.5)),
  y_blocked = Y1*d_blocked + (1-d_blocked)*Y0
)

m1 <- difference_in_means(y_blocked ~ d_blocked, blocks = X1, data = dat2)

texreg::htmlreg(list(m1),
                include.ci = FALSE,
                booktabs = TRUE,
                caption = "Overall ATE, using estimatr",
                caption.above = TRUE
                )
Overall ATE, using estimatr
  Model 1
d_blocked 4.06***
  (1.10)
DF 96.00
nobs 100
nblocks 2
nclusters 0
condition2 1
condition1 0
p < 0.001; p < 0.01; p < 0.05
# Applying the formula
block_DIM = dat2 %>% group_by(X1) %>% summarise(DIM = mean(y_blocked[d_blocked==1]) - mean(y_blocked[d_blocked==0]), Nj = n())

overall_ate <- block_DIM %>% summarise(`Block Weighted Overall ATE` = sum(DIM*(Nj/sum(Nj))))

kable(block_DIM, caption = "Block Level ATEs")
Block Level ATEs
X1 DIM Nj
0 3.617449 45
1 4.427061 55
kable(overall_ate, caption = "Overall ATE, using the formula")
Overall ATE, using the formula
Block Weighted Overall ATE
4.062735

Step 3: Conduct randomization inference

D = obtain_permutation_matrix(blocked_declaration)

ests_sharp_null <- c()

for(i in 1:ncol(D)){
  dat2$d_blocked_temp <- D[,i]
  estimate_sharp_null <- difference_in_means(y_blocked ~ d_blocked_temp, 
                                             blocks = X1, 
                                             data = dat2)
  
  ests_sharp_null = bind_rows(ests_sharp_null, estimate_sharp_null$coefficients)
}

ggplot(data = ests_sharp_null, aes(x = d_blocked_temp)) +
  geom_histogram(bins = 50) +
  geom_vline(xintercept = overall_ate$`Block Weighted Overall ATE`, linetype = "dashed", color = "blue") +
  geom_text(data = overall_ate, aes(y = 550, x = `Block Weighted Overall ATE` + 1, label = "DIM\nEstimate")) +
  xlab("Estimates") + ylab("Frequency") + 
  ggtitle("Prob. of Observing the Estimate Under Sharp Null") + theme_bw()


Solutions

Calculating the true SE

library(randomizr)

# Load the data set
dat <- read_csv("w1_practice_data.csv")

estimates <- c()

for(i in 1:10000){
  dat <- dat %>%
  mutate(
    d = complete_ra(N = nrow(dat), m = 500),
    Y = Y1*d + (1-d)*Y0
  )
  estimate = dat %>% summarise(DIM = mean(Y[d==1]) - mean(Y[d==0]))
  estimates = bind_rows(estimates, estimate)
}

# Write custom functions:

var_pop <- function(x){sum((x-mean(x))^2)/(length(x))}
cov_pop <- function(x,y){sum((x-mean(x))*(y-mean(y)))/(length(x))}

# Formula-based calculation of SE

m <- 500
n = nrow(dat)-m

out1 <- dat %>% summarise(`SE using formula` = sqrt(1/((m+n)-1)*((m/n)*var_pop(dat$Y0) + (n/m)*var_pop(dat$Y1) + 2*cov_pop(dat$Y1,dat$Y0))))
  
# Simulation based calculation

out2 <- estimates %>% summarise(`SE with simulations` = sqrt(var_pop(DIM)))

output <- bind_cols(out1, out2)

kable(output,
      digits = 4,
      caption = "True SE: Formula v. Simulations",
      caption.above = TRUE)
True SE: Formula v. Simulations
SE using formula SE with simulations
0.1002 0.0996