Notes from Week II

In Q.4, how do we get from the estimand \(\frac{\sum_1^N \tau_i d_i}{\sum_1^N d_i}\) to the ATT?

Intuition: The estimand changes depending on the random assignment. The first few steps in the process are:

\(\frac{\sum_1^N \tau_i d_i}{\sum_1^N d_i} = \frac{\sum_1^m \tau_i| d_i = 1}{m} = E[\tau_i | d_i = 1]\)

Note that the expectation operator in this step signifies an average over all units.

The question says that this estimand, in expectation, equals the ATE. Formally then:

\(E\{ E[\tau_i | d_i = 1] \} = E_d\{ E[\tau_i | d_i = 1, d] \} = E[\tau_i | D_i = 1]\)

The proof is straightforward hereafter, but it is important to note the expectation operator in the second step is an average over all possible random assignments (as described in footnote 3, p28).

How is non-intereference different from excludability?

Intuition: The non-interference assumption focuses on how interaction between units affects potential outcomes. The excludability assumption pertains to what we randomly assign and whether treatment is the only thing that moves with assignment. Notationally:

Non-interference implies \(Y_i(z,d) = Y_i(z',d')\) where \(z_i = z'_i\) and \(d_i=d'_i\). To illustrate, say we have \(n=3\) and \(m=1\). Let one assignment vector \(z = \{0,1,0\}\) in which \(i=3\) unit is untreated. If we assume non-interference, we in effect think that \(i\)’s potential outcome is the same when the assignment vector is \(z'= \{1,0,0\}\). Note that in both \(z\) and \(z'\), \(i\)’s assignment status is the same (only the first and second unit’s assignment status changes).

Excludability implies that \(Y_i(z,d) = Y_i(d)\), or \(Y_i(z=1,d) = Y_i(z=0,d)\). What this is saying is that whether we assign \(i\) to treatment or control does not matter; the only thing that determines which of \(i\)’s potential outcomes is realized is her/his treatment status.

Example: In question 10, non-interference is compromised if student \(i\) sits next to treated student \(j\) in the cafeteria, and reads her newspaper. Now, which potential outcome \(i\) reveals depends on \(j\)’s assignment and treatment status. In contrast, excludability is violated if being assigned to treatment (\(z=1\)) entails receiving a letter and newspaper. If the letter affects outcomes, it is no longer true that \(Y_i(z=1,d=1) = Y_i(z=0,d=1)\).

Robin hood treatments: How does \(Cov[Y_i(1),Y_i(0)]\) affect sampling variability?

The key intuition here is that when \(Y_i(0)\) and \(Y_i(1)\) positively covary, some units have both high treated and untreated potential outcomes, while others have low \(Y_i(0)\) and \(Y_i(1)\) values. As a result, when a unit \(i\) (say with high Y1 and Y0 values) moves between treatment conditions under different hypothetical assignments, \(i\) considerably changes group means. That is to say, when \(i\) is assigned to treatment, \(i\) “pushes-up” the treatment group mean and “pushes-down” the control group mean. This produces a large ATE estimate. On the other hand, when \(i\) is assigned to control, \(i\) “pushes-up” the control group mean and “pushes-down” the treatment group mean. This produces a small ATE estimate. As we can see, this extreme flip-flop increases the sampling variability. Here is an illustration of the fact:

Science Table: Cov[Y0,Y1] > 0
Unit Y0 Y1 d1 d2 d3
1 2 5 0 0 1
2 3 6 0 1 0
3 4 7 1 0 0
ATE Estimate - - 4.5 3 1.5
## [1] 1.5
Science Table: Cov[Y0,Y1] < 0
Unit Y0 Y1 d1 d2 d3
1 2 7 0 0 1
2 3 6 0 1 0
3 4 5 1 0 0
ATE Estimate - - 2.5 3 3.5
## [1] 0.1666667

In the above setup (\(n=3\), \(m=1\)) the main takeaway is that when \(Cov[Y_i(0),Y_i(1)] > 0\), I get three ATE estimates of 4.5, 3, and 1.5 (with \(Var[\hat{ATE}] = 1.5\)) . However, when \(Cov[Y_i(0),Y_i(1)] < 0\), I get three ATE estimates that are pretty close to each other: 2.5, 3, 3.5 (with \(Var[\hat{ATE}] = 0.167\))

Randomization Inference

In the following subsections, we will do randomization inference when there is (a) complete random assignment, (b) block random assignment, and (c) clustering. For the time being, we will use randomizr but in subsequent weeks move to ri2. The first step is to create some data with the following properties:

  • \(E[Y_i(1) - Y_i(0)] = 5\) but note \(\tau_i \neq 5\) \(\forall i\).

  • A blocking variable \(X_1\) that is highly correlated with outcomes.

  • A second blocking variable \(X_2\) that is uncorrelated with outcomes.

  • Homogenous clusters, so that inter-cluster variation is high.

  • Random assignment to clusters, so that intra-cluster variation is high.

library(dplyr)
library(randomizr)
library(knitr)

set.seed(100)

data <- data_frame(Y0 = rnorm(100, mean = 0, sd = 8), Y1 = Y0 + rnorm(100, mean = 5, 
    sd = 4), X1 = as.numeric(Y0 >= 0), X2 = randomizr::complete_ra(N = 100, 
    m = 50), C1 = ifelse(Y0 <= -1, 1, ifelse((Y0 > -1) & (Y0 <= 0), 2, ifelse((Y0 > 
    0) & (Y0 <= 1), 3, 4))), C2 = sample(x = c(1, 2, 3, 4), size = 100, replace = T, 
    prob = c(0.25, 0.25, 0.25, 0.25)))

kable(head(data), caption = "Dataset for Randomization Inference", row.names = T)
Dataset for Randomization Inference
Y0 Y1 X1 X2 C1 C2
1 -4.0175388 -0.3492322 0 1 1 2
2 1.0522493 11.5047042 1 0 4 4
3 -0.6313367 2.4920739 0 0 2 3
4 7.0942785 15.4657810 1 0 4 3
5 0.9357702 0.1037953 1 1 3 3
6 2.5490407 5.9478170 1 0 4 3
mean(data$Y1 - data$Y0)  # True ATE in this sample
## [1] 5.044563

Case 1: Complete Random Assignment

In this setup, exactly \(m\) of \(N\) units are assigned to treatment. We ignore covariate and cluster related information.

# Step 1: Create an assignment vector (i.e. the 'realized' or 'actual'
# assignment status)

declaration = declare_ra(N = nrow(data), m = 50)
declaration
## Random assignment procedure: Complete random assignment 
## Number of units: 100 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
Z <- conduct_ra(declaration)

# Step 2: Switching equation for observed values of Y
data$Y <- data$Y1 * Z + data$Y0 * (1 - Z)

# Step 2B: ATE estimate given Z
ate_estimate <- mean(data$Y[Z == 1]) - mean(data$Y[Z == 0])
ate_estimate
## [1] 3.956781
# Step 3: Obtaining a permutation matrix (all possible assignment vectors)
D <- obtain_permutation_matrix(declaration)

# Step 4: Obtain sampling distribution under the sharp null
ate1 <- rep(NA, ncol(D))  #ncol(D) is the number of columns in the permutation matrix

for (i in 1:ncol(D)) {
    Z <- D[, i]
    ate1[i] <- mean(data$Y[Z == 1]) - mean(data$Y[Z == 0])
}

# Step 5: For p values

mean(ate1 >= ate_estimate)  # one-sided
## [1] 0.0091
mean(abs(ate1) >= abs(ate_estimate))  # two-sided
## [1] 0.0197
# Step 6: To visualize this in terms of a histogram
hist(ate1, breaks = 100)
abline(v = ate_estimate, col = "blue")

Case 2: Blocking on a strong predictor (\(X_1\))

We have designed the data in such a way that \(X_1\) is highly correlated with potential outcomes:

cor(data$Y0, data$X1)
## [1] 0.7713862
cor(data$Y1, data$X1)
## [1] 0.7060083

So now consider a design in which we conduct two experiments, one for units with covariate profile \(X_1 = 1\), and another for those with \(X_1 = 0\). Here is how to conduct randomization inference with blocked assignment.

# Step 1: Create an assignment vector (i.e. the 'realized' or 'actual'
# assignment status)

declaration = declare_ra(N = nrow(data), blocks = data$X1, block_prob = c(0.5, 
    0.5))
declaration
## Random assignment procedure: Block random assignment 
## Number of units: 100 
## Number of blocks: 2 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
# There are other ways to do this:

table(data$X1)
## 
##  0  1 
## 52 48
declaration1 = declare_ra(N = nrow(data), blocks = data$X1, block_m = c(30, 
    24))

# For multi-arm experiments we use 'block_prob_each' or 'block_m_each'.
# Here we supply a matrix with the same number of rows as blocks and the
# same number of columns as treatment arms.


# Step 2: Get an assignment vector based on design specification
set.seed(23)
data$Z_blocked <- conduct_ra(declaration)

# Note: If we have unequal treatment assignment probabilities (as in
# 'declaration1'), we can check the assignment procedure follows this:

Z1_blocked <- conduct_ra(declaration1)
obtain_condition_probabilities(declaration1, assignment = Z1_blocked)
##   [1] 0.4230769 0.5000000 0.4230769 0.5000000 0.5000000 0.5000000 0.4230769
##   [8] 0.5000000 0.5769231 0.5769231 0.5000000 0.5000000 0.5769231 0.5000000
##  [15] 0.5000000 0.5769231 0.4230769 0.5000000 0.4230769 0.5000000 0.5769231
##  [22] 0.5000000 0.5000000 0.5000000 0.4230769 0.5769231 0.5769231 0.5000000
##  [29] 0.5769231 0.5000000 0.5769231 0.5000000 0.4230769 0.5769231 0.5769231
##  [36] 0.4230769 0.5000000 0.5000000 0.5000000 0.5000000 0.5769231 0.5000000
##  [43] 0.4230769 0.5000000 0.5769231 0.5000000 0.4230769 0.5000000 0.5000000
##  [50] 0.4230769 0.5769231 0.4230769 0.5000000 0.5000000 0.5769231 0.5000000
##  [57] 0.5769231 0.5000000 0.5000000 0.4230769 0.4230769 0.5769231 0.5769231
##  [64] 0.5000000 0.5000000 0.5769231 0.5000000 0.5000000 0.4230769 0.5769231
##  [71] 0.5000000 0.5769231 0.4230769 0.5000000 0.4230769 0.5000000 0.5769231
##  [78] 0.5000000 0.5000000 0.4230769 0.5000000 0.4230769 0.5769231 0.5769231
##  [85] 0.5000000 0.4230769 0.5000000 0.5000000 0.5000000 0.4230769 0.4230769
##  [92] 0.5000000 0.5769231 0.5769231 0.5769231 0.5000000 0.5769231 0.5000000
##  [99] 0.5769231 0.5769231
# Step 3: Switching equation for observed values of Y
data$Y <- data$Y1 * data$Z_blocked + data$Y0 * (1 - data$Z_blocked)

# Step 4: Calculate the blocked ATE estimate, given Z

block_DIM <- data %>% group_by(X1) %>% summarise(DIM = mean(Y[Z_blocked == 1]) - 
    mean(Y[Z_blocked == 0]), Nj = n())

ate <- block_DIM %>% summarise(estimate = sum(DIM * (Nj/sum(Nj))))

ate_estimate <- ate$estimate
ate_estimate
## [1] 3.61079
# Step 5: Obtain permutation matrix

D_blocked <- obtain_permutation_matrix(declaration)

# Step 5: Obtain sampling distribution under the sharp null

ate_block <- rep(NA, ncol(D_blocked))  #ncol(D_blocked) is the number of columns in the permutation matrix

for (i in 1:ncol(D_blocked)) {
    data$Z_blocked <- D_blocked[, i]
    block_DIM <- data %>% group_by(X1) %>% summarise(DIM = mean(Y[Z_blocked == 
        1]) - mean(Y[Z_blocked == 0]), Nj = n())
    ate <- block_DIM %>% summarise(hat = sum(DIM * Nj/sum(Nj)))
    ate_block[i] <- ate$hat
}

# Step 6: For p values

mean(ate_block >= ate_estimate)  # one-sided
## [1] 4e-04
mean(abs(ate_block) >= abs(ate_estimate))  # two-sided
## [1] 7e-04
# Step 7: To visualize this in terms of a histogram
hist(ate_block, breaks = 100)
abline(v = ate_estimate, col = "blue")

Case 3: Clustering

Consider a case where data are clustered by \(C_1\), such that potential outcomes are highly correlated within a cluster, and there is considerable inter-cluster variation. To see this, I subset outcomes by cluster. For comparison, I also subset outcomes by another clustering variable \(C_2\) which was randomly assigned to units:

cluster_outcomes1 <- data %>% group_by(C1) %>% summarise(Y0_avg = mean(Y0), 
    Y0_sd = sd(Y0), Y1_avg = mean(Y1), Y1_sd = sd(Y1), N = n())
kable(cluster_outcomes1, caption = "Clustering by C1 (Homogenous Clusters)")
Clustering by C1 (Homogenous Clusters)
C1 Y0_avg Y0_sd Y1_avg Y1_sd N
1 -7.2744323 4.6722130 -1.692289 5.359986 42
2 -0.6377496 0.2062309 4.140585 2.115267 10
3 0.6440536 0.3476875 3.654301 4.184801 6
4 7.3897472 5.3706004 12.250735 6.042714 42
cluster_outcomes2 <- data %>% group_by(C2) %>% summarise(Y0_avg = mean(Y0), 
    Y0_sd = sd(Y0), Y1_avg = mean(Y1), Y1_sd = sd(Y1), N = n())
kable(cluster_outcomes2, caption = "Clustering by C2 (Heterogenous Clusters)")
Clustering by C2 (Heterogenous Clusters)
C2 Y0_avg Y0_sd Y1_avg Y1_sd N
1 2.9470758 9.258148 8.439956 9.367646 29
2 0.0682238 7.889888 4.849499 8.023810 19
3 -0.9841677 7.037718 3.499210 7.037770 23
4 -2.1308808 7.524174 3.082943 7.797912 29

In this part, we use randomizr to compute p values under the sharp null.

# Step 1: Create an assignment vector (i.e. the 'realized' or 'actual'
# assignment status)

declaration = declare_ra(N = nrow(data), clusters = data$C1, m = 2)
# If there are more than two conditions, use 'm_each' to specify number of
# clusters (or units) per condition.
declaration
## Random assignment procedure: Cluster random assignment 
## Number of units: 100 
## Number of clusters: 4 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
set.seed(13)
Z <- conduct_ra(declaration)

# To check that whole clusters have been assigned to a condition
kable(xtabs(~Z + data$C1))
1 2 3 4
0 0 10 0 42
1 42 0 6 0
# Step 2: Switching equation for observed values of Y
data$Y <- data$Y1 * Z + data$Y0 * (1 - Z)

# Step 2B: ATE estimate given Z
ate_estimate <- mean(data$Y[Z == 1]) - mean(data$Y[Z == 0])
ate_estimate
## [1] -6.869963
# Step 3: Obtaining a permutation matrix (all possible assignment vectors)
D_cluster <- obtain_permutation_matrix(declaration)

# Step 4: Obtain sampling distribution under the sharp null
ate3 <- rep(NA, ncol(D_cluster))  #ncol(D_cluster) is the number of columns in the permutation matrix

for (i in 1:ncol(D_cluster)) {
    Z <- D_cluster[, i]
    ate3[i] <- mean(data$Y[Z == 1]) - mean(data$Y[Z == 0])
}

# Step 5: For p values

mean(ate3 <= ate_estimate)  # one-sided
## [1] 0.3333333
mean(abs(ate3) >= abs(ate_estimate))  # two-sided
## [1] 0.6666667
# Step 6: To visualize this in terms of a histogram
hist(ate3)
abline(v = ate_estimate, col = "blue")

Sampling Variability with Blocking and Clustering

Since we have the full schedule of potential outcomes, lets look at the sampling distribution of ATE estimates under various types of random assignment.

Randomization Mean_ATE SE
Blocking on X1 5.062608 1.071294
Blocking on X2 5.037609 1.622848
Complete 5.047940 1.627865
Heterogenous Clusters 5.027069 2.874211
Homogenous Clusters 4.916638 10.776919

Clustering and Bias in Difference-in-Means Estimator

We are told that the difference-in-means estimator is biased when data are clustered. Here, I go through Middleton and Aronow (2011)’s proof:

Lets start with the difference-in-means estimator and re-write it when there are \(M\) clusters, each of size \(n_j\), and \(m_t\) of \(J\) clusters are assigned to treatment. Let the clusters in treatment be called \(J_1\), and those in control \(J_0\):

\(\frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m} = \frac{\sum_{j \in J_1} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_1} n_j} - \frac{\sum_{j \in J_0} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_0} n_j}\)

The trouble is that \(m\) and \(N-m\) are fixed (or non-random) quantities, but \(\sum_{j \in J_1} n_j\) and \(\sum_{j \in J_0} n_j\) are random variables. This is because `the total number of individuals in treatment and control now depends on the size of the particular clusters assigned to the experimental groups’(Middleton and Aronow:44).

This matters because an unbiased estimator satisfies the property \(E[\hat{\theta}] = \theta\). In the ordinary case:

\(E[\frac{X}{m}] = \frac{E[X]}{m}\) becaue \(m\) is a constant. But when \(m\) is a random variable:

\(E[\frac{X}{m}] = \frac{1}{E[m]}[E[X] - Cov(\frac{X}{m}, m)]\).

Going back to the difference-in-means estimator. To show it is unbiased, we take expectations and apply the above rule:

\(E\{\frac{\sum_{j \in J_1} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_1} n_j} - \frac{\sum_{j \in J_0} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_0} n_j}\}\)

Using linearity of expectations we can write this as:

\(E\{\frac{\sum_{j \in J_1} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_1} n_j}\} - E\{\frac{\sum_{j \in J_0} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_0} n_j}\}\)

And applying the expectation rule from above we get:

\(E[\frac{\sum_{i=1}^m Y_i}{m} - \frac{\sum_{i=m+1}^N Y_i}{N-m}] - \frac{M}{N}\{\frac{1}{m_t}Cov(\frac{\sum_{j \in J_1} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_1} n_j}, \sum_{j \in J_1} n_j) - \frac{1}{m_c}Cov(\frac{\sum_{j \in J_0} \sum_{i=1}^{n_j} Y_i}{\sum_{j \in J_0} n_j}, \sum_{j \in J_0} n_j) \}\)

Note that the bias term depends on two things: (a) the extent to which cluster size covaries with potential outcomes; and (b) the number of clusters assigned to each treatment condition (\(m_t\) and \(m_c\)).

Appendix I: Confidence Intervals Under Constant Effects

I will use the complete random assignment case to build confidence intervals, as described in Gerber and Green (Chapter III). Note that these confidence intervals assume constant effects.

# Recreating the setup from before:
declaration = declare_ra(N = nrow(data), m = 50)
Z <- conduct_ra(declaration)

data$Y <- data$Y1 * Z + data$Y0 * (1 - Z)

# ATE estimate given Z
ate_estimate <- mean(data$Y[Z == 1]) - mean(data$Y[Z == 0])
ate_estimate
## [1] 4.614019
# Obtaining a permutation matrix (all possible assignment vectors)
D <- obtain_permutation_matrix(declaration)

### New step: getting confidence interval under constant effects

## Step 1: Construct potential outcomes under constant effects assumption
Y0_temp <- data$Y - ate_estimate * Z
Y1_temp <- Y0_temp + ate_estimate

ci_completera <- rep(NA, ncol(D))

for (i in 1:ncol(D)) {
    Z_temp <- D[, i]
    # Compute ATE
    ci_completera[i] <- mean(Y1_temp[Z_temp == 1]) - mean(Y0_temp[Z_temp == 
        0])
}

hist(ci_completera, breaks = 100, main = "Sampling Distribution Under Fixed Trt. Effect")
abline(v = ate_estimate, col = "blue")

ci_95 <- quantile(ci_completera, probs = c(0.025, 0.975))
ci_95
##     2.5%    97.5% 
## 1.361424 7.994932

Appendix II: Randomization Checks

What if I want to check whether treatment was randomly assigned? If there was random assignment, we should observe, in expectation, balance on covariate \(X_1\).

# Step 1: Estimate a regression in which the treatment vector is regressed
# on all covariates, pull out the F-statistic (test of joint significance of
# all covariates)

ra_model <- lm(Z ~ data$X1 + data$C1)
f_stat <- summary(ra_model)$fstatistic[1]
f_stat
##    value 
## 0.494898
fstat_distrib <- rep(NA, ncol(D))

for (i in 1:ncol(D)) {
    Z_temp <- D[, i]
    # Compute f-statistic from regression Z_temp ~ Covariates
    model <- lm(Z_temp ~ data$X1 + data$C1)
    fstat_distrib[i] <- summary(model)$fstatistic[1]
}

# Distribution of F-statistics
hist(fstat_distrib, breaks = 100, main = "Sampling Distribution of F Statistic")
abline(v = f_stat, col = "blue")

# P value
mean(fstat_distrib >= f_stat)
## [1] 0.6184