Problem Set 1 has been graded and returned on Canvas. Please let me know if you have trouble viewing my comments.
Question 3 describes an encouragement-design with experimental features. The treatment was a simplified procedure to purchase a private water connection at 0% interest rate. A random subset of the sample received this “incentive”, while the remaining subset did not. This allows us to identify the causal effect of that incentive, and with some assumptions, the causal effect of possessing a private connection. In Chapters 5 and 6, we will discuss non-compliance and the necessary assumptions to identify complier average causal effects.
\(d_i\) is the observed treatment assignment of unit \(i\)
\(D_i\) is a random variable that indicates whether unit \(i\) would be treated in a hypothetical experiment.
In words: \(d_i\) is a particular realization of \(D_i\).
Unit | Y1 | Y0 | d_i |
---|---|---|---|
1 | 10 | 15 | 1 |
2 | 8 | 13 | 1 |
3 | 6 | 11 | 0 |
4 | 4 | 9 | 0 |
What is \(E[Y_i(1) | d_i = 0]\)?
\(\frac{Y_3(1) + Y_4(1)}{2} = \frac{6+4}{2} = 5\)
How many ways can two of four units be assigned to treatment?
\({4 \choose 2} = \frac{4!}{2!\times 2!} = 6\) ways
Lets use randomizr to get the six different assignment vectors. The R package manual is available here.
# If you do not have this package:
# install.packages('randomizr')
# library(randomizr)
declaration <- declare_ra(N = 4, m = 2) #this gives randomizr the necessary information
declaration
## Random assignment procedure: Complete random assignment
## Number of units: 4
## Number of treatment arms: 2
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
# If you are doing simple random assignment
declare_ra(N = 4, simple = TRUE)
## Random assignment procedure: Simple random assignment
## Number of units: 4
## Number of treatment arms: 2
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
conduct_ra(declaration) # This uses that information to generate an assignment vector. This vector will be different every time we run the command. See below
## [1] 0 1 0 1
Z <- conduct_ra(declaration)
Z # Different from the first assignment vector
## [1] 1 1 0 0
obtain_condition_probabilities(declaration, Z) # This gives us each unit's probability of being assigned to treatment. When there are no blocks or clusters, this will be the same for all units.
## [1] 0.5 0.5 0.5 0.5
D <- obtain_permutation_matrix(declaration)
# Generates all possible assignment vectors for N=4, m=2
dim(D)
## [1] 4 6
print(D)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 1 1 1
## [2,] 0 1 1 0 0 1
## [3,] 1 0 1 0 1 0
## [4,] 1 1 0 1 0 0
For any unit \(i \in \{1,2,3,4\}\), \(D_i = 1\) with probability \(\frac{1}{2}\) and 0 with probability \(\frac{1}{2}\).
We can manually check this in table D. Unit 1 is assigned to treatment in three of six assignment vectors (columns 4, 5, and 6). Unit 4 is assigned three of six times to treatment as well (columns 1,2, and 4).
Now lets see the full science table:
Unit | Y1 | Y0 | d_i | d1 | d2 | d3 | d4 | d5 |
---|---|---|---|---|---|---|---|---|
1 | 10 | 15 | 1 | 0 | 0 | 0 | 1 | 1 |
2 | 8 | 13 | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 6 | 11 | 0 | 1 | 0 | 1 | 0 | 1 |
4 | 4 | 9 | 0 | 1 | 1 | 0 | 1 | 0 |
Formally, we have:
\(Y_i(1), Y_i(0),X \perp\!\!\!\perp D_i\)
Unit | Y1 | Y0 | d_i | d1 | d2 | d3 | d4 | d5 |
---|---|---|---|---|---|---|---|---|
1 | 10 | 15 | 1 | 0 | 0 | 0 | 1 | 1 |
2 | 8 | 13 | 1 | 0 | 1 | 1 | 0 | 0 |
3 | 6 | 11 | 0 | 1 | 0 | 1 | 0 | 1 |
4 | 4 | 9 | 0 | 1 | 1 | 0 | 1 | 0 |
r(Y1,d) | NA | NA | 0.89 | -0.89 | -0.45 | 0 | 0 | 0.45 |
r(Y0,d) | NA | NA | 0.89 | -0.89 | -0.45 | 0 | 0 | 0.45 |
What might be average value of \(r(Y_i(1),d)\) and \(r(Y_i(0),d)\)?
Answer: Zero! Which is an implied property of random assignment.
Can you provide an intuition for why the correlation between any given assignment vector and treated/untreated potential outcomes is the same?
Answer: This schedule uses a constant treatment effect, so \(Y1 = Y0 - 5\). This implies \(Var[Y1] = Var[Y0 - 5] = Var[Y0]\), and that \(Cov(Y1,d_i) = Cov(Y0-5,d_i) = Cov(Y0,d_i)\)
Finally, what is \(E[Y_i(1) | D_i = 0]\)?
Gerber and Green say:
“The notation \(E[Y_i(1) | D_i = 0]\) may be regarded as shorthand for \(E[E[Y_i(1)|d_i = 0, \textbf{d}]]\), where d refers to a vector of treatment assignments and \(d_i\) refers its \(i\)th element. Given d, we may calculate the probability distribution function for all \(\{Y(1),d\}\) pairs and the expectation given this set of assignments. Then we may take the expectation of this expected value by summing over all possible d vectors.” (P28, Footnote 3)
\(E[E[Y_i(1)|d_i = 0, \textbf{d}]]\) = \((\frac{1}{6})(\frac{6+4}{2}) + (\frac{1}{6})(\frac{10+8}{2}) + (\frac{1}{6})(\frac{10+6}{2}) + (\frac{1}{6})(\frac{10+4}{2}) + (\frac{1}{6})(\frac{8+6}{2}) + (\frac{1}{6})(\frac{8+4}{2})\)
Which equals: \(\frac{1}{6} \times \frac{84}{2} = 7\)
Unsurprisingly, \(E[Y_i(1)] = \frac{10+8+6+4}{4} = 7\)
This is what random assignment suggests: \(E[Y_i(1) | D_i = 0] = E[Y_i(1) | D_i = 1] = E[Y_i(1)]\)
Caveat: Correlations tell us whether there is a linear association between two variables. By itself, it is not evidence of statistical independence because two variables might be non-linearly associated.
(For each assignment vector, manually calculate the ATE (or ATT) using the table above. Then compare with code-generated results below)
library(dplyr)
data1 <- data1 %>% slice(1:4) # remove the correlation rows from the dataset
data1 <- data1 %>% select(Unit, Y1, Y0, d_i) # to remove the permutation matrix D.
data1
## # A tibble: 4 x 4
## Unit Y1 Y0 d_i
## <chr> <chr> <chr> <chr>
## 1 1 10 15 1
## 2 2 8 13 1
## 3 3 6 11 0
## 4 4 4 9 0
## 'Loop and Function' approach to get the sampling distribution of the ATE
# Step 1: Define a function that computes the difference-in-means, given a
# treatment assignment vector Z
dif_in_means <- function(Y1, Y0, Z) {
# We need to ensure the dataset columns are numerical vectors, not
# characters
as.num <- function(x) {
as.numeric(as.character(x))
}
Y1 <- as.num(Y1)
Y0 <- as.num(Y0)
Z <- as.num(Z)
# Switching equation
Y <- Y1 * Z + (1 - Z) * Y0
# ATE estimate
estimate <- mean(Y[Z == 1], na.rm = T) - mean(Y[Z == 0], na.rm = T)
return(estimate)
}
# Step 2: Create an output vector, and run a loop with this function
ate <- rep(NA, 6) # We will have one ATE estimate for each assignment vector, hence 6 slots
# another way to do this is:
reps <- dim(D)[2] # gives the number of columns in the permutation matrix D
ate <- rep(0, reps)
for (i in 1:reps) {
Z = D[, i]
ate[i] <- dif_in_means(data1$Y1, data1$Y0, Z)
}
print(ate)
## [1] -9 -7 -5 -5 -3 -1
mean(ate) # Should be -5
## [1] -5
hist(ate) # Sampling distribution of ATE estimate
sd(ate) #SD of sampling distribution, or SE of ATE estimate
## [1] 2.828427
Under random assignment, I have shown that: \(E[Y_i(1) | D_i = 0] = E[Y_i(1) | D_i = 1] = E[Y_i(1)]\)
This is also true of untreated potential outcomes: \(E[Y_i(0) | D_i = 0] = E[Y_i(0) | D_i = 1] = E[Y_i(0)]\)
It follows that \(ATT = E[Y_i(1) | D_i = 1] - E[Y_i(0) | D_i = 1] = E[Y_i(1)] - E[Y_i(0)] = ATE\). We will now see this using a simulation.
# Define a function that calculates the ATC
atc_estimator <- function(Y1, Y0, Z) {
# We need to ensure the dataset columns are numerical vectors, not
# characters
as.num <- function(x) {
as.numeric(as.character(x))
}
Y1 <- as.num(Y1)
Y0 <- as.num(Y0)
Z <- as.num(Z)
# ATC estimate
estimate <- mean(Y1[Z == 0], na.rm = T) - mean(Y0[Z == 0], na.rm = T)
return(estimate)
}
# Run a similar loop to get the sampling distribution of the ATC
atc <- rep(0, reps) # reps = no. of columns in permutation matrix D
for (i in 1:reps) {
Z = D[, i]
atc[i] <- atc_estimator(data1$Y1, data1$Y0, Z)
}
print(atc)
## [1] -5 -5 -5 -5 -5 -5
mean(atc)
## [1] -5
Task: Alter my code to estimate the ATT. Is the mean(att) = -5 ?
You will observe that the sd(att) = 0 because across assignment vectors, the ATT estimate is exactly \(-5\). This is not typically the case, it so because I defined potential outcomes in a certain way: \(Y_i(1) = Y_i(0) - 5\). When \(\tau_i\) varies from unit to unit, the ATT estimate will vary with assignment vectors.
The true standard error of the ATE estimate is:
\(\sqrt{\frac{1}{N-1}\{\frac{m}{N-m}Var[Y_i(0)] + \frac{N-m}{m}Var[Y_i(1)] + 2Cov[Y_i(0),Y_i(1)]\}}\)
There are many design implications of this formula, one of which is that we should allocate more units to the condition with higher variance in potential outcomes. Lets look at two cases: one in which treated and untreated potential outcomes have equal variance; another in which they have unequal variances. At what \(m\) do we get the tightest sampling distribution for the ATE (i.e. the smallest standard error)?
set.seed(45)
#### Lets create some data, n = 80 that meets this description:
Y1 <- runif(80, min = 0, max = 10)
Y0 <- Y1 - 5 # A constant treatment effect gives us Var[Y1] = Var[Y0].
#### Define a mammoth function 'sims'
sims <- function(N, m, Y1, Y0) {
# Getting all possible assignment vectors when m of N are treated
declaration <- randomizr::declare_ra(N = N, m = m)
D <- randomizr::obtain_permutation_matrix(declaration)
# Defining the function dif_in_means
dif_in_means <- function(Y1, Y0, Z) {
# We need to ensure the dataset columns are numerical vectors, not
# characters
as.num <- function(x) {
as.numeric(as.character(x))
}
Y1 <- as.num(Y1)
Y0 <- as.num(Y0)
Z <- as.num(Z)
# Switching equation
Y <- Y1 * Z + (1 - Z) * Y0
# ATE estimate
estimate <- mean(Y[Z == 1], na.rm = T) - mean(Y[Z == 0], na.rm = T)
return(estimate)
}
# Getting sampling distribution of ATE
reps <- dim(D)[2] # gives the number of columns in the permutation matrix D
ate <- rep(0, reps)
for (i in 1:reps) {
Z = D[, i]
ate[i] <- dif_in_means(Y1, Y0, Z)
}
# Creating output vector
output <- c(m, sd(ate))
return(output)
}
#### Checking that the function works
sims(N = 80, m = 50, Y1, Y0)
## [1] 50.0000000 0.6431916
#### Now running this function with different values of m, and storing the
#### standard errors
out1 <- matrix(NA, ncol = 2, nrow = 79)
for (i in 1:79) {
out1[i, ] <- sims(N = 80, m = i, Y1, Y0)
}
#### Creating a table of the output
colnames(out1) <- c("m", "SE")
out1 <- as.data.frame(out1)
#### Which value of m minimizes standard error:
out1[which.min(out1$SE), ]
## m SE
## 40 40 0.6197052
#### Using ggplot to create a figure
library(ggplot2)
fig1 <- ggplot(data = out1, aes(x = m, y = SE)) + geom_point() + theme_bw()
fig1
When \(m \approx 40\) we observe the “tightest” sampling distribution (i.e. the lowest standard error).
Now consider a case in which \(Var[Y_i(1)] > Var[Y_i(0)]\):
# Create some data
Y0_new <- runif(80, min = 0, max = 10)
Y1_new <- Y0_new + (Y0_new/2) # A treatment effect that is positively correlated with Y0 gives us Var[Y1] > Var[Y0].
var(Y0_new)
## [1] 9.304405
var(Y1_new)
## [1] 20.93491
# Lets perform the same simulation on this data
out2 <- matrix(NA, ncol = 2, nrow = 79)
for (i in 1:79) {
out2[i, ] <- sims(N = 80, m = i, Y1_new, Y0_new)
}
colnames(out2) <- c("m", "SE")
out2 <- as.data.frame(out2)
#### Which value of m minimizes standard error:
out2[which.min(out2$SE), ]
## m SE
## 47 47 0.8254226
#### Using ggplot to create a figure
fig2 <- ggplot(data = out2, aes(x = m, y = SE)) + geom_point() + theme_bw()
fig2
Now, the optimal procedure would have \(m \approx 47\). As expected, we should allocate more units to the treatment arm when \(Var[Y_i(1)] > Var[Y_i(0)]\)
Formally, excludability implies:
\(Y_i(z_i,d_i) = Y_i(d_i)\)
In words: potential outcomes respond to actual treatment, not treatment assignment.
Why is it a problem when measurement error is correlated with treatment assignment? For concreteness, imagine we observe \(Y_i(0)^* = Y_i(0) + e_{i0}\) where \(e_{i0}\) is the error in measuring the outcome in the control group. And in the treatment group \(Y_i(1)^* = Y_i(1) + e_{i1}\).
Gerber and Green show that when \(E[e_{i0} | D_i = 0] \neq E[e_{i1} | D_i = 1]\), the difference-in-means estimator is biased. I reproduce the proof here:
\(E[\frac{\sum_1^m Y_i}{m} - \frac{\sum_{m+1}^N Y_i}{N-m}] = E[Y_i(1)^* | D_i = 1] - E[Y_i(0)^* | D_i = 1]\)
Which equals: \(E[Y_i(1) | D_i = 1] - E[Y_i(0) | D_i = 1] + E[e_{i1} | D_i = 1] - E[e_{i0} | D_i = 0]\)
Re-written as: \(ATE + \{E[e_{i1} | D_i = 1] - E[e_{i0} | D_i = 0]\}\)
Below, I plot the sampling distribution of estimated ATEs from Case 1 and 2 at different levels of \(m\).