Recap from Week I

  • Problem Set 1 has been graded and returned on Canvas. Please let me know if you have trouble viewing my comments.

  • Question 3 describes an encouragement-design with experimental features. The treatment was a simplified procedure to purchase a private water connection at 0% interest rate. A random subset of the sample received this “incentive”, while the remaining subset did not. This allows us to identify the causal effect of that incentive, and with some assumptions, the causal effect of possessing a private connection. In Chapters 5 and 6, we will discuss non-compliance and the necessary assumptions to identify complier average causal effects.

\(d_i\) and \(D_i\)

  • \(d_i\) is the observed treatment assignment of unit \(i\)

  • \(D_i\) is a random variable that indicates whether unit \(i\) would be treated in a hypothetical experiment.

In words: \(d_i\) is a particular realization of \(D_i\).

An example with \(n\)=4, \(m\)=2

Schedule of Potential Outcomes and Treatment Assignment
Unit Y1 Y0 d_i
1 10 15 1
2 8 13 1
3 6 11 0
4 4 9 0

What is \(E[Y_i(1) | d_i = 0]\)?

\(\frac{Y_3(1) + Y_4(1)}{2} = \frac{6+4}{2} = 5\)

How many ways can two of four units be assigned to treatment?

\({4 \choose 2} = \frac{4!}{2!\times 2!} = 6\) ways

Lets use randomizr to get the six different assignment vectors. The R package manual is available here.

# If you do not have this package:

# install.packages('randomizr')

# library(randomizr)

declaration <- declare_ra(N = 4, m = 2)  #this gives randomizr the necessary information
declaration
## Random assignment procedure: Complete random assignment 
## Number of units: 4 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
# If you are doing simple random assignment
declare_ra(N = 4, simple = TRUE)
## Random assignment procedure: Simple random assignment 
## Number of units: 4 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are constant across units.
conduct_ra(declaration)  # This uses that information to generate an assignment vector. This vector will be different every time we run the command. See below
## [1] 0 1 0 1
Z <- conduct_ra(declaration)
Z  # Different from the first assignment vector
## [1] 1 1 0 0
obtain_condition_probabilities(declaration, Z)  # This gives us each unit's probability of being assigned to treatment. When there are no blocks or clusters, this will be the same for all units.
## [1] 0.5 0.5 0.5 0.5
D <- obtain_permutation_matrix(declaration)
# Generates all possible assignment vectors for N=4, m=2
dim(D)
## [1] 4 6
print(D)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    1    1    1
## [2,]    0    1    1    0    0    1
## [3,]    1    0    1    0    1    0
## [4,]    1    1    0    1    0    0

For any unit \(i \in \{1,2,3,4\}\), \(D_i = 1\) with probability \(\frac{1}{2}\) and 0 with probability \(\frac{1}{2}\).

We can manually check this in table D. Unit 1 is assigned to treatment in three of six assignment vectors (columns 4, 5, and 6). Unit 4 is assigned three of six times to treatment as well (columns 1,2, and 4).

Now lets see the full science table:

Full Schedule of Potential Outcomes and Treatment Assignments
Unit Y1 Y0 d_i d1 d2 d3 d4 d5
1 10 15 1 0 0 0 1 1
2 8 13 1 0 1 1 0 0
3 6 11 0 1 0 1 0 1
4 4 9 0 1 1 0 1 0

What is the random assignment assumption?

Formally, we have:

\(Y_i(1), Y_i(0),X \perp\!\!\!\perp D_i\)

Full Schedule of Potential Outcomes and Treatment Assignments
Unit Y1 Y0 d_i d1 d2 d3 d4 d5
1 10 15 1 0 0 0 1 1
2 8 13 1 0 1 1 0 0
3 6 11 0 1 0 1 0 1
4 4 9 0 1 1 0 1 0
r(Y1,d) NA NA 0.89 -0.89 -0.45 0 0 0.45
r(Y0,d) NA NA 0.89 -0.89 -0.45 0 0 0.45

What might be average value of \(r(Y_i(1),d)\) and \(r(Y_i(0),d)\)?

Answer: Zero! Which is an implied property of random assignment.

Can you provide an intuition for why the correlation between any given assignment vector and treated/untreated potential outcomes is the same?

Answer: This schedule uses a constant treatment effect, so \(Y1 = Y0 - 5\). This implies \(Var[Y1] = Var[Y0 - 5] = Var[Y0]\), and that \(Cov(Y1,d_i) = Cov(Y0-5,d_i) = Cov(Y0,d_i)\)

Finally, what is \(E[Y_i(1) | D_i = 0]\)?

Gerber and Green say:

“The notation \(E[Y_i(1) | D_i = 0]\) may be regarded as shorthand for \(E[E[Y_i(1)|d_i = 0, \textbf{d}]]\), where d refers to a vector of treatment assignments and \(d_i\) refers its \(i\)th element. Given d, we may calculate the probability distribution function for all \(\{Y(1),d\}\) pairs and the expectation given this set of assignments. Then we may take the expectation of this expected value by summing over all possible d vectors.” (P28, Footnote 3)

\(E[E[Y_i(1)|d_i = 0, \textbf{d}]]\) = \((\frac{1}{6})(\frac{6+4}{2}) + (\frac{1}{6})(\frac{10+8}{2}) + (\frac{1}{6})(\frac{10+6}{2}) + (\frac{1}{6})(\frac{10+4}{2}) + (\frac{1}{6})(\frac{8+6}{2}) + (\frac{1}{6})(\frac{8+4}{2})\)

Which equals: \(\frac{1}{6} \times \frac{84}{2} = 7\)

Unsurprisingly, \(E[Y_i(1)] = \frac{10+8+6+4}{4} = 7\)

This is what random assignment suggests: \(E[Y_i(1) | D_i = 0] = E[Y_i(1) | D_i = 1] = E[Y_i(1)]\)

Caveat: Correlations tell us whether there is a linear association between two variables. By itself, it is not evidence of statistical independence because two variables might be non-linearly associated.

Sampling Distribution of ATE

(For each assignment vector, manually calculate the ATE (or ATT) using the table above. Then compare with code-generated results below)

library(dplyr)
data1 <- data1 %>% slice(1:4)  # remove the correlation rows from the dataset
data1 <- data1 %>% select(Unit, Y1, Y0, d_i)  # to remove the permutation matrix D.
data1
## # A tibble: 4 x 4
##    Unit    Y1    Y0   d_i
##   <chr> <chr> <chr> <chr>
## 1     1    10    15     1
## 2     2     8    13     1
## 3     3     6    11     0
## 4     4     4     9     0
## 'Loop and Function' approach to get the sampling distribution of the ATE

# Step 1: Define a function that computes the difference-in-means, given a
# treatment assignment vector Z
dif_in_means <- function(Y1, Y0, Z) {
    # We need to ensure the dataset columns are numerical vectors, not
    # characters
    as.num <- function(x) {
        as.numeric(as.character(x))
    }
    Y1 <- as.num(Y1)
    Y0 <- as.num(Y0)
    Z <- as.num(Z)
    # Switching equation
    Y <- Y1 * Z + (1 - Z) * Y0
    # ATE estimate
    estimate <- mean(Y[Z == 1], na.rm = T) - mean(Y[Z == 0], na.rm = T)
    return(estimate)
}

# Step 2: Create an output vector, and run a loop with this function

ate <- rep(NA, 6)  # We will have one ATE estimate for each assignment vector, hence 6 slots
# another way to do this is:
reps <- dim(D)[2]  # gives the number of columns in the permutation matrix D
ate <- rep(0, reps)

for (i in 1:reps) {
    Z = D[, i]
    ate[i] <- dif_in_means(data1$Y1, data1$Y0, Z)
}

print(ate)
## [1] -9 -7 -5 -5 -3 -1
mean(ate)  # Should be -5
## [1] -5
hist(ate)  # Sampling distribution of ATE estimate

sd(ate)  #SD of sampling distribution, or SE of ATE estimate
## [1] 2.828427

ATT=ATC=ATE

Under random assignment, I have shown that: \(E[Y_i(1) | D_i = 0] = E[Y_i(1) | D_i = 1] = E[Y_i(1)]\)

This is also true of untreated potential outcomes: \(E[Y_i(0) | D_i = 0] = E[Y_i(0) | D_i = 1] = E[Y_i(0)]\)

It follows that \(ATT = E[Y_i(1) | D_i = 1] - E[Y_i(0) | D_i = 1] = E[Y_i(1)] - E[Y_i(0)] = ATE\). We will now see this using a simulation.

# Define a function that calculates the ATC

atc_estimator <- function(Y1, Y0, Z) {
    # We need to ensure the dataset columns are numerical vectors, not
    # characters
    as.num <- function(x) {
        as.numeric(as.character(x))
    }
    Y1 <- as.num(Y1)
    Y0 <- as.num(Y0)
    Z <- as.num(Z)
    # ATC estimate
    estimate <- mean(Y1[Z == 0], na.rm = T) - mean(Y0[Z == 0], na.rm = T)
    return(estimate)
}

# Run a similar loop to get the sampling distribution of the ATC
atc <- rep(0, reps)  # reps = no. of columns in permutation matrix D

for (i in 1:reps) {
    Z = D[, i]
    atc[i] <- atc_estimator(data1$Y1, data1$Y0, Z)
}

print(atc)
## [1] -5 -5 -5 -5 -5 -5
mean(atc)
## [1] -5

Task: Alter my code to estimate the ATT. Is the mean(att) = -5 ?

You will observe that the sd(att) = 0 because across assignment vectors, the ATT estimate is exactly \(-5\). This is not typically the case, it so because I defined potential outcomes in a certain way: \(Y_i(1) = Y_i(0) - 5\). When \(\tau_i\) varies from unit to unit, the ATT estimate will vary with assignment vectors.

Equation 3.4 and Study Design

The true standard error of the ATE estimate is:

\(\sqrt{\frac{1}{N-1}\{\frac{m}{N-m}Var[Y_i(0)] + \frac{N-m}{m}Var[Y_i(1)] + 2Cov[Y_i(0),Y_i(1)]\}}\)

There are many design implications of this formula, one of which is that we should allocate more units to the condition with higher variance in potential outcomes. Lets look at two cases: one in which treated and untreated potential outcomes have equal variance; another in which they have unequal variances. At what \(m\) do we get the tightest sampling distribution for the ATE (i.e. the smallest standard error)?

Case 1: \(Var[Y_i(1)] = Var[Y_i(0)]\)

set.seed(45)

#### Lets create some data, n = 80 that meets this description:

Y1 <- runif(80, min = 0, max = 10)
Y0 <- Y1 - 5  # A constant treatment effect gives us Var[Y1] = Var[Y0].


#### Define a mammoth function 'sims'

sims <- function(N, m, Y1, Y0) {
    # Getting all possible assignment vectors when m of N are treated
    declaration <- randomizr::declare_ra(N = N, m = m)
    D <- randomizr::obtain_permutation_matrix(declaration)
    # Defining the function dif_in_means
    dif_in_means <- function(Y1, Y0, Z) {
        # We need to ensure the dataset columns are numerical vectors, not
        # characters
        as.num <- function(x) {
            as.numeric(as.character(x))
        }
        Y1 <- as.num(Y1)
        Y0 <- as.num(Y0)
        Z <- as.num(Z)
        # Switching equation
        Y <- Y1 * Z + (1 - Z) * Y0
        # ATE estimate
        estimate <- mean(Y[Z == 1], na.rm = T) - mean(Y[Z == 0], na.rm = T)
        return(estimate)
    }
    # Getting sampling distribution of ATE
    reps <- dim(D)[2]  # gives the number of columns in the permutation matrix D
    ate <- rep(0, reps)
    for (i in 1:reps) {
        Z = D[, i]
        ate[i] <- dif_in_means(Y1, Y0, Z)
    }
    # Creating output vector
    output <- c(m, sd(ate))
    return(output)
}

#### Checking that the function works
sims(N = 80, m = 50, Y1, Y0)
## [1] 50.0000000  0.6431916
#### Now running this function with different values of m, and storing the
#### standard errors

out1 <- matrix(NA, ncol = 2, nrow = 79)

for (i in 1:79) {
    out1[i, ] <- sims(N = 80, m = i, Y1, Y0)
}

#### Creating a table of the output

colnames(out1) <- c("m", "SE")
out1 <- as.data.frame(out1)

#### Which value of m minimizes standard error:
out1[which.min(out1$SE), ]
##     m        SE
## 40 40 0.6197052
#### Using ggplot to create a figure

library(ggplot2)

fig1 <- ggplot(data = out1, aes(x = m, y = SE)) + geom_point() + theme_bw()
fig1

When \(m \approx 40\) we observe the “tightest” sampling distribution (i.e. the lowest standard error).

Case 2: \(Var[Y_i(1)] \neq Var[Y_i(0)]\)

Now consider a case in which \(Var[Y_i(1)] > Var[Y_i(0)]\):

# Create some data

Y0_new <- runif(80, min = 0, max = 10)
Y1_new <- Y0_new + (Y0_new/2)  # A treatment effect that is positively correlated with Y0 gives us Var[Y1] > Var[Y0].
var(Y0_new)
## [1] 9.304405
var(Y1_new)
## [1] 20.93491
# Lets perform the same simulation on this data

out2 <- matrix(NA, ncol = 2, nrow = 79)
for (i in 1:79) {
    out2[i, ] <- sims(N = 80, m = i, Y1_new, Y0_new)
}

colnames(out2) <- c("m", "SE")
out2 <- as.data.frame(out2)

#### Which value of m minimizes standard error:
out2[which.min(out2$SE), ]
##     m        SE
## 47 47 0.8254226
#### Using ggplot to create a figure
fig2 <- ggplot(data = out2, aes(x = m, y = SE)) + geom_point() + theme_bw()
fig2

Now, the optimal procedure would have \(m \approx 47\). As expected, we should allocate more units to the treatment arm when \(Var[Y_i(1)] > Var[Y_i(0)]\)

Asymmetric Measurement Error and Excludability

Formally, excludability implies:

\(Y_i(z_i,d_i) = Y_i(d_i)\)

In words: potential outcomes respond to actual treatment, not treatment assignment.

Why is it a problem when measurement error is correlated with treatment assignment? For concreteness, imagine we observe \(Y_i(0)^* = Y_i(0) + e_{i0}\) where \(e_{i0}\) is the error in measuring the outcome in the control group. And in the treatment group \(Y_i(1)^* = Y_i(1) + e_{i1}\).

Gerber and Green show that when \(E[e_{i0} | D_i = 0] \neq E[e_{i1} | D_i = 1]\), the difference-in-means estimator is biased. I reproduce the proof here:

\(E[\frac{\sum_1^m Y_i}{m} - \frac{\sum_{m+1}^N Y_i}{N-m}] = E[Y_i(1)^* | D_i = 1] - E[Y_i(0)^* | D_i = 1]\)

Which equals: \(E[Y_i(1) | D_i = 1] - E[Y_i(0) | D_i = 1] + E[e_{i1} | D_i = 1] - E[e_{i0} | D_i = 0]\)

Re-written as: \(ATE + \{E[e_{i1} | D_i = 1] - E[e_{i0} | D_i = 0]\}\)

Appendix (Histograms)

Below, I plot the sampling distribution of estimated ATEs from Case 1 and 2 at different levels of \(m\).

Case 1 (Equal Variances)

Case 2 (Unequal Variances)