Notes Over Spring Break

Midterms have been graded, and comments posted on Canvas. I would be happy to discuss the solutions with you in office hours or at a mutually convenient time. Here are two helpful pieces of information: the mean score on the test was \(\mu = 89.73\) (\(11.07\)), and the median score \(90.5\).
Replication papers are due at the end of the semester. We will now go around the room, please share the following details with the class: the paper you have shortlisted for replication, the key variables (outcome and treatment), whether or not replication data and code are available, and any extensions or limitations in the analysis you might have come across.

Bias Due To Attrition

Missing outcome data (or attrition) is a threat to unbiased inference. This is because attrition jeopardizes the core assumptions of an experiment:

De-randomization: When missigness is systematically related to potential outcomes, the observed treatment or control group are no longer a random subset of the population. This implies that the treatment group mean is not an unbiased estimator of the population mean of \(Y_i(1)\):

\(E[\frac{\sum_{i=1}^m Y_i \cdot Z_i \cdot R_i}{\sum_{i=1}^m Z_i \cdot R_i}] \neq E[Y_i(1)]\)

and the control group mean is not an unbiased estimator of the population mean of \(Y_i(0)\):

\(E[\frac{\sum_{i=m+1}^N Y_i \cdot (1-Z_i) \cdot R_i}{\sum_{i=m+1}^N (1-Z_i) \cdot R_i}] \neq E[Y_i(0)]\)

Figure 1: Attrition undoes random assignment. Consider a situation in which outcome data are missing for those treated units that have high Y1 values, and those untreated units that have low Y0 values.

Excludability violations: When missigness covaries with potential outcomes, treatment assignment \(Z\) affects outcomes \(Y\) other than through treatment \(D\).

Figure 2: Z affects Y through R

ATE as a weighted average of types

Let there be four types of subjects in an experiment:

	R(Z=1)	R(Z=0)
Always Responder	1	1
If-Treated Responder	1	0
If-Control Responder	0	1
Never Responder	0	0

Then the Average Treatment Effect can be written as a weighted average of types:

\(ATE = E[Y_i(1)] - E[Y_i(0)] = E[Y_i | D_i =1] - E[Y_i | D_i = 0] = E[Y_i | Z_i =1] - E[Y_i | Z_i = 0]\)

where the first equality follows from the random assignment assumption (\(Y_i \perp Z_i\)), and the second from full compliance (\(d_i(1) > d_i(0) \forall i\)).

Focus on the treatment group:

\(E[Y_i | Z_i =1] = E[Y_i | Z_i = 1, \color{red}{R_i = 1}]\{E[R_i = 1| Z_i = 1]\} + E[Y_i | Z_i = 1, \color{red}{R_i = 0}] \{1 - E[R_i = 1| Z_i = 1]\}\)

Now say we condition on response \(R_i = 1\):

\(E[Y_i | Z_i = 1, R_i = 1] = E[Y_i | AR]\pi_{AR} + E[Y_i | ITR]\pi_{ITR}\)

Now lets look at the control group:

\(E[Y_i | Z_i =0] = E[Y_i | Z_i = 0, \color{red}{R_i = 1}]\{E[R_i = 1| Z_i = 0]\} + E[Y_i | Z_i = 0, \color{red}{R_i = 0}] \{1 - E[R_i = 1| Z_i = 0]\}\)

And if we condition on response \(R_i = 1\):

\(E[Y_i | Z_i = 0, R_i = 1] = E[Y_i | AR]\pi_{AR} + E[Y_i | ICR]\pi_{ICR}\)

In other words, when we condition on response (i.e. estimate \(E[Y_i | Z_i = 1, R_i = 1] - E[Y_i | Z_i = 0, R_i = 1]\)) we are effectively comparing “Always Responders” and “If-Treated Responders” in the treatment group, to “Always Responders” and “If-Control Responders” in the control group.

Recovering ATE with \(MIPO\)

If we assume \(Y_i(z) \perp R_i(z)\) (i.e. missigness is independent of potential outcomes), then:

\(E[Y_i | Z_i = 1 , R_i = 1] = E[Y_i | Z_i = 1 , R_i = 0] = E[Y_i | Z_i = 1]\)

(observed treated units are a random subset of the treatment group)

and

\(E[Y_i | Z_i = 0 , R_i = 1] = E[Y_i | Z_i = 0 , R_i = 0] = E[Y_i | Z_i = 0]\)

(observed untreated units are a random subset of the control group)

With this assumption, conditioning on response still gives the ATE:

\(E[Y_i | Z_i = 1, R_i = 1] - E[Y_i | Z_i = 0, R_i = 1] = E[Y_i | Z_i = 1] - E[Y_i | Z_i = 0] = ATE\)

Recovering ATE with \(MIPO|X\)

“MIPO” is a strong assumption that implies missigness is at random. A weaker assumption “MIPO|X” states that missigness is random within (or conditional on) a certain covariate profile. Formally:

\({Y_i(z) \perp R_i(z)} | \textbf{X}_i = x\), \(\forall x\) (where \(\textbf{X}_i\) is a vector of covariates)

Intuition: Say for some outcome \(Y\), women generally have higher values than men, or \(\mu_{Female} > \mu_{Male}\). And outcome data are missing for both men and women, but more so for men (\(E[R_i=1 | Male] = 0.4\) and \(E[R_i =1 |Female] = 0.8\)). Then, by construction, missigness is correlated with potential outcomes: attrition decreases as Y increases. So “MIPO” is an untenable assumption. But “MIPO|X” may be tenable, if conditional on being female (or male) outcome data are missing at random.

Claim: An IPW estimator is unbiased and recovers the average treatment effect if we assume “MIPO|X”

Proof: I will demonstrate that in the treatment group, within any covariate strata \(E[\frac{Y_i(1) \cdot R_i}{p_R(X_i)}|X_i] = E[Y_i(1) | X_i]\). Put another way, the IPW estimator equals the conditional mean of \(Y_i(1)\). And if we average over all values of the covariate, the law of iterated expectations tells us: \(E_X[E[Y_i(1)|X_i]] = E[Y_i(1)]\).

To see this, start with the IPW estimator:

\(E[\frac{Y_i(1) \cdot R_i}{p_R(X_i)}|X_i] = E[Y_i(1) \cdot \frac{R_i}{Pr[R_i=1 | X_i]} |X_i]\)

MIPO|X implies that \(Y_i(1) \perp R_i |X_i\), so we can re-write the above as:

\(E[Y_i(1) \cdot \frac{R_i}{Pr[R_i=1 | X_i]} |X_i] = E[Y_i(1) | X_i] \cdot E[\frac{R_i}{Pr[R_i=1 | X_i]} |X_i]\)

\(= E[Y_i(1) | X_i] \cdot \frac{E[R_i | X_i]}{Pr[R_i=1 | X_i]} = E[Y_i(1) | X_i] \cdot \frac{Pr[R_i=1 | X_i]}{Pr[R_i=1 | X_i]}\)

Which implies: \(E[\frac{Y_i(1) \cdot R_i}{p_R(X_i)}|X_i] = E[Y_i(1)|X_i]\)

And averaging over all possible values of \(X\) gives: \(E_X[E[\frac{Y_i(1) \cdot R_i}{p_R(X_i)}|X_i]] = E_X[E[Y_i(1)|X_i]] = E[Y_i(1)]\)

Similarly, in the control group: \(E_X[E[\frac{Y_i(0) \cdot R_i}{p_R(X_i)}|X_i]] = E_X[E[Y_i(0)|X_i]] = E[Y_i(0)]\)

This demonstrates that:

\(E_X[E[\frac{Y_i(1) \cdot R_i}{p_R(X_i)}|X_i]] - E_X[E[\frac{Y_i(0) \cdot R_i}{p_R(X_i)}|X_i]] = E[Y_i(1)] - E[Y_i(0)] = ATE\)

Uncertainty around bounds

In this section, we will work with the data in Table 7.1 (Gerber and Green 2012:217). I will first estimate extreme value bounds that capture the fundamental uncertainty in the ATE estimate. However, the upper (or lower) bound is itself a random variable with a sampling distribution. In what follows, I will use a bootstrapping procedure and randomization inference to recover these sampling distributions.

Table 7.1
	Y(0)	Y(1)	r(0)	r(1)	Y(0) \| r(0)	Y(1) \| r(1)	Z	Y_obs
1	3	8	0	1	NA	8	0	NA
2	3	7	0	1	NA	7	1	7
3	8	10	1	1	8	10	1	10
4	7	8	1	1	7	8	0	7
5	6	6	0	0	NA	NA	1	NA
6	5	8	1	1	5	8	0	5
7	6	6	1	0	6	NA	1	NA
8	6	7	1	1	6	7	0	6

\(\color{red}{Warning:}\) R packages like lm and lm_robust do listwise deletion, and may conceal attrition. To see this, lets input our dataset into model.matrix which returns the X matrix in \(\textbf{Y} = \textbf{X}\beta\):

model.matrix(Y_obs ~ Z, table7.1)

##   (Intercept) Z
## 2           1 1
## 3           1 1
## 4           1 0
## 6           1 0
## 8           1 0
## attr(,"assign")
## [1] 0 1

# Note that despite no missing Z values, this matrix has only 5 rows

Extreme value bounds: Assume that \(Y_{min} =0\) and \(Y_{max}=20\), then we can place Manski bounds on the ATE estimate.

library(estimatr)

# Select only relevant columns from Table 7.1 (Y_obs and Z), and create a
# column for missingness
data <- table7.1 %>% select(Y_obs, Z) %>% mutate(missing = Y_obs - Y_obs)

# Code missingness as 1 if Y_obs is NA, else 0
data <- data %>% mutate(missing = recode(missing, .missing = 1))

# Impute values and create two new columns Y_upper and Y_lower (Alex's code)

data <- within(data, {
    Y_upper <- Y_lower <- Y_obs
    Y_upper[Z == 1 & missing == 1] <- 20
    Y_upper[Z == 0 & missing == 1] <- 0
    Y_lower[Z == 1 & missing == 1] <- 0
    Y_lower[Z == 0 & missing == 1] <- 20
})

kable(data, caption = "Extreme Value Bounds Imputed Data")

Extreme Value Bounds Imputed Data
Y_obs	Z	missing	Y_upper	Y_lower
NA	0	1	0	20
7	1	0	7	7
10	1	0	10	10
7	0	0	7	7
NA	1	1	20	0
5	0	0	5	5
NA	1	1	20	0
6	0	0	6	6

# Lower bound
difference_in_means(Y_lower ~ Z, data)

## Design:  Standard 
##   Estimate Std. Error  Pr(>|t|)  CI Lower CI Upper       DF
## Z    -5.25   4.337338 0.2760474 -16.13225 5.632249 5.442531

# Upper bound
difference_in_means(Y_upper ~ Z, data)

## Design:  Standard 
##   Estimate Std. Error   Pr(>|t|)   CI Lower CI Upper       DF
## Z     9.75   3.716517 0.05553307 -0.3620691 19.86207 4.217635

Note: From Table 7.1 we know that the true ATE, or mean(Y1 - Y0) = 2. In this step we estimated the fundamental uncertainty around that average treatment effect: \([-5.25,9.75]\). But observe that the lower (or upper bound) are sensitive to several things: values of \(Y_{min}\) and \(Y_{max}\), the number of missing observations, and sampling variability. For instance, in the text book \(Y_{min} = 0\) and \(Y_{max} = 10\), and the corresponding Manski bounds are \([4.75,-2.75]\). These are different from the bounds reported here, primarily because we impute with \(Y_{max} = 20\). However, both sets of bounds are also affected by sampling variability. That is to say, they would be different had we observed a different draw from the population, or a different assignment vector.

Uncertainty: We now consider the two types of uncertainty. We know that the lower and upper bounds will be different in another random draw from the population. We can use a bootstrapping procedure to capture this uncertainty:

Draw a sample of \(n=8\) with replacement from data.
For each boostrap sample, estimate \(ATE_{lower}\) (lower bound) and \(ATE_{upper}\) (upper bound)
Repeat the process \(k = 10000\) times (essentially a large number of times), storing \(Y_{lower}\) and \(Y_{upper}\) in each iteration.
The standard deviation of the two vectors \(Y_{lower}\) and \(Y_{upper}\) is the uncertainty around each bound.

# Define output vectors
sims <- 10000

ate_lower <- rep(NA, sims)
ate_upper <- rep(NA, sims)

# Bootstrapping procedure

for (i in 1:sims) {
    boot.sample = sample_n(data, size = 8, replace = T)  # draws 8 rows with replacement from data
    ate_lower[i] <- lm(Y_lower ~ Z, boot.sample)$coefficients[2]
    ate_upper[i] <- lm(Y_upper ~ Z, boot.sample)$coefficients[2]
}


fig1.data <- data.frame(ATE = c(ate_lower, ate_upper), Type = c(rep("Lower", 
    sims), rep("Upper", sims)))

Type	Mean	SE
Lower	-5.171885	4.068366
Upper	9.668185	3.491560

Note: When we sample with replacement from small datasets, there are some bootstrap samples in which all the rows are drawn from the treatment or control group. In those cases, the lower and upper bounds are undefined. For this reason, bootstrapping procedures should be performed on larger datasets.

Finally, we expect that the lower and upper bounds will change with the assignment vector \(Z\). In Table 7.1 we know every unit’s treated and untreated potential outcome, and their type. We can use this information to estimate the lower and upper bounds under different assignments:

Caution: We can measure the uncertainty due to random assignment only because Table 7.1 gives \(Y_i(1)\), \(Y_i(0)\) and the respondent’s type. When the full schedule of potential outcomes and respondent’s type are unknown, we do not know whether a respondent will reveal his outcome under some hypothetical assignment, and what the value of that outcome will be. Therefore, this simulation is only for illustrative purposes.

Trimming Bounds

Estimand: ATE | AR

Assumption: Monotonicity, either \(r_i(1) \geq r_i(0)\) (no ICRs) or \(r_i(0) \geq r_i(1)\) (no ITRs)

Procedure: Estimate Q which is a measure of the “extra” observed subjects in the treatment group (when we assume there are no If-Control Responders).

\(Q = \frac{E[R_i = 1 | Z_i = 1] - E[R_i = 1 | Z = 0]}{E[R_i = 1 | Z_i = 1]} = \frac{(\pi_{AR} + \pi_{ITR}) - \pi_{AR}}{\pi_{AR} + \pi_{ITR}}\)

Then sort the data in descending order (highest to lowest values). To get the “upper bound” trim-off the lowest \(100q\%\) of the values of \(Y_i(1)\) and average the remaining values to get an upper estimate of \(E[Y_i(1) | AR]\). This process attributes the lowest \(Y_i(1)\) values to ITRs. Conversely, to get the “lower bound” trim-off the highest \(100q\%\) of \(Y_i(1)\) values, and average the remaining values to get a lower estimate of \(E[Y_i(1) | AR]\). This attributes the highest \(Y_i(1)\) values to ITRs.

Intuition

Lets work through the intuition of trimming bounds by looking at a small dataset (\(n = 10, m = 6\)):

Trimming Toy Dataset
	Y	Z
1	10	1
2	8	1
3	5	1
4	2	1
5	NA	1
6	NA	1
7	5	0
8	5	0
9	NA	0
10	NA	0

Step 1: Estimate \(Q\)

# Create two variables 'observed' and 'missing'
trimming.data <- trimming.data %>% mutate(missing = Y - Y) %>% mutate(missing = recode(missing, 
    .missing = 1), observed = 1 - missing)

# Estimating Q
Q <- with(trimming.data, (mean(observed[Z == 1]) - mean(observed[Z == 0]))/mean(observed[Z == 
    1]))

Q  # should be 0.25

## [1] 0.25

Step 2: Trimming

# Create two vectors: observed values in treatment and control group (i.e.
# drop missing data)
observed_treatment <- trimming.data %>% filter(Z == 1 & observed == 1)
observed_treatment

##    Y Z missing observed
## 1 10 1       0        1
## 2  8 1       0        1
## 3  5 1       0        1
## 4  2 1       0        1

observed_control <- trimming.data %>% filter(Z == 0 & observed == 1)
observed_control

##   Y Z missing observed
## 1 5 0       0        1
## 2 5 0       0        1

mean(observed_control$Y)  # E[Y0 | AR]

## [1] 5

# Arrange the treated observations in descending order
observed_treatment <- observed_treatment %>% arrange(desc(Y))

# Identify the cutoffs for the lower and upper bounds

quantile(observed_treatment$Y, Q)

##  25% 
## 4.25

quantile(observed_treatment$Y, (1 - Q))

## 75% 
## 8.5

# Trimming-off the lowest and highest values

observed_treatment_high <- filter(observed_treatment, Y > quantile(Y, probs = Q))
observed_treatment_high

##    Y Z missing observed
## 1 10 1       0        1
## 2  8 1       0        1
## 3  5 1       0        1

observed_treatment_low <- filter(observed_treatment, Y < quantile(Y, probs = (1 - 
    Q)))
observed_treatment_low

##   Y Z missing observed
## 1 8 1       0        1
## 2 5 1       0        1
## 3 2 1       0        1

Step 3: Estimate lower and upper bound for \(E[Y_i(1) | AR]\)

# Upper bound
mean(observed_treatment_high$Y) - mean(observed_control$Y)

## [1] 2.666667

# Lower bound
mean(observed_treatment_low$Y) - mean(observed_control$Y)

## [1] 0

Practice

Download the Angrist, Bettinger, and Kremer (2006) dataset posted on Canvas or Github. The treatment in their study is \(vouch0\), and the outcome variable is \(read\). Since outcome data is missing for many units, there are two additional variables: \(observed\) (which equals 1 if the reading score is reported), and \(missing\) (which equals 1 if the reading score is not reported or “missing”). Calculate the trimming bounds for the ATE | AR under the assumption \(r_i(1) \geq r_i(0)\) (i.e. there are no “if-control responders”).

Solution

(Reproduced from Alex Coppock’s class slides)

# Load the dataset
load("/Users/shikharsingh/Dropbox/Yale/Teaching/PLSC512_FieldExperiments/W7_Trimming.Rda")

# Confirm that missigness is higher in the control group (it includes NRs +
# ITRs)

tab <- data %>% group_by(vouch0) %>% summarise(Missingness = mean(missing))
kable(tab, digits = 3)

vouch0	Missingness
0	0.698
1	0.624

# Calculate Q

Q = with(data, (mean(observed[vouch0 == 1]) - mean(observed[vouch0 == 0]))/mean(observed[vouch0 == 
    1]))
Q

## [1] 0.1956908

# Subsetting to only observed data

observed_treatment <- data %>% filter(vouch0 == 1 & observed == 1)

observed_control <- data %>% filter(vouch0 == 0 & observed == 1)

mean(observed_control$read)  # E[Y0 | AR]

## [1] 46.92081

# Arrange the treated observations in descending order
observed_treatment <- observed_treatment %>% arrange(desc(read))

# Identify the cutoffs for the lower and upper bounds

quantile(observed_treatment$read, Q)

## 19.56908% 
##        43

quantile(observed_treatment$read, (1 - Q))

## 80.43092% 
##        52

# Trimming-off the lowest and highest values (from Alex's code)

observed_treatment_high <- filter(observed_treatment, read > quantile(read, 
    probs = Q))

observed_treatment_low <- filter(observed_treatment, read < quantile(read, probs = (1 - 
    Q)))

# Upper bound
mean(observed_treatment_high$read) - mean(observed_control$read)

## [1] 2.812247

# Lower bound
mean(observed_treatment_low$read) - mean(observed_control$read)

## [1] -1.463869

Section 7: Attrition

Shikhar Singh

March 29, 2018