Notes from Week VIII

Trimming Bounds

With binary outcome variables, my code for trimming bounds breaks because the quantile function does not correctly trim the data. To see this, lets attempt Question 8 (Chapter 7) using the section code:

library(tidyverse)
library(estimatr)

# Create the dataset using the table on page 246

data <- data.frame(Hispanic = c(rep(0, 106), rep(1, 111)), Y = c(rep(1, 50), 
    rep(0, 28), rep(NA, 28), rep(1, 68), rep(0, 26), rep(NA, 17)), Observed = c(rep(1, 
    50 + 28), rep(0, 28), rep(1, 68 + 26), rep(0, 17)))

# Calculate Q

Q = with(data, (mean(Observed[Hispanic == 1]) - mean(Observed[Hispanic == 0]))/mean(Observed[Hispanic == 
    1]))
Q
## [1] 0.1310719
# Subsetting to only observed data

observed_treatment <- data %>% filter(Hispanic == 1 & Observed == 1)

observed_control <- data %>% filter(Hispanic == 0 & Observed == 1)

mean(observed_control$Y)  # E[Y0 | AR]
## [1] 0.6410256
# Identify the cutoffs for the lower and upper bounds

quantile(observed_treatment$Y, Q)
## 13.10719% 
##         0
quantile(observed_treatment$Y, (1 - Q))
## 86.89281% 
##         1
# Earlier: Trimming-off the lowest and highest values

observed_treatment_high <- filter(observed_treatment, Y > quantile(Y, probs = Q))

observed_treatment_low <- filter(observed_treatment, Y < quantile(Y, probs = (1 - 
    Q)))

# Error: We overtrim because the quantile function does not work with binary
# variables

length(observed_treatment$Y)  #n = 94
## [1] 94
length(observed_treatment_high$Y)  # n should be 82 (removing .13*94 = 12 observations)
## [1] 68
# Thus, the bound estimate is wrong:
mean(observed_treatment_high$Y) - mean(observed_control$Y)
## [1] 0.3589744
# Instead: Arrange the treated observations in descending order

observed_treatment <- observed_treatment %>% arrange(desc(Y))

# Calculate the number of ITRs

Q * length(observed_treatment$Y)  # approx 12
## [1] 12.32075
observed_treatment_high <- observed_treatment[1:82, ]

observed_treatment_low <- observed_treatment[13:94, ]


# Upper bound
mean(observed_treatment_high$Y) - mean(observed_control$Y)
## [1] 0.1882427
# Lower bound
mean(observed_treatment_low$Y) - mean(observed_control$Y)
## [1] 0.04190119

Non-Interference Assumption

Non-interference is an implicit but important assumption for unbiased causal inference. Typically, the assumption is that subject \(i\)’s potential outcomes depend solely on \(i\)’s assignment (\(z_i\)) and treatment status (\(d_i\)), not on the the treatment assignment or status of any other subject \(j \neq i\). Formally, we state this as:

\(Y_i(\textbf{z},\textbf{d}) = Y_i(\textbf{z'},\textbf{d'})\) where \(z_i = z'_i\) and \(d_i = d'_i\)

Note that we have already assumed non-interference when we say that units have only two potential outcomes \(Y_i(1)\) and \(Y_i(0)\), or that the outcome they reveal is determined by the switching equation \(Y_i = Y_i(1)\cdot d_i + Y_i(0)\cdot[1-d_i]\).

So what happens if this assumption breaks down? In other words, \(i\)’s outcomes are sensitive to \(j\)’s treatment assignment or status? In this section, we will look at the three-step solution to spillovers:

  1. Defining an exposure model and potential outcomes.

  2. Writing an estimand in terms of the potential outcome model.

  3. Using a design that randomly samples from those potential outcomes.

Defining Potential Outcomes

When there is interference between units, the first step is to posit an exposure model and define potential outcomes. This entails specifying the types of interference, writing potential outcomes that reflect the underlying interactions between units, and “restabilizing” outcomes.

Example: In Camerer (1998)’s study, bets were placed on one of two horses running in the same race. For each pair \(i\) and \(j\), Camerer randomly selected one horse and placed two $500 bets on that horse before the start of the race. The outcome of interest is the change in total bets for each horse (i.e. the difference between total bets placed post-treatment and pre-treatment). Crucially, bets depend on the horse’s odds or the proportion of total bets placed on that horse. This means that horse \(i\)’s betting odds are sensitive to the bets placed on horse \(j\). This leads to an interference problem: if Camerer bet on horse \(i\) in some race, \(i\)’s treatment status affects \(j\)’s betting odds (and outcomes).

To resolve this issue, the first step is to define an exposure model:

Horse \(i\)’s potential outcomes are affected by: (a) \(i\)’s treatment status \(d_i\); and (b) \(j\)’s treatment status \(d_j\). Accordingly, horse \(i\)’s potential outcomes are not sensitive to the treatment status of any horse \(k\) in some other pair and race.

If we believe this exposure model, horse \(i\) has four possible outcomes:

\(Y_i(d_i = 1, d_j = 1)\)

\(Y_i(d_i = 1, d_j = 0)\)

\(Y_i(d_i = 0, d_j = 1)\)

\(Y_i(d_i = 0, d_j = 0)\)

Note that Camerer (1998)’s random assignment protocol (or matched-pair design) allow only one horse to be treated. Thus horse \(i\) never reveals \(Y_i(d_i = 1, d_j = 1)\) and \(Y_i(d_i = 0, d_j = 0)\) with positive probability.

Practice 1: Village Study

Consider Figure 8.1 in Gerber and Green (2012). There are six locations (labeled A-F), of which five are inhabited (labeled 1-5). Say a researcher randomly selects one village for treatment, and specifies the following exposure model:

Village \(x\)’s outcomes depend on two things: \(x\)’s treatment status (\(d_x\)), and the treatment status of its immediate neighbor.