Notes from Week IV

As you write a pre-analysis plan, here is a useful tool for power calculations.

Properties of an Estimating Procedure

The main takeaway from question 3 is that the properties of an estimation procedure depend on the estimator and decisions taken by the researcher.

Estimation Procedure = \(f(Estimator, Researcher)\)

Illustration I: Pick the higher estimate decision-rule

Consider the sampling distribution of two estimators: the difference-in-means, and regression with covariate adjustment. I will use some data created in the previous section to demonstrate that both estimators are unbiased, and the latter is more precise. However, if a researcher applies some decision rule (“pick the higher of the two estimates”), the resulting estimator is biased. Much of this mirrors question 3.

Step 1: Create some data with \(Y_i(0)\), \(Y_i(1)\), and a covariate \(X_1\). By construction the true ATE or \(E[\tau_i] \approx 5\).

library(dplyr)
library(knitr)
library(randomizr)
library(ri2)
library(estimatr)


set.seed(100)
dat1 <- data_frame(Y0 = rnorm(1000, mean = 0, sd = 20), Y1 = Y0 + rnorm(1000, 
    mean = 5, sd = 20), X1 = as.numeric(Y0 >= 0) + runif(1000, min = -1, max = 1), 
    Z = complete_ra(N = 1000, m = 500), Y = Y1 * Z + Y0 * (1 - Z))

true_ate <- dat1 %>% summarise(value = mean(Y1) - mean(Y0))
true_ate$value
## [1] 5.082935

Step 2: Next, lets get the sampling distribution for each estimator

# Specify design, get a large number of possible assignment vectors
declaration <- declare_ra(N = 1000, m = 500)
D <- obtain_permutation_matrix(declaration, maximum_permutations = 50000)

# Specify outputs

DIM <- rep(NA, ncol(D))
DIM_covariate <- rep(NA, ncol(D))
DIM_higher <- rep(NA, ncol(D))

# Run a loop to get sampling distribution
for (i in 1:ncol(D)) {
    Z <- D[, i]
    Y <- Z * dat1$Y1 + (1 - Z) * dat1$Y0
    m1 <- lm_robust(Y ~ Z)
    m2 <- lm_robust(Y ~ Z + dat1$X1)
    DIM[i] <- summary(m1)$coefficient[2]
    DIM_covariate[i] <- summary(m2)$coefficient[2]
    DIM_higher[i] <- ifelse(summary(m1)$coefficient[2] >= summary(m2)$coefficient[2], 
        summary(m1)$coefficient[2], summary(m2)$coefficient[2])
}

out1 <- data.frame(Method = c("DIM", "Covariate Adjustment", "Pick Higher"), 
    Mean = c(mean(DIM), mean(DIM_covariate), mean(DIM_higher)), SE = c(sd(DIM), 
        sd(DIM_covariate), sd(DIM_higher)))
kable(out1)
Method Mean SE
DIM 5.085978 1.447851
Covariate Adjustment 5.083390 1.286763
Pick Higher 5.350184 1.343805

Note that the true ATE \(= 5.082935\). As we see, the difference-in-means and regression procedure provide unbiased estimates in expectation, whereas the third procedure is biased.

Illustration 2: Restricted Randomization

Another example of this is restricted randomization. Here the assignment procedure involves the decision-rule: “discard an assignment vector if it fails the balance test”. As we have seen, this assignment procedure does not have the same properties as random assignment. In fact, such a procedure yields differential probabilities of assignment to conditions.

To know those probabilities for each unit, we draw a large number of assignment vectors following the restricted randomization protocol and compute individual probabilities of being assigned to treatment and control.

Confidence Intervals Under Constant Effects

There is some confusion on how to estimate confidence intervals under constant effects. There are essentially two approaches: one using ri2, the other using a loop.

  • conduct_ri does not give the right confidence intervals while doing randomization inference.
# Calculate the observed ate

ate_estimate <- dat1 %>% summarise(value = mean(Y[Z == 1]) - mean(Y[Z == 0]))
ate_estimate$value
## [1] 2.393333
# Conduct randomization inference to get p value for this ATE estimate:

out2 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = 0, 
    data = dat1)

summary(out2)
##   coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1           Z 2.393333               0.13     -2.963273      3.278193

The interval estimate reported here is not the confidence interval around \(\hat{ATE} = 2.39333\). null_ci_lower is the 2.5th percentile value in the sampling distribution under the sharp null, null_ci_upper is the 97.5th percentile value in that distribution.

  • To obtain confidence intervals using ri2, specify: \(sharp\_hypothesis = \hat{ATE}\)
# This is exactly the same command but with sharp_hypothesis = 2.39333

out3 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = ate_estimate$value, 
    data = dat1)

summary(out3)
##   coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1           Z 2.393333              0.509    -0.4490959      5.479461

This interval \([-0.4490959, 5.479461]\) is the confidence interval we are looking for.

  • Another way to get the same interval is to use a loop:
# Create potential outcomes under constant effects. This involves two steps:

# Step 1: Imputing Y0 for all units

# Step 2: Imputing Y1 for all units

dat1 <- dat1 %>% mutate(Y0_temp = (Y - ate_estimate$value * Z), Y1_temp = Y0_temp + 
    ate_estimate$value)

# Note that we do this with the observed Y, even though we may have Y0 and
# Y1 for every unit. We do NOT only impute Y1 (Y1_temp = Y0 +
# ate_estimate$value), and keep Y0.

kable(head(dat1))
Y0 Y1 X1 Z Y Y0_temp Y1_temp
-10.043847 16.90916 0.8226190 1 16.909156 14.515823 16.9091558
2.630623 31.25135 0.7031790 1 31.251353 28.858021 31.2513533
-1.578342 15.17187 -0.1817087 0 -1.578342 -1.578342 0.8149908
17.735696 44.25915 1.7347557 1 44.259148 41.865815 44.2591478
2.339425 30.07248 0.0831873 1 30.072483 27.679151 30.0724834
6.372602 26.57846 0.2425147 1 26.578455 24.185122 26.5784550
# Now compute ATE estimates for all possible assignment vectors:

out4 <- rep(NA, ncol(D))

for (i in 1:ncol(D)) {
    Z_temp <- D[, i]
    # Compute ATE
    ate_temp <- dat1 %>% summarise(estimate = mean(Y1_temp[Z_temp == 1]) - mean(Y0_temp[Z_temp == 
        0]))
    out4[i] <- ate_temp$estimate
}

ci_95 <- quantile(out4, probs = c(0.025, 0.975))
ci_95
##       2.5%      97.5% 
## -0.6805252  5.4891112

Condition Probabilities

Consider the situation in Chapter 4, Q8: an experiment randomly assigns 500 of 1000 names to a treatment; however, it is later discovered that 600 names appear once and 200 names appear twice. The question is, what is the probability of assignment to treatment among those whose name appeared once, and those whose name appeared twice?

Simulation approach

# Create the register from which random assignment is taking place
names <- c(1:600, rep(601:800, 2))

# Write a loop in which we sample 500 of 1000 names in each iteration

trt_1 <- rep(NA, 50000)
trt_601 <- rep(NA, 50000)

for (i in 1:50000) {
    treatment = sample(names, size = 500, replace = FALSE)
    trt_1[i] <- is.element(1, treatment)
    trt_601[i] <- is.element(601, treatment)
}

mean(trt_1)
## [1] 0.50044
mean(trt_601)
## [1] 0.74784

Analytical solution

Think of this in terms of a coin-flip, where \(Heads = Treatment = 1\), and \(Tails = Control = 0\). For those whose name appeared once, treatment status is determined by one coin flip:

\(Pr_1[Treatment] = \frac{500}{1000} = \frac{1}{2}\)

However, if a name appeared twice, treatment status is determined by two coin flips:

\(Pr_{601}[Treatment] = Pr[1,1] + Pr[0,1] + Pr[1,0] = (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) = \frac{3}{4}\)

Another way to state this:

\(Pr_{601}[Treatment] = 1 - Pr_{601}[Not Treated] = 1 - Pr[0,0] = 1 - \frac{1}{4} = \frac{3}{4}\)

Weighting \(\delta_X\)

In this section, I compare the weights used in regression, matching estimators, and blocked ATE estimates. To set things up, consider a covariate \(X\), and \(\delta_X\) the difference in the mean outcome of the treatment and control group, at each value of \(X_i\):

\(\delta_X = E[Y_i | D_i = 1, X_i] - E[Y_i | D_i = 0, X_i]\)

A blocked ATE estimate (like Equation 3.10) weights \(\delta_X\) (or the block-level ATE estimate) by \(P(X_i = x) = \frac{N_x}{N}\) (where \(N_x\) is the number of observations in block \(x\), and \(N\) is the total number of observations). So the blocked ATE estimate is:

\(\delta_{Block} = \sum_x \delta_x P(X_i =x) = \sum_x \delta_x \frac{N_x}{N}\)

A regression of type \(Y_i \sim D_i + X_i\) places the following weight on each \(\delta_X\)

\(\frac{P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}{\sum_x P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}\)

To demistify this, let the probability of being assigned to the treatment condition in a block be \(p_x = P(D_i = 1 | X_i = x)\). Then the conditional probability of being assigned to control is \(1-p_x\). Think of this in terms of a Bernoulli trial: \(p_x\) is the probability of success, and the variance of a Bernoulli distribution is \(p_x(1-p_x)\). Now re-write the weight:

\(\frac{p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)

Consequently, the regression estimate is:

\(\delta_R = \frac{\sum_x \delta_x p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)

This is why we say regressions produce a treatment-variance weighted average of \(\delta_X\) (Angrist and Pischke 2009:75). When is \(p_x(1-p_x)\) largest? When exactly half the units are assigned to treatment, and the other half to control. Then \(p_x = 0.5\) and \(p_x(1-p_x) = 0.5^2\). Put another way, regressions place maximum weight on those block-level ATE estimates where units are equally split between treatment and control conditions.

So what about matching estimators? Note that matching estimators estimate the ATT (treatment effect on the treated). They therefore weight by the conditional probability of \(X_i\): \(P(X_i = x| D_i = 1)\). This is the proportion of all treated units that are also in block \(x\). Specifically then:

\(\hat{ATT} = E[Y_{1,i} - Y_{0,i} | D_i = 1] = \sum_x \delta_x P(X_i = x | D_i = 1)\)

Note that using Bayes rule we can re-write the weight as:

\(P(X_i = x | D_i = 1) = \frac{P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)

And accordingly re-write the matching estimator as:

\(\delta_M = \frac{\sum_x \delta_x P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)

Angrist and Pischke say these “weights are proportional to the probability of treatment at each value of the covariate” (76). In other words, \(P(D_i=1 | X_i = x)P(X_i=x) = P(D_i=1 \cap X_i = x)\), which is the joint probability of observing \(X_i = x\) and treatment.

Bottom line: In each case, we are estimating the difference in group means within-strata and then weighting those estimates. OLS with block fixed effects, and blocked ATE estimates shoot for the ATE but use different weighting schemes: one employs weights proportional to the conditional variance in treatment assignment, the other weights by block size. In contrast, a matching estimator shoots for the ATT and weights by the conditional probability of treatment assignment within each block.

Task

Lets use a toy dataset to compute \(\delta_X\), and the weights used by each estimator, then compare the regression, blocked ATE, and matching estimates:

Unit Y Di X
1 6 1 0
2 4 0 0
3 5 0 1
4 8 1 1
5 10 1 1

Answer: Here are the relevant quantities:

\(\delta_{x=0} = 6 - 4 = 2\) and \(\delta_{x=1} = (\frac{10+8}{2} - 5 ) = 9 - 5 =4\)

\(\frac{N_{x=0}}{N} = \frac{2}{5}\) and \(\frac{N_{x=1}}{N} = \frac{3}{5}\)

Next, lets compute various treatment assignment probabilities:

\(P(D_i=1) = \frac{3}{5}\)

\(P(D_i=1 | X_i=0) = p_{x=0} = \frac{1}{2}\)

\(P(D_i=1 | X_i=1) = p_{x=1} = \frac{2}{3}\)

And now putting these together to get each estimate:

Blocked ATE = \((2 \times \frac{2}{5}) + (4 \times \frac{3}{5}) = \frac{4}{5} + \frac{12}{5} = \frac{16}{5} = 3.2\)

OLS with Block FE = \(\frac{(2 \cdot \frac{2}{5}\frac{1}{2}\frac{1}{2}) + (4 \cdot \frac{3}{5}\frac{2}{3}\frac{1}{3})}{(\frac{2}{5}\frac{1}{2}\frac{1}{2}) + (\frac{3}{5}\frac{2}{3}\frac{1}{3})} = \frac{660}{210} = 3.1428\)

Matching estimator = \(\frac{(2\cdot \frac{1}{2} \cdot \frac{2}{5}) + (4 \cdot \frac{2}{3} \cdot \frac{3}{5})}{\frac{3}{5}} = \frac{\frac{10}{5}}{\frac{3}{5}} = \frac{10}{3} = 3.333\)

Note that the raw average of block-level ATEs (\(\delta_x\)’s) is \(\frac{2+4}{2} = 3\). Each procedure gives us an estimate between 2 and 4. OLS weights-up block 1 (\(X_i = 0\)) because it has equal number of units in treatment and control (hence the highest conditional variance in treatment assignment); the matching estimator weights-up block 2 (\(X_i = 1\)) because a large proportion of treated units are in block 2 compared to block1 (and \(p_{x=1} > p_{x=0}\)).

Re-thinking Q9 (Chapter 4)

library(foreign)
dat3 <- read.dta("W4_Q9.dta")

# Computing block-level ATE estimates, and two types of weights. I am
# essentially doing exactly the same calculation as above, but in code:

block_dat3 <- dat3 %>% group_by(strata) %>% summarise(ATE_block = mean(vote02[treat2 == 
    1]) - mean(vote02[treat2 == 0]), N_j = n(), p_trt = mean(treat2 == 1)) %>% 
    mutate(w_block = N_j/sum(N_j), w_ols = w_block * p_trt * (1 - p_trt)/sum(w_block * 
        p_trt * (1 - p_trt))) %>% select(strata, ATE_block, w_block, w_ols)

kable(block_dat3)
strata ATE_block w_block w_ols
1 0.0096403 0.0494868 0.2192159
2 -0.0078288 0.1520981 0.2475176
3 -0.0136212 0.6266159 0.2767679
4 0.0082712