As you write a pre-analysis plan, here is a useful tool for power calculations.
The main takeaway from question 3 is that the properties of an estimation procedure depend on the estimator and decisions taken by the researcher.
Estimation Procedure = \(f(Estimator, Researcher)\)
Consider the sampling distribution of two estimators: the difference-in-means, and regression with covariate adjustment. I will use some data created in the previous section to demonstrate that both estimators are unbiased, and the latter is more precise. However, if a researcher applies some decision rule (“pick the higher of the two estimates”), the resulting estimator is biased. Much of this mirrors question 3.
Step 1: Create some data with \(Y_i(0)\), \(Y_i(1)\), and a covariate \(X_1\). By construction the true ATE or \(E[\tau_i] \approx 5\).
library(dplyr)
library(knitr)
library(randomizr)
library(ri2)
library(estimatr)
set.seed(100)
dat1 <- data_frame(Y0 = rnorm(1000, mean = 0, sd = 20), Y1 = Y0 + rnorm(1000,
mean = 5, sd = 20), X1 = as.numeric(Y0 >= 0) + runif(1000, min = -1, max = 1),
Z = complete_ra(N = 1000, m = 500), Y = Y1 * Z + Y0 * (1 - Z))
true_ate <- dat1 %>% summarise(value = mean(Y1) - mean(Y0))
true_ate$value
## [1] 5.082935
Step 2: Next, lets get the sampling distribution for each estimator
# Specify design, get a large number of possible assignment vectors
declaration <- declare_ra(N = 1000, m = 500)
D <- obtain_permutation_matrix(declaration, maximum_permutations = 50000)
# Specify outputs
DIM <- rep(NA, ncol(D))
DIM_covariate <- rep(NA, ncol(D))
DIM_higher <- rep(NA, ncol(D))
# Run a loop to get sampling distribution
for (i in 1:ncol(D)) {
Z <- D[, i]
Y <- Z * dat1$Y1 + (1 - Z) * dat1$Y0
m1 <- lm_robust(Y ~ Z)
m2 <- lm_robust(Y ~ Z + dat1$X1)
DIM[i] <- summary(m1)$coefficient[2]
DIM_covariate[i] <- summary(m2)$coefficient[2]
DIM_higher[i] <- ifelse(summary(m1)$coefficient[2] >= summary(m2)$coefficient[2],
summary(m1)$coefficient[2], summary(m2)$coefficient[2])
}
out1 <- data.frame(Method = c("DIM", "Covariate Adjustment", "Pick Higher"),
Mean = c(mean(DIM), mean(DIM_covariate), mean(DIM_higher)), SE = c(sd(DIM),
sd(DIM_covariate), sd(DIM_higher)))
kable(out1)
Method | Mean | SE |
---|---|---|
DIM | 5.085978 | 1.447851 |
Covariate Adjustment | 5.083390 | 1.286763 |
Pick Higher | 5.350184 | 1.343805 |
Note that the true ATE \(= 5.082935\). As we see, the difference-in-means and regression procedure provide unbiased estimates in expectation, whereas the third procedure is biased.
Another example of this is restricted randomization. Here the assignment procedure involves the decision-rule: “discard an assignment vector if it fails the balance test”. As we have seen, this assignment procedure does not have the same properties as random assignment. In fact, such a procedure yields differential probabilities of assignment to conditions.
To know those probabilities for each unit, we draw a large number of assignment vectors following the restricted randomization protocol and compute individual probabilities of being assigned to treatment and control.
There is some confusion on how to estimate confidence intervals under constant effects. There are essentially two approaches: one using ri2, the other using a loop.
# Calculate the observed ate
ate_estimate <- dat1 %>% summarise(value = mean(Y[Z == 1]) - mean(Y[Z == 0]))
ate_estimate$value
## [1] 2.393333
# Conduct randomization inference to get p value for this ATE estimate:
out2 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = 0,
data = dat1)
summary(out2)
## coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1 Z 2.393333 0.13 -2.963273 3.278193
The interval estimate reported here is not the confidence interval around \(\hat{ATE} = 2.39333\). null_ci_lower is the 2.5th percentile value in the sampling distribution under the sharp null, null_ci_upper is the 97.5th percentile value in that distribution.
# This is exactly the same command but with sharp_hypothesis = 2.39333
out3 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = ate_estimate$value,
data = dat1)
summary(out3)
## coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1 Z 2.393333 0.509 -0.4490959 5.479461
This interval \([-0.4490959, 5.479461]\) is the confidence interval we are looking for.
# Create potential outcomes under constant effects. This involves two steps:
# Step 1: Imputing Y0 for all units
# Step 2: Imputing Y1 for all units
dat1 <- dat1 %>% mutate(Y0_temp = (Y - ate_estimate$value * Z), Y1_temp = Y0_temp +
ate_estimate$value)
# Note that we do this with the observed Y, even though we may have Y0 and
# Y1 for every unit. We do NOT only impute Y1 (Y1_temp = Y0 +
# ate_estimate$value), and keep Y0.
kable(head(dat1))
Y0 | Y1 | X1 | Z | Y | Y0_temp | Y1_temp |
---|---|---|---|---|---|---|
-10.043847 | 16.90916 | 0.8226190 | 1 | 16.909156 | 14.515823 | 16.9091558 |
2.630623 | 31.25135 | 0.7031790 | 1 | 31.251353 | 28.858021 | 31.2513533 |
-1.578342 | 15.17187 | -0.1817087 | 0 | -1.578342 | -1.578342 | 0.8149908 |
17.735696 | 44.25915 | 1.7347557 | 1 | 44.259148 | 41.865815 | 44.2591478 |
2.339425 | 30.07248 | 0.0831873 | 1 | 30.072483 | 27.679151 | 30.0724834 |
6.372602 | 26.57846 | 0.2425147 | 1 | 26.578455 | 24.185122 | 26.5784550 |
# Now compute ATE estimates for all possible assignment vectors:
out4 <- rep(NA, ncol(D))
for (i in 1:ncol(D)) {
Z_temp <- D[, i]
# Compute ATE
ate_temp <- dat1 %>% summarise(estimate = mean(Y1_temp[Z_temp == 1]) - mean(Y0_temp[Z_temp ==
0]))
out4[i] <- ate_temp$estimate
}
ci_95 <- quantile(out4, probs = c(0.025, 0.975))
ci_95
## 2.5% 97.5%
## -0.6805252 5.4891112
Consider the situation in Chapter 4, Q8: an experiment randomly assigns 500 of 1000 names to a treatment; however, it is later discovered that 600 names appear once and 200 names appear twice. The question is, what is the probability of assignment to treatment among those whose name appeared once, and those whose name appeared twice?
# Create the register from which random assignment is taking place
names <- c(1:600, rep(601:800, 2))
# Write a loop in which we sample 500 of 1000 names in each iteration
trt_1 <- rep(NA, 50000)
trt_601 <- rep(NA, 50000)
for (i in 1:50000) {
treatment = sample(names, size = 500, replace = FALSE)
trt_1[i] <- is.element(1, treatment)
trt_601[i] <- is.element(601, treatment)
}
mean(trt_1)
## [1] 0.50044
mean(trt_601)
## [1] 0.74784
Think of this in terms of a coin-flip, where \(Heads = Treatment = 1\), and \(Tails = Control = 0\). For those whose name appeared once, treatment status is determined by one coin flip:
\(Pr_1[Treatment] = \frac{500}{1000} = \frac{1}{2}\)
However, if a name appeared twice, treatment status is determined by two coin flips:
\(Pr_{601}[Treatment] = Pr[1,1] + Pr[0,1] + Pr[1,0] = (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) = \frac{3}{4}\)
Another way to state this:
\(Pr_{601}[Treatment] = 1 - Pr_{601}[Not Treated] = 1 - Pr[0,0] = 1 - \frac{1}{4} = \frac{3}{4}\)
In this section, I compare the weights used in regression, matching estimators, and blocked ATE estimates. To set things up, consider a covariate \(X\), and \(\delta_X\) the difference in the mean outcome of the treatment and control group, at each value of \(X_i\):
\(\delta_X = E[Y_i | D_i = 1, X_i] - E[Y_i | D_i = 0, X_i]\)
A blocked ATE estimate (like Equation 3.10) weights \(\delta_X\) (or the block-level ATE estimate) by \(P(X_i = x) = \frac{N_x}{N}\) (where \(N_x\) is the number of observations in block \(x\), and \(N\) is the total number of observations). So the blocked ATE estimate is:
\(\delta_{Block} = \sum_x \delta_x P(X_i =x) = \sum_x \delta_x \frac{N_x}{N}\)
A regression of type \(Y_i \sim D_i + X_i\) places the following weight on each \(\delta_X\)
\(\frac{P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}{\sum_x P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}\)
To demistify this, let the probability of being assigned to the treatment condition in a block be \(p_x = P(D_i = 1 | X_i = x)\). Then the conditional probability of being assigned to control is \(1-p_x\). Think of this in terms of a Bernoulli trial: \(p_x\) is the probability of success, and the variance of a Bernoulli distribution is \(p_x(1-p_x)\). Now re-write the weight:
\(\frac{p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)
Consequently, the regression estimate is:
\(\delta_R = \frac{\sum_x \delta_x p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)
This is why we say regressions produce a treatment-variance weighted average of \(\delta_X\) (Angrist and Pischke 2009:75). When is \(p_x(1-p_x)\) largest? When exactly half the units are assigned to treatment, and the other half to control. Then \(p_x = 0.5\) and \(p_x(1-p_x) = 0.5^2\). Put another way, regressions place maximum weight on those block-level ATE estimates where units are equally split between treatment and control conditions.
So what about matching estimators? Note that matching estimators estimate the ATT (treatment effect on the treated). They therefore weight by the conditional probability of \(X_i\): \(P(X_i = x| D_i = 1)\). This is the proportion of all treated units that are also in block \(x\). Specifically then:
\(\hat{ATT} = E[Y_{1,i} - Y_{0,i} | D_i = 1] = \sum_x \delta_x P(X_i = x | D_i = 1)\)
Note that using Bayes rule we can re-write the weight as:
\(P(X_i = x | D_i = 1) = \frac{P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)
And accordingly re-write the matching estimator as:
\(\delta_M = \frac{\sum_x \delta_x P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)
Angrist and Pischke say these “weights are proportional to the probability of treatment at each value of the covariate” (76). In other words, \(P(D_i=1 | X_i = x)P(X_i=x) = P(D_i=1 \cap X_i = x)\), which is the joint probability of observing \(X_i = x\) and treatment.
Bottom line: In each case, we are estimating the difference in group means within-strata and then weighting those estimates. OLS with block fixed effects, and blocked ATE estimates shoot for the ATE but use different weighting schemes: one employs weights proportional to the conditional variance in treatment assignment, the other weights by block size. In contrast, a matching estimator shoots for the ATT and weights by the conditional probability of treatment assignment within each block.
Lets use a toy dataset to compute \(\delta_X\), and the weights used by each estimator, then compare the regression, blocked ATE, and matching estimates:
Unit | Y | Di | X |
---|---|---|---|
1 | 6 | 1 | 0 |
2 | 4 | 0 | 0 |
3 | 5 | 0 | 1 |
4 | 8 | 1 | 1 |
5 | 10 | 1 | 1 |
Answer: Here are the relevant quantities:
\(\delta_{x=0} = 6 - 4 = 2\) and \(\delta_{x=1} = (\frac{10+8}{2} - 5 ) = 9 - 5 =4\)
\(\frac{N_{x=0}}{N} = \frac{2}{5}\) and \(\frac{N_{x=1}}{N} = \frac{3}{5}\)
Next, lets compute various treatment assignment probabilities:
\(P(D_i=1) = \frac{3}{5}\)
\(P(D_i=1 | X_i=0) = p_{x=0} = \frac{1}{2}\)
\(P(D_i=1 | X_i=1) = p_{x=1} = \frac{2}{3}\)
And now putting these together to get each estimate:
Blocked ATE = \((2 \times \frac{2}{5}) + (4 \times \frac{3}{5}) = \frac{4}{5} + \frac{12}{5} = \frac{16}{5} = 3.2\)
OLS with Block FE = \(\frac{(2 \cdot \frac{2}{5}\frac{1}{2}\frac{1}{2}) + (4 \cdot \frac{3}{5}\frac{2}{3}\frac{1}{3})}{(\frac{2}{5}\frac{1}{2}\frac{1}{2}) + (\frac{3}{5}\frac{2}{3}\frac{1}{3})} = \frac{660}{210} = 3.1428\)
Matching estimator = \(\frac{(2\cdot \frac{1}{2} \cdot \frac{2}{5}) + (4 \cdot \frac{2}{3} \cdot \frac{3}{5})}{\frac{3}{5}} = \frac{\frac{10}{5}}{\frac{3}{5}} = \frac{10}{3} = 3.333\)
Note that the raw average of block-level ATEs (\(\delta_x\)’s) is \(\frac{2+4}{2} = 3\). Each procedure gives us an estimate between 2 and 4. OLS weights-up block 1 (\(X_i = 0\)) because it has equal number of units in treatment and control (hence the highest conditional variance in treatment assignment); the matching estimator weights-up block 2 (\(X_i = 1\)) because a large proportion of treated units are in block 2 compared to block1 (and \(p_{x=1} > p_{x=0}\)).
library(foreign)
dat3 <- read.dta("W4_Q9.dta")
# Computing block-level ATE estimates, and two types of weights. I am
# essentially doing exactly the same calculation as above, but in code:
block_dat3 <- dat3 %>% group_by(strata) %>% summarise(ATE_block = mean(vote02[treat2 ==
1]) - mean(vote02[treat2 == 0]), N_j = n(), p_trt = mean(treat2 == 1)) %>%
mutate(w_block = N_j/sum(N_j), w_ols = w_block * p_trt * (1 - p_trt)/sum(w_block *
p_trt * (1 - p_trt))) %>% select(strata, ATE_block, w_block, w_ols)
kable(block_dat3)
strata | ATE_block | w_block | w_ols |
---|---|---|---|
1 | 0.0096403 | 0.0494868 | 0.2192159 |
2 | -0.0078288 | 0.1520981 | 0.2475176 |
3 | -0.0136212 | 0.6266159 | 0.2767679 |
4 | 0.0082712 |