Notes from Week IV

As you write a pre-analysis plan, here is a useful tool for power calculations.

Properties of an Estimating Procedure

The main takeaway from question 3 is that the properties of an estimation procedure depend on the estimator and decisions taken by the researcher.

Estimation Procedure = \(f(Estimator, Researcher)\)

Illustration I: Pick the higher estimate decision-rule

Consider the sampling distribution of two estimators: the difference-in-means, and regression with covariate adjustment. I will use some data created in the previous section to demonstrate that both estimators are unbiased, and the latter is more precise. However, if a researcher applies some decision rule (“pick the higher of the two estimates”), the resulting estimator is biased. Much of this mirrors question 3.

Step 1: Create some data with \(Y_i(0)\), \(Y_i(1)\), and a covariate \(X_1\). By construction the true ATE or \(E[\tau_i] \approx 5\).

library(dplyr)
library(knitr)
library(randomizr)
library(ri2)
library(estimatr)


set.seed(100)
dat1 <- data_frame(Y0 = rnorm(1000, mean = 0, sd = 20), Y1 = Y0 + rnorm(1000, 
    mean = 5, sd = 20), X1 = as.numeric(Y0 >= 0) + runif(1000, min = -1, max = 1), 
    Z = complete_ra(N = 1000, m = 500), Y = Y1 * Z + Y0 * (1 - Z))

true_ate <- dat1 %>% summarise(value = mean(Y1) - mean(Y0))
true_ate$value
## [1] 5.082935

Step 2: Next, lets get the sampling distribution for each estimator

# Specify design, get a large number of possible assignment vectors
declaration <- declare_ra(N = 1000, m = 500)
D <- obtain_permutation_matrix(declaration, maximum_permutations = 50000)

# Specify outputs

DIM <- rep(NA, ncol(D))
DIM_covariate <- rep(NA, ncol(D))
DIM_higher <- rep(NA, ncol(D))

# Run a loop to get sampling distribution
for (i in 1:ncol(D)) {
    Z <- D[, i]
    Y <- Z * dat1$Y1 + (1 - Z) * dat1$Y0
    m1 <- lm_robust(Y ~ Z)
    m2 <- lm_robust(Y ~ Z + dat1$X1)
    DIM[i] <- summary(m1)$coefficient[2]
    DIM_covariate[i] <- summary(m2)$coefficient[2]
    DIM_higher[i] <- ifelse(summary(m1)$coefficient[2] >= summary(m2)$coefficient[2], 
        summary(m1)$coefficient[2], summary(m2)$coefficient[2])
}

out1 <- data.frame(Method = c("DIM", "Covariate Adjustment", "Pick Higher"), 
    Mean = c(mean(DIM), mean(DIM_covariate), mean(DIM_higher)), SE = c(sd(DIM), 
        sd(DIM_covariate), sd(DIM_higher)))
kable(out1)
Method Mean SE
DIM 5.085978 1.447851
Covariate Adjustment 5.083390 1.286763
Pick Higher 5.350184 1.343805

Note that the true ATE \(= 5.082935\). As we see, the difference-in-means and regression procedure provide unbiased estimates in expectation, whereas the third procedure is biased.

Illustration 2: Restricted Randomization

Another example of this is restricted randomization. Here the assignment procedure involves the decision-rule: “discard an assignment vector if it fails the balance test”. As we have seen, this assignment procedure does not have the same properties as random assignment. In fact, such a procedure yields differential probabilities of assignment to conditions.

To know those probabilities for each unit, we draw a large number of assignment vectors following the restricted randomization protocol and compute individual probabilities of being assigned to treatment and control.

Confidence Intervals Under Constant Effects

There is some confusion on how to estimate confidence intervals under constant effects. There are essentially two approaches: one using ri2, the other using a loop.

  • conduct_ri does not give the right confidence intervals while doing randomization inference.
# Calculate the observed ate

ate_estimate <- dat1 %>% summarise(value = mean(Y[Z == 1]) - mean(Y[Z == 0]))
ate_estimate$value
## [1] 2.393333
# Conduct randomization inference to get p value for this ATE estimate:

out2 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = 0, 
    data = dat1)

summary(out2)
##   coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1           Z 2.393333               0.13     -2.963273      3.278193

The interval estimate reported here is not the confidence interval around \(\hat{ATE} = 2.39333\). null_ci_lower is the 2.5th percentile value in the sampling distribution under the sharp null, null_ci_upper is the 97.5th percentile value in that distribution.

  • To obtain confidence intervals using ri2, specify: \(sharp\_hypothesis = \hat{ATE}\)
# This is exactly the same command but with sharp_hypothesis = 2.39333

out3 <- conduct_ri(Y ~ Z, declaration = declaration, assignment = "Z", sharp_hypothesis = ate_estimate$value, 
    data = dat1)

summary(out3)
##   coefficient estimate two_tailed_p_value null_ci_lower null_ci_upper
## 1           Z 2.393333              0.509    -0.4490959      5.479461

This interval \([-0.4490959, 5.479461]\) is the confidence interval we are looking for.

  • Another way to get the same interval is to use a loop:
# Create potential outcomes under constant effects. This involves two steps:

# Step 1: Imputing Y0 for all units

# Step 2: Imputing Y1 for all units

dat1 <- dat1 %>% mutate(Y0_temp = (Y - ate_estimate$value * Z), Y1_temp = Y0_temp + 
    ate_estimate$value)

# Note that we do this with the observed Y, even though we may have Y0 and
# Y1 for every unit. We do NOT only impute Y1 (Y1_temp = Y0 +
# ate_estimate$value), and keep Y0.

kable(head(dat1))
Y0 Y1 X1 Z Y Y0_temp Y1_temp
-10.043847 16.90916 0.8226190 1 16.909156 14.515823 16.9091558
2.630623 31.25135 0.7031790 1 31.251353 28.858021 31.2513533
-1.578342 15.17187 -0.1817087 0 -1.578342 -1.578342 0.8149908
17.735696 44.25915 1.7347557 1 44.259148 41.865815 44.2591478
2.339425 30.07248 0.0831873 1 30.072483 27.679151 30.0724834
6.372602 26.57846 0.2425147 1 26.578455 24.185122 26.5784550
# Now compute ATE estimates for all possible assignment vectors:

out4 <- rep(NA, ncol(D))

for (i in 1:ncol(D)) {
    Z_temp <- D[, i]
    # Compute ATE
    ate_temp <- dat1 %>% summarise(estimate = mean(Y1_temp[Z_temp == 1]) - mean(Y0_temp[Z_temp == 
        0]))
    out4[i] <- ate_temp$estimate
}

ci_95 <- quantile(out4, probs = c(0.025, 0.975))
ci_95
##       2.5%      97.5% 
## -0.6805252  5.4891112

Condition Probabilities

Consider the situation in Chapter 4, Q8: an experiment randomly assigns 500 of 1000 names to a treatment; however, it is later discovered that 600 names appear once and 200 names appear twice. The question is, what is the probability of assignment to treatment among those whose name appeared once, and those whose name appeared twice?

Simulation approach

# Create the register from which random assignment is taking place
names <- c(1:600, rep(601:800, 2))

# Write a loop in which we sample 500 of 1000 names in each iteration

trt_1 <- rep(NA, 50000)
trt_601 <- rep(NA, 50000)

for (i in 1:50000) {
    treatment = sample(names, size = 500, replace = FALSE)
    trt_1[i] <- is.element(1, treatment)
    trt_601[i] <- is.element(601, treatment)
}

mean(trt_1)
## [1] 0.50044
mean(trt_601)
## [1] 0.74784

Analytical solution

Think of this in terms of a coin-flip, where \(Heads = Treatment = 1\), and \(Tails = Control = 0\). For those whose name appeared once, treatment status is determined by one coin flip:

\(Pr_1[Treatment] = \frac{500}{1000} = \frac{1}{2}\)

However, if a name appeared twice, treatment status is determined by two coin flips:

\(Pr_{601}[Treatment] = Pr[1,1] + Pr[0,1] + Pr[1,0] = (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) + (\frac{1}{2}\times\frac{1}{2}) = \frac{3}{4}\)

Another way to state this:

\(Pr_{601}[Treatment] = 1 - Pr_{601}[Not Treated] = 1 - Pr[0,0] = 1 - \frac{1}{4} = \frac{3}{4}\)

Weighting \(\delta_X\)

In this section, I compare the weights used in regression, matching estimators, and blocked ATE estimates. To set things up, consider a covariate \(X\), and \(\delta_X\) the difference in the mean outcome of the treatment and control group, at each value of \(X_i\):

\(\delta_X = E[Y_i | D_i = 1, X_i] - E[Y_i | D_i = 0, X_i]\)

A blocked ATE estimate (like Equation 3.10) weights \(\delta_X\) (or the block-level ATE estimate) by \(P(X_i = x) = \frac{N_x}{N}\) (where \(N_x\) is the number of observations in block \(x\), and \(N\) is the total number of observations). So the blocked ATE estimate is:

\(\delta_{Block} = \sum_x \delta_x P(X_i =x) = \sum_x \delta_x \frac{N_x}{N}\)

A regression of type \(Y_i \sim D_i + X_i\) places the following weight on each \(\delta_X\)

\(\frac{P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}{\sum_x P(D_i=1 | X_i = x)(1 - P(D_i = 1| X_i=x)) P(X_i = x)}\)

To demistify this, let the probability of being assigned to the treatment condition in a block be \(p_x = P(D_i = 1 | X_i = x)\). Then the conditional probability of being assigned to control is \(1-p_x\). Think of this in terms of a Bernoulli trial: \(p_x\) is the probability of success, and the variance of a Bernoulli distribution is \(p_x(1-p_x)\). Now re-write the weight:

\(\frac{p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)

Consequently, the regression estimate is:

\(\delta_R = \frac{\sum_x \delta_x p_x(1 - p_x) \frac{N_x}{N}}{\sum_x p_x(1 - p_x) \frac{N_x}{N}}\)

This is why we say regressions produce a treatment-variance weighted average of \(\delta_X\) (Angrist and Pischke 2009:75). When is \(p_x(1-p_x)\) largest? When exactly half the units are assigned to treatment, and the other half to control. Then \(p_x = 0.5\) and \(p_x(1-p_x) = 0.5^2\). Put another way, regressions place maximum weight on those block-level ATE estimates where units are equally split between treatment and control conditions.

So what about matching estimators? Note that matching estimators estimate the ATT (treatment effect on the treated). They therefore weight by the conditional probability of \(X_i\): \(P(X_i = x| D_i = 1)\). This is the proportion of all treated units that are also in block \(x\). Specifically then:

\(\hat{ATT} = E[Y_{1,i} - Y_{0,i} | D_i = 1] = \sum_x \delta_x P(X_i = x | D_i = 1)\)

Note that using Bayes rule we can re-write the weight as:

\(P(X_i = x | D_i = 1) = \frac{P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)

And accordingly re-write the matching estimator as:

\(\delta_M = \frac{\sum_x \delta_x P(D_i = 1 | X_i = x) P(X_i =x)}{P(D_i=1)}\)

Angrist and Pischke say these “weights are proportional to the probability of treatment at each value of the covariate” (76). In other words, \(P(D_i=1 | X_i = x)P(X_i=x) = P(D_i=1 \cap X_i = x)\), which is the joint probability of observing \(X_i = x\) and treatment.

Bottom line: In each case, we are estimating the difference in group means within-strata and then weighting those estimates. OLS with block fixed effects, and blocked ATE estimates shoot for the ATE but use different weighting schemes: one employs weights proportional to the conditional variance in treatment assignment, the other weights by block size. In contrast, a matching estimator shoots for the ATT and weights by the conditional probability of treatment assignment within each block.

Task

Lets use a toy dataset to compute \(\delta_X\), and the weights used by each estimator, then compare the regression, blocked ATE, and matching estimates:

Unit Y Di X
1 6 1 0
2 4 0 0
3 5 0 1
4 8 1 1
5 10 1 1

Answer: Here are the relevant quantities:

\(\delta_{x=0} = 6 - 4 = 2\) and \(\delta_{x=1} = (\frac{10+8}{2} - 5 ) = 9 - 5 =4\)

\(\frac{N_{x=0}}{N} = \frac{2}{5}\) and \(\frac{N_{x=1}}{N} = \frac{3}{5}\)

Next, lets compute various treatment assignment probabilities:

\(P(D_i=1) = \frac{3}{5}\)

\(P(D_i=1 | X_i=0) = p_{x=0} = \frac{1}{2}\)

\(P(D_i=1 | X_i=1) = p_{x=1} = \frac{2}{3}\)

And now putting these together to get each estimate:

Blocked ATE = \((2 \times \frac{2}{5}) + (4 \times \frac{3}{5}) = \frac{4}{5} + \frac{12}{5} = \frac{16}{5} = 3.2\)

OLS with Block FE = \(\frac{(2 \cdot \frac{2}{5}\frac{1}{2}\frac{1}{2}) + (4 \cdot \frac{3}{5}\frac{2}{3}\frac{1}{3})}{(\frac{2}{5}\frac{1}{2}\frac{1}{2}) + (\frac{3}{5}\frac{2}{3}\frac{1}{3})} = \frac{660}{210} = 3.1428\)

Matching estimator = \(\frac{(2\cdot \frac{1}{2} \cdot \frac{2}{5}) + (4 \cdot \frac{2}{3} \cdot \frac{3}{5})}{\frac{3}{5}} = \frac{\frac{10}{5}}{\frac{3}{5}} = \frac{10}{3} = 3.333\)

Note that the raw average of block-level ATEs (\(\delta_x\)’s) is \(\frac{2+4}{2} = 3\). Each procedure gives us an estimate between 2 and 4. OLS weights-up block 1 (\(X_i = 0\)) because it has equal number of units in treatment and control (hence the highest conditional variance in treatment assignment); the matching estimator weights-up block 2 (\(X_i = 1\)) because a large proportion of treated units are in block 2 compared to block1 (and \(p_{x=1} > p_{x=0}\)).

Re-thinking Q9 (Chapter 4)

library(foreign)
dat3 <- read.dta("W4_Q9.dta")

# Computing block-level ATE estimates, and two types of weights. I am
# essentially doing exactly the same calculation as above, but in code:

block_dat3 <- dat3 %>% group_by(strata) %>% summarise(ATE_block = mean(vote02[treat2 == 
    1]) - mean(vote02[treat2 == 0]), N_j = n(), p_trt = mean(treat2 == 1)) %>% 
    mutate(w_block = N_j/sum(N_j), w_ols = w_block * p_trt * (1 - p_trt)/sum(w_block * 
        p_trt * (1 - p_trt))) %>% select(strata, ATE_block, w_block, w_ols)

kable(block_dat3)
strata ATE_block w_block w_ols
1 0.0096403 0.0494868 0.2192159
2 -0.0078288 0.1520981 0.2475176
3 -0.0136212 0.6266159 0.2767679
4 0.0082712 0.1717991 0.2564986
# Computing the two estimates: blocked ATE and OLS-weighted

out5 <- block_dat3 %>% summarise(blocked_estimate = sum(ATE_block * w_block), 
    ols_estimate = sum(ATE_block * w_ols))
kable(out5)
blocked_estimate ols_estimate
-0.007828 -0.0014728
# Now, getting the OLS-weighted estimates using OLS!

model1 <- lm_robust(vote02 ~ treat2 + as.factor(strata), data = dat3)
summary(model1)
## 
## Call:
## lm_robust(formula = vote02 ~ treat2 + as.factor(strata), data = dat3)
## 
## Standard error type =  HC2 
## 
## Coefficients:
##                     Estimate Std. Error   Pr(>|t|)  CI Lower  CI Upper
## (Intercept)         0.485022   0.002360  0.000e+00  0.480396  0.489647
## treat2             -0.001473   0.003032  6.272e-01 -0.007416  0.004471
## as.factor(strata)2  0.080367   0.002679 1.108e-197  0.075117  0.085617
## as.factor(strata)3 -0.048928   0.002440  1.941e-89 -0.053709 -0.044146
## as.factor(strata)4 -0.002194   0.002648  4.072e-01 -0.007384  0.002995
##                        DF
## (Intercept)        940710
## treat2             940710
## as.factor(strata)2 940710
## as.factor(strata)3 940710
## as.factor(strata)4 940710
## 
## Multiple R-squared:  0.008548 ,  Adjusted R-squared:  0.008544 
## F-statistic:  2028 on 4 and 940710 DF,  p-value: < 2.2e-16
# And then, blocked ATE estimate using OLS (we need to do inverse
# probability weighting):

declaration = declare_ra(N = nrow(dat3), blocks = dat3$strata, block_m = c(6935, 
    7007, 7548, 7229))
declaration
## Random assignment procedure: Block random assignment 
## Number of units: 940715 
## Number of blocks: 4 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The probabilities of assignment are NOT constant across units. Your analysis strategy must account for differential probabilities of assignment, typically by employing inverse probability weights.
cond_prob <- obtain_condition_probabilities(declaration, dat3$treat2)
weights = 1/cond_prob

model2 <- lm_robust(vote02 ~ treat2 + as.factor(strata), weights = weights, 
    data = dat3)
summary(model2)
## 
## Call:
## lm_robust(formula = vote02 ~ treat2 + as.factor(strata), data = dat3, 
##     weights = weights)
## 
## Weighted, Standard error type =  HC2 
## 
## Coefficients:
##                     Estimate Std. Error  Pr(>|t|) CI Lower  CI Upper
## (Intercept)         0.492100   0.003685 0.000e+00  0.48488  0.499322
## treat2             -0.007828   0.003859 4.249e-02 -0.01539 -0.000265
## as.factor(strata)2  0.073599   0.004454 2.583e-61  0.06487  0.082329
## as.factor(strata)3 -0.058747   0.004333 7.311e-42 -0.06724 -0.050254
## as.factor(strata)4 -0.001659   0.004432 7.081e-01 -0.01035  0.007027
##                        DF
## (Intercept)        940710
## treat2             940710
## as.factor(strata)2 940710
## as.factor(strata)3 940710
## as.factor(strata)4 940710
## 
## Multiple R-squared:  0.009379 ,  Adjusted R-squared:  0.009374 
## F-statistic:  2227 on 4 and 940710 DF,  p-value: < 2.2e-16

One-Sided Non Compliance

Subject Types and Potential Outcomes

  • What does the notation \(d_i(1) = 0\) mean?

It refers to subject \(i\) who is assigned to the treatment group (\(z_i = 1\)) but is untreated.

  • Write down the notational shorthand for different types of subjects in an experiment
Source: Gerber and Green (2012)

Source: Gerber and Green (2012)

  • What assumption do we make in one-sided non-compliance? Which types does this eliminate?

We assume that the control group will not have access to treatment, thus ruling out Always Takers and Defiers.

  • How would you read the following potential outcomes?

\(E[Y_i (z = 1, d = 0) | d_i(1)=0]\)

\(d_i(1) = 0\) is the shorthand for “Never Takers” under one-sided non-compliance. The term gives the expected outcome of never takers in the treatment group.

\(E[Y_i(z=1,d(0))]\)

This is the mean outcome for units assigned to the treatment group, when they reveal potential outcomes had they been assigned to control: \(d_i(z=0)\) and \(Y_i(d_i(0))\). For instance, for compliers \(d_i(0) = 0\) and thus the outcome they express is \(Y_i(z=1,d=0)\) (or their “untreated” outcome). For always takers \(d_i(0)=1\), so the outcome they express is \(Y_i(z=1,d=1)\) (or their “treated” outcome).

  • In this context, excludability assumes \(Y_i(z=1,d=0) = Y_i(z=0,d=0) = Y_i(d=0)\)

Excludability violations

The excludability assumption states \(Y_i(z,d) = Y_i(d)\). Put another way: \(Y_i(z=1,d) = Y_i(z=0,d) = Y_i(d)\). Look at the potential outcomes table below and identify the “types” that violate the excludability assumption (i.e. subjects whose potential outcomes respond to treatment assignment \(z\)):

Type Y(z=1,d=1) Y(z=1,d=0) Y(z=0,d=0) Y(z=0,d=1)
A 10 5 5 10
A 10 5 5 10
B 8 7 4 8
B 7 6 5 7
C 12 8 8 10
C 10 6 6 8

Answer: Type “B” violate excludability because \(Y_i(z=1,d=0)\neq Y_i(z=0,d=0)\). Type “C” violate the assumption as well because \(Y_i(z=1,d=1)\neq Y_i(z=0,d=1)\).

  • Under one-sided non-compliance, which of these columns is ruled-out by assumption?

Column 4 (\(Y_i(z=0,d=1)\)) is ruled out because subjects assigned to the control group cannot be treated.

  • Under one-sided non-compliance then, which type(s) violate the excludability assumption?

Only type “B”.

Appendix I (Creating Data)

In creating “fake” data for power calculations, think carefully about the random generative process: what is the underlying distribution from which you are drawing observations.

Example 1: Normal distribution

Use this if a random variable has a “most likely” value, but there is an equal chance you might observe higher or lower values. For example, 68% of the time \(\pm 1 SD\) of the most likely value):

y <- rnorm(n = 1000, mean = 0, sd = 1)
mean(y)
## [1] -0.05294129
sd(y)
## [1] 0.9731606
hist(y, breaks = 100)

# Adjust the 'sd' parameter depending on how much variation in values you
# expect in the population.

y <- rnorm(n = 1000, mean = 0, sd = 100)
mean(y)
## [1] 3.428463
sd(y)
## [1] 94.89124
hist(y, breaks = 100)

Example 2: Uniform distribution

Use this if a random variable is equally likely to take on values in some interval \([a,b]\):

y <- runif(n = 1000, min = -1, max = 1)
# min sets the lower interval or 'a', and max sets the upper interval or 'b'
mean(y)
## [1] 0.001730345
hist(y, breaks = 100)

Note that the mean of a uniform distribution specified in the interval \([a,b]\) is \(\frac{a+b}{2}\) and the variance is \(\frac{(b-a)^2}{12}\).

Example 3: Poisson distribution

Typically used for event counts in a fixed time interval (i.e. the random variable takes on non-negative values which represent the number of events occuring in some time period):

y <- rpois(n = 1000, lambda = 1)
# A poisson distribution has a mean and variance of lambda
hist(y, breaks = 100)

Example 4: Binomial distribution

This gives the distribution of \(k\) successes in \(n\) trials, when the probability of success in each trial is \(p\). A special case exists if \(n=1\) trial: this is the Bernoulli distribution in which success occurs with probability \(p\). If, however \(n>1\), \(k\) can take several values. For example, if there are \(n=2\) trials, \(k\) can be 0,1 or 2. The probability of (say) \(k=1\) success is then given by the expression: \({n\choose k}p^k(1-p)^{n-k}\).

# Say you have a binary variable, and you want to draw from a Bernoulli
# distribution:

y <- rbinom(n = 1000, size = 1, prob = 0.25)

# Where size is the number of trials, prob is the probability of success in
# each trial

hist(y, breaks = 100)

Note: Look at Goldberger (1991), chapters 2 and 7 for more on univariate distributions.