Notes from Week IX

Within-Subject Studies

Within-subject experiments track a single person or entity over a period of time, and random assignment determines when a treatment is administered (Gerber and Green 2012:273). Such designs typically make two non-interference assumptions: no anticipation (\(D_{i,t+1}\) does not affect \(Y_{i,t}\)) and no persistence (\(D_{i,t-1}\) does not affect \(Y_{i,t}\)). In this section, I use the “tetris” dataset to evaluate these assumptions (this corresponds to Question 10 (b) and (c) in the previous week’s problem set):

library(tidyverse)
library(randomizr)
library(estimatr)
library(knitr)

# Download the dataset
data <- foreign::read.dta("W8_Tetris.dta")

# We need to create two variables:
data$run_lag <- c(NA, data$run[1:25])
data$run_anticp <- c(data$run[2:26], NA)

# Estimate the effects:

effects <- data.frame(Bivariate = c(lm_robust(tetris ~ run, data = data)$coefficients[2], 
    "(Coefficient from tetris ~ run )"), Persistence = c(summary(lm_robust(tetris ~ 
    run + run_lag, data = data))$fstatistic[1], "(F Statistic from tetris ~ run + run_lag)"), 
    Anticipation = c(lm_robust(tetris ~ run_anticp, data = data)$coefficients[2], 
        "(Coefficient from tetris ~ run_anticp)"))

kable(effects, row.names = F, caption = "Estimates", digits = 2)
Estimates
Bivariate Persistence Anticipation
13613.1 4.5445923571162 645.621212121212
(Coefficient from tetris ~ run ) (F Statistic from tetris ~ run + run_lag) (Coefficient from tetris ~ run_anticp)
# Conduct randomization inference:

## Note that every time we generate a new assignment vector ('run'), the
## lagged and future values also change. So we need to write a loop for
## randomization inference.

## Step 1: Declare design, define output vectors
set.seed(343)

declaration <- declare_ra(N = 26, prob = 0.5, simple = T)
perms <- obtain_permutation_matrix(declaration)
dim(perms)
## [1]    26 10000
bivariate.out <- rep(NA, 10000)
persistence.out <- rep(NA, 10000)
anticipation.out <- rep(NA, 10000)

# Step 2: Run a loop, estimating the quantities for each assignment vector
# from 'perms'

for (i in 1:10000) {
    # Define variables
    data$Z <- perms[, i]
    data$Z_lag <- c(NA, data$Z[1:25])
    data$Z_anticp <- c(data$Z[2:26], NA)
    
    # Store output
    bivariate.out[i] <- lm_robust(tetris ~ Z, data = data)$coefficients[2]
    persistence.out[i] <- summary(lm_robust(tetris ~ Z + Z_lag, data = data))$fstatistic[1]
    anticipation.out[i] <- lm_robust(tetris ~ Z_anticp, data = data)$coefficients[2]
}

# For RI p value in the regression tetris ~ run
mean(abs(bivariate.out) >= 13613.1)
## [1] 0.0109
# For RI p value on the F statistic in tetris ~ run + run_lag
mean(abs(persistence.out) >= 4.545)
## [1] 0.019
# For RI p value in the regression tetris ~ run_anticp
mean(abs(anticipation.out) >= 645.621)
## [1] 0.8994

Heterogeneous Treatment Effects

This section focuses on the variability in treatment effects or \(Var[\tau_i]\). I will begin with a discussion on how to detect heterogeneous treatment effects. Mainly, we either place bounds on \(Var[\tau_i]\), or do a hypothesis test (\(\widehat{Var[Y_i(1)]} = \widehat{Var[Y_i(0)]}\)) under the constant effects assumption. After this, I will show some regression-based strategies to model heterogeneity: first when there is interaction between treatment and covariates; and then between treatments.

Detecting Heterogeneity

Bounds on \(Var[\tau_i]\)

Intuition: We can never point-estimate \(Var[\tau_i]\) because \(Cov[Y_i(1),Y_i(0)]\) is never known. However, we can place bounds on this quantity by estimating the minimum and maximum covariance between treated and untreated potential outcomes.

To see this, note that:

\(Var[\tau_i] = Var[Y_i(1) - Y_i(0)] = Var[Y_i(1)] + Var[Y_i(0)] - 2\cdot Cov[Y_i(1),Y_i(0)]\)

We can use the observations in the treatment group to get \(\widehat{Var[Y_i(1)]}\); and similarly use the control group units to get \(\widehat{Var[Y_i(0)]}\). This leaves the covariance term:

  • \(\widehat{Cov[Y_i(1),Y_i(0)]}\) is largest if we pair the smallest \(Y_i(1)\) values with the smallest \(Y_i(0)\) values, and the largest \(Y_i(1)\) values with the largest \(Y_i(0)\) values. Assuming there are equal number of observations in treatment in control group, this means estimating the covariance between outcomes, when they are both arranged in ascending order. Call this \(\widehat{Cov_{max}[Y_i(1),Y_i(0)]}\)
Figure 1: Largest Covariance Estimate

Figure 1: Largest Covariance Estimate

  • \(\widehat{Cov[Y_i(1),Y_i(0)]}\) is smallest if we pair the smallest \(Y_i(1)\) values with the largest \(Y_i(0)\) values, and the largest \(Y_i(1)\) values with the smallest \(Y_i(0)\) values. This can be done if we estimate covariance between outcomes when the control group is arranged in ascending order and the treatment group in descending order. Call this \(\widehat{Cov_{min}[Y_i(1),Y_i(0)]}\)
Figure 2: Smallest Covariance Estimate

Figure 2: Smallest Covariance Estimate

Then we have two bounds:

Upper bound: \(\widehat{Var[Y_i(1)]} + \widehat{Var[Y_i(0)]} - 2\cdot \widehat{Cov_{min}[Y_i(1),Y_i(0)]}\)

Lower bound: \(\widehat{Var[Y_i(1)]} + \widehat{Var[Y_i(0)]} - 2\cdot \widehat{Cov_{max}[Y_i(1),Y_i(0)]}\)

Here is the code to do this in R

library(fabricatr)
library(lmtest)
library(ri2)

# Setup: Create some data using 'fabricatr'
set.seed(46)
data <- fabricate(N = 1000, X1 = rnorm(N), X2 = rnorm(N), Y0 = rnorm(N, mean = 0, 
    sd = 4) + X1 + X2, Y1 = Y0 + 5, Z = complete_ra(N), Y = Y0 * (1 - Z) + Y1 * 
    Z)

kable(head(data), caption = "Example Dataset")
Example Dataset
ID X1 X2 Y0 Y1 Z Y
0001 -0.8989165 1.1591073 -0.5525463 4.447454 0 -0.5525463
0002 0.2121311 0.5222966 -3.7698224 1.230178 1 1.2301776
0003 -0.7285901 1.0230716 0.1487505 5.148751 0 0.1487505
0004 1.2355204 -0.0198303 3.1718613 8.171861 1 8.1718613
0005 1.1688400 0.5450542 2.5888604 7.588860 1 7.5888604
0006 -0.6224851 -1.3212777 -3.1347933 1.865207 0 -3.1347933
# Step 0: True Variance in Potential Outcomes (Var[Y1] = Var[Y0] given this
# data generating process)
true.var <- data %>% summarise(VarY0 = var(Y0), VarY1 = var(Y1))
kable(true.var)
VarY0 VarY1
17.89246 17.89246
# Step 1: Estimate Var[Y0] and Var[Y1]

var.estimates <- data %>% group_by(Z) %>% summarise(Variance = var(Y))
kable(var.estimates)
Z Variance
0 17.78453
1 17.94717
var.dif <- var.estimates$Variance[2] - var.estimates$Variance[1]
var.dif
## [1] 0.1626368
# Step 2: Estimate the minimum and maximum values of Cov[Y0,Y1]

control.group <- data %>% filter(Z == 0) %>% arrange(Y)

treat.group.hightolow <- data %>% filter(Z == 1) %>% arrange(desc(Y))
treat.group.lowtohigh <- data %>% filter(Z == 1) %>% arrange(Y)

cov.max <- cov(control.group$Y, treat.group.lowtohigh$Y)
cov.max
## [1] 17.79711
cov.min <- cov(control.group$Y, treat.group.hightolow$Y)
cov.min
## [1] -17.82733
# Step 3: Estimate the upper and lower bound for Var[tau]

var.lower <- var.estimates$Variance[2] + var.estimates$Variance[1] - 2 * cov.max
var.lower
## [1] 0.1374816
var.upper <- var.estimates$Variance[2] + var.estimates$Variance[1] - 2 * cov.min
var.upper
## [1] 71.38636

So we end up with an interval estimate: \(Var[\tau_i] \in [0.1374816, 71.38636]\). Note that \(Var[\tau_i] \geq 0\), so the minimum value the lower bound can take is 0.

Difference in Variances

A second way to test for heterogeneity is to evaluate the hypothesis \(H_a: \widehat{Var[\tau_i]} \neq 0\). As I will show below, rejecting the hypothesis \(Var[Y_i(1)] = Var[Y_i(0)]\) also means rejecting the hypothesis \(Var[\tau_i] = 0\). To see this, note that:

\(Var[Y_i(1)] = Var[Y_i(0) + \tau_i] = Var[Y_i(0)] + Var[\tau_i] + 2\cdot Cov[Y_i(0),\tau_i]\)

Under the constant effects assumption, \(Var[\tau_i] = 0\) and:

\(2 \cdot Cov[Y_i(0), \tau_i] = 0\) because \(\tau_i = \tau \forall i\) (i.e. \(\tau_i\) is a constant, and the covariance between a random variable and a constant is zero).

This means under the null hypothesis of constant treatment effects:

\(Var[Y_i(1)] = Var[Y_i(0)]\) and \(Var[Y_i(1)] - Var[Y_i(0)] = 0\)

The strategy then is straightforward:

Step 1: Estimate the difference in variances or \(\widehat{Var[Y_i(1)]} - \widehat{Var[Y_i(0)]}\).

Step 2: Assuming constant effects, construct a potential outcomes table. Under this null hypothesis: \(Var[\tau_i] = 0\) and \(Var[Y_i(1)] = Var[Y_i(0)]\)

Step 3: Conduct randomization inference to get the \(p\) value for the observed difference-in-variance. In other words, the probability of obtaining the observed value or more extreme, given the null hypothesis is correct.

If \(p \geq \alpha\) (some arbitrarily selected critical value), we cannot reject the null hypothesis \(Var[\tau_i] = 0\) and \(Var[Y_i(1)] = Var[Y_i(0)]\). This means we cannot rule out constant treatment effects (or lack of heterogeneity).

Here is R code to perform this operation:

# Step 1: Estimate the difference in variances: Var[Y1] - Var[Y0]
var.dif
## [1] 0.1626368
# Step 2: Create a schedule of potential outcomes under the constant
# treatment effects assumption

ate_hat <- lm_robust(Y ~ Z, data = data)$coefficients[2]
ate_hat
##        Z 
## 5.421689
ri.data <- data %>% select(Y, Z) %>% mutate(Y0_hyp = Y - ate_hat * Z, Y1_hyp = Y0_hyp + 
    ate_hat)
kable(head(ri.data))
Y Z Y0_hyp Y1_hyp
-0.5525463 0 -0.5525463 4.869143
1.2301776 1 -4.1915118 1.230178
0.1487505 0 0.1487505 5.570440
8.1718613 1 2.7501719 8.171861
7.5888604 1 2.1671710 7.588860
-3.1347933 0 -3.1347933 2.286896
# Step 3A: Specify the random assignment

declaration <- declare_ra(N = 1000, m = 500)

# Step 3B: Write a function that estimates difference in variance

dif_in_var <- function(data) {
    output <- var(data$Y1_hyp[data$Z == 1]) - var(data$Y0_hyp[data$Z == 0])
    return(output)
}

dif_in_var(ri.data)
## [1] 0.1626368
# Step 3C: Conduct randomization inference using 'ri2'

ri.out1 <- conduct_ri(test_function = dif_in_var, assignment = "Z", declaration = declaration, 
    data = ri.data, sims = 10000)
summary(ri.out1)
##             coefficient  estimate two_tailed_p_value null_ci_lower
## 1 Custom Test Statistic 0.1626368             0.9221      -3.16951
##   null_ci_upper
## 1      3.214688
plot(ri.out1)

As the hypothesis test reveals, \(p = 0.9221\) which suggests a high level of congruence between the null hypothesis (\(Var[\tau_i] = 0\) / \(Var[Y_i(1)] = Var[Y_i(0)]\)) and the observed test statistic (\(\widehat{Var[Y_i(1)]} - \widehat{Var[Y_i(0)]}\)). We fail to reject the null of constant treatment effects. This is in line with our expectations, given the data generating process specified for this example.

Interaction Effects

In this part, I discuss regression-based strategies to model treatment effect heterogeneity. I begin with treatment-by-covariate interactions, then move to treatment-by-treatment interactions.

Treatment-Covariate Interaction

Consider the dataset from the previous section (\(n=1000\), \(m=500\)). Say we are interested in knowing if treatment effects vary by covariate profile (i.e. different levels of \(X_1\) and \(X_2\)). How might we model and estimate that heterogeneity?

Intution: Interaction terms in an OLS regression capture heterogeneity in treatment effects. To see this, consider two models:

Restricted Model: \(Y_i = \alpha_1 + \beta_{1} Z_i + \gamma_{11} X_{1,i} + \gamma_{12} X_{2,i} + u_i\)

Unrestricted Model: \(Y_i = \alpha_2 + \beta_{2} Z_i + \gamma_{21} X_{1,i} + \gamma_{22} X_{2,i} + \gamma_{23} (Z_i \cdot X_{1,i}) + \gamma_{24} (Z_i \cdot X_{2,i}) + \epsilon_i\)

We want to know if \(\gamma_{23} = \gamma_{24} = 0\), or in words: there is no heterogeneity in treatment effects at different levels of \(X_1\) and \(X_2\). One way to do this is to estimate the F-statistic (joint significance of the interaction terms), and then do randomization inference to see how likely we are to observe that F-statistic value (or more extreme) if there was no heterogeneity in treatment effects (i.e. under the null hypothesis of constant effects). Here are the steps to do that:

Step 1: Estimate the F-statistic

\(F = \frac{\frac{SSR_{R} - SSR_{UR}}{k - q}}{\frac{SSR_{UR}}{N-k}}\)

where \(k\) is the number of parameters in the unrestricted model (6), and \(q\) the number of parameters in the restricted model (4). Put another way, \(k-q\) captures the number of additional coefficients whose statistical significance is being evaluated (two interaction terms in this case).

Step 2: Generate a full schedule of potential outcomes using \(\hat{\beta}\) from the restricted (“base”) model. This schedule is generated under the constant effects assumption.

Step 3: Conduct randomization inference to get the distribution of F statistics under the null hypothesis of constant treatment effects. We can then estimate a \(p\) value that gives the probability of observing our F-statistic (or more extreme), under the null hypothesis of constant treatment effects.

Here is the R code to perform this operation:

# Step 0: Estimate models

model1 <- lm(Y ~ Z + X1 + X2, data = data)
summary(model1)
## 
## Call:
## lm(formula = Y ~ Z + X1 + X2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6128  -2.5867  -0.0092   2.6175  12.3267 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1351     0.1818  -0.743    0.458    
## Z             5.3212     0.2571  20.693  < 2e-16 ***
## X1            0.8460     0.1253   6.751 2.49e-11 ***
## X2            0.8226     0.1325   6.207 7.92e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.061 on 996 degrees of freedom
## Multiple R-squared:  0.3475, Adjusted R-squared:  0.3456 
## F-statistic: 176.8 on 3 and 996 DF,  p-value: < 2.2e-16
model2 <- lm(Y ~ Z + X1 + X2 + Z * X1 + Z * X2, data = data)

summary(model2)
## 
## Call:
## lm(formula = Y ~ Z + X1 + X2 + Z * X1 + Z * X2, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3455  -2.6071   0.0317   2.5405  12.4775 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.1349     0.1819  -0.742 0.458459    
## Z             5.3075     0.2574  20.623  < 2e-16 ***
## X1            0.7518     0.1740   4.321 1.71e-05 ***
## X2            0.6438     0.1872   3.439 0.000609 ***
## Z:X1          0.1908     0.2508   0.761 0.446848    
## Z:X2          0.3553     0.2650   1.341 0.180348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.061 on 994 degrees of freedom
## Multiple R-squared:  0.3491, Adjusted R-squared:  0.3458 
## F-statistic: 106.6 on 5 and 994 DF,  p-value: < 2.2e-16
# Step 1: Perform an F-test (joint significance of interaction terms Z*X1
# and Z*X2)

test.out <- lmtest::waldtest(model1, model2, test = "F")
test.out
## Wald test
## 
## Model 1: Y ~ Z + X1 + X2
## Model 2: Y ~ Z + X1 + X2 + Z * X1 + Z * X2
##   Res.Df Df      F Pr(>F)
## 1    996                 
## 2    994  2 1.1993 0.3018
# Step 2: Randomization Inference to get p value

## Declare the randomization procedure
declaration <- declare_ra(N = 1000, m = 500)

## Create potential outcomes table under the constant treatment effects
## assumption

ate_hat <- model1$coefficients["Z"]
ri.data <- data %>% select(Y, Z, X1, X2) %>% mutate(Y0_hyp = Y - ate_hat * Z, 
    Y1_hyp = Y0_hyp + ate_hat)

## Write a function for 'ri2' that computes F statistic

ftest <- function(data) {
    data$Y_sim <- data$Y0_hyp
    data$Y_sim[data$Z == 1] <- data$Y1_hyp[data$Z == 1]
    model1 <- lm(Y_sim ~ Z + X1 + X2, data = data)
    model2 <- lm(Y_sim ~ Z + X1 + X2 + Z * X1 + Z * X2, data = data)
    test.out <- lmtest::waldtest(model1, model2, test = "F")
    test.out$F[2]
}

# Step 3: Randomization inference on ri2

ri.out2 <- conduct_ri(test_function = ftest, assignment = "Z", declaration = declaration, 
    data = ri.data, sims = 10000)

summary(ri.out2)
##             coefficient estimate two_tailed_p_value null_ci_lower
## 1 Custom Test Statistic 1.199303             0.3184    0.02875406
##   null_ci_upper
## 1      3.871089
plot(ri.out2)

Treatment-Treatment Interaction

Finally, consider a factorial design in which there are two treatments: \(Z_1 \in \{0,1\}\) and \(Z_2 \in \{0,1\}\). We are interested in how \(Z_1\)’s effect varies at different levels of \(Z_2\) and vice versa. To do this, consider the following models:

Restricted Model: \(Y_i = \alpha_1 + \beta_{11} Z_1 + \beta_{12} Z_2 + u_i\)

Unrestricted Model: \(Y_i = \alpha_2 + \beta_{21} Z_1 + \beta_{22} Z_2 + \beta_{23} (Z_1 \cdot Z_2) + \epsilon_i\)

We are interested in the interaction term \(\beta_{23}\) because it captures the relevant heterogeneity. Visually speaking:

Figure 3: Factorial Design

Figure 3: Factorial Design

\(\beta_{23} = (d - b) - (c - a)\)

Alternatively: \(\beta_{23} = (d - c) - (b - a)\)

To do randomization inference on \(\beta_{23}\), we follow these steps:

  • Step 1: Estimate the restricted and unrestricted model.

  • Step 2: Estimate the F statistic, call this f.obs.

  • Step 3: Use the restricted (“base”) model to construct a full schedule of potential outcomes under the constant and additive effects assumption. Note that each observation has four potential outcomes, \(Y_i(Z_1,Z_2)\): \(Y_{00}\), \(Y_{01}\), \(Y_{10}\), \(Y_{11}\). So we must use \(\widehat{\beta_{11}}\) and \(\widehat{\beta_{12}}\) to generate potential outcomes.

  • Step 4: Declare the factorial design, and obtain the permutation matrix for randomization inference. (Think of this as a four-arm trial in which units can be assigned with equal probability to “00”,“01”,“10”,“11”).

  • Step 5: For each iteration of the loop, call Y_sim the revealed potential outcome of unit \(i\), which depends on \(i\)’s assignment status. Use Y_sim to estimate the two models, and the F-statistic. Repeat this process a large number of times to obtain a distribution of F-statistics under the null hypothesis of constant effects. Call this vector f.stats

  • Step 6: Obtain a \(p\) value by calculating mean(abs(f.stats) >= f.obs). This is the probability of observing f.obs or more extreme under the null hypothesis of constant effects (i.e. no heterogeneity across units or treatment conditions). If \(p \leq \alpha\), we can reject the null hypothesis. Put another way, the interaction term is statistically significant.

Practice: Lets attempt this by looking at the Rind and Bordia (1996) dataset from Question 6 in this week’s problem set.

data <- foreign::read.dta("W9_RindBordia.dta")

# Step 0: Create treatment indicators
data <- data %>% mutate(Sex = as.numeric(female == "Female"), Face = as.numeric(happyface == 
    "Happy Face"))
head(data)
##   female happyface   tip xhappy tipround Sex Face
## 1 Female      None 44.78      0       45   1    0
## 2 Female      None 27.55      0       28   1    0
## 3 Female      None 18.12      0       18   1    0
## 4   Male      None  0.00      0        0   0    0
## 5 Female      None 30.08      0       30   1    0
## 6   Male      None 16.28      0       16   0    0
# Step 1: Estimate the two models, and joint significance of interaction
# term

model1 <- lm(tip ~ Sex + Face, data = data)

model2 <- lm(tip ~ Sex + Face + Sex * Face, data = data)

# Step 2: Estimate the F-statistic (significance of the interaction term)

test.out <- lmtest::waldtest(model1, model2, test = "F")
test.out  # F= 3.9944
## Wald test
## 
## Model 1: tip ~ Sex + Face
## Model 2: tip ~ Sex + Face + Sex * Face
##   Res.Df Df      F  Pr(>F)  
## 1     86                    
## 2     85  1 3.9944 0.04885 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Step 3: Construct schedule of potential outcomes under the constant
# effects assumption

ate_hat_sex <- model1$coefficients["Sex"]

ate_hat_face <- model1$coefficients["Face"]

ri.data <- data %>% mutate(Y00 = tip - ate_hat_face * Face - ate_hat_sex * Sex, 
    Y10 = Y00 + ate_hat_face * Face, Y01 = Y00 + ate_hat_sex * Sex, Y11 = Y00 + 
        ate_hat_face * Face + ate_hat_sex * Sex)

# Step 4: Declare randomization procedure, obtain permutation matrix

declaration = declare_ra(N = nrow(data), prob_each = c(0.25, 0.25, 0.25, 0.25), 
    condition_names = c("00", "10", "01", "11"))

perm <- obtain_permutation_matrix(declaration)
dim(perm)
## [1]    89 10000

Task: In pairs, write a loop for randomization inference and calculate the two-sided p value for f.obs (observed F statistic). Here is what I get:

# Step 5: Randomization inference using a loop

f.out <- rep(NA, 10000)

for (i in 1:ncol(perm)) {
    # Realized assignment status
    ri.data <- within(ri.data, {
        Z_sim <- perm[, i]
        Face_sim <- as.numeric(Z_sim == "10" | Z_sim == "11")
        Sex_sim <- as.numeric(Z_sim == "01" | Z_sim == "11")
    })
    
    # Realized outcome values
    ri.data <- within(ri.data, {
        Y_sim <- Y00
        Y_sim[Face_sim == 0 & Sex_sim == 1] <- Y01[Face_sim == 0 & Sex_sim == 
            1]
        Y_sim[Face_sim == 1 & Sex_sim == 0] <- Y10[Face_sim == 1 & Sex_sim == 
            0]
        Y_sim[Face_sim == 1 & Sex_sim == 1] <- Y11[Face_sim == 1 & Sex_sim == 
            1]
    })
    
    # Estimate models
    m1 <- lm(Y_sim ~ Face_sim + Sex_sim, data = ri.data)
    m2 <- lm(Y_sim ~ Face_sim + Sex_sim + Face_sim * Sex_sim, data = ri.data)
    
    # Report F statistic
    f.out[i] <- lmtest::waldtest(m1, m2, test = "F")$F[2]
}

# Step 6: P Value and Graphic

mean(abs(f.out) >= test.out$F[2])
## [1] 0.0474
fig1 <- ggplot2::ggplot(data.frame(f.out), aes(x = f.out)) + geom_histogram() + 
    geom_vline(xintercept = test.out$F[2]) + theme_bw() + xlab("F Statistic Distribution Under the Null")
fig1