Cluster Random Assignment

Cluster random assignment implies that all subjects in the same cluster are placed as a group into either the treatment or control condition. (Gerber and Green 2012:80)

Consider a data set that contains potential outcomes (\(Y_1, Y_0\)) and two clustering variables (\(C_1,C_2\)). Here is a glimpse of the data set. We will focus on \(C_1\) and \(C_2\).

set.seed(03042022)

library(tidyverse)
library(knitr)

 dat2 <- tibble(
  Y0 = runif(1000, min= -1, max = 1),
  Y1 = Y0 + rnorm(1000, mean = 5, sd = 4),
  X1 = as.numeric(Y0 >= 0),
  X2 = randomizr::complete_ra(N = 1000, m = 50),
  C1 = case_when(
    Y0 <= -0.75 ~ 1,
    (Y0 > -0.75) & (Y0 <= -0.5) ~ 2,
    (Y0 > -0.5) & (Y0 <= -0.25) ~ 3,
    (Y0 > -0.25) & (Y0 <= 0) ~ 4,
    (Y0 > 0) & (Y0 <= 0.25) ~ 5,
    (Y0 > 0.25) & (Y0 <= 0.5) ~ 6,
    (Y0 > 0.5) & (Y0 <= 0.75) ~ 7,
    (Y0 > 0.75) ~ 8, 
  ),
  #(Y0 <= -1,1,ifelse((Y0 > -1) & (Y0 <= 0),2,ifelse((Y0 > 0) & (Y0 <= 1),3,4))),
  C2 = sample(x = c(1:8), size = 1000, replace = T, prob = rep(1/8,8))
)

kable(head(dat2), caption = "Dataset from Prior Week", row.names = T)

Dataset from Prior Week
	Y0	Y1	X1	X2	C1	C2
1	0.4603614	5.112706	1	0	6	3
2	0.6667425	13.504230	1	0	7	3
3	0.3806324	6.758841	1	1	6	4
4	0.3238680	8.622701	1	0	6	3
5	0.2214657	4.901916	1	0	5	2
6	0.0338077	-1.833710	1	0	5	5

dat2 %>% summarise(`True ATE` = mean(Y1 - Y0)) %>% kable()

True ATE
4.828663

Homogeneous and Heterogeneous Clusters

\(C_1\) produces clusters such that potential outcomes are highly correlated within a cluster. This means there is considerable inter-cluster variation. To see this, I group_by cluster and report the average value of the treated and untreated potential outcomes. For comparison, we look at another case when subjects are randomly assigned to clusters (call this clustering variable \(C_2\)).

library(patchwork)

cluster_outcomes1 <-
  dat2 %>%
  group_by(C1) %>%
  summarise(
    `Average Y0` = mean(Y0),
    `SD(Y0)` = sd(Y0),
    `Average Y1` = mean(Y1),
    `SD(Y1)` = sd(Y1),
    N = n()
  )


cluster_outcomes1 <- pivot_longer(
  cluster_outcomes1,
  cols = c(`Average Y0`, `Average Y1`),
  names_to = "potential_outcome",
  values_to = "cluster_average"
) %>%
  mutate(
    SD = case_when(
      potential_outcome == "Average Y0" ~ `SD(Y0)`,
      potential_outcome == "Average Y1" ~ `SD(Y1)`
    )
  )

fig1 <- ggplot(cluster_outcomes1,
               aes(x = potential_outcome, y = cluster_average)) +
  geom_point() +
  ylab("") +
  xlab("") +
  ylim(-1,7) +
  ggtitle("Homogeneous Clusters (C1)") +
  theme_bw() +
  theme(
    panel.grid = element_blank()
  )
  

cluster_outcomes2 <-
  dat2 %>% 
  group_by(C2) %>% 
  summarise(
    `Average Y0` = mean(Y0),
    `SD(Y0)` = sd(Y0),
    `Average Y1` = mean(Y1),
    `SD(Y1)` = sd(Y1),
    N = n()
  )

cluster_outcomes2 <- pivot_longer(
  cluster_outcomes2,
  cols = c(`Average Y0`, `Average Y1`),
  names_to = "potential_outcome",
  values_to = "cluster_average"
) %>%
  mutate(
    SD = case_when(
      potential_outcome == "Average Y0" ~ `SD(Y0)`,
      potential_outcome == "Average Y1" ~ `SD(Y1)`
    )
  )

fig2 <- ggplot(cluster_outcomes2,
               aes(x = potential_outcome, y = cluster_average)) +
  geom_point() +
  ylab("") +
  xlab("") +
  ylim(-1,7) +
  ggtitle("Heterogeneous Clusters (C2)") +
  theme_bw() +
  theme(
    panel.grid = element_blank()
  )

fig1 + fig2

Standard Errors

Clustering typically increases sampling variability or the standard error of estimates. The extent to which standard errors increase depends on how similar potential outcomes are, on average, across clusters. This is called inter-cluster variation. To see how this affects standard errors, lets focus on Equation 3.22:

\(SE(\widehat{ATE}) = \sqrt{\frac{1}{k-1}\{\frac{m}{N-m}Var[\overline{Y_i(0)}] + \frac{N-m}{m}Var[\overline{Y_i(1)}] + 2Cov[\bar{Y_i(0)},\overline{Y_i(1)}]\}}\)

In equation 3.22, \(Var[\bar{Y_i(0)}]\) is the variance in the cluster-level average untreated potential outcome, and \(Var[\bar{Y_i(1)}]\) is the variance in the cluster-level average treated potential outcome. When clusters are very homogeneous (like \(C_1\)), potential outcomes are very similar within a cluster and very different across clusters. As a result, the variance in cluster averages or \(Var(\cdot)\) is large. In contrast, when clusters are very heterogeneous (like \(C_2\)), potential outcomes are very different within a cluster but very similar across clusters. This means cluster averages are very similar, and the variance in those values or \(Var(\cdot)\) is small.

Example

To illustrate this point, lets conduct simulations in which we compare the sampling distribution when clusters are homogeneous (\(C_1\)) and heterogeneous (\(C_2\)).

Step 1: Lets specify two types of cluster random assignments - cluster random assignment with \(C_1\) (exactly \(m=2\) of \(N=4\) clusters is assigned to treatment), and similarly with \(C_2\) (exactly two of the clusters, \(m=2\), is assigned to treatment).

Note: In randomizr, when you specify a cluster, m should indicate the number of clusters (not the number of subjects) to be assigned to the treatment condition. You will notice that if clusters are of unequal size, then the actual number of treated subjects will vary, depending on which clusters are assigned to the treatment condition.

library(randomizr)

c1_ra_declaration = declare_ra(N = nrow(dat2),
                               clusters = dat2$C1,
                               m = 2)

c2_ra_declaration = declare_ra(N = nrow(dat2),
                               clusters = dat2$C2,
                               m = 2)

Step 2: Using simulations, I recover the sampling distribution of the average treatment effect under the two cluster assignments. Note that we use difference_in_means from the estimatr package and we specify clusters.

library(estimatr)

# When C1 is the clustering variable

D_c1_ra <- obtain_permutation_matrix(c1_ra_declaration) 

ests_c1_ra <- c()

for(i in 1:ncol(D_c1_ra)) {
  dat_temp <- dat2 %>%
    mutate(d_temp = D_c1_ra[, i],
           y_temp = d_temp * Y1 + (1 - d_temp) * Y0)
  estimate = tidy(difference_in_means(y_temp ~ d_temp, clusters = C1, data = dat_temp))
  ests_c1_ra = bind_rows(ests_c1_ra, estimate)
}

# When C2 is the clustering variable

D_c2_ra <- obtain_permutation_matrix(c2_ra_declaration) 

ests_c2_ra <- c()

for(i in 1:ncol(D_c2_ra)) {
  dat_temp <- dat2 %>%
    mutate(d_temp = D_c2_ra[, i],
           y_temp = d_temp * Y1 + (1 - d_temp) * Y0)
  estimate = tidy(difference_in_means(y_temp ~ d_temp, clusters = C2, data = dat_temp))
  ests_c2_ra = bind_rows(ests_c2_ra, estimate)
}

# Make a figure to compare the two sampling distributions


fig1 <- ggplot(data = ests_c1_ra, aes(x = estimate)) +
  geom_histogram(bins = 50) +
  geom_vline(
    xintercept = mean(ests_c1_ra$estimate),
    linetype = "dashed",
    color = "red"
  ) +
  ylab("Frequency") +
  xlim(c(3, 7)) +
  ggtitle("Homogeneous Clusters (C1)") +
  theme_bw()


fig2 <- ggplot(data = ests_c2_ra, aes(x = estimate)) +
  geom_histogram(bins = 50) +
  geom_vline(
    xintercept = mean(ests_c2_ra$estimate),
    linetype = "dashed",
    color = "red"
  ) +
  ylab("Frequency") +
  xlim(c(3, 7)) +
  ggtitle("Heterogeneous Clusters (C2)") +
  theme_bw()


fig1 / fig2

Advanced

Equations 3.4 and 3.22

Claim: As the number of clusters \(k\) tends to the number of subjects \(N\), equation 3.22 tends to equation 3.4

To start, write down the regular standard error (Equation 3.4), and compare it to the clustered standard error (Equation 3.22):

Equation 3.4: \(\sqrt{\frac{1}{N-1}\{\frac{m}{N-m}Var[Y_i(0)] + \frac{N-m}{m}Var[Y_i(1)] + 2Cov[Y_i(0),Y_i(1)]\}}\)

Equation 3.22: \(\sqrt{\frac{1}{k-1}\{\frac{m}{N-m}Var[\overline{Y_i(0)}] + \frac{N-m}{m}Var[\overline{Y_i(1)}] + 2Cov[\bar{Y_i(0)},\overline{Y_i(1)}]\}}\)

Note that the only difference between these equations is that \(\frac{1}{N-1}\) is replaced by \(\frac{1}{k-1}\); and unit-level potential outcomes (\(Y_i(1)\) and \(Y_i(0)\)) are replaced by cluster-level average potential outcomes (\(\overline{Y_i(1)}\) and \(\overline{Y_i(0)}\)). To show equality, we must establish some relationship between these terms.

\(Var[\overline{Y_{cluster}}] \approx \frac{Var[Y]}{\frac{N-1}{k-1}}\)

\(Cov[\overline{Y_{cluster}},\overline{X_{cluster}}] \approx \frac{Cov[Y,X]}{\frac{N-1}{k-1}}\)

To explain why this is the case, consider the limit case in which \(N\) units are randomly assigned to \(k=N\) clusters. In this case, we would expect (for example) variance in unit-level potential outcomes to equal the variance in cluster averages. This is indeed the case:

\(Var[\overline{Y_{cluster}}] \approx \frac{Var[Y]}{\frac{N-1}{k-1}} = \frac{Var[Y]}{\frac{N-1}{\color{red}{N}-1}} = Var[Y]\)

This equality turns into an approximation (\(\approx\)) as the number of clusters becomes smaller. Here is a simulation using some data (\(n=1000\)) to confirm this.

se_equality <- function(cluster_n) {
  # Create some data
  data <- tibble(Y = rnorm(mean = 5, sd = 4, 10000),
                     cluster = conduct_ra(N = 10000, prob_each = rep(1 / cluster_n, cluster_n)))
  # Compute variance of y for the full data
  var_pop <- mean((data$Y - mean(data$Y)) ^ 2)
  # Compute variance in cluster averages
  var_clust <-
    data %>% 
    group_by(cluster) %>% 
    summarise(Y_avg = mean(Y)) %>%
    summarise(outcome = mean((Y_avg - mean(Y_avg))^2))
  # Specify output
  adjustment <- (10000 - 1) / (cluster_n - 1) # (N-1)/(k-1)
  output <-
    c(var_clust, adjustment, var_clust * adjustment, var_pop)
  return(output)
}
    
q10_simulation <- tibble(
  `No. of Clusters` = c(5, 10, 20, 50, 100, 500, 1000),
  `Var(Y Cluster)` = rep(NA, 7),
  `(n-1)/(k-1)` =  rep(NA, 7),
  `Var(Y Cluster) x (n-1)/(k-1)` = rep(NA, 7),
  `Var(Y)` = rep(NA, 7)
)

    
q10_simulation[1, 2:5] <- se_equality(5)
q10_simulation[2, 2:5] <- se_equality(10)
q10_simulation[3, 2:5] <- se_equality(20)
q10_simulation[4, 2:5] <- se_equality(50)
q10_simulation[5, 2:5] <- se_equality(100)
q10_simulation[6, 2:5] <- se_equality(500)
q10_simulation[7, 2:5] <- se_equality(1000)
    
kable(q10_simulation)

No. of Clusters	Var(Y Cluster)	(n-1)/(k-1)	Var(Y Cluster) x (n-1)/(k-1)	Var(Y)
5	0.0050348	2499.75000	12.58577	16.30450
10	0.0155372	1111.00000	17.26187	16.12729
20	0.0191439	526.26316	10.07470	15.75334
50	0.0714861	204.06122	14.58753	16.41436
100	0.1335960	101.00000	13.49319	15.90312
500	0.7820279	20.03808	15.67033	16.23850
1000	1.5602297	10.00901	15.61635	15.64081

Power Calculations

The “statistical power” of a study refers to the probability a researcher will be able to reject the null hypothesis of no treatment effect (Gerber and Green:93). It is \(1 - Pr\)(Committing Type 2 Error), where a Type 2 error is defined as the probability of retaining a false null hypothesis.

In this section, we calculate the statistical power of an experiment (\(n=120\)) to detect an average treatment effect of \(\approx\) 0.5 SD units. Power calculations are typically part of a pre-analysis plan, and involve the following steps:

Write a for loop. In each iteration of the loop:

Generate some data according to an assumed data generating process.
Estimate the average treatment effect, given some observed assignment vector \(Z_i\) and outcome values \(Y_i\).
Estimate the uncertainty (\(se\)) around your DIM estimate and construct a confidence interval.
Set a success criterion (say \(p \leq 0.05\)) and record whether the experiment rejected the null hypothesis (\(success = 1\)), or failed to reject it (\(p > 0.05\) and \(success =0\)).

Example: Here is a data generating process that yields an average treatment effect of 0.5 SD units. Assume we are planning an experiment with \(n=120\). What is the power of such an experiment to detect this effect?

# Basic information

sims <- 1000 # Number of simulations
N <- 120 # Size of the experimental sample
m <- N/2 # Number of units to be assigned to treatment

# Data generating process

dat3 <- tibble(
  Y0 = rnorm(N, mean = 0, sd = 1),
  Y1 = Y0 + rnorm(N, mean = 0.5, sd = 1),
  Z = complete_ra(N = N, m = m),
  Y = Y1 * Z + Y0 * (1 - Z)
)

# To make sure the data generating process fits our description:

out3 <- dat3 %>% 
  summarise(`DIM` = mean(Y[Z==1]) - mean(Y[Z==0]), 
            `SD(Y) in Control` = sd(Y[Z==0]), 
            `DIM/SD(Y)` = `DIM`/`SD(Y) in Control`)
kable(out3,
      digits = 3)

DIM	SD(Y) in Control	DIM/SD(Y)
0.644	1.008	0.638

Here is a for loop that estimates the statistical power of a study (\(n=200\)) to detect a 0.5 SD effect:

output = c()

for(i in 1:sims){
  # Create Data
  temp_data <- tibble(
    Y0 = rnorm(N, mean = 0, sd = 1),
    Y1 = Y0 + rnorm(N, mean = 0.5, sd = 1),
    Z = complete_ra(N = N, m = m),
    Y = Y1 * Z + Y0 * (1 - Z)
  ) 
  # Estimate the ATE
  fit <- difference_in_means(Y ~ Z, data = temp_data) %>% tidy()
  # Save the estimate, std. error, and p value
  output = bind_rows(output, fit)
}

# Define "success" in a hypothetical experiment if p<=0.05

output <- output %>%
  mutate(
    success = case_when(p.value <= 0.05 ~ 1,
                        p.value > 0.05 ~ 0)
  )

# Power of the experiment
mean(output$success)

[1] 0.612

# Distribution of p-values

ggplot(data = output,
       aes(x = p.value)) +
  geom_histogram() +
  geom_vline(xintercept = 0.05, linetype = "dashed", color = "red") +
  ylab("Frequency") +
  xlab("P Value") +
  ggtitle("Distribution of P Values") +
  theme_bw() +
  theme(
    panel.grid = element_blank()
  )

Section Exercise: Using the for loop method described above, can you calculate the statistical power of a study with \(n=250\) subjects and a constant treatment effect of 0.25?

Writing Functions

For this section exercise, I write a function that computes statistical power given some sample size. I then test this function with \(n=250\), and plot the statistical power for different sample sizes by applying this function stat.power to \(n=\{50,75,100,125,150,175,250,300\}\) Here is the plot:

### Define a function ### 

stat.power <- function(N,sims){
# Define output vectors:
output = c()
# Run simulations  
for(i in 1:sims){
  # Create Data
  temp_data <- data_frame(
    Y0 = rnorm(N, mean = 0, sd = 1),
    Y1 = Y0 + 0.25,
    Z = complete_ra(N = N, prob = 0.5),
    Y = Y1 * Z + Y0 * (1 - Z)
  )  
  # Estimate the ATE
  fit <- difference_in_means(Y ~ Z, data = temp_data) %>% tidy()
  # Report outcomes
  output = bind_rows(output, fit)
}
  # Define success
output <- output %>%
  mutate(success = case_when(p.value <= 0.05 ~ 1,
                             p.value > 0.05 ~ 0))
power = mean(output$success)
return(power)
}

# Test the function with n= 250

stat.power(250,500)

[1] 0.478

fig_data <- tibble(
  `Sample Size` = c(50, 100, 150, 200, 250, 300, 350, 400, 450, 500),
  `Statistical Power` = c(
    stat.power(50, 500),
    stat.power(100, 500),
    stat.power(150, 500),
    stat.power(200, 500),
    stat.power(250, 500),
    stat.power(300, 500),
    stat.power(350, 500),
    stat.power(400, 500),
    stat.power(450, 500),
    stat.power(500, 500)
  )
)

ggplot(data = fig_data,
       aes(x = `Sample Size`, y = `Statistical Power`)) +
  geom_vline(xintercept = 250,
             linetype = "dashed",
             color = "blue") +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(breaks = c(50, 100, 150, 200, 250, 300, 350, 400, 450, 500)) +
  theme_bw() +
  theme(
    panel.grid = element_blank()
  )

Section 5: Clustering, Statistical Power

Shikhar Singh

March 4, 2022