Note: Every week, I will post section material on https://github.com/shikhar46/logicofexperiments and Canvas’ Files page.

Logistics

Sections will be held on Fridays, 10:30 AM to 11:20 AM and 11:25 AM to 12:30 PM in Watson A68.

Office hours will be held on Tuesdays, 2:30 PM to 4:30 PM in RKZ 204. You can sign-up for a slot by clicking here: https://calendly.com/shikhar-singh/experiments, or scanning here:


Notes from Week I

Field Experiments

Experiments randomly assign subjects to conditions with known probabilities between 0 and 1 (excluding 0 and 1).

Randomized studies that are conducted in “real-world settings” are often called field experiments (Gerber and Green 2012:10). Specifically,

  1. Treatment resembles intervention of interest in the world.
  2. Study participants resemble people who would ordinarily encounter these interventions.
  3. Treatment is administered in a real world setting
  4. Study outcomes resemble actual outcomes in the world.

The Private Water Connection Study

Question 3 describes an encouragement-design with experimental features. The treatment was a simplified procedure to purchase a private water connection at 0% interest rate. A random subset of the sample received this “incentive”, while the remaining subset did not. This allows us to identify the causal effect of that incentive, and with some assumptions, the causal effect of possessing a private connection. In Chapters 5 and 6, we will discuss non-compliance and the necessary assumptions to identify complier average causal effects.


Potential Outcomes Framework

Every subject \(i\) has two potential outcomes, \(Y_i(1)\) (the outcome when subject \(i\) is treated), and \(Y_i(0)\) (the outcome when subject \(i\) is untreated).

The causal effect of treatment (\(\tau_i\)) is defined as

\[\begin{equation} \tau_i \equiv Y_i(1) - Y_i(0) \end{equation}\]

The fundamental problem of causal inference is that we never observe both potential outcomes for any subject \(i\). In the real world, when subjects are randomly assigned to treatment conditions, they reveal one of their potential outcomes. Call this \(Y_i\) or the observed outcome.

According to the switching equation, the revealed or observed outcome is:

\[\begin{equation} Y_i = \underbrace{d_i \cdot Y_i(1)}_{\text{When i is in control, this equals 0 and} Y_i=Y_i(0)} + \underbrace{(1 - d_i) \cdot Y_i(0)}_{\text{When i is in treatment, this equals 0 and} Y_i=Y_i(1)} \end{equation}\]

Estimand, Estimator, Estimate

An estimand is a population-level quantity of interest. The average treatment effect (or ATE) is one such estimand.

\[\begin{equation}\text{ATE} = E[\tau_i] = E[Y_i(1) - Y_i(0)] = \frac{1}{N} \sum_{i=1}^N \tau_i = \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0)\end{equation}\]

An estimator is a procedure or rule for estimating the given quantity based on some observed data. The difference-in-means is an estimator that uses observed data or \(Y_i\) to estimate the average treatment effect.

\[\begin{equation}\text{DIM} = \frac{\sum_{1}^m Y_i}{m} - \frac{\sum_{m+1}^N Y_i}{N - m} \end{equation}\]

where \(m\) of \(N\) subjects are assigned to the treatment condition.

An estimate is one particular result or guess of the quantity of interest, given some data. To understand why the true ATE is conceptually and empirically different from an estimate of the ATE, lets focus on the difference between \(d_i\) and \(D_i\)

Difference between \(d_i\) and \(D_i\)

  • \(d_i\) is the observed treatment assignment of unit \(i\)

  • \(D_i\) is a random variable that indicates whether unit \(i\) would be treated in a hypothetical experiment.

In words: \(d_i\) is a particular realization of \(D_i\).

An example with \(n\)=4, \(m\)=2

Schedule of Potential Outcomes and Treatment Assignment
Unit Y1 Y0 d_i
1 10 15 1
2 8 13 1
3 6 11 0
4 4 9 0

In this study, we randomly assigned 2 of 4 subjects to treatment. It so happens that in this iteration of that random assignment procedure, units 1 and 2 get assigned to the treatment condition (\(d_i =1\)).

However there are many ways in which two of four units can be assigned to a treatment:

\({4 \choose 2} = \frac{4!}{2!\times 2!} = 6\) ways

Lets use randomizr to get the six different assignment vectors. The R package manual is available here.

Step 1: Installing the package, and librarying it

# If you do not have this package:

# install.packages("randomizr")

library(randomizr)

Step 2: Declaring the design or randomization procedure

declaration <- declare_ra(N = 4, m = 2) #this gives randomizr the necessary design information

declaration
## Random assignment procedure: Complete random assignment 
## Number of units: 4 
## Number of treatment arms: 2 
## The possible treatment categories are 0 and 1.
## The number of possible random assignments is 6.  
## The probabilities of assignment are constant across units: 
## prob_0 prob_1 
##    0.5    0.5

Step 3: Conduct a random assignment, or obtain \(d_i\). You will notice that \(d_i\) is typically different every time we conduct this assignment procedure.

conduct_ra(declaration) # This uses that information to generate an assignment vector. This vector will be different every time we run the command. See below
## [1] 0 1 0 1
D <- conduct_ra(declaration)
D # Different from the first assignment vector
## [1] 0 1 1 0

Step 4: Getting the six possible assignment vectors.

D <- obtain_permutation_matrix(declaration) 
# Generates all possible assignment vectors for N=4, m=2

kable(D, caption = "Assignment Vectors")
Assignment Vectors
0 0 0 1 1 1
0 1 1 0 0 1
1 0 1 0 1 0
1 1 0 1 0 0

For any unit \(i \in \{1,2,3,4\}\), \(d_i = 1\) with probability \(\frac{1}{2}\) and 0 with probability \(\frac{1}{2}\).

We can manually check this in the table. Unit 1 is assigned to treatment in three of six assignment vectors (columns 4, 5, and 6). Unit 4 is assigned three of six times to treatment as well (columns 1,2, and 4).

Now lets see the full science table:

Full Schedule of Potential Outcomes and Treatment Assignments
Unit Y1 Y0 d_i d1 d2 d3 d4 d5
1 10 15 1 0 0 0 1 1
2 8 13 1 0 1 1 0 0
3 6 11 0 1 0 1 0 1
4 4 9 0 1 1 0 1 0

What is \(E[Y_i(0) | d_i = 0]\)?

\(\frac{Y_3(0) + Y_4(0)}{2} = \frac{11+9}{2} = 10\)

What is \(E[Y_i(0) | D_i = 0]\)?

It is \(E_d[E(Y_i(0) | d_i=0)] = \sum_d E(Y_i(0) | d_i=0) \cdot p(d)\)

where \(p(d)\) is the probability of obtaining that assignment vector.

In our case, there are 6 possible assignment vectors, so \(p(d)=\frac{1}{6}\), and the overall quantity is:

\(\frac{1}{6}(\frac{Y_3(0) + Y_4(0)}{2}) + \frac{1}{6}(\frac{Y_1(0) + Y_2(0)}{2}) + \frac{1}{6}(\frac{Y_1(0) + Y_3(0)}{2}) + \frac{1}{6}(\frac{Y_1(0) + Y_4(0)}{2}) + \frac{1}{6}(\frac{Y_2(0) + Y_3(0)}{2}) + \frac{1}{6}(\frac{Y_2(0) + Y_4(0)}{2})\)

Which equals:

\(\frac{1}{6}[\frac{11+9}{2} + \frac{15+13}{2} + \frac{15+11}{2} + \frac{15+9}{2} + \frac{13+11}{2} + \frac{13+9}{2}]\)

\(\frac{1}{6}[\frac{144}{2}] = 12\)

Note that \(E[Y_i(0) | D_i = 0] \neq E[Y_i(0) | d_i = 0]\), and that \(E[Y_i(0) | D_i = 0] = E[Y_i(0)]\)


Tidyverse

We will use the Tidyverse suite of packages in R to analyze data from experiments.

Figure 1: Steps in Tidyverse

Step 1: Installing tidyverse packages, and librarying it

# If you do not have this package:

# install.packages("tidyverse")

library(tidyverse)

Step 2: Lets use data from the above example to estimate the true ATE.

results_estimand <- data1 %>%
  summarise(
    averageY1 = mean(Y1),
    averageY0 = mean(Y0),
    estimand = averageY1 - averageY0
  )

kable(results_estimand, 
      caption = "Calculating the true ATE")
Calculating the true ATE
averageY1 averageY0 estimand
7 12 -5

Step 3: In the real world, we do not observe people’s potential outcomes. We conduct a study in which people are randomly assigned to conditions and we observe either their treated or untreated potential outcome. We use the difference-in-means estimator to estimate the ATE.

group_means <- data1 %>%
  mutate(
    Y = Y1*d_i + Y0*(1-d_i)
  ) %>%
  group_by(d_i) %>%
  summarise(Average = mean(Y))

kable(group_means, 
      caption = "Estimating Group Means")
Estimating Group Means
d_i Average
0 10
1 9

We can also estimate the difference in means:

estimate <- group_means %>%
  summarise(DIM = Average[d_i==1] - Average[d_i==0])

kable(estimate, 
      caption = "Difference in Means")
Difference in Means
DIM
-1

Practice

Download the dataset uploaded along with this script. Conduct complete random assignment using the complete_ra command in randomizr, assigning exactly 350 subjects to the treatment condition and the remaining to the control condition. Use the switching equation to create a new column of observed outcoemes, \(Y_i\). Estimate the true ATE (\(E[Y_i(1) - Y_i(0)])\), and the difference in means estimate (\(\frac{\sum_{1}^m Y_i}{m} - \frac{\sum_{m+1}^N Y_i}{N - m}\)). Do you get the same answer?

Note: I am including the solution code but please attempt doing this without looking at that code in the first instance.

dat <- read_csv("w1_practice_data.csv")

# Conduct complete random assignment

dat <- dat %>%
  mutate(
    d = complete_ra(N = nrow(dat), m = 350),
    Y = Y1*d + (1-d)*Y0
  )

# True ATE

dat %>% summarise(ATE = mean(Y1-Y0))

# DIM estimate in this iteration of the experiment

dat %>% summarise(DIM = mean(Y[d==1]) - mean(Y[d==0]))