Overview of DiD
Problems with Staggered DiD
Simulation Results
Some Alternative Methods
Application
Think Card and Krueger minimum wage study comparing NJ and PA.
2 units and 2 time periods.
1 unit (T) is treated, and receives treatment in the second period. The control unit (C) is never treated.
Building upon Angrist & Pischke (2008, p. 228)Angrist & Pischke (2008, p. 228) we can think of these simple 2x2 DiDs as a fixed effects estimator.
Potential Outcomes
The expected outcome is a linear function of unit and time fixed effects: E[Y0i,t]=αi+αtE[Y0i,t]=αi+αt E[Y1i,t]=αi+αt+δDstE[Y1i,t]=αi+αt+δDst
Difference in expectations for the control unit times t = 1 and t = 0: E[Y0C,1]=α1+αCE[Y0C,0]=α0+αCE[Y0C,1]−E[Y0C,0]=α1−α0
Now do the same thing for the treated unit: E[Y1T,1]=α1+αT+δE[Y1T,0]=α0+αTE[Y1T,1]−E[Y1T,0]=α1−α0+δ
δ=(E[Y1T,1]−E[Y1T,0])−(E[Y0C,1]−E[Y0C,0])
The DiD can be estimated through linear regression of the form:
yit=α+β1TREATi+β2POSTt+δ(TREATi⋅POSTt)+ϵit
The coefficients from the regression estimate in (1) recover the same parameters as the double-differencing performed above: α=E[yit|i=C,t=0]=α0+αCβ1=E[yit|i=T,t=0]−E[yit|i=C,t=0]=(α0+αT)−(α0+αC)=αT−αCβ2=E[yit|i=C,t=1]−E[yit|i=C,t=0]=(α1+αC)−(α0+αC)=α1−α0δ=(E[yit|i=T,t=1]−E[yit|i=T,t=0])−(E[yit|i=C,t=1]−E[yit|i=Ct=0])=δ
Advantage of regression DiD - it provides both estimates of δ and standard errors for the estimates.
Angrist & Pischke (2008):
Two-way fixed effects estimator: yit=αi+αt+δDDDit+ϵit
αi and αt are unit and time fixed effects, Dit is the unit-time indicator for treatment.
TREATi and POSTt now subsumed by the fixed effects.
can be easily modified to include covariate matrix Xit, time trends, dynamic treatment effects estimation, etc.
Developed literature now on the issues with TWFE DiD with "staggered treatment timing" (Abraham and Sun (2018), Borusyak and Jaravel (2018), Callaway and Sant'Anna (2019), Goodman-Bacon (2019), Strezhnev (2018), Athey and Imbens (2018))
Probably the most common use of DiD today. If done right can increase amount of cross-sectional variation.
Without digging into the literature:
δDD with staggered treatment timing is a weighted average of many different treatment effects.
We know little about how it measures when treatment timing varies, how it compares means across groups, or why different specifications change estimates.
The weights are often negative and non-intuitive.
Important Insights
δDD is just the weighted average of the four 2x2 treatment effects. The weights are a function of the size of the subsample, relative size of treatment and control units, and the timing of treatment in the sub sample.
Already-treated units act as controls even though they are treated.
Given the weighting function, panel length alone can change the DiD estimates substantially, even when each δDD does not change.
Groups treated closer to middle of panel receive higher weights than those treated earlier or later.
Can show how easily δDD goes awry up through a simulation exercise.
Consider two sets of DiD estimates - one where the treatment occurs in one period, and one where the treatment is staggered.
The data generating process is linear: yit=αi+αt+δit+ϵit.
We will consider two different treatment assignment set ups for δit.
There are 40 states s, and 1000 units i randomly drawn from the 40 states.
Data covers years 1980 to 2010, and half the states receive "treatment" in 1995.
For every unit incorporated in a treated state, we pull a unit-specific treatment effect from μi∼N(0.3,(1/5)2).
Treatment effects here are trend breaks rather than unit shifts: the accumulated treatment effect δit is μi×(year−1995+1) for years after 1995.
We then estimate the average treatment effect as ˆδ from:
yit=^αi+^αt+ˆδDit
Simulate this data 1,000 and plot the distribution of estimates ˆδ and the true effect (red line).
Run similar analysis with staggered treatment.
The 40 states are randomly assigned into four treatment cohorts of size 250 depending on year of treatment assignment (1986, 1992, 1998, and 2004)
DGP is identical, except that now δit is equal to μi×(year−τg+1) where τg is the treatment assignment year.
Estimate the treatment effect using TWFE and compare to the analytically derived true δ (red line).
Main problem - we use prior treated units as controls.
When the treatment effect is "dynamic", i.e. takes more than one period to be incorporated into your dependent variable, you are subtracting the treatment effects from prior treated units from the estimate of future control units.
This biases your estimates towards zero when all the treatment effects are the same.
Can we actually get estimates for δ that are of the wrong sign? Yes, if treatment effects for early treated units are larger (in absolute magnitude) than the treatment effects on later treated units.
Here firms are randomly assigned to one of 50 states. The 50 states are randomly assigned into one of 5 treatment groups Gg based on treatment being initiated in 1985, 1991, 1997, 2003, and 2009.
All treated firms incorporated in a state in treatment group Gg receive a treatment effect δi∼N(δg,.22).
The treatment effect is cumulative or dynamic - δit=δi×(year−Gg).
`Gg` | `δg` |
---|---|
1985 | 0.5 |
1991 | 0.4 |
1997 | 0.3 |
2003 | 0.2 |
2009 | 0.1 |
ATT(g,t)=E[(GgE[Gg]−pg(X)C1−pg(X)E[pg(X)C1−pg(X)])(Yt−Tg−1)]
A relatively straightforward extension of the standard event-study TWFE model:
yit=αi+αt+∑e∑l≠−1δel(1{Ei=e}⋅Dlit)+ϵit
You saturate the relative time indicators (i.e. t = -2, -1, ...) with indicators for the treatment initiation year group, and aggregate to overall aggregate relative time indicators by cohort size.
In the case of no covariates, this gives you the same estimate as Callaway & Sant'Anna if you fully saturate the model with time indicators (leaving only two relative year identifiers missing).
The authors don't claim that it can be used with covariates, but it seemingly follows if we think it is okay with normal TWFE DiD.
Similar to the standard TWFE DiD, but we ensure that no previously treated units enter as controls by trimming the sample.
For each treatment cohort Gg, get all treated units, and all units that are not treated by year g+k where g is the treatment year and k is the outer most relative year that you want to test (e.g. if you do an event study plot from -5 to 5, k would equal 5).
Keep only observations within years g−k and g+k for each cohort-specific dataset, and then stack them in relative time.
Run the same TWFE estimates as in standard DiD, but include interactions for the cohort-specific dataset with all of the fixed effects, controls, and clusters.
In the stylized example all the models work. How do they differ?
Callaway & Sant'Anna
Abraham & Sun
Cengiz et al.
Bachhuber et al. 2014 found, using a staggered DiD, that states with medical cannabis laws experienced a slower increase in opioid overdose mortality from 1999-2010.
Shover et al. 2020 extend the data sample from 2010 to 2017, a period during which 32 extra states passed MML laws.
Not only do the results go away, but the sign flips; MML laws are associated with higher opioid overdose mortality rates.
Authors don't call it difference-in-differences, but it uses TWFE with a binary indicator variable (thus is effectively DiD).
Little evidence covariates matter here, so estimate standard DiD with no controls over the two periods:
yit=αi+αt+∑k=Pre,Postδk+3∑−3δk+ϵit
So we can verify that in the first sample (1999 - 2010), there appears to be a negative effect of law introduction, while in the full sample (1999 - 2017), there is a positive effect.
But there appears to be evidence of pre-trends in the full sample.
In addition, by the end of the sample the number of firms adopting MMLs is quite large.
If there are dynamic treatment effects, then these estimates could be biased from using many prior treated states as controls.
The unweighted average of the 2x2 treatment effects are negative for the earlier vs. later treated (unbiased), while positive for the later vs. earlier treated (biased).
The effect is also positive for the treated vs. untreated units, but there are not many untreated states (i.e. states without medical cannabis laws).
Type | Average Estimate | Number of 2x2 Comparisons | Total Weight |
---|---|---|---|
Earlier vs Later Treated | -0.16 | 91 | 0.38 |
Later vs Earlier Treated | 0.32 | 105 | 0.42 |
Treated vs Untreated | 0.44 | 14 | 0.20 |
DiDs are a powerful tool and we are going to keep using them.
But we should make sure we understand what we're doing! DiD is a comparison of means and at a minimum we should know which means we're comparing.
Multiple new methods have been proposed, all of which ensure that you aren't using prior treated units as controls.
You should probably tailor your selection of method to your data structure: they use and discard different amount of control units and depending on your setting this might matter.
Unclear what's going on with MMLs and opioid mortality rates, but very unlikely that the results in the first published paper is robust.
Overview of DiD
Problems with Staggered DiD
Simulation Results
Some Alternative Methods
Application
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |