class: center, middle, inverse, title-slide # Difference-in-Differences: What it DiD? ### Andrew Baker ### Stanford University ### 2020-05-25 --- <style type="text/css"> @media print { .has-continuation { display: block !important; } } </style> # .center.pull[Outline of Talk] `\(\hspace{2cm}\)` 1. Overview of DiD 2. Problems with Staggered DiD 3. Simulation Results 4. Some Alternative Methods 5. Application --- # .center.pull[Difference-in-Differences] `\(\hspace{2cm}\)` - Think Card and Krueger minimum wage study comparing NJ and PA. - 2 units and 2 time periods. - 1 unit (T) is treated, and receives treatment in the second period. The control unit (C) is never treated. --- # .center.pull[Difference-in-Differences] <img src="DiD_files/figure-html/d1-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Difference-in-Differences] - Building upon `\(\color{blue}{\text{Angrist & Pischke (2008, p. 228)}}\)` we can think of these simple 2x2 DiDs as a fixed effects estimator. - Potential Outcomes - `\(Y_{i, t}^1\)` = value of dependent variable for unit `\(i\)` in period `\(t\)` with treatment. - `\(Y_{i, t}^0\)` = value of dependent variable for unit `\(i\)` in period `\(t\)` without treatment. - The expected outcome is a *linear function* of unit and time fixed effects: `$$E[{Y_{i, t}^0}] =\alpha_i + \alpha_t$$` `$$E[{Y_{i, t}^1}] =\alpha_i + \alpha_t + \delta D_{st}$$` - Goal of DiD is to get an unbiased estimate of the treatment effect `\(\delta\)`. --- # .center.pull[Difference-in-Differences as Solving System of Equations for Unknown Variable] - Difference in expectations for the *control* unit times t = 1 and t = 0: `$$\begin{align*} E[Y_{C, 1}^0] & = \alpha_1 + \alpha_C \\ E[Y_{C, 0}^0] & = \alpha_0 + \alpha_C \\ E[Y_{C, 1}^0] - E[Y_{C, 0}^0] & = \alpha_1 - \alpha_0 \end{align*}$$` - Now do the same thing for the *treated* unit: `$$\begin{align*} E[Y_{T, 1}^1] & = \alpha_1 + \alpha_T + \delta \\ E[Y_{T, 0}^1] & = \alpha_0 + \alpha_T \\ E[Y_{T, 1}^1] - E[Y_{T, 0}^1] & = \alpha_1 - \alpha_0 + \delta \end{align*}$$` - If we assume the linear structure of DiD, then unbiased estimate of `\(\delta\)` is: `$$\delta= \begin{align*} & \left( E[Y_{T, 1}^1] - E[Y_{T, 0}^1] \right) - \left( E[Y_{C, 1}^0] - E[Y_{C, 0}^0] \right) \end{align*}$$` --- # .center.pull[Two-Way Differencing] <img src="DiD_files/figure-html/d2-1.gif" style="display: block; margin: auto;" /> --- # .center.pull[Regression DiD] The DiD can be estimated through linear regression of the form: `$$\tag{1} y_{it} = \alpha + \beta_1 TREAT_i + \beta_2 POST_t + \delta (TREAT_i \cdot POST_t) + \epsilon_{it}$$` The coefficients from the regression estimate in (1) recover the same parameters as the double-differencing performed above: `$$\begin{align*} \alpha &= E[y_{it} | i = C, t = 0] = \alpha_0 + \alpha_C \\ \beta_1 &= E[y_{it} | i = T, t = 0] - E[y_{it} | i = C, t= 0] \\ &= (\alpha_0 + \alpha_T) - (\alpha_0 + \alpha_C) = \alpha_T - \alpha_C \\ \beta_2 &= E[y_{it} | i = C, t = 1] - E[y_{it} | i = C, t = 0] \\ &= (\alpha_1 + \alpha_C) - (\alpha_0 + \alpha_C) = \alpha_1 - \alpha_0 \\ \delta &= \left(E[y_{it} | i = T, t = 1] - E[y_{it} | i = T, t = 0] \right) - \\ &\hspace{.5cm} \left(E[y_{it} | i = C, t = 1] - E[y_{it} | i = C t = 0] \right) = \delta \end{align*}$$` --- # .center.pull[Regression DiD] <center> `\(\hspace{2cm}\)` ![](https://media.giphy.com/media/Mab1lyzb70X0YiNLUj/giphy.gif) --- # .center.pull[Regression DiD - The Workhorse Model] - Advantage of regression DiD - it provides both estimates of `\(\delta\)` and standard errors for the estimates. - `\(\color{blue}{\text{Angrist & Pischke (2008)}}\)`: - "It's also easy to add additional (units) or periods to the regression setup... [and] it's easy to add additional covariates." - Two-way fixed effects estimator: `$$y_{it} = \alpha_i + \alpha_t + \delta^{DD} D_{it} + \epsilon_{it}$$` - `\(\alpha_i\)` and `\(\alpha_t\)` are unit and time fixed effects, `\(D_{it}\)` is the unit-time indicator for treatment. - `\(TREAT_i\)` and `\(POST_t\)` now subsumed by the fixed effects. - can be easily modified to include covariate matrix `\(X_{it}\)`, time trends, dynamic treatment effects estimation, etc. --- # .center.pull[Where It Goes Wrong] - Developed literature now on the issues with TWFE DiD with "staggered treatment timing" <span style="color:blue"> (Abraham and Sun (2018), Borusyak and Jaravel (2018), Callaway and Sant'Anna (2019), Goodman-Bacon (2019), Strezhnev (2018), Athey and Imbens (2018))<span> - Different units receive treatment at different periods in time. - Probably the most common use of DiD today. If done right can increase amount of cross-sectional variation. - Without digging into the literature: - `\(\delta^{DD}\)` with staggered treatment timing is a *weighted average of many different treatment effects*. - We know little about how it measures when treatment timing varies, how it compares means across groups, or why different specifications change estimates. - The weights are often negative and non-intuitive. --- # .center.pull[Bias with TWFE - Goodman-Bacon (2019)] - `\(\color{blue}{\text{Goodman-Bacon (2019)}}\)` provides a clear graphical intuition for the bias. Assume three treatment groups - never treated units (U), early treated units (k), and later treated units (l). <img src="DiD_files/figure-html/d3-1.png" width="504" style="display: block; margin: auto;" /> --- # .center.pull[Bias with TWFE - Goodman-Bacon (2019)] - `\(\color{blue}{\text{Goodman-Bacon (2019)}}\)` shows that we can form four different 2x2 groups in this setting, where the effect can be estimated using the simple regression DiD in each group: <img src="DiD_files/figure-html/d4-1.png" width="504" style="display: block; margin: auto;" /> --- # .center.pull[Bias with TWFE - Goodman-Bacon (2019)] - Important Insights - `\(\delta^{DD}\)` is just the weighted average of the four 2x2 treatment effects. The weights are a function of the size of the subsample, relative size of treatment and control units, and the timing of treatment in the sub sample. - Already-treated units act as controls even though they are treated. - Given the weighting function, panel length alone can change the DiD estimates substantially, even when each `\(\delta^{DD}\)` does not change. - Groups treated closer to middle of panel receive higher weights than those treated earlier or later. --- # .center.pull[Simulation Exercise] - Can show how easily `\(\delta^{DD}\)` goes awry up through a simulation exercise. - Consider two sets of DiD estimates - one where the treatment occurs in one period, and one where the treatment is staggered. - The data generating process is linear: `\(y_{it} = \alpha_i + \alpha_t + \delta_{it} + \epsilon_{it}\)`. - `\(\alpha_i, \alpha_t \sim N(0, 1)\)` - `\(\epsilon_{i, t} \sim N\left(0, \left(\frac{1}{2}\right)^2\right)\)` - We will consider two different treatment assignment set ups for `\(\delta_{it}\)`. --- # .center.pull[Simulation 1 - 1 Period Treatment] - There are 40 states `\(s\)`, and 1000 units `\(i\)` randomly drawn from the 40 states. - Data covers years 1980 to 2010, and half the states receive "treatment" in 1995. - For every unit incorporated in a treated state, we pull a unit-specific treatment effect from `\(\mu_i \sim N(0.3, (1/5)^2)\)`. - Treatment effects here are trend breaks rather than unit shifts: the accumulated treatment effect `\(\delta_{it}\)` is `\(\mu_i \times (year - 1995 + 1)\)` for years after 1995. - We then estimate the average treatment effect as `\(\hat{\delta}\)` from: `$$y_{it} = \hat{\alpha_i} + \hat{\alpha_t} + \hat{\delta} D_{it}$$` - Simulate this data 1,000 and plot the distribution of estimates `\(\hat{\delta}\)` and the true effect (red line). --- # .center.pull[Simulation 1 - 1 Period Treatment] <img src="DiD_files/figure-html/d5-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Simulation 1 - 1 Period Treatment] <center> `\(\hspace{2cm}\)` ![](https://media.giphy.com/media/drwxYI2fxqQGqRZ9Pe/giphy.gif) --- # .center.pull[Simulation 2 - Staggered Treatment] - Run similar analysis with staggered treatment. - The 40 states are randomly assigned into four treatment cohorts of size 250 depending on year of treatment assignment (1986, 1992, 1998, and 2004) - DGP is identical, except that now `\(\delta_{it}\)` is equal to `\(\mu_i \times (year - \tau_g + 1)\)` where `\(\tau_g\)` is the treatment assignment year. - Estimate the treatment effect using TWFE and compare to the analytically derived true `\(\delta\)` (red line). --- # .center.pull[Simulation 2 - Staggered Treatment] <img src="DiD_files/figure-html/d6-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Simulation 2 - Staggered Treatment] <center> `\(\hspace{2cm}\)` ![](https://media.giphy.com/media/4cuyucPeVWbNS/giphy.gif) --- # .center.pull[Simulation 2 - Staggered Treatment] - Main problem - we use prior treated units as controls. - When the treatment effect is "dynamic", i.e. takes more than one period to be incorporated into your dependent variable, you are *subtracting* the treatment effects from prior treated units from the estimate of future control units. - This biases your estimates towards zero when all the treatment effects are the same. --- # .center.pull[Another Simulation] - Can we actually get estimates for `\(\delta\)` that are of the *wrong sign*? Yes, if treatment effects for early treated units are larger (in absolute magnitude) than the treatment effects on later treated units. - Here firms are randomly assigned to one of 50 states. The 50 states are randomly assigned into one of 5 treatment groups `\(G_g\)` based on treatment being initiated in 1985, 1991, 1997, 2003, and 2009. - All treated firms incorporated in a state in treatment group `\(G_g\)` receive a treatment effect `\(\delta_i \sim N(\delta_g, .2^2)\)`. - The treatment effect is cumulative or dynamic - `\(\delta_{it} = \delta_i \times (year - G_g)\)`. --- # .center.pull[Another Simulation] - The average treatment effect multiple decreases over time: `\(\hspace{2cm}\)` <table class="table" style="margin-left: auto; margin-right: auto;"> <caption>Treatment Effect Averages</caption> <thead> <tr> <th style="text-align:center;"> `\(G_g\)` </th> <th style="text-align:center;"> `\(\delta_g\)` </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 1985 </td> <td style="text-align:center;"> 0.5 </td> </tr> <tr> <td style="text-align:center;"> 1991 </td> <td style="text-align:center;"> 0.4 </td> </tr> <tr> <td style="text-align:center;"> 1997 </td> <td style="text-align:center;"> 0.3 </td> </tr> <tr> <td style="text-align:center;"> 2003 </td> <td style="text-align:center;"> 0.2 </td> </tr> <tr> <td style="text-align:center;"> 2009 </td> <td style="text-align:center;"> 0.1 </td> </tr> </tbody> </table> --- # .center.pull[Another Simulation] - First let's look at the distribution of `\(\delta^{DD}\)` using TWFE estimation with this simulated sample: <img src="DiD_files/figure-html/d8-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Goodman-Bacon Decomposition] <img src="DiD_files/figure-html/d9-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Callaway & Sant'Anna] - Inverse propensity weighted long-difference in cohort-specific average treatment effects between treated and untreated units for a given treatment cohort. `$$\begin{equation} ATT(g, t) = \mathbb{E} \left[\left( \frac{G_g}{\mathbb{E}[G_g]} - \frac{\frac{p_g(X)C}{1 - p_g(X)}}{\mathbb{E}\left[\frac{p_g(X)C}{1 - p_g(X)} \right]} \right) \left(Y_t - T_{g - 1}\right)\right] \end{equation}$$` - Without covariates, as in the simulated example here, it calculates the simple long difference between all treated units `\(i\)` in relative year `\(k\)` with all potential control units that have not yet been treated by year `\(k\)`. --- # .center.pull[Callaway & Sant'Anna] <img src="DiD_files/figure-html/d10-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Abraham and Sun] - A relatively straightforward extension of the standard event-study TWFE model: `$$y_{it} = \alpha_i + \alpha_t + \sum_e \sum_{l \neq -1} \delta_{el}(1\{E_i = e\} \cdot D_{it}^l) + \epsilon_{it}$$` - You saturate the relative time indicators (i.e. t = -2, -1, ...) with indicators for the treatment initiation year group, and aggregate to overall aggregate relative time indicators by cohort size. - In the case of no covariates, this gives you the same estimate as Callaway & Sant'Anna if you *fully saturate* the model with time indicators (leaving only two relative year identifiers missing). - The authors don't claim that it can be used with covariates, but it seemingly follows if we think it is okay with normal TWFE DiD. --- # .center.pull[Abraham and Sun] <img src="DiD_files/figure-html/d11-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Cengiz et al. (2019)] - Similar to the standard TWFE DiD, but we ensure that no previously treated units enter as controls by trimming the sample. - For each treatment cohort `\(G_g\)`, get all treated units, and all units that are not treated by year `\(g + k\)` where `\(g\)` is the treatment year and `\(k\)` is the outer most relative year that you want to test (e.g. if you do an event study plot from -5 to 5, `\(k\)` would equal 5). - Keep only observations within years `\(g - k\)` and `\(g + k\)` for each cohort-specific dataset, and then stack them in relative time. - Run the same TWFE estimates as in standard DiD, but include interactions for the cohort-specific dataset with all of the fixed effects, controls, and clusters. --- # .center.pull[Cengiz et al. (2019)] <img src="DiD_files/figure-html/d12-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Model Comparison] - In the stylized example all the models work. How do they differ? - Callaway & Sant'Anna - Can be *very* flexible in determining which control units to consider. - Has a more flexible functional form as well (IPW instead of OLS). - IPW can run into issues with p-scores near 0 or 1. But just bc OLS runs doesn't mean it's right! - Abraham & Sun - Very similar to regular TWFE OLS and hence easy to explain. - Control units are all units not treated within the data sample. If most of your units are treated by the end (or all), this can make control units very non-representative and restricted. - Cengiz et al. - Also fairly close to regular DiD. - Can modify this framework to allow different forms of control units as well. - Not theoretically derived. --- # .center.pull[Application - Medical Marijuana Laws and Opioid Overdose Deaths] - `\(\color{blue}{\text{Bachhuber et al. 2014}}\)` found, using a staggered DiD, that states with medical cannabis laws experienced a slower increase in opioid overdose mortality from 1999-2010. - `\(\color{blue}{\text{Shover et al. 2020}}\)` extend the data sample from 2010 to 2017, a period during which 32 extra states passed MML laws. - Not only do the results go away, but the sign flips; MML laws are associated with *higher* opioid overdose mortality rates. - Authors don't call it difference-in-differences, but it uses TWFE with a binary indicator variable (thus is effectively DiD). --- # .center.pull[Replication] <img src="DiD_files/figure-html/d13-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Event Study Estimates] - Little evidence covariates matter here, so estimate standard DiD with no controls over the two periods: `$$y_{it} = \alpha_i + \alpha_t + \sum_{k = Pre, Post} \delta_k + \sum_{-3}^3 \delta_k + \epsilon_{it}$$` <img src="DiD_files/figure-html/d14-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Event Study Estimates] - So we can verify that in the first sample (1999 - 2010), there appears to be a negative effect of law introduction, while in the full sample (1999 - 2017), there is a positive effect. - *But* there appears to be evidence of pre-trends in the full sample. - In addition, by the end of the sample the number of firms adopting MMLs is quite large. - If there are dynamic treatment effects, then these estimates could be biased from using many prior treated states as controls. --- # .center.pull[Bacon-Goodman Decomposition] <img src="DiD_files/figure-html/d15-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Bacon-Goodman Decomposition] - The unweighted average of the 2x2 treatment effects are negative for the earlier vs. later treated (unbiased), while positive for the later vs. earlier treated (biased). - The effect is also positive for the treated vs. untreated units, but there are not many untreated states (i.e. states without medical cannabis laws). <table class="table table-striped table-hover" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Type </th> <th style="text-align:center;"> Average Estimate </th> <th style="text-align:center;"> Number of 2x2 Comparisons </th> <th style="text-align:center;"> Total Weight </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> Earlier vs Later Treated </td> <td style="text-align:center;"> -0.16 </td> <td style="text-align:center;"> 91 </td> <td style="text-align:center;"> 0.38 </td> </tr> <tr> <td style="text-align:center;"> Later vs Earlier Treated </td> <td style="text-align:center;"> 0.32 </td> <td style="text-align:center;"> 105 </td> <td style="text-align:center;"> 0.42 </td> </tr> <tr> <td style="text-align:center;"> Treated vs Untreated </td> <td style="text-align:center;"> 0.44 </td> <td style="text-align:center;"> 14 </td> <td style="text-align:center;"> 0.20 </td> </tr> </tbody> </table> --- # .center.pull[Callaway & Sant'Anna] <img src="DiD_files/figure-html/d17-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Abraham & Sun] - Skip for now - without covariates it's the same as Callaway & Sant'Anna --- # .center.pull[Cengiz et al.] - First, we can plot the state-specific DiD estimates, separated by adoption period: <img src="DiD_files/figure-html/d19-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Cengiz et al.] <img src="DiD_files/figure-html/d20-1.png" width="720" style="display: block; margin: auto;" /> --- # .center.pull[Takeaways] - DiDs are a powerful tool and we are going to keep using them. - But we should make sure we understand what we're doing! DiD is a comparison of means and at a minimum we should know which means we're comparing. - Multiple new methods have been proposed, all of which ensure that you aren't using prior treated units as controls. - You should probably tailor your selection of method to your data structure: they use and discard different amount of control units and depending on your setting this might matter. - Unclear what's going on with MMLs and opioid mortality rates, but very unlikely that the results in the first published paper is robust.