In my previous post, I presented a convenient model for describing the counterfactual questions, which ponder the potential outcomes we would see given a change in a particular explanatory variable. I ended the post with an formalization of average treatment effects or the arithmetic mean of all causal effects that a particular explanatory variable may have on individual measurements of an outcome variable. Unfortunately, as a result of the fundamental problem of causal inference, we cannot directly measure average treatment effects. This is because we cannot witness more than one potential outcome, as we cannot set an explanatory variable to more than one value. I mentioned that there was a way we could estimate average treatment effects and now I can proceed to describe the method in detail.
Recall, that in order to estimate the causal effect due to a particular explanatory variable, we must observe data with variation, between treated individuals who received treatment, and untreated individuals who did not. When considering the estimation of average treatment effects, it will be helpful to also consider the average treatement effect of the treated (ATT) and the average treatement effect of the untreated (ATU).
ATT and ATU
The former is the average treatment effect for the individuals which are treated, and for which a particular explanatory variable describing their treatment is equal to . The latter is the average treatment effect for the individuals which are untreated, and for which the describing their treatment is equal to . Formally, ATT is described by the following equation. In the calculation of the arithmetic mean for treated individuals, we represent the number of individuals that are treated with .
and, with the number of untreated individuals represented with , the ATU can be formally written as follows
For example, consider a scenario in which I decide to vary treatment between the same 8 email subscribers I presented in my previous post, arbitrary chosen to send the first 4 email alerts without images and send the last 4 alerts with images. Let us extend our observations in table 1, adding a column marking a variation between email subscribers who are treated, and received email alerts with images () and email subscribers who are untreated and receive email alerts without images ().
For this treatment assignment, email subscribers 0 through 3 are untreated. We can take the potential outcomes and of these individuals to calculate the average treatement effect on the untreated:
Similarly, we can use the potential outcome values of email subscribers 4 through 7 to calculate the average treatment effect on the treated individuals as follows:
Note, that just like the ATE, the ATT and the ATU cannot be calculated. While we can calculate in order to calculate the ATT, we also need , the mean outcomes of treated individuals in the hypothetical universe for which they did not receive treatment. In addition, while we can calculate in order to calculate the ATU, we also need , the mean outcomes of untreated individuals in the hypothetical universe for which they received treatment.
Ok, well is there anything that we can calculate to help us quantify the causal effect generated by an explanatory variable?!
Why yes, I’m glad you asked.
Simple Difference In Mean Outcomes
Let’s recall what values I can calculate given the outcomes I observe when inferring the causal effect of images in email alerts on my email subscribers. I can calculate , or the open rate of subscribers chosen to receive email alerts with images in the universe for which they received email alerts with images (the universe I am currently observing). I can also calculate , or the open rate of email subscribers chosen to receive email alerts without images in the universe for which they received email alerts without images (again, the universe I am currently observing). The difference between these values is known as the simple difference in mean outcomes (SDO) and is the main mechanism we will use for estimating ATEs. The equation below uses and as well as , the number of observed individuals who are treated, and , the number of observed individuals who are untreated.
For our example data set regarding images in email alerts, the simple difference in mean outcomes can be calculated as follows.
Often, this is where “traditional” inference practitioners stop when trying to estimate an average causal effect. It is the intuitive estimation strategy you may be familiar with, simply the difference in means between treated and untreated individuals. Unfortunately, it is not exactly equivalent to the ATE. The SDO has two main sources of bias which can systematically distort the statistic away from the true value of an average treatment effect. The equation relating the SDO to the average treatment effect is as follows:
I will refrain from presenting the proof of this equation within this post, but would implore you to check it out in detail on page 90 of Causal Inference: The Mixtape1. It is crucially important to discuss the implications of the excess terms on the right-hand side of this equation, in order to understand why we must be careful when using “simple” difference in means estimations to estimate average treatment effects.
The first term to the right of ATE in the equation above is
Which represents a systematic difference in how treated and untreated individuals would respond given no treatment. For example, let’s say I placed images in the email alerts of my subscribers indexed as 2 and 4 (and marked in the table below with 👨 and 👩), and assigned images in email alerts to two other subscribers randomly. A causal diagram describing the effect that the explanatory variable isMyParent has on both Images In Email Alert and Opens Blog Post is as follows.
Let’s say 2 and 4 are my two parents and I want to ensure they receive the best experience possible from Causal Flows. A possible treatment assignment which satisfies these requirements could be as follows.
The simple difference in outcomes calculated from my treatment assignment is as follows:
This is a huge effect! If the average treatment effect of adding images is 1, then I should see a blog post open rate close to 100% once I start adding images to my blog posts. However, this simple difference in mean outcomes calculation suffers from selection bias, as both of my parents (who will open my blog posts regardless of whether or not they receive an email alert with images) are treated with images in their email alert. We can quantify this sampling bias as follows:
Note that for this example, the minus the calculated is equal to the average treatment effect, which was calculated in my previous post to be . I constructed this example so there would be no heterogeneous treatment effect bias and thus its resultant SDO is only effected by one type of additive bias. My choice to send email alerts with images to my parents, who would open my blog post regardless of its content, upwardly biased my estimate of average treatment effects. I systematically chose to treat email subscribers who would respond positively regardless of my treatment. In order to ensure my estimated ATE is not distorted due to sampling bias, I must ensure treatment assignment strategy does not yield a significant difference in the potential outcome given no treatment () of treated and untreated individuals.
Heterogeneous Treatment Effect Bias
Recall from my previous post that heterogeneous treatment effects (HTEs) characterize differing responses to treatment from different portions of the population. Often times, a policy solution, UI feature, or medical therapy does not have the same effect on all individuals in a population; causal inference often involves estimating treatment responses despite these differences. When using the SDO to estimate average treatment effects, we must be cautious of differing responses to treatment between treated individuals and untreated individuals, as it incurs bias which obscures our estimate of the average treatment effect of an entire sample population. Such bias is known as heterogeneous treatment effect bias and is the second form of additive bias that can adversely effect an SDO estimation of average treatment effect. Particuarly, if untreated individuals have a systematically different response to treatement than treated individuals, then the SDO, which only encorporates the treatment responses from treated individuals, will be systematically different than the true ATE of a given sample population. The formula for heterogeneous treatment effect bias is comprised of the difference between the average treatment effect of treated individuals (ATT) and the average treatement effect of untreated individuals (ATU), times the portion of observed individuals which are untreated. Formally, HTE bias is defined with the following equation. In the calculation of heterogeneous treatment effect bias, we represent the portion of individuals that are treated with and the portion of individuals which are untreated with .
For example, consider a scenario in which I add images to email alerts depending on whether or not an individual email subscriber is an economist. Just as my parents will open my blog post regardless of whether or not its corresponding email alert has images, some of my subscribers are economist, and have a depthful understanding of the introductory causal inference topics I currently cover. They won’t be reading my posts on structural causal models, as they are waiting for me to discuss more advanced topics. Suppose that I want to ensure that economists get the best experience from Causal Flows as possible (maybe I want to make a good impression for my graduate school application). Therefore, I choose to send them email alerts with images while assigning different email alerts to other subscribers randomly. A causal diagram describing the effect that the explanatory variable isEconomist has on both Images In Email Alert and Opens Blog Post is as follows.
If indices 3 and 7 (marked in the table below as 📈 and 📉) represent the economists amongst my sample observations, a possible treatment assignment which satisfies these requirements could be as follows.
The resultant SDO from this treatment assignment is
This is much lower than my previously calculated ATE of 50%,however nothing has changed besides the treatment assignment used for my estimation.
In this example, the ATT is much lower than the ATU. The treatment effect on individuals who have been assigned treatment, which includes both economists, is (on average) less than the treatment effect on individuals who were not assigned treatment. We can calculate the ATT and the ATU as follows.
With the ATE and ATU of my observed individuals, as well as the value of the proportion of untreated individuals (). I can calculate the heterogenous treatment effect bias as follows.
In this example, the SDO () minus the calculated HTE Bias () is equal to the average treatment effect, which was calculated in my previous post to be . In this example the heterogeneous treatment effect bias is the only type of additive bias on the SDO. My decision to send email alerts to the economics professors subscribed to Causal Flows downwardly biased my estimate of an average treatment effect, because I systematically chose to treat email subscribers who were not affected by my treatment. In order to ensure my estimated ATE is not distorted due to sampling bias, I must ensure that my treatment assignment strategy does not yield a significant difference in the potential outcome given treatment () of treated and untreated individuals.
Extension To Regression
Often times, the SDO estimation of an ATE can be calculated with a linear regression, which models a linear relationship between explanatory variables and outcome variables. Consider the following switching equation presented in my previous post:
With a little algebra, this equation can be rewritten as follows.
Thus, the ATE can be interpreted as a regression of treatment on an observed individual on their observed outcome with a constant term representing an individual’s outcome in the absence of treatment (). Heterogeneous treatment effect estimation (analysis of how treatment effects vary across individuals) is also closely linked to regression, and often times when we wish to extend treatment effect estimation to variables representing a wide range of real numbers, rather than a binary 0 and 1 (such as minimum wages, drug dosages, and e-commerce prices) causal inference practitioners often use regressions to generate treatment effect estimations, while remaining cautious of the biases noted in this post (and many more not covered).
How Can We Deal With Bias In An ATE Estimation?
Ok, so we understand the ways in which the simple difference in mean outcomes for ATE estimation can be significantly biased away from the true ATE. However, we know the formulas to calculate this bias, can’t we just compute its value and subtract it from our calculated SDO? Well no, unfortunately the biases we’ve defined are calculated with values which we cannot observe. Sampling bias (defined as ) contains the term which is the mean outcome of treated indviduals, in a hypothetical universe in which they did not receive treatement, this is a hypothetical universe we cannot observe, and for which we cannot measure outcomes. Similarly, recall that ATT and ATU cannot be calculated solely from observed outcomes and thus heterogeneous treatment effect bias (defined as ) also cannot be calculated.
Ok, so how can we confidently estimate ATEs with SDOs, given that we can’t even calculate our two main sources of bias? Are there particular methodologies we can use to ensure there is no significant bias on our estimation? Most techniques used to discard bias from an ATE estimate ensure that individuals are assigned treatment such that is independent from potential outcomes and . From probability theory, if is independent from and , then
As a result, both heterogeneous treatement effect bias and sampling bias are eliminated as,
In experimental studies, or studies where an analyst has control over treatment assignment, analysts can randomly assign treatment to individuals to ensure that the treatment and the potential outcomes of observed individuals are drawn from independent probability distributions. For example, if I were to assign email alerts with images to my email subscribers randomly, their potential outcomes from receiving emails with and without images, would be statistically independent from the event that they were chosen for treatment. While randomizing treatment enables the SDO to be an unbiased estimate of ATE, in order to minimize the variance of the estimate practitioners must also ensure they are calculating an SDO from a sufficiently large sample population.
Many challenging causal inference questions can only be answered by observational studies, for which an analyst is unable to control treatment assignment, and must design a strategy to obtain an unbiased estimate of a treatment effect from observed data. One common strategy for estimating average treatment effects is to leverage observed natural experiments, or natural processes which assign treatment to individuals in a way that is statistically independent from their potential outcomes. For example, Angrist (1990) used the Vietnam War draft lottery to estimate the effect of military enrollment on income2. Other strategies concern eliminating bias due to confounding variables, or variables which have a causal effect on both an explanatory variable and an outcome variable, obscuring the causal effect between these variables. Recall that, when considering the effect that images in my email alerts had on the open rate of my blog posts, I presented examples of particular email subscriber characteristics which could have an effect on both treatment assignment and potential outcomes. In these examples, the status of an email subscriber as one of my parents or as an economist, effect both the explanatory variable (Images In Email Alert), and the outcome variable (Opens Blog Post) in my hypothesized causal model.
In my next blog post, I will formalize strategies for characterizing, identifying, and eliminating confounding bias, by combining structural causal models and the potential outcomes model to thoroughly describe hypothesized causal relationships between the interconnected processes of our universe.
- Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press.↩
- Angrist, J. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records. The American Economic Review, 80(3), 313-336. Retrieved June 7, 2020, from www.jstor.org/stable/2006669↩