What are bad controls?
Many of us are control freaks. If you’ve decided to read this blog, you probably are one too. Applied researchers, especially those just starting out with observational data, are keen to control for variables they think relate to the outcome or exposure. To claim causality and avoid omitted variable bias, we’re often motivated to control for variables that feel very “close” to our variables of interest. But there’s a big problem here. You’ve heard it many times: correlation is not causation.
First, let’s understand the three sources of association because these help us identify good and bad controls:
1
. Z affects X and Y, and introduces a non-causal association (induce spurious associations) between X and Y. In this cases, we have learned that we must control for Z. Here, controlling or conditioning on Z means that we block the spurious association and identify the causal effect of X on Y. Thus, Z is a good (i.e., necessary) control here.2
. X and Y, both affect Z. Unlike the previous scenario, X and Y having effect on Z does not introduce association. In this case, our results would be biased if we control for Z; we would induce a non-causal association. Thus, Z is a bad control here. Do not control for Z!!!3
visualizes a chain. Controlling for Z on this causal pathway stops the flow of association and results in overcontrol bias. Thus, Z is again a bad control. Again, do not control for Z!!!CAVEAT!!! Here, I’m assuming that the interest lies in the total effect of X on Y. Conditioning on Z is relevant if the researcher is interested in the controlled direct effect (CDE), which we commonly encounter in mediation analyses.
Actually, not all controls must be good or bad. There are also neutral controls. Let’s start with a scenario where Z and X affect Y, but Z and X are not correlated. We assume no effect of X on Z or Z on X. In this case, Z is considered a neutral control. Controlling for Z does no harm, and likely to improve the precision of the effect estimate, because controlling for Z would reduce the variation of Y. This is visualized in Figure 4
I believe it is also necessary to highlight one more scenario (among many more possible), where Z determines exposure (X) but not the outcome of interest (Y). This is visualized in Figure 5
. Although we may still consider Z as a neutral control (since it does not introduce any bias), unlike the previous scenario in Figure 4
, this does harm the precision of the effect estimate we are interested to find.
It is important to mention that I have only covered the simplest scenarios and one can think of much more challenging scenarios, but for the sake of simplicity and basic understanding, I will leave it at this. I hope this is helpful.