How much control do we really need?

What are bad controls?

Orkhan Sariyev true
2024-06-06

Three sources of association

Many of us are control freaks. If you’ve decided to read this blog, you probably are one too. Applied researchers, especially those just starting out with observational data, are keen to control for variables they think relate to the outcome or exposure. To claim causality and avoid omitted variable bias, we’re often motivated to control for variables that feel very “close” to our variables of interest. But there’s a big problem here. You’ve heard it many times: correlation is not causation.

First, let’s understand the three sources of association because these help us identify good and bad controls:

  1. Confounder or common cause means that exposure (X) and outcome (Y) share a common cause (Z). This is visualized in Figure 1. Z affects X and Y, and introduces a non-causal association (induce spurious associations) between X and Y. In this cases, we have learned that we must control for Z. Here, controlling or conditioning on Z means that we block the spurious association and identify the causal effect of X on Y. Thus, Z is a good (i.e., necessary) control here.
Confounder

Figure 1: Confounder

  1. Collider means that exposure (X) and outcome (Y) share a common effect (Z). This is visualized in Figure 2. X and Y, both affect Z. Unlike the previous scenario, X and Y having effect on Z does not introduce association. In this case, our results would be biased if we control for Z; we would induce a non-causal association. Thus, Z is a bad control here. Do not control for Z!!!
Collider

Figure 2: Collider

  1. Mediator means that exposure (X) affects outcome of interest (Y) through the mediator (Z). If we control for Z, we block the association. Figure 3 visualizes a chain. Controlling for Z on this causal pathway stops the flow of association and results in overcontrol bias. Thus, Z is again a bad control. Again, do not control for Z!!!

CAVEAT!!! Here, I’m assuming that the interest lies in the total effect of X on Y. Conditioning on Z is relevant if the researcher is interested in the controlled direct effect (CDE), which we commonly encounter in mediation analyses.

Mediator

Figure 3: Mediator

Neutral controls

Actually, not all controls must be good or bad. There are also neutral controls. Let’s start with a scenario where Z and X affect Y, but Z and X are not correlated. We assume no effect of X on Z or Z on X. In this case, Z is considered a neutral control. Controlling for Z does no harm, and likely to improve the precision of the effect estimate, because controlling for Z would reduce the variation of Y. This is visualized in Figure 4

Neutral control: good for precision

Figure 4: Neutral control: good for precision

I believe it is also necessary to highlight one more scenario (among many more possible), where Z determines exposure (X) but not the outcome of interest (Y). This is visualized in Figure 5. Although we may still consider Z as a neutral control (since it does not introduce any bias), unlike the previous scenario in Figure 4, this does harm the precision of the effect estimate we are interested to find.

Neutral control: bad for precision

Figure 5: Neutral control: bad for precision

It is important to mention that I have only covered the simplest scenarios and one can think of much more challenging scenarios, but for the sake of simplicity and basic understanding, I will leave it at this. I hope this is helpful.