Confounding
In causal inference, a confounder is traditionally understood to be a variable that independently predicts the outcome, is associated with the exposure, and is not on the causal pathway between the exposure and the outcome. Failure to control for a confounder results in a spurious association between exposure and outcome.
Confounding is a causal concept rather than a purely statistical one, and therefore cannot be fully described by correlations or associations alone. The presence of confounders helps explain why correlation does not imply causation, and why careful study design and analytical methods are required to distinguish causal effects from spurious associations.
Several notation systems and formal frameworks, such as causal directed acyclic graphs, have been developed to represent and detect confounding, making it possible to identify when a variable must be controlled for in order to obtain an unbiased estimate of a causal effect.
Confounders are threats to internal validity.
Definition
Confounding is defined in terms of the data generating model. Let X be an exposure, and let Y be the outcome. Traditionally, a variable Z was considered to confound the relationship between X and Y if Z independently predicts Y, is associated with X, and is not on the causal pathway between X and Y. Not controlling for Z introduces a spurious relationship between X and Y.However, several developments in causal inference over the past decades have shown that this definition of confounding is inadequate. This is because there can be pre-exposure variables associated with the outcome that, when controlled for, introduce rather than eliminate bias.
Modern causal inference therefore typically defines a confounder in terms of the minimally sufficient adjustment set. Formally, a set of variables Z is a sufficient adjustment set for the effect of X on Y if, conditional on Z, the potential outcomes are independent of X. I.e. after adjusting for Z, the exposed and unexposed groups are exchangeable with respect to the outcome. A minimally sufficient adjustment set is an adjustment set Z where every member of Z is required to control for confounding. Under this framework, a confounder is defined as a member of the minimally sufficient adjustment set.
In the language of directed acyclic graphs, confounding corresponds to the presence of one or more open backdoor paths between X and Y. A set of variables Z is a sufficient adjustment set if conditioning on Z blocks all backdoor paths from X to Y. The set is minimally sufficient if no proper subset of Z satisfies this property. Removing any variable from a minimally sufficient set reopens at least one backdoor path.
Examples
Simple ExampleA trucking company compares the fuel economy of trucks from two manufacturers by measuring miles per gallon over one month. They find that A trucks appear more fuel-efficient. However, A trucks are more often assigned highway routes while B trucks are more often assigned city routes. Here, truck make is the independent variable, MPG is the dependent variable, and route type is the confounder. Because route type affects MPG and the route type differs across truck make, it confounds the comparison. Thus the observed difference likely reflects highway vs. city driving rather than truck make.
Relationship between birth order and Down Syndrome
A scientist is studying the relationship between birth order and the presence of Down syndrome in the child. However, it is known that:
- Higher maternal age is directly associated with Down Syndrome in the child
- Higher maternal age is directly associated with Down Syndrome, regardless of birth order
- Maternal age is directly associated with birth order
- Maternal age is not a consequence of birth order
Relationship between smoking and lung disease
A scientist is studying the relationship between smoking status and the presence of lung disease. However, it is known that:
- Alcohol consumption and diet are directly associated with lung disease and overall health.
- Alcohol consumption and diet affect health regardless of smoking status.
- Alcohol consumption and diet are associated with smoking status.
- Alcohol consumption and diet are not consequences of smoking itself.
Control
Consider a researcher attempting to assess the effectiveness of drug X, from population data in which drug usage was a patient's choice. The data shows that gender influences a patient's choice of drug as well as their chances of recovery. In this scenario, gender Z confounds the relation between X and Y since Z is a cause of both X and Y:We have that
because the observational quantity contains information about the correlation between X and Z, and the interventional quantity does not. It can be shown that, in cases where only observational data is available, an unbiased estimate of the desired quantity, can
be obtained by "adjusting" for all confounding factors, namely, conditioning on their various values and averaging the result. In the case of a single confounder Z, this leads to the "adjustment formula":
which gives an unbiased estimate for the causal effect of X on Y. The same adjustment formula works when there are multiple confounders except, in this case, the choice of a set Z of variables that would guarantee unbiased estimates must be done with caution. The criterion for a proper choice of variables is called the Back-Door and requires that the chosen set Z "blocks" every path between X and Y that contains an arrow into X. Such sets are called "Back-Door admissible" and may include variables which are not common causes of X and Y, but merely proxies thereof.
Returning to the drug use example, since Z complies with the Back-Door requirement, the Back-Door adjustment formula is valid:
In this way the physician can predict the likely effect of administering the drug from observational studies in which the conditional probabilities appearing on the right-hand side of the equation can be estimated by regression.
Contrary to common beliefs, adding covariates to the adjustment set Z can introduce bias. A typical counterexample occurs when Z is a common effect of X and Y, a case in which Z is not a confounder and adjusting for Z would create bias known as "collider bias" or "Berkson's paradox." Controls that are not good confounders are sometimes called bad controls.
In general, confounding can be controlled by adjustment if and only if there is a set of observed covariates that satisfies the Back-Door condition. Moreover, if Z is such a set, then the adjustment formula of Eq. is valid. Pearl's do-calculus provides all possible conditions under which can be estimated, not necessarily by adjustment.
History
According to Morabia, the word confounding derives from the Medieval Latin verb "confundere", which meant "mixing", and was probably chosen to represent the confusion between the cause one wishes to assess and other causes that may affect the outcome and thus confuse, or stand in the way of the desired assessment. Greenland, Robins and Pearl note an early use of the term "confounding" in causal inference by John Stuart Mill in 1843.Fisher introduced the word "confounding" in his 1935 book "The Design of Experiments" to refer specifically to a consequence of blocking the set of treatment combinations in a factorial experiment, whereby certain interactions may be "confounded with blocks". This popularized the notion of confounding in statistics, although Fisher was concerned with the control of heterogeneity in experimental units, not with causal inference.
According to Vandenbroucke it was Kish who used the word "confounding" in the sense of "incomparability" of two or more groups in an observational study. Formal conditions defining what makes certain groups "comparable" and others "incomparable" were later developed in epidemiology by Greenland and Robins using the counterfactual language of Neyman and Rubin. These were later supplemented by graphical criteria such as the Back-Door condition.
Graphical criteria were shown to be formally equivalent to the counterfactual definition but more transparent to researchers relying on process models.
Types
In the case of risk assessments evaluating the magnitude and nature of risk to human health, it is important to control for confounding to isolate the effect of a particular hazard such as a food additive, pesticide, or new drug. For prospective studies, it is difficult to recruit and screen for volunteers with the same background, and in historical studies, there can be similar variability. Due to the inability to control for variability of volunteers and human studies, confounding is a particular challenge. For these reasons, experiments offer a way to avoid most forms of confounding.In some disciplines, confounding is categorized into different types. In epidemiology, one type is "confounding by indication", which relates to confounding from observational studies. Because prognostic factors may influence treatment decisions, controlling for known prognostic factors may reduce this problem, but it is always possible that a forgotten or unknown factor was not included or that factors interact complexly. Confounding by indication has been described as the most important limitation of observational studies. Randomized trials are not affected by confounding by indication due to random assignment.
Confounding variables may also be categorised according to their source. The choice of measurement instrument, situational characteristics, or inter-individual differences.
- An operational confounding can occur in both experimental and non-experimental research designs. This type of confounding occurs when a measure designed to assess a particular construct inadvertently measures something else as well.
- A procedural confounding can occur in a laboratory experiment or a quasi-experiment. This type of confound occurs when the researcher mistakenly allows another variable to change along with the manipulated independent variable.
- A person confounding occurs when two or more groups of units are analyzed together, despite varying according to one or more other characteristics.