Uncertainty modeling and quantification for causal inference

causal inference

graphical models

structure learning

A theoretical introduction of UQ in causal inference

Published

September 1, 2023

Structural causal models: a simple introduction and example

The most complete and versatile axiomatic treatment of causal inference is given by the structural causal model (SCM) framework, by celebrated computer scientist and philosopher Dr. Judea Pearl (Pearl 2009). Let us give a simple example:

Consider an experiment where we apply an external force \(F\) on objects with different masses \(m\) and then measure their induced acceleration \(a\). Let us put all these variables together: \(\mathcal{V}=\{m,F,a\}\). Here, \(m\) and \(F\) are controlled (possibly randomly assigned) by the researcher, so they might be considered exogenous.

Acceleration measurements is noisy, due to random measurement errors or residual forces applied (air current). Such noise is captured by random variable \(U_a\). Let us put all exogenous variables and noises together in \(\mathcal{U}=\{m,F,U_a\}\). Observations from these variables come respectively from \(P(m), P(F), P(U_a)\) (independent) distributions.

Since we measure the induced acceleration after an applied force, and we know there’s a theoretical (causal) relation between them, we can say that such acceleration is caused by the force. We can represent this graphically as \(F\rightarrow a\). We also know changes in the mass being pushed induce changes in the resulting acceleration, so the \(m\) is also a cause of \(a\), or \(m\rightarrow a\). The external force is applied independent of the mass, so none is cause of the other. We can represent succinctly such causal structure with a directed acyclic graph \(\mathcal{G}\):

\[ \mathcal{G} :\quad F\rightarrow a\leftarrow m \] By Newton’s second law formula (with noise), measured acceleration can be expressed as \(a = m^{-1}F+U_a\). This can be shortly expressed as \(a=f_a(m,F,U_a)\), where \(f_a\) is the acceleration’s causal mechanism: a deterministic function defined as \(f_a(x,y,z)=x^{-1}y+z\)

Then, the tuple of mathematical objects \(\mathfrak{M}=(\mathcal{V},\mathcal{U},\mathcal{G},f_a,{P}(\mathcal{U}))\) is a structural causal model (SCM) that fully describes the system, and can be leveraged to answer queries in three levels:

Observational: keeping the mass constant at \(m=2\)kg, what is the observed curve \(F\) vs. \(a\)?
Causal: what’s the average change in \(a\) if \(F\) changes from \(F=1\)N to \(F=2\)N with a mass of \(1\)kg?
Counterfactual: what would have been the value of \(a\) under \(F=2\)N for a mass \(m=1\)kg that actually experienced \(F=1\)N and measured \(a=1\)?

Structural causal models: a more technical introduction

An SCM is a tuple of mathematical objects \(\mathfrak{M}=(\mathcal{V},\mathcal{U},\mathcal{G},\mathcal{F},{P}(\mathcal{U}))\)¹, such that:

\(\mathcal{V}\) is a finite set of relevant random variables (the data)
\(\mathcal{U}\) is a finite set of exogenous random variables and background noises
\(\mathcal{G}\) is a directed acyclic graph on \(\mathcal{V}\)
\(P(\mathcal{U})\) is a probability measure for \(\mathcal{U}\)
\(\mathcal{F}=\{f_V\}_{V\in\mathcal{V}}\) is an indexed collection of measurable functions specifying the causal relations i.e., for every \(V\in\mathcal{V}\) there is a \(U_V\in\mathcal{U}\) and a function \(f_V:\text{supp}\, \text{pa}(V;\mathcal{G})\times \text{supp}\, U_V\rightarrow\text{supp}\, V\), such that \(V=f_V(\text{pa}(V;\mathcal{G}),U_V)\)² almost surely

An SCM is Markovian if all the background noises \(\{U_V\}_{V\in\mathcal{V}}\) are assumed to be mutually independent. In this case, the set of conditional independence statements encoded in \(\mathcal{G}\) allows a Bayesian network representation that factorizes the joint distribution in terms of independent causal families (Pearl 2009):

\[ p(\mathcal{V}) = \prod_{V\in\mathcal{V}}p(V\mid \text{pa}(V;\mathcal{G})) \]

A hard intervention on a collection of variables \(A=(A_j)_{j\in J}\subset\mathcal{V}\) to the assigned value \(a=(a_j)_{j\in J}\in\text{supp}\, A\) is denoted \(do(A=a)\). Such intervention induces a new SCM \(\mathfrak{M}_{A=a}\), where all \(f_{A_j}\) are replaced by constant functions that output the respective value \(a_j\). Its associated graph is the mutilated graph \(\mathcal{G}[\overline{A}]\) that removes all incoming arrows to \(A\) (Bareinboim et al. 2022).

Let disjoint \(A,Y\subset\mathcal{V}\) denote respectively the exposure and outcome variables, the unit-level counterfactual \(Y_{a}(u)\) is the value \(Y\) takes according to \(\mathfrak{M}_{A=a}\) in the individual context \(\mathcal{U}=u\). Its induced population-level distribution, named the interventional distribution, can be expressed as:

\[ p(y\mid do(A=a)) := p_{Y_{a}}(y)=\int_{\mathcal{U}_a[y]}\text{d} P(u) \]

Where \(\mathcal{U}_a[y]=\{u\in\text{supp}\, \mathcal{U} : Y_{a}(u)=y \}\) is the inverse image of \(y\in\text{supp}\, Y\) under \(Y_{a}(u)\) for a given \(a\in\text{supp}\, A\).

The average treatment effect (ATE), \(\psi\), and the \(X\)-specific conditional average treatment effect (CATE), \(\psi_X(\cdot)\), with \(X\subseteq\text{nd}(A;\mathcal{G})\)³:, are two of the most commonly investigated causal effects/estimands. For binary \(A\), they correspond to difference functionals of the interventional distribution, and are defined as:

\[ \begin{aligned} \psi &:= \Delta_a \mathbb{E}\left[Y\mid do(A=a) \right]\\ \psi_X(x) &:= \Delta_a \mathbb{E}\left[Y\mid do(A=a),X=x \right],\, x\in\text{supp}\, X,\, X\subseteq\text{nd}(A;\mathcal{G}) \end{aligned} \]

An interventional distribution or causal effect is said to be nonparametrically identifiable from positive \(P(\mathcal{V})\) if it can be uniquely computed from it (using the conditional independence statements embedded in \(\mathcal{G}\) and its mutilation). In other words, a query \(Q\) such as interventional distribution, ATE, or CATE, is nonparametrically identifiable from \(P(\mathcal{V})\), if there exists a functional/algorithm \(\Psi_\mathcal{G}:P(\mathcal{V})\mapsto Q\), such that such that it returns a unique value up to some equivalent relation.

Uncertainty modeling and quantification in structural causal models

The two natures of uncertainty, aleatoric and epistemic (Hüllermeier and Waegeman 2021), can be associated with different parts of an SCM.

Aleatoric uncertainty

It is generally induced by randomness, i.e. \(P(\mathcal{U})\) and, in turn, \(P(\mathcal{V})\).

Epistemic uncertainty

It is the result of working with unknowns. It can be further broken down into:

Model/mechanism uncertainty: induced by unknown \(\mathcal{F}\). Its analytic treatment depends on where the true mechanism lies in relation with the working models/hypotheses \(\mathcal{M}\) (linear regressions, neural nets, etc.).
- \(\mathcal{M}\)-closed world: If \(\mathcal{F}\in\mathcal{M}\), model uncertainty can be integrated via Bayesian model-averaging.
- \(\mathcal{M}\)-open/complete world: If \(\mathcal{F}\notin\mathcal{M}\), model uncertainty can be integrated via Bayesian stacking of predictive distributions (Yao et al. 2018).
Structure uncertainty: induced by unknown \(\mathcal{G}\), typically under the causal sufficiency assumption, i.e., all relevant endogenous variables \(\mathcal{V}\) are observed, but the graph connecting them is not (Kitson et al. 2023).
- Here, ‘’relevant’’ depends on the downstream task. If the goal is full structure learning, then all \(\mathcal{V}\) must be observed. If the goal is downstream causal inference, then a minimal adjustment set must be observed.
Identification uncertainty: induced by latent/hidden \(V_H\subset\mathcal{V}\), where variables \(V_H\) are needed for identification of causal and counterfactual queries.
- When point-identification is not possible due to latent variables, such as unmeasured confounders, partial-identification (bounds) can still be informative (Chernozhukov et al. 2022).

References

Bareinboim, Elias, Juan D. Correa, Duligur Ibeling, and Thomas Icard. 2022. “On Pearl’s Hierarchy and the Foundations of Causal Inference.” In Probabilistic and Causal Inference: The Works of Judea Pearl, 1st ed., 507–56. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3501714.3501743.

Chernozhukov, Victor, Carlos Cinelli, Whitney Newey, Amit Sharma, and Vasilis Syrgkanis. 2022. “Long Story Short: Omitted Variable Bias in Causal Machine Learning.” National Bureau of Economic Research.

Hüllermeier, Eyke, and Willem Waegeman. 2021. “Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods.” Machine Learning 110 (3): 457–506. https://doi.org/10.1007/s10994-021-05946-3.

Kitson, Neville Kenneth, Anthony C Constantinou, Zhigao Guo, Yang Liu, and Kiattikun Chobtham. 2023. “A Survey of Bayesian Network Structure Learning.” Artificial Intelligence Review, 1–94.

Pearl, Judea. 2009. Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge, UK: Cambridge University Press. https://doi.org/10.1017/CBO9780511803161.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis 13 (3): 917–1007. https://doi.org/10.1214/17-BA1091.

Footnotes

Some authors do not include \(\mathcal{G}\) directly in \(\mathfrak{M}\), but say that \(\mathcal{G}\) is associated with \(\mathfrak{M}\). This might be useful in some uncertainty-related context, such as when there are hidden variables, as many marginal graphs can be associated with the same SCM.↩︎
\(\text{pa}(A;\mathcal{G})\) stands for the parents of node \(A\) in \(\mathcal{G}\).↩︎
\(\text{nd}(A;\mathcal{G})\) stands for the non-descendants of node \(A\) in \(\mathcal{G}\).↩︎