Semiparametric estimation for the mean ReLU

semiparametric
IF
Bayesian
counterfactual
A simple setting for a nonregular inference problem
Published

May 17, 2024


Parametric model and inference

Let \(Y\in\{0,1\}\) be a binary variable, and let \(X\) be another variable, which can be either absolutely continuous or categorical, serving as a predictor of \(Y\). Using a linear specification for the logit of the conditional mean, we can formulate a logistic regression for \(Y\):
\[ \Psi(x)= \mathbb{E}[Y\mid X=x] = \text{logit}^{-1}\left(\beta_0+\beta_1x \right). \]

\(\Psi(x)\) is effectively the conditional mean of \(Y\) within the subpopulation or stratum where \(X = x\). Since this is a functional of the true distribution \(P_0\), a more precise notation would be \(\Psi[P_0](x)\).

The proposed logistic regression is a parametric model, involving finite-dimensional parameters \(\beta_0\) and \(\beta_1\). If the model is correctly specified, the Maximum Likelihood (ML) estimators of such parameters are \(\sqrt{n}\)-consistent, asymptotically efficient, and converge in distribution to a normal distribution with variance equal to the inverse of the Fisher information matrix. Consequently, as more data is added, the estimates quickly become more accurate, and the asymptotic confidence intervals rapidly concentrate around the true.

Furthermore, according to the Bernstein–von Mises theorem, if the (joint) prior distribution for \(\beta_0\) and \(\beta_1\) is absolutely continuous (e.g., Gaussian), then the Bayesian credible intervals will be asymptotically equivalent to frequentist confidence intervals.

After estimating \(\hat{\beta}_0\) and \(\hat{\beta}_1\) via Maximum Likelihood Estimation (MLE), a plug-in estimate for \(\Psi(x)\) is simply \(\text{logit}^{-1}\left(\hat{\beta}_0 + \hat{\beta}_1 x \right)\). In the case of Bayesian inference, a point estimator for \(\Psi(x)\) would be the posterior mean of these plug-ins. That is, if \(\{(\beta_0^j, \beta_1^j)\}_{j=1}^J\) are \(J\) draws from the posterior distribution, then a point estimate for \(\Psi(x)\) would be \(J^{-1}\sum_{j=1}^J \text{logit}^{-1}\left(\beta_0^j + \beta_1^j x \right)\).

Semiparametric model and inference

A semiparametric approach involves replacing the parameters \(\beta_0\) and \(\beta_1\) with an unknown function \(f\) to provide greater flexibility and fitting to the data: \[ \mathbb{E}[Y\mid X] = \text{logit}^{-1}[f(X)]. \]

Various statistical and machine learning methods can be employed for this purpose, including Single-Index Models (SIM), Generalized Additive Models (GAM), Bayesian Additive Regression Trees (BART), Neural Networks (NN), Kernel Regression Models (KRM), Gaussian Processes (GP), and others.

In semiparametric theory, the concept of asymptotical linearity allows oneself to talk about convergence in the same way as with parametric models.

Asymptotical linearity

An estimator \(\hat{\psi}\) constructed using data \(\{\mathcal{V}_i\}_{i=1}^n\) for a functional of a distribution \(\Psi[P_0]\) (e.g., a stratified mean, a risk difference, a log odds-ratio), is asymptotically linear with influence function \(D\in L_0^2(P_0)\) if: \[ \underbrace{\hat{\psi}-\Psi[P_0]}_{\text{bias}}=\underbrace{\frac{1}{n}\sum_{i=1}^n D(\mathcal{V}_i)}_{\text{average of } D} + \underbrace{o_{\mathbb{P}}(n^{-1/2})}_{\text{remainder}}. \]

In other words, an estimator is asymptotically linear if its asymptotic bias resembles a simple average of a function (the influence function) plus a remainder that diminishes rapidly towards zero. To put it even more simply, an estimator is asymptotically linear if it behaves like a sample average in the limit (Bickel et al. 1998).

The influence function of \(\Psi\) at \(P_0\) can be derived as: \[ D(\mathcal{V}_i) = \frac{d}{dt}\Psi[(1-t)P_0+t\delta_{\mathcal{V}_i}]\,\big|_{t=0}, \] provided the derivative exists, and \(D \in L_0^2(P_0)\). The latter means that \(D\) must satisfy two conditions: zero mean and finite variance. It is important to note that \((1-t)P_0 + t\delta_{\mathcal{V}_i}\) represents a slight fluctuation of the true distribution by mixing it with a point mass at the data point \(\mathcal{V}_i\) (van der Vaart 2000).

This immediately allows for the construction of a correction to any initial plug-in estimator \(\Psi[\hat{P}]\) simply by adding the estimated influence functions: \[ \underbrace{\tilde{\psi}}_{\text{one-step corrected}} = \underbrace{\Psi[\hat{P}]}_{\text{plug-in}} + \underbrace{\frac{1}{n}\sum_{i=1}^n \hat{D}(\mathcal{V}_i)}_{\text{average of } \hat{D}} \]

Under certain technical conditions, the one-step corrected estimator is asymptotically linear for \(\Psi[P_0]\) and efficient (with the minimum variance). One of those conditions is that either:

  • The plug-in and the influence functions are estimated using different halves of the dataset. This effectively halves the sample size but allows the use of data-adaptive methods such as BART.
  • There is no sample splitting, but it is assumed that the estimated influence functions lie within a Donsker class (no overfitting), meaning they should not be fitted with highly flexible and data-adaptive methods like BART.

Another condition is that \(\Psi\) is pathwise differentiable at \(P_0\) (Hines et al. 2022).

Pathwise differentiability

A functional \(\Psi\) is said to be pathwise differentiable at \(P_0\) with canonical gradient \(\tilde{D}_{P_0}\in L_0^2(P_0)\) if for all \(\tilde{P}\) one has: \[ \frac{d}{dt}\Psi[(1-t)P_0+t\tilde{P}]\,\big|_{t=0} = \mathbb{E}_{\tilde{P}}[\tilde{D}_{P_0}(\mathcal{V})] - \mathbb{E}_{P_0}[\tilde{D}_{P_0}(\mathcal{V})]. \]

This is to say, \(\Psi\) is differentiable in a functional sense, and it has functional derivative \(\tilde{D}_{P_0}\). In some special cases (RAL estimator, defined later), the canonical gradient and the influence function are the same (van der Laan 1995).

Properties of one-step corrected estimators

One-step corrected estimators, and asymptotically linear estimators in general, are powerful due to their highly valuable properties:

  1. Rapid convergence to the true value, even when using very flexible, ML-based, and data-adaptive methods (under sample splitting).
  2. Valid (asymptotic) uncertainty quantification, as the influence function fully determines the asymptotic variance.
  3. Multiple robustness, maintaining consistency even when some components are misspecified (under certain technical conditions).

In causal inference, when the functional \(\Psi[P_0]\) represents the average treatment effect (ATE), the one-step corrected estimator is named the augmented inverse probability weighting (AIPW) estimator. \[ \operatorname{AIPW} = \underbrace{\frac{1}{n}\sum_{i=1}^{n}\Delta_a\hat{\mathbb{E}}[Y\mid X_i,A=a]}_{\text{regression plug-in}}+\underbrace{\frac{1}{n}\sum_{i=1}^{n}\frac{(2A_i-1)(Y_i-\hat{\mathbb{E}}[Y\mid X_i,A_i])}{\hat{\mathbb{P}}(A=A_i\mid X_i)}}_{\text{ average of } \hat{D}}. \]

A different estimand

Now consider the target estimand not being the conditional mean, but instead: \[ \Psi[P_0] = \mathbb{E}[\max\{0,\mathbb{E}[Y\mid X]\}] = \mathbb{E}\operatorname{ReLU}(\mathbb{E}[Y\mid X]). \]

This parameter is the mean of a function that assigns the value of the conditional mean \(\mathbb{E}[Y\mid X]\) if it is positive, and zero if the conditional mean is negative. In other words, it replaces the negative conditional mean at the unit level with zero, retains the positive ones, and averages them.

If the first term (0) in \(\max\{\cdot,\cdot\}\) has index 0, and the second (\(\mathbb{E}[Y\mid X]\)) has index 1, what happens when these terms are equal (\(\mathbb{E}[Y\mid X]=0\))? In this case, \(\operatorname{ReLU}(\mathbb{E}[Y\mid X])=0\) regardless of which term has the maximal index.

Estimands of this type are relevant because they are intrinsically related to lower bounds of partially identified parameters in causal inference, such as the lower bound of the probability of benefit (Tian and Pearl 2000).

Let us approach the following questions. Is it possible to build an estimator for \(\Psi[P_0]\) that is …

  1. Semiparametric Bayesian?
  2. Semiparametric Bayesian with credible intervals that are also asymptotically valid confidence intervals?
  3. RAL (regular and asymptotically linear)?

Q.1. Semiparametric Bayesian estimator

Certainly. The Bayesian framework is versatile and powerful enough to address this inference problem. Semiparametric Bayesian models, such as Bayesian splines, BART, or Gaussian Processes (GPs), can be implemented to generate posterior samples \(\{f^j\}_{j=1}^J\) of the function \(f(\cdot)=\mathbb{E}[Y\mid X=\cdot]\). Using a nonparametric specification for \(P_X\) and, by the inherent Bayesian property of plug-in uncertainty propagation, a posterior sample for \(\Psi[P_0]\) would be: \[ \left\{\frac{1}{n}\sum_{i=1}^n\operatorname{ReLU}[f^j(X_i)] \right\}_{j=1}^J \]

The average of such a samples builds a mixed-type Bayesian point-estimator \(\check{\psi}\) for \(\Psi[P_0]\). Moreover, several uncertainty quantification measures can be computed directly from the samples.

Q.2. Semiparametric Bayesian estimator with frequentist-like UQ?

As observed in the parametric case, credible intervals can be asymptotically equivalent to confidence intervals under very mild conditions. Would the Bayesian estimator defined previously possess this property? The answer is not generally.

It is important to highlight that there are two different philosophical interpretations of Bayesian uncertainty:

  1. There is a true point-value for \(\Psi[P_0]\), and the posterior distribution captures all aleatoric and prior-epistemic uncertainty. This perspective is supported by prominent Bayesian researchers such as Greenland (2006) and Gelman et al. (2013).
  2. There is no single true \(\Psi[P_0]\), as it is inherently nondeterministic, and thus collapsing the posterior distribution into a point value is meaningless. This perspective is appropriate for quantum systems.

Under the second philosophy, there is no issue with the solution provided in Q.1., as samples from the posterior distribution are all that is needed. In contrast, adopting the first philosophy implies that frequentist-like coverage is important for uncertainty quantification, as we want to evaluate how well the credible intervals perform at containing the true value.

Condition for frequentist-like coverage of Bayesian posteriors

Assuming the first philosophy, we can ask: under what conditions would the Bayesian procedure proposed in Q.1. generate credible intervals with nominal coverage? The answer is precisely whenever \(\Psi[P_0]\) is pathwise differentiable at \(P_0\) (Dümbgen 1993; Fang and Santos 2019; Kitagawa et al. 2020)

Q.3. RAL estimator

An asymptotically linear estimator that is also regular is called… regular asymptotically linear (RAL).

Regular estimator

An estimator \(\hat{\psi}_n\) for \(\Psi[P_0]\) is called regular if the asymptotic distribution of \[ \sqrt{n}\left(\hat{\psi}_n - \Psi\left[\left(1-\frac{1}{\sqrt{n}}\right)P_0+\frac{1}{\sqrt{n}}\tilde{P}\right] \right) \] is the same for all \(\tilde{P}\). This is, this asymptotic distribution does not depend on the scores \(\frac{d\tilde{P}}{dP_0}-1\).

In simpler terms, a regular estimator is one whose asymptotic distribution is not affected by slight fluctuations of the true distribution \(P_0\).

It has been proven that, if \(\Psi[P_0]\) is not pathwise differentiable at \(P_0\), then no regular estimator of \(\Psi[P_0]\) exists (Hirano and Porter 2012). So, again, we need pathwise differentiability!

Pathwise differentiability

The function \(\operatorname{ReLU}(\cdot)=\max\{0,\cdot\}\) is differentiable everywhere on the real line, except at zero. Consequently, one condition for pathwise differentiability of \(\Psi[P_0] = \mathbb{E}\operatorname{ReLU}(\mathbb{E}[Y\mid X])\) at \(P_0\) is that \(\mathbb{E}[Y\mid X]\) is \(P_0\)-almost surely different from zero: \[ \int \mathbb{I}(\mathbb{E}[Y\mid X]=0)\, dP_0 = 0. \]

If \(X\) is discrete, this implies that there should be no stratum \(X=x\) with positive probability where the conditional mean of \(Y\) is equal to zero.

It is important to note that whether the condition is met is determined by the true distribution \(P_0\) (relative to the target estimand), not by the models used. Distributions that violate this condition are referred to as exceptional (Robins 2004; Luedtke and van der Laan 2016).

It is not hard to envision exceptional cases. For instance, if the target is the lower bound for the probability of benefit, then we are dealing with the estimand \(\mathbb{E}\operatorname{ReLU}[\operatorname{CATE}(X)]\). An immune stratum \(X=x_0\), could exist where the treatment has no effect—neither benefit nor harm. Note that in a quantum-Bayesian framework, if \(Y\) is continuous, the true \(\operatorname{CATE}(X)\) is a distribution, not a point-value, so we can say it is not exactly zero.

Summary so far

We aim to estimate \(\Psi[P_0]= \mathbb{E}\operatorname{ReLU}(\mathbb{E}[Y\mid X])\) and perform inference with valid uncertainty quantification through confidence or credible intervals that achieve nominal frequentist-like coverage.

If the true distribution \(P_0\) is nonexceptional, we can construct a one-step corrected estimator that is RAL under mild conditions, as well as a mixed-type Bayesian estimator that would be consistent under correct specification.

If the true distribution \(P_0\) is exceptional, no RAL estimator exists, and Bayesian credible intervals will not be asymptotically equivalent to valid confidence intervals. What can we do in this case then?

Now we sketch two potentials solutions: smooth surrogates and the online one-step estimator (OOSE).

Possible solution 1: Smooth surrogates

Consider replacing the original target estimand \(\Psi[P_0]\) for the following one: \[ \Gamma[P_0] = \mathbb{E}\operatorname{GELU}_h(\mathbb{E}[Y\mid X]), \]

where GELU stands for the Gaussian Error Linear Unit, a smooth approximation of ReLU function given by \(\operatorname{GELU}_h(x)=x\Phi(x/h)\), where \(\Phi\) is the Gaussian CDF and \(h\geq 0\) is a smoothing hyperparameter (Hendrycks and Gimpel 2016). Notice that, in the limit \(h\rightarrow 0\), one has \(\operatorname{GELU}_h(\cdot)\rightarrow \operatorname{ReLU}(\cdot)\).

Advantages

  • \(\Gamma[P_0]\) is pathways differentiable, so one can construct RAL estimators and semiparametric Bayesian point-estimators with valid credible/confidence intervals.

Disadvantages

  • The original target estimand has been replaced with a surrogate, resulting in a different interpretation from the intended parameter. This is known as smoothing bias.
  • How should hyperparameter \(h\) be selected?
  • It is not data-adaptive. If the distribution is actually nonexceptional, smoothing is unnecessary and introduces bias without benefit.

Possible solution 2: Online one-step estimator (OOSE)

Paper under construction!

References

Bickel, P. J., C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. 1998. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins Series in the Mathematical Sciences. Springer New York. https://books.google.com/books?id=lSnTm6SC_SMC.
Dümbgen, Lutz. 1993. “On Nondifferentiable Functions and the Bootstrap.” Probability Theory and Related Fields 95: 125–40.
Fang, Zheng, and Andres Santos. 2019. “Inference on Directionally Differentiable Functions.” The Review of Economic Studies 86 (1 (306)): 377–412.
Gelman, A., J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, and D. B. Rubin. 2013. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis. https://books.google.com.co/books?id=ZXL6AQAAQBAJ.
Greenland, Sander. 2006. Bayesian perspectives for epidemiological research: I. Foundations and basic methods.” International Journal of Epidemiology 35 (3): 765–75. https://doi.org/10.1093/ije/dyi312.
Hendrycks, Dan, and Kevin Gimpel. 2016. “Gaussian Error Linear Units (GELUs).” arXiv Preprint arXiv:1606.08415.
Hines, Oliver, Oliver Dukes, Karla Diaz-Ordaz, and Stijn Vansteelandt. 2022. “Demystifying Statistical Learning Based on Efficient Influence Functions.” The American Statistician 76 (3): 292–304.
Hirano, Keisuke, and Jack R. Porter. 2012. “Impossibility Results for Nondifferentiable Functionals.” Econometrica 80 (4): 1769–90. http://www.jstor.org/stable/23271416.
Kitagawa, Toru, José Luis Montiel Olea, Jonathan Payne, and Amilcar Velez. 2020. “Posterior Distribution of Nondifferentiable Functions.” Journal of Econometrics 217 (1): 161–75. https://doi.org/https://doi.org/10.1016/j.jeconom.2019.10.009.
Luedtke, Alexander R, and Mark J van der Laan. 2016. “Statistical Inference for the Mean Outcome Under a Possibly Non-Unique Optimal Treatment Strategy.” Annals of Statistics 44 (2): 713.
Robins, James M. 2004. “Optimal Structural Nested Models for Optimal Sequential Decisions.” In Proceedings of the Second Seattle Symposium in Biostatistics, 179–326. Lecture Notes in Statististics. Springer, New York.
Tian, Jin, and Judea Pearl. 2000. “Probabilities of Causation: Bounds and Identification.” Annals of Mathematics and Artificial Intelligence 28 (1): 287–313. https://doi.org/10.1023/A:1018912507879.
van der Laan, Mark J. 1995. Efficient and Inefficient Estimation in Semiparametric Models. CWI Tract - Centrum Voor Wiskunde En Informatica. Centrum voor Wiskunde en Informatica. https://books.google.com/books?id=ZwjvAAAAMAAJ.
van der Vaart, A. W. 2000. Asymptotic Statistics. Asymptotic Statistics. Cambridge University Press. https://books.google.com/books?id=UEuQEM5RjWgC.