Intuition for Doubly Robust Estimator

May 18, 2025 4 min read causal inference, econometrics

$$ \newcommand{\indep}{\mathrel{\perp\mkern-10mu\perp}} \newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\1}[1]{\mathbf{1}\\{#1\\}} $$

Introduction

To estimate ATE, we can either use outcome regression or inverse propensity weighting (IPW). While each approach has merits, combining them offers significant advantage – double robustness. In this post, I summarize the intuition for doubly robust estimator from Professor Ding’s textbook (Ding 2024), and connects this framework to debiased machine learning (DML) through Riesz representation theory. By understanding these connections, we can gain some insights into how modern causal inference methods effectively correct for bias in treatment effect estimation.

Two characterizations of the ATE

Assumption 1 (Basic setting).

SUTVA, unconfoundedness and overlap

Let $Y(0), Y(1)$ be potential outcomes and $D$ be a binary treatment variable. Consider ATE.

First, we can use the outcome regression,

$$ \tau = \E\{\mu_1(X) - \mu_2(X) \}, $$ where $$ \mu_1 = \E\{Y(1) \mid X \} = \E\{Y \mid D = 1, X \}, $$ $$ \mu_0 = \E\{Y(0) \mid X \} = \E\{Y \mid D = 0, X \} $$

Second, we can use the inverse propensity score weighting (IPW) approach,

$$\tau = \E\left\{\frac{DY}{e(X)} \right\} - \E\left\{\frac{(1-D)Y}{1-e(X)} \right\},$$

where $e(X) = \P(D = 1 \mid X)$ is the propensity score.

It completely ignores the outcome model. However, if there exist covariates $X$ that are predictive of $Y$, then even when the outcome model is misspecified, including it can reduce the variance compared to using IPW alone.

Key Insights

This motivates combining the two approaches to:

1. Reduce the variance of the IPW estimator

2. Reduce the bias of the outcome regression

Reducing the Variance

$$ \mu_1=E\{Y(1)\}=E\left\{Y(1)-\mu_1\left(X, \beta_1\right)\right\}+E\left\{\mu_1\left(X, \beta_1\right)\right\} . $$

Idea:

View $Y-\mu_1\left(X, \beta_1\right)$ as a “pseudo potential outcome”
Then apply IPW to it:

$$ \begin{aligned} \mu_1 & = E\left\{\frac{D\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}\right\}+E\left\{\mu_1\left(X, \beta_1\right)\right\} \\ & =E\left\{\frac{D\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}+\mu_1\left(X, \beta_1\right)\right\}, \end{aligned} $$

Similarly,

$$ \begin{aligned} \mu_0 & =E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}\right\}+E\left\{\mu_0\left(X, \beta_0\right)\right\} \\ & =E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}+\mu_0\left(X, \beta_0\right)\right\}, \end{aligned} $$

Notice that,

$$ \begin{aligned} \mu_1 - \mu_0 & = E\left\{\frac{D\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}+\mu_1\left(X, \beta_1\right)\right\} \\ &\quad - E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}+\mu_0\left(X, \beta_0\right)\right\} \end{aligned} $$

$\mu_1 - \mu_2$ gives us the AIPW estimator.

Reducing the Bias

$$ \mu_1 =E\left\{\frac{Z\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}\right\}+E\left\{\mu_1\left(X, \beta_1\right)\right\} $$

Idea: We can view $Y-\mu_1\left(X, \beta_1\right)$ as the regression residuals, from which we apply IPW to extract useful signals. Alternatively, we can view $\mu_1\left(X, \beta_1\right) - Y$ as the bias, which we then use IPW to correct the bias.

Connecting to DDML: Riesz Representation for Bias Correction

In the generic debiased framework (Chernozhukov, Newey, and Singh 2022), we leverage the Riesz representer to correct for bias, similar to the approach described above. This method parallels our use of IPW to extract signals from residuals, but specifically employs the Riesz representer for bias correction.

Data is $Z = \{Y, D, X\}$
Let $g$ be outcome regression, $g(D, X)=E[Y \mid D, X]$
Suppose moment is of the form: for some moment $m()$ that is linear in $g$

$$\theta-E[m(Z ; g)]=0,$$ $$\theta = E[m(Z ; g)] :=E[g(1, X)-g(0, X)]$$

Then debiased version of the moment is of the form:

$$ \theta-E[m(Z ; g)+a(X) \cdot(Y-g(X))]=0, $$

where

$a(X)$ is the Riesz Representer of the linear functional $L(g) := E[m(Z ; g)]$. The existence of $a(X)$ is guaranteed by the Riesz representation theorem, for all square integrable $g(\cdot)$, we have

$$ E[m(Z ; g)]=E[a(X) \cdot g(X)]. $$

As we consider the ATE, the Riesz Representer are just some “inverse propensity score terms”,

$$ a(D, X):=\frac{D}{e(X)}-\frac{(1-D)}{1-e(X)}, $$

Consider the expression

$$ E[m(Z ; g)+a(X) \cdot(Y-g(X))] $$

The key intuition here is that $Y−g(X)$ represents the residual part, and we use the Riesz Representer $a(X)$ to correct this bias/residual term.

Double robust estimation of ATT

How about the double robust estimator for ATT? Can we also use the same idea to derive it? Yes!

Assumption 2 ( "one-sided" unconfoundedness and overlap).

$$ D \indep Y(0) \mid X \text { and } e(X)<1 . $$

Theorem 1.

Under the "one-sided" unconfoundedness and overlap assumption, $$ \E\{Y(0) \mid D=1\}=\frac{1}{e}\E\left\{\frac{e(X)}{1-e(X)} (1-D) Y\right\} \tag{1} $$ and $$ \tau_{\mathrm{T}}=\E(Y \mid D=1)-\E\left\{\frac{e(X)}{e} \frac{1-D}{1-e(X)} Y\right\}, $$ where $e=\P(D=1)$ is the marginal probability of the treatment.

We also have a doubly robust estimator for $\E\{Y(0) \mid D=1\}$ which combines the propensity score and the outcome models.

Theorem 2.

Define $$ \tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}} := \frac{1}{e} \E\left[ \frac{e(X, \alpha)}{1 - e(X, \alpha)}(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}+D \mu_0\left(X, \beta_0\right)\right] , $$ Under above Assumption, if either $e(X, \alpha)=e(X)$ or $\mu_0\left(X, \beta_0\right)=\mu_0(X)$, then $\tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}}= \E\{Y(0) \mid D=1\}$.

How to come up with $\tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}}$ ? Exactly the same idea as before!

$$ \E\{Y(0) \mid D=1\} = \E\{Y(0) - \mu_0\left(X, \beta_0\right) \mid D=1\} + \E\{\mu_0\left(X, \beta_0\right) \mid D=1\} $$

Now, we can view $Y(0) - \mu_0\left(X, \beta_0\right)$ as a “pseudo potential outcome” under the control and apply eqn (1) to weight it, then we can get the form of $\tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}}$ .

Reference

Ding, Peng (2024) A first course in causal inference. Chapman & Hall.

Chernozhukov, Victor, Whitney K. Newey, and Rahul Singh (2022), “Automatic Debiased Machine Learning of Causal and Structural Effects,” Econometrica, 90 (3), 967–1027.

double robust propensity score weighting outcome regression AIPW Riesz Representer

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!