Intuition for Doubly Robust Estimator
Introduction
To estimate ATE, we can either use outcome regression or inverse propensity weighting (IPW). While each approach has merits, combining them offers significant advantage – double robustness. In this post, I summarize the intuition for doubly robust estimator from Professor Ding’s textbook (Ding 2024), and connects this framework to debiased machine learning (DML) through Riesz representation theory. By understanding these connections, we can gain some insights into how modern causal inference methods effectively correct for bias in treatment effect estimation.
Two characterizations of the ATE
Let $Y(0), Y(1)$ be potential outcomes and $D$ be a binary treatment variable. Consider ATE.
First, we can use the outcome regression,
$$ \tau = \E\{\mu_1(X) - \mu_2(X) \}, $$ where $$ \mu_1 = \E\{Y(1) \mid X \} = \E\{Y \mid D = 1, X \}, $$ $$ \mu_0 = \E\{Y(0) \mid X \} = \E\{Y \mid D = 0, X \} $$Second, we can use the inverse propensity score weighting (IPW) approach,
$$\tau = \E\left\{\frac{DY}{e(X)} \right\} - \E\left\{\frac{(1-D)Y}{1-e(X)} \right\},$$where $e(X) = \P(D = 1 \mid X)$ is the propensity score.
It completely ignores the outcome model. However, if there exist covariates $X$ that are predictive of $Y$, then even when the outcome model is misspecified, including it can reduce the variance compared to using IPW alone.
Key Insights
This motivates combining the two approaches to:
1. Reduce the variance of the IPW estimator
2. Reduce the bias of the outcome regression
Reducing the Variance
$$ \mu_1=E\{Y(1)\}=E\left\{Y(1)-\mu_1\left(X, \beta_1\right)\right\}+E\left\{\mu_1\left(X, \beta_1\right)\right\} . $$Idea:
-
View $Y-\mu_1\left(X, \beta_1\right)$ as a “pseudo potential outcome”
-
Then apply IPW to it:
Similarly,
$$ \begin{aligned} \mu_0 & =E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}\right\}+E\left\{\mu_0\left(X, \beta_0\right)\right\} \\ & =E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}+\mu_0\left(X, \beta_0\right)\right\}, \end{aligned} $$Notice that,
$$ \begin{aligned} \mu_1 - \mu_0 & = E\left\{\frac{D\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}+\mu_1\left(X, \beta_1\right)\right\} \\ &\quad - E\left\{\frac{(1-D)\left\{Y-\mu_0\left(X, \beta_0\right)\right\}}{1-e(X)}+\mu_0\left(X, \beta_0\right)\right\} \end{aligned} $$$\mu_1 - \mu_2$ gives us the AIPW estimator.
Reducing the Bias
$$ \mu_1 =E\left\{\frac{Z\left\{Y-\mu_1\left(X, \beta_1\right)\right\}}{e(X)}\right\}+E\left\{\mu_1\left(X, \beta_1\right)\right\} $$Idea: We can view $Y-\mu_1\left(X, \beta_1\right)$ as the regression residuals, from which we apply IPW to extract useful signals. Alternatively, we can view $\mu_1\left(X, \beta_1\right) - Y$ as the bias, which we then use IPW to correct the bias.
Connecting to DDML: Riesz Representation for Bias Correction
In the generic debiased framework (Chernozhukov, Newey, and Singh 2022), we leverage the Riesz representer to correct for bias, similar to the approach described above. This method parallels our use of IPW to extract signals from residuals, but specifically employs the Riesz representer for bias correction.
-
Data is $Z = \{Y, D, X\}$
-
Let $g$ be outcome regression, $g(D, X)=E[Y \mid D, X]$
-
Suppose moment is of the form: for some moment $m()$ that is linear in $g$
Then debiased version of the moment is of the form:
$$ \theta-E[m(Z ; g)+a(X) \cdot(Y-g(X))]=0, $$
where
$a(X)$ is the Riesz Representer of the linear functional $L(g) := E[m(Z ; g)]$. The existence of $a(X)$ is guaranteed by the Riesz representation theorem, for all square integrable $g(\cdot)$, we have
$$ E[m(Z ; g)]=E[a(X) \cdot g(X)]. $$
As we consider the ATE, the Riesz Representer are just some “inverse propensity score terms”,
$$ a(D, X):=\frac{D}{e(X)}-\frac{(1-D)}{1-e(X)}, $$
Consider the expression
$$ E[m(Z ; g)+a(X) \cdot(Y-g(X))] $$
Double robust estimation of ATT
How about the double robust estimator for ATT? Can we also use the same idea to derive it? Yes!
We also have a doubly robust estimator for $\E\{Y(0) \mid D=1\}$ which combines the propensity score and the outcome models.
How to come up with $\tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}}$ ? Exactly the same idea as before!
$$ \E\{Y(0) \mid D=1\} = \E\{Y(0) - \mu_0\left(X, \beta_0\right) \mid D=1\} + \E\{\mu_0\left(X, \beta_0\right) \mid D=1\} $$Now, we can view $Y(0) - \mu_0\left(X, \beta_0\right)$ as a “pseudo potential outcome” under the control and apply eqn (1) to weight it, then we can get the form of $\tilde{\mu}_{0 \mathrm{T}}^{\mathrm{dr}}$ .
Reference
Ding, Peng (2024) A first course in causal inference. Chapman & Hall.
Chernozhukov, Victor, Whitney K. Newey, and Rahul Singh (2022), “Automatic Debiased Machine Learning of Causal and Structural Effects,” Econometrica, 90 (3), 967–1027.