Notes on Propensity Score Methods

Jun 3, 2025 6 min read causal inference, econometrics

Introduction

Here are my notes on propensity scores, mainly from Prof. Ding’s textbook (2024).

The traditional propensity score analysis workflow is shown in the image below, which I will not cover in detail. Instead, I will summarize the key theorems and results from Ding’s textbook.

Image from Chap3.3_observational_PS, page 16 — Figure 1: Traditional propensity score analysis workflow

I will also provide some connections with Riesz Representer (RR).

Why connect with the Riesz Representer (RR)? The connection provides a powerful generalization of the foundational Rosenbaum-Rubin (1983) result.
Rosenbaum and Rubin showed that conditioning on the propensity score is sufficient for removing confounding bias when estimating causal effects. The Riesz representer extends this principle: it suffices to regress on the Riesz representer to obtain unbiased estimates of the average treatment effect.
The key insight is that the Riesz representer, like the propensity score, serves as a sufficient statistic – it captures all the confounding information necessary for unbiased estimation of your target causal parameter.

Setting & Notation

Binary treatment $Z$
Potential outcomes ${Y (0), Y (1)}$
Propensity score: $P (Z = 1 ∣ X)$ , where $X$ represents covariates

Two approaches learning causal relationships:

Outcome process (via outcome regression)
Treatment assignment mechanism (via propensity score)

The following summarizes the key theorems and results related to propensity scores from Prof. Ding’s textbook.

1. The propensity score as a dimension reduction tool

Theorem 1 (pscore as dimension reduction tool).

If Z ⊥ ⊥ {Y (1), Y (0)} ∣ X, then Z ⊥ ⊥ {Y (1), Y (0)} ∣ e (X) .

Covariates $X$ can be high dimensional, but the propensity score, $e (X) \in R$ , is a 1-dimensional scalar
We can view the propensity score as a dimensional reduction tool

2. Propensity score stratification

Idea: Discretize the estimated propensity score by its $K$ quantiles:
$Z ⊥ ⊥ {Y (1), Y (0)} ∣ {\hat{e}}^{'} (X) = e_{k} (k = 1, \dots, K) .$
Estimate ATE within each subclass and then average by the block size
Advantage: The propensity score stratification estimator only requires the correct ordering of the estimated propensity scores rather than their exact values, which makes it relatively robust compared with other methods

3. Propensity score weighting

Theorem 2 (Invese propensity score weighting (IPW)).

Z ⊥ ⊥ {Y (1), Y (0)} ∣ X

and

0 < e (X) < 1

, then

E {Y (1)} = E {\frac{Z Y}{e (X)}}, E {Y (0)} = E {\frac{(1 - Z) Y}{1 - e (X)}}

and

\begin{aligned} τ & = E {Y (1) - Y (0)} \\ = E {\frac{Z Y}{e (X)} - \frac{(1 - Z) Y}{1 - e (X)}} \\ = E {H Y} \end{aligned}

where

H := [\frac{Z}{e (X)} - \frac{(1 - Z)}{1 - e (X)}]

is called the Horvitz-Thompson transform.

Connection the Riesz Representer (RR)

Remark 1 (RR in the case of ATE).

In the case of ATE, the Riesz Representer, $α (Z, X)$ , has the same form as above Horvitz-Thompson transform, $α (Z, X) = [\frac{Z}{e (X)} - \frac{(1 - Z)}{1 - e (X)}]$

3.1 Estimation

The sample version of IPW is called the Horvitz–Thompson (HT) estimator,
${\hat{τ}}^{ht} = \frac{1}{n} \sum_{i = 1}^{n} \frac{Z_{i} Y_{i}}{\hat{e} (X_{i})} - \frac{1}{n} \sum_{i = 1}^{n} \frac{(1 - Z_{i}) Y_{i}}{1 - \hat{e} (X_{i})}$
HT estimator ${\hat{τ}}^{ht}$ has many problems
Problem: lack of invariance, i.e. if we replace $Y_{i}$ by $Y_{i} + c$ , ${\hat{τ}}^{ht}$ changed because it depends on $c$ . This is not reasonable.
Solution: normalizing the weights
${\hat{τ}}^{hajek} = \frac{\sum_{i = 1}^{n} \frac{Z_{i} Y_{i}}{\hat{e} (X_{i})}}{\sum_{i = 1}^{n} \frac{Z_{i}}{\hat{e} (X_{i})}} - \frac{\sum_{i = 1}^{n} \frac{(1 - Z_{i}) Y_{i}}{1 - \hat{e} (X_{i})}}{\sum_{i = 1}^{n} \frac{1 - Z_{i}}{1 - \hat{e} (X_{i})}} .$
Hajek estimator is invariant to the location transformation

3.2 Strong overlap condition

Many asymptotic analyses require a strong overlap condition,

$0 < α_{L} \leq e (X) \leq α_{U} < 1$ In practice,

Crump et al. (2009) suggested $α_{L} = 0.1$ and $α_{U} = 0.9$
Kurth et al. (2005) suggested $α_{L} = 0.05$ and $α_{U} = 0.95$

4. Balancing property

Theorem 3 (balancing property).

Conditional on $e (X)$ , the treatment and the covariates are independent
Within the same level of the propensity score, the covariate distributions are balanced across the treatment and control groups
Useful implication: we can check whether the propensity score model is specified well enough to ensure the covariate balance in the data

4.1 Propensity score is a balancing score

Definition 1.

Theorem 4 (Propensity score is a balancing score).

This is relevant in subgroup analysis
The conditional independence in (11.5) ensures unconfoundedness holds given the propensity score, within each level of $X_{1}$ . Therefore, we can perform the same analysis based on the propensity score, within each level of $X_{1}$ , yielding estimates for two subgroup effects

5. Doubly Robust or AIPW

The following Theorem is summarized from Prof. Wager’s lecture notes (2024).

Theorem 5 (strong double robustness of AIPW estimator).

Define the outcome regression as

μ_{(z)} (x) = E [Y_{i} (z) ∣ X_{i} = x],

Define AIPW estimator as

\begin{aligned} {\hat{τ}}_{A I P W} & = \underset{outcome regression estimator}{\underset{⏟}{\frac{1}{n} \sum_{i = 1} ({\hat{μ}}_{(1)} (X_{i}) - {\hat{μ}}_{(0)} (X_{i}))}} \\ + \underset{applying IPW to the regression residuals}{\underset{⏟}{\frac{1}{n} \sum_{i = 1}^{n} (\frac{Z_{i}}{\hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(1)} (X_{i})) - \frac{1 - Z_{i}}{1 - \hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(0)} (X_{i})))}} \end{aligned}

If we use estimators

{\hat{μ}}_{(z)} (x)

and

\hat{e} (x)

that are both consistent with root-mean squared error (RMSE) decaying faster than

n^{- α_{μ}}

and

n^{- α_{e}}

respectively, and if furthermore

α_{μ} + α_{e} \geq 1 / 2

, then

\begin{aligned} \sqrt{n} ({\hat{τ}}_{A I P W} - τ) \Rightarrow N (0, V_{A I P W}) \\ V_{A I P W} = Var [τ (X_{i})] + E [\frac{σ_{0}^{2} (X_{i})}{1 - e (X_{i})}] + E [\frac{σ_{1}^{2} (X_{i})}{e (X_{i})}] \end{aligned}

where

σ_{(z)}^{2} (x) = Var [Y_{i} (z) ∣ X_{i} = x]

Check my previous post: Intuition for Doubly Robust Estimator
AIPW provides a natural starting point for understanding Double Machine Learning
Key insight of RR in DML framework: Leverage the Riesz Representer, a “generalized version of propensity score” to “correct the bias”

More general, Li et al. (2018a) gave a unified discussion of the causal estimands in observational studies.

Theorem 6 (Ding (2024), Section 13.4).

Summary Table of common estimands:

This table provides us a good way to understand and remember IPW estimator for ATT
How to remember $τ^{h}$ ? Apply IPW on “pseudo outcome” $Y h (X)$ then divide by $E (h (X))$
When the parameter of interest is ATT, then $E (h (X)) = E (e (X)) = E (E (Z ∣ X)) = E (Z) = P (Z = 1) = e$
Use it to better understand IPW for ATT

7. Propensity Score in Regression

PS as a covariate

Theorem 7 (regression with pscore as a covariate).

Under unconfoundedness, the coefficient of

Z

in the population OLS fit of

Y \sim 1 + Z + e (X)

equals

τ_{O}

τ_{O} = \frac{E [e (X) {1 - e (X)} τ (X)]}{E [e (X) {1 - e (X)}]},

which is the overlap-weighted average treatment effect

Based on above Theorem, we also have:

Corollary 1.

Under unconfoundedness,
1. the coefficient of

Z

in the population OLS fit of

Y \sim 1 + Z + e (X) + X

also equals

τ_{O}

,

2. the coefficient of

Z - e (X)

in the population OLS fit of

Y \sim [Z - e (X)] or Y \sim 1 + [Z - e (X)]

also equals

τ_{O}

PS as a weight

There is a convenient way to obtain ${\hat{τ}}^{hajek}$ based on WLS.

Proposition 1 (convenient to obtain

{\hat{τ}}^{hajek}

based on WLS).

Need to use bootstrap for standard error
Why does the WLS give a consistent estimator for $τ$ ?
In RCT with a constant propensity score, we can simply use the coefficient of $Z_{i}$ in the OLS fit of $Y_{i}$ on ( $1, Z_{i}$ ) to estimate $τ$
In observational studies, we need to deal with the selection bias. The key idea is:
- If we weight the treated units by $\frac{1}{e (X_{i})}$ and the control units by $\frac{1}{1 - e (X_{i})}$ , then both treated and control groups can represent the whole population
- Thus, by weighting, we effectively have a pseudo-randomized experiment
- Remark 2 (IPCW).
  
  Inverse Probability of Censoring Weighting (IPCW) follows the same idea — it adjusts for censoring bias by reweighting observations based on their probability of being uncensored.
Consequently, the difference between the weighted means is consistent for $τ$ . The numerical equivalence of ${\hat{τ}}^{hajek}$ and WLS is not only a fun numerical fact itself but also useful for motivating more complex estimators with covariate adjustment

Reference

Ding, Peng (2024), A First Course in Causal Inference, CRC Press.

Wager, S. (2024). Causal inference: A statistical learning approach. https://web.stanford.edu/~swager/causal_inf_book.pdf

propensity score weighting treatment assignment mechanism IPW AIPW Riesz Representer IPCW

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!

Notes on Propensity Score Methods

Introduction

Setting & Notation

1. The propensity score as a dimension reduction tool

2. Propensity score stratification

3. Propensity score weighting

3.1 Estimation

3.2 Strong overlap condition

4. Balancing property

4.1 Propensity score is a balancing score

5. Doubly Robust or AIPW

6. Other Estimands related to IPW

7. Propensity Score in Regression

PS as a covariate

PS as a weight

Reference

Chen Xing

Founder & Data Scientist

Related