Notes on Propensity Score Methods

Introduction

Here are my notes on propensity scores, mainly from Prof. Ding’s textbook (2024).

The traditional propensity score analysis workflow is shown in the image below, which I will not cover in detail. Instead, I will summarize the key theorems and results from Ding’s textbook.

Image from Chap3.3_observational_PS, page 16
Figure 1: Traditional propensity score analysis workflow

I will also provide some connections with Riesz Representer (RR).

  • Why connect with the Riesz Representer (RR)? The connection provides a powerful generalization of the foundational Rosenbaum-Rubin (1983) result.

  • Rosenbaum and Rubin showed that conditioning on the propensity score is sufficient for removing confounding bias when estimating causal effects. The Riesz representer extends this principle: it suffices to regress on the Riesz representer to obtain unbiased estimates of the average treatment effect.

  • The key insight is that the Riesz representer, like the propensity score, serves as a sufficient statistic – it captures all the confounding information necessary for unbiased estimation of your target causal parameter.

Setting & Notation

  • Binary treatment Z
  • Potential outcomes {Y(0),Y(1)}
  • Propensity score: P(Z=1X), where X represents covariates

Two approaches learning causal relationships:

  • Outcome process (via outcome regression)

  • Treatment assignment mechanism (via propensity score)

The following summarizes the key theorems and results related to propensity scores from Prof. Ding’s textbook.

1. The propensity score as a dimension reduction tool

Theorem 1 (pscore as dimension reduction tool).
 If Z{Y(1),Y(0)}X, then Z{Y(1),Y(0)}e(X).
  • Covariates X can be high dimensional, but the propensity score, e(X)R, is a 1-dimensional scalar

  • We can view the propensity score as a dimensional reduction tool

2. Propensity score stratification

  • Idea: Discretize the estimated propensity score by its K quantiles:

    Z{Y(1),Y(0)}e^(X)=ek(k=1,,K).

    Estimate ATE within each subclass and then average by the block size

  • Advantage: The propensity score stratification estimator only requires the correct ordering of the estimated propensity scores rather than their exact values, which makes it relatively robust compared with other methods

3. Propensity score weighting

Theorem 2 (Invese propensity score weighting (IPW)).
If Z{Y(1),Y(0)}X and 0<e(X)<1, then E{Y(1)}=E{ZYe(X)},E{Y(0)}=E{(1Z)Y1e(X)} and τ=E{Y(1)Y(0)}=E{ZYe(X)(1Z)Y1e(X)}=E{HY} where H:=[Ze(X)(1Z)1e(X)] is called the Horvitz-Thompson transform.
  • Connection the Riesz Representer (RR)

    Remark 1 (RR in the case of ATE).
    In the case of ATE, the Riesz Representer, α(Z,X), has the same form as above Horvitz-Thompson transform, α(Z,X)=[Ze(X)(1Z)1e(X)]

3.1 Estimation

  • The sample version of IPW is called the Horvitz–Thompson (HT) estimator,

    τ^ht=1ni=1nZiYie^(Xi)1ni=1n(1Zi)Yi1e^(Xi)
  • HT estimator τ^ht has many problems

  • Problem: lack of invariance, i.e. if we replace Yi by Yi+c, τ^ht changed because it depends on c. This is not reasonable.

  • Solution: normalizing the weights

    τ^hajek =i=1nZiYie^(Xi)i=1nZie^(Xi)i=1n(1Zi)Yi1e^(Xi)i=1n1Zi1e^(Xi).
  • Hajek estimator is invariant to the location transformation

3.2 Strong overlap condition

Many asymptotic analyses require a strong overlap condition,

0<αLe(X)αU<1 In practice,

  • Crump et al. (2009) suggested αL=0.1 and αU=0.9

  • Kurth et al. (2005) suggested αL=0.05 and αU=0.95

4. Balancing property

Theorem 3 (balancing property).
image-20250603182217046
  • Conditional on e(X), the treatment and the covariates are independent

  • Within the same level of the propensity score, the covariate distributions are balanced across the treatment and control groups

  • Useful implication: we can check whether the propensity score model is specified well enough to ensure the covariate balance in the data

4.1 Propensity score is a balancing score

Definition 1.
image-20250603182507172
Theorem 4 (Propensity score is a balancing score).
image-20250603182648222 image-20250603182857253
  • This is relevant in subgroup analysis

  • The conditional independence in (11.5) ensures unconfoundedness holds given the propensity score, within each level of X1. Therefore, we can perform the same analysis based on the propensity score, within each level of X1, yielding estimates for two subgroup effects

5. Doubly Robust or AIPW

The following Theorem is summarized from Prof. Wager’s lecture notes (2024).

Theorem 5 (strong double robustness of AIPW estimator).
Define the outcome regression as μ(z)(x)=E[Yi(z)Xi=x], Define AIPW estimator as τ^AIPW=1ni=1(μ^(1)(Xi)μ^(0)(Xi)) outcome regression estimator +1ni=1n(Zie^(Xi)(Yiμ^(1)(Xi))1Zi1e^(Xi)(Yiμ^(0)(Xi))) applying IPW to the regression residuals  If we use estimators μ^(z)(x) and e^(x) that are both consistent with root-mean squared error (RMSE) decaying faster than nαμ and nαe respectively, and if furthermore αμ+αe1/2, then n(τ^AIPWτ)N(0,VAIPW)VAIPW=Var[τ(Xi)]+E[σ02(Xi)1e(Xi)]+E[σ12(Xi)e(Xi)] where σ(z)2(x)=Var[Yi(z)Xi=x]
  • Check my previous post: Intuition for Doubly Robust Estimator

  • AIPW provides a natural starting point for understanding Double Machine Learning

  • Key insight of RR in DML framework: Leverage the Riesz Representer, a “generalized version of propensity score” to “correct the bias”

More general, Li et al. (2018a) gave a unified discussion of the causal estimands in observational studies.

Theorem 6 (Ding (2024), Section 13.4).
image-20250602144337279
Summary Table of common estimands: image-20250602145602468
  • This table provides us a good way to understand and remember IPW estimator for ATT

  • How to remember τh? Apply IPW on “pseudo outcome” Yh(X) then divide by E(h(X))

  • When the parameter of interest is ATT, then E(h(X))=E(e(X))=E(E(ZX))=E(Z)=P(Z=1)=e

  • Use it to better understand IPW for ATT

7. Propensity Score in Regression

PS as a covariate

Theorem 7 (regression with pscore as a covariate).
Under unconfoundedness, the coefficient of Z in the population OLS fit of Y1+Z+e(X) equals τO, τO=E[e(X){1e(X)}τ(X)]E[e(X){1e(X)}], which is the overlap-weighted average treatment effect

Based on above Theorem, we also have:

Corollary 1.
Under unconfoundedness,
1. the coefficient of Z in the population OLS fit of Y1+Z+e(X)+X also equals τO,

2. the coefficient of Ze(X) in the population OLS fit of Y[Ze(X)]orY1+[Ze(X)] also equals τO

PS as a weight

There is a convenient way to obtain τ^hajek based on WLS.

Proposition 1 (convenient to obtain τ^hajek based on WLS).
image-20250604101708438
  • Need to use bootstrap for standard error

  • Why does the WLS give a consistent estimator for τ ?

  • In RCT with a constant propensity score, we can simply use the coefficient of Zi in the OLS fit of Yi on ( 1,Zi ) to estimate τ

  • In observational studies, we need to deal with the selection bias. The key idea is:

    • If we weight the treated units by 1e(Xi) and the control units by 11e(Xi), then both treated and control groups can represent the whole population

    • Thus, by weighting, we effectively have a pseudo-randomized experiment

    • Remark 2 (IPCW).
      Inverse Probability of Censoring Weighting (IPCW) follows the same idea — it adjusts for censoring bias by reweighting observations based on their probability of being uncensored.
  • Consequently, the difference between the weighted means is consistent for τ. The numerical equivalence of τ^hajek  and WLS is not only a fun numerical fact itself but also useful for motivating more complex estimators with covariate adjustment

Reference

Ding, Peng (2024), A First Course in Causal Inference, CRC Press.

Wager, S. (2024). Causal inference: A statistical learning approach. https://web.stanford.edu/~swager/causal_inf_book.pdf

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related