Notes on Instrumental Variables

May 29, 2025 6 min read causal inference, econometrics

$$ \newcommand{\indep}{\mathrel{\perp\mkern-10mu\perp}} \newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\1}[1]{\mathbf{1}\\{#1\\}} $$

Here are my notes on instrumental variables from Stefan’s lecture materials.

Motivation

How can we identify causal eﬀects $W \rightarrow Y$ when we are in the presence of unobserved confounding $U$?

One popular way is to ﬁnd and use instrumental variables.

Partially Linear IV Models

When instrumental variables are available, it becomes possible to point identify causal effects in partially linear models and certain types of causal effects in nonlinear models.

The keyword in this section is linear, constant effect IV

Here we begin with partially linear models.

Assumption 1 (partially linear).

Assume the structural equation for $Y$ is linear: $$ Y=f_Y\left(W, U, \varepsilon_Y\right)=\alpha+W \tau+\varepsilon \tag{PLM}, $$ where $\epsilon$ is an error term that captures the contribution of both $U$ and $\epsilon_Y$.

This is a semiparametric specification, in that we impose a linear relationship between $W$ and $Y$ but let the rest be non-parametric. Notice that, besides the linearity assumption, we also assume a constant treatment effect (i.e. $\tau$), which is also a strong assumption.

Remark 1 (partialling out X).

I ignore the role of covariates $X$ and the possible extension like $$ \varepsilon_i \indep Z_i \mid X_i $$ because we can recover the same DAG after partialling out observed confounder $X$. As an illustration, we can use $\tilde{Y}, \tilde{W}, \tilde{Z}$, where $$\tilde{V} = V - \E[V \mid X]$$

Remark 2 (relax linearity assumption).

The key limitation here is that we assume the linear structure and constant treatment effect. Later, we will consider IV estimator without the linearity assumption.

When the constant treatment eﬀect model (PLM) doesn’t hold, the average treatment eﬀect $\tau_{ATE} = \E [Y_i (1) − Y_i (0)]$ is NOT identified without more data, because we don’t have any observations on treated never takers, etc. Without linearity, the estimator $\tau_{IV}$ still converges to a large-sample limit $\tau_{LATE}$, the local average treatment effect (LATE).

Identifying assumptions

There are 3 identification assumptions. Or let’s say there are three main assumptions that must be satisfied for a variable $Z$ to be considered an instrument.

Assumption 2 (exogeneity).

The instrument $Z_i$ must be exogenous, which here means $\epsilon_i \indep Z_i$

Assumption 3 (relevance).

The instrument $Z_i$ must be relevant, such that $\Cov[W_i, Z_i] \neq 0$

Graphically, the relevance assumption corresponds to the existence of an active edge from $Z$ to $W$ in the causal graph.

Assumption 4 (exclusion restriction).

The instrument $Z_i$ must satisfy exclusion restriction, meaning that any eﬀect of $Z_i$ on $Y_i$ must be fully mediated via the treatment $W_i$

Graphically, this means that we’ve excluded enough potential edges between variables in the causal graph so that all causal paths from $Z$ to $Y$ go through $W$.

Case I (Easiest)

Consider the fully linear version,

$$ \begin{align} & Y=\alpha+W \tau+\varepsilon, \quad \varepsilon \indep Z \tag{C1} \\ & W=Z \gamma+\eta \end{align} $$

Then,

$$\operatorname{Cov}[Y, Z]=\operatorname{Cov}[\tau W+\varepsilon, Z]=\tau \operatorname{Cov}[W, Z] $$ $$\tau= \frac{\operatorname{Cov}[Y, Z]}{\operatorname{Cov}[W, Z]}$$

It implies,

$$ \hat{\tau}_{IV}= \frac{\widehat{\operatorname{Cov}}[Y_i, Z_i]}{\widehat{\operatorname{Cov}}[W_i, Z_i]} $$

Case II (More general: optimal instruments)

Case I assumes (1) linear relationship between $Y$ and $W$ and (2) linear relationship between $W$ and $Z$. This may be too restrictive. How to extend to more general specification? What should we do if

we have multiple instruments
or we believe that the instrument may act non-linearly

Consider the following,

$$ Y=\tau W+\varepsilon, \quad \varepsilon \indep Z, \quad Y, W \in \mathbb{R}, \quad Z \in \mathcal{Z}, \tag{C2} $$

where $\mathcal{Z}$ can be a high-dimensional space. Define function $w$ that maps $Z_i$ to the real line $$w: \mathcal{Z} \rightarrow \R$$

Then by the same argument as the Case I (note: we can regard $w(Z)$ as a “pseudo $Z$”),

$$ \tau=\frac{\operatorname{Cov}[Y, w(Z)]}{\operatorname{Cov}[W, w(Z)]} $$

provided the denominator is non-zero, resulting a feasible estimator

$$ \hat{\tau}_{I V}=\frac{\widehat{\operatorname{Cov}}\left[Y_i, w\left(Z_i\right)\right]}{\widehat{\operatorname{Cov}}\left[W_i, w\left(Z_i\right)\right]}=\frac{\frac{1}{n} \sum_{i=1}^n\left(Y_i-\bar{Y}\right)\left(w\left(Z_i\right)-\overline{w(Z)}\right)}{\frac{1}{n} \sum_{i=1}^n\left(W_i-\bar{W}\right)\left(w\left(Z_i\right)-\overline{w(Z)}\right)} $$

Theorem 1.

Suppose $\left(X_i, W_i, Y_i, Z_i\right)$ are IID draws from a distribution satisfying (C2), and let $w: \mathcal{Z} \rightarrow \mathbb{R}$ be such that $\operatorname{Cov}[W, w(Z)] \neq 0$. Then, $\hat{\tau}_{I V}$ as given above is consistent for $\tau$, and $$ \sqrt{n}\left(\hat{\tau}_{I V}-\tau\right) \Rightarrow \mathcal{N}\left(0, V_w\right), \quad V_w=\frac{\operatorname{Var}\left[\varepsilon_i\right] \operatorname{Var}\left[w\left(Z_i\right)\right]}{\operatorname{Cov}\left[W_i, w\left(Z_i\right)\right]^2} $$

What is the best function $w$, say $w^*(z)$, that minimizes the variance $V_w$? It turns out that the optimal instrument is the best prediction of $W_i$ from $Z_i$.

$$ w^*(z)=\mathbb{E}\left[W_i \mid Z_i=z\right] $$

How to do we estimate? Do cross-fitting! Why? Because

$$ \epsilon_i \rightarrow W_i \rightarrow \hat{w}(Z_i) $$

We no longer have $\hat{w}(Z_i) \indep \epsilon_i$. Therefore, we need cross-ﬁtting to address this issue.

Case III (Much more general: non-parametric IV regression)

The more general version than Case II is the following,

$$ Y_i=\alpha+g\left(W_i\right)+\varepsilon_i, \quad Z_i \indep \varepsilon_i, \quad Y_i, W_i \in \mathbb{R}, \quad Z_i \in \mathcal{Z} \tag{C3} $$

where $g(\cdot)$ is some generic smooth function we want to estimate. Note that:

(C3) still requires the eﬀect of $W_i$ on $Y_i$ to be additive
however, unlike (C2), it now allows this additive eﬀect to be modified by a non-linearity $g(\cdot)$

Now,

$$ \begin{aligned} \mathbb{E}\left[Y_i \mid Z_i=z\right] & =\mathbb{E}\left[\alpha+g\left(W_i\right)+\varepsilon_i \mid Z_i=z\right] \\ & =\alpha+\mathbb{E}\left[g\left(W_i\right) \mid Z_i=z\right] \\ & =\alpha+\int_{\mathbb{R}} g(w) f(w \mid z) d w, \end{aligned} $$

There are two steps for learning $g(\cdot)$:

non-parametric model $\hat{f}(w \mid z)$ using cross-fitting
estimate $g(\cdot)$ using empirical minimization

For more details, see Section 9.2 in Stefan’s lecture.

Local Average Treatment Effects

Motivation

IV without the linearity assumption: One may doubt the validity the linearity and constant treatment eﬀect assumption in previous section. What about non-parametric identification using IV?
Encouragement Design and Noncompliance: Noncompliance is a common problem in encouragement designs involving human beings as experimental units. In those cases, the experimenters cannot force the units to take the treatment but rather only encourage them to do so. Heterogeneous effects should be allowed.

Setup

Consider an randomized experiment,

Let $Z_i \in \{0, 1\}$ be the treatment assigned
Let $W_i \in \{0, 1\}$ be the treatment received
When $Z_i \neq W_i$, the noncompliance problem arises
Potential outcome $\{W_i(1), W_i(0)\}$ s.t. $W_i = W_i(Z_i)$
Potential outcome $\{Y_i(w,z)\}_{(w,z) \in \{0,1\}}$ s.t. $Y_i = Y_i(W_i, Z_i)$

Identifying assumptions

LATE Theorem

Idea of proof:

Start with $\Cov(Y, Z) = \E[YZ] - \E[Y]\E[Z]$, then apply LIE by conditioning on $Z$
Derive $\Cov(W, Z)$ similarly as above, then get the ratio
Decompose the ATE on $Y$, $\E[Y(1) - Y(0)]$, into four terms (always-taker, compiler, defier, never-taker) using law of total probability
By exclusion restriction and monotonicity assumption, only “compiler” remains

Multiple instruments

We may have access to data from multiple randomized trials that can be used to study a treatment effect via a non-compliance analysis.

Marketing example:

goal: study the effect of subscription to a loyalty program ($W_i$) on long-term customer CLV ($Y_i$)
randomized trial 1: offering discounts for joining the loyalty program $\left(Z_i=\1(\{\right.$ customer received a discount $\left.\})\right)$
randomized trial 2: showing advertisements $\left(Z_i=\1(\{\right.$ customer was shown an ad for the program $\left.\})\right)$

Previously, under the linear treatment effect model, multiple instruments could be combined into a single optimal instrument, and the optimal instrument corresponds to the summary of all the instruments that best predicts the treatment.

Without the linear treatment effect model, however, we caution that no such result is available. Diﬀerent instruments may induce diﬀerence compliance patterns, and so the LATEs identiﬁed diﬀerent instruments may not be the same.

In the marketing example, the ATE for customers who respond to a discount may be diﬀerent from the ATE for customers who respond to an advertisement.

Reference

Wager, S. (2024). Causal inference: A statistical learning approach. https://web.stanford.edu/~swager/causal_inf_book.pdf

unobserved confounder instrmental variable LATE IV

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!

Notes on Instrumental Variables

Motivation

Partially Linear IV Models

Identifying assumptions

Case I (Easiest)

Case II (More general: optimal instruments)

Case III (Much more general: non-parametric IV regression)

Local Average Treatment Effects

Motivation

Setup

Identifying assumptions

LATE Theorem

Multiple instruments

Reference

Chen Xing

Founder & Data Scientist

Related