Notes on Matrix Completion Methods

Here are my study notes on the matrix completion method for causal panel data models proposed by Athey et al. (2021).

1. Setup and Notation

Let Y(0),Y(1) be potential outcome matrices, W be a matrix of binary treatment, Wit{0,1} .

  • Rows (i) of Y and W correspond to units

  • Columns (t) of Y and W correspond to time periods

We would need to impute the missing potential outcomes in Y(0), a N×T matrix, denoted as YN×T.

  • From now on, I drop from here on the (0) part of Y(0)

  • I will simply use Y to refer to Y(0)

This is now a matrix completion problem.

For illustration, let’s consider the structure on the missing data to be block structure, with a subset of the units adopting an irreversible treatment at a particular point in time T0+1. Note that this method can also handle the staggered treatment case.

image-20250708083502066
Figure 1: block structure

Let M denote the set of pairs of indices (i,t)[N]×[T], corresponding to the missing entries. Let O denote the set of pairs of indices corresponding to the observed entries.

The causal parameter of interest is the treatment effect on treated (ATT). τ=i,tWit(Yit(1)Yit(0))itWit

2. Horizontal / Unconfoundedness Regression

The unconfoundedness literature focuses on:

  • The single-treated-period structure with a thin matrix (NT).

  • A large number of treated and control units

image-20250708085824332
Figure 2: thin matrix
  • Impute the missing potential outcomes using control units with similar lagged outcomes
    Remark 1.
    Does it remind you time series forecasting with exponential smoothing – more recent observations have higher weights, while more distant ones have lower weights. However, the horizontal regression approach uses linear regression on lagged observations to choose weights.

In this setting, the identification depends on the following assumption.

Assumption 1 (Identification Assumption).
For the units with (i,t)M, i.e., the units that have missing entries, Yi,T(0)Wi,TYi,1,,Yi,T1,

Roughly speaking, the idea is that if conditional on the pre-treatment outcomes, the treatment assignment is as good as random, then we can use flexible control methods (e.g. doubly robust method).

The horizontal or unconfoundedness type of regression is to regress the last period outcome on the lagged outcomes and use the estimated regression to predict the missing potential outcomes.

  • lm (YiT1+Yi1+Yi2++Yi,T1)

  • Predict Y^iT using the linear combination of its historical observations

For the units with (i,t)M, the predicted outcome is:

Y^iT=β^0+s=1T1β^sYis, where

β^=argminβi:(i,T)O(YiTβ0s=1T1βsYis)2.
Assumption 2 (Patterns over time are stable across units).
Horizontal / unconfoundedness approach assumes that patterns over time are stable across units.

More specifically, it assumes there is relation between outcomes in treated period and pre-treatment periods that is the same for all units. Note: βs is the same for ALL units i.

A more flexible, nonparametric, version of this estimator would correspond to matching,

  • For each treated unit i, we find a corresponding control unit j with YjtYit for all pre-treatment periods t=1,,T1.

If N is large relative to T0, use regularized regression such as LASSO, Ridge, and Elastic Net.

3. Vertical Regression and the Synthetic Control literature

The synthetic control methods focus on:

  • The single-treated-unit block structure with a fat (TN) or approximately square (TN) matrix.

  • A large number of pre-treatment periods (i.e. T0 is large)

  • YNt is missing for tT0 and there are no missing entries for other units

vertical-regression-matrix
Figure 3: fat matrix

In this setting, the identification depends on the assumption that:

Assumption 3 (Identification Assumption).
YN,t(0)WN,TY1,t,,YN1,T

Previous studies1 show how the synthetic control method can be interpreted as regressing the outcomes for the treated unit prior to the treatment on the outcomes for the control units in the same periods.

  • lm (YN,t1+Y1t+Y2t++YN1,t), for t=T0,,T

That is, for the treated unit in period t, for t=T0,,T, the predicted outcome is:

Y^Nt=ω^0+i=1N1ω^iYit

where

ω^=argminωt:(N,t)O(YNtω0i=1N1ωiYis)2.

This is referred to as vertical regression.

Assumption 4 (Patterns across units are stable over time).
Vertical / synthetic control approach assumes that patterns across units are stable over time.

More specifically, it assumes there is relation between different units that is stabale over time. Note: ωi is the same for ALL time period t.

A more flexible, nonparametric, version of this estimator would correspond to matching,

  • For each post-treatment period t we find a corresponding pre-treatment period s with YisYit for all control units i=1,,N1.

If T is large relative to Nc, one could use regularized regression such as LASSO, Ridge, and Elastic Net.

4. Panel with Fixed Effects

Before move to fixed effect, factor and interactive mixed effect models, let’s do a short summary.

Horizontal regression:

  • Data: Thin matrix (many units, few periods), single treated period (period T)

  • Strategy: Use controls to regress YiT on lagged outcomes Yi1,Yi2,,Yi,T1

    • lm (YiT1+Yi1+Yi2++Yi,T1)

    • NC observations, T1 regressors

  • Does not work well if Y is fat (few units, many periods)

  • Key identifying assumption: Yi,T(0)Wi,TYi,1,,Yi,T1

The horizontal regression focuses on a pattern in the time path of the outcome Yit, specifically the relation between YiT and the lagged Yit for t=1,,T1 for the units for whom these values are observed, and assumes that this pattern is the same for units with missing outcomes.

Vertical regression

  • Data: Fat matrix (few units, many periods), single treated unit (unit N), treatment starts in period T0 .

  • Strategy: Use pretreatment periods to regress YN,t on contemporaneous outcomes Y1t,Y2t,,YN1,t

    • lm (YN,t1+Y1t+Y2t++YN1,t)

    • T01 observations, N regressors

  • Does not work well if matrix is thin (many units)

  • Key identifying assumption: YN,t(0)WN,TY1,t,,YN1,T

The vertical regression focuses on a pattern between units at times when we observe all outcomes, and assumes this pattern continues to hold for periods when some outcomes are missing.

However, by focusing on only one of these patterns, cross-section or time series, these approaches ignore alternative patterns that may help in imputing the missing values. One alternative is to consider approaches that allow for the exploitation of stable patterns over time and stable patterns across units. Such methods have a long history in panel data literature, including the two-way fixed effects, and factor and interactive mixed effect models.

In the absence of covariates, the common two-way fixed effect model is:

Yit=δi+γt+ϵit

In this setting, the identification depends on the assumption that:

Yi,t(0)Wi,Tδi,γt.

So the predicted outcome based on the unit and time fixed effects is:

y^NT=δ^N+γ^T

In a matrix form, the model can be rewritten as:

image-20250709205419552

The matrix formulation of the identification assumption is:

YN×T(0)WN×TLN×T

5. Interactive Fixed Effects

5.1 Why Use Interactive Fixed Effects?

In panel data, traditional fixed effects control for differences across units and time periods. But what if there are hidden factors that evolve over time and affect each unit differently — like how each state reacts differently to national economic trends?

Classical models can’t handle that. This is where interactive fixed effects shine: they model these latent time-varying confounders using a flexible structure, λiFt (Bai 2009), helping us avoid bias when estimating causal effects.

5.2 Interactive Fixed Effects Model

For interactive fixed effects, instead of exploiting additive structure in unit and time effects, we exploit the low rank or interactive structure of unit and time fixed effects for panel data regression.

The common interactive fixed effect model is:

Yit=r=1Rδirγtr+εit,

where

  • γtr are called factors

    • Note: Factors are unobserved, time-varying variables that capture common influences or shocks affecting all units at time t. They represent the latent drivers of the outcome that change over time but are shared across units.
  • δir are called factor loadings

    • Note: Factor loadings are unobserved, unit-specific coefficients that determine how much each unit i is affected by the common factors. They reflect the heterogeneity in how units respond to the time-varying factors.

Note that both factors and factor loadings are parameters that need to be estimated. Typically it is assumed that the number of factors R is fixed, although it is not necessarily known to the researcher.

Again, the identification depends on the assumption that:

Yi,t(0)Wi,Tδi,γt.

The predicted outcome based on the interactive fixed effects would be:

y^NT=r=1Rδ^irγ^tr

In a matrix form, the YN×T can be rewritten as:

image-20250710131419972

The matrix formulation of the identification assumption is:

YN×T(0)WN×TLN×T

Using the interactive fixed effects model, we would estimate δ and γ by least squares regression and use those to impute missing values.

6. The Matrix Completion with Nuclear Norm Minimization Estimator

In the absence of covariates, the YN×T matrix can be rewritten as

YN×T=LN×T+ϵN×T
Assumption 5 (Key assumptions).
1. WN×TεN×T (but W may depend on L ).
2. There exists staggered entry relating to the weights such that Wit+1Wit
3. L has a low rank relative to N and T.

In a more general case, Y_{i t}$ is equal to:

Yit=Lit+p=1Pq=1QXipHpqZqt+γi+δt+Vitβ+εit

where

  • Xi: unit-specific P-component covariate

  • Zt: time-specific Q-component covariate

  • Vit: unit-time-specific covariate

  • γi: unit fixed effect

  • δt: time fixed effect

We do not necessarily need the fixed effects γi and δt as these can be subsumed into L. However, it is convenient and efficient to include the fixed effects given that we regularize L.

With too many parameters, especially for N×T matrix L, we need regularization such that we shrink L and H toward zero. To regularize H , we use Lasso-type element-wise 1 norm, defined as

H1,e=p=1Pq=1Q|Hpq|

But how do we regularize L ?

By singular value decomposition (SVD), we have LN×T is equal to:

LN×T=SN×NΣN×TRT×T

where

  • S,R unitary

  • Σ is rectangular diagonal with entries σi(L) that are the singular values

  • Rank of L is number of non-zero σi(L).

There are three ways to regularize L :

LF2=i,t|Lit|2=j=1min(N,T)σi2(L) (Frobenius, like ridge) L=j=1mis(N,T)σi(L) (nuclear norm, like LASSO) LR=j=1min(N,T)1σi(L)>0 (Rank, like subset selection) 
  • Frobenius norm imputes missing values as 0

  • Rank norm is computationally not feasible for general missing data patterns

  • The preferred Nuclear norm leads to low-rank matrix and is computationally feasible.

So the Matrix-Completion with Nuclear Norm Minimization (MC-NNM) estimator uses the nuclear norm:

minL1|O|(i,t)0(YitLit)2+λLL

For the general case, we estimate H,L,δ,γ, and β as

minH,L,δ,γ1|O|(i,t)O(YitLitp=1Pq=1QXipHpqZqtγiδtVitβ)2+λLL+λHH1,e

And we choose λL and λH through cross-validation.

Reference

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi (2021), “Matrix Completion Methods for Causal Panel Data Models,” Journal of the American Statistical Association, 116 (536), 1716–30.

Liu, Licheng, Ye Wang, and Yiqing Xu (2024), “A Practical Guide to Counterfactual Estimators for Causal Inference with Time-Series Cross-Sectional Data,” American Journal of Political Science, 68 (1), 160–76.

Chapter 8 Matrix Completion Methods | Machine Learning-Based Causal Inference Tutorial

7.1 Simple Exponential Smoothing | Forecasting: Principles And Practice (2nd Ed)

Bai, Jushan (2009), “Panel Data Models With Interactive Fixed Effects,” Econometrica, 77 (4), 1229–79.

R packages 📦 to implement:

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related