PCA and stratification in GWAS / A primer on random matrix theory

Alex Bloemendal
Neale Lab
A primer on random matrix theory

Abstract: High-dimensional data behave in ways that seem to contradict intuitions from low-dimensional geometry and classical statistics, particularly in testing for and recovering low rank signal. Random matrix theory is a branch of mathematics that characterizes such phenomena; I will sketch a few relevant results.

Christina Chen
Neale Lab
Controlling for stratification in (meta)-GWAS with PCA: theory, applications, and implications

Abstract: Principal component analysis (PCA) is the standard method for estimating population structure and sample ancestry in genetic datasets. Population structure can induce confounding in genome-wide association studies (GWAS), which is typically addressed by including principal components (PCs) as covariates. However, results from random matrix theory (RMT) predict that PCA fails to detect population differentiation below a particular threshold and that even above the threshold, sample PCs may be only partially correlated with true axes of differentiation. These phenomena depend for each PC on the corresponding eigenvalue; we extend previous work to characterize and interpret the eigenvalues for general population structures. Moreover, we propose an estimator for the effective number of unlinked variants that outperforms previous moments-based estimators, which we then combine with RMT results to estimate the inaccuracy of each PC and predict how this inaccuracy leads to residual confounding in GWAS on stratified phenotypes. We validate our method via downsampling experiments on real data including the UK Biobank and suggest this behavior may be driving the uncorrected stratification recently observed in some large meta-analyses of smaller GWAS.