Knockoffs for Finding Variables with Statistical Guarantees/Challenges in High-Dimensional Variable Selection
Harvard Statistics Using Knockoffs to Find Important Variables with Statistical Guarantees
Abstract: Despite the significant recent progress in high-dimensional variable selection (reviewed in the primer), it remains unclear how to powerfully select important variables while controlling the fraction of false discoveries, even in simple models like logistic regression, not to mention general high-dimensional nonlinear models. To address this practical problem, we propose a new framework of model-X knockoffs, which acts as a wrapper around any (arbitrarily complex, e.g., drawn from machine learning) measure of variable importance and identifies important variables while exactly controlling the false discovery rate. Our method relies only on a model for the explanatory variables X, and in fact makes no assumptions at all about the response variable's distribution. To our knowledge, no other procedure solves the FDR-controlled variable selection problem in such generality, but in the restricted settings where competitors exist we demonstrate the superior power of knockoffs through simulations. We also demonstrate model-X knockoffs on GWAS data from a case-control study of Crohn’s disease in the United Kingdom, making twice as many discoveries as the original analysis of the same data.
Wenshuo Wang
Harvard Statistics Primer: Challenges in High-Dimensional Variable Selection
Abstract: Identifying relevant features to explain a response variable has always been an important problem in many areas of science. As data sets become more complex, the number of candidate features is quickly growing and very often even exceeds the number of observations we can afford to collect. This brings huge challenges for statisticians and scientists, as traditional variable selection methods fail in these cases. This talk reviews these challenges and existing statistical methods to address them. We will discuss the advantages and disadvantages of those methods, ultimately motivating the novel approach presented in the main talk.