Detecting novel associations in large data sets

,
MIT CSAIL, HMS, Harvard CS
Detecting novel associations in large data sets

Abstract: As data sets grow in dimensionality, making sense of the wealth of interactions they contain has become a daunting task, not just due to the sheer number of relationships but also because relationships come in different forms (e.g. linear, exponential, periodic, etc.) and strengths. If you do not already know what kinds of relationships might be interesting, how do you find the most important or unanticipated ones effectively and efficiently? This is commonly done by using a statistic to rank relationships in a data set and then manually examining the top of the resulting list. For such a strategy to succeed though, the statistic must give similar scores to equally noisy relationships of different types. In this talk we will formalize this property, called equitability, and show how it is related to a variety of traditional statistical concepts. We will then introduce the maximal information coefficient, a statistic that has state-of-the-art equitability in a wide range of settings, and discuss how its equitability translates to practical benefits in the search for dependence structure in high-dimensional data using examples from global health and the human gut microbiome.