Meaningful Signals Within Deep Learning Models; Primer: Meaningful choice/curation of pre-training data in alignment with a downstream task

Alex Lu
Microsoft Research New England
Primer: Towards Meaningful Pretrained Models for Biology

Modern biological experiments and curation efforts have amassed tremendous amounts of data across domains. These big datasets now drive efforts in large pretrained models, where researchers expose deep learning models to large quantities of unlabeled data with the aim of initializing them with a foundational knowledge of biology so that they can be rapidly transferred to useful analyses. While pretrained models promise to democratize the benefits of deep learning, current models are not guaranteed to provide any meaningful signal for analyses, and in some cases worsen a biologist’s ability to resolve signal. This hinders the usefulness of models, especially in exploratory analysis and hypothesis discovery applications where there may not be enough prior annotations to empirically benchmark models.

Meeting: Stanley Hua
University of Toronto
Meaningful choice/curation of pre-training data in alignment with a downstream task

Motivation: Lack of labeled data is a common problem when applying computer vision to biomedical imaging data. A common approach is "transfer learning", which pre-trains on a separate (and sometimes unrelated) dataset to set the model to a good initialization so its features can generalize to the new task. Most pre-training approaches in biomedical computer vision rely upon ImageNet or other natural image datasets. However, given the large shift in domain, we hypothesized that transfer from natural image datasets may not be the most effective approach for transfer learning on biomedical images. Here, we sought to understand if curation and pre-training on microscopy images can yield better performance on tasks relating to the analysis of microscopy images. We present CytoImageNet, a large-scale dataset of openly-sourced and weakly-labeled microscopy images (890K images, 894 classes). Intriguingly, while pretrained models on CytoImageNet do not surpass the performance of ImageNet, we show that fusing their features improves performance on unseen microscopy images across the board, suggesting that CytoImageNet features capture information not available in ImageNet-trained features. Our work highlights the potential of meaningfully curating domain-relevant datasets to learn domain-relevant features. The CytoImageNet dataset is made available at .

Meeting Part 2:
Alexander Lin
School of Engineering and Applied Sciences
Harvard University

Disentangling Meaningful Signal from Experimental Noise within Deep Learning Models

Data collected by high-throughput microscopy experiments are affected by batch effects, stemming from slight technical differences between experimental batches. Batch effects significantly impede machine learning efforts, as models learn spurious technical variation that do not generalize. We introduce batch effects normalization (BEN), a simple method for correcting batch effects that can be applied to any neural network with batch normalization (BN) layers. BEN aligns the concept of a “batch” in biological experiments with that of a “batch” in deep learning. During each training step, data points forming the deep learning batch are always sampled from the same experimental batch. This small tweak turns the batch normalization layers into an estimate of the shared batch effects between images, allowing for these technical effects to be standardized out during training and inference. We demonstrate that BEN results in dramatic performance boosts in both supervised and unsupervised learning, leading to state-of-the-art performance on the RxRx1-Wilds benchmark.

For more information visit: /mia.