Active learning for phenotypic profiling

Focus on Research and Development by Prof. Kevin Smith, available now as a product from BioDataAnalysis GmbH.

High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data.

Active learning. Starting from the top, an instance or set of instances is sampled from the unlabeled pool using a query strategy and presented as queries to the expert. The expert provides labels to the queries, and the labeled instance(s) are added to the labeled training set. Next, the predictive model is retrained with the updated labeled training data. After retraining, the SML algorithm makes predictions for each instance in the unlabeled pool which are then used by the query strategy to measure the informativeness of each instance when choosing the next query.

We investigated the impact of active learning on single-cell–based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We considered several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost, in several cases by a factor or 3 or more. Active learning does not reduce the quality of the predictions, and can reveal the same phenotypic targets identified using SML.

We tested 42 combinations of SML methods and active learning strategies on data from 3 different high content screens Ribosome biogensis, Semliki forest virus, and Uukuniemi virus. Each test was repeated 5 times, the area under the learning curve was computed, and the results were averaged. We tested for 17 different values of qc, the number of queries posed to oracle per active learning cycle. In total 672 tests were performed, using 10,320 CPU hours distributed on 48 cores.

Area under the learning curve for 42 combinations of SML methods and active learning strategies for qc=3, each test repeated 5 times.

The experiment was repeated with a per-query analysis instead of per-unit-time in order to obtain software/hardware agnostic results. Our finding was that while no single combination performs best, Random Forest with least confidence sampling and qc =3 performs well across the board.


  1. K. Smith, P. Horvath. Active Learning Strategies for Phenotypic Profling of High-Content Screens. Journal of Biomolecular Screening, to appear 2014.