Focus on Research and Development by Prof. Kevin Smith, available now as a product from BioDataAnalysis GmbH.
High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data.
We investigated the impact of active learning on single-cell–based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We considered several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost, in several cases by a factor or 3 or more. Active learning does not reduce the quality of the predictions, and can reveal the same phenotypic targets identified using SML.
We tested 42 combinations of SML methods and active learning strategies on data from 3 different high content screens Ribosome biogensis, Semliki forest virus, and Uukuniemi virus. Each test was repeated 5 times, the area under the learning curve was computed, and the results were averaged. We tested for 17 different values of qc, the number of queries posed to oracle per active learning cycle. In total 672 tests were performed, using 10,320 CPU hours distributed on 48 cores.
The experiment was repeated with a per-query analysis instead of per-unit-time in order to obtain software/hardware agnostic results. Our finding was that while no single combination performs best, Random Forest with least confidence sampling and qc =3 performs well across the board.
References
- K. Smith, P. Horvath. Active Learning Strategies for Phenotypic Profling of High-Content Screens. Journal of Biomolecular Screening, to appear 2014.