Development of machine learning techniques for flow cytometry data

01 January 2014 → 31 December 2017
Regional and community funding: IWT/VLAIO
Research disciplines
  • Engineering and technology
    • Computer hardware
    • Computer theory
    • Scientific computing
    • Other computer engineering, information technology and mathematical engineering
flowcytometric data Immunity
Project description

The immune system, a complex system consisting of many different cell types, is our body’s main defense mechanism against all kinds of intruders. It plays a huge role in most diseases, either by battling the culprit in infectious diseases such as flu, or because something goes wrong with its functioning in immune diseases such as asthma. Determining the immune profile of patients can help to diagnose them or
to follow their treatment, whereas studying in vitro immune cells or the immune system of laboratory animals is crucial in medicine development. To determine an immune profile, a high-throughput technology called flow cytometry is often
used. Biological samples are stained with antibodies bound to fluorochromes and the cells in solution are passed through a fluidics system. Using an optics system including lasers and bandpass filters, the fluorescence emission of every individual
cell is measured, which reports the presence of specific proteins or ‘markers’ on the cell surface, allowing to identify different cell types and giving insight in the immune profile of a patient. This technology can capture information of thousands
of individual cells per second.
The analysis of flow cytometry data will typically consist of multiple steps.
First, some quality control steps should be executed, such as removing erroneous measurements caused by obstructions in the machine, dead cells or doublets. Some artifacts from the optics system need to be corrected as well, by compensating
and transforming the data. Next, the different cell types can be distinguished.
Traditionally, this is done by ‘gating’ the data, a procedure where subsets of cells are repetitively selected by drawing polygon shapes on two-dimensional scatter plots. Detecting the different cell types is rarely the final goal of the research.
Often an additional analysis is executed on the cell type counts or percentages, to determine differences between patient groups or lab animals.
In recent years, the number of markers that can be measured simultaneously has strongly increased. Whereas the original machine design in the seventies was only able to measure two colors and this increased gradually to twelve in the
nineties, with the discovery of fluorochromes with smaller emission spectra and the advent of mass cytometry, this number increased to thirty and more in the last ten years. This causes the traditional way of analyzing this data to fall short. While for smaller datasets observing two parameters at a time was enough to identify the cell populations, this view is just too limited for high-dimensional datasets. It is not only time-consuming, but also very biased towards the expected populations.
Many cells are ‘gated out’ and never analyzed, and rarely all markers are studied for a single cell. Additionally, as more and more cell populations can be detected, it becomes harder to identify which (combinations of) cell populations can be predictive for a clinical outcome.
Machine learning, a branch of computer science in which models are learned from data, might help to tackle these problems. It has algorithms to handle highdimensional data (such as dimensionality reduction and feature selection), algorithms to select subpopulations in the data (called clustering) and algorithms to predict values, such as a group label or a survival time, from a description of a patient (classification and regression). Most of these machine learning techniques can find useful applications in flow cytometry research.
In this work, we evaluate which algorithms are best suited for this type of data and develop several specific solutions for different use cases. The first chapter is
a general introduction of the flow cytometry technique, showing its uses in immunology research and a short overview of machine learning approaches.
In the second chapter, we develop a comprehensive visualization tool for flow cytometry data, as the traditional 2D scatter plots were incomplete, and alternative techniques such as SPADE and viSNE were not able to handle the millions
of cells processed from flow cytometry samples. FlowSOM uses a self-organizing map, making it computationally very scalable, and includes an additional metaclustering step, allowing clusters in strongly varying sizes and shapes. The clusters from the self-organizing map are visualized in a minimal spanning tree, a view which has a very intuitive interpretation with the separate branches representing different cell types and the separate nodes in the branches representing
small variations in a specific cell population.
While the first version of the FlowSOM algorithm managed to give a comprehensive overview of a dataset, we soon noticed that it was not yet able to answer many questions of the immunologists without effort, such as ‘What is the immunophenotypic difference between these two groups of patients?’ and ‘Which branch represents the dendritic cells?’. In the third chapter, we describe some additional functionality that was implemented into the FlowSOM R package that is
available on Bioconductor and that allows users to do a more complete analysis of their data without much extra work.
The fourth chapter describes our participation in the FlowCAP IV challenge.
The FlowCAP consortium provided a flow cytometry dataset of HIV patients with known progression time to AIDS, and were looking for cell populations which could predict this progression rate. We built a pipeline called FloReMi, which
first applied extensive preprocessing to clean the files and then used a combination of the existing flowDensity and flowType algorithms to automatically detect many possible populations. We applied a supervised feature selection procedure
to find populations of interest with minimal redundancy and the progression time was predicted using a random survival forest. Our final results outperformed the submissions of the other eight teams participating in this challenge.
In the fifth chapter, a review of computational flow cytometry techniques is given, including the two techniques from the previous chapters and many other techniques developed in different research groups. Even though several algorithms
exist, most people in the lab still use the traditional gating approach to analyze their data. It is necessary to introduce these new techniques to immunologists and to give them an overview of all the different approaches and their advantages, so
they can make an informed decision about what could be worth learning to advance their research.
Where all previous chapters focused on flow cytometry data, the algorithms could also be used in mass cytometry settings. Mass cytometry is a variation on flow cytometry where the cells are labeled with rare earth metals instead of fluorochromes. This circumvents the limitations of the optics system and allows the number of markers to go up to fifty and more. As mass cytometry is used more and more often in clinical studies, it is of utmost importance that the values are comparable between the different samples. To ensure this, samples are often processed per plate, but even then, batch effects might pop up between different plates. In the sixth chapter, we propose a new normalization procedure based on quantile normalization, which takes into account the cell type specific effects that might be occurring by incorporating the FlowSOM algorithm.
In conclusion, we developed new tools for all steps of a flow cytometry analysis, going from preprocessing, over cell type identification up until prognosis prediction. Making use of machine learning techniques allowed us to improve
compared to existing analysis tools and several of our methods have been adopted by other research groups. However, most people in the lab will not yet take the
leap to start programming scripts. It will take some time until these new analysis tools are implemented in the commercial point-and-click solutions used by most immunologists. In the meantime, strong collaborations between wet lab teams and
bioinformatics teams will keep pushing computational flow cytometry to a new level.