Project iCA_07-01_2020: Machine learning and statistical methods for metabolomics data analysis guided by metabolic pathway for feature selection

My research focus is the analysis of metabolomics data. It includes the analysis of untargeted metabolomics and targeted metabolomics of diseases. The goal of my PhD project is to develop statistical and computational methods and tools to enhance metabolomics research and discovery.

My research focus is the analysis of metabolomics data. It includes the analysis of untargeted metabolomics and targeted metabolomics of diseases. The goal of my PhD project is to develop statistical and computational methods and tools to enhance metabolomics research and discovery.

Metabolomics is a rapidly growing filed, aiming to comprehensively characterize metabolites, i.e. small molecular compounds in high-throughput manner. It is an interdisciplinary field, which combines analytical chemistry, mass spectrometry with sophisticated data analysis. The metabolomics analysis is typically conducted in two distinct ways: non-targeted and targeted. Targeted studies focus on identifying and quantifying a number of known metabolites, while untargeted studies allow for a more comprehensive evaluation of metabolomics profiles and thereby screen potential and putative metabolites.

In untargeted metabolomics analysis, the annotation of unknown features is still a major bottleneck. A single biological sample has thousands ion signals from metabolites, contaminants, artifacts and background noise. In addition, each metabolite produces multiple signals due to the presence of isotopes, adducts, multimers, different charge states and neutral loss fragments. The signals of these degenerate features are obstacles for compound identification and leads to false discovery in downstream statistics analysis. To improve the feature annotation accuracy, statistics methods (e.g. correlation) and unsupervised learning methods (e.g. clustering, distance metrics) to group features by the similarity score are employed.

In targeted metabolomics analysis, the objective is to develop algorithms and workflow to handle large-scale datasets for better understanding of biological mechanisms involved in pathology development and identifying predictive biomarkers. The work consists in feature selection and pathway mapping using statistics (e.g. ANOVA) and machine leaning (e.g. SVM, RF, NN) approaches

Name of Doctoral Researcher

Ruibing Shi

Name of Supervisor

Frank Klawonn

Institute / Department

Biostatistics group, HZI

Department of Computer Science, Ostfalia University of Applied Sciences

Contact details

ruibing.shi@helmholtz-hzi.de