Assistant Professor Cornell University Ithaca, New York
Plants produce a vast diversity of chemicals. Many of these can be detected using untargeted liquid chromatography tandem mass spectrometry (LC-MS/MS) however, < 5% of the peaks can be assigned compound names based on matches to public spectral databases. Novel machine learning (ML) methods provide a unique opportunity to assess a larger proportion of the metabolome by annotating the peaks based on their structural classes instead of their compound names. Using grid search and publicly available MS/MS training datasets, we developed a random forest model for predicting lipid classes using MS/MS datasets. This model unfortunately showed signatures of over-fitting. We then used a novel publicly available deep learning based algorithm for predicting metabolite classes from MS/MS data and inferring metabolomic changes under stress. We were able to predict metabolite classes for ~82% of the detected peaks vs. ~5% using spectral matches alone. However, extensive quality testing using orthogonal methods and authentic standards revealed dependence of the predictions on various experimental and training conditions and allowed use of only ~37% of the fragmented peaks. Combined application of information theory approaches and the above ML-based method allowed inference of co-regulated metabolite classes in roots and leaves of Brachypodium distachyon under environmental change. We also developed a method for predicting perturbed metabolic pathways from ML-based annotation of metabolome data. Based on these results, I will discuss the importance of validation while applying ML-based methods to real experimental datasets, and the insights that these methods can reveal when applied to plant metabolomic analyses.