Staff Scientist Lawrence Berkeley National Laboratory Orinda, California
Genomes have the potential to serve as blueprints for capturing the complex physiology, biochemistry, metabolism, phenotypes, and responses of plants to their environment. Towards this goal, the combination of ‘omics approaches with targeted protein-level experimentation can provide the much-needed systems biology understanding grounded in molecular-level knowledge. However, instead of being blueprints, plant genomes are more like catalogs of what we think we know and what we do not know, revealing the wealth of functionality yet to be discovered. The success of machine learning-based prediction tools, such as AlphaFold, have heralded the use of such approaches to solve post-genomic challenges in translating sequence to function. However, in order to build accurate machine-learning (ML) models, diverse high-quality representative data, such as the comprehensive sequence-to-structure datasets collected over many decades that was used to train the neural network-based model AlphaFold, are needed to train models ensuring accuracy and performance. Towards ML-augmented plant metabolic modeling, we have surveyed the known and unknown metabolic landscapes of plants, substantiating the gaps in data that need to filled to support the next generation of accurate ML-based tools.