E180 - Fakultät für Informatik E105 - Institut für Computational Statistics
-
Date (published):
2024
-
Number of Pages:
178
-
Keywords:
Deep Learning; Large Language Models
en
Abstract:
This thesis addresses the challenge of metabolite structure prediction, a key task in drug discovery, by framing it as a Neural-Machine-Translation (NMT) problem. Utilizing an encoder-decoder transformer architecture, we encoded molecules as 1D strings to predict metabolite structures, considering scenarios both with and without stereochemistry.A major contribution of this work is the establishment of rigorous data splitting criteria,enhancing transparency and fairness in dataset development. We also introduce Meta-Trans V2, a benchmark dataset with expanded test and validation sets, enabling a more robust evaluation of model generalization. Additionally, a tailored data-cleaning pipeline for enzymatic reactions was developed to align closer with the metabolite prediction task. However, incorporating this cleaned dataset did not yield the expected performance improvements, indicating challenges in data quality.In our exploration of molecular string representations, we compared the popular SMILES notation with the more complex SAFE notation, which maintains spatial proximity in the string. Contrary to our expectations, the added complexity of SAFE did not result in better performance. Similarly, experimenting with a Byte Pair Encoding (BPE) tokenizer specific to the drug and metabolite space provided no significant gains, reinforcing the symbolic nature of chemical language.We also investigated novel augmentation techniques, such as Site-of-Metabolism (SOM)integration and a multiple-output training strategy. The latter, which aligns with the one-to-many nature of metabolite structure prediction, demonstrated superior accuracy and precision compared to existing methods. By training the model under conditions that mirror the true underlying process—where a single substrate can yield multiple metabolites—the multiple-output augmentation technique allows the model to produce more diverse and accurate predictions, effectively capturing the complexity of metabolite formation.In conclusion, this thesis advances metabolite structure prediction through innovative data preparation, model training, and evaluation strategies. The findings offer new insights and resources, contributing to more effective methodologies in the field and laying the groundwork for future research in metabolite prediction using Deep Learning(DL) approaches.