Neural machine translation for metabolite structure prediction

Bartmann, Christoph

doi:10.34726/hss.2024.120144

Record link:

https://doi.org/10.34726/hss.2024.120144
http://hdl.handle.net/20.500.12708/202248

Title:

Neural machine translation for metabolite structure prediction

Citation:

Bartmann, C. (2024). Neural machine translation for metabolite structure prediction [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.120144

reposiTUm DOI:

10.34726/hss.2024.120144

CatalogPlus:

AC17333379

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Bartmann, Christoph

Advisor:

Filzmoser, Peter

Organisational Unit:

E180 - Fakultät für Informatik
E105 - Institut für Computational Statistics

Date (published):

2024

Number of Pages:

178

Keywords:

Deep Learning; Large Language Models

Abstract:

This thesis addresses the challenge of metabolite structure prediction, a key task in drug discovery, by framing it as a Neural-Machine-Translation (NMT) problem. Utilizing an encoder-decoder transformer architecture, we encoded molecules as 1D strings to predict metabolite structures, considering scenarios both with and without stereochemistry.A major contribution of this work is the establishment of rigorous data splitting criteria,enhancing transparency and fairness in dataset development. We also introduce Meta-Trans V2, a benchmark dataset with expanded test and validation sets, enabling a more robust evaluation of model generalization. Additionally, a tailored data-cleaning pipeline for enzymatic reactions was developed to align closer with the metabolite prediction task. However, incorporating this cleaned dataset did not yield the expected performance improvements, indicating challenges in data quality.In our exploration of molecular string representations, we compared the popular SMILES notation with the more complex SAFE notation, which maintains spatial proximity in the string. Contrary to our expectations, the added complexity of SAFE did not result in better performance. Similarly, experimenting with a Byte Pair Encoding (BPE) tokenizer specific to the drug and metabolite space provided no significant gains, reinforcing the symbolic nature of chemical language.We also investigated novel augmentation techniques, such as Site-of-Metabolism (SOM)integration and a multiple-output training strategy. The latter, which aligns with the one-to-many nature of metabolite structure prediction, demonstrated superior accuracy and precision compared to existing methods. By training the model under conditions that mirror the true underlying process—where a single substrate can yield multiple metabolites—the multiple-output augmentation technique allows the model to produce more diverse and accurate predictions, effectively capturing the complexity of metabolite formation.In conclusion, this thesis advances metabolite structure prediction through innovative data preparation, model training, and evaluation strategies. The findings offer new insights and resources, contributing to more effective methodologies in the field and laying the groundwork for future research in metabolite prediction using Deep Learning(DL) approaches.

License:

In Copyright

Appears in Collections:

Thesis