Eger, M. (2017). Data mining and computational intelligence in bioprocessing [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.38784
E166 - Institut für Verfahrenstechnik, Umwelttechnik und technische Biowissenschaften
-
Date (published):
2017
-
Number of Pages:
80
-
Keywords:
Machine learning; bioprocess analysis
en
Abstract:
Biotechnology production plants offer a wealth of information in recorded bioprocessing data through modern data logging and archiving systems. This rich amount of data allows to optimize single bioprocesses, conduct quality control, and characterize processes. Due to the variety of different data logging systems, the resulting recorded bioprocess data is stored in different document formats, storage and database systems, and largely unstructured shapes. State of the art methods of bioprocess data alignment include the development of specific extraction scripts and manual data alignment. A large part of the analytical process is thus put into time-consuming manual data alignment processes. Machine Learning (ML) techniques can uncover hidden patterns in these seemingly unstructured shapes and thus allow for the automatic alignment of bioprocess data, independent of different storage formats and shapes. It is thus potentially advantageous to apply ML techniques in order to ease the time-consuming effort put into the data alignment process. The aim of this thesis is to develop a bioprocess data extraction and alignment framework based on different Machine Learning techniques for data preprocessing and classification. The data set used in the scope of this thesis consists of different online- and offlinerecorded batch processes collected from six different companies in the pharmaceutical, bioreactor, and biosensor industry (see appendix). All data sources were anonymized and recorded values replaced by random values prior to any processing steps. Each recorded batch is partitioned into a vectorized grid, where every data entry is represented by a single cell on the grid. Features for the classification process are built from cell properties and their surrounding cell neighborhood information for a given radius. Due to the high-dimensionality of the resulting feature space and to ensure maximum variability, the input dimension is reduced using Principal Component Analysis (PCA). The processed feature set is tested on two different classifiers, Stochastic Gradient Descent (SGD) with L2 regularization (Support Vector Machine) and Gradient tree Boosting. Tests were run using different neighborhood distances, different training/testing ratios, and different sets of hyperparameters for the chosen classifiers. The variation in training/testing ratios showed that the variance between the highest and lowest test result steadily decreased for an increasing number of training samples. Based on the resulting averaged classifier F-scores for different neighborhood radii tests, a neighborhood distance radius of 2 demonstrated a good agreement between specificity Between the two classifiers, the Gradient tree Boosting method achieved an overall higher prediction accuracy than the Support Vector Machine. The results of the feature importance tests for the cell features showed that the features data type and string similarity contributed the most during the training phase. The results for the feature importances for different neighborhood directions indicated a strong bias towards the cell information below each cell. The final prediction model achieved an average F-score of 86.27 with a low standard deviation of only 0.0477. A machine-learning based extraction method for bioprocess data proved to be succesful, but with limitations. Improvements in the parsing process of the bioprocess data, the addition of new cell features, and a higher information content in the cell neighborhood features would greatly refine the accuracy of the prediction models. Future opportunities to expand the work done in this thesis would be an extension of the hyperparameter optimization for the Gradient Boosting classifier, improved sampling methods to reduce the class label imbalance in the training set, and a comparison of different nearest neighbor metrics in relation to the respective feature importance.