Data mining and computational intelligence in bioprocessing

Eger, Marcus

doi:10.34726/hss.2017.38784

DC Field

Value

Language

dc.contributor.advisor

Herwig, Christoph

dc.contributor.author

Eger, Marcus

dc.date.accessioned

2020-06-28T05:22:55Z

dc.date.issued

2017

dc.date.submitted

2018-01

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Eger, M. (2017). <i>Data mining and computational intelligence in bioprocessing</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2017.38784</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2017.38784

dc.identifier.uri

http://hdl.handle.net/20.500.12708/3052

dc.description.abstract

Biotechnology production plants offer a wealth of information in recorded bioprocessing data through modern data logging and archiving systems. This rich amount of data allows to optimize single bioprocesses, conduct quality control, and characterize processes. Due to the variety of different data logging systems, the resulting recorded bioprocess data is stored in different document formats, storage and database systems, and largely unstructured shapes. State of the art methods of bioprocess data alignment include the development of specific extraction scripts and manual data alignment. A large part of the analytical process is thus put into time-consuming manual data alignment processes. Machine Learning (ML) techniques can uncover hidden patterns in these seemingly unstructured shapes and thus allow for the automatic alignment of bioprocess data, independent of different storage formats and shapes. It is thus potentially advantageous to apply ML techniques in order to ease the time-consuming effort put into the data alignment process. The aim of this thesis is to develop a bioprocess data extraction and alignment framework based on different Machine Learning techniques for data preprocessing and classification. The data set used in the scope of this thesis consists of different online- and offlinerecorded batch processes collected from six different companies in the pharmaceutical, bioreactor, and biosensor industry (see appendix). All data sources were anonymized and recorded values replaced by random values prior to any processing steps. Each recorded batch is partitioned into a vectorized grid, where every data entry is represented by a single cell on the grid. Features for the classification process are built from cell properties and their surrounding cell neighborhood information for a given radius. Due to the high-dimensionality of the resulting feature space and to ensure maximum variability, the input dimension is reduced using Principal Component Analysis (PCA). The processed feature set is tested on two different classifiers, Stochastic Gradient Descent (SGD) with L2 regularization (Support Vector Machine) and Gradient tree Boosting. Tests were run using different neighborhood distances, different training/testing ratios, and different sets of hyperparameters for the chosen classifiers. The variation in training/testing ratios showed that the variance between the highest and lowest test result steadily decreased for an increasing number of training samples. Based on the resulting averaged classifier F-scores for different neighborhood radii tests, a neighborhood distance radius of 2 demonstrated a good agreement between specificity Between the two classifiers, the Gradient tree Boosting method achieved an overall higher prediction accuracy than the Support Vector Machine. The results of the feature importance tests for the cell features showed that the features data type and string similarity contributed the most during the training phase. The results for the feature importances for different neighborhood directions indicated a strong bias towards the cell information below each cell. The final prediction model achieved an average F-score of 86.27 with a low standard deviation of only 0.0477. A machine-learning based extraction method for bioprocess data proved to be succesful, but with limitations. Improvements in the parsing process of the bioprocess data, the addition of new cell features, and a higher information content in the cell neighborhood features would greatly refine the accuracy of the prediction models. Future opportunities to expand the work done in this thesis would be an extension of the hyperparameter optimization for the Gradient Boosting classifier, improved sampling methods to reduce the class label imbalance in the training set, and a comparison of different nearest neighbor metrics in relation to the respective feature importance.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

Machine learning

dc.subject

bioprocess analysis

dc.title

Data mining and computational intelligence in bioprocessing

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2017.38784

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Marcus Eger

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

tuw.publication.orgunit

E166 - Institut für Verfahrenstechnik, Umwelttechnik und technische Biowissenschaften

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC14536536

dc.description.numberOfPages

dc.identifier.urn

urn:nbn:at:at-ubtuw:1-107221

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.mimetype

application/pdf

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

crisitem.author.dept

TU Wien

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(7.76 MB)

In Copyright

Show simple item record

Page view(s)

289

checked on Nov 19, 2023

Download(s)

102

checked on Nov 19, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM