Segmentation of assembly operations using pose estimation, optical flow and deep learning

Bacher, Manuel

doi:10.34726/hss.2024.105340

DC Field

Value

Language

dc.contributor.advisor

Schlund, Sebastian

dc.contributor.author

Bacher, Manuel

dc.date.accessioned

2024-12-09T09:53:43Z

dc.date.issued

2024

dc.date.submitted

2024-11

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Bacher, M. (2024). <i>Segmentation of assembly operations using pose estimation, optical flow and deep learning</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2024.105340</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2024.105340

dc.identifier.uri

http://hdl.handle.net/20.500.12708/205426

dc.description.abstract

Assistenzsysteme werden zunehmend in der Fertigung eingesetzt, um basierend auf den Bedürfnissen der Arbeiter Anweisungen oder Feedback bereitzustellen. Typischerweise handelt es sich bei diesen Assistenzsystemen um intelligente Montagetische, die Augmented Reality, Bildschirme und Kameras umfassen, um komplexe Arbeiten zu unterstützen. Um den Arbeitern zu helfen, muss das Assistenzsystem seine Umgebung kennen und somit den Kontext der Arbeit erfassen. Ein wichtiger Aspekt ist die Aufgabe, die der Arbeiter ausführt, was durch die Erstellung eines visuellen Sensorsystems und eines Algorithmus zur Klassifizierung und Analyse der Montageschritte in Echtzeit erreicht werden kann. Es gibt jedoch mehrere Herausforderungen bei der Erkennung von Aktionen und Aktivitäten im Kontext von Montageprozessen. In Montageszenarien wird die Kamera typischerweise über dem Arbeiter platziert, um den Montageprozess nicht zu behindern, was die Erkennung des gesamten Körperskeletts, wie sie in modernen Algorithmen häufig verwendet wird, erschwert. Darüber hinaus kann die Interaktion mit Werkzeugen und Objekten Teile der Hände des Arbeiters vorübergehend verdecken, und die Montagetätigkeiten variieren stark in ihrer Dauer. Eine weitere Herausforderung ist die begrenzte Verfügbarkeit von Datensätzen zur Erkennung von Montagevorgängen. Moderne Verfahren wie 3D-CNNs, die eine Vielzahl von Parametern aufweisen und rechnerisch komplex sind, können unter diesen Umständen schwer zu trainieren sein. Diese Arbeit versucht, die oben genannten Herausforderungen durch die Kombination von Pose-Estimation, Optical Flow und Deep Learning zu bewältigen. Der Hauptbeitrag dieser Arbeit ist ein multimodales Netzwerk, das aus Pose-Estimation und Optical Flow-Extraktion besteht, zwei Convolutional Neural Networks (CNNs) zur Extraktion von hochrangigen Merkmalen und einem temporalen CNN zur Aktionssegmentierung. Der entwickelte Algorithmus wurde auf einer angepassten Version des öffentlich verfügbaren Assembly101-Datensatzes trainiert und später auf einem Datensatz der TU Wien fine-getuned. Auf dem angepassten Assembly101-Datensatz erreichte der Algorithmus eine Validation Accuracy von 23,48% und einen Edit Score von 21,01%. Auf dem TU Wien Assembly-Datensatz erreichte der Algorithmus eine Validation Accuracy von 55,06% und einen Edit Score von 70,85%. Zusätzlich zum multimodalen Modell wurde die Erkennung auch mit einzelnen Modalitäten getestet, wobei Pose, Optical Flow und RGB-Streams als Input dienten. Die Kombination von Pose-Estimation und Optical Flow erzielte ähnliche Ergebnisse wie reine RGB Daten. Die Verwendung einzelner Modalitäten, d.h. Pose oder Optical Flow, führte jedoch zu ähnlichen Ergebnissen wie das multimodale Modell.

dc.description.abstract

Assistance systems are being increasingly used in manufacturing to provide instructions or feedback based on the needs of the worker. Typically, these assistance systems are smart assembly tables that include augmented reality, displays and cameras to accomodate complex work. To assist workers, the assistance system must know about its environment and thus become aware of the contex of work. An important aspect is the task the worker is performing, which can be done by creating a visual sensor system and an algorithm to classify and analyze the assembly steps performed by the human worker in real time.However, there are several challenges when performing action and activity recognition in the context of assembly operations. In assembly scenarios, the camera is typically placed above the worker to not hinder the assembly process, making full-body skeleton recognition, commonly used in state of the art algorithms, more difficult. Moreover, the interaction with tools and objects can temporarily occlude parts of the hands of the worker, and additionally the assembly operations vary a lot in duration. Another challenge is the limited availability of datasets for the recognition of assembly operations. State-of-the-art practices, such as 3D CNN, which possess a lot of parameters and are computationally complex, can be difficult to train under these circumstances.This thesis tries to tackle the above-mentioned challenges by combining pose estimation, optical flow and deep learning. The main contribution of this thesis is a multimodal network consisting of pose estimation and optical flow extraction, two convolutional neural networks for high-level feature extraction and a temporal convolutional neural network for the action segmentation. The developed algorithm was trained on an adapted version of the publicly available Assembly101 dataset and later fine-tuned on a dataset from the TU Wien. On the adapted Assembly101 dataset, the algorithm achieved a validation accuracy of 23.48% and an edit score of 21.01%. On the TU Wien Assembly dataset, the algorithm achieved a validation accuracy of 55.06% and an edit score of 70.85%.In addition to the multimodal model, the recognition was also tested with singlular modalities, with pose, optical flow, and RGB streams as inputs. The combination of pose estimation and optical flow achieved similar results as plain RGB features, however, using single inputs, i.e., pose or optical flow, achieved similar results as the multimodal model. This suggests that, although motion information is relevant, these modalities do not complement each other.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

human activity recognition

dc.subject

pose estimation

dc.subject

optical flow

dc.subject

adaptive assistance

dc.subject

assembly

dc.title

Segmentation of assembly operations using pose estimation, optical flow and deep learning

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2024.105340

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Manuel Bacher

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Kostolani, David

tuw.publication.orgunit

E180 - Fakultät für Informatik

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC17387568

dc.description.numberOfPages

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0002-8142-0255

item.languageiso639-1

item.openairetype

master thesis

item.grantfulltext

open

item.fulltext

with Fulltext

item.cerifentitytype

Publications

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.openaccessfulltext

Open Access

Appears in Collections:

Thesis

Bacher Manuel - 2024 - Segmentation of Assembly Operations Using Pose Estimation...pdf

Adobe PDF

(2.1 MB)

Show simple item record

Page view(s)

checked on Dec 9, 2024

Download(s)

checked on Dec 9, 2024

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM