Automated skill extraction and classification from German job listings using transfer learning

Scheuvens, Malte

doi:10.34726/hss.2025.114402

Record link:

https://doi.org/10.34726/hss.2025.114402
http://hdl.handle.net/20.500.12708/213387

Title:

Automated skill extraction and classification from German job listings using transfer learning

Citation:

Scheuvens, M. (2025). Automated skill extraction and classification from German job listings using transfer learning [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.114402

reposiTUm DOI:

10.34726/hss.2025.114402

CatalogPlus:

AC17471756

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Scheuvens, Malte

Advisor:

Ansari Chaharsoughi, Fazel

Co-advisor:

Kohl, Linus

Organisational Unit:

E180 - Fakultät für Informatik
E330 - Institut für Managementwissenschaften

Date (published):

2025

Number of Pages:

113

Keywords:

Künstliche Intelligenz; Computerlinguistik; Kompetenzmanagement; Jobausschreibungen

Artificial Intelligence; Natural Language Processing; NLP; Competence Management; Job Vacencies

Abstract:

Die Fähigkeiten und Kenntnisse der Menschen spielen eine wichtige Rolle bei den sich ständig ändernden Anforderungen des Arbeitsmarktes. Die automatisierte Erfassung und Klassifizierung von Fähigkeiten können einerseits dazu beitragen, neue Trends bei den Bedarf nach neuen Fähigkeiten zu erkennen, andererseits aber auch die Vermittlung zwischen Arbeitssuchenden und Arbeitgebern zu unterstützen. Um dies zu erreichen, ist eine standardisierte Klassifizierung der Fähigkeiten erforderlich. Die vorliegende Arbeit befasst sich mit den Herausforderungen der automatisierten Extraktion von Fähigkeiten aus deutschsprachigen Stellenangeboten. Dabei werden zwei Hauptprobleme betrachtet: das Fehlen einer standardisierten Zuordnung von Fähigkeiten zu standardisierten Kompetenztaxonomien (P1) und das Fehlen öffentlich zugänglicher Benchmarking-Datensätze (P2). Um die genannten Herausforderungen zu bewältigen, untersucht die vorliegende Arbeit den Einsatz moderner Natural-Language-Processing Methoden zur Extraktion und Klassifizierung von Fähigkeiten, wobei der Schwerpunkt auf der Klassifikation nach der European Skills, Competences, Qualifications, and Occupation (ESCO) Taxonomie liegt. Die verwendete strukturierte Methodik basiert auf dem Design Science Research Process (DSRP) sowie dem Cross Industry Standard Process for Data Mining (CRISP-DM), welche den Rahmen für das gesamte Forschungsdesign bilden. Das methodische Vorgehen umfasst zudem eine systematische Literaturrecherche, um den aktuellen Stand der Technik zu evaluieren. Dabei werden Einschränkungen, wie der Fokus auf die ESCO-Taxonomie und die Abhängigkeit von bestehenden Modellen, berücksichtigt.In der praktischen Anwendung der Ergebnisse werden auf den Anwendungsbereich und die deutsche Sprache abgestimmte Transformatormodelle, wie beispielsweise JobGBERT, eingesetzt und sprachspezifische Vorverarbeitungstechniken eingeführt, welche die Eigenheiten der deutschen Sprache adressieren, wie die Verwendung von zusammengesetzten Wörtern und die häufige Nominalisierung von Verben. Die Ergebnisse legen nahe, dass die Einbeziehung sprachspezifischer Anpassungen die Extraktions- und Klassifikationsleistung in Bezug auf Precision, Recall und F1 zwar erheblich verbessert, die Einbeziehung domänenspezifischer Modelle jedoch in bestimmten Situationen nicht unbedingt zu einer Steigerung der Gesamtleistung führt. Des Weiteren wird durch die Erstellung eines neuartigen, Benchmarking-Datensatz aus deutschen Stellenausschreibungen der Mangel an Benchmarking-Ressourcen behoben, wodurch eine reproduzierbare Forschung sowie eine vergleichbare Bewertung von Methoden zur Extraktion von Fähigkeiten ermöglicht wird. Zusammenfassend leistet diese Arbeit einen Beitrag zur Weiterentwicklung des Forschungsgebiets durch die Erstellung eines Benchmarking-Datensatzes sowie der entwickelten Extraktions-Pipeline, welche den Vergleich verschiedener State-of-the-Art Modelle ermöglicht.

People's skills and knowledge play an important role in the ever-changing demands of the labour market. Automated skill extraction and classification can, on the one hand, aid in the discovery of new trends in skill demand, but on the other, support the matching process between job seekers and employers. To achieve this, a standardised classification of skills is necessary, which helps in facilitating these matching processes. Previous approaches have been mainly focussing on extracting skills from English job listings. This thesis addresses the challenges of automated skill extraction from German job listings, focusing on two main problems: the absence of a standardised competency taxonomy mapping (P1) and the lack of publicly available benchmarking datasets (P2). To address these challenges, the thesis investigates the use of state-of-the-art transformer-based methods for skill extraction and classification, with particular emphasis on the European Skills, Competences, Qualifications, and Occupation (ESCO) taxonomy. The research involves creating a new benchmarking dataset of German job listings, specifically annotated using a developed set of annotation guidelines for German job listings. The structured research methodology used is based on the Design Science Research Process (DSRP) and the Cross Industry Standard Process for Data Mining (CRISP-DM), which serve as the guidelines for the overall research design. The methodology also includes a systematic literature review to assess the current state-of-the-art.In the practical application of the findings, the thesis applies existing transformer models fine-tuned for the German language, such as JobGBERT, and introduces language-specific pre-processing techniques, designed to address the particularities of the German language, such as usage of compound words and the frequent nominalisation of verbs. The effectiveness of these methods is evaluated using standard performance metrics such as precision, recall, and the F1 score. Findings reveal that while incorporating language-specific adaptations substantially enhances extraction and classification performance, the incorporation of domain-specific models does not necessarily improve the overall performance in certain settings. Additionally, the creation of a novel annotated job listing dataset addresses the lack of benchmarking resources, allowing for reproducible research and comparable evaluation of skill extraction methods. This thesis contributes to advancing the field of skill extraction and classification from job listings, particularly within the context of the German labour market, by providing a novel annotated German job listing dataset as well as structured skill extraction pipeline, which enables the comparison of different state-of-the-art models.

License:

In Copyright

Appears in Collections:

Thesis