Mauczka, A. (2016). Design and evaluation of a natural language processing based methodology for classification and profiling of artifacts in software evolution [Dissertation, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2016.38867
Mining Software Repositories; Natural Language Processing; Software Evolution; Software Maintenance
en
Abstract:
Software evolution is an active eld of research and has featured many different approaches to learn more about the processes and people that drive software engineering eorts. Recent studies have further advanced research on software evolution by incorporating Natural Language Processing method- ologies to mine textual artifacts accessible in repositories like Bug Tracking Systems and Version Control Systems for information about the nature of software engineering. We propose a methodology called SubCat that exploits Natural Language Processing and data mining capabilities to provide a framework that provides both researchers and managers access to software evolution meta-data contained within their repositories. The proposed methodology incorporates in its design the current state of the art in the mining of software repositories and answers de ned problems with current tool support. We apply the resulting framework in dierent scenarios to validate the methods eciency. In these scenarios various aspects of software evolution were analyzed, new ndings could be made and existing assumptions partially refuted. The methodology was applied to cover a broad range of topics from classi cation of code changes to using Sentiment Analysis on comments in a bug tracker. Further, the methodology was used to identify security-relevant changes, which could be validated by using existing Security Advisories. Additionally, we employ the framework to generate content for a Bug Tracker based on information available in a Code Repository to showcase a potential use for projects that did not start out with a Bug Tracker. Aside the mentioned scenarios, we created a classi cation mechanism for code changes into maintenance categories and evaluated it for cross-project validity. The dictionary and the provided Sentiment Analysis capabilities of the framework were then used to generate developer pro les to showcase a potential use for future studies on longterm developer motivation or dashboards for project managers to see possible con icts and problems at a glance.