A framework for evaluating the readability of test code in the context of code maintainability: A family of empirical studies

Urbanke, Pirmin

doi:10.34726/hss.2022.103606

DC Field

Value

Language

dc.contributor.advisor

Winkler, Dietmar

dc.contributor.author

Urbanke, Pirmin

dc.date.accessioned

2022-12-23T11:40:52Z

dc.date.issued

2022

dc.date.submitted

2022-12

dc.identifier.citation

<div class="csl-bib-body"> <div class="csl-entry">Urbanke, P. (2022). <i>A framework for evaluating the readability of test code in the context of code maintainability: A family of empirical studies</i> [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2022.103606</div> </div>

dc.identifier.uri

https://doi.org/10.34726/hss.2022.103606

dc.identifier.uri

http://hdl.handle.net/20.500.12708/137096

dc.description

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

dc.description.abstract

Context and Motivation: Software testing is a common practice in software development and serves many functions. It provides certain guarantees that the software works as expected across the life cycle of the system, it helps with finding and fixing erroneous behaviour, it acts as documentation, provides usage examples, etc.. Still, test code is often treated as an orphan, which leads to poor quality tests also with respect to readability. However, if the test has poor readability, upstream activities like maintaining tests or drawing correct conclusions from tests may be compromised. But what is readable test code? Since test code has a different purpose than production code and contains exclusive features like assertion methods, the factors influencing readability may deviate from production code. Objective: We propose a framework, which can be used to evaluate the readability of test code. It also provides information on factors influencing readability and gives best-practice examples for improvements. Aside from this main goal, we give an overview on academic literature in the field of test code readability and compare it to opinions of practitioners. We investigate the impact of modifications, related to widely discussed readability factors, on the readability of test cases. Furthermore, we gather readability rating criteria from free text answers, investigate impact of developer experience on readability ratings and evaluate the accuracy of a readability rating tool, which is often used in other studies. Methods: We collect extensive information on test code readability by combining a systematic mapping of academic literature with the results of a systematic mapping of grey literature. We conduct a human-based experiment on test code readability with 77 mostly junior-level participants in academic context, to investigate various influence factors to readability. We categorise and group free text answers from the experiments participants and compare the human readability ratings with tool generated readability ratings. Finally, after the construction of the readability assessment framework, which is based on the previous results, we perform an evaluation and compare it to the results of the initial human-based experiment. Results: The literature studies result in 16 relevant sources from the scientific community and 56 sources from practitioners. From both literature mappings we see an ongoing interest in test code readability. Scientific sources focus on investigating automatically generated test code, which is often compared to manually written tests (88%). For capturing human readability, they primarily use surveys as methods (44%), which contain Likert scales in almost all cases. Grey literature (56 sources) mostly consists of blogs from practitioners, sharing their opinion and experience on problems found in their daily work. There is a clear intersection on readability factors discussed in both communities, but some factors are exclusive to each community. For the human-based experiment, we found statistical significant influence on the readability of test cases in five of ten investigated modifications, which map to readability factors. We do not see much influence of experience on readability ratings, although previous research found experience influencing understanding and maintenance tasks. Judging from the categorisation of around 2500 free text answers, the participants rate readability based on Test naming, Structure and Dependencies (i.e., does the test ensure only one behaviour?). The ratings of the readability rating tool are between the 0.25% and 0.75% quantile of our human ratings in around 51% of the investigated test cases. We also found influence of invisible differences in formatting (i.e. spaces, tabulators) affecting the tools ratings up to 0.25 on a scale from 0 to 1. The framework evaluation shows a decreased variation in the ratings across participants and increased rating speed compared to gut feeling ratings from the initial experiments. Overall, the framework rates tests to optimistically. Nevertheless, the validity is very limited, due to a small number of survey participants (5). Therefore, this evaluation is merely a concept, which we pursue in future work. Conclusion: From the literature mappings we found different views on test case readability between practitioners and academia, which come from the different contexts of the communities. The ratings from the readability tool are not accurate enough in order to trust them blindly. They still need to be complemented with human expertise. Our readability evaluation framework enables a more efficient assessment of readability. A large scale evaluation is planned for future work.

dc.language

English

dc.language.iso

dc.rights.uri

http://rightsstatements.org/vocab/InC/1.0/

dc.subject

test

dc.subject

code

dc.subject

readability

dc.subject

evaluation

dc.subject

quality

dc.subject

maintainability

dc.subject

mapping study

dc.subject

grey literature

dc.title

A framework for evaluating the readability of test code in the context of code maintainability: A family of empirical studies

dc.title.alternative

Ein Framework für die Qualitätsbeurteilung von Test Code für die Wartung von Software Code: Eine Familie von empirischen Studien

dc.type

Thesis

dc.type

Hochschulschrift

dc.rights.license

In Copyright

dc.rights.license

Urheberrechtsschutz

dc.identifier.doi

10.34726/hss.2022.103606

dc.contributor.affiliation

TU Wien, Österreich

dc.rights.holder

Pirmin Urbanke

dc.publisher.place

Wien

tuw.version

vor

tuw.thesisinformation

Technische Universität Wien

dc.contributor.assistant

Biffl, Stefan

tuw.publication.orgunit

E194 - Institut für Information Systems Engineering

dc.type.qualificationlevel

Diploma

dc.identifier.libraryid

AC16727234

dc.description.numberOfPages

113

dc.thesistype

Diplomarbeit

dc.thesistype

Diploma Thesis

dc.rights.identifier

In Copyright

dc.rights.identifier

Urheberrechtsschutz

tuw.advisor.staffStatus

staff

tuw.assistant.staffStatus

staff

tuw.advisor.orcid

0000-0002-4743-3124

tuw.assistant.orcid

0000-0002-3413-7780

item.grantfulltext

open

item.openairecristype

http://purl.org/coar/resource_type/c_bdcc

item.mimetype

application/pdf

item.openairetype

master thesis

item.openaccessfulltext

Open Access

item.languageiso639-1

item.cerifentitytype

Publications

item.fulltext

with Fulltext

crisitem.author.dept

E194 - Institut für Information Systems Engineering

crisitem.author.parentorg

E180 - Fakultät für Informatik

Appears in Collections:

Thesis

Fulltext (Version of Record (published version))

Adobe PDF

(2.33 MB)

In Copyright

Show simple item record

Page view(s)

326

checked on Nov 22, 2023

Download(s)

169

checked on Nov 22, 2023

Google Scholar^TM

Check

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM