<div class="csl-bib-body">
<div class="csl-entry">Schwendinger, F., Vana, L., & Hornik, K. (2024). Readability prediction: How many features are necessary? <i>Annals of Applied Statistics</i>, <i>18</i>(2), 1010–1034. https://doi.org/10.1214/23-AOAS1820</div>
</div>
-
dc.identifier.issn
1932-6157
-
dc.identifier.uri
http://hdl.handle.net/20.500.12708/196784
-
dc.description.abstract
Traditionally, readability prediction has relied on readability formulas, which are based on shallow text characteristics such as average word and sentence length. With recent advances in text mining and natural language processing, more complex text properties can be incorporated into readability prediction models, with papers in the literature suggesting to use up to 200 features for predicting text readability. However, many of the features generated using natural language processing tools are highly correlated and can be thought to measure similar latent text properties. When dealing with a high-dimensional space of correlated features, removing the redundant variables has two advantages: (1) improving interpretability and (2) increasing the predictive power of the model. In this paper we propose an ordinal version of the averaged lasso, which combines hierarchical clustering with the lasso, in order to identify relevant features for readability prediction. We illustrate the approach on two corpora and show improved prediction accuracy when benchmarking against a set of competing models. The annotated corpora as well as the steps necessary for feature creation are freely available as R packages, thus allowing the obtained results to be directly incorporated into a readability estimation pipeline.
en
dc.description.sponsorship
FWF - Österr. Wissenschaftsfonds
-
dc.language.iso
en
-
dc.publisher
INST MATHEMATICAL STATISTICS-IMS
-
dc.relation.ispartof
Annals of Applied Statistics
-
dc.subject
Averaged ordinal lasso
en
dc.subject
Model selection
en
dc.subject
NLP
en
dc.subject
Pipeline
en
dc.subject
Readability prediction
en
dc.title
Readability prediction: How many features are necessary?
en
dc.type
Article
en
dc.type
Artikel
de
dc.contributor.affiliation
University of Klagenfurt, Austria
-
dc.contributor.affiliation
Vienna University of Economics and Business, Austria
-
dc.description.startpage
1010
-
dc.description.endpage
1034
-
dc.relation.grantno
ZK 35-G
-
dc.type.category
Original Research Article
-
tuw.container.volume
18
-
tuw.container.issue
2
-
tuw.journal.peerreviewed
true
-
tuw.peerreviewed
true
-
tuw.project.title
Hochdimensionales statistisches Lernen: Neue Methoden zur Förderung der Wirtschafts- und Nachhaltigkeitspolitik