Gander, A. (2021). Text analysis using colexification networks [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2021.83049
colexification; text analysis; text similarity; word similarity; computational science; nlp; linguistics; machine learning
en
Abstract:
The phenomenon of colexification describes occurrences in natural language in which two concepts are expressed by the same word in at least one language. We deploy this linguistic principle to construct a theory-driven text analysis method. Compared to many state-of-the-art natural language processing (NLP) models, this method is fully interpretable, allowing precise insights into the structure of the model. Such theory-driven approaches are increasingly in demand since when using other large NLP models it is difficult for developers to understand a models’ dynamics and implications thereof. Furthermore, the proposed method is domain-independent because it is constructed on the language-layer itself as compared to the majority of state-of-the-art methods, which are trained using large corpora of texts.The text analysis method here proposed is based on a word similarity measure built on top of a colexification network, i.e. a network of concepts linked by occurrences of colexification. Inspired by similar approaches in other domains, we compute the word similarity measure as the stationary visiting distribution in each node and validate it using several of the most used word similarity datasets in NLP. The results show that the colexification-based method significantly outperforms other word and graph embedding approaches in the task of word similarity prediction. After the validation of the word similarity metric we define a text similarity measure inspired by a state-of-the-art approach to the same task. Performing various experiments based on databases of English texts, we validate the measure by showing that it is able to distinguish text excerpts on the basis of their genre, author and text of origin with reasonable accuracy. We compare the results of the method with the ones of a standard NLP approach on the genre recognition task and find that the two models reach comparable performances.The text analysis method developed in this work allows us to validate the hypothesis that colexification occurrences encode semantic relationships between concepts. Furthermore, we show that a colexification-based approach to NLP has significant merits in various text analysis tasks, leading to meaningful insights. For instance, we perform a historical analysis of American English fiction literature, showing that the style and content of fiction literature has become more diverse over time, with the rate of change increasing particularly sharply in recent decades. These insights can be linked to other findings in computational social science, suggesting that the flux of cultural content has been increasing during the last decades.
en
Additional information:
Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers