Using natural language processing to automate the Bechdel test

Westphal, Krista

doi:10.34726/hss.2018.26183

Record link:

https://doi.org/10.34726/hss.2018.26183
http://hdl.handle.net/20.500.12708/4344

Title:

Using natural language processing to automate the Bechdel test

Citation:

Westphal, K. (2018). Using natural language processing to automate the Bechdel test [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2018.26183

reposiTUm DOI:

10.34726/hss.2018.26183

CatalogPlus:

AC14552095

Publication Type:

Thesis - Diplomarbeit

Language:

English

Authors:

Westphal, Krista

Advisor:

Hanbury, Allan

Organisational Unit:

E188 - Institut für Softwaretechnik und Interaktive Systeme

Date (published):

2018

Number of Pages:

Keywords:

Bechdel Test; Naturliche Sprachverarbeitung; Maschinelles Lernen

Bechdel Test; Natural Language Processing; Machine Learning

Abstract:

The Bechdel test asks three questions: does a movie contain two named female characters, do two female characters converse at some point during the movie and is there at least one conversation between female characters that is not about a man? If all questions can be answered positively, then the film passes the Bechdel test. This thesis defines and implements methods for automating the Bechdel test for screenplays and novels. Being able to automate this task would allow for large-scale analyses, permitting researchers to analyse trends over long time periods, for example, that would otherwise only be possible with time consuming manual methods. Previous research exists for automating the Bechdel test for screenplays, which provided the basis for the approach described in this thesis. Although the Bechdel test was originally formulated for movies, the questions are just as applicable to novels. However, as far as we could find, no previous research exists for automating the Bechdel test for novels. For screenplays we first parsed the text using a new rule-based approach that relies on the specialized text formatting required for screenplays. Then we identified all the characters who appeared in speaking roles and assigned each a gender by using a newly developed algorithm that incorporates census data about names and the Internet Movie Database (IMDb) information about the specific film. We also used a machine learning approach to predict if there is at least one conversation about something other than a man between the identified female characters. The results achieved for screenplays are comparable to the previous published work. Novels required a different approach than screenplays, due to the differences in structure between the two texts. For novels we used a Named-Entity Recognizer and a rule-based algorithm that connects the different names used for each character throughout the text, to identify all the characters in a novel. Using quote attribution, we then determined which character says which lines of dialogue, and so establish who converses with whom. The method developed for novels achieved perfect accuracy on a small dataset of five novels.

Additional information:

Abweichender Titel nach Übersetzung der Verfasserin/des Verfassers

License:

In Copyright

Appears in Collections:

Thesis