The Literary Canons of Large-Language Models: An Exploration of the Frequency of Novel and Author Generations Across Gender, Race and Ethnicity, and Nationality
Abstract
Large language models (LLMs) are an emerging site for computational literary and cultural analysis. While such research has focused on applying LLMs to the analysis of literary text passages, the probabilistic mechanism used by these models for text generation lends them to also understanding literary and cultural trends. Indeed, we can imagine LLMs as constructing their own "literary canons" by encoding particular authors and book titles with high probability distributions around relevant words and text. This paper explores the frequency with which certain literary titles and authors are generated by a selection of popular proprietary and open-source models and compares it to existing conceptions of literary canon. It investigates the diversity of author mentions across gender, ethnicity, nationality as well as LLMs' ability to accurately report such characteristics. We demonstrate that the literary canons of popular large-language models are generally aligned with the Western literary canon in that they slightly prioritize male authors and overwhelmingly prioritize White American and British authors.