Jazz Genius

About the Project

Feel free to scroll through the project description, or jump to a specific section:

Additional project information can also be found:


Project Purpose

Our project develops novel exploratory analyses of the application of Digital Humanities methods to a corpora of jazz lyrics from American jazz singers in the 20th century. This includes the use of web scraping, TEI (Text Encoding Initiative) encoding, text classification, topic modeling, and various forms of data visualization. Specifically, we are interested in exploring how these methods can be applied to a non-standardized form of text like song lyrics and, as important, what the social, political, and discursive power of song lyrics can tell us about American history throughout the 20th century. This project demonstrates the utility of digital humanities methods to answer research questions about texts that are considered figurative and discursive in nature.

Research Questions

  • Extending from the work of the Linked Open Jazz project, how can additional analyses of jazz lyrics expand our understanding of the social and rhetorical role of jazz music in the United States in the 20th Century?
  • How have trends in the form and style of American jazz lyrics developed throughout the 20th century and how do they vary by gender of the performer and/or additional performer-specific metadata?
  • What limitations does the Genius.com API present for web scraping of jazz lyrics?
  • Knowing that topic models of figurative language do not produce topics with the same clarity as non-fiction or academic texts (Rhody, 2012) how do jazz lyrics perform as a source corpora for topic modeling? Which topics/themes frequently appear in American jazz lyrics, how do they compare across gender of the artist, and how do these topics/themes reflect their societal and historical contexts?
  • How can song lyrics be encoded with additional tags/metadata, what coding is “useful” for providing deeper understanding of those lyrics, and how does extra coding enhance our ability to perform textual analysis?
  • How can song lyrics be visualized as stand-alone text or how can visualizations be derived from additional encoded data, text classifications, topic models, or other forms of text analysis?

Workflow and Data Management

A majority of our file sharing was conducted with Google Drive, especially the deliverables of the project report and presentation. GitHub was used for housing our DH project corpus & any documents that get made for the project (XML files, etc.). Additionally, GitHub allowed us to have greater version control. The comment feature associated with uploads was used to share what exactly has been done or changed to those specific documents, therefore also being useful for our weekly updates.

Data Sources

Genius.com is a website which collates information about music—including artist biographies, production information, and most importantly for our purposes, song lyrics. Genius also has a public API which is freely available for registered users to request information from the website. For our project, we will use the Genius API to automatically save song lyrics as individual text files. Because the metadata on Genius.com is sometimes incomplete, we ended up also supplementing the lyrics collected from Genius with additional metadata from Discogs.

See the methods page for more specifics on this process.

Our initial thought was to search via the Jazz meta tag. However, Genius classifies Jazz as a secondary tag, making it difficult to search for. Also, the site relies upon users to tag the genre of songs. This could lead to potential misclassifications and contaminate our dataset. So, using the meta tags wasn’t an option.

Our solution to the problem was to create a list of 124 artists associated with Jazz. We attempted to balance our list via gender and to use artists from multiple eras. This solution was slightly positive. We did obtain over 5000 songs, but popular artists such as Ella Fitzgerald dominated the song returns.

Annotated Bibliography

Antoniou, M. (2018). Text analytics & topic modelling on music genres song lyrics. Towards Data Science. https://towardsdatascience.com/text-analytics-topic-modelling-on-music-genres-song-lyrics-deb82c86caa2

Uses a koggle dataset of 380k songs since 1970 and analyzes various characteristics of songs based on their genre/comparisons of genre. Shows that jazz has the lowest average number of lyrics and the lowest median length. Uses box plots and word cloud (of the lyrics from the top genres) as visualizations of findings.

Franzke, A.S., Bechmann, A., Zimmer, M., & Ess, C. M. (2020). Internet Research: Ethical Guidelines 3.0. Association of Internet Researchers. https://aoir.org 

The Association of Internet Researchers (AoIR) is a professional organization of academics and other stakeholders who develop methods and approaches for internet-based research. They have published guidelines for researchers to consider when studying born-digital content, particularly its implications as human-subjects research. These guidelines will be useful if we choose to collect information about the users who contribute to Genius.com or about their specific comments.

Hartmann, J., Huppertz, J., Schamp, C., & Heitmann, M. (2019). Comparing automated text classification methods. International Journal of Research in Marketing, 36(1), 20–38. https://doi.org/10.1016/j.ijresmar.2018.09.009

This article focuses on marketing research applications but is concerned with the  effectiveness of different approaches/methods of text classification on unstructured text data. Word choice classification methods and sentiment analysis methods in particular. Naïve Bayes works well with small samples of unstructured texts. Lexicon-based methods aren’t traditionally in marketing research.

Janicke, S., Franzini, G., Cheema, M. F., & Scheuermann, G. (2017). Visual Text Analysis in Digital Humanities, 36(6), 226-250. https://doi.org/10.1016/j.ijresmar.2018.09.009 

Discussion of existing research relating to visualization process and techniques for close reading (annotations) and distant reading (abstraction) in the DH (text sources, data transformation steps, types of visualizations used).

Pre-processing steps include: XML-based TEI, and XMLS stylesheets to transform TEI into visualizations. Tokenization and normalization (chinking and frequency analysis). POS-Parts of Speech tagging. NER-Named Entity Recognition for people and place names. Topic modeling (LDA most popular).

For close reading, the structure of the text is generally retained so can do deep analysis and compare text editions and linguistic patters. Visualizations use color, font size, glyphs, and connections. For distant reading the summaries of information about corpora is important and the visualizations use heat maps. Maps. Tag/word cloud. Timelines. Graphs.

Lin, J., Milligan, I., Oard, D. W., Ruest, N., & Shilton, K. (2020). We Could, but Should We?: Ethical Considerations for Providing Access to GeoCities and Other Historical Digital Collections. Proceedings of the 2020 Conference on Human Information Interaction and Retrieval, 135–144. https://doi.org/10.1145/3343413.3377980

This article considers the ethical issues of web archiving, particularly when it comes to researchers that want to work with older content. The authors suggest that research that uses certain online data sets, particularly the archived collection of GeoCities websites, necessarily changes the context of publication. Despite being published on the public web, GeoCites were once relatively private and making the archive freely available could disclose personal information in a way that had never been anticipated. While our use of the Genius.com database does not raise the same issues of private information, this article serves as an important reminder that by using publicly available data we are nevertheless transforming it and changing the context of publication.

Moser, S. (2007). Media modes of poetic reception: Reading lyrics versus listening to songs. Poetics, 35(4), 277–300. https://doi.org/10.1016/j.poetic.2007.01.002

In this article, Moser argues that “songs are a multisensorial mode of linguistic communication” and therefore analyses of how song texts are received may consider many factors. Although lyrics exist in multiple modalities, such as oral, printed, and audiovisual forms, most analyses of lyrics follow traditional methods of textual analysis. Moser suggests that lyrics that have been separated from their melody and vocal reproduction does not necessarily represent the full song text. This is an important caveat for our project, where lyrics specifically take a central role. While we may be able to describe broad trends in jazz lyrics, we must be weary of overgeneralizing our findings as representative of jazz music more

Myers, M. (2013). Why Jazz Happened. University of California Press.

This book provides a social history of the mid century, specifically 1942-1972, jazz that connects changes in style to changes in the music industry, and in American culture at large. The narrative focuses almost entirely on the major commercial and technological forces that allowed jazz to be recorded and broadcast. Some of the major extra-musical factors that made this possible include developments in business, technology, the economy, demographics, and race relations. Myers accomplishes this by using a combination of sources, much pulled from insider interviews Myers personally conducted during the years 2008-2011 with performers, producers, and many others within the industry. In our encoding of song lyrics, various words and phrases found in these songs that correlate to specific social history events can be correctly tagged and/or noted.

Rhody, L. M. (2012). Topic Modeling and Figurative Language. Journal of Digital Humanities, 2(1). http://journalofdigitalhumanities.org/2-1/topic-modeling-and-figurative-language-by-lisa-m-rhody/ 

LDA looks at a finite number of topics within a corpus of texts. Topic modeling of figurative texts does not produce topics with the same clarity as non-fiction or academic text in general (how would this work for song lyrics?). Can not apply labels to topics in the same way based on our assumption that the topics or “thematic” especially if you know the texts and are pre-supposed to reading it a certain way (e.g. the meanings are more fluid in figurative texts). Topics are representations of discourse rather than thematics strong of coherent terms (language as it is used and is it participates in recognized social forms) = TYPES OF TOPICS. Then examine the docs/samples of docs that the model tells you apply to each type of discourse to see what they tell you about the generated topic.

Rustin-Paschal, N., & Tucker, S. (Eds.). (2008). Big ears: Listening for gender in jazz studies. Duke University Press.

Tucker and Rustin-Paschal put together a collection of articles by eminent scholars in multiple disciplines, all centered around the idea of jazz and gender. Various articles cover women and men, masculinity and femininity, race, class, and space in varied ways. Specifically with the article, "Separated at 'Birth': Singing and the History of Jazz", the author critiques the ways in which singing (gendered female) and actual female singers have been removed from the genre and history of jazz in favor of the dominance of male-coded instrumentalism. By our DH project focusing (almost) exclusively on song lyrics as a textual analysis, we are re-emphasizing the importance of singers and their significant role within the jazz genre. For our own use of this book, not all articles will be used since the chapters included go beyond our own scope such as Ursel Schlicht's article on women musicians and audiences in post-war Germany.

Stratton, V. N., & Zalanowski, A. H. (1994). Affective Impact of Music Vs. Lyrics. Empirical Studies of the Arts, 12(2), 173–184. https://doi.org/10.2190/35T0-U4DT-N09Q-LQHW

This article demonstrates that lyrics alone can evoke different affective responses, as compared to music alone or music and lyrics combined. This research demonstrates the value in a project such as ours, which focuses solely on the lyrics of music. However, it also rightly reminds us that lyrics have a different impact on individuals when they are separated from their original music. Therefore, our project must be precise in how we describe our methodologies, as well as recognize the limited scope of our analysis.

Sugimoto, G. (2019). Introduction to Populating a Website with API Data. Programming Historian. https://programminghistorian.org/en/lessons/introduction-to-populating-a-website-with-api-data

The article provides information on integrating API data onto a webpage. It walks through the steps of registering for an API by utilizing the “European” API as an example. Then, it provides the steps to set up a virtual server using XAMPP and goes over basic HTM/PHP syntax. Finally, it walks through the process of integrating a JSON file from the API data in the previous example, into a  new web page.This is a useful tutorial if our group decides to display elements of our project onto a webpage. While PHP can become complicated, this particular use of it shouldn’t be too difficult.

Tucker, S. (1999). Telling Performances: Jazz History Remembered and Remade by the Women in the Band. Oral History Review, 26(1), 67–84. https://doi.org/10.1093/ohr/26.1.67

In this article, author Sherrie Tucker argues that more attention should be given to the oral histories of women jazz musicians. By doing so, jazz historians would gain a more complex understanding of the history of jazz and the contributions that women have made to the genre. Tucker contrasts this proposed methodology with the contemporary scholarship that focuses on “favored artistes” and “superior genres.” An interesting feminist historical analysis and critique that highlights the overlooked contributions of women to jazz. The article is also an important reminder that academic methodology can shape the way history is told.

Turkel, W. J., & Crymble, A. (2012). Output Keywords in Context in an HTML File with Python. Programming Historian. https://programminghistorian.org/en/lessons/output-keywords-in-context-in-html-file

This tutorial in Programming Historian describes a method for using python to write data into an HTML file. While the tutorial’s example—a dictionary of n-grams—may be different from what analysis we eventually conduct, the descriptions of writing a python function to wrap data in HTML tags is useful for getting our work into a presentable format.

Walsh, M. (2021). Song Genius Data Collection. In Introduction to Cultural Analytics & Python. https://doi.org/10.5281/ZENODO.4411250

In her textbook, Melanie Walsh describes methods to use Python for Cultural Analytics methods. In this section, she provides a helpful tutorial on how to interact with the Genius.com API via python. The tutorial outlines basic concepts for working with the API, examples for collecting song lyrics, and some initial scripts to being processing song lyrics that have been saved to text files. Most importantly, she introduces John Miller’s Python package LyricsGenius which is a helpful wrapper for interacting with the Genius API.