Jazz Genius

Sample Code

Good coders create. Great coders steal.

One of our primary goals with this project was to explore the types of research that become possible when applying Digital Humanities to song lyrics and an online resource such as Genius.com. To deliver on this goal, we determined it was important to provide as much documentation and explanation as possible. This way, if any future projects hope to achieve similar results or to ask similar kinds of questions, the work that we have produced may serve as a useful starting point.

To that end, we are providing sample code from various portions of the project. These examples serve as supplementary material to our full methods descriptions.

You can show or hide sections by clicking on their headings. If you'd like, you can jump to a specific section:


Genius Scraping

This is the 'megascraper' script which was used to collect the bulk of our dataset. It reads a list of artist names from a file called artists.txt, and searches Genius.com for up to 100 songs from each of those artists. As it does this, it saves the genius song IDs - which can be used later on. It also outputs the API responses into a specified CSV file. The script takes a verylong time to run when requesting large numbers of artists or songs! We recommend testing with a much smaller query before running the 'real thing'

Read more about the Data Collection methodology

Download Script


import lyricsgenius
import http.client, urllib, json,  random, sys
import csv
from rauth import OAuth1Service
from requests.auth import HTTPBasicAuth
import requests
from time import sleep
from datetime import datetime

auth_string = '[PUT YOUR GENIUS.COM API KEY HERE]'

#Name the output filename:
outputName = 'genius-metadata.csv'

#Enter column headers
columnHeads = [
"Genius Song ID",
"Title",
"Artist",
"Genius Artist ID",
"Release Date",
"Display Date",
"Recording Location",
"lyrics_state",
"Genius URL",
"Apple Music URL",
"Apple Music ID",
"song_art_image_url"
]

#songlist = []

LyricsGenius = lyricsgenius.Genius(auth_string)




artistslist = []


def geniusSearch(searchartist):

	artist = LyricsGenius.search_artist(searchartist, max_songs=100)

	try:
		print(artist.songs)
	except:
		print('print artist.songs failed')

	try:
		for song in artist.songs:
			#searchsongid(song)
			song.save_lyrics(extension='txt')
			try:
				#songlist.append(song.id)
				songid = song.id
				writeIDs(songid)
				getData(songid)
			except:
				print('saving song ID failed')
			
	except:
		print('for song in artist.songs: failed')
		
def searchsongid(song):
	genius_search_url = 'http://api.genius.com/search?q=' + str(song) + '&access_token=' + auth_string
	sleep(0.5)
	response = requests.get(genius_search_url)
	
	json_data = response.json()
	print('test')
	item = json_data['response']['hits']
	print(item)
		
def setupCSV():
	print("Creating File: " + outputName)
	f = csv.writer(open(outputName, "w"))
	f.writerow(columnHeads)


def getartistlist():
	text_file = open("artists.txt", "r")
	list1 = text_file.readlines()
	for item in list1:
		artistslist.append(item)


def getData(songid):
    print('Retrieving metadata for song ID #' + str(songid))
    conn = http.client.HTTPSConnection("api.genius.com")
    request_uri = '/songs/' + str(songid)
    headersMap = {
            "User-Agent": "CompuServe Classic/1.22",
            "Accept": "application/json",
            "Authorization": "Bearer " + auth_string
    }
    sleep(0.5)
    conn.request("GET", request_uri, headers=headersMap)
    response = conn.getresponse()
    ### Output the HTTP status code and reason text...
    print (response.status)
    print(response.reason)
    data = response.read()
    result = json.loads(data)

	#Print the whole json response (if needed)
    #print(json.dumps(result, indent=4, sort_keys=True))
    
    songtitle = result['response']['song']['title']
    artist = result['response']['song']['primary_artist']['name']
    artist_id = result['response']['song']['primary_artist']['id']
    release_date = result['response']['song']['release_date']
    display_date = result['response']['song']['release_date_for_display']
    recording_location = result['response']['song']['recording_location']
    lyrics_state = result['response']['song']['lyrics_state']
    genius_url = result['response']['song']['url']
    apple_music_url = result['response']['song']['apple_music_player_url']
    apple_music_id = result['response']['song']['apple_music_id']
    song_art_image_url = result['response']['song']['song_art_image_url']
    
    f = csv.writer(open(outputName, "a"))
    f.writerow([songid, songtitle, artist, artist_id, release_date, display_date, recording_location, lyrics_state, genius_url, apple_music_url, apple_music_id, song_art_image_url])	
    
   
def writeIDs(songid):
	ids = open("idlist.txt", "a")
	ids.write(str(songid))
	ids.write('\n')
	ids.close()

startTime = datetime.now()
print('Started at ' + str(startTime))

getartistlist()  
setupCSV()


for artist in artistslist:
	geniusSearch(artist)
	
print('Finished getting lyrics.')
print('Now retrieving metadata')
	
#print(songlist)
#ids = open("idlist.txt", "a")
#ids.write(str(songlist))
#ids.close()

#Moved to run after each artist - so if the script fails, we still get some data
#for song in songlist:
#	getData(song)
	
totalTime = datetime.now() - startTime
print('Total time: ' + str(totalTime))		
Back to top of page

Discogs Scraping

This script reads a CSV file containing song metadata. It reads the title and artist - currently in column 2 and 3 - from each row, and searches Discogs for the release year. It then saves that information into a new CSV file

It requires the python csv library as well as the official Discogs API client for Python.

Read more about the Data Collection methodology

Download Script


import discogs_client, csv

apikey = '[PUT YOUR DISCOGS API KEY HERE]'
consumersecret = '[PUT YOUR DISCOGS CONSUMER SECRET HERE]]'
usertoken = '[PUT YOUR DISCOGS USER TOKEN HERE]]'

filepath = '[PUT INPUT FILEPATH HERE]'
outputpath = '[PUT OUTPUT FILEPATH HERE]'



#The API wants there to be some sort of a client header, but it can be anything that follows the correct format/structure:
discogs = discogs_client.Client('jazzGenius/0.1', user_token=usertoken)



def searchDiscogs(query):
	print("Searching for " + str(query))
	results = discogs.search(query, type='release')
	try:
		year = results[0].year
	except:
		year = ''
	print(year)
	return year
	
def readCSV():
	csvfile = open(filepath, "r")
	csvreader = csv.reader(csvfile, delimiter=',')
	return csvreader
		
	
def setupOutput():
	outputfile = open(outputpath, "a")
	csvwriter = csv.writer(outputfile)
	return csvwriter
	
def writeRow(rowdata):
	csvwriter.writerow(rowdata)
	
csvreader = readCSV()
csvwriter = setupOutput()

for row in csvreader:
	songtitle = row[1] #get the title
	songartist = row[2] #get the artist
	
	searchquery = str(songtitle) + " " + str(songartist) #build the query
	
	result = str(searchDiscogs(searchquery))
	
	row.append(result)
	writeRow(row)		
Back to top of page

Sentiment Analysis

This python script reads a CSV file which contains song lyrics and uses VADER to analyze the emotional sentiment of their content. The script then outputs a new CSV file which contains those scores. The script expects the song lyrics to be located in the fourth column of the CSV file - row[3] - but this can easily be adjusted to match a different CSV file layout. Remember that python starts its counting at 0, and not 1!

The script was written following this tutorial on Programming Historian. It requires the python csv library as well as the Natural Language Toolkit (NLTK).

Read more about the sentiment analysis methodology

Download Script



# first, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

#import the CSV processor
import csv

#Import VADER into an object we can use
sid = SentimentIntensityAnalyzer()


filepath = '[PUT INPUT FILEPATH HERE]'
outputpath = '[PUT OUTPUT FILEPATH HERE]'

def readCSV():
	csvfile = open(filepath, "r")
	csvreader = csv.reader(csvfile, delimiter=',')
	return csvreader
	
def setupOutput():
	outputfile = open(outputpath, "a")
	csvwriter = csv.writer(outputfile)
	return csvwriter
	
def writeRow(rowdata):
	csvwriter.writerow(rowdata)
	
csvreader = readCSV()
csvwriter = setupOutput()

def analyzeSentiment(message_text):
	#print(message_text)
	# Calling the polarity_scores method on sid and passing in the message_text outputs a dictionary with negative, neutral, positive, and compound scores for the input text
	scores = sid.polarity_scores(message_text)
		
	return scores
    
    
for row in csvreader:
	print("\n") #Start each song on a new line
	print(row[1]) #Print the artist
	print(row[0]) #Print the title
	
	songlyrics = row[3] #get the lyrics
	
	scores = analyzeSentiment(songlyrics)
	
	# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
	for key in sorted(scores):
		print('{0}: {1}, '.format(key, scores[key]), end='')
	
	#Set the scores to some variables
	#This is very inefficient, because we're not doing anything with this data after we write it. 
	#
	
	compound = scores['compound']
	negative = scores['neg']
	neutral = scores['neu']
	positive = scores['pos']
	
	row.append(compound)
	row.append(negative)
	row.append(neutral)
	row.append(positive)
	writeRow(row)
		
Back to top of page

TEI Template

With the TEI component, I put together a basic template to use for the future encoding of a subset of songs we gathered from the Genius API. The purpose of our template is to make sure that all xml files created for our project have the exact same information and everything validates.

Read more about the TEI methodology

Download Template



<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>

<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
	<fileDesc>
		<titleStmt>
			<title type="main"></title>
			<respStmt> 
				<!-- We are using "respStmt" here because author/editor isn't correct in our situation -->
				<resp>Vocalist</resp>
				<persName>Artist Name Here</persName>
			</respStmt>
			<respStmt>
				<resp>Encoding</resp>
				<name xml:id="initials_here">Put Your Name Here</name>
			</respStmt>
		</titleStmt>
		<editionStmt>
			<edition>
			<date>2021</date>
			</edition>
		</editionStmt>
		<publicationStmt>
			<!-- Put basic information here for publication stuff, we can change later once we discuss/etc. -->
			<authority>LIS 768 Analytics Project</authority>
			<publisher>University of Wisconsin Madison</publisher>
			<pubPlace>Madison, Wisconsin</pubPlace> 
			<idno>File Name Here</idno>
			<availability>
			<p>Under Section 107 of the U.S. Copyright Act of 1976, we are using these song lyrics for education and research. These materials are not for public distribution.</p>
			</availability>
		</publicationStmt>
		<notesStmt>
			<note type="project">This file was produced for Library &amp; Information Sciences 768: Digital Humanities Analytics, iSchool, University of Wisconsin Madison, Spring 2021.</note>
		</notesStmt>
		<sourceDesc>
			<bibl>
				<respStmt>
				<resp>Vocalist</resp>
				<name>Jazz Vocalist Name Here</name></respStmt>
				<title type="main">Song Title Here</title>
				<date when="2021">Date (Year) Here (N.D. if unavailable/unknown) (put the correct date in the when attribute as well! (if no date, just delete when attribute)</date>
			</bibl>
		</sourceDesc>
	</fileDesc>
	<encodingDesc>
		<editorialDecl>
		<correction>
			<p></p>
		</correction>
		<normalization>
			<p></p>
		</normalization>
		</editorialDecl>
	</encodingDesc>
	<profileDesc>
		<!-- We may not need textClass tag (unless we decide to catch geographic information?) It depends on what we want to do here. -->
		<textClass>
		<keywords scheme="original" n="category">
			<term>Song</term>
		</keywords>
		<keywords scheme="lcsh" n="keywords">
			<term></term>
		</keywords>
		</textClass>
		<!-- we may use "particDesc" tag since it describes any individual (or group) in text (so, use "personGrp" or "org" to describe something the vocalist is talking about) the options are endless! -->
		<particDesc>
			<person></person>
		</particDesc>
	</profileDesc>        
	<revisionDesc>
		<change when="2021-03"><name>Your Name Here</name> tei encoding</change>
		<change> </change>
	</revisionDesc>
</teiHeader>
<text>
<body>           
	<div1 type="song">
	<!-- we will use "lg" (line group) and "l" (line) for song lyrics;  -->         
	</div1>
</body>
</text>
</TEI>
		
Back to top of page

Web Database

This is the .sql file produced when exporting the database structure from Sequel Ace. This does not contain any of the song data itself, but can be used to recreate the database structure and table relations on a new server. Check out the methods page for more details about how the database structure has been set up.

Download .sql File



# ************************************************************
# Sequel Ace SQL dump
# Version 3028
#
# https://sequel-ace.com/
# https://github.com/Sequel-Ace/Sequel-Ace
#
# Host: localhost (MySQL 5.5.5-10.5.9-MariaDB-1:10.5.9+maria~buster)
# Database: jazzGenius
# Generation Time: 2021-04-17 20:29:55 +0000
# ************************************************************


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
SET NAMES utf8mb4;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE='NO_AUTO_VALUE_ON_ZERO', SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;


# Dump of table artists
# ------------------------------------------------------------

CREATE TABLE `artists` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `artist_name` varchar(255) DEFAULT NULL,
  `sort_name` varchar(255) DEFAULT NULL,
  `gender` varchar(64) DEFAULT NULL,
  `artist_notes` text DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1988020 DEFAULT CHARSET=utf8mb4;



# Dump of table songs
# ------------------------------------------------------------

CREATE TABLE `songs` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `artist_id` int(11) unsigned NOT NULL,
  `title` varchar(255) DEFAULT NULL,
  `genius_url` varchar(2083) DEFAULT NULL,
  `art_url` varchar(2083) DEFAULT NULL,
  `apple_music_url` varchar(2083) DEFAULT NULL,
  `notes` text DEFAULT NULL,
  `lyrics` text DEFAULT NULL,
  `year` year(4) DEFAULT NULL,
  `sentiment` float DEFAULT 0,
  PRIMARY KEY (`id`),
  KEY `songs_artist_join` (`artist_id`),
  CONSTRAINT `songs_artist_join` FOREIGN KEY (`artist_id`) REFERENCES `artists` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6594775 DEFAULT CHARSET=utf8mb4;



# Dump of table tei
# ------------------------------------------------------------

CREATE TABLE `tei` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `songid` int(11) unsigned DEFAULT NULL,
  `encoder` varchar(16) DEFAULT NULL,
  `tei_notes` text DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `songid` (`songid`),
  CONSTRAINT `tei_ibfk_1` FOREIGN KEY (`songid`) REFERENCES `songs` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4;




/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
		
Back to top of page

Parsing TEI for the Web

TEI files are written in XML which is a highly machine-readable format and a supposedly human-readable format. However, despite this consistent formatting and file structure, it is still a considerable challenge to read the XML file and display the data in an HTML webpage. We are using the PHP simplexml extension to read the contents of each XML file. This is one key portion of the display script, which loops through each line in the body of the TEI file - it searches for specified strings that correspond to the XML tags and replaces them with HTML code that will display mouse-over text.


#Remove the <l> and </l> tags:
$processedline = str_ireplace('&lt;l&gt;', '', htmlspecialchars($line->asXML()));
$processedline = str_ireplace('&lt;/l&gt;', '', $processedline);

#Find and replace <persName> and </persName>
$processedline = str_ireplace('&lt;persName&gt;', '<span class="tooltip"><span class="tooltiptext">Person Name</span>', $processedline);
$processedline = str_ireplace('&lt;/persName&gt;', '</span>', $processedline);

#Find and replace <rs type="person">
$processedline = str_ireplace('&lt;rs type=&quot;person&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Reference String - Person</span>', $processedline);

#Find and replace <rs type="place">
$processedline = str_ireplace('&lt;rs type=&quot;place&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Reference String - Place</span>', $processedline);

#Find and replace <rs type="clothing">
$processedline = str_ireplace('&lt;rs type=&quot;clothing&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Reference String - Clothing</span>', $processedline);

#Find and replace <rs type="food">
$processedline = str_ireplace('&lt;rs type=&quot;food&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Reference String - Food</span>', $processedline);

#Find and replace </rs>
$processedline = str_ireplace('&lt;/rs&gt;', '</span>', $processedline);

#Find and replace <name type="person">
$processedline = str_ireplace('&lt;name type=&quot;person&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Name Tag - Person</span>', $processedline);

#Find and replace <name>
$processedline = str_ireplace('&lt;name&gt;', '<span class="tooltip"><span class="tooltiptext">Name Tag</span>', $processedline);


#Find and replace </name>
$processedline = str_ireplace('&lt;/name&gt;', '</span>', $processedline);

#Find and replace <placeName>
$processedline = str_ireplace('&lt;placeName&gt;', '<span class="tooltip"><span class="tooltiptext">Place Name</span>', $processedline);

#Find and replace </placeName>
$processedline = str_ireplace('&lt;/placeName&gt;', '</span>', $processedline);

#Find and replace <foreign xml:lang="fr">
$processedline = str_ireplace('&lt;foreign xml:lang=&quot;fr&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Foreign Language - French</span>', $processedline);

#Find and replace </foreign>
$processedline = str_ireplace('&lt;/foreign&gt;', '</span>', $processedline);

#Find and replace <seg type="structure-label">
$processedline = str_ireplace('&lt;seg type=&quot;structure-label&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Segment Label: Song Structure</span>', $processedline);

#Find and replace <seg type="accompaniment">
$processedline = str_ireplace('&lt;seg type=&quot;accompaniment&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Segment Label: Accompaniment</span>', $processedline);

#Find and replace </seg>
$processedline = str_ireplace('&lt;/seg&gt;', '</span>', $processedline);


#Find and replace <sic>
$processedline = str_ireplace('&lt;sic&gt;', '<span class="tooltip"><span class="tooltiptext">[sic]</span>', $processedline);

#Find and replace </sic>
$processedline = str_ireplace('&lt;/sic&gt;', '</span>', $processedline);

#Find and replace <rhyme label="a">
$processedline = str_ireplace('&lt;rhyme label=&quot;a&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Rhyme A</span>', $processedline);

#Find and replace <rhyme label="b">
$processedline = str_ireplace('&lt;rhyme label=&quot;b&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Rhyme B</span>', $processedline);

#Find and replace <rhyme label="c">
$processedline = str_ireplace('&lt;rhyme label=&quot;c&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Rhyme C</span>', $processedline);

#Find and replace <rhyme label="d">
$processedline = str_ireplace('&lt;rhyme label=&quot;d&quot;&gt;', '<span class="tooltip"><span class="tooltiptext">Rhyme D</span>', $processedline);


#Find and replace </rhyme>
$processedline = str_ireplace('&lt;/rhyme&gt;', '</span>', $processedline);



echo(($processedline));	
Back to top of page

Web Odds and Ends

Collapsible divs for Section Organization

Throughout the website, we are using jQuery to dynamically show and hide various sections of the website. Especially for pages with lots of content, this is a convenient way to avoid overwhelming the user with too much content, but while also avoiding the problem of having too many separate pages. Ben originally wrote this code for his personal website, so it was a simple matter to recycle the code to use here.



$(document).ready(function(){

  // Add the indicator arrows to the trigger/header elements
	$('.trigger').each(function(index, value) {
		//First, we store the h2 text as the variable label
		var label = $(this).text();
		
		//Next, we re-write that element with label plus the unicode for the down-facing arrow
		$(this).text(label + " \u25BE");
		return
	});
	
	//When we click on an h2, hide the next div
	$('.trigger').click(function() {

		//Store whatever the current value of the h2 label is as a variable
		var label = $(this).text();
		
		//First, we need to test if the section is visible or not
		if( $(this).next('div.showHide').is(':visible')) {
			//If so, find the down facing arrow and replace with a right facing arrow
			label = label.replace(" \u25BE", "");
			newLabel = label + " \u25B8";
			$(this).text(newLabel);
		} else {
			//If it's hidden, find the right facing arrow and replace with a down facing arrow
			label = label.replace(" \u25B8", "");
			newLabel = label + " \u25BE";
			$(this).text(newLabel);
		}
		
		//And then we show/hide the particular section
		$(this).next('div.showHide').slideToggle("fast");
	
	})

});		

Code Highlighter

We are using highlight.js to prettify all of our code blocks that are being displayed throughout the website. Using lightbox is simple: we just had to download a copy of its JavaScript and CSS files, include this files on each page, and then place the code content within the <pre> <code> tags.

<link rel="stylesheet" href="/path/to/styles/default.css">
<script src="/path/to/highlight.min.js"></script>
<script>hljs.highlightAll();</script>
		

Image Lightbox

To display images in-line with the text, but include an option to click and view a lager version, we are using Magnific Popup. This script uses jQuery to provide responsive interactivity.

Magnific Popup required us to download a copy of their JavaScript, and link to it from each page. Then, to include an image lightbox, we just have to include data-lightbox in the image and href tags.


<a data-lightbox="gallery-name-goes-here" data-title="The Image Caption Goes Here" href="<?php echo HTTP_URL . '/images/image-name.png';?>"><img src="<?php echo HTTP_URL . '/images/image-name.png';?>" alt="The image alt-text for accessibility goes here" class="screenshot"></a>

		
Back to top of page