vsupalov

Analyzing my Spotify Music Library With Jupyter And a Bit of Pandas

Or the quest to cleanup my saved songs to play everything on random.

Are you a Spotify user? By now, that’s the way I consume most of my music.

Recently, I have been developing a web app which uses the Spotify API in part for a series of posts about developing a small real-life application and web development in general. The app is written in Python and Flask for displaying full-sized album art for the songs which are playing on a user’s Spotify account.

While playing around with the Spotify web API, and building a login flow in the app, it was pretty easy to get an access token for my account with all kinds of permissions for access to my data. Among others, it’s good for everything needed to analyze the heck out of your whole music library - information about songs and albums in particular.

The Motive

My music library on Spotify is quite big. What I like to do, is just put all of my “saved” songs, and put the on shuffle play. I’m not sure the library is meant for such use - it’s a bit in the way of when songs are “saved”.

To listen to songs in offline mode, you can “download” them on a particular device. Doing so also “saves” them, thus adding the complete album to your music library. You can get around that with playlists, but that’s a few more clicks.

As I’m in the habit of giving complete albums a listen every once in a while, and forgetting about the stuff I download-saved, my music library is now a bit messy. This leads to song playing which are not as enjoyable without the album context. Oh the pain.

The Cure

Having the amazing data crunching powers of the forementioned TOKEN, I thought it’d be cool to take a look at what albums I have in my library. And maybe clean up obvious mis-saves, but also to see what stuff I really seem to like. Here’s the plan:

  • Get data on all songs in my library (especially which album they belong to)
  • Get data for each album referenced (total number of songs, how many of those are in my library)
  • Do something with the information (aka cleanup)

There are further possible stuff one could do with the data, for example to handcraft an own simple artist/album recommendation script, but that’s for anonther time.

Before we Proceed

A word of caution:

The code below was never meant to be reliable or presentable. It doesn’t mean it’s bad in general. I wrote it to satisfy my own curiosity, and with one person in mind who would ever see it: yours truly. The samples are not as polished as other specimen you will find elsewhere, and might not be the best learning resource regarding best practices.

That said, it works, does what it’s supposed to do given the context and priorities of and probably looks very similar to many real-world projects, where results and time constraints are significant factors.

Alright! Let’s jump into it!

The Setup Step by Step

Here’s what I usually do when starting an exploratory data project with Python.

When tinkering with data years ago, I used to rely on running tiny Python scripts and saving intermediate results. Then I got introduced to ipython notebooks and never looked back. By now, jupyter notebooks are what you should use. It’s the perfect tool for working with Python and data when you’re trying stuff out and playing with an early-stage data project.

With Python projects, I like setting up a virtual environments for each one. This makes it easy to isolate dependencies, install EXACTLY what is needed in the version which is needed and make sure that the code runs out of the box.

So I went ahead and created a new virtual environment using virtualenv wrapper. It’s really convenient, if you don’t know it and like Python, check it out! In the following, code blocks starting with $ is what happens in a terminal, while everything else is Python code.

$ mkvirtualenv spotify

Jupyter is a python module, and can be installed using pip:

$ pip install jupyter

For data-tastic Python fun, I usually install a few modules by default, to make sure I can do basic data crunching, plotting and http requests without much effort. Here is my choice of tools for all of the above:

  • requests - A great library for talking to web apis and fetching single web pages.
  • furl - For creating urls and requests in general with parameters for GET / POST stuff.
  • pandas - Makes python feel a bit like R, namely able to work with data frames and crunch data without getting too verbose about it.
  • matplotlib - To give pandas plotting-superpowers.

Here’s how you’d install it all:

$ pip install pandas matplotlib requests furl

In this investigation, I did not really use much pandas, nor matplotlib apart from a tiny diversion. Don’t be discouraged by them appearing to be of little value, they shine when the data handling gets more challenging.

Once everything was installed, I went into a new directory created for the project and started the Jupyter notebook server with:

$ jupyter notebook

Which also opens the Jupyter web interface in a new tab in your browser. Everything that was left to do at this point, was to create a new notebook, give it a name and start the fun.

Hello API

The code samples below, are what I typed into the Jupyter notebook cells. They are depending on each other, use previous variables as you would in a single script, but can be re-executed as needed. It’s perfect to experiment and get things to work without enduring unnecessary waiting time.

But anyway, to access the Spotify Web API, we need the API access token. You could use the big album art project in development mode. You’ll need a private Spotify app, insert the tokens as environment variables, add a “print” statement in the home view, run the app localy and you’ll be set. Don’t forget to adjust the permission request (see a few lines below).

The permission scopes we’ll need are:

"user-read-private user-read-email user-read-playback-state user-read-currently-playing user-library-read"

I just made the app print the token value, and copied it into the first cell of the notebook. The token expires after a while, so you might need to refresh it when working on the code.

TOKEN = "???"

I like to put constants in the beginning of a project and write the name in UPPER_CASE_NOTATION. We will need a few Python modules for convenience - the ones we installed previously. Let’s import them:

import json
import requests
from furl import furl
from math import ceil

# to save some typing
import pandas as pd
import matplotlib

# to display plots in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

The code above also makes it possible for plots to appear inline in the notebook, which looks nice. We’re ready to get the data!

The First Request

Now that we have the token and tools in place, it’s really simple to get the first bit of information.

As we want to get the tracks of our user, we’ll go ahead and take a peek at how many there are. The API is well documented, and can be browsed here. The tracks endpoint is what we need. The response is paginated, but for the field we need this does not matter. The total number of songs is available under the “total” key.

url = "https://api.spotify.com/v1/me/tracks"
headers = {'Authorization': "Bearer {}".format(TOKEN)}
r = requests.get(url, headers=headers)
parsed = json.loads(r.text)

count_songs = parsed["total"]
print "Total number of songs: {}".format(count_songs)

We make the request using requests, add a header with the authentication info (as described in the docs), parse the response into json and - without checking for errors - take the data field into a var. If the request goes wrong for some reason, we will get an exception in the notebook and should be able to fix it /rerun the thing. So no fancy edgecase handling is needed.

Apparently I have 2416 songs in the library.

Getting Track Data

User libraries can be quite large. Thousands of songs. It would be unnecessary and at some point impractical to return all songs on every single request to the tracks endpoint. That’s why it is paginated. We need all of those of course.

We got the number of songs, and the first “page” of songs in the user library. The maximum amount of songs returned per request is 50 if you specify it. So all we need is the total number and a for loop.

Once again, as this code is pretty unlikely to fail, and can be reexecuted if needed we don’t care about crashing if everything goes wrong.

# paginate over all tracks
all_songs = []
for i in range(int(ceil(count_songs/50.0))):
    offset = 50*i
    url = "https://api.spotify.com/v1/me/tracks?limit=50&offset={}".format(offset)
    headers = {'Authorization': "Bearer {}".format(TOKEN)}
    r = requests.get(url, headers=headers)
    parsed = json.loads(r.text)

    all_songs.extend(parsed["items"])
print "Number of gathered songs: {}".format(len(all_songs))

The printed number equals the one above. Neato. If the responses would fail, we’d notice. Now we have the complete user library of songs. Imagine the possibilities.

How Many Albums Are Referenced?

Staying on track. We’d like to get the albums which are referenced by the songs. The track data does not have everything we need unfortunately - there’s only the album id and a bit of other info. We’ll need to get detailed data on each relevant album, especially the number of tracks each has in total

Using a Python set, we can get a unique list of all album ids which we will need.

album_ids = set()

for song in all_songs:
    album_id = song["track"]["album"]["id"]
    album_ids.add(album_id)
    
print "Number of albums: {}".format(len(album_ids))

For me that prints 1307 as the number of albums. Roughly half of my library song count. Huh. Who would have thought.

Lots of Requests Later

With the album ids at hand, we can proceed. I usually get the raw data and produce derived datasets in later steps. This way we can go back and use fields which we ignored at first. This also suits the iterative superpowers of Jupyter - single computation steps can be reexecuted in isolation, not requiring the previous ones to run again.

This part is anything but elegant. In fact it’s a bit rude. We could request multiple album ids and only use 1/20th of the current requests. Also, it would be faster if we watched the rate limits and the Retry-After header. Once the waiting times are long enough, or when there would be multiple users I’d reconsider being more polite and less lazy.

Running this takes a few minutes.

# gather information on all albums
album_info_by_id = {}

from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# silly solution, but good enough
# could handle the header entry 'Retry-After' for better manners
# https://developer.spotify.com/web-api/user-guide/#rate-limiting
s = requests.Session()
retries = Retry(total=3000, backoff_factor=1, status_forcelist=[ 502, 503, 504, 429 ])
s.mount('https://', HTTPAdapter(max_retries=retries))

for album_id in album_ids:
    url = "https://api.spotify.com/v1/albums/{}".format(album_id)
    headers = {'Authorization': "Bearer {}".format(TOKEN)}
    r = s.get(url, headers=headers)
    
    #TODO: check for error apart from dumb retries?
    
    parsed = json.loads(r.text)    
    
    album_info = parsed

    #TODO: sanity checks?
    try:
        a = album_info["tracks"]
    except:
        print "Entry seems wrong. Fix it:"
        print album_info
        break
    
    album_info_by_id[album_id] = album_info

In this step, we actually care about the data quality, as mistakes might be more subtle than in previous steps. We want to get a snapshot of the album data, and really can’t have the data be incomplete or wrong for some reason. By trying to access the “tracks” field in each item we are making sure that the data is not formatted in a way we don’t expect. If unexpected stuff happen, we prefer to crash instead of quietly saving garbage.

API-related rate limits could be handled more graceful, but in our case it’s still relatively small and silly retries on expected errors are what we roll with. The total number of retries is limited to a still-large-but-not-huge number. Just in case.

Obviously, we really don’t want to be so lazy for really large requests or with many users.

Putting it All Together

After the previous chunk of code ran through without complaining, we have everything we need in our notebook for more elaborate tinkering. No more API requests are needed and we can work with the data we have. Neat-o!

This is a nice time to create a checkpoint :)

How do we handle the raw data? I really like to put convenient wrapper classes around the data entries. This way, we can perform basic tasks without giving them too much thought and keep the code readable.

For accessing album data and computing simple derived values, I created an “AlbumBin” class. It will provide information on all of the user’s tracks which are related to an individual album and accessors to raw album metainformation.

You can read up on the data we are working on, in the albums API endpoint docs.

class AlbumBin:
    def __init__(self, album_id, album_info):
        self.album_id = album_id
        self.album_info = album_info
        self.my_tracks = []

    def add(self, song):
        self.my_tracks.append(song)

    def my_track_count(self):
        return len(self.my_tracks)

    def total_track_count(self):
        return len(self.album_info["tracks"]["items"])
    
    def get_completeness_ratio(self):
        return (1.0 * self.my_track_count() /  self.total_track_count())
    
    def get_name(self):
        return self.album_info["name"]    
    
    def get_artists(self):
        """ comma separated list of artists for pretty printing"""
        return ", ".join(map(lambda x: x["name"], self.album_info["artists"]))

So much for the data class. It can help us see how “complete” an album is (what the ratio of songs is, which is saved in the library), and straight-forward ways for getting the name of the album as well as the artists without caring about the underlying raw data structure. Now, we can create a binning of tracks (aka put tracks into album classes).

binning = {}

# create album bins
for album_id in album_ids:
    album_info = album_info_by_id[album_id]
    
    the_bin = AlbumBin(album_id, album_info)
    
    binning[album_id] = the_bin

# fill album bins with songs
for song in all_songs:
    album_id = song["track"]["album"]["id"]

    the_bin = binning.get(album_id)
    the_bin.add(song)

My naming is on the inconsistent side here - songs and tracks are pretty much interchangeable in my mind it seems.

Printing First Results - Terribly

This first one, is a part I’m not proud of. I could have been way lazier - but chose to copy-paste stuff tweaking the numbers instead of thinking for a bit. Originally I just wanted to get a grasp of the almost-completely-saved album counts, but later got interested in the complete distribution. Bear with me, we will handle this better with pandas. But first, here’s the copypaste bit for you to behold:

album_bins = binning.values()

def fi(at_least, at_most):
    """ at_most is non-inclusive """
    return filter(lambda x: at_least <= x.get_completeness_ratio() and at_most > x.get_completeness_ratio(), album_bins)

album_count = len(album_bins)
albums_100_percent = fi(1.0, 9000.0)
albums_90_percent = fi(0.9, 1.0)
albums_80_percent = fi(0.8, 0.9)
albums_70_percent = fi(0.7, 0.8)
albums_60_percent = fi(0.6, 0.7)
albums_50_percent = fi(0.5, 0.6)
albums_40_percent = fi(0.4, 0.5)
albums_30_percent = fi(0.3, 0.4)
albums_20_percent = fi(0.2, 0.3)
albums_10_percent = fi(0.1, 0.2)
albums_0_percent = fi(0.0, 0.1)

print "Album count: {}".format(album_count)
print ""
print "Album count at/over 100%: {}".format(len(albums_100_percent))
print "Album count over 90% but under 100%: {}".format(len(albums_90_percent))
print "Album count over 80% but under 90%: {}".format(len(albums_80_percent))
print "Album count over 70% but under 80%: {}".format(len(albums_70_percent))
print "Album count over 60% but under 70%: {}".format(len(albums_60_percent))
print "Album count over 50% but under 60%: {}".format(len(albums_50_percent))
print "Album count over 40% but under 50%: {}".format(len(albums_40_percent))
print "Album count over 30% but under 40%: {}".format(len(albums_30_percent))
print "Album count over 20% but under 30%: {}".format(len(albums_20_percent))
print "Album count over 10% but under 20%: {}".format(len(albums_10_percent))
print "Album count over 0% but under 10%: {}".format(len(albums_0_percent))

URGH. Research code, right? It did what it was supposed to do, but it’s anything but good. One more function getting a list of arguments would have totally fixed this.

But why. We have a first impression of the “completeness” distribution of saved albums:

Album count: 1307

Album count over 100%: 172
Album count over 90% but under 100%: 5
Album count over 80% but under 90%: 1
Album count over 70% but under 80%: 0
Album count over 60% but under 70%: 2
Album count over 50% but under 60%: 23
Album count over 40% but under 50%: 8
Album count over 30% but under 40%: 25
Album count over 20% but under 30%: 90
Album count over 10% but under 20%: 236
Album count over 0% but under 10%: 745

There are a whopping 172 albums which are at 100%. This can be explained by single-song releases (singles). But I don’t think all of those are. Otherwise, I seem to be a picky listener. The albums at 50% or 30% might be worth revisiting, as they are either well-pruned or have potential to be just my type.

Printing and Visualizing Results - a bit better

Let’s pause for a moment, and look at how we could handle the distribution issue with pandas instead of lots of copy-pasted code.

There are many ways to visualize data in Pandas. For helping us understand the distribution, and maybe tell a story, a histogram would be nice. A box plot would make it very easy to understand the distribution a bit better.

Pandas feels really convenint for common data-crunching tasks. Just as R does. The code might look a bit daunting at first, but if you get into the topic, you’ll be comfortable in reading and writing this flavor of Python.

completeness_ratios = map(lambda x: x.get_completeness_ratio(), album_bins)
df = pd.DataFrame(completeness_ratios)
df.describe()

Basically, we create a list of “completeness ratios” for each album, and put it into a “data frame”. The ‘describe’ function outputs useful numbers to understand the distribution of the underlying data.

Results of the *describe* function of album completeness ratios.

Simple, right? However, that’s not very intuitive and takes time to understand. Plots are better for this, and can be generated without much hassle:

ax = df.plot(kind='hist', title ="Album Completeness Histogram", figsize=(15, 10), legend=False, fontsize=12)
ax.set_xlabel("Completenness Percent", fontsize=12)
ax.set_ylabel("Album Count", fontsize=12)
plt.show()

ax = df.plot(kind='box', title ="Album Completeness Distribution Boxplot", figsize=(15, 10), legend=False, fontsize=12)
ax.set_xlabel("All tracks in the user library", fontsize=12)
ax.set_ylabel("Completeness Percent", fontsize=12)
plt.show()

The results are two plots:

A histogram of album completeness ratios.
A box plot of album completeness ratios.

To get a bin-overview as in the code above, we can load off a huge chunk onto Pandas.

# originally I used
#bins = list(range(0,101,10))
# but I wanted 100 to be a special case, so:
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 99, 100]

bin_labels = []
for i in range(len(bins)-1):    
    bin_labels.append("({}, {}]".format(bins[i], bins[i+1]))

    bins = map(lambda x: x/100.0, bins)

    # results in bins (0, 10], so 0 is not included, but 10 is
    # this is different than our handling above, with [0, 10)
    # and results in different numbers

    df["bins"] = pd.cut(df[0], bins, labels=bin_labels)
    print df["bins"].groupby(df["bins"]).count()

The output looks as you would expect:

Completeness percent based bins using pandas.

Alright, back on track.

List of Albums Which Are Completely Saved - Ready For Pruning

With our list of albums, and easy way to compute completeness ratios, we can just filter the albums for every entry which is

  • Not from an one-song album
  • Completely saved in the library

With a bit of sorting based on track count, we get the information I was interested in originally:

albs = fi(1.0, 9000.0)
albs_big = filter(lambda x: x.total_track_count()!=1, albs)

albs_big.sort(key=lambda x: -x.total_track_count())

for a in albs_big:
    print u"[{tracks_total}] '{name}' - {artists}".format(
        name=a.get_name(),
        tracks_total=a.total_track_count(),
        artists=a.get_artists())

The top part of the results (of 48 non-single-albums):

[32] 'Zirkus Zeitgeist (Deluxe)' - Saltatio Mortis
[24] 'Meet The EELS: Essential EELS 1996-2006 Vol. 1' - Eels
[23] 'Подмосковные вечера (Имена на все времена)' - Владимир Трошин
[21] 'Great Divide (Deluxe Version)' - Twin Atlantic
[20] 'Opposites (Deluxe)' - Biffy Clyro
[20] 'Holy Wood [International Version (UK)]' - Marilyn Manson
[18] 'Come What[ever] May' - Stone Sour

That’s way more than I expected. And in the very first results are some entries I’d rather not keep around in their entirety. Hooray! That’s all the actionable information I needed to begin pruning.

Just How Many Songs From Complete Albums Are in my Library?

Easy to answer. Interesting to know.

count_of_songs_on_albums = sum(map(lambda x: x.my_track_count(), albs_big))
print "Songs from 'complete' albums: {}".format(count_of_songs_on_albums)
print "Total number of songs: {}".format(count_songs)

percentage = 100.0*count_of_songs_on_albums/count_songs
print "Percent: {:.2f}%".format(percentage)

Turns out, it’s quite a lot:

Songs from 'complete' albums: 652
Total number of songs: 2416
Percent: 26.99%

Over one fourth. Alright, that’s why I started noticing. That took a while.

Conclusion

After a bit of coding, I got all the answers I was interested in. My Spotify library can use a bit of cleaning up, and I’m really looking forward to improve my all-random listening enjoyment.

Python spiced with Jupyter and other helpful modules is great to tinkering with data. Requests + Pandas are amazing libraries, which you should add to your arsenal if you’re working with data. Pandas did not really shine here, I think it deserves its own article with a few harder questions to justify its use and demonstrate its full potential. Do you have an idea about an interesting way to look at the data, or a question you’d be interested in answering? Just drop me a mail :) Also, the Spotify API is cool and well documented.

Thanks a lot for reading. I’d be thrilled to hear from you! If you have any questions, remarks or cool projects in mind just write me a mail.

Want to be notified when I publish new content?

Just enter your name and email below. You will also get content that I share exclusively with the list. Zero spam!