As music lovers ourselves, we hope to build a music preference classifier that is more tailored to each
user’s personal music taste through machine learning techniques. We started the project with a single
user’s Spotify music dataset from Kaggle (“Spotify Song Attributes”)1, in which we utilized the data
visualization skills to have a better understanding of the types of music the user likes, and to select the
input attributes into the modeling process later on. For data preparation, we used the Principal
Components Analysis procedure to turn a set of correlated variables into a set of nonlinearly-correlated
variables. From there, we built ten machine learning models to check which one produces the highest
accuracy rate. Among the ten models, we chose the top three best performing models to construct the
coherent predictor for music preferences, namely K-nearest neighbor, Support Vector Machine and
Data Visualization and Model Building
1) Innovative Algorithm:
The current music filtering method Spotify is using is named Collaborative Filtering. While this algorithm
is so far widely adopted on the Internet, it is not perfect. One of the biases it would raise specifically in
Spotify is that popular songs are generally more likely to be recommended to users than the non-
mainstream ones. To minimize this bias and to make the recommendation more personalized, we aim to
build a music classifier that is based on the music itself regardless of the choices of any other users.
Spotify API platform identifies each song with 16 features: duration, acousticness, danceability,
speechiness, loudness, energy, valence, etc. To grasp the potential correlation between the attributes and
their impact on whether the user likes the song or not, we cleansed data and chose 13 out of the 16
attributes of each song to run an OLS regression. Based on the t- and p-value yielded, we observed
multiple attributes with statistical significance. Then dividing the original dataset into two groups
depending on the user’s preference, we drew the pdf of each attribute and found distinct differences
between the two groups. Afterward, drawing the pairplot graph of these attributes, we are again confirmed
that the attributes are strongly correlated in a nonlinear way. Hence, we need more sophisticated models
to understand and predict users’ music preference for each song.
2) Self-studying Machine learning models:
Before setting up models, we first performed feature scaling to reduce biases as preparation for Principal
Component Analysis. Followed by that, we applied Principal Component Analysis to reduce dimensions
and improve machine learning training efficiency. We then successfully built up 10 machine learning
models, seven of which were not covered in class, such as SVM and Neural Network. The largest
challenge we encountered was building the Neural Network. In order to construct a model that can satisfy
our requests, we decided to build the Neural Network model completely on our own to customize
parameters, including code for the sigmoid function, backward propagation, forward propagation etc.
Additionally, we also took the potential overfitting problem into consideration by incorporating
regularization terms into the cost function for gradient descent. Furthermore, to make our models more
tailored to our specific questions, we have also applied algorithms like grid search to tuned our models by
3) General Suitability:
We don’t want to stop at constructing a model that predicts the music preference of one user and want to
add more practicability. So we went above and beyond, and turned this algorithm into a commercial
prototype by adding two extra modules. One module asks users for the Spotify id’s several playlists of the
songs they like and dislike, and automatically turns these playlists into dataframes with 16 attributes of
each song in the list using Spotify API. During this process, the user input information will help train our
models and tune parameters to predict which songs they will like. The other module takes in any new
playlist that users are interested in and will use the trained model to recommend songs in the new playlist
that users may like in just a few seconds.
4) Successful Validation:
To validate the result, we tested our prototype on our group members. As inputs, each of us came up with
two lists of songs: one of the songs we liked and the other we disliked. We each contributed about 1400
songs on average. Obtaining attributes from Spotify API for these songs, we trained the model and got
back results with an accuracy of over 70%, while some even hit 90%.
|Are you a contestant for RMDS 2021 Data Science Competition?
||Aug 08, 2019
||Feb 22, 2021
Please sign in or create an account to give a rating or comment.