How I created a classifier to determine the potential popularity of a song.

A tool that helps musicians succeed.

According to my perspective I saw many musicians, may it be a band or an individual, do not achieve their desired status in the society in terms of popularity while few dominate the music industry.Therefore, my efforts were to develop a system where each music enthusiast who is interested in displaying their unique talent via song(s) receives the ability to assess the potential popularity they are likely to receive from the audience.

The popularity is linked to a rating which depends on music audience preference.There will be three rating classes which will be labelled as excellent ,moderate and poor. Therefore a higher rating(excellent) suggest a popularity similar to other highly rated songs among music audience and vice versa.while the moderate hold tracks in between the above two.

Tensorflow- a popular machine learning library

Pandas — a data handling and manipulation library for python

Librosa -a music and audio analysis tool

scikit-learn- another machine learning library(a high-level API)

numpy- a handy tool for performing matrix(2d arrays) operations efficiently

os-handling file system related operations

CatBoost-a library that facilitates the implementation of boosted trees with less over-fitting

XGBoost- a gradient boosting algorithm

We will need a dataset to work with which we need to hand craft as I will be using Sinhala songs since Sinahala is my native language as well as there’s less research done based on those .currently there is no dataset to be found pertaining to a Sinhala songs.Thus the need for creating one.our task will be a classification task which adopts a supervised learning approach.We will be primarily focusing on neural networks as our classification algorithm .To be specific we will be using a multi layer neural network implemented using tensorflow.We will also be laying down a comparison of the results obtained by using many techniques under the results section.

When developing our solution we considered various approaches to choose the best one.The comparison of the performance of each approach can be found under the results section.The following implementations were considered .

1.Vanilla neural networks(multi layer neural network)

2.Ensemble technique(Random forest)

3.Boosting (XGboost,CatBoost)

4.Stacking(2 base learners ,1 meta learner)

Creating the dataset

We will extract three meaningful features for each song,from a 8000+ music repository using librosa.The three features will be tempo(beats per minute),Mel Frequency Cepstral Co-efficients(mimics some parts of the human speech production and speech perception ) and the harmonic element(the harmonic component within an audio signal). The reason of using these three features are because they are considered to be high level features of music and high level features have proven to be more determinant factors of preference by audience since they capture the characteristics the audience value most.Next we need to label this dataset.For this we use K-means clustering to cluster the data points in to three clusters which equal to the number of rating classes.Here we assume the songs with similar characteristics generates produce feature values close to each other, therefore when calculating the distance measure to determine the cluster that the data points belong to, the distances for data points that have similar rating will have minor differences.Thus falling to the same cluster. After the labels were determined the the features and labels were merged to create the dataset.

The cluster labels will be assigned random whole numbers between 0 and number of clusters -1 to the three clusters.These labels 0,1 and 2 are just notations that separate the three clusters and therefore have no numeric representation.Therefore if someone needs to identify which label is excellent ,moderate and poor to assess where they stand .They will have to define success in terms of their perspective.The reason for this is the subjective nature of music from person to for instance if I want to assess my song relative to what I see as a popular song.I have to first choose three song which I perceive as excellent,moderate and poor and extract the features of those songs and give them to the system and obtain the rating/label for those songs.Now since I know what the labels mean from my perspective I can give my creation to the system and obtain the label/rating for it and compare where I stand.

Data preprocessing

The LabelBinerizer() in sklearn has been used to create the equivalent one-hot encoding on labels.

StandardScaler() is used to standardize data to a common Gaussian distribution with a mean of zero and a standard deviation of one.

Constructing the neural network

The neural network has an input layer ,two hidden layers and an output layer.First we create three place holders to feed the features, labels and the probability of each neuron being present.which is required in the dropout layer as a parameter.The values for these place holders will be provided during runtime.Then we create each layer of the network by declaring weights and biases for each layer.dropouts will be added on the output from the activation function after each layer except after the output layer.The basic concept of feed forward neural nets is that the inputs that are fed to a layer are multiplied by a weight matrix and added to biases in that layer.These weights and biases are the variables that can be changed in order to make the the learning function generalize.The result from a layer is given to an activation function which maps the inputs to outputs in a certain range depending on the activation function.The activation function for input layer and hidden layers would be tanh while the output layer will have the softmax activation function.The best activation for intermediary layers were proven to be tanh for which the comparison of popular activation functions against performance will be available under the results section.

Cost function

The cost function which we used was the cross entropy function.which takes the log values of the predicted labels and multiply it by the actual labels. then it takes the summation and creates the new matrix values. To obtain the cost for each batch we compute the mean along the rows of the matrix.Now we have a column matrix which specifies the cost from each batch or one epoch.

Optimization function

As the optimization function we use the stochastic gradient descent which adjusts the learning curve by the learning rate in the direction of cost reduction.


Training is done in batches to reduce over-fitting.Further the number of training epochs are set to 200.This should be done within a session in Tensorflow as the computational graph in Tensorflow is evaluated only within a session.Values to be fed to placeholders are fed during training using the feed_dict dictionary parameter.The run method in the session class can be used to run the operations of the computational graph.

Setting Hyper-parameters

Training epochs

We say that an epoch is completed when we have used all our training data for the training process. Training data consist of our training features and it’s corresponding training labels.Here we have set training epochs to 200 which mean we train on our entire training data on 200 iterations. There is no ideal number of training epochs we could use.This depends on the complexity of your data.Therefore you should do parameter tuning or basically try few parameter configurations to find the ideal/suitable value for this parameters.

Hyper parameter 1:training_epochs = 200

Since we are implementing a multi-layer neural network.It will consist of an input layer, two hidden layers and an output layer.

Number of neurons in the hidden layers

Hidden layers are the layers which perform transformations on the input data to identify patterns and generalize our model.Here I have used 120 neurons each in my first and second hidden layers which was sufficient in achieving a decent accuracy. But as I explained earlier all hyper-parameters should be tuned in such a way that it improves your model.

Hyper parameter 2:n_neurons_in_h1 = 120

Hyper parameter 3:n_neurons_in_h2 = 120

Learning rate

This is the phase at which the algorithm learns.Machine Learning guru’s say that we should start with a high learning rate and gradually reduce it to achieve best results. Further the learning rate is advised to be kept within the range of 0 & 1.

Hyper parameter 4:learning_rate = 0.001


Used to reduce over-fitting during training.

keep_prob=0.5 for training and 1.0 for testing. Dropouts are only used during training and not testing.The above probability variable specifies the probability of each neuron to remain in a layer.

Finally the model can be saved after training using the save() method in the Saver() class.

Measures taken to reduce over-fitting

1.Shuffling the dataset

2.Standardizing the dataset

3.Adding dropout layers

4.Training dataset in batches of samples.

In this section we will be assessing the performance of each approach we used in solving the problem and the inferences we could gain.

Parameter tuning results are as follows:

Other parameter(batch size & split percentage) changes:

The above diagram shows the predictive power of each feature as measured by the F-score(2TP/2TP+FP+FN) .Where TP is the number of true positives,FP is the number of false positives and FN is the number of false negatives.

Its clear that in terms of prediction accuracy stacking provides impressive results by classifying every element in each class correctly.While Boosting and bagging are also perform well.The neural network implementation is not very impressive but gives acceptable results.The most important fact is given the bias nature of the dataset still the neural network has managed to identify the small proportion of class 2.But this was after many techniques mentioned above were incorporated to minimize over-fitting.Further tanh is considered to be the best choice in terms of choosing an activation function for layers other than the output layer.

An open-source software library for Machine
LibROSA - librosa 0.5.1 documentation
LibROSA is a python package for music and audio analysis. It provides the building blocks necessary to create music…
scikit-learn: machine learning in Python - scikit-learn 0.19.1 documentation
catboost - CatBoost is an open-source gradient boosting on decision trees library with categorical features support out…
XGBoost Documentation - xgboost 0.4 documentation
This is document of xgboost library. XGBoost is short for eXtreme gradient boosting. This is a library that is designed…
Urban Sound Classification, Part 1
Feature extraction from sound and classification using Neural
The Speech Recognition Wiki
1. Introduction The most commonly used feature extraction method in automatic speech recognition (ASR) is Mel-Frequency…