Building an Audio Classifier using Deep Neural Networks

Using a deep convolutional neural network architecture to classify audio and how to effectively use transfer learning and data-augmentation to improve model accuracy using small datasets.

By Narayan Srinivasan.

Understanding sound is one of the basic tasks that our brain performs. This can be broadly classified into Speech and Non-Speech sounds. We have noise robust speech recognition systems in place but there is still no general purpose acoustic scene classifier which can enable a computer to listen and interpret everyday sounds and take actions based on those like humans do, like moving out of the way when we listen to a horn or hear a dog barking behind us etc.

Our model is only as complex as our data, thus getting labelled ‘data is very important in machine learning’. The complexity of the Machine Learning systems arise from the

The Audio-classification problem is now transformed into an image classification problem. We need to detect presence of a particular entity ( ‘Dog’,’Cat’,’Car’ etc) in this image.

Step 2. Choosing an Architecture

We use a convolutional Neural Network, to classify the spectrogram images.This is because CNNs work better in detecting local feature patterns (edges etc) in different parts of the image and are also good at capturing hierarchical features which become subsequently complex with every layer as illustrated in the image

Another way to think about this is to use a Recurrent Neural Network to capture the sequential information in sound

rubberband -t 1.5 -p 2 input.wav output.wav

This one line terminal command, gives us a new audio file which is 50% longer than original and has pitch shifted up by one octave.

To visualise what this means, look at this image of a cat I took from the internet.

If we only have the image on our right, we can use data augmentation to make a mirror image of that image and it’s still a cat (Additional training data!). For a computer these are two completely different pixel distributions and helps it learn more general concepts (if A is a dog, mirror image of A is dog too).

Similarly we apply time-stretching (either slow down the sound or speed it up) , and
pitch-shifting (make it more or less shrill) to get more generalised training data for our network (also improved the validation accuracy by 8-9% in this case due to a small training set).

We observed that the performance of the model for each sound class is influenced differently by each augmentation set, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

Overfitting is a major problem in the field of deep learning and we can use data Augmentation as one way to combat this problem , other ways of implicitly generalising include using dropout layers and L1,L2 regularisation. [Ref]

So in this article we proposed a deep convolutional neural network architecture which helps us classify audio and how to effectively use transfer learning and data-augmentation to improve model accuracy in case of small datasets.

Bio: Narayan Srinivasan is interested in building autonomous vehicle. He is a graduate of Indian Institute of Technology, Madras.


  • The 10 Deep Learning Methods AI Practitioners Need to Apply
  • MLDB: The Machine Learning Database
  • Bill Inmon on Hearing The Voice Of Your Customer