0% found this document useful (0 votes)
36 views

Predicting Singer Voice Using Convolutional Neural Network

This document discusses using a convolutional neural network to predict a singer's voice from an audio signal. It begins with an introduction to CNNs and their use in audio analysis. Features are extracted from audio files, including MFCCs. A CNN model is trained on spectrograms generated from song clips to classify singers. The model is able to accurately predict which singer is heard based solely on the raw audio data and extracted features.

Uploaded by

Puru Mathur
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Predicting Singer Voice Using Convolutional Neural Network

This document discusses using a convolutional neural network to predict a singer's voice from an audio signal. It begins with an introduction to CNNs and their use in audio analysis. Features are extracted from audio files, including MFCCs. A CNN model is trained on spectrograms generated from song clips to classify singers. The model is able to accurately predict which singer is heard based solely on the raw audio data and extracted features.

Uploaded by

Puru Mathur
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MINOR PROJECT

Predicting singer voice using


convolutional neural network
Table of Contents

 Introduction
Audio file overview
Applications of Audio Processing
Audio Processing with Python
Spectrogram
Feature extraction from Audio signal
Voice detection using Convolutional Neural
Networks(CNN)
 Conclusion
INTRODUCTION

 Convolutional neural networks (CNNs or ConvNets) are a popular


group of neural networks that belong to a wider family of
methods known as deep learning.
 CNNs have been designed to process image data efficiently, and
for this, they were developed with properties such as local
connectivity, spatial invariance, and hierarchical features.
 Audio data analysis is about analyzing and understanding audio
signals captured by digital devices, with numerous applications in
the enterprise, healthcare, productivity, and smart cities.
 In this, we start with the audio data analysis and extract necessary
features from a sound/audio file. We will also build an
Convolutional Neural Network(CNN) for the singer’s voice
prediction.
Audio file overview

• The sound excerpts are digital audio files in .wav


format. Sound waves are digitized by sampling
them at discrete intervals known as the sampling
rate (typically 44.1kHz for CD-quality audio
meaning samples are taken 44,100 times per
second).
• Each sample is the amplitude of the wave at a
particular time interval, where the bit depth
determines how detailed the sample will be also
known as the dynamic range of the signal (typically
16bit which means a sample can range from
65,536 amplitude values).
In signal processing,
sampling is the reduction of a
continuous signal into a
series of discrete values. The
sampling frequency or rate is
the number of samples taken
over some fixed amount of
What is Sampling and
time. A high sampling
Sampling frequency?
frequency results in less
information loss but higher
computational expense, and
low sampling frequencies
have higher information loss
but are fast and cheap to
compute.
Applications of Audio Processing

Indexing music collections according to their


audio features.
Recommending music for radio channels
Similarity search for audio files (aka Shazam)
Speech processing and synthesis — generating
artificial voice for conversational agents
Audio Data Handling using Python

 Sound is represented in the


form of an audio signal
having parameters such as
frequency, bandwidth,
decibel, etc. A typical audio
signal can be expressed as
a function of Amplitude and
Time.
 There are devices built that help you catch these sounds
and represent it in a computer-readable format. Examples
of these formats are
wav (Waveform Audio File) format
mp3 (MPEG-1 Audio Layer 3) format
WMA (Windows Media Audio) format

 A typical audio processing process involves the


extraction of acoustics features relevant to the
task at hand, followed by decision-making
schemes that involve detection, classification, and
knowledge fusion.
Python Audio Libraries

 Python has some great libraries for audio processing like


Librosa and PyAudio. There are also built-in modules for
some basic audio functionalities.
 We will mainly use two libraries for audio acquisition and
playback:
1. Librosa: It is a Python module to analyze audio signals in
general but geared more towards music. It includes the
nuts and bolts to build a MIR(Music information retrieval)
system. It has been very well documented along with a lot
of examples and tutorials.
2. IPython.display.Audio
lets you play audio directly in a jupyter notebook.
Spectrogram
• A spectrogram is a visual way of representing the
signal strength, or “loudness”, of a signal over time at
various frequencies present in a particular waveform.
Not only can one see whether there is more or less
energy at, for example, 2 Hz vs 10 Hz, but one can
also see how energy levels vary over time.
• A spectrogram is usually depicted as a heat map, i.e.,
as an image with the intensity shown by varying the
color or brightness.
• We can display a spectrogram using.
Librosa.display.specshow.
Feature extraction from Audio signal

Extraction of features is a very important part in analyzing and


finding relations between different things. The data provided of
audio cannot be understood by the models directly to convert
them into an understandable format feature extraction is used.
 It is a process that explains most of the data but in an
understandable way. Feature extraction is required for
classification, prediction and recommendation algorithms.
 Feature Extraction - MFCC — Mel-Frequency
Cepstral Coefficients

This feature is one of the most important method to


extract a feature of an audio signal and is used majorly
whenever working on audio signals. The mel frequency
cepstral coefficients (MFCCs) of a signal are a small set of
features (usually about 10–20) which concisely describe
the overall shape of a spectral envelope.
Voice detection using Convolutional
Neural Networks(CNN)
 Song Dataset :
Collecting five songs of five different singers &
making a song dataset.
 Preprocess the data:
Breaking each song of each singer into small raw
audio files by taking out 20 second clip from it.
 Computing spectrogram:
Computing spectrogram which is short time Fourier
transform for each 20 second clip. After computing, we have
spectrogram as an image i.e. audio data is converted to
image data.
 Applying Convolutional Neural Network:
We will apply convolutional neural network to that image
data to predict singers voice using some python libraries
such as LibROSA, an open-source python package for music
and audio analysis.
Conclusion

This work is an approach to apply deep neural network models


into music signal processing. Singers voice are predicted using
raw audio data with extracted MFCC and pitch as feature
vectors. A 2-hidden-layer deep neural network was used, with a
final linear layer forage prediction and sigmoid function for
voice prediction .This work shows very promising result using
convolutional neural network based method to predict singers’
voice. In the future, we can include more singer’s information in
the model as well as exploring more features from raw audio
data.

You might also like