0% found this document useful (0 votes)
15 views6 pages

Lionel Adi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Lionel Adi

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404

Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

Music Genre Recommendations Based on


Spectrogram Analysis Using Convolutional Neural
Network Algorithm with RESNET-50 and VGG-16
Architecture
I Nyoman Purnama1
1
Sistem Informasi, STMIK PRIMAKARA
Email: [email protected]

Abstract − Recommendations are a very useful tool in many industries. Recommendations provide the best selection of
what the user wants and provide satisfaction compared to ordinary searches. In the music industry, recommendations are
used to provide songs that have similarities in terms of genre or theme. There are various kinds of genres in the world of
music, including pop, classic, reggae and others. With genre, the difference between one song and another can be heard
clearly. This genre can be analyzed by spectrogram analysis. Convolutional Neural Network(CNN) is a neural network
algorithm that is commonly used to recognize and classify image data. In this study, an image spectrogram analysis was
developed which will be the input feature for the Convolutional Neural Network. CNN will classify and provide song
recommendations according to what the user wants. In addition, testing was carried out with two different architectures
from CCN, namely VGG-16 and RESNET-50. From the results of the study obtained, the best accuracy results were
obtained by the VGG-16 model with 20 epochs with accuracy 60%, compared to the RESNET-50 model with more than
20 epochs. The results of the recommendations generated on the test data obtained a good similarity value for VGG-16
compared to RESNET-50.

Keywords – Recommendation, VGG16, Resnet50, CNN, Spectogram, Music

can classify and recommend data in the form of images or


I. INTRODUCTION sounds[4]. One of the methods commonly used for the
Music is an inseparable part of people's lives. Music classification and recommendation process is the
often accompanies someone in their activities. Sometimes Convolutional Neural Network (CNN). CNN is an
listening to music can also affect the mood of the listeners. extension of Multilayer Perceptron. CNN is able to learn
Usually someone listens to music according to his feelings from an image by using supervised learning techniques.
at that time. So that the role of music becomes important in This technique will provide a target for output by
managing people psychology[1]. comparing past learning experiences. There are several
Correct song listened by someone can affect listener's architectures that can be used to optimize CNN so that they
feelings. Due to the large amount of music available either can have optimal classification results. There are the VGG
through the internet or other music service applications, it architecture, mobileNet, ResNet etc. ResNet, short for
will be difficult for people to choose the songs they want. Residual Networks is a classic neural network[5]. This
Music is also distinguished by a variety of genres, speed, model is also the winner of the ImageNet challenge in 2015.
tempo and themes that vary and vary widely[2]. For This model is also easier to optimize, and can get accuracy
western songs, the genres are distinguished by Hip-Hop, from great depths increases. ResNet 50 is the best CNN
International, Electronic, Folk, Experimental, Rock, Pop, architecture, it is proved on the research by Talo was to
and Instrumental. This makes it difficult for music lovers to conduct research on the classification of brain diseases with
choose the right song. MRI images[6]. The spectrogram is a visual representation
Music lovers usually choose songs using manual of the frequency spectrum of the signal. The spectrogram is
method in finding the desired music. Like asking for formed using the Fourier transform. Making a spectrogram
recommendations from friends or listening to music shows with FFT (Fast Fourier Transform) is done by first taking
to choose music[3]. Often the song that is listened to, does the data in the time domain, and breaking the data into
not match his mood or is not a fan of the genre of the song. several parts, and doing a Fourier Transform to calculate
Recommendations are implemented in various music the magnitude of the frequency spectrum for each part.
player platforms on the internet, to provide more The spectrogram is very useful for analyzing sound,
experience in listening to the music. The recommendation where the spectrogram forms a ratio of magnitude to
system is able to predict the favorite music desired by the frequency at a given time. Music recommendations can also
user. Besides for users, recommendations are also useful be made based on the mood of the user. Where in the
for music service providers, because they can increase user research that has been done by Amala George et al. The
satisfaction for using the music service. system developed is able to analyze the mood of the user
Deep learning is a part of Artificial Neural Network- based on his face, then analyze it using the CNN
based Machine Learning. With deep learning, a computer algorithm[2]. From the results of this mood classification,
recommendations are then given using the recommendation

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
69
JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404
Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

module. From the research that has been done, the accuracy Based on the research that has been done previously,
is 98%. this study will carry out a music genre recommendation
Research of music types Classification using the CNN process using the GZTAN dataset which is composed of 10
algorithm has also been carried out by analyzing types of genres, where the music data is first processed
spectrogram images. The spectrogram image that has been using a spectrogram. The results will be classified using the
generated from the music, then deep learning process will CNN algorithm with RESNET50 and VGG16 architecture.
be carried out by using the CNN model. Based on the The results of the recommendations generated will be
research that has been done, it is found that the use of 35 tested whether they are in accordance with the song desired
epochs has an optimal accuracy of 81.33%. When by the user.
compared with the KNN method, CNN produces a better
level of accuracy[1]. Other research on spectrogram II. RESEARCH METHODOLOGY
analysis for music genre classification using CNN and Mel-
spectograms has been carried out and the test results depend The method used in this research is the dataset
on the number of datasets, training iterations and computer preparation process, pre-processing, spectrogram,
specifications greatly affect the level of accuracy and classification process and calculating similarity using
duration of modeling. The resulting accuracy is very cosine similiarity.
optimal in classifying music genres, which is 99% for the A. Dataset
RELU activation function and 95% for ELU[7]. This research uses a dataset in the form of spectrogram
Music recommendations based on genre have also been images taken from the GTZAN dataset. To simplify the
carried out using the Convolutional Recurrent Neural classification of music data using a neural network, it is
Network. Where in this study also uses a spectrogram and necessary to change the music data data into a mel-
analyzes it using CRNN. This study also compared the use spectrogram to be processed by the Neural Network.
of CRNN and CNN methods to classify music genres. From GTZAN consists of music data and Mel spectrogram
the research results, it is found that CRNN which takes into results from that music file. Where this dataset is a public
account the frequency and time sequence features has better dataset that is widely used for evaluating the introduction
performance than CNN[8]. Research on next-song of music genres (Music Genre Recognition / MGR).
recommendation has also been carried out, where Neural GTZAN is a collection of music collected from 2000-2001,
network has performed well in all types of tests. In this which comes from various sources such as CDs, radio and
study it was concluded that the NN-based next-song microphone recordings. This dataset consists of 10 genres,
recommenders, CNN-rec, NN-rec and Word2Vec, namely blues, classical, country, disco, hip-hop, jazz,
outperform the non-NN based ones[9]. In this research metal, pop, reggae and rock. The duration of each of these
demonstrate that the NN-based next-song recommenders, music is 30 seconds. Each genre contains 100 music files.
which combine users’ general preference and sequential The number of datasets used in this study is divided into 3
listening patterns, have the highest performance. parts : training data, validation data and test data. With
Music recommendation using deep content also done details in each section as follows:
by A¨aron van den Oord dkk[10]. In their research showed
that recent advances in deep learning translate very well to Table 1. Number dataset used on each class
the music recommendation setting in combination with Genre Training Validation Testing
data data data
approach used in this study, with deep convolutional neural Blues 80 10 10
networks significantly outperforming a more traditional Classical 80 10 10
approach using bag-of-words representations of audio Country 80 10 10
signals. Also other research on music recommendation Disco 80 10 10
done by using user behaviour [11]. The approach Hiphop 80 10 10
Jazz 80 10 10
considered genre, recording year, freshness, favor and time
Metal 80 10 10
pattern as factors to recommend songs. The evaluation Pop 80 10 10
results demonstrate that the approach is effective. Regae 80 10 10
Research on music recommendations by genre is Rock 80 10 10
carried out by comparing several machine learning
algorithms such as KNN, RF, NB, DT dan B. Spectogram
SVM[12]. According to the results summarized in this The spectrogram is a visual representation of the
research, SVM achieved better classification results than frequency spectrum of the signal[14]. In the GTZAN
other methods. In addition, changing the window size and dataset, spectrograms have been generated and stored in
window type caused very small performance changes. their respective classes. Before being entered into the CNN
Research about music recommendation using network, this data is further divided into training data,
similarity between using decided genre value and using validation data and test data. Each of these spectrogram
feature vector distance also have been done by Jonseol Dee images will be included in the array, then labeled according
et al. In their paper, proposed a recommendation system to their respective index folders. Then after being given a
based on a preference classification using real-time user label, the data will be appended into an array to make it
brainwaves and genre feature classification. Proposed easier to pass the data. From the spectrogram image there
user’s preference clasifier achieved an overall accuracy of are many values and features of the music file that can be
81.07%[13]. displayed. The following is an example of an illustration of
the spectrogram of each class in this study.

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
70
JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404
Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

Figure 2 Example of Spectogram on each classes

Based on the picture above, it can be seen that there are


differences in the spectrograms of each genre. The image
on the right shows a spectrogram for the hip hop and rock
genres, here you can see the frequency density compared to
the spectrogram on the left. The spectrogram image is a
wave generated as an audio representation in the time,
frequency and magnitude domains. To generate
spectrograms from each music genre and use it on an
artificial neural network, in this study, the Librosa library
was used. With librosa, we can retrieve important features
in a music file, such as tempo, chroma and spectrogram.

C. Convolutional Neural Network Figure 3. Comparison of Resnet-50 and VGG-16 . architectures


Convolutional Neural Network is an artificial neural
network that is widely used in the field of image D. Research flow
classification. In this study, the audio/music signal is The research flow used in this study can be described
represented as a spectrogram which has a 2D image. CNN in outline as follows:
is used to classify music genres with the help of
spectrograms. Based on the spectrogram images of each
music genre, the pattern of the audio signal can be seen. So Start Collecting spectrogram data
that each of these genres can be input of the CNN artificial
neural network.
In this study, two CNN architectures were used, Devide data by train,test and
namely Resnet and VGG16. Resnet is a residual network, valida
which is in charge of image recognition. RESNET-50 is an
improved version of VGG-16. Where the last number of
this architecture represents the number of layers in the CNN
network. RESNET stands for Residual Network which is
an artificial neural network innovation that won the 2015
ILSVRC classification competition with an error rate of CNN RESNET50
CNN VGG-16
only 3.15%[15]. While VGG-16 stands for Visual
Geometry Group and 16 is the number of layers. The VGG-
16 is also a well-known model that participated in the 2014
ILSVRC and obtained an accuracy rate of 92.7%. VGG-16
is also used in image classification and is very popular Process Cossine similarity
because of its ease of implementation. The following in
Figure 3 is a comparison of the RESNET and VGG
architectures.
Recommendation

Figure 4. Research flow

The first step in this research is to collect the dataset, in the


form of a spectrogram image from the GTZAN dataset.
After that by using the required libraries such as
imageDataGenerator in the Keras library to manage
training, test and validation data. After that the MFCC
image data from 10 music genres have been grouped by
category. The next step is to build a CNN model with Keras.

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
71
JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404
Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

There are 2 different processes will be carried out using the convolution process repeatedly. The resulting model output
VGG-16 and Resnet 50 architectures. is shown as shown below.
After the model is obtained from the training process
using two different architectures. Then the process of
finding similarities between feature vectors is carried out
using cosine similarity. The application will display a
recommendation of 5 songs that match those in the
validation data. Where the recommendation process is
carried out by calculating the value of the similarity of
features between one music and another. The first process
is to choose music from each genre that will be used as the
basis for the recommendation system. Then the forecast
from the music base is calculated based on an artificial
neural network. The cosine similarity value is calculated
from the 2 featured vector being compared. To calculate the
similarity of 2 pieces of music with the number of features
N, where the first music has a feature vector
x=[x1,x2,x3….xn] and the second music has a feature
vector y=[y1,y2,y3,…yn] then the formula which is used as
follows:
Figure 6. CNN output model

The next process is to carry out the transfer learning


process with 2 different architectures, namely VGG16 and
RESNET50. Transfer learning is the process of using an
Figure 5. Cossine similarity formula existing model for different problems. By using transfer
learning, it is hoped that the results of the training will be
better. The parameters needed in this transfer learning
III. RESULTS AND DISCUSSION process are lastfourtrainable, if the value of this parameter
System implementation is done using Google Colabs. is false then the last fully connected layer will be trained. If
The libraries used in the making of this research are numpy, true then the last 4 layer models that have parameters will
pandas, librosa, Keras and Scikit learns. This research uses be trained. For these two architectural models, "adam"
a spectrogram image dataset obtained from the GTZAN optimization is used. The training process was carried out
dataset with a total of 1000 music data and is divided into on each model of 20 epochs. This training will produce a
10 categories namely Blues, Classical, Country, Disco, Hip model that will be used in the testing process.
Hop, Jazz, Metal, Pop, Reggae, Rock. This spectrogram A. Result analysis
image will be the input for the Convolutional Neural After the training process was carried out, the precision,
Network. Where the image in the form of a mel recall, and f1-score values were obtained from each music
spectrogram representation of this audio file is saved in class. Following are the values of Precision, recall, f1-score
“.jpg” format. and accuracy on the VGG16 model.
To build the image dataset in this study, the
ImageDataGenerator library was used. As a parameter of
this library, we must divide the GTZAN dataset into 3
folder namely training, test and validation data. The
number of classes used is 10 classes which are divided into
their respective folders. Some parameters that must be
initialized are batch_size=64 and the number of initial
epochs used is 20. An important parameter that we need to
initialize is Input_shape for all images, in this study we set
the input shape at (224,224.3) / RGB channel and normalize
image with a scale of 1./255. To implement CNN in this
research, using Python programming language with Keras Figure 7. VGG16 Accuracy value
library and tensorflow. CNN modeling is done by
initializing the CNN network layer parameters, namely the The results of the confusion matrix for the CNN-
number of conv2d layers in the model=2, the number of VGG16 model are shown in Fig. where the results
conv2d layers=32, filter size=(3,3), obtained are quite good in classifying the class of the
initializer=glorot_uniform, activation function=relu, layer music dataset used. The best classification process was
dropout=0.2 and the optimized optimizer. "Adam" is used. obtained from classical, jazz and reggae classes. While
The CNN model makes several layers, namely convolution
layers, pooling layers, dropout layers, flatten layers, dense
layers and the RELU activation function. The result of the
convolution process is a feature map that is used in the

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
72
JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404
Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

the lowest classification was obtained in the disco class. Where the results obtained are still lower than the VGG16
model in classifying classes from the music dataset used.

.
Figure 10. RESNET50 Accuracy value

The results of the confusion matrix for the CNN-


RESNET50 model are shown in Fig. Where the results
Figure 8. VGG16 Confusion matrix obtained are still lower than the VGG16 model in
classifying classes from the music dataset used. The best
After the process of model formation with transfer classification is obtained from the reggeae and pop classes.
learning VGG16 and RESNET50. Then proceed with In rock class, the RESNET50 model is not able to give good
making feature extraction. In the VGG16 model, for classification results.
example, in this study, we will take a model that has
been previously stored in the training process. After
that, the output weight will be obtained before the
classification layer of this model. From this model we
will derive the feature vectors for the training and
validation datasets. The result of this feature vector is
then searched for its similarity with cosine similarity.
In Figure 9, 5 music recommendations are
obtained based on the spectrogram image of the music
file desired by the user. Where "test image" is the music
spectrogram testing data. As Seen with VGG16, the
recommended spectrogram has almost the same shape
as the test spectrogram. With the test image from the
Blues class, the recommendation results are also
obtained from the Blues class with a similarity level of
1.

Figure 11 RESNET50 Confusion matrix

For the testing process, the steps taken are the same as
the process in the VGG16 model, that is looking for feature
extraction from the test image and looking for its cosine
similarity with feature extraction from the training dataset.
So that the results of the music spectrogram
recommendations are obtained in accordance with the
testing dataset used. The following is in Figure 12 the
results of 5 image similarities from the tested test data. It
Figure 9. Recommendation output for VGG16 model
can be seen that the results of the spectrogram
While on RESNET50, it has the same testing process recommendation are quite good, only the level of similarity
as VGG16. After the experiment, the training process with is lower than the VGG16 model.
RESNET50 requires a larger epoch to get better accuracy
results. In this study, quite good accuracy results were
obtained at epochs of 30 for RESNET50. The picture below
shows the calculation results of Precision, recall, f1-score
and accuracy on the RESNET50 model. The resulting
Accuracy value is slightly lower than the VGG16 model
with a larger number of epochs. The results of the confusion
matrix for the CNN-RESNET50 model are shown in Fig.

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
73
JISA (Jurnal Informatika dan Sains) e-ISSN: 2614-8404
Vol. 05, No. 01, June 2022 p-ISSN:2776-3234

[4] J. Dias, “Music genre classification from


Spectrogram using CNN,” [Online]. Available:
http://cs230.stanford.edu/files_winter_2018/projec
ts/6936608.pdf.
[5] Faiz Nashrullah, Suryo Adhi Wibowo, and Gelar
Budiman, “The Investigation of Epoch Parameters
in ResNet-50 Architecture for Pornographic
Classification,” J. Comput. Electron.
Telecommun., vol. 1, no. 1, pp. 1–8, 2020, doi:
10.52435/complete.v1i1.51.
Figure 12. Recommendation output for RESNET50 model [6] M. Talo, O. Yildirim, U. B. Baloglu, G. Aydin, and
U. R. Acharya, “Convolutional neural networks for
IV. CONCLUSION multi-class brain disease detection using MRI
In this research it is implemented using Python, images,” Comput. Med. Imaging Graph., vol. 78, p.
Google Colab, and TensorFlow and hard libraries. Input 101673, Dec. 2019, doi:
shape on CNN model in This research is 224x224 pixels, 10.1016/J.COMPMEDIMAG.2019.101673.
the filter size is 3x3, the number of epochs is 20 and 30, and [7] D. Lionel, R. Adipranata, and E. Setyati,
the training data is 799 and the validation is 100 data. CNN “Klasifikasi Genre Musik Menggunakan Metode
is the most widely used method in image data. For research Deep Learning Convolutional Neural Network dan
with audio data, this data is first processed by spectrogram Mel- Spektrogram,” J. Infra Petra, vol. 7, no. 1, pp.
analysis in the form of Cartesian coordinates with the 51–55, 2019, [Online]. Available:
amplitude of the music as the y-axis. In this study, the http://publication.petra.ac.id/index.php/teknik-
spectrogram results become input for CNN with VGG16 informatika/article/view/8044.
and RESNET50 architectures. [8] Adiyansjah, A. A. S. Gunawan, and D. Suhartono,
Based on the results of the design of prediction system “Music recommender system based on genre using
using the CNN method, the accuracy value for the VGG16 convolutional recurrent neural networks,”
training data model is 0.8408, the training data loss is Procedia Comput. Sci., vol. 157, pp. 99–109, 2019,
0.4827, the test data accuracy is is 0.6094 and the test data doi: 10.1016/j.procs.2019.08.146.
loss is 1.2762. Meanwhile, for the RESNET50 model, the [9] K.-C. Hsu, S.-Y. Chou, Y.-H. Yang, and T.-S. Chi,
training data accuracy value is 0.6286, the training data loss “Neural Network Based Next-Song
is 1.0383, the test data accuracy is is 0.3438 and the test Recommendation,” 2016, [Online]. Available:
data loss is 1.8529. So, from these results it can be http://arxiv.org/abs/1606.07722.
concluded that the results in both the data is still [10] D. G. W. Ingram et al., “Computer Aided Design,
underfitting. This is because there are still many datasets International Conference, University of
that are more numerous in number and variants that have Southampton, Engl, Apr 24-28 1972.,” Inst Electr
characteristics that are similar to each class. Eng, Conf Publ, no. 86 . IEE, pp. 1–9, 1972.
The best accuracy results were obtained by the [11] Y. Hu, “12th International Society for Music
VGG16 model with 20 epochs compared to the RESNET50 Information Retrieval Conference ( ISMIR 2011 )
model with more than 20 epochs. The results of the NEXTONE PLAYER : A MUSIC
recommendations generated on the test data obtained a RECOMMENDATION SYSTEM BASED ON
good similarity value for VGG16 compared to RESNET50. USER BEHAVIOR,” no. Ismir, pp. 103–108,
The suggestion for this research is that in the future it can 2011.
increase the dataset so that the accuracy obtained is even [12] A. Elbir, H. Bilal Çam, M. Emre Iyican, B. Öztürk,
better, because in this study the songs in the dataset do not and N. Aydin, “Music Genre Classification and
have clear boundaries between one genre and another. In Recommendation by Using Machine Learning
addition, the epoch value during the training process is also Techniques,” Proc. - 2018 Innov. Intell. Syst. Appl.
further improved so that the accuracy level is even better Conf. ASYU 2018, no. October 2018, 2018, doi:
for each CNN model. 10.1109/ASYU.2018.8554016.
[13] J. Lee, K. Yoon, D. Jang, S. J. Jang, S. Shin, and J.
REFERENCES H. Kim, “Music recommendation system based on
[1] Y. M. G. Costa, L. S. Oliveira, A. L. Koericb, and genre distance and user preference classification,”
F. Gouyon, “Music genre recognition using J. Theor. Appl. Inf. Technol., vol. 96, no. 5, pp.
spectrograms,” Int. Conf. Syst. Signals, Image 1285–1292, 2018.
Process., pp. 151–154, 2011. [14] M. H. Ashshiddieqy, Jondri, and A. Rizal,
[2] A. George, S. Suneesh, S. Sreelakshmi, and T. E. “Klasifikasi Suara Paru Dengan Convolutional
Paul, “Music Recommendation System Using Neural Network (CNN),” eProceedings Eng., vol.
CNN,” vol. 9, no. 6, pp. 4197–4200, 2020. 07, no. 02, pp. 8506–8512, 2020.
[3] C. R. Wairata, E. R. Swedia, and M. Cahyanti, [15] W. Setiawan, “Perbandingan Arsitektur
“Pengklasifikasian Genre Musik Indonesia Convolutional Neural Network Untuk Klasifikasi
Menggunakan Convolutional Neural Network,” Fundus,” J. Simantec, vol. 7, no. 2, pp. 48–53,
Sebatik, vol. 25, no. 1, pp. 255–261, 2021, doi: 2020, doi: 10.21107/simantec.v7i2.6551.
10.46984/sebatik.v25i1.1286.

JISA (Jurnal Informatika dan Sains) (e-ISSN: 2614-8404) is published by Program Studi Teknik Informatika, Universitas Trilogi
under Creative Commons Attribution-ShareAlike 4.0 International License.
74

You might also like