0% found this document useful (0 votes)
6 views

Survey Paper

The document presents a survey on a multimodal emotion detection system designed to recommend music and movies based on human emotions identified through facial expressions. Utilizing the FER-2013 dataset and advanced algorithms like CNNs and Haar cascades, the system aims to enhance user experience by providing real-time, personalized recommendations. Key elements discussed include emotion detection techniques, the integration of multimodal data, and evaluation methodologies for assessing system performance.

Uploaded by

Vaibhavi Kukkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Survey Paper

The document presents a survey on a multimodal emotion detection system designed to recommend music and movies based on human emotions identified through facial expressions. Utilizing the FER-2013 dataset and advanced algorithms like CNNs and Haar cascades, the system aims to enhance user experience by providing real-time, personalized recommendations. Key elements discussed include emotion detection techniques, the integration of multimodal data, and evaluation methodologies for assessing system performance.

Uploaded by

Vaibhavi Kukkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Survey on Multimodal Emotion Detection

Abstract- As we know many people suffer from unhappiness in today’s world, so there is a
requirement of such system which recommends music based on human emotions such as angry,
sad, happy etc. Here in this paper we are going to develop a systems named as multimodal
emotion detection which helps people in suggesting a music and movie based on emotion. We
implemented this system using the FER-2013 dataset which has 35,887 images of different
emotions which has features sad, happy, angry and used python TensorFlow framework and haar
cascades algorithm for recommending movie and music on human facial emotions . The main
concern in existing recommendation systems is manual sorting, so to avoid it, this model is
proposed. ‘Movie and Music Recommendation System based on Facial expressions’ provides a
way to automatically play the music and movie without spending much time browsing it.

Keywords - Image processing, Facial recognition, Movie Recommendation, Song


Recommendation

1. Introduction and test the accuracy and effectiveness of


the model in detecting the emotions and
Music and Movies are always known to alter
retrieving it. Since multiple options of music
human emotions, enhancing their mood and
and movies creates a lot of confusion to
reduce depression. The primary objective of
choose from, this model proves helpful for
this project is to develop a model which can
those who want to seamlessly get the results.
detect emotions based on facial expressions.
The development of an easy- interface is the
To achieve this objective clearly, we have to
most important part to get the
involve the database of the songs and
recommendations in real-time. The
movies. This database will recommend
Convolutional Neural Network is used to
songs and movies based on the expressions.
detect and analyze the emotions. It consists
It also aims to create an easy-to-use
of input layer, convolutional layer, dense
interface so the user can easily get their
layer and output layer.CNN extracts features
hands on the website. The interface of the
from the image and determines the specific
website is created using HTML and CSS
expression. To accurately detect emotions,
with Flask. The main objective is to evaluate

1
we need to detect faces within an image and areas like personalized content
run the model to detect expressions. In terms recommendation, healthcare, education, and
of results, the major goal was to create and virtual assistants.
establish a framework for customers to assist
them to find an ideal choice for them. The
project seeks to discover the correlation and 2.1 Key Elements in Multimodal

similarity between the different songs and to Emotion Detection:

construct a recommendation system  Facial Expression Analysis: One of the

framework that suggests new movies/songs. most commonly used methods for
detecting emotions. Facial expressions
are analyzed through computer vision
2. Background and Key techniques, such as Convolutional

Concept Neural Networks (CNNs), which process


facial landmarks (e.g., eyes, mouth,
 Multimodal Emotion Detection
eyebrows) to classify emotions like
Multimodal emotion detection refers to the happiness, sadness, anger, or surprise.
process of identifying and interpreting
human emotions by combining multiple  Speech and Voice Analysis: Audio
types of data inputs, typically visual (facial features such as tone, pitch, and intensity
expressions), audio (speech, voice tone), and are analyzed to identify the emotional
textual (spoken or written words). The goal state of a speaker. This modality is often
of multimodal emotion detection is to create paired with facial expression analysis to
a more accurate and holistic understanding provide a more comprehensive
of a user's emotional state by integrating understanding of emotions.
these different sources of information.
 Textual Data Analysis: In some cases,
Emotion detection has become increasingly
text from social media, reviews, or
relevant with the rise of human-computer
chatbots is analyzed using Natural
interaction (HCI) applications, where
Language Processing (NLP) techniques
systems need to understand and respond to
to detect sentiment and emotions.
users' emotions to enhance user experience.
This capability is particularly important in

2
2.2. Facial Expression-Based Emotion data can be easily captured, such as
Detection smartphones, webcams, or in-vehicle
cameras.
Facial expressions are a primary channel
through which humans express their
emotions. Advanced computer vision 2.3. Recommendation Systems

techniques are used to capture and analyze


these expressions to detect underlying Recommendation systems use algorithms

emotions. The process typically involves the and data-driven techniques to suggest

following steps: personalized content (e.g., movies, music) to


users. In traditional recommendation
1. Face Detection: Identifying the presence
systems, user preferences are derived from
and location of faces in images or videos
historical data, such as past interactions
using algorithms like the Viola-Jones
(e.g., movies watched, songs liked) or
detector or modern deep learning-based
demographic information. However,
techniques such as Haar cascades.
integrating emotion detection with
2. Feature Extraction: Detecting facial recommendation systems enhances
landmarks (such as eyes, mouth, and personalization by adapting
eyebrows) or using deep learning models recommendations to users' emotional states
(like CNNs) to extract key features from the in real-time.
face.
2.3.1. Key Approaches in
3. Emotion Classification: Assigning a
Recommendation Systems:
category (e.g., happiness, anger, fear) to the
Collaborative Filtering: Makes
detected emotion using machine learning
recommendations by identifying users
classifiers or neural networks. Popular
with similar preferences and suggesting
models include CNNs, Recurrent Neural
content that those users have enjoyed.
Networks (RNNs), and Support Vector
Machines (SVMs).
Content-Based Filtering: Recommends
Facial expression analysis is considered content similar to items a user has
reliable for real-time emotion detection,
particularly in environments where visual

3
previously liked, based on features of the
items (e.g., genre, director, actors). 3.1. Emotion Detection through Facial
Expressions
Hybrid Systems: Combine both Facial expression analysis has been a
collaborative and content-based foundational technique in emotion
approaches to provide more accurate detection due to its non-invasive and
recommendations. widely applicable nature. Early methods
relied on handcrafted features and
In the context of emotion-based traditional machine learning models,
recommendations, the system dynamically while recent advancements have shifted
adapts suggestions based on the user's towards deep learning-based approaches.
detected emotional state. For example, if 3.1.1. Traditional Methods
the system detects that a user is feeling Before the advent of deep learning,
sad, it might recommend uplifting music emotion recognition from facial
or feel-good movies to improve their expressions was based on feature
mood. extraction techniques, such as:

 Facial Action Coding System (FACS):


3. Literature Review This system breaks down facial
expressions into a set of Action Units

Multimodal emotion detection and its (AUs) based on muscle movements,

application in personalized which can be mapped to specific

recommendation systems (e.g., movie emotions. Early machine learning

and music recommendations) have models, like Support Vector Machines

gained significant attention in recent (SVMs), were used to classify emotions

years. This section reviews key based on the extracted AUs.

approaches and technologies used in


emotion detection through facial  Gabor Filters: These filters were
expressions and the integration of these employed to capture facial texture
techniques into recommendation features, which were then fed into
systems. classifiers such as k-Nearest Neighbors

4
(k-NN) or Decision Trees to identify synthesizing emotional expressions or
emotions. enhancing datasets by generating more
varied facial expressions. This helps
3.1.2. Deep Learning Approaches improve the robustness of emotion
With the rise of deep learning, detection models, particularly in dealing
Convolutional Neural Networks (CNNs) with data scarcity.
have become the dominant method for facial
emotion recognition. CNNs automatically
learn hierarchical features from facial 3.1.3. Multimodal Fusion for Emotion
images, eliminating the need for handcrafted Detection
feature extraction.
Facial expressions are often combined
with other modalities like audio and text
 Convolutional Neural Networks
to improve the robustness of emotion
(CNNs): CNNs have shown remarkable
detection. Common multimodal fusion
success in emotion detection by
techniques include:
processing raw facial images and
learning emotion-relevant features  Feature-Level Fusion: Combines raw
through layers of convolutions and features from different modalities
pooling. (e.g., facial landmarks, voice pitch)
before feeding them into a classifier.
 Neural Networks (RNNs): When dealing  Decision-Level Fusion: Processes
with video data, RNNs, particularly each modality separately and then
Long Short-Term Memory (LSTM) combines the predictions from each
networks, are employed to capture the modality to make the final emotion
temporal dynamics of facial expressions. classification. Ensemble methods
These networks help model the like boosting and bagging are often
sequential nature of emotions, which used here.
unfold over time rather than being static.
This fusion improves the accuracy of
emotion detection, as each modality can
 Generative Adversarial Networks
compensate for the shortcomings of the
(GANs): GANs have been explored in
others (e.g., if facial expressions are

5
ambiguous, voice analysis can provide
clearer emotional cues). 3.2.2. Emotion-Driven Recommendation
Emotion-driven recommendation systems
3.2. Emotion-Based Recommendation
enhance traditional methods by
Systems
incorporating emotion detection as an input.
Integrating emotion detection into
Techniques include:
recommendation systems represents a shift
 Emotion-Adaptive Collaborative
from static, historical data-based
Filtering: Modifies collaborative
recommendations to dynamic, real-time
filtering algorithms to factor in real-time
personalization. The goal is to enhance user
emotional data. For instance, if a user
experience by recommending content
feels sad, the system may recommend
(music, movies) that matches or modulates
comforting music or movies that have
the user’s emotional state.
been popular with other users in a
similar emotional state.

3.2.1. Traditional Recommendation  Emotion-Based Content Filtering: Uses

Systems detected emotions to filter content based


on emotional tone. For instance, if a user
Traditional recommendation systems rely on
appears stressed, the system might
two main approaches:
recommend soothing or uplifting content
 Collaborative Filtering: Uses past to improve the user’s mood.
behaviors and preferences of similar
users to recommend content. This 3.2.3. Techniques for Emotion-Based
approach faces limitations when Recommendations
emotional context is required, as it only  Hybrid Models: Combine traditional
looks at historical data. recommendation algorithms with
 Content-Based Filtering: Recommends emotion detection systems to offer more
items similar to those the user has nuanced and personalized
previously interacted with, based on item recommendations. For example,
features (e.g., genre, mood). However, it collaborative filtering can be paired with
doesn’t take the user’s real-time real-time facial expression analysis to
emotional state into account.

6
modify suggestions based on the user’s 4. System Architecture
detected emotional state.
 Real-Time Personalization: Deep
learning techniques such as
Reinforcement Learning have been
explored for real-time personalization.
These systems learn from real-time user
interactions and emotional feedback to
continuously adjust the recommended
content, offering a more responsive and
personalized experience.

Fig. System Architectur


The entire system is structured into five key 4. Expression Recognition: A Euclidean
steps: distance classifier is used to identify a
person's expression. It finds the closest
1.I mage Acquisition: The first step in the
match between test images and training
image processing workflow is to gather user
images, assigning an expression label
images from a camera source, ensuring they
(happy, sad, fear, surprise, anger, disgust, or
are in the .jpg format.
neutral) based on the smallest distance from
2. Pre-processing: This stage is crucial for the average image.
removing unnecessary details from the
5. Music/Movie Recommendation: The
captured images and standardizing them.
final step involves recommending music and
Images are converted from RGB to
movies based on the user’s identified
grayscale. Key facial areas such as the eyes,
emotional state. After facial expression
nose, and mouth are identified using the
classification using a CNN algorithm, a
Haar Cascade algorithm.
curated list of songs or movies
3. Feature Extraction: In this step, corresponding to the detected emotion is
important facial features are extracted and presented for selection. These are grouped
represented as vectors during both the by emotional categories, allowing users to
training and testing phases. The main choose based on their current mood.
features analyzed include the mouth, eyes,
nose, and forehead, as these areas reflect the
most expressive emotions. Principal 5. Evaluation Methodology
Component Analysis (PCA) is employed to
extract and represent these features.

7
5.1. balanced measure, especially useful
Datasets when the class distribution is uneven.
Selecting appropriate datasets is crucial for  Confusion Matrix: A table that shows
evaluating the performance of multimodal the number of correct and incorrect
emotion detection systems. Commonly used predictions for each emotion. It provides
datasets in the field include: a deeper understanding of which
emotions are harder to classify.
 FER+: A widely-used dataset containing
labeled facial expressions. It improves 5.3.
on the original FER dataset by providing Cross-Validation
more consistent and accurate
To ensure that the model generalizes
annotations.
well to unseen data, you can perform k-
 AffectNet: A large facial expression
fold cross-validation. In this method:
dataset containing images labeled with
emotions such as happiness, sadness,  The dataset is divided into k subsets
anger, surprise, and more. (folds).
 The model is trained on k-1 folds
These datasets provide a diverse set of and tested on the remaining fold.
emotions and multimodal data, making them  This process is repeated k times, and
suitable for testing and training your model. the results are averaged to provide a
reliable estimate of the model's
performance.
5.2.
Evaluation Metrics
5.4.
To assess the effectiveness of the system, Comparison with Baselines
you can use several metrics that measure
how well the system predicts emotions. You should compare your proposed
Common metrics include: model’s performance against baseline
models to demonstrate improvements.
 Accuracy: The percentage of correctly Common baselines include:
classified emotions out of all predictions.  Single-modality models (e.g., facial
 Precision: The ratio of true positive expressions or speech only).
predictions to the sum of true positives  Traditional Machine Learning
and false positives. It shows how many Methods (e.g., Support Vector
of the predicted emotions were correct. Machines, Decision Trees).
 Recall (Sensitivity): The ratio of true  Existing multimodal systems from
positives to the sum of true positives and recent literature.
false negatives. It indicates how many of
the actual emotions were correctly
identified.
5.5.
 F1-Score: The harmonic mean of
Ablation Study
precision and recall. It provides a

8
An ablation study is conducted to biased models that perform well on
understand the contribution of different frequently represented emotions while
components of your system. For struggling with less common ones.
example:
 Test the system with only facial 6.2.
expression data. Real-Time Processing
 Test with only speech data.  Latency: Achieving low-latency
 Test with both modalities together to processing for real-time emotion
show how combining them improves detection and recommendations can be
performance. difficult, especially when dealing with
simultaneous processing of video and
audio data. Ensuring that the system can
5.6. provide immediate feedback is crucial
Computational Efficiency for user experience.
 Resource Constraints: Real-time
It’s important to assess the computational applications may have limitations on
cost of your system, especially for real-time hardware resources (CPU/GPU),
applications. You may include: necessitating the optimization of models
 Inference Time: The time it takes for the to balance performance and resource
system to detect emotions and consumption effectively.
recommend content.
 Resource Usage: Evaluate the CPU/GPU
memory usage and the complexity of the 7. Future Direction
model in terms of the number of
parameters. 7.1.
Expanding Dataset Diversity

6. Challenges in Multimodal Encourage the creation of larger and more


Emotion Detection diverse datasets that include a variety of
emotional expressions across different
6.1. cultures and demographics. This will
Data Quality and Diversity enhance the model's ability to generalize.
 Limited Data Availability: High-quality,
labeled datasets for multimodal emotion 7.2.
detection are often scarce. The lack of Addressing Ethical Concerns
comprehensive datasets that capture a Develop privacy-preserving techniques,
wide range of emotions across diverse such as differential privacy and federated
demographics can hinder the training of learning, to ensure user data is protected
effective models. during emotion detection processes.
 Data Imbalance: Emotion datasets may
have a disproportionate number of 7.3.Real-World Application Enhancements
samples for certain emotions, leading to Enhance personalized recommendation
systems by considering not only detected

9
emotions but also contextual information
(user history and preferences) to tailor
suggestions more effectively.

7.4.
Advancements in Model Interpretability
Invest in research on explainable AI to
improve transparency in how emotion
detection systems make decisions, fostering
user trust and understanding.

8. Conclusion

This paper discusses the importance of using


facial expressions to detect emotions,
particularly in creating personalized
recommendations for music and movies.
While there have been significant
advancements in this field, challenges
remain, such as the need for diverse and
high-quality data and concerns about
privacy and bias.Moving forward, research
should focus on improving the accuracy of
facial expression recognition and making
these systems easier for users to understand
and trust. By effectively using facial
expressions to detect emotions, we can
develop technologies that are more engaging
and responsive, ultimately enhancing how
users interact with digital content.

10

You might also like