Survey Paper
Survey Paper
Abstract- As we know many people suffer from unhappiness in today’s world, so there is a
requirement of such system which recommends music based on human emotions such as angry,
sad, happy etc. Here in this paper we are going to develop a systems named as multimodal
emotion detection which helps people in suggesting a music and movie based on emotion. We
implemented this system using the FER-2013 dataset which has 35,887 images of different
emotions which has features sad, happy, angry and used python TensorFlow framework and haar
cascades algorithm for recommending movie and music on human facial emotions . The main
concern in existing recommendation systems is manual sorting, so to avoid it, this model is
proposed. ‘Movie and Music Recommendation System based on Facial expressions’ provides a
way to automatically play the music and movie without spending much time browsing it.
1
we need to detect faces within an image and areas like personalized content
run the model to detect expressions. In terms recommendation, healthcare, education, and
of results, the major goal was to create and virtual assistants.
establish a framework for customers to assist
them to find an ideal choice for them. The
project seeks to discover the correlation and 2.1 Key Elements in Multimodal
framework that suggests new movies/songs. most commonly used methods for
detecting emotions. Facial expressions
are analyzed through computer vision
2. Background and Key techniques, such as Convolutional
2
2.2. Facial Expression-Based Emotion data can be easily captured, such as
Detection smartphones, webcams, or in-vehicle
cameras.
Facial expressions are a primary channel
through which humans express their
emotions. Advanced computer vision 2.3. Recommendation Systems
emotions. The process typically involves the and data-driven techniques to suggest
3
previously liked, based on features of the
items (e.g., genre, director, actors). 3.1. Emotion Detection through Facial
Expressions
Hybrid Systems: Combine both Facial expression analysis has been a
collaborative and content-based foundational technique in emotion
approaches to provide more accurate detection due to its non-invasive and
recommendations. widely applicable nature. Early methods
relied on handcrafted features and
In the context of emotion-based traditional machine learning models,
recommendations, the system dynamically while recent advancements have shifted
adapts suggestions based on the user's towards deep learning-based approaches.
detected emotional state. For example, if 3.1.1. Traditional Methods
the system detects that a user is feeling Before the advent of deep learning,
sad, it might recommend uplifting music emotion recognition from facial
or feel-good movies to improve their expressions was based on feature
mood. extraction techniques, such as:
4
(k-NN) or Decision Trees to identify synthesizing emotional expressions or
emotions. enhancing datasets by generating more
varied facial expressions. This helps
3.1.2. Deep Learning Approaches improve the robustness of emotion
With the rise of deep learning, detection models, particularly in dealing
Convolutional Neural Networks (CNNs) with data scarcity.
have become the dominant method for facial
emotion recognition. CNNs automatically
learn hierarchical features from facial 3.1.3. Multimodal Fusion for Emotion
images, eliminating the need for handcrafted Detection
feature extraction.
Facial expressions are often combined
with other modalities like audio and text
Convolutional Neural Networks
to improve the robustness of emotion
(CNNs): CNNs have shown remarkable
detection. Common multimodal fusion
success in emotion detection by
techniques include:
processing raw facial images and
learning emotion-relevant features Feature-Level Fusion: Combines raw
through layers of convolutions and features from different modalities
pooling. (e.g., facial landmarks, voice pitch)
before feeding them into a classifier.
Neural Networks (RNNs): When dealing Decision-Level Fusion: Processes
with video data, RNNs, particularly each modality separately and then
Long Short-Term Memory (LSTM) combines the predictions from each
networks, are employed to capture the modality to make the final emotion
temporal dynamics of facial expressions. classification. Ensemble methods
These networks help model the like boosting and bagging are often
sequential nature of emotions, which used here.
unfold over time rather than being static.
This fusion improves the accuracy of
emotion detection, as each modality can
Generative Adversarial Networks
compensate for the shortcomings of the
(GANs): GANs have been explored in
others (e.g., if facial expressions are
5
ambiguous, voice analysis can provide
clearer emotional cues). 3.2.2. Emotion-Driven Recommendation
Emotion-driven recommendation systems
3.2. Emotion-Based Recommendation
enhance traditional methods by
Systems
incorporating emotion detection as an input.
Integrating emotion detection into
Techniques include:
recommendation systems represents a shift
Emotion-Adaptive Collaborative
from static, historical data-based
Filtering: Modifies collaborative
recommendations to dynamic, real-time
filtering algorithms to factor in real-time
personalization. The goal is to enhance user
emotional data. For instance, if a user
experience by recommending content
feels sad, the system may recommend
(music, movies) that matches or modulates
comforting music or movies that have
the user’s emotional state.
been popular with other users in a
similar emotional state.
6
modify suggestions based on the user’s 4. System Architecture
detected emotional state.
Real-Time Personalization: Deep
learning techniques such as
Reinforcement Learning have been
explored for real-time personalization.
These systems learn from real-time user
interactions and emotional feedback to
continuously adjust the recommended
content, offering a more responsive and
personalized experience.
7
5.1. balanced measure, especially useful
Datasets when the class distribution is uneven.
Selecting appropriate datasets is crucial for Confusion Matrix: A table that shows
evaluating the performance of multimodal the number of correct and incorrect
emotion detection systems. Commonly used predictions for each emotion. It provides
datasets in the field include: a deeper understanding of which
emotions are harder to classify.
FER+: A widely-used dataset containing
labeled facial expressions. It improves 5.3.
on the original FER dataset by providing Cross-Validation
more consistent and accurate
To ensure that the model generalizes
annotations.
well to unseen data, you can perform k-
AffectNet: A large facial expression
fold cross-validation. In this method:
dataset containing images labeled with
emotions such as happiness, sadness, The dataset is divided into k subsets
anger, surprise, and more. (folds).
The model is trained on k-1 folds
These datasets provide a diverse set of and tested on the remaining fold.
emotions and multimodal data, making them This process is repeated k times, and
suitable for testing and training your model. the results are averaged to provide a
reliable estimate of the model's
performance.
5.2.
Evaluation Metrics
5.4.
To assess the effectiveness of the system, Comparison with Baselines
you can use several metrics that measure
how well the system predicts emotions. You should compare your proposed
Common metrics include: model’s performance against baseline
models to demonstrate improvements.
Accuracy: The percentage of correctly Common baselines include:
classified emotions out of all predictions. Single-modality models (e.g., facial
Precision: The ratio of true positive expressions or speech only).
predictions to the sum of true positives Traditional Machine Learning
and false positives. It shows how many Methods (e.g., Support Vector
of the predicted emotions were correct. Machines, Decision Trees).
Recall (Sensitivity): The ratio of true Existing multimodal systems from
positives to the sum of true positives and recent literature.
false negatives. It indicates how many of
the actual emotions were correctly
identified.
5.5.
F1-Score: The harmonic mean of
Ablation Study
precision and recall. It provides a
8
An ablation study is conducted to biased models that perform well on
understand the contribution of different frequently represented emotions while
components of your system. For struggling with less common ones.
example:
Test the system with only facial 6.2.
expression data. Real-Time Processing
Test with only speech data. Latency: Achieving low-latency
Test with both modalities together to processing for real-time emotion
show how combining them improves detection and recommendations can be
performance. difficult, especially when dealing with
simultaneous processing of video and
audio data. Ensuring that the system can
5.6. provide immediate feedback is crucial
Computational Efficiency for user experience.
Resource Constraints: Real-time
It’s important to assess the computational applications may have limitations on
cost of your system, especially for real-time hardware resources (CPU/GPU),
applications. You may include: necessitating the optimization of models
Inference Time: The time it takes for the to balance performance and resource
system to detect emotions and consumption effectively.
recommend content.
Resource Usage: Evaluate the CPU/GPU
memory usage and the complexity of the 7. Future Direction
model in terms of the number of
parameters. 7.1.
Expanding Dataset Diversity
9
emotions but also contextual information
(user history and preferences) to tailor
suggestions more effectively.
7.4.
Advancements in Model Interpretability
Invest in research on explainable AI to
improve transparency in how emotion
detection systems make decisions, fostering
user trust and understanding.
8. Conclusion
10