Ace2020 59091
Ace2020 59091
Shofiyati Nur Karimah, Japan Advanced Institute of Science and Technology, Japan
Shinobu Hasegawa, Japan Advanced Institute of Science and Technology, Japan
Abstract
This paper proposes a framework of the practical use of a real-time engagement
estimation to assess learner’s engagement state during an online learning activity such
as reading, writing, watching video tutorials, online exams and online class. The
framework depicts the whole picture of how to implement an engagement estimation
tool into an online learning management system (LMS) in a web-based environment,
where the input is the real-time images of the learners from a webcam. We built a face
recognition and engagement classification model to analyse learners’ facial feature and
adopt a convolutional neural network to classify them into one of the three engagement
classes, namely, very engaged, normally engaged, or not engaged. The deep learning
model is experimented on open Dataset for Affective States in E-Environments
(DAiSEE) with hard labeling modification. Extracting images from every 10 seconds
snippet video is done to prepare the dataset then to be fed into the convolutional neural
network (CNN). The engagement states are recorded into a file to evaluate the learner’s
engagement states during any online learning activities.
iafor
The International Academic Forum
www.iafor.org
1. Introduction
In this term, online learning includes reading, writing, watching video lectures, online
exams and real-time online classes through conference applications such as Zoom,
Webex, Google Meet, etc.
Nevertheless, unlike in the traditional classroom, the educators in the online learning
could not see whether all the learners are engaged during the lectures. On the other hand,
real-time engagement assessment benefits the educators to adjust their teaching strategy
the way they do in a traditional classroom, e.g., by suggesting some useful reading
materials or changing the course contents (Woolf et al., 2009). Therefore, several kinds
of research on automatic engagement estimation for online learning have been proposed.
Based on the input features to be analyzed, the engagement estimation methods are
categorized into three groups, namely, log-file analysis, sensor data analysis, and
computer vision-based methods. Computer vision based methods are promising
compared to the other two methods because of their non-intrusiveness in nature and
cost-effective hardware and software (Dewan et al., 2019). Therefore, in this paper, we
work on computer vision-based engagement estimation for online learning, where a
convolutional neural network (CNN) is adopted for the engagement level classification.
Although the proposed techniques for automatic engagement estimation have been
proposed, in most cases, the recommendation of how to implement the models/tools to
the actual learning process is omitted. Therefore, in this paper, we propose a framework
that shows the whole picture of real-time engagement estimation from the input data,
data processing, classification model, and recommendation of how to implement the
tools in a learning management system. We use a publicly available engagement dataset,
i.e., Dataset for Affective States in E-Environments (DAiSEE), to train the model and
classify the images into one of three engagement levels: very engaged, normally
engaged, or not engaged.
Several methods have been proposed to automatically estimate the engagement level in
online learning by extracting various traits captured from computer vision analysis
(e.g., facial expression, eye gaze, and body pose), physiological and neurological
sensors analysis, and analysis of learners’ activities record-files in online learning
(Dewan et al., 2019). Cocea and Weibelzahl (2009, 2011), Sundar and Kumar (2016),
and Aluja-Banet et al. (2019) used data mining and machine learning approaches to
analyze learners’ actions in online learning such as total time spent for study, number
of posts in forum, the average time to solve a problem, number of pages accessed, etc.,
which is stored in log-files, for engagement estimation. However, in log-files analysis,
the annotation is not straight forward since many attributes need to be analyzed. Cocea
and Weibelzahl (2009, 2011) analyzed 30 attributes, Sundar and Kumar (2016)
combined with user profile and Aluja-Banet et al. (2019) added 14 behavioral indicators
in analyses.
On the other hand, computer vision-based methods offer several ways to estimate
learners’ engagement by optimizing the appearance features such as body pose, eye
gaze, and facial expression. Grafsgaard et al. (2013), Whitehill (2014), and Monkaresi
(2017) using machine learning to estimate engagement from facial expression features.
They used machine learning toolboxes, e.g., Computer Expression Recognition
Toolbox (CERT) and WEKA, to track the face and classification. However, using the
toolboxes for engagement estimation will automate a part of the classification process
but not the implementation in the real-time education process since humans manually
input the extracted features. On the other hand, Nezami et al. (2017, 2018) and Dewan
(Dewan et al., 2018) using deep learning to build their own classification model to
estimate the engagement of online learners which possibly enable to make the
preprocess both in the implementation process and the training process is done in the
same way so that the input for engagement prediction is in the same distribution as the
input for classification model training.
Therefore, in this work, we focus on utilizing deep learning for the real-time
implementation of automatic engagement estimation. First of all, we draw the
framework to show the whole mechanism of how the learners are joining online
learning while the tool is capturing their face through a webcam or a built-in camera
PC and record the engagement state into a file. The file contained all the learners’
engagement state records, which can be downloaded anytime by the educator to
evaluate their teaching or course planning. For the engagement classification model, in
this work, we are using a convolutional neural network (CNN) to classify the real-time
image into very engaged, normally engaged, or not engaged class.
To train the model, we use the DAiSEE dataset with the feature extraction, in the same
way, to extract the learners’ face features while joining online learning. We use CNN
because it is relatively simple and one of the deep learning methods broadly used in
literature (Gudi et al., 2015; Li & Deng, 2020; Murshed et al., 2019; Nezami et al.,
2017). Furthermore, we believe that simplicity and cost efficiency are the keys to a
reliable implementation of engagement estimation in the actual online learning process.
As shown in Figure 1(b), in the output part, the engagement log file contains the
information of the learner’s engagement state with respect to the time it records the
state and the average engagement state of the learner when the learner sign-out from
the LMS or the course content page.
(a)
(b)
Figure 1: (a) The Proposed Framework of engagement assessment system where the
classification system generically depicted in (b).
3.1. Pre-process
The term pre-process in this work refers to the processing of the input video to be fed
as an input for the classification model both. For fully automatic engagement
estimation, the pre-process is not only required in building the classification model,
where the input is a set of references with engagement state label, but also when the
system is running, where the input is the real-time video stream of a learner joining
online learning and need to be classified its engagement state. The pre-processing when
the system is online needs to be done in the same way as the pre-process for training
the classification model so that the input images to be predicted are in the same
distribution as the input for training the model.
In this work, the pre-processing comprises Viola-Jones (V&J) face detector (Paul Viola
& Jones, 2004), where rectangle features are used to detect the presence of that feature
in the given face images. Figure 2. shows three types of rectangle features used in V&J
face detection, i.e., two-rectangle feature, three-rectangle feature, and four-rectangle
feature. The sum of pixels under the white rectangle is subtracted from the sum of pixels
under the black rectangle, resulting in a single value in each feature.
The classification model in this work employs a convolutional neural network (CNN)
for engagement classification using the image features obtained from V&J face
detection. We use the typical CNN architecture which contains an input layer, multiple
hidden layers, and an output layer. The hidden layers combine convolutional layers,
activation layers, pooling layers, normalization layers, and fully connected layers that
we classified into convolution blocks and fully connected block as depicted in Figure
1(b).
3.3. Dataset
Figure 4: Samples of extracted images from DAiSEE dataset. (a)-(e), (d)-(h), and (i)-
(m) are images with labelled as very-engaged, not-engaged and normal-engaged,
respectively.
From the network and hyper-parameters set above, we got the training accuracy is 71%
and the validation accuracy is 62%. To build a web-based application with the classifier
model we have obtained, we used the Flask app from python, and the screenshot of the
running application is shown in Figure 6.
In this section, we conclude this paper with a brief discussion of our main contribution
based on the description in the previous sections and consider the limitation future
work.
For future work, to the more in-depth analysis of the dataset, it considers its annotation
method and data distribution. Furthermore, intensity normalization can be considered
in feature extraction to solve the illumination problem. In addition, trying out other
engagement datasets such as EmotiW2018 (Dhall et al., 2018; Niu et al., 2018) is
another possibility.
Another limitation of this work is associated with the neural network we used for
engagement estimation, where the result far from perfect. We acknowledge this
limitation because using a typical CNN model that works with minimizing a loss
function, which is computationally feasible but represents inaccurate prediction
(Murshed et al., 2019). Some other features such as head pose, eye gaze, and distance
between the monitor and the face can be further considered for input features for better
accuracy.
It is our hope that the engagement state can be included in student assessment in online
learning, where the engagement can be fully automatically estimated. The fully
automatic estimation is expected to lead to more effective learning and teaching,
especially in an online learning environment. To that end, we presented the framework
of how to build an online learning scenario with the in-built engagement estimation
tool, wherein future improvement on the dataset training and pre-process, both for
building classification model and when the system is online running, might increase the
accuracy and overcome the overfitting.
References
Alexander, K. L., Entwisle, D. R., & Horsey, C. S. (1997). From First Grade Forward:
Early Foundations of High School Dropout. Sociology of Education, 70(2), 87.
https://doi.org/10.2307/2673158
Aluja-Banet, T., Sancho, M.-R., & Vukic, I. (2019). Measuring motivation from the
Virtual Learning Environment in secondary education. Journal of Computational
Science, 36, 100629. https://doi.org/10.1016/j.jocs.2017.03.007
Chaouachi, M., Chalfoun, P., Jraidi, I., & Frasson, C. (2010). Affect and mental
engagement: Towards adaptability for intelligent systems. Proceedings of the 23rd
International Florida Artificial Intelligence Research Society Conference, FLAIRS-
23, Flairs, 355–360.
Cocea, M., & Weibelzahl, S. (2009). Log file analysis for disengagement detection in
e-Learning environments. User Modeling and User-Adapted Interaction, 19(4), 341–
385. https://doi.org/10.1007/s11257-009-9065-5
Dewan, M. A. A., Lin, F., Wen, D., Murshed, M., & Uddin, Z. (2018). A Deep
Learning Approach to Detecting Engagement of Online Learners. 2018 IEEE
SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing,
Scalable Computing & Communications, Cloud & Big Data Computing, Internet of
People and Smart City Innovation
(SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), 1895–1902.
https://doi.org/10.1109/SmartWorld.2018.00318
Dewan, M. A. A., Murshed, M., & Lin, F. (2019). Engagement detection in online
learning: a review. Smart Learning Environments, 6(1), 1.
https://doi.org/10.1186/s40561-018-0080-z
Dhall, A., Kaur, A., Goecke, R., & Gedeon, T. (2018). EmotiW 2018. Proceedings of
the 2018 on International Conference on Multimodal Interaction - ICMI ’18, 653–
656. https://doi.org/10.1145/3242969.3264993
Goldberg, B. S., Sottilare, R. A., Brawner, K. W., & Holden, H. K. (2011). Predicting
Learner Engagement during Well-Defined and Ill-Defined Computer-Based
Intercultural Interactions. In Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol.
6974 LNCS (Issue PART 1, pp. 538–547). https://doi.org/10.1007/978-3-642-24600-
5_57
Grafsgaard, J. F., Wiggins, J. B., Boyer, K. E., Wiebe, E. N., & Lester, J. C. (2013).
Automatically recognizing facial expression: Predicting engagement and frustration.
Proceedings of the 6th International Conference on Educational Data Mining, EDM
2013.
Gudi, A., Tasli, H. E., den Uyl, T. M., & Maroulis, A. (2015). Deep learning based
FACS Action Unit occurrence and intensity estimation. 2015 11th IEEE International
Conference and Workshops on Automatic Face and Gesture Recognition (FG), 2015-
Janua, 1–5. https://doi.org/10.1109/FG.2015.7284873
Gupta, A., D’Cunha, A., Awasthi, K., & Balasubramanian, V. (2016). DAiSEE:
Towards User Engagement Recognition in the Wild. 14(8), 1–12.
http://arxiv.org/abs/1609.01885
Kaur, A., Ghosh, B., Singh, N. D., & Dhall, A. (2019). Domain Adaptation based
Topic Modeling Techniques for Engagement Estimation in the Wild. 2019 14th IEEE
International Conference on Automatic Face & Gesture Recognition (FG 2019), 1–6.
https://doi.org/10.1109/FG.2019.8756511
Lee, J.-S. (2014). The Relationship Between Student Engagement and Academic
Performance: Is It a Myth or Reality? The Journal of Educational Research, 107(3),
177–185. https://doi.org/10.1080/00220671.2013.807491
Li, S., & Deng, W. (2020). Deep Facial Expression Recognition: A Survey. IEEE
Transactions on Affective Computing, 3045(c), 1–1.
https://doi.org/10.1109/TAFFC.2020.2981446
Monkaresi, H., Bosch, N., Calvo, R. A., & D’Mello, S. K. (2017). Automated
Detection of Engagement Using Video-Based Estimation of Facial Expressions and
Heart Rate. IEEE Transactions on Affective Computing, 8(1), 15–28.
https://doi.org/10.1109/TAFFC.2016.2515084
Murshed, M., Dewan, M. A. A., Lin, F., & Wen, D. (2019). Engagement Detection in
e-Learning Environments using Convolutional Neural Networks. 2019 IEEE Intl Conf
on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive
Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf
on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech),
80–86. https://doi.org/10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00028
Nezami, O. M., Dras, M., Hamey, L., Richards, D., Wan, S., & Paris, C. (2018).
Automatic Recognition of Student Engagement using Deep Learning and Facial
Expression. ArXiv, 2. http://arxiv.org/abs/1808.02324
Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of
simple features. Proceedings of the 2001 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition. CVPR 2001, 1, I-511-I–518.
https://doi.org/10.1109/CVPR.2001.990517
Viola, Paul, & Jones, M. J. (2004). Robust Real-Time Face Detection. International
Journal of Computer Vision, 57(2), 137–154.
https://doi.org/10.1023/B:VISI.0000013087.49260.fb
Whitehill, J., Serpell, Z., Lin, Y. C., Foster, A., & Movellan, J. R. (2014). The faces
of engagement: Automatic recognition of student engagement from facial expressions.
IEEE Transactions on Affective Computing, 5(1), 86–98.
https://doi.org/10.1109/TAFFC.2014.2316163
Woolf, B., Burleson, W., Arroyo, I., Dragon, T., Cooper, D., & Picard, R. (2009).
Affect-aware tutors: recognising and responding to student affect. International
Journal of Learning Technology, 4(3/4), 129.
https://doi.org/10.1504/IJLT.2009.028804