Real-time Attention Span Tracking in Online Education
Rahul RK Shanthakumar S Vykunth P Sairamnath K
[email protected] [email protected] [email protected] [email protected] Department of Information Technology, Sri Venkateswara College of Engineering, Sriperumbudur (Autonomous, Affiliated to Anna University)
Abstract—Over the last decade, e-learning has revolutionized experiences. Emotional states such as happiness, joy,
how students learn by providing them access to quality surprise, and neutral denote a positive constructive learning
education whenever and wherever they want. However, students experience, whereas, emotions like sadness, fear, anger, and
often get distracted because of various reasons, which affect the disgust represent a negative experience. The study [3]
learning capacity to a great extent. Many researchers have been
discusses how mind-wandering can have negative effects on
trying to improve the quality of online education, but we need a
holistic approach to address this issue. This paper intends to performance and how eye gaze data, collected using a
provide a mechanism that uses the camera feed and microphone dedicated eye tracker, can automatically detect loss of
input to monitor the real-time attention level of students during attention during computer-based tasks. The eye behavior
online classes. We explore various image processing techniques patterns were observed and it is found that the patterns varied
and machine learning algorithms throughout this study. We distinctly during mind wandering.
propose a system that uses five distinct non-verbal features to
calculate the attention score of the student during computer-
Article [4] shows how the loss of attention significantly
based tasks and generate real-time feedback for both students affects the learning efficiency of students. The review paper
and the organization. We can use the generated feedback as a [5] suggests that drowsiness, caused by sleep, restlessness,
heuristic value to analyze the overall performance of students as and mental pressure, is one of the major factors that lead to
well as the teaching standards of the lecturers. loss of attention. Various state-of-the-art drowsiness
detection techniques were compared, and it is found that the
Keywords—Artificial Intelligence, Attention, Blink rate, Haar classifier and Support Vector Machine (SVM) gives
Drowsiness, Eye gaze tracking, Emotion classification, Face better accuracy in real-time scenarios. The study [6]
recognition, Body Posture estimation, Noise detection.
establishes evidence that the attention span of students is
I. INTRODUCTION contingent upon the environmental noise conditions, and the
results suggest that noise levels greater than 75dB have a
The demand and need for online education are increasing serious impact on the accuracy of the students. The paper [7]
rapidly. Almost all the schools and colleges throughout the proposes an automated facial recognition model using
world have shifted to the online mode of lectures and exams Convolutional Neural Network (CNN) and Principle
due to the recent coronavirus outbreak, and this trend will Component Analysis (PCA). The study [8] proposes a two-
most likely continue in the upcoming years. The increasing layer CNN to learn the high-level sparse and selective facial
demand for online education opens the gate to automation feature maps. Sparse Representation Classifier (SRC)
in the field. One major issue in the online mode of lectures improves the performance by using a sparsely selected feature
is that students tend to lose their concentration after a certain extractor. The study [9] examines the features of body posture
period and there is no automated mechanism to monitor their and head pose to predict the user’s attention level by
activities during the classes. Some students tend to just start identifying patterns of behavior associated with attention.
a lecture online and move away from the place, or might
even use a proxy to write online tests for them. This situation From the literature review, we have identified that the five
also takes place in online course platforms such as EdX and parameters - blink rate, facial expression, eye gaze,
Coursera where the student tries to skip lectures just for the background noise, and body posture make a good feature set
sake of completion and certification. The loss in to assess the attention level of the students.
concentration not only affects the student’s knowledge level III. PROPOSED METHODOLOGY
but also hurts the society by producing low-skilled laborers.
We propose a solution in our paper to address this issue. This study makes use of five parameters to calculate the
The paper is structured as follows: Section 2 reviews the attention-span level of the student attending the online class.
literature, Section 3 describes the proposed methodology Facial recognition is used to validate the student’s attendance.
and its working, Section 4 evaluates the performance, and The attention span score is calculated using blink rate, facial
Section 6 concludes the paper and talks about ideas for expression, eye gaze, background noise, and body posture and
future works. is updated continuously for a window length of 5 seconds.
Instead of sequential execution, all the models required to
II. LITERATURE REVIEW calculate the attention span are executed in parallel once the
The purpose of attention span detection during online online lecture starts. This is achieved using multithreading all
classes is to gather data and analyze the state of the student, the functions, which plays a major role in reducing the time
to evaluate his performance based on concentration level, consumption of each model as well as the whole system. For
instead of just academic scores. According to [1], the average every 5 seconds, the model will generate the attention span
blink rate of a person is between 8 to 21 blinks per minute, score and provide real-time feedback to the students in the
but when the person is deeply focused on a specific visual form of live graphs which are plotted for each parameter as
task, the rate of blinking has significantly reduced to an well as the calculated attention span score. The following
average of 4.5 blinks per minute. Likewise, the blink rate sections will explain in detail about each of the models used in
escalated to over 32.5 blinks per minute when the individual’s this study and their significance in calculating the attention
concentration level is low. The study [2] explores how the span. The overall architecture of the proposed system is shown
emotional state of students varies during the learning process in (Fig. 1) and the working of each module is represented as
and how emotional feedback can improve learning flowcharts in (Fig. 2).
978-1-7281-7571-3/20/$31.00 ©2020 IEEE
Fig. 3. Face detection and Facial Landmarks.
B. Blink Rate Detection
Blink rate is one of the important factors to determine the
state of mind - whether the student is actively listening or
drowsy during the class. In this module, we crop the regions
containing the eye pairs and divide each eye into two halves.
We calculate the Eye Aspect Ratio (EAR) using Euclidean
Fig. 1. The architecture of the proposed system. distances (Fig. 4a) for every frame as per Formula (1) to
identify whether the eyes are open or closed. We also have a
countdown timer, which is activated once a blink is detected,
to keep track of the number of seconds the eyes are closed. It
can be concluded that the user is feeling drowsy (loss of
attention) if the eyes are found to be closed for more than two
seconds [5] and an alert will be given both visually as shown
in (Fig. 4c) and warning alarm sounds. We calculate the
number of blinks continuously on an interval of 5 seconds to
determine the average blink rate of the user. The EAR
threshold value is set as 0.2 based on test experiments.
We tested the blink detection module on 306 images
comprising of 156 closed eyes and 150 open eyes and
classified the blinks with an accuracy of 91.02% and open eyes
with an accuracy of 92.66%.
(a) Blink rate detection (b) Eye-gaze tracking (c) Emotion Classification | | | |
(1)
| |
(a) Eye key points (p1, p2, p3, p4, p5, p6)
(d) Face recognition (e) Body posture (f) Noise level
Fig. 2. Flowcharts of each module (b) Opened eye (c) Closed eye and Drowsiness alarm
Fig. 4. Blink rate and drowsiness detection
A. Facial Landmark Detection
Face detection is implemented using the Viola-Jones C. Eye-gaze Tracking
algorithm [13], which uses a windowing mechanism to scan The eye-gaze of a student is tracked to determine where
images for identifying features of human faces. The paper [8] he is looking at and is often closely associated with the
provides an efficient real-time approach to extract 68 key distraction level of the student. As suggested in [10], we
points from the detected face image using OpenCV’s dlib analyze the extracted eye region coordinates for rectangular
library as shown in (Fig. 3). We use rectangular regions to features to identify eye regions containing the pupil. The pupil
extract Haar features from the image. The landmarks are coordinates (x, y) of each eye (Fig. 5a) are calculated and
classified into five categories of facial features: eyebrows, mapped to determine the eye gaze direction. Based on the
eyes, nose, mouth, and jaw, which are denoted sequentially resolution of the screen, we have established two possible
using the key points. We will be using these individual classes of eye gaze: looking at the screen (Fig. 5c) and looking
landmark features as inputs to further modules. We can away (Fig. 5b) (right or left). We collected 150 images with
improve the accuracy of the face detection module by two different classes: looking straight and looking away. Our
computing more number of key points. However, this model was able to classify the eye gaze correctly for 113
increases the processing time. images with an accuracy of 75.33%.
(SRC) output layer. The dropout layer helps in reducing the
computational cost. We created a dataset consisting of 500
images with four different faces. We split the dataset into 300
images for training and 200 images for validation. After 15
epochs of training, we attained a training accuracy of 94.8%
(a) Pupil coordinates – Left, Center, and Right. and a validation accuracy of 90% (Fig. 7).
(b) Looking away (c) Looking center
Fig. 7. Face Recognition
Fig. 5. Eye-gaze tracking
F. Body Posture Estimation
D. Emotion Classification
Many researchers have used CNN or R-CNN to estimate
The emotion of the person attending the online class plays the pose with high accuracy. However, the main goal of our
a major role in his attention level. This study uses facial research is to calculate the attention level of the student in
features such as eyes, nose, and mouth, extracted using Haar- real-time without compromising on processing time. Hence,
cascade classifier and facial landmark detector. Support we make use of the TensorFlow pose-estimator (PoseNet),
Vector Machine (SVM) algorithm is used to classify the based on Mobilenet SSD to estimate the posture of the
emotion of the students into seven different classes - angry, student. This model uses the heat map to estimate the pose
disgust, fear, happy, sad, surprise, and neutral. A score is given difference from one frame to the previous frame. PoseNet can
for each emotion depending on its effect on the attention level identify 17 key points including face, shoulders, elbows,
of the user. The traditional method for emotion classification wrists, hip, knees, and ankle. We assign a pixel similarity
pre-processing uses only the cropped eye. However, [11] score by comparing the change of head pose and body posture
proposed an alternative solution to include mouth features for between consecutive frames to predict whether the student is
better accuracy. As this method only uses Haar cascades to restless or focused during the online lecture.
classify the emotion, the processing speed was much faster
when compared to the Sobel edge eye detection. The model
was validated using the JAFFE dataset which contains 213
images and 7 emotions posed by 10 different Japanese women.
The training set consisted of 42 and 7 classes of emotion. The
test set consisted of 70 images. We obtained an average
accuracy of 82.55% on our test set (Fig. 6).
Fig. 8. Body Posture Estimation
G. Background Noise Detection
We use the python package PyAudio to detect the input
audio from the device’s microphone. The background noise
during the class might affect the concentration level of the
student. The average sound level in a school is 50 dB and 75
(a) Happy (b) Neutral dB is set as the threshold for loud noise [6]. Anything above
75 dB is considered as a noisy environment and the scores
will be inversely proportional to the background noise. The
model will monitor the background noise continuously and
we calculate the average noise level every 5 seconds.
H. Overall Attention level Detection
All the scores from the above parameter scores (blink rate
(c) Sad (d) Surprise
detection, eye gaze tracking, emotion classification, body
Fig. 6. Emotion Classification (Happy, Neutral, Sad, Surprise) posture estimation, and background noise detection) are
normalized to calculate the attention score as per Formula (2).
E. Face Recognition We plot live graphs as shown in (Fig. 9) with the predicted
A robust facial recognition system is essential to attention level of the student along with the scores for each
authenticate the student based on biometrics to avoid student parameter updated in real-time. We do not use face
proxies and to automate the attendance management process recognition in the scoring method because it does not
using the webcam feed during classes. In this module, we contribute to determining the attention level of the student;
modified the architecture of [8] and implemented a 3 layer rather we use it for biometric authentication and automated
Convolutional Neural Network (CNN) consisting of three attendance of the students.
convolutional layers with max-pooling, a fully connected
layer with dropout, and a Sparse Representation Classifier ∑
∗ 100 (2)
V. CONCLUSION AND FUTURE WORKS
In this paper, we have implemented a system to tackle the
issues involved in online education using five parameters. We
used the face recognition model to verify the student attending
the online class. We used the other five parameters - blink
rate, eye gaze, emotion, posture, and noise level to calculate
the attention level of the student throughout the lecture. Since
this involves real-time processing, we have implemented and
used lightweight models to reduce the processing time. We
visualize the scores in the form of a live graph and generate
automated reports. The feedback generated can be used for:
1) Evaluating student performance
2) Improving teaching standards
Fig. 9. Live graph plotting with real-time attention level
3) Preventing malpractice during online examinations
As a part of future works, we can improve our system’s
IV. PERFORMANCE EVALUATION performance further by training our models using more data.
The system’s performance was analyzed by using a Also, the same attention tracking mechanism can be further
dataset of 15 undergraduate students consisting of nine males optimized to simultaneously work with multiple subjects in a
and six females. The students were asked to attend online classroom using video footage from the CCTV cameras.
lectures on different topics each for 500 seconds and three Moreover, we have used human observed attention scores as
human observers were asked to provide an observed attention ground truth-values as we currently do not have any dataset
score based on the recorded web camera video, which is used for measuring the attention span during online lectures. A
as the ground truth-value. We compared the predicted standard dataset can help to evaluate the system’s
attention scores with the observed scores to evaluate the performance more reliably.
overall performance of our system. (Fig. 10) shows the
comparison between predicted scores and observed scores VI. REFERENCES
and Table (1) shows the system’s performance metrics. [1] Mark B. Abelson, MD, CM, FRCSC, ORA staff, Andover, Mass, It’s
Time to Think About the Blink, Review of Ophthalmology June 2011.
[2] Liping Shen, Minjuan Wang, and Ruimin Shen, "Affective e-Learning:
Using “Emotional” Data to Improve Learning in a Pervasive Learning
Environment", International Forum of Educational Technology &
Society 2009.
[3] Bixler, R., D’Mello, S. Automatic gaze-based user-independent
detection of mind wandering during computerized reading. User Model
User-Adap Inter 26, 33–68 2016.
[4] Smallwood, J., Fishman, D.J. & Schooler, J.W. Counting the cost of an
absent mind: Mind wandering as an underrecognized influence on
educational performance. Psychonomic Bulletin & Review 14, 230–
Fig. 10. Predicted Attention Score vs Observed Attention Score Comparison. 236 2007.
[5] M. Ramzan, H. U. Khan, S. M. Awan, A. Ismail, M. Ilyas, and A.
TABLE I. PERFORMANCE METRICS Mahmood, "A Survey on State-of-the-Art Drowsiness Detection
Techniques," in IEEE Access, vol. 7, pp. 61904-61919, 2019.
Metric RMSE MAE R2 MAPE [6] Zhen Zhang, Yuan Zhang, An Experimental Study on the Influence of
Environmental Noise on Students’ Attention, EuroNoise 2018
Value 11.152 9.837 0.154 15.248 conference.
Our system was able to perform quite well given the [7] S. Sawhney, K. Kacker, S. Jain, S. N. Singh, and R. Garg, "Real-time
Smart Attendance System using Face Recognition Techniques," 2019
limited data used for training the models. We obtained the 9th International Conference on Cloud Computing, Data Science &
overall accuracy of our attention-tracking model by taking Engineering (Confluence), Noida, India, 2019, pp. 522-525.
the average of the accuracies of each module. Compiling [8] Eric-Juwei Cheng, Kuang-Pen Chou, Shantanu Rajora, Bo-Hao Jin, M.
OpenCV’s DNN module and Caffe with CUDA support Tanveer, Chin-Teng Lin, Ku-Young Young, Wen-Chieh Lin, Mukesh
improved the performance and significantly reduced the Prasad, Deep Sparse Representation Classifier for facial recognition
and detection system.
inference time of our models as shown in Table (2). We
[9] Stanley, Darren, "Measuring attention using Microsoft Kinect" (2013).
achieved an overall accuracy of 84.6233%.
Thesis. Rochester Institute of Technology.
[10] C. Morimoto, D. Koons, A. Amir, and M. Flickner, “Pupil detection
TABLE II. SYSTEM PERFORMANCE
and tracking using multiple light sources,” Image Vis. Comput, vol. 18,
no. 4, pp. 331–336, 2008.
Module Accuracy Inference time
[11] D. Yang, Abeer Alsadoona, P.W.C. Prasad*a, A.K. Singh, A.
Facial Landmarks 89.67 % 0.033 ms Elchouemi, “An Emotion Recognition Model Based on Facial
Blink rate detection 91.02 % 0.026 ms Recognition in Virtual Learning Environment”.
[12] Vahid Kazemi, Josephine Sullivan; One Millisecond Face Alignment
Eye gaze tracking 75.33 % 0.032 ms with an Ensemble of Regression Trees, Proceedings of the IEEE
Emotion classification 82.55 % 0.057 ms Conference on Computer Vision and Pattern Recognition (CVPR),
2014, pp. 1867-1874.
Facial recognition 90.11 % 0.052 ms
[13] Daniel Hefenbrock, Jason Oberg, Nhat Tan Nguyen Thanh, Ryan
Body posture 79.06 % 0.048 ms Kastner, Scott B. Baden, Accelerating Viola-Jones Face Detection to
Overall System 84.6233 % 0.258 ms FPGA-Level using GPUs.