Mooc Dropouts

Uploaded by

Dương Vũ Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

Mooc Dropouts

Uploaded by

Dương Vũ Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Understanding Dropouts in MOOCs

Wenzheng Feng,† Jie Tang,† Tracy Xiao Liu‡ *

†
Department of Computer Science and Technology, Tsinghua University
‡
Department of Economics, School of Economics and Management, Tsinghua University
[email protected], [email protected], [email protected]

Abstract learners who complete courses, where 61% of survey re-

spondents report MOOCs’ education benefits and 72% of
Massive open online courses (MOOCs) have developed
those report career benefits (Zhenghao et al. 2015).
rapidly in recent years, and have attracted millions of on-
line users. However, a central challenge is the extremely high However, on the other hand, MOOCs are criticized for
dropout rate — recent reports show that the completion rate the low completion ratio (He et al. 2015). Indeed, the av-
in MOOCs is below 5% (Onah, Sinclair, and Boyatt 2014; erage course completion rate on edX is only 5% (Kizilcec,
Kizilcec, Piech, and Schneider 2013; Seaton et al. 2014). Piech, and Schneider 2013; Seaton et al. 2014). We did a
What are the major factors that cause the users to drop out? similar statistic for 1,000 courses on XuetangX, and resulted
What are the major motivations for the users to study in in a similar number — 4.5%. Figure 1 shows several ob-
MOOCs? In this paper, employing a dataset from XuetangX1 , servational analyses. As can be seen, Age is an important
one of the largest MOOCs in China, we conduct a system- factor — young people are more inclined to drop out; Gen-
atical study for the dropout problem in MOOCs. We found
that the users’ learning behavior can be clustered into several
der is another important factor — roughly, female users are
distinct categories. Our statistics also reveal high correlation more likely to drop science courses and male users are more
between dropouts of different courses and strong influence likely to give up non-science courses; finally, educational
between friends’ dropout behaviors. Based on the gained in- background is also important. This raises several interesting
sights, we propose a Context-aware Feature Interaction Net- questions: 1) what are the major dropout reasons? 2) what
work (CFIN) to model and to predict users’ dropout behavior. are the deep motivations that drive the users to study or in-
CFIN utilizes context-smoothing technique to smooth fea- duce them to drop out? 3) is that possible to predict users’
ture values with different context, and use attention mecha- dropout behavior in advance, so that the MOOCs platform
nism to combine user and course information into the model- could deliver some kind of useful interventions (Halawa,
ing framework. Experiments on two large datasets show that Greene, and Mitchell 2014; Qi et al. 2018)?
the proposed method achieves better performance than sev-
eral state-of-the-art methods. The proposed method model Employing a dataset from XuetangX, the largest MOOC
has been deployed on a real system to help improve user re- platform in China, we aim to conduct a systematical explo-
tention. ration for the aforementioned questions. We first perform
a clustering analysis over users’ learning activity data and
found that users’ studying behavior can be grouped into
Introduction several categories, which implicitly correspond to different
Massive open online courses (MOOCs) have become in- motivations that users study MOOC courses. The analyses
creasingly popular. Many MOOC platforms have been also disclose several interesting patterns. For example, the
launched. For example, Coursera, edX, and Udacity are dropout rates between similar courses is highly correlated;
three pioneers, followed by many others from different friends’ dropout behaviors strongly influence each other —
countries such as XuetangX in China, Khan Academy in the probability that a user drops out from a course increases
North America, Miriada in Spain, Iversity in German, Fu- quickly to 65% when the number of her/his dropout friends
tureLearn in England, Open2Study in Australia, Fun in increases to 5.
France, Veduca in Brazil, and Schoo in Japan (Qiu et al. Based on the analyses results, we propose a Context-
2016). By the end of 2017, the MOOC platforms have of- aware Feature Interaction Network (CFIN) to model and
fered 9,400 courses worldwide and attracted 81,000,000 on- to predict users’ dropout behavior. In CFIN, we utilize a
line registered students (Shah 2018). Recently, a survey from context-smoothing technique to smooth values of activity
Coursera shows that MOOCs are really beneficial to the features using the convolutional neural network (CNN). At-
* The other authors include Shuhuai Zhang from PBC School of tention mechanisms are then used to combine user and
Finance of Tsinghua University and Jian Guan from XuetangX. course information into the modeling framework. We eval-
Copyright © 2019, Association for the Advancement of Artificial uate the proposed CFIN on two datasets: KDDCUP and
Intelligence (www.aaai.org). All rights reserved. XuetangX. The first dataset was used in KDDCUP 2015
1
https://xuetangx.com and the second one is larger, extracted from the XuetangX

517
0.55 0.8 0.70
Dropout Rates of Different Ages female
male
0.7 0.65
0.50
0.6 0.60
0.45

Dropout Rate
Dropout Rate

Dropout Rate
0.5 0.55
0.40
0.4 0.50
0.35 0.3 0.45

0.30 0.2 0.40

15 20 25 30 35 40 45 50 h s r e y e y y n
Age Mat Physic mpute nguag Biolog edicin osoph Histor ucatio ary
Mid
dle Hig
h te r's
ocia achelo Maste
r's
tora
te
Co La M Phil Ed Prim Ass B Doc
Course Category Education Level

(a) Age (b) Course Category (c) Education Level

Figure 1: Dropout rates of different demographics of users. (a) user age (b) course category (c) user education level.

system. Experiments on both datasets show that the pro- motivations for choosing a course and to understand the rea-
posed method achieves much better performance than sev- sons that users drop out a course. Qiu et al. (2016) study the
eral state-of-the-art methods. We have deployed the pro- relationship between student engagement and their certifi-
posed method in XuetangX to help improve user retention. cate rate, and propose a latent dynamic factor graph (LadFG)
to model and predict learning behavior in MOOCs.
Related Work
Prior studies apply generalized linear models (including lo- Data and Insights
gistic regression and linear SVMs (Kloft et al. 2014; He et al. The analysis in this work is performed on two datasets from
2015)) to predict dropout. Balakrishnan et al. (2013) present XuetangX. XuetangX, launched in October 2013, is now one
a hybrid model which combines Hidden Markov Models of the largest MOOC platforms in China. It has provided
(HMM) and logistic regression to predict student retention over 1,000 courses and attracted more than 10,000,000 reg-
on a single course. Another attempt by Xing et al. (2016) istered users. XuetangX has twelve categories of courses:
uses an ensemble stacking generalization approach to build art, biology, computer science, economics, engineering, for-
robust and accurate prediction models. Deep learning meth- eign language, history, literature, math, philosophy, physics,
ods are also used for predicting dropout. For example, Fei and social science. Users in XuetangX can choose the learn-
et al. (2015) tackle this problem from a sequence labeling ing mode: Instructor-paced Mode (IPM) and Self-paced
perspective and apply an RNN based model to predict stu- Mode (SPM). IPM follows the same course schedule as con-
dents’ dropout probability. Wang et al. (2017) propose a hy- ventional classrooms, while in SPM, one could have more
brid deep neural network dropout prediction model by com- flexible schedule to study online by herself/himself. Usually
bining the CNN and RNN. Ramesh et al. (2014) develop an IPM course spans over 16 weeks in XuetangX, while
a probabilistic soft logic (PSL) framework to predict user an SPM course spans a longer period. Each user can en-
retention by modeling student engagement types using la- roll one or more courses. When one studying a course, the
tent variables. Cristeaet et al. (2018) propose a light-weight system records multiple types of activities: video watching
method which can predict dropout before user start learning (watch, stop, and jump), forum discussion (ask questions
only based on her/his registration date. Besides prediction and replies), assignment completion (with correct/incorrect
itself, Nagrecha et al. (2017) focus on the interpretability of answers, and reset), and web page clicking (click and close
existing dropout prediction methods. Whitehill et al. (2015) a course page).
design an online intervention strategy to boost users’ call- Two Datasets. The first dataset contains 39 IPM courses and
back in MOOCs. Dalipi et al. (2018) review the techniques their enrolled students. It was also used for KDDCUP 2015.
of dropout prediction and propose several insightful sugges- Table 1 lists statistics of this dataset. With this dataset, we
tions for this task. What’s more, XuetangX has organized the compare our proposed method with existing methods, as the
KDDCUP 20152 for dropout prediction. In that competition, challenge has attracted 821 teams to participate. We refer to
most teams adopt assembling strategies to improve the pre- this dataset as KDDCUP.
diction performance, and “Intercontinental Ensemble” team The other dataset contains 698 IPM courses and 515 SPM
get the best performance by assembling over sixty single courses. Table 2 lists the statistics. The dataset contains
models. richer information, which can be used to test the robustness
More recent works mainly focus on analyzing students and generalization of the proposed method. This dataset is
engagement based on statistical methods and explore how to referred to as XuetangX.
improve student engagements (Kellogg 2013; Reich 2015).
Zheng et al. (2015) apply the grounded theory to study users’ Insights
Before proposing our methodology, we try to gain a better
2
https://biendata.com/competition/kddcup2015 understanding of the users’ learning behavior. We first per-

518
Table 1: Statistics of the KDDCUP dataset.
Table 3: Results of clustering analysis. C1-C5 — Cluster 1
Category Type Number to 5; CAR — average correct answer ratio.
# video activities 1,319,032
log # forum activities 10,763,225 Category Type C1 C2 C3 C4 C5
# assignment activities 2,089,933 #watch 21.83 46.78 12.03 19.57 112.1
video #stop 28.45 68.96 20.21 37.19 84.15
# web page activities 738,0344 #jump 16.30 16.58 11.44 14.54 21.39
# total 200,904 #question 0.04 0.38 0.02 0.03 0.03
# dropouts 159,223 forum
#answer 0.13 3.46 0.13 0.12 0.17
enrollment # completions 41,681 assignment
CAR 0.22 0.76 0.19 0.20 0.59
# users 112,448 #revise 0.17 0.02 0.04 0.78 0.01
# courses 39 seconds 1,715 714 1,802 1,764 885
session
count 3.61 8.13 2.18 4.01 7.78
#enrollment 21,048 9,063 401,123 25,042 10,837
enrollment
total #users 2,735 4,131 239,302 4,229 4,121
Table 2: Statistics of the XuetangX dataset. dropout rate 0.78 0.29 0.83 0.66 0.28

Category Type #IPM* #SPM*

# video activities 50,678,849 38,225,417 Correlation Between Courses. We further study whether
log # forum activities 443,554 90,815 there is any correlation for users dropout behavior between
# assignment activities 7,773,245 3,139,558
# web page activities 9,231,061 5,496,287
different courses. Specifically, we try to answer this ques-
# total 467,113 218,274 tion: will someone’s dropout for one course increase or
# dropouts 372,088 205,988 decrease the probability that she drops out from another
enrollment # completions 95,025 12,286 course? We conduct a regression analysis to examine this. A
# users 254,518 123,719 user’s dropout behavior in a course is encoded as a 16-dim
# courses 698 515 dummy vector, with each element representing whether the
∗
#IPM and #SPM respectively stands for the number for the user has visited the course in the corresponding week (thus
corresponding IPM courses and SPM courses. 16 corresponds to the 16 weeks for studying the course). The
input and output of the regression model are two dummy
vectors which indicate a user’s dropout behavior for two dif-
ferent courses in the same semester. By examining the slopes
form a clustering analysis on users’ learning activities. To of regression results (Figure 2), we can observe a signifi-
construct the input for the clustering analysis, we define a cantly positive correlation between users’ dropout probabil-
concept of temporal code for each user. ities of different enrolled courses, though overall the corre-
Definition 1. Temporal Code: For each user u and one lation decreases over time. Moreover, we did the analysis for
of her enrolled course c, the temporal code is defined as courses of the same category and those across different cat-
a binary-valued vector suc = [suc,1 , suc,2 , ..., suc,K ], where egories. It can be seen that the correlation between courses
suc,k ∈ {0, 1} indicates whether user u visits course c in of the same category is higher than courses from different
the k-th week. Finally, we concatenate all course-related categories. One potential explanation is that when a user has
vectors and generate the temporal code for each user as limited time to study MOOC, they may first give up substitu-
Su = [suc1 , suc2 , ..., sucM ], where M is the number of courses. tive courses instead of those with complementary knowledge
domain.
Please note that the temporal code is usually very sparse.
We feed the sparse representations of all users’ temporal Influence From Dropout Friends. Users’ online study
codes into a K-means algorithm. The number of clusters is behavior may influence each other (Qiu et al. 2016). We
set to 5 based on a Silhouette Analysis (1987) on the data. did another analysis to understand how the influence would
Table 3 shows the clustering results. It can be seen that both matter for dropout prediction. In XuetangX, the friend re-
cluster 2 and cluster 5 have low dropout rates, but more inter- lationship is implicitly defined using co-learning relation-
esting thing is that users of cluster 5 seem to be hard work- ships. More specifically, we use a network-based method
ers — with the longest video watching time, while users of to discover users’ friend relationships. First, we build up
cluster 2 seem to be active forum users — the number of a user-course bipartite graph Guc based on the enrollment
questions (or answers) posted by these users is almost 10× relation. The nodes are all users and courses, and the edge
higher than the others. This corresponds to different moti- between user u and course c represents that u has enrolled
vations that users come to MOOCs. Some users, e.g., users in course c. Then we use DeepWalk (Perozzi, Al-Rfou, and
from cluster 5, use MOOC to seriously study knowledge, Skiena 2014), an algorithm for learning representations of
while some other users, e.g., cluster 2, may simply want to vertices in a graph, to learn a low dimensional vector for
meet friends with similar interest. Another interesting phe- each user node and each course node in Guc . Based on the
nomenon is about users in cluster 4. Their average number user-specific representation vectors, we compute the cosine
of revise answers for assignment (i.e. #reset) is much higher similarity between users who have enrolled a same course.
than all the other clusters. Users of this cluster probably are Finally, those users with high similarity score, i.e., greater
students with difficulties to learn the corresponding courses. than 0.8, are considered as friends.

519
history prediction
Courses in Different Categories Timeline
0.60 Courses in Same Categories
All Courses Learning start 𝐷" day 𝐷" + 𝐷$ day
0.55

0.50 Figure 4: Dropout Prediction Problem. The first Dh days are

Slope

0.45 history period, and the next Dp days are prediction period.
0.40

0.35 Formulation
0.30 In order to formulate this problem more precisely, we first
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 introduce the following definitions.
Week Definition 2. Enrollment Relation: Let C denote the set
Figure 2: Dropout correlation analysis between courses. The of courses, U denote the set of users, and the pair (u, c) de-
x-axis denotes the weeks from 1 to 16 and the y-axis is the note user u ∈ U enrolls the course c ∈ C. The set of enrolled
slope of linear regression results for dropout correlation be- courses by u is denoted as Cu ⊂ C and the set of users who
tween two different courses. The red line is the result of dif- have enrolled course c is denoted as Uc ⊂ U. We use E to
ferent category courses, the green line denotes the slope of denote the set of all enrollments, i.e., {(u, c)}
same category courses, and the black line is pooling results Definition 3. Learning Activity: In MOOCs, user u’s
in all courses. learning activities in a course c can be formulated into
an mx -dimensional vector X(u, c), where each element
xi (u, c) ∈ X(u, c) is a continuous feature value associated
to u’s learning activity in a course c. Those features are ex-
tracted from user historical logs, mainly includes the statis-
0.8
tics of users’ activities.
user dropout probability

Definition 4. Context Information: Context information

0.6 in MOOCs comprises user and course information. User in-
formation is represented by user demographics (i.e. gender,
0.4 age, location, education level) and user cluster. While course
information is the course category. The categorical informa-
0.2 tion (e.g. gender, location) is represented by a one-hot vec-
tor, while continues information (i.e. age) is represented as
0.0
0 1 2 3 4 5 6 7 8 9 10
the value itself. By concatenating all information representa-
# dropout friends tions, the context information of a (u, c) pair is represented
by a vector Z(u, c).
Figure 3: User dropout probability conditioned on the num- With these definitions, our problem of dropout prediction
ber of dropout friends. x-axis is the number of dropout can be defined as: Given user u’s learning activity X(u, c)
friends, and y-axis is user’s dropout probability. on course c in history period (as shown in Figure 4, it is
the first Dh days after the learning starting time), as well
as her/his context information Z(u, c), our goal is to pre-
In order to analyze the influence from dropout friends dict whether u will drop out from c in the prediction period
quantitatively, we calculate users’ dropout probabilities con- (as shown in Figure 4, it is the following Dp days after his-
ditional on the number of dropout friends. Figure 3 presents tory period). More precisely, let y(u,c) ∈ {0, 1} denote the
the results. We see users’ dropout probability increases ground truth of whether u has dropped out, y(u,c) is positive
monotonically from 0.33 to 0.87 when the number of if and only if u has not taken activities on c in the prediction
dropout friends ranges from 1 to 10. This indicates that a period. Then our task is to learn a function:
user’s dropout rate is greatly influenced by her/his friends’
dropout behavior. f : (X(u, c), Z(u, c)) → y(u,c)
Please note that we define the prediction of dropouts for
Methodology all users/courses together, as we need to consider the user
We now turn to discuss potential solutions to predict when and course information.
and whether a user will drop out a specific course, by lever-
aging the patterns discovered in the above analysis. In sum- Context-aware Feature Interaction Network
mary, we propose a Context-aware Feature Interaction Net- Motivation. From prior analyses, we find users’ activity
work (CFIN) to deal with the dropout prediction problem. patterns in MOOCs have a strong correlation with their con-
Different from previous work on this task, the proposed text (e.g. course correlation and friends influence). More
model incorporates context information, including user and specifically, the value of learning activity vector X(u, c) is
course information, into a unified framework. Let us begin highly sensitive to the context information Z(u, c). To tackle
with a formulation of the problem we are going to address. this issue, we employ convolutional neural networks (CNN)

520
(i)
Attention-based matrix of X̂g . After that, the next step is feature fusion.
Context Information 𝐙(𝑢, 𝑐) 𝑦.
Interaction We employ a one-dimensional convolutional neural network
(i)
Embedding layer
DNN (CNN) to compress each Eg (1 ≤ i ≤ mx ) to a vector.
(i) (i)
More formally, a vector Vg ∈ Rdf is generated from Ex
𝐕34/5
Fully connected layer
by
Attention Weighted sum

𝐕6 Vg(i) = σ(Wconv δ(E(i)

g ) + bconv ), (2)
𝐕3())
where δ(E) denotes flatting matrix E to a vector, Wconv ∈
𝑥) (∗, 𝑐) 𝑥) (𝑢,∗)
Convolutional layer Rdf ×mg de is convolution kernel, bconv ∈ Rdf is bias term.
σ(·) is activate function. This procedure can be seen as an
Embedding layer mg -stride convolution on Ex . By using this method, each
𝑔, 𝑔/
(i) (i)
feature group X̂g is represented by a dense vector Vg . It
can be seen as the context-aware representation of each xi
𝑥) (𝑢,𝑐 )
Feature augmentation details Feature augmentation Learning Activity 𝐗(𝑢, 𝑐)
with integrating its context statistics.
Context-smoothing Attention-based Interaction. We now turn to introduce
Activity feature 𝑥) (𝑢,𝑐 ) Context feature 𝑧) (𝑢, 𝑐) Course-context statistics User-context statistics how to learn a dropout probability by modeling the
attention-based interactions for activity features in X using
Figure 5: The architecture of CFIN. context information Z. First, we need to transform Z into a
dense vector Vz ∈ Rdf by feeding the embedding of Z into
a fully-connected layer:
to learn a context-aware representation for each activity fea-
ture xi (u, c) by leveraging its context statistics. This strat- Vz = σ(Wf c δ(Ez ) + bf c ), (3)
egy is referred to as context-smoothing in this paper. What’s
where Ez is the embedding matrix of Z. Wf c and bf c are
more, we also propose an attention mechanism to learn the
parameters. Then we use Vz to calculate an attention score
importances of different activities by incorporating Z(u, c) (i)
into dropout prediction. Figure 5 shows the architecture of for each Vg (1 ≤ i ≤ mx ):
the proposed method. In the rest of this section, we will ex-
plain the context-smoothing and attention mechanism in de- λ̂i = hT (i)
attn σ(Wattn (Vg ⊕ Vz ) + battn ), (4)
tails. exp(λ̂i )
λi = ∑ , (5)
Context-Smoothing. The context-smoothing strategy exp(λ̂i )
1≤i≤mx
consists of three steps: feature augmentation, embed-
ding and feature fusion. In feature augmentation, each where Wattn ∈ Rda ×2df , battn ∈ Rda and hattn ∈ Rda
(i)
activity feature xi (u, c) ∈ X(u, c)3 is expanded with its are parameters. λi is the attention score of Vg , which can
user and course-context statistics. User-context statistics th
be interpreted as the importance of the i activity feature xi .
of feature xi is defined by a mapping function gu (xi ) Based on the calculated attention scores, we obtain a pooling
from the original activity feature to several statistics (i)
vector Vgsum by applying weighted sum to Vg :
of ith feature across all courses enrolled by u, i.e.,
gu : xi (u, c) → [avg({xi (u, ∗)}), max({xi (u, ∗)}), . . .].
∑
Vgsum = λi Vg(i) . (6)
While course-context statistics, represented by 1≤i≤mx
gc (xi ), are statistics over all users in course c, i.e.,
gc : xi (u, c) → [avg({xi (∗, c)}), max({xi (∗, c)}), . . .]. Here Vgsum can be seen as the context-aware representa-
(1) (2) (m ) tion of X. In the final step, we feed Vgsum into an L-layer
Let X̂ = X̂g ⊕ X̂g ⊕ . . . ⊕ X̂g x represent the deep neural network (DNN) to learn the interactions of fea-
(i)
augmented activity feature vector, where each X̂g ∈ Rmg tures. Specifically, the input layer is Vgsum . And each hidden
is a feature group which consists of xi and its context layer can be formulated as
(i)
statistics: X̂g = [[xi ] ⊕ gu (xi ) ⊕ gc (xi )]. Then each x̂ ∈ X̂ (l+1) (l) (l) (l)
is converted to a dense vector through an embedding layer. Vd = σ(Wd Vd + bd ), (7)
As x̂ is continuous variable, we obtain the corresponding (l) (l)
where l is the layer depth. Wd , bd
are model parameters.
embedding vector via simply multiplying x̂ by a parameter (l)
Vd is output of l-layer. The final layer a sigmoid function
vector a ∈ Rde :
which is used to estimate the dropout rate ŷ(u,c) :
e = x̂ · a. (1) 1
We use Ex ∈ Rmg mx ×de to denote the embedding matrix ŷ(u,c) = (L−1)
, (8)
(i)
1+ exp(−hT
sigmoid Vd )
of X̂ and use Eg ∈ Rmg ×de to represent the embedding
where ŷ(u,c) ∈ [0, 1] denotes the probability of u dropping
3
We ommit the notation (u, c) in the following description, if out from course c. All the parameters can be learned by min-
no ambiguity. imizing the follow objective function:

521
Table 4: Overall Results on KDDCUP dataset and IPM
∑ courses of XuetangX dataset.
L(Θ) = − [y(u,c) log(ŷ(u,c) )
(u,c)∈E , (9) KDDCUP XuetangX
+ (1 − y(u,c) ) log(1 − ŷ(u,c) )] Methods AUC (%) F1 (%) AUC (%) F1 (%)
LRC 86.78 90.86 82.23 89.35
where Θ denotes the set of model parameters, y(u,c) is the SVM 88.56 91.65 82.86 89.78
corresponding ground truth, E is the set of all enrollments. RF 88.82 91.73 83.11 89.96
DNN 88.94 91.81 85.64 90.40
Model Ensemble GBDT 89.12 91.88 85.18 90.48
For further improving the prediction performance, we also CFIN 90.07 92.27 86.40 90.92
design an ensemble strategy by combining CFIN with the CFIN-en 90.93 92.87 86.71 90.95
XGBoost (Chen and Guestrin 2016), one of the most ef-
fective gradient boosting framework. Specifically, we obtain Table 5: Contribution analysis for different engagements on
(L−1)
Vd , the output of DNN’s (L − 1)th layer, from a suc- KDDCUP dataset and IPM courses of XuetangX dataset.
cessfully trained CFIN model, and use it to train an XGBoost
classifier together with the original features, i.e., X and Z. KDDCUP XuetangX
This strategy is similar to Stacking (Wolpert 1992). Features AUC (%) F1 (%) AUC (%) F1 (%)
All 90.07 92.27 86.50 90.95
Experiments - Video 87.40 91.61 84.40 90.32
We conduct various experiments to evaluate the effective- - Forum 88.61 91.93 85.13 90.41
ness of CFIN on two datasets: KDDCUP and XuetangX.4 - Assignment 86.68 91.39 84.83 90.34

Experimental Setup
Implementation Details. We implement CFIN with Ten- (CV) with the grid search, and use the best group of parame-
sorFlow and adopt Adam (Kingma and Ba 2014) to opti- ters in all experiments. The evaluation metrics include Area
mize the model. To avoid overfitting, we apply L2 regular- Under the ROC Curve (AUC) and F1 Score (F1).
ization on the weight matrices. We adopt Rectified Linear
Unit (Relu) (Nair and Hinton 2010) as the activation func- Prediction performance
tion. All the features are normalized before fed into CFIN. Table 4 presents the results on KDDCUP dataset and IPM
We test CFIN’s performance on both KDDCUP and Xue- courses of XuetangX dataset for all comparison methods.
tangX datasets. For the KDDCUP dataset, the history period Overall, CFIN-en gets the best performance on both two
and prediction period are set to 30 days and 10 days respec- datasets, and its AUC score on KDDCUP dataset achieves
tively by the competition organizers. We do not use the at- 90.93%, which is comparable to the winning team of KDD-
tention mechanism of CFIN on this data, as there is no con- CUP 20152 . Compared to LR and SVM, CFIN achieves 1.51
text information provided in the dataset. For the XuetangX – 3.29% and 3.54 – 4.17% AUC score improvements on KD-
dataset, the history period is set to 35 days, prediction period DCUP and XuetangX, respectively. Moreover, compared to
is set to 10 days, i.e., Dh = 35, Dp = 10. the ensemble methods (i.e. RF and GBDT) and DNN, CFIN
also shows a better performance.
Comparison Methods. We conduct the comparison ex-
periments for following methods: Feature Contribution
• LR: logistic regression model. In order to identify the importance of different kinds of
• SVM: The support vector machine with linear kernel. engagement activities in this task, we conduct feature ab-
lation experiments for three major activity features, i.e.,
• RF: Random Forest model.
• GBDT: Gradient Boosting Decision Tree.
Table 6: Average attention weights of different clusters. C1-
• DNN: 3-layer deep neural network. C5 — Cluster 1 to 5; CAR — average correct answer ratio.
• CFIN: The CFIN model.
• CFIN-en: The assembled CFIN using the strategy pro- Category Type C1 C2 C3 C4 C5
#watch 0.078 0.060 0.079 0.074 0.072
posed in Model Ensemble. video #stop 0.090 0.055 0.092 0.092 0.053
For baseline models (LR, SVM, RF, GBDT, DNN) above, #jump 0.114 0.133 0.099 0.120 0.125
we use all the features (including learning activity X and #question 0.136 0.127 0.138 0.139 0.129
forum
context information Z) as input. When training the mod- #answer 0.142 0.173 0.142 0.146 0.131
CAR 0.036 0.071 0.049 0.049 0.122
els, we tune the parameters based on 5-fold cross validation assignment
#reset 0.159 0.157 0.159 0.125 0.136
4 seconds 0.146 0.147 0.138 0.159 0.151
All datasets and codes used in this paper is publicly available session
count 0.098 0.075 0.103 0.097 0.081
at http://www.moocdata.cn.

522
Based on our study, the probability of you
obtaining a certificate can be increased by
about 3% for every hour of video watching~

Hi Tom, Based on our study, the probability of you

Hi Alice, you have spent 300 minutes
obtaining a certificate can be increased by about
learning and completed 2 homework
3% for every hour of video watching~
questions in last week, keep going!

(a) Strategy 1: Certificate driven (b) Strategy 2: Certificate driven in video (c) Strategy 3: Effort driven

Figure 6: Snapshots of the three intervention strategies.

Table 7: Results of intervention by A/B test. WVT — av- help improve user retention. Specifically, we use our algo-
erage time (s) of video watching; ASN — average number rithm to predict the dropout probability of each user from
of completed assignments; CAR — average ratio of correct a course. If a user’s dropout probability is greater than a
answers. threshold, XiaoMu would send the user an intervention mes-
sage. We did an interesting A/B test by considering different
Activity No intervention Strategy 1 Strategy 2 Strategy 3 strategies.
WVT 4736.04 4774.59 5969.47 3402.96
ASN 4.59 9.34* 2.95 11.19** • Strategy 1: Certificate driven. Users in this group will
CAR 0.29 0.34 0.22 0.40 receive a message like “Based on our study, the proba-
*: p−value ≤ 0.1, **: p−value ≤ 0.05 by t−test. bility of you obtaining a certificate can be increased by
about 3% for every hour of video watching.”.
• Strategy 2: Certificate driven in video. Users of this
video activity, assignment activity and forum activity, on group will receive the same message as Strategy 1, but
two datasets. Specifically, we first input all the features to the scenario is when the user is watching course video.
the CFIN, then remove every type of activity features one
• Strategy 3: Effort driven. Users in group will receive a
by one to watch the variety of performance. The results are
message to summarize her/his efforts used in this course
shown in Table 5. We can observe that all the three kinds
such as “You have spent 300 minutes learning and com-
of engagements are useful in this task. On KDDCUP, as-
pleted 2 homework questions in last week, keep going!”.
signment plays the most important role, while on XuetangX,
video seems more useful. Figure 6 shows the system snapshots of three strategies.
We also perform a fine-grained analysis for different fea- We did the A/B test on four courses (i.e. Financial Analy-
tures on different groups of users. Specifically, we feed a sis and Decision Making, Introduction to Psychology, C++
set of typical features into CFIN, and compute their average Programming and Java Programming) in order to examine
attention weights for each cluster. The results are shown in the difference the different intervention strategies. Users are
Table 6. We can observe that the distributions of attention split into four groups, including three treatment groups cor-
weights on the five clusters are quite different. The most sig- responding to three intervention strategies and one control
nificant difference appears in CAR (correct answer ratio): Its group. We collect two weeks of data and examine the video
attention weight on cluster 5 (hard workers) is much higher activities and assignment activities of different group of
than those on other clusters, which indicates that correct an- users. Table 7 shows the results. We see Strategy 1 and Strat-
swer ratio is most important in predicting dropout for hard egy 3 can significantly improve users’ engagement on as-
workers. While for users with more forum activities (cluster signment. Strategy 2 is more effective in encouraging users
2), answering questions in forum seems to be the key fac- to watch videos.
tor, as the corresponding attention weight on “#question” is
the highest. Another interesting thing is about the users with Conclusion
high dropout rates (cluster 1, 3 and 4). They get much higher In this paper, we conduct a systematical study for the
attention wights on the number of stopping video and watch- dropout problem in MOOCs. We first conduct statistical
ing video compared to cluster 2 and cluster 5. This indicates analyses to identify factors that cause users’ dropouts.
that the video activities play an more important role in pre- We found several interesting phenomena such as dropout
dicting dropout for learners with poor engagements than ac- correlation between courses and dropout influence between
tive learners. friends. Based on these analyses, we propose a context-
aware feature interaction network (CFIN) to predict users’
From Prediction to Online Intervention dropout probability. Our method achieves good performance
We have deployed the proposed algorithm onto XiaoMu, on two datasets: KDDCUP and XuetangX. The proposed
an intelligent learning assistant subsystem on XuetangX, to method has been deployed onto XiaoMu, an intelligent

523
learning assistant in XuetangX to help improve students Onah, D. F.; Sinclair, J.; and Boyatt, R. 2014. Dropout
retention. We are also working on applying the method to rates of massive open online courses: behavioural patterns.
several other systems such as ArnetMiner (Tang et al. 2008). EDULEARN’14 5825–5834.
Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk:
Acknowledgements. The work is supported by the National Online learning of social representations. In Proceedings of
Natural Science Foundation of China (61631013), the Cen- the 20th ACM SIGKDD International Conference on Knowl-
ter for Massive Online Education of Tsinghua University, edge Discovery and Data Mining, 701–710.
and XuetangX. Qi, Y.; Wu, Q.; Wang, H.; Tang, J.; and Sun, M. 2018. Bandit
learning with implicit feedback. In NIPS’18.
References Qiu, J.; Tang, J.; Liu, T. X.; Gong, J.; Zhang, C.; Zhang, Q.;
Balakrishnan, G., and Coetzee, D. 2013. Predicting stu- and Xue, Y. 2016. Modeling and predicting learning behav-
dent retention in massive open online courses using hidden ior in moocs. In Proceedings of the Ninth ACM International
markov models. Electrical Engineering and Computer Sci- Conference on Web Search and Data Mining, 93–102.
ences University of California at Berkeley. Ramesh, A.; Goldwasser, D.; Huang, B.; Daumé, III, H.; and
Chen, T., and Guestrin, C. 2016. Xgboost: A scalable Getoor, L. 2014. Learning latent engagement patterns of
tree boosting system. In Proceedings of the 22Nd ACM students in online courses. In AAAI’14, 1272–1278.
SIGKDD International Conference on Knowledge Discov- Reich, J. 2015. Rebooting mooc research. Science 34–35.
ery and Data Mining, 785–794. Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the
Cristea, A. I.; Alamri, A.; Kayama, M.; Stewart, C.; Al- interpretation and validation of cluster analysis. Journal of
shehri, M.; and Shi, L. 2018. Earliest predictor of dropout computational and applied mathematics 20:53–65.
in moocs: a longitudinal study of futurelearn courses. Seaton, D. T.; Bergner, Y.; Chuang, I.; Mitros, P.; and
Dalipi, F.; Imran, A. S.; and Kastrati, Z. 2018. Mooc dropout Pritchard, D. E. 2014. Who does what in a massive open
prediction using machine learning techniques: Review and online course? Communications of the Acm 58–65.
research challenges. In Global Engineering Education Con- Shah, D. 2018. A product at every price: A review of mooc
ference (EDUCON), 2018 IEEE, 1007–1014. IEEE. stats and trends in 2017. Class Central.
Fei, M., and Yeung, D.-Y. 2015. Temporal Models for Pre- Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; and Su, Z.
dicting Student Dropout in Massive Open Online Courses. 2008. Arnetminer: Extraction and mining of academic social
2015 IEEE International Conference on Data Mining Work- networks. In KDD’08, 990–998.
shop (ICDMW) 256–263. Wang, W.; Yu, H.; and Miao, C. 2017. Deep model for
Halawa, S.; Greene, D.; and Mitchell, J. 2014. Dropout pre- dropout prediction in moocs. In Proceedings of the 2nd In-
diction in moocs using learner activity features. Experiences ternational Conference on Crowd Science and Engineering,
and best practices in and around MOOCs 3–12. 26–32. ACM.
He, J.; Bailey, J.; Rubinstein, B. I. P.; and Zhang, R. 2015. Whitehill, J.; Williams, J.; Lopez, G.; Coleman, C.; and Re-
Identifying at-risk students in massive open online courses. ich, J. 2015. Beyond prediction: First steps toward automatic
In Proceedings of the Twenty-Ninth AAAI Conference on Ar- intervention in mooc student stopout.
tificial Intelligence, 1749–1755. Wolpert, D. H. 1992. Stacked generalization. Neural net-
Kellogg, S. 2013. Online learning: How to make a mooc. works 5(2):241–259.
Nature 369–371. Xing, W.; Chen, X.; Stein, J.; and Marcinkowski, M. 2016.
Temporal predication of dropouts in moocs: Reaching the
Kingma, D. P., and Ba, J. 2014. Adam: A method for
low hanging fruit through stacking generalization. Comput-
stochastic optimization. arXiv preprint arXiv:1412.6980.
ers in Human Behavior 119–129.
Kizilcec, R. F.; Piech, C.; and Schneider, E. 2013. Decon- Zheng, S.; Rosson, M. B.; Shih, P. C.; and Carroll, J. M.
structing disengagement: Analyzing learner subpopulations 2015. Understanding student motivation, behaviors and per-
in massive open online courses. In Proceedings of the Third ceptions in moocs. In Proceedings of the 18th ACM Con-
International Conference on Learning Analytics and Knowl- ference on Computer Supported Cooperative Work, 1882–
edge, 170–179. 1895.
Kloft, M.; Stiehler, F.; Zheng, Z.; and Pinkwart, N. 2014. Zhenghao, C.; Alcorn, B.; Christensen, G.; Eriksson, N.;
Predicting MOOC Dropout over Weeks Using Machine Koller, D.; and Emanuel, E. 2015. Who’s benefiting from
Learning Methods. 60–65. moocs, and why. Harvard Business Review 25.
Nagrecha, S.; Dillon, J. Z.; and Chawla, N. V. 2017. Mooc
dropout prediction: Lessons learned from making pipelines
interpretable. In WWW’17, 351–359.
Nair, V., and Hinton, G. E. 2010. Rectified linear units
improve restricted boltzmann machines. In ICML’10, 807–
814.