Mooc Dropouts
Mooc Dropouts
517
0.55 0.8 0.70
Dropout Rates of Different Ages female
male
0.7 0.65
0.50
0.6 0.60
0.45
Dropout Rate
Dropout Rate
Dropout Rate
0.5 0.55
0.40
0.4 0.50
0.35 0.3 0.45
Figure 1: Dropout rates of different demographics of users. (a) user age (b) course category (c) user education level.
system. Experiments on both datasets show that the pro- motivations for choosing a course and to understand the rea-
posed method achieves much better performance than sev- sons that users drop out a course. Qiu et al. (2016) study the
eral state-of-the-art methods. We have deployed the pro- relationship between student engagement and their certifi-
posed method in XuetangX to help improve user retention. cate rate, and propose a latent dynamic factor graph (LadFG)
to model and predict learning behavior in MOOCs.
Related Work
Prior studies apply generalized linear models (including lo- Data and Insights
gistic regression and linear SVMs (Kloft et al. 2014; He et al. The analysis in this work is performed on two datasets from
2015)) to predict dropout. Balakrishnan et al. (2013) present XuetangX. XuetangX, launched in October 2013, is now one
a hybrid model which combines Hidden Markov Models of the largest MOOC platforms in China. It has provided
(HMM) and logistic regression to predict student retention over 1,000 courses and attracted more than 10,000,000 reg-
on a single course. Another attempt by Xing et al. (2016) istered users. XuetangX has twelve categories of courses:
uses an ensemble stacking generalization approach to build art, biology, computer science, economics, engineering, for-
robust and accurate prediction models. Deep learning meth- eign language, history, literature, math, philosophy, physics,
ods are also used for predicting dropout. For example, Fei and social science. Users in XuetangX can choose the learn-
et al. (2015) tackle this problem from a sequence labeling ing mode: Instructor-paced Mode (IPM) and Self-paced
perspective and apply an RNN based model to predict stu- Mode (SPM). IPM follows the same course schedule as con-
dents’ dropout probability. Wang et al. (2017) propose a hy- ventional classrooms, while in SPM, one could have more
brid deep neural network dropout prediction model by com- flexible schedule to study online by herself/himself. Usually
bining the CNN and RNN. Ramesh et al. (2014) develop an IPM course spans over 16 weeks in XuetangX, while
a probabilistic soft logic (PSL) framework to predict user an SPM course spans a longer period. Each user can en-
retention by modeling student engagement types using la- roll one or more courses. When one studying a course, the
tent variables. Cristeaet et al. (2018) propose a light-weight system records multiple types of activities: video watching
method which can predict dropout before user start learning (watch, stop, and jump), forum discussion (ask questions
only based on her/his registration date. Besides prediction and replies), assignment completion (with correct/incorrect
itself, Nagrecha et al. (2017) focus on the interpretability of answers, and reset), and web page clicking (click and close
existing dropout prediction methods. Whitehill et al. (2015) a course page).
design an online intervention strategy to boost users’ call- Two Datasets. The first dataset contains 39 IPM courses and
back in MOOCs. Dalipi et al. (2018) review the techniques their enrolled students. It was also used for KDDCUP 2015.
of dropout prediction and propose several insightful sugges- Table 1 lists statistics of this dataset. With this dataset, we
tions for this task. What’s more, XuetangX has organized the compare our proposed method with existing methods, as the
KDDCUP 20152 for dropout prediction. In that competition, challenge has attracted 821 teams to participate. We refer to
most teams adopt assembling strategies to improve the pre- this dataset as KDDCUP.
diction performance, and “Intercontinental Ensemble” team The other dataset contains 698 IPM courses and 515 SPM
get the best performance by assembling over sixty single courses. Table 2 lists the statistics. The dataset contains
models. richer information, which can be used to test the robustness
More recent works mainly focus on analyzing students and generalization of the proposed method. This dataset is
engagement based on statistical methods and explore how to referred to as XuetangX.
improve student engagements (Kellogg 2013; Reich 2015).
Zheng et al. (2015) apply the grounded theory to study users’ Insights
Before proposing our methodology, we try to gain a better
2
https://biendata.com/competition/kddcup2015 understanding of the users’ learning behavior. We first per-
518
Table 1: Statistics of the KDDCUP dataset.
Table 3: Results of clustering analysis. C1-C5 — Cluster 1
Category Type Number to 5; CAR — average correct answer ratio.
# video activities 1,319,032
log # forum activities 10,763,225 Category Type C1 C2 C3 C4 C5
# assignment activities 2,089,933 #watch 21.83 46.78 12.03 19.57 112.1
video #stop 28.45 68.96 20.21 37.19 84.15
# web page activities 738,0344 #jump 16.30 16.58 11.44 14.54 21.39
# total 200,904 #question 0.04 0.38 0.02 0.03 0.03
# dropouts 159,223 forum
#answer 0.13 3.46 0.13 0.12 0.17
enrollment # completions 41,681 assignment
CAR 0.22 0.76 0.19 0.20 0.59
# users 112,448 #revise 0.17 0.02 0.04 0.78 0.01
# courses 39 seconds 1,715 714 1,802 1,764 885
session
count 3.61 8.13 2.18 4.01 7.78
#enrollment 21,048 9,063 401,123 25,042 10,837
enrollment
total #users 2,735 4,131 239,302 4,229 4,121
Table 2: Statistics of the XuetangX dataset. dropout rate 0.78 0.29 0.83 0.66 0.28
519
history prediction
Courses in Different Categories Timeline
0.60 Courses in Same Categories
All Courses Learning start 𝐷" day 𝐷" + 𝐷$ day
0.55
0.45 history period, and the next Dp days are prediction period.
0.40
0.35 Formulation
0.30 In order to formulate this problem more precisely, we first
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 introduce the following definitions.
Week Definition 2. Enrollment Relation: Let C denote the set
Figure 2: Dropout correlation analysis between courses. The of courses, U denote the set of users, and the pair (u, c) de-
x-axis denotes the weeks from 1 to 16 and the y-axis is the note user u ∈ U enrolls the course c ∈ C. The set of enrolled
slope of linear regression results for dropout correlation be- courses by u is denoted as Cu ⊂ C and the set of users who
tween two different courses. The red line is the result of dif- have enrolled course c is denoted as Uc ⊂ U. We use E to
ferent category courses, the green line denotes the slope of denote the set of all enrollments, i.e., {(u, c)}
same category courses, and the black line is pooling results Definition 3. Learning Activity: In MOOCs, user u’s
in all courses. learning activities in a course c can be formulated into
an mx -dimensional vector X(u, c), where each element
xi (u, c) ∈ X(u, c) is a continuous feature value associated
to u’s learning activity in a course c. Those features are ex-
tracted from user historical logs, mainly includes the statis-
0.8
tics of users’ activities.
user dropout probability
520
(i)
Attention-based matrix of X̂g . After that, the next step is feature fusion.
Context Information 𝐙(𝑢, 𝑐) 𝑦.
Interaction We employ a one-dimensional convolutional neural network
(i)
Embedding layer
DNN (CNN) to compress each Eg (1 ≤ i ≤ mx ) to a vector.
(i) (i)
More formally, a vector Vg ∈ Rdf is generated from Ex
𝐕34/5
Fully connected layer
by
Attention Weighted sum
521
Table 4: Overall Results on KDDCUP dataset and IPM
∑ courses of XuetangX dataset.
L(Θ) = − [y(u,c) log(ŷ(u,c) )
(u,c)∈E , (9) KDDCUP XuetangX
+ (1 − y(u,c) ) log(1 − ŷ(u,c) )] Methods AUC (%) F1 (%) AUC (%) F1 (%)
LRC 86.78 90.86 82.23 89.35
where Θ denotes the set of model parameters, y(u,c) is the SVM 88.56 91.65 82.86 89.78
corresponding ground truth, E is the set of all enrollments. RF 88.82 91.73 83.11 89.96
DNN 88.94 91.81 85.64 90.40
Model Ensemble GBDT 89.12 91.88 85.18 90.48
For further improving the prediction performance, we also CFIN 90.07 92.27 86.40 90.92
design an ensemble strategy by combining CFIN with the CFIN-en 90.93 92.87 86.71 90.95
XGBoost (Chen and Guestrin 2016), one of the most ef-
fective gradient boosting framework. Specifically, we obtain Table 5: Contribution analysis for different engagements on
(L−1)
Vd , the output of DNN’s (L − 1)th layer, from a suc- KDDCUP dataset and IPM courses of XuetangX dataset.
cessfully trained CFIN model, and use it to train an XGBoost
classifier together with the original features, i.e., X and Z. KDDCUP XuetangX
This strategy is similar to Stacking (Wolpert 1992). Features AUC (%) F1 (%) AUC (%) F1 (%)
All 90.07 92.27 86.50 90.95
Experiments - Video 87.40 91.61 84.40 90.32
We conduct various experiments to evaluate the effective- - Forum 88.61 91.93 85.13 90.41
ness of CFIN on two datasets: KDDCUP and XuetangX.4 - Assignment 86.68 91.39 84.83 90.34
Experimental Setup
Implementation Details. We implement CFIN with Ten- (CV) with the grid search, and use the best group of parame-
sorFlow and adopt Adam (Kingma and Ba 2014) to opti- ters in all experiments. The evaluation metrics include Area
mize the model. To avoid overfitting, we apply L2 regular- Under the ROC Curve (AUC) and F1 Score (F1).
ization on the weight matrices. We adopt Rectified Linear
Unit (Relu) (Nair and Hinton 2010) as the activation func- Prediction performance
tion. All the features are normalized before fed into CFIN. Table 4 presents the results on KDDCUP dataset and IPM
We test CFIN’s performance on both KDDCUP and Xue- courses of XuetangX dataset for all comparison methods.
tangX datasets. For the KDDCUP dataset, the history period Overall, CFIN-en gets the best performance on both two
and prediction period are set to 30 days and 10 days respec- datasets, and its AUC score on KDDCUP dataset achieves
tively by the competition organizers. We do not use the at- 90.93%, which is comparable to the winning team of KDD-
tention mechanism of CFIN on this data, as there is no con- CUP 20152 . Compared to LR and SVM, CFIN achieves 1.51
text information provided in the dataset. For the XuetangX – 3.29% and 3.54 – 4.17% AUC score improvements on KD-
dataset, the history period is set to 35 days, prediction period DCUP and XuetangX, respectively. Moreover, compared to
is set to 10 days, i.e., Dh = 35, Dp = 10. the ensemble methods (i.e. RF and GBDT) and DNN, CFIN
also shows a better performance.
Comparison Methods. We conduct the comparison ex-
periments for following methods: Feature Contribution
• LR: logistic regression model. In order to identify the importance of different kinds of
• SVM: The support vector machine with linear kernel. engagement activities in this task, we conduct feature ab-
lation experiments for three major activity features, i.e.,
• RF: Random Forest model.
• GBDT: Gradient Boosting Decision Tree.
Table 6: Average attention weights of different clusters. C1-
• DNN: 3-layer deep neural network. C5 — Cluster 1 to 5; CAR — average correct answer ratio.
• CFIN: The CFIN model.
• CFIN-en: The assembled CFIN using the strategy pro- Category Type C1 C2 C3 C4 C5
#watch 0.078 0.060 0.079 0.074 0.072
posed in Model Ensemble. video #stop 0.090 0.055 0.092 0.092 0.053
For baseline models (LR, SVM, RF, GBDT, DNN) above, #jump 0.114 0.133 0.099 0.120 0.125
we use all the features (including learning activity X and #question 0.136 0.127 0.138 0.139 0.129
forum
context information Z) as input. When training the mod- #answer 0.142 0.173 0.142 0.146 0.131
CAR 0.036 0.071 0.049 0.049 0.122
els, we tune the parameters based on 5-fold cross validation assignment
#reset 0.159 0.157 0.159 0.125 0.136
4 seconds 0.146 0.147 0.138 0.159 0.151
All datasets and codes used in this paper is publicly available session
count 0.098 0.075 0.103 0.097 0.081
at http://www.moocdata.cn.
522
Based on our study, the probability of you
obtaining a certificate can be increased by
about 3% for every hour of video watching~
(a) Strategy 1: Certificate driven (b) Strategy 2: Certificate driven in video (c) Strategy 3: Effort driven
Table 7: Results of intervention by A/B test. WVT — av- help improve user retention. Specifically, we use our algo-
erage time (s) of video watching; ASN — average number rithm to predict the dropout probability of each user from
of completed assignments; CAR — average ratio of correct a course. If a user’s dropout probability is greater than a
answers. threshold, XiaoMu would send the user an intervention mes-
sage. We did an interesting A/B test by considering different
Activity No intervention Strategy 1 Strategy 2 Strategy 3 strategies.
WVT 4736.04 4774.59 5969.47 3402.96
ASN 4.59 9.34* 2.95 11.19** • Strategy 1: Certificate driven. Users in this group will
CAR 0.29 0.34 0.22 0.40 receive a message like “Based on our study, the proba-
*: p−value ≤ 0.1, **: p−value ≤ 0.05 by t−test. bility of you obtaining a certificate can be increased by
about 3% for every hour of video watching.”.
• Strategy 2: Certificate driven in video. Users of this
video activity, assignment activity and forum activity, on group will receive the same message as Strategy 1, but
two datasets. Specifically, we first input all the features to the scenario is when the user is watching course video.
the CFIN, then remove every type of activity features one
• Strategy 3: Effort driven. Users in group will receive a
by one to watch the variety of performance. The results are
message to summarize her/his efforts used in this course
shown in Table 5. We can observe that all the three kinds
such as “You have spent 300 minutes learning and com-
of engagements are useful in this task. On KDDCUP, as-
pleted 2 homework questions in last week, keep going!”.
signment plays the most important role, while on XuetangX,
video seems more useful. Figure 6 shows the system snapshots of three strategies.
We also perform a fine-grained analysis for different fea- We did the A/B test on four courses (i.e. Financial Analy-
tures on different groups of users. Specifically, we feed a sis and Decision Making, Introduction to Psychology, C++
set of typical features into CFIN, and compute their average Programming and Java Programming) in order to examine
attention weights for each cluster. The results are shown in the difference the different intervention strategies. Users are
Table 6. We can observe that the distributions of attention split into four groups, including three treatment groups cor-
weights on the five clusters are quite different. The most sig- responding to three intervention strategies and one control
nificant difference appears in CAR (correct answer ratio): Its group. We collect two weeks of data and examine the video
attention weight on cluster 5 (hard workers) is much higher activities and assignment activities of different group of
than those on other clusters, which indicates that correct an- users. Table 7 shows the results. We see Strategy 1 and Strat-
swer ratio is most important in predicting dropout for hard egy 3 can significantly improve users’ engagement on as-
workers. While for users with more forum activities (cluster signment. Strategy 2 is more effective in encouraging users
2), answering questions in forum seems to be the key fac- to watch videos.
tor, as the corresponding attention weight on “#question” is
the highest. Another interesting thing is about the users with Conclusion
high dropout rates (cluster 1, 3 and 4). They get much higher In this paper, we conduct a systematical study for the
attention wights on the number of stopping video and watch- dropout problem in MOOCs. We first conduct statistical
ing video compared to cluster 2 and cluster 5. This indicates analyses to identify factors that cause users’ dropouts.
that the video activities play an more important role in pre- We found several interesting phenomena such as dropout
dicting dropout for learners with poor engagements than ac- correlation between courses and dropout influence between
tive learners. friends. Based on these analyses, we propose a context-
aware feature interaction network (CFIN) to predict users’
From Prediction to Online Intervention dropout probability. Our method achieves good performance
We have deployed the proposed algorithm onto XiaoMu, on two datasets: KDDCUP and XuetangX. The proposed
an intelligent learning assistant subsystem on XuetangX, to method has been deployed onto XiaoMu, an intelligent
523
learning assistant in XuetangX to help improve students Onah, D. F.; Sinclair, J.; and Boyatt, R. 2014. Dropout
retention. We are also working on applying the method to rates of massive open online courses: behavioural patterns.
several other systems such as ArnetMiner (Tang et al. 2008). EDULEARN’14 5825–5834.
Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk:
Acknowledgements. The work is supported by the National Online learning of social representations. In Proceedings of
Natural Science Foundation of China (61631013), the Cen- the 20th ACM SIGKDD International Conference on Knowl-
ter for Massive Online Education of Tsinghua University, edge Discovery and Data Mining, 701–710.
and XuetangX. Qi, Y.; Wu, Q.; Wang, H.; Tang, J.; and Sun, M. 2018. Bandit
learning with implicit feedback. In NIPS’18.
References Qiu, J.; Tang, J.; Liu, T. X.; Gong, J.; Zhang, C.; Zhang, Q.;
Balakrishnan, G., and Coetzee, D. 2013. Predicting stu- and Xue, Y. 2016. Modeling and predicting learning behav-
dent retention in massive open online courses using hidden ior in moocs. In Proceedings of the Ninth ACM International
markov models. Electrical Engineering and Computer Sci- Conference on Web Search and Data Mining, 93–102.
ences University of California at Berkeley. Ramesh, A.; Goldwasser, D.; Huang, B.; Daumé, III, H.; and
Chen, T., and Guestrin, C. 2016. Xgboost: A scalable Getoor, L. 2014. Learning latent engagement patterns of
tree boosting system. In Proceedings of the 22Nd ACM students in online courses. In AAAI’14, 1272–1278.
SIGKDD International Conference on Knowledge Discov- Reich, J. 2015. Rebooting mooc research. Science 34–35.
ery and Data Mining, 785–794. Rousseeuw, P. J. 1987. Silhouettes: a graphical aid to the
Cristea, A. I.; Alamri, A.; Kayama, M.; Stewart, C.; Al- interpretation and validation of cluster analysis. Journal of
shehri, M.; and Shi, L. 2018. Earliest predictor of dropout computational and applied mathematics 20:53–65.
in moocs: a longitudinal study of futurelearn courses. Seaton, D. T.; Bergner, Y.; Chuang, I.; Mitros, P.; and
Dalipi, F.; Imran, A. S.; and Kastrati, Z. 2018. Mooc dropout Pritchard, D. E. 2014. Who does what in a massive open
prediction using machine learning techniques: Review and online course? Communications of the Acm 58–65.
research challenges. In Global Engineering Education Con- Shah, D. 2018. A product at every price: A review of mooc
ference (EDUCON), 2018 IEEE, 1007–1014. IEEE. stats and trends in 2017. Class Central.
Fei, M., and Yeung, D.-Y. 2015. Temporal Models for Pre- Tang, J.; Zhang, J.; Yao, L.; Li, J.; Zhang, L.; and Su, Z.
dicting Student Dropout in Massive Open Online Courses. 2008. Arnetminer: Extraction and mining of academic social
2015 IEEE International Conference on Data Mining Work- networks. In KDD’08, 990–998.
shop (ICDMW) 256–263. Wang, W.; Yu, H.; and Miao, C. 2017. Deep model for
Halawa, S.; Greene, D.; and Mitchell, J. 2014. Dropout pre- dropout prediction in moocs. In Proceedings of the 2nd In-
diction in moocs using learner activity features. Experiences ternational Conference on Crowd Science and Engineering,
and best practices in and around MOOCs 3–12. 26–32. ACM.
He, J.; Bailey, J.; Rubinstein, B. I. P.; and Zhang, R. 2015. Whitehill, J.; Williams, J.; Lopez, G.; Coleman, C.; and Re-
Identifying at-risk students in massive open online courses. ich, J. 2015. Beyond prediction: First steps toward automatic
In Proceedings of the Twenty-Ninth AAAI Conference on Ar- intervention in mooc student stopout.
tificial Intelligence, 1749–1755. Wolpert, D. H. 1992. Stacked generalization. Neural net-
Kellogg, S. 2013. Online learning: How to make a mooc. works 5(2):241–259.
Nature 369–371. Xing, W.; Chen, X.; Stein, J.; and Marcinkowski, M. 2016.
Temporal predication of dropouts in moocs: Reaching the
Kingma, D. P., and Ba, J. 2014. Adam: A method for
low hanging fruit through stacking generalization. Comput-
stochastic optimization. arXiv preprint arXiv:1412.6980.
ers in Human Behavior 119–129.
Kizilcec, R. F.; Piech, C.; and Schneider, E. 2013. Decon- Zheng, S.; Rosson, M. B.; Shih, P. C.; and Carroll, J. M.
structing disengagement: Analyzing learner subpopulations 2015. Understanding student motivation, behaviors and per-
in massive open online courses. In Proceedings of the Third ceptions in moocs. In Proceedings of the 18th ACM Con-
International Conference on Learning Analytics and Knowl- ference on Computer Supported Cooperative Work, 1882–
edge, 170–179. 1895.
Kloft, M.; Stiehler, F.; Zheng, Z.; and Pinkwart, N. 2014. Zhenghao, C.; Alcorn, B.; Christensen, G.; Eriksson, N.;
Predicting MOOC Dropout over Weeks Using Machine Koller, D.; and Emanuel, E. 2015. Who’s benefiting from
Learning Methods. 60–65. moocs, and why. Harvard Business Review 25.
Nagrecha, S.; Dillon, J. Z.; and Chawla, N. V. 2017. Mooc
dropout prediction: Lessons learned from making pipelines
interpretable. In WWW’17, 351–359.
Nair, V., and Hinton, G. E. 2010. Rectified linear units
improve restricted boltzmann machines. In ICML’10, 807–
814.
524