Educational Data Mining
Educational Data Mining
Educational data mining (EDM) is a research field concerned with the application of data mining,
machine learning and statistics to information generated from educational settings (e.g., universities and
intelligent tutoring systems). At a high level, the field seeks to develop and improve methods for exploring
this data, which often has multiple levels of meaningful hierarchy, in order to discover new insights about
how people learn in the context of such settings.[1] In doing so, EDM has contributed to theories of
learning investigated by researchers in educational psychology and the learning sciences.[2] The field is
closely tied to that of learning analytics, and the two have been compared and contrasted.[3]
Definition
Educational data mining refers to techniques, tools, and research designed for automatically extracting
meaning from large repositories of data generated by or related to people's learning activities in educational
settings.[4] Quite often, this data is extensive, fine-grained, and precise. For example, several learning
management systems (LMSs) track information such as when each student accessed each learning object,
how many times they accessed it, and how many minutes the learning object was displayed on the user's
computer screen. As another example, intelligent tutoring systems record data every time a learner submits a
solution to a problem. They may collect the time of the submission, whether or not the solution matches the
expected solution, the amount of time that has passed since the last submission, the order in which solution
components were entered into the interface, etc. The precision of this data is such that even a fairly short
session with a computer-based learning environment (e.g. 30 minutes) may produce a large amount of
process data for analysis.
In other cases, the data is less fine-grained. For example, a student's university transcript may contain a
temporally ordered list of courses taken by the student, the grade that the student earned in each course, and
when the student selected or changed his or her academic major. EDM leverages both types of data to
discover meaningful information about different types of learners and how they learn, the structure of
domain knowledge, and the effect of instructional strategies embedded within various learning
environments. These analyses provide new information that would be difficult to discern by looking at the
raw data. For example, analyzing data from an LMS may reveal a relationship between the learning objects
that a student accessed during the course and their final course grade. Similarly, analyzing student transcript
data may reveal a relationship between a student's grade in a particular course and their decision to change
their academic major. Such information provides insight into the design of learning environments, which
allows students, teachers, school administrators, and educational policy makers to make informed decisions
about how to interact with, provide, and manage educational resources.
History
While the analysis of educational data is not itself a new practice, recent advances in educational
technology, including the increase in computing power and the ability to log fine-grained data about
students' use of a computer-based learning environment, have led to an increased interest in developing
techniques for analyzing the large amounts of data generated in educational settings. This interest translated
into a series of EDM workshops held from 2000 to 2007 as part of several international research
conferences.[5] In 2008, a group of researchers established what has become an annual international
research conference on EDM, the first of which took place in Montreal, Quebec, Canada.[6]
As interest in EDM continued to increase, EDM researchers established an academic journal in 2009, the
Journal of Educational Data Mining, for sharing and disseminating research results. In 2011, EDM
researchers established the International Educational Data Mining Society to connect EDM researchers and
continue to grow the field.
With the introduction of public educational data repositories in 2008, such as the Pittsburgh Science of
Learning Centre's (PSLC) DataShop and the National Center for Education Statistics (NCES), public data
sets have made educational data mining more accessible and feasible, contributing to its growth.[7]
Goals
Ryan S. Baker and Kalina Yacef [8] identified the following four goals of EDM:
1. Predicting students' future learning behavior – With the use of student modeling, this
goal can be achieved by creating student models that incorporate the learner's
characteristics, including detailed information such as their knowledge, behaviours and
motivation to learn. The user experience of the learner and their overall satisfaction with
learning are also measured.
2. Discovering or improving domain models – Through the various methods and
applications of EDM, discovery of new and improvements to existing models is possible.
Examples include illustrating the educational content to engage learners and determining
optimal instructional sequences to support the student's learning style.
3. Studying the effects of educational support that can be achieved through learning
systems.
4. Advancing scientific knowledge about learning and learners by building and
incorporating student models, the field of EDM research and the technology and software
used.
Learners – Learners are interested in understanding student needs and methods to improve
the learner's experience and performance.[9] For example, learners can also benefit from the
discovered knowledge by using the EDM tools to suggest activities and resources that they
can use based on their interactions with the online learning tool and insights from past or
similar learners.[10] For younger learners, educational data mining can also inform parents
about their child's learning progress.[11] It is also necessary to effectively group learners in
an online environment. The challenge is using the complex data to learn and interpret these
groups through developing actionable models.[12]
Educators – Educators attempt to understand the learning process and the methods they
can use to improve their teaching methods.[9] Educators can use the applications of EDM to
determine how to organize and structure the curriculum, the best methods to deliver course
information and the tools to use to engage their learners for optimal learning outcomes.[13] In
particular, the distillation of data for human judgment technique provides an opportunity for
educators to benefit from EDM because it enables educators to quickly identify behavioural
patterns, which can support their teaching methods during the duration of the course or to
improve future courses. Educators can determine indicators that show student satisfaction
and engagement of course material, and also monitor learning progress.[13]
Researchers – Researchers focus on the development and the evaluation of data mining
techniques for effectiveness.[9] A yearly international conference for researchers began in
2008. The wide range of topics in EDM ranges from using data mining to improve
institutional effectiveness to student performance.
Administrators – Administrators are responsible for allocating the resources for
implementation in institutions.[9] As institutions are increasingly held responsible for student
success, the administering of EDM applications are becoming more common in educational
settings. Faculty and advisors are becoming more proactive in identifying and addressing at-
risk students. However, it is sometimes a challenge to get the information to the decision
makers to administer the application in a timely and efficient manner.
Phases
As research in the field of educational data mining has continued to grow, a myriad of data mining
techniques have been applied to a variety of educational contexts. In each case, the goal is to translate raw
data into meaningful information about the learning process in order to make better decisions about the
design and trajectory of a learning environment. Thus, EDM generally consists of four phases:[2][5]
1. The first phase of the EDM process (not counting pre-processing) is discovering
relationships in data. This involves searching through a repository of data from an
educational environment with the goal of finding consistent relationships between variables.
Several algorithms for identifying such relationships have been utilized, including
classification, regression, clustering, factor analysis, social network analysis, association
rule mining, and sequential pattern mining.
2. Discovered relationships must then be validated in order to avoid overfitting.
3. Validated relationships are applied to make predictions about future events in the learning
environment.
4. Predictions are used to support decision-making processes and policy decisions.
During phases 3 and 4, data is often visualized or in some other way distilled for human judgment.[2] A
large amount of research has been conducted in best practices for visualizing data.
Main approaches
Of the general categories of methods mentioned, prediction, clustering and relationship mining are
considered universal methods across all types of data mining; however, Discovery with Models and
Distillation of Data for Human Judgment are considered more prominent approaches within educational
data mining.[7]
In the Discovery with Model method, a model is developed via prediction, clustering or by human
reasoning knowledge engineering and then used as a component in another analysis, namely in prediction
and relationship mining.[7] In the prediction method use, the created model's predictions are used to predict
a new variable.[7] For the use of relationship mining, the created model enables the analysis between new
predictions and additional variables in the study.[7] In many cases, discovery with models uses validated
prediction models that have proven generalizability across contexts.
Key applications of this method include discovering relationships between student behaviors, characteristics
and contextual variables in the learning environment.[7] Further discovery of broad and specific research
questions across a wide range of contexts can also be explored using this method.
Humans can make inferences about data that may be beyond the scope in which an automated data mining
method provides.[7] For the use of education data mining, data is distilled for human judgment for two key
purposes, identification and classification.[7]
For the purpose of identification, data is distilled to enable humans to identify well-known patterns, which
may otherwise be difficult to interpret. For example, the learning curve, classic to educational studies, is a
pattern that clearly reflects the relationship between learning and experience over time.
Data is also distilled for the purposes of classifying features of data, which for educational data mining, is
used to support the development of the prediction model. Classification helps expedite the development of
the prediction model, tremendously.
The goal of this method is to summarize and present the information in a useful, interactive and visually
appealing way in order to understand the large amounts of education data and to support decision
making.[9] In particular, this method is beneficial to educators in understanding usage information and
effectiveness in course activities.[9] Key applications for the distillation of data for human judgment include
identifying patterns in student learning, behavior, opportunities for collaboration and labeling data for future
uses in prediction models.[7]
Applications
A list of the primary applications of EDM is provided by Cristobal Romero and Sebastian Ventura.[5] In
their taxonomy, the areas of EDM application are:
New EDM applications will focus on allowing non-technical users use and engage in data mining tools and
activities, making data collection and processing more accessible for all users of EDM. Examples include
statistical and visualization tools that analyzes social networks and their influence on learning outcomes and
productivity.[14]
Courses
1. In October 2013, Coursera offered a free online course on "Big Data in Education" that
taught how and when to use key methods for EDM.[15] This course moved to edX in the
summer of 2015,[16] and has continued to run on edX annually since then. A course archive
is now available online.[17]
2. Teachers College, Columbia University offers a MS in Learning Analytics.[18]
Publication venues
Considerable amounts of EDM work are published at the peer-reviewed International Conference on
Educational Data Mining, organized by the International Educational Data Mining Society.
EDM papers are also published in the Journal of Educational Data Mining (JEDM).
Many EDM papers are routinely published in related conferences, such as Artificial Intelligence and
Education, Intelligent Tutoring Systems, and User Modeling, Adaptation, and Personalization.
In 2011, Chapman & Hall/CRC Press, Taylor and Francis Group published the first Handbook of
Educational Data Mining. This resource was created for those that are interested in participating in the
educational data mining community.[14]
Contests
In 2010, the Association for Computing Machinery's KDD Cup was conducted using data from an
educational setting.[33] The data set was provided by the DataShop, and it consisted of over 1,000,000 data
points from students using a cognitive tutor.[34] Six hundred teams competed for over US$8,000 in prize
money (which was donated by Facebook). The goal for contestants was to design an algorithm that, after
learning from the provided data, would make the most accurate predictions from new data. The winners
submitted an algorithm that utilized feature generation (a form of representation learning), random forests,
and Bayesian networks.[35]
Criticisms
Generalizability – Research in EDM may be specific to the particular educational setting
and time in which the research was conducted, and as such, may not be generalizable to
other institutions. Research also indicates that the field of educational data mining is
concentrated in western countries and cultures and subsequently, other countries and
cultures may not be represented in the research and findings.[8] Development of future
models should consider applications across multiple contexts.
Privacy – Individual privacy is a continued concern for the application of data mining tools.
With free, accessible and user-friendly tools in the market, students and their families may
be at risk from the information that learners provide to the learning system, in hopes to
receive feedback that will benefit their future performance. As users become savvy in their
understanding of online privacy, administrators of educational data mining tools need to be
proactive in protecting the privacy of their users and be transparent about how and with
whom the information will be used and shared. Development of EDM tools should consider
protecting individual privacy while still advancing the research in this field.
Plagiarism – Plagiarism detection is an ongoing challenge for educators and faculty
whether in the classroom or online. However, due to the complexities associated with
detecting and preventing digital plagiarism in particular, educational data mining tools are
not currently sophisticated enough to accurately address this issue. Thus, the development
of predictive capability in plagiarism-related issues should be an area of focus in future
research.
Adoption – It is unknown how widespread the adoption of EDM is and the extent to which
institutions have applied and considered implementing an EDM strategy. As such, it is
unclear whether there are any barriers that prevent users from adopting EDM in their
educational settings.
See also
Big data
Data mining
Education
Educational technology
Glossary of education terms
Learning analytics
Machine learning
Statistics
References
1. "EducationalDataMining.org" (http://www.educationaldatamining.org/). 2013. Retrieved
2013-07-15.
2. R. Baker (2010) Data Mining for Education. In McGaw, B., Peterson, P., Baker, E. (Eds.)
International Encyclopedia of Education (3rd edition), vol. 7, pp. 112-118. Oxford, UK:
Elsevier.
3. G. Siemens, R.S.j.d. Baker (2012). "Learning analytics and educational data mining:
towards communication and collaboration". Proceedings of the 2nd International
Conference on Learning Analytics and Knowledge: 252–254.
doi:10.1145/2330601.2330661 (https://doi.org/10.1145%2F2330601.2330661).
S2CID 207196058 (https://api.semanticscholar.org/CorpusID:207196058).
4. "educationaldatamining.org" (https://educationaldatamining.org/). Retrieved 2020-11-14.
5. C. Romero, S. Ventura. Educational Data Mining: A Review of the State-of-the-Art. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews. 40(6),
601-618, 2010.
6. "http://educationaldatamining.org/EDM2008/" Retrieved 2013-09-04
7. Baker, Ryan. "Data Mining for Education" (http://www.columbia.edu/~rsb2162/Encyclopedi
a%20Chapter%20Draft%20v10%20-fw.pdf) (PDF). oxford, UK: Elsevier. Retrieved
9 February 2014.
8. Baker, R.S.; Yacef, K (2009). "The state of educational data mining in 2009: A review and
future visions". JEDM-Journal of Educational Data Mining. 1 (1): 2017.
9. Romero, Cristobal; Ventura, Sebastian (January–February 2013). "WIREs Data Mining
Knowl Discov". Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 3
(1): 12–27. doi:10.1002/widm.1075 (https://doi.org/10.1002%2Fwidm.1075).
S2CID 18019486 (https://api.semanticscholar.org/CorpusID:18019486).
10. Romero, Cristobal; Ventura, Sebastian (2007). "Educational data mining: A survey from 1995
to 2005". Expert Systems with Applications. 33 (1): 135–146.
doi:10.1016/j.eswa.2006.04.005 (https://doi.org/10.1016%2Fj.eswa.2006.04.005).
11. "Assessing the Economic Impact of Copyright Reform in the Area of Technology-Enhanced
Learning" (https://web.archive.org/web/20140413145423/https://www.ic.gc.ca/eic/site/ippd-d
ppi.nsf/eng/ip01102.html). Industry Canada. Archived from the original (https://www.ic.gc.ca/
eic/site/ippd-dppi.nsf/eng/ip01102.html) on 13 April 2014. Retrieved 6 April 2014.
12. Azarnoush, Bahareh, et al. "Toward a Framework for Learner Segmentation." JEDM-Journal
of Educational Data Mining 5.2 (2013): 102-126.
13. U.S. Department of Education, Office of Educational Technology. "Enhancing Teaching and
Learning Through Educational Data Mining and Learning Analytics: An Issue Brief" (https://w
eb.archive.org/web/20140611233540/http://www.ed.gov/edblogs/technology/files/2012/03/e
dm-la-brief.pdf) (PDF). Archived from the original (https://www.ed.gov/edblogs/technology/fil
es/2012/03/edm-la-brief.pdf) (PDF) on 11 June 2014. Retrieved 30 March 2014.
14. Romero, C.; Ventura, S.; Pechenizkiy, M.; Baker, R. S. (2010). Handbook of educational data
mining. CRC Press.
15. "Big Data in Education" (https://www.coursera.org/course/bigdata-edu). Coursera. Retrieved
30 March 2014.
16. "Big Data in Education" (https://www.edx.org/course/big-data-education-pennx-bde1x-0).
edXedxed. Retrieved 13 October 2015.
17. "Big Data in Education" (http://www.upenn.edu/learninganalytics/MOOT/bigdataeducation.ht
ml). Retrieved 17 July 2018.
18. "Learning Analytics | Teachers College Columbia University" (http://www.tc.columbia.edu/hu
man-development/learning-analytics/). www.tc.columbia.edu. Retrieved 2015-10-13.
19. "Home" (https://www.educationaldatamining.org/EDM2008/).
www.educationaldatamining.org. Retrieved 1 July 2022.
20. "EDM'09 - Home" (https://www.educationaldatamining.org/EDM2009/).
www.educationaldatamining.org. Retrieved 1 July 2022.
21. "EDM2010" (https://web.archive.org/web/20111020063642/http://educationaldatamining.org/
EDM2010/). 2011-10-20. Archived from the original (http://educationaldatamining.org/EDM2
010/) on 20 October 2011. Retrieved 2022-07-02.
22. "EDM2011" (https://web.archive.org/web/20210509052855/https://educationaldatamining.or
g/EDM2011/). 20 October 2011. Archived from the original (http://www.educationaldataminin
g.org/EDM2011) on 9 May 2021. Retrieved 1 July 2022.
23. "EDM2012" (https://web.archive.org/web/20130508060102/https://www.educationaldatamini
ng.org/EDM2012/). Archived from the original (http://www.educationaldatamining.org/EDM20
12) on 8 May 2013. Retrieved 1 July 2022.
24. "EDM2013" (https://web.archive.org/web/20131229095950/https://www.sites.google.com/a/ii
s.memphis.edu/edm-2013-conference/). Archived from the original (https://www.sites.google.
com/a/iis.memphis.edu/edm-2013-conference/) on 29 December 2013. Retrieved 1 July
2022.
25. "EDM2014" (https://web.archive.org/web/20140130023940/https://www.educationaldatamini
ng.org/EDM2014/). Archived from the original (http://www.educationaldatamining.org/EDM20
14/) on 30 January 2014. Retrieved 1 July 2022.
26. "EDM2015" (https://web.archive.org/web/20141008025929/https://www.educationaldatamini
ng.org/EDM2015/). Archived from the original (http://www.educationaldatamining.org/EDM20
15/) on 30 January 2014. Retrieved 1 July 2022.
27. "EDM2016" (https://web.archive.org/web/20220513092757/https://www.educationaldatamini
ng.org/EDM2016/). Archived from the original (http://www.educationaldatamining.org/EDM20
16/) on 13 May 2022. Retrieved 1 July 2022.
28. "EDM2017" (https://web.archive.org/web/20170430002603/http://educationaldatamining.org/
EDM2017). Archived from the original (http://www.educationaldatamining.org/EDM2017/) on
30 April 2017. Retrieved 1 July 2022.
29. "EDM2018" (https://web.archive.org/web/20220513092759/https://educationaldatamining.or
g/EDM2018/). Archived from the original (http://www.educationaldatamining.org/EDM2018/)
on 13 May 2022. Retrieved 1 July 2022.
30. "EDM2019" (https://web.archive.org/web/20220513092801/https://educationaldatamining.or
g/edm2019/). Archived from the original (http://www.educationaldatamining.org/EDM2019/)
on 13 May 2022. Retrieved 1 July 2022.
31. "EDM2020" (https://web.archive.org/web/20220122123547/https://www.educationaldatamini
ng.org/EDM2020/). Archived from the original (http://www.educationaldatamining.org/EDM20
20/) on 22 January 2022. Retrieved 1 July 2022.
32. "EDM2021" (https://web.archive.org/web/20210815200103/https://educationaldatamining.or
g/edm2021/). Archived from the original (https://educationaldatamining.org/edm2021/) on 15
August 2021. Retrieved 1 July 2022.
33. "KDD Cup 2010" (https://web.archive.org/web/20100715171814/http://www.kdd.org/kdd201
0/kddcup.shtml). KDD. Archived from the original (http://www.kdd.org/kdd2010/kddcup.shtml)
on 15 July 2010. Retrieved 1 July 2022.
34. "PLCS DataShop" (https://web.archive.org/web/20100626050514/https://pslcdatashop.web.
cmu.edu/). DataShop. Archived from the original (https://pslcdatashop.web.cmu.edu/) on 26
June 2010. Retrieved 1 July 2022.
35. Yu, Hsaing-Fu; Lin, Chih-Jen; Lin, Hsuan-Tien; Lin, Shou-De; Wei, Yin-Hsuan; Weng, Jui-
Yu; Change, Chun-Fu; Yan, En-Syu; McKenzie, Todd; Lou, Jing-Kai; Hsieh, Hsun-Ping
(2010). "Feature Engineering and Classifier Ensemble for KDD Cup 2010" (https://web.archi
ve.org/web/20220303192425/https://pslcdatashop.web.cmu.edu/KDDCup/workshop/papers/
kdd2010ntu.pdf) (PDF). DataShop. Archived from the original (https://www.csie.ntu.edu.tw/~
htlin/paper/doc/wskdd10cup.pdf) (PDF) on 3 March 2022. Retrieved 1 July 2022.
36. "How Can Educational Data Mining and Learning Analytics Improve and Personalize
Education?" (http://edtechreview.in/trends-insights/insights/389-data-mining-and-learning-an
alytics-improving-education). EdTechReview. Retrieved 9 April 2014.