0% found this document useful (0 votes)
53 views7 pages

Automated Grading with NLP and BERT

Uploaded by

darshan Chordiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views7 pages

Automated Grading with NLP and BERT

Uploaded by

darshan Chordiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

Integrating Text Recognition and Subjective

Answer Evaluation for Efficient Grading Systems


1st Vaishali Kalyankar-Rajput 2nd Bhupesh Chavan
3rd Bhavesh Chaudhari
Department of Artificial Intelligence Department of Artificial Intelligence
Department of Artificial Intelligence
and Data Science and Data Science
and Data Science
Vishwakarma Institute of Technology Vishwakarma Institute of Technology
Vishwakarma Institute of Technology
Pune, India Pune, India
Pune, India
darshan.bhokare22@[Link] bhupesh.chavan22@[Link]
bhavesh.chaudhari221@[Link]
4th Nikita Chavan
5th Harshada Deshingkar
Department of Artificial Intelligence 6st Darshan Bhokare
Department of Artificial Intelligence
and Data Science Department of Artificial Intelligence
and Data Science
Vishwakarma Institute of Technology and Data Science
Vishwakarma Institute of Technology
Pune, India Vishwakarma Institute of Technology
Pune, India
nikita.chavan221@[Link] Pune, India
harshada.deshingkar22@[Link]
darshan.bhokare22@[Link]
Abstract—Subjective assessments are essential in education for enhancing the overall learning experience. By combining
evaluating students' understanding beyond objective measures; traditional NLP techniques with state-of-the-art machine learning
however, manual grading of such assessments is time-consuming models, this work contributes to the field of educational
and can lead to inconsistencies. This paper presents a novel technology by offering an effective solution for the automated
system for automating the evaluation of subjective answers by evaluation of subjective assessments.
comparing student submissions with ideal answer keys provided
by faculty. The dual-interface platform allows students to upload
assignments in PDF format while faculty members input the II. LITERARATURE REVIEW
ideal answers separately. [1] Used automatic Short Answer Grading - ASAG task
Natural language processing techniques are employed in the using traditional machine learning and transformer-based
backend to assess the similarity between student answers and
models. For instance, BERT achieved an accuracy of 66%
ideal responses
To enhance the system's ability to understand contextual and on the SemEval-2013 SciEntBank dataset, outperforming
semantic nuances, a fine-tuned Bidirectional Encoder some SOTA methods. However, challenges remain in
Representations from Transformers (BERT) model trained on improving explainability and validating results across
the Stanford Natural Language Inference (SNLI) corpus was diverse datasets.
integrated. This semantic similarity approach allows for deeper [2] Automated Essay Scoring (AES) has seen
analysis of student responses, effectively capturing paraphrased advancements with deep learning models like LSTM and
content and synonyms beyond keyword-based methods. Bi-LSTM. 2. Research shows that Bi-LSTM outperforms
LSTM by achieving better Average Mean Error (AME) of
Keywords—Automated evaluation, Subjective assessments,
Natural Language Processing (NLP), Jaccard similarity, BERT
0.6708 and a validation loss of 0.3503 on the ASAP dataset,
(Bidirectional Encoder Representations from Transformers), demonstrating its superior performance in student essay
Semantic similarity, Contextual analysis, Stanford Natural scoring.
Language Inference (SNLI) [3] A novel short answer grading dataset from a real
statistics exam was introduced, achieving an accuracy of
0.7973 (balanced accuracy: 0.6349) and an F1 score of
I. INTRODUCTION 0.8780 using a sentence embedding-based SVM approach.
In the evolving landscape of education, subjective assessments Future work includes exploring deep learning techniques,
play a pivotal role in evaluating a student's depth of understanding, numerical grading, and integrating the classifier into an
critical thinking, and ability to articulate complex concepts. online evaluation tool.
Traditional methods of grading subjective answers, however,
In [4] The proposed model combines handcrafted and deep-
present significant challenges for educators, including substantial
time investment and potential inconsistencies due to human error
encoded features to score essays, achieving a kappa score of
or bias. As educational institutions increasingly adopt digital 0.81 on the ASAP dataset. It demonstrates adaptability to
platforms, there is a pressing need for automated systems that can various essay types, outperforming several baselines.
efficiently and accurately evaluate subjective student responses. However, limitations include reliance on handcrafted
This paper introduces a novel system designed to automate the features, restricted evaluation on diverse datasets, and
evaluation of subjective answers by leveraging advanced natural limited use of advanced deep learning methods. Future work
language processing (NLP) techniques. The system features a dual- could address these gaps by incorporating transformer-based
interface platform: the student interface allows learners to upload models, testing on broader datasets, and enhancing
their assignments in PDF format for specific subjects and
explainability.
assignments, while the faculty interface enables educators to input
ideal answer keys, which remain inaccessible to students to [5] proposed rank-based approach for automated essay
maintain assessment integrity. scoring (AES) that uses listwise learning and linguistic and
At the core of the system is a backend processing unit that statistical features to optimize agreement between human
compares student submissions with the faculty's ideal answers to and machine raters. Listwise learning is a method used in
generate personalized evaluation scorecards. Initially, traditional web search ranking that incorporates rater agreement into
text similarity measures such as Jaccard and Cosine similarity were
implemented to assess the overlap between student answers and the the loss function. This method eliminates the need for
ideal responses. Through experimental analysis, it was determined manual feature engineering and domain-specific tuning.
that Jaccard similarity provided a more reliable measure in this
context, as it considers the unique presence of words and mitigates
III. METHODOLOGY
the impact of word repetition that can skew Cosine similarity
results. The proposed system is designed as a web-based application
employing a client-server architecture to facilitate seamless
To further enhance the system's capability to understand contextual interaction between students and faculty. The methodology
and semantic nuances in language, we integrated a fine-tuned encompasses the development of two primary interfaces—the
Bidirectional Encoder Representations from Transformers (BERT) student interface and the faculty interface—backend processing
model. Trained on the Stanford Natural Language Inference utilizing advanced NLP techniques, integration of frontend and
(SNLI) corpus, the BERT model enables the system to capture backend components via Python Flask, and the application of
deeper semantic relationships between texts, allowing for a more evaluation metrics to assess system performance. This section
sophisticated comparison that goes beyond mere keyword provides a detailed explanation of each component and the
matching. processes involved.

The proposed system not only streamlines the grading process by A. Frontend Development with Python Flask
reducing the time and effort required from educators but also
provides immediate and consistent feedback to students, thereby
Python Flask was chosen for frontend development due to its
lightweight nature, scalability, and robust capabilities for
integrating with Python-based backend models. The Flask
framework enables rapid development and provides the necessary
tools for creating dynamic and responsive web applications. The
student interface is designed to provide functionalities such as
registration, authentication, assignment submission, and feedback
display. Students can create accounts, log in securely, view
available assignments for specific subjects, and upload their
submissions in PDF format. After evaluation, they can view their
results and feedback through a user-friendly dashboard.

The faculty interface offers functionalities including registration, Fig 2. Login for student and faculties
authentication, assignment management, and an analytics
dashboard. Faculty members can create, edit, and delete
assignments, set deadlines, and upload ideal answer keys for each
assignment, which are securely stored and hidden from students.
The analytics dashboard provides insights into student
performance, submission statuses, and grading trends.

Implementation of the Flask framework involves defining routes


and request handling mechanisms. URL mapping is established for
different functionalities such as login, upload, and results viewing.
Appropriate HTTP methods (GET, POST) are utilized to ensure
RESTful API practices. Dynamic content rendering is achieved
using the Jinja2 templating engine, allowing the server to render
pages with personalized content. Reusable components and
modularized UI elements ensure consistency and maintainability. Fig 3. Faculty Dashboard

Secure file upload mechanisms are implemented to handle the


transfer of assignment files and ideal answer keys. The
secure_filename function from Flask is used to prevent directory
traversal attacks, and uploaded files are validated for allowed
extensions and size limits to prevent malicious uploads. User
authentication and authorization are managed using Flask
extensions like Flask-Login and Flask-Security. Session
management is handled to ensure that only registered users can
access the system. Password security is enforced through hashing
algorithms such as bcrypt to protect user credentials. Role-based
access control is implemented to differentiate between student and
faculty permissions, enhancing the security and integrity of the
application.
Fig 4. Student Dashboard to submit assignments
Frontend security measures include protection against cross-site
request forgery (CSRF) attacks by incorporating CSRF tokens in
forms, input sanitization to prevent injection attacks, and
enforcement of HTTPS connections to encrypt data transmitted
between the client and server.

Fig 5. Create new assignment page

B. Backend Processing and NLP Integration

The backend processing unit is responsible for evaluating student


submissions against ideal answer keys using various NLP
techniques. The process begins with data extraction and
preprocessing, where the uploaded PDF files are converted into
Fig 1. Architecture of the Platform raw text. For text-based PDFs, libraries like PyPDF2 or pdfminer
are used to extract text directly. For scanned PDFs, Tesseract OCR when a student uploads an assignment via the /upload_assignment
is integrated via pytesseract to convert images into text. endpoint, the backend receives the file and initiates processing.
The text extracted from the PDF is sent to the backend processing
Text preprocessing involves normalization steps such as unit, where the NLP techniques are applied to evaluate the answer.
lowercasing and removing punctuation and special characters to The resulting similarity scores are then returned to the frontend and
ensure case-insensitive comparisons and clean the text. displayed to the user in a comprehensible format.
Tokenization is performed using tools like NLTK's word_tokenize
or spaCy's tokenizers to split the text into individual words. Stop- Data flow between the frontend and backend is managed using
word removal is conducted using predefined lists from NLTK to asynchronous JavaScript and XML (AJAX) calls, allowing for
eliminate common words that do not contribute significantly to non-blocking communication and enhancing the user experience.
meaning. Lemmatization is applied using the WordNet Data is exchanged in JSON format for consistency and ease of
Lemmatizer or spaCy's lemmatizer to reduce words to their base parsing. Session and state management are maintained through
forms. session tokens and state persistence mechanisms to ensure
continuity across multiple requests and prevent data loss in case of
Three primary similarity measurement techniques are interruptions.
implemented: Jaccard similarity, Cosine similarity, and semantic
similarity using Bidirectional Encoder Representations from Security and data protection are integral to the integration process.
Transformers (BERT). Jaccard similarity measures the similarity Secure communication is enforced through HTTPS, and
between two sets by dividing the size of their intersection by the authentication tokens such as JSON Web Tokens (JWT) may be
size of their union. It focuses on the presence or absence of unique used for stateless authentication, enhancing scalability. Data
words in the texts, reducing the impact of word repetition. Cosine encryption is applied both at rest and in transit to protect sensitive
similarity evaluates the cosine of the angle between two non-zero information.
vectors in a multidimensional space representing the text data. D. Evaluation Metrics
Texts are transformed into numerical vectors using term
frequency-inverse document frequency (TF-IDF) representations. To assess the performance of the system and determine the
However, Cosine similarity can be sensitive to word frequency, effectiveness of each similarity measurement technique, several
which may lead to skewed results when words are repeated evaluation metrics are employed. Accuracy is calculated as the
excessively. ratio of correctly predicted instances to the total instances,
measuring the overall correctness of the BERT model's
To enhance the system's ability to understand contextual and classifications compared to the ground truth. Precision, recall, and
semantic nuances, a pre-trained BERT model from the Hugging F1-score are particularly important for the BERT-based semantic
Face Transformers library is fine-tuned on the Stanford Natural similarity model. Precision indicates the model's ability to correctly
Language Inference (SNLI) corpus. The fine-tuning process identify relevant instances, recall reflects its ability to find all
involves setting up a model architecture with input layers for token relevant instances, and the F1-score provides a balance between
IDs, attention masks, and token type IDs. The BERT layers precision and recall.
generate contextualized embeddings, and a Bidirectional Long
Short-Term Memory (BiLSTM) layer is added on top to capture A confusion matrix is used to provide a detailed breakdown of
sequential dependencies. The output is processed through global prediction outcomes, helping to identify specific areas of
average and max pooling layers, followed by a dense layer with misclassification. Correlation with manual grading is assessed
softmax activation for classification. using the Pearson correlation coefficient and Spearman rank
correlation, measuring the linear and monotonic relationships
The training process for the BERT model is conducted in two between automated similarity scores and manual grades,
phases. Initially, the BERT layers are frozen, and only the top respectively. The Area Under the ROC Curve (AUC-ROC) is
layers are trained to adapt to the new task. Subsequently, the BERT calculated to represent the model's ability to distinguish between
layers are unfrozen, and the entire model is fine-tuned with a lower classes, with higher values indicating better performance. Mean
learning rate to prevent disrupting the pre-trained weights. The Squared Error (MSE) and Root Mean Squared Error (RMSE) are
Adam optimizer is used with appropriate learning rates for each used if the evaluation is treated as a regression problem, measuring
phase, and the categorical cross-entropy loss function is applied for the average squared difference between predicted and actual
multi-class classification. values.

The evaluation pipeline involves processing the student's uploaded An experimental setup is established to ensure rigorous evaluation
assignment through text extraction and preprocessing as described. of the system. Dataset preparation involves collecting training data
Similarity computations are then conducted using the Jaccard and for BERT fine-tuning from the SNLI corpus and possibly
Cosine similarity measures, as well as the BERT-based semantic augmenting it. Real student submissions are collected for system
similarity. The scores from these methods are aggregated, possibly evaluation, with manual grading by faculty members serving as the
using weighted averaging, to produce a final similarity score. ground truth. The model training involves hyperparameter tuning
Thresholding is applied to classify the level of similarity (e.g., and validation strategies such as k-fold cross-validation to enhance
high, medium, low), and feedback is generated based on the score. reliability. Performance analysis includes a comparative study
This feedback includes score interpretation, highlighting key areas against baseline models, statistical significance testing, and
where the student's answer aligns or deviates from the ideal visualization through graphs and plots.
answer, and providing recommendations for improvement.

C. Integration of Frontend and Backend IV. RESULTS AND DESCUSSION


The developed automated subjective answer evaluation
The integration of the frontend and backend components is
system underwent extensive testing to assess its
facilitated by Python Flask, providing the necessary routing and
request handling mechanisms. API endpoints are established to effectiveness in accurately grading student submissions
trigger backend processing functions upon user actions such as compared to manual evaluations by faculty members. The
uploading an assignment or an ideal answer key. For instance, evaluation focused on three primary similarity measurement
techniques: Jaccard similarity, Cosine similarity, and and F1-score for each class (entailment, contradiction,
semantic similarity using a fine-tuned BERT model. The neutral) averaged 0.84, indicating robust performance across
performance of each method was analyzed using a dataset categories. The BERT model effectively identified
comprising student answers and corresponding ideal paraphrased responses and appropriately graded answers
answers across various subjects and assignment types. This that expressed the correct concepts using different wording.
section presents a detailed analysis of the results obtained For instance, in the inheritance question, a student who
and discusses the implications of the findings. wrote, "Inheritance allows a new class to adopt properties of
existing ones, promoting code reuse," received a high
A. Evaluation of Similarity Measures similarity score, despite not using the exact phrases from the
ideal answer. The model also demonstrated the ability to
The first method evaluated was the Jaccard similarity understand synonyms and contextual meanings, correctly
measure, which quantifies the overlap of unique words assessing answers that used different terminology but
between student answers and the ideal answers. The dataset conveyed the same ideas. However, some challenges were
included 500 student submissions covering subjects such as noted in handling complex sentence structures or when
computer science, literature, and social sciences. The students used metaphorical language, indicating areas for
Jaccard similarity scores ranged from 0.15 to 0.85, with an further improvement.
average score of 0.47. Student responses that closely
matched the ideal answers in terms of key concepts and
terminology yielded high Jaccard similarity scores, typically
above 0.70. For instance, in a computer science question
about "inheritance in object-oriented programming,"
students who included essential terms like "class," "object,"
"parent," "child," and "reuse" achieved scores around 0.78
to 0.85. Conversely, responses that deviated significantly
from the expected content, either by discussing irrelevant
topics or omitting essential terms, resulted in lower scores,
often below 0.30. While Jaccard similarity effectively
identified the presence or absence of essential terms, it Fig 6. Similarity Results of assignments
lacked the ability to capture contextual nuances or variations
in expression. Synonyms and paraphrased content were
often not recognized, leading to lower scores for answers
that were correct but worded differently from the ideal
answer.

Cosine similarity was then utilized to evaluate the angle


between the vector representations of the student and ideal
answers, considering word frequency. The TF-IDF
vectorization accounted for term importance across the
corpus. The Cosine similarity scores exhibited a narrower
range, from 0.40 to 0.95, with an average score of 0.68. The
higher average score compared to Jaccard similarity Fig 7. Comparison between ideal assignment answer and student’s
indicated a general tendency for Cosine similarity to assign submitted answer
higher similarity values. However, the sensitivity of Cosine
similarity to word frequency led to inflated similarity scores
B. Comparative Analysis
for responses that repeated certain terms, regardless of the
overall relevance of the content. For example, a student who
A comparative analysis was conducted to evaluate the
repeatedly mentioned "inheritance" and "class" without
alignment of each similarity measure with manual grading
demonstrating understanding achieved a high score of 0.88.
by faculty. The correlation coefficients between the
This reduced the method's discriminative ability, as both
automated scores and manual grades were calculated using
coherent answers and those with mere keyword repetition
Pearson's correlation. The Jaccard similarity scores
received high similarity scores, diminishing the reliability of
exhibited a moderate positive correlation of 0.65 with
Cosine similarity as a sole metric for accurate evaluation.
manual grades, indicating a reasonable level of agreement
with human evaluators, especially in cases where key terms
The fine-tuned BERT model demonstrated a strong ability
are essential indicators of understanding. Cosine similarity
to comprehend contextual and semantic relationships
had a lower correlation coefficient of 0.52, reflecting its
between student and ideal answers. The model was fine-
limitations in distinguishing between relevant and irrelevant
tuned using a dataset of 10,000 sentence pairs from the
content due to word repetition. The BERT-based semantic
SNLI corpus and additional domain-specific data, trained
similarity achieved the highest correlation coefficient of
over three epochs with a batch size of 16 and an Adam
0.78 with manual grades, demonstrating a strong alignment.
optimizer learning rate of 2e-5. On the validation dataset,
This strong correlation is attributed to the model's ability to
the model achieved an accuracy of 85%. Precision, recall,
understand context, capture semantic nuances, and system holds significant potential to transform grading practices
recognize paraphrased content. and improve educational outcomes.
A statistical significance test using a two-tailed t-test
VI. REFERENCES
confirmed that the BERT-based method's correlation with
manual grading was significantly higher (p < 0.01) than that
[1] Hadi Abdi Ghavidel, Amal Zouaq and Michel C. Desmarai, “Using
of Jaccard and Cosine similarities. This finding supports the BERT and XLNET for the Automatic Short Answer Grading Task” ,
adoption of the BERT-based semantic similarity measure as 12th International Conference on Computer Supported Education,
the primary evaluation method in the system. 2020
[2] T.S Adharsh and M.K Jeyakumar, “Deep Learning Based Automatic
Answer Scoring Through Bi Directional LSTM”, Migration Letters,
Volume 20
[3] R.R. Venga, [Link], M.S. Bharath, “Autograder: A Feature-Based
Quantitative Essay Grading System Using BERT”, ICT Infrastructure
and Computing (pp.71-81), Oct 2023
[4] Chen H. He B, “Automated essay scoring by maximizing human-
machine agreement”, Proceedings of the 2013 Conference on
empirical methods in natural language processing.
[5] Heilman M. Madnani N (2013) ETS: domain adaptation and stacking
for short answer scoring In: Proceedings of the seventh international
workshop on semantic evaluation
[6] 5. Prabhu S. Akhila K, Sanriya S (2022) A hybrid approach towards
automated essay evaluation based on Bert and feature engineering. In:
2022 IEEE 7th international conference for convergence in
technology (12CT)
[7] Yang Y, Xia L, Zhao Q (2019) An automated grader for Chinese
essay combining shallow and deep semantic attributes.
[8] Attali Y, Burstein J (2006) Automated essay scoring with e-raterR v.
Fig 8 . Performance Comparison of Models 2. J Technol Learn Assess 4(3)
[9] Saravanan, Kalaimathi, et al. "Exam Marks Summation App Using
V. CONCLUSION Tesseract OCR in Python." International Journal of Integrated
This paper presented an automated system for evaluating Engineering 14.3 (2022): 102-110..
subjective student answers by integrating traditional Natural [10] Patience, Okechukwu Ogochukwu, et al. "Enhanced Text Recognition
Language Processing techniques with advanced machine learning in Images Using Tesseract OCR within the Laravel
models. Utilizing Python Flask for frontend development and Framework." Asian Journal of Research in Computer Science 17.9
(2024): 58-69..
seamless backend integration, we created a user-friendly web
application that streamlines the grading process for educational [11] Mahmud, Saikat, et al. "Automatic Multiple Choice Question
Evaluation Using Tesseract OCR and YOLOv8." 2024 IEEE
institutions. While Jaccard similarity offers a computationally Conference on Artificial Intelligence (CAI). IEEE, 2024.K. Elissa,
efficient method for initial assessments, it lacks depth in capturing “Title of paper if known,” unpublished.
semantic meaning. Cosine similarity proved less reliable due to its [12] Joshi, Kartik. "Study of Tesseract OCR." GLS KALP: Journal of
sensitivity to word repetition. The fine-tuned BERT model Multidisciplinary Studies 1.2 (2021): 41-50.Y. Yorozu, M. Hirano, K.
significantly enhanced the system's ability to comprehend Oka, and Y. Tagawa, “Electron spectroscopy studies on magneto-
contextual and semantic nuances, closely aligning automated optical media and plastic substrate interface,” IEEE Transl. J. Magn.
evaluations with manual grading and achieving 85% accuracy on Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th Annual Conf.
validation data. The system reduces faculty workload and provides Magnetics Japan, p. 301, 1982].
students with immediate, consistent feedback, enhancing the [13] Zacharias, Ebin, Martin Teuchler, and Bénédicte Bernier. "Image
learning experience. Challenges remain regarding computational processing based scene-text detection and recognition with
tesseract." arXiv preprint arXiv:2004.08079 (2020).
demands affecting scalability. Future work will focus on
optimization, enhanced language understanding, adaptive learning,
multilingual support, and user experience improvements. The

You might also like