0% found this document useful (0 votes)
2 views

autoextraction

Uploaded by

kadlaginvestment
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

autoextraction

Uploaded by

kadlaginvestment
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Automatic Extraction of Segments from Resumes

using Machine Learning


Gunaseelan B Supriya Mandal Rajagopalan V
Digital Technology Solutions Digital Technology Solutions Digital Technology Solutions
Mirafra Software Technologies Pvt Ltd Mirafra Software Technologies Pvt Ltd Mirafra Software Technologies Pvt Ltd
Bangalore, India Bangalore, India Bangalore, India
2020 IEEE 17th India Council International Conference (INDICON) | 978-1-7281-6916-3/20/$31.00 ©2020 IEEE | DOI: 10.1109/INDICON49873.2020.9342596

[email protected] [email protected] [email protected]

Abstract— Online recruitment systems or automatic resume LSTM and RNN also being used for named-entity
processing systems are becoming more popular because it saves recognition. M. Zhao et al. [2] proposed NER and named
time for both employers and job seekers. Manually processing entity normalization (NEN), to extract relevant information
these resumes and fitting to several job specifications is a from resumes. NER refers to the identification of a phrase that
difficult task. Due to the increased amount of data, it is a big
challenge to effectively analyze each resume based on various
expresses skill, while NEN goes further and relates the phrase
parameters like experience, skill set, etc. Processing, extracting to a specific individual or entity. The authors define a SKILL-
information and reviewing these applications automatically considered application which detects, extracts and links a skill
would save time and money. Automatic data extraction, focused to a qualified target. Most of the recruitment systems invest
primarily on skillset, experience and education from a resume. more time to structure the information from unstructured
So, it extremely helpful to map the appropriate resume for the resume by several techniques. The skill tagging matches the
right job description. In this research study, we propose a skills found in resumes with the concepts of skill taxonomy,
system that uses multi-level classification techniques to which needs to be developed for each domain. The automation
automatically extract detailed segment information like skillset, of this task developed by using MediaWiki and rules
experience and education from resume based on specific
parameters. We have achieved state-of-the-art accuracy in the
established on category tags. Chen et al. [3] integrated the text
segment of the resumes to identify skill sets. block classification by defining the text as a separate blocks,
with the help of title words and writing style of the document
Keywords—Information Retrieval, IR, resume parsing, in resumes. But it will be more complicated for the
segmentation, classification, bagging, boosting, GBDT and unstructured resume because the writing style and the title
random forest. words vary in resumes. There are some NLP techniques to
understand the context in the text. Hence, the heuristic model
I. INTRODUCTION has been used to retrieve the information from resume based
Online or intelligent recruitment systems play a significant on lexical, syntactic and semantic analysis [4]. Unlike
role in the recruitment channel with the advanced text classification model, Bernabé-Moreno et al. [5] approached
processing techniques. In recent years, most of the companies the word embedding technique, with Glove model to mine the
have posted their job requirements on various online platforms skill information by capturing the semantic relationships,
or gather a large number of resumes from different sources. between the terms and also clustered the entities with the help
Recruitment industry spends an excessive amount of time and of t-SNE. Sequential models like LSTM and RNN are
energy on filtering and pulling data for job requirements from extensively used for sequence tagging task in text, speech, and
resumes. The entire resume review process based on a job images. Ayishathahira et al. [6] developed a Convolutional
description is quite tedious because the resume is in Neural Network (CNN) segmentation model that segmented
unstructured format. Therefore, in recent years, many the knowledge into the family, educational, occupational, and
recruitment tools have been developed for the retrieval of others. In their research, CRF and Bi-LSTM-CNN created for
information. Although fundamental theories and processing information extraction. They have also shown CRF
methods for web data extraction are exist, most of the outperforms the other neural networks, based on the
recruitment tools are still subject to text processing and segmentation accuracy. Kumar, R et al. [7] proposed a
candidates complying with job requirements. Chif et al.[1] use supervised learning approach, by deep learning technique to
a technique of ontology matching to distinguish technical predict keyword relevance to identify the skills from resume.
skills with a set of common multi-word part-of-speech Hence, they took skills as relevant keywords and others as
patterns, such as lexicalized, multi-word language. If the irrelevant keywords in that approach. Then they measured the
specific skill is not mentioned in the database, then it fails to importance of each word in the text, which reflects the
recognize the information to extract from resume. Named- probability of a word is the skill in the document. Feature
entity recognition (NER) has been widely studied in NLP to extraction and feature selection are important steps for the text
identify and extract name, entities from text. With the classification, to understand the pattern and reduce number of
increased computational power, deep learning models like features from the high dimensional feature space. Shah et al.

978-1-7281-6916-3/20/$31.00 ©2020 IEEE


Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
[8] extracted the features by using Principal component Finally implemented a method that extracts the segment under
analysis (PCA), to produce the lower-dimensional feature set each heading. The paper is set out as follows: - The processing
from the original text and get rid of irrelevant feature space. of data is dealt with in Section II. Section III explains the
Also, they have used information gain and ambiguity approach proposed to resume segmentation. The next segment
measures for feature selection to improvise the model. focuses on performance and technique evaluation. Finally, we
Dwivedi et al. [9] did a comparative study on a rule-based conclude the paper.
model (RBM) feature and lexicon features, for the
optimization of sentiment classification. Kumar, S et al. [10] II. DATA PREPARATION
worked on the N-gram model sequences used for feature A. Data Collection
extraction includes unigram, bigram and trigram based on the
In our recruitment system, all the resumes gathered from
importance of context, for varying analysis of co-occurring
different sources for various requirements are stored in a
word within a given window. The methodology demonstrates
database. Resumes were stored in various file formats such as
to understand the sequences of words from a given sample
DOC, DOCX, PDF, TXT, HTML, and so on. We have
text. In [11] author proposed a Term Frequency-Inverse
considered MS Word and PDF files for this experiment. 400
Document Frequency (TF-IDF), for feature extraction trained
resumes were randomly picked from the recruitment database
by Support vector machine (SVM) algorithm that produced a
and considered this as a base dataset for our experiments.
satisfactory classification. Manchanda et al. [12] compute a
Extracting text from resume is a challenging task. We've
dynamic algorithm programming, for segmenting a text from
explored python packages such as pypdf2, docxt2txt, OCR to
the labelled document with the help of distant multi-class
extract the text from a resume, but it leads to some issues
supervised approach. Jo et al. [13] proposed a text
including a line breaker when beginning of bold word starts,
segmentation with K Nearest Neighbors, by classifying that
incorrect text and table alignment. All the files have been
the sentence or paragraph belongs to a specific group, which
processed first to extract text and tables from the resumes.
considers the similarity between attributes by calculating
Pydocx, an open-source library has been used to parse docx
similarity among feature vectors. This analysis found that
files and Pdfminer, a python based open-source library used
similarity features represent an important role in segmenting
to extract text and tables from pdf files. The extracted text
various information from the document. Hakak et al. [14]
further used to train classifiers for heading prediction.
represent the survey of single-pattern exact string matching
algorithms and the working process of different string B. Data Labelling
matching algorithms, to identify proper string matching Data labelling refers to the method of assigning data point
algorithms depending on their application and complexity of labels, to make the information suitable, for the training of
the problems. Ensemble learning combines more than one supervised machine learning models. Each row of text
machine learners into one ensemble learner according to extracted from resumes is labelled manually by experts.
ensemble policy. Ensemble learning algorithms try to produce Label 1 has been assigned to all the text, which is a heading
a set of learners from training data set and integrate these and 0 for not-a-heading. 26,092 line of text from resumes
learners to deal with the same task. Zhang et al. [15] uses have been labelled. Table I shows the example of labelled
ensembling techniques like Random forest (RF), Bagging and data points for a resume.
Ada Boost classifiers for imbalanced text classification and
also proposed sampling techniques like under-sampling, TABLE I. SAMPLE LABELLED DATA
bootstrap re-sampling used to improve the classification
Text Target class
performance. Goudjil et al. [16] approached the active
learning multi-class support vector machine (SVM), which SUMMARY OF EXPERIENCE 1
requires minimal labelled data gives the better model, where Highly skilled at building teams, maintaining a
the algorithm chooses the data to learn by selecting the healthy and conducive environment for 0
samples considered to be the most informative. Xu et al. [17] productivity and Team work.
1
proposed a complex model using Extreme gradient boosting SKILLS PROFILE
(XGB), for the recommendation engine to predict the Platform/Technologies: SAS, SQL, PL/SQL
Basics, Unix Commands and Basic level Shell 0
probability of every sample and taken the appropriate
Scripting
threshold to decide the label. So, they converted the Learning Technologies: R, Tableau and Python, 0
recommendation problem into a binary classification problem Data Science Concepts
and extracted the features from consumers behaviour history PROJECTS SUMMARY 1
to predict the target user. Their research shows that it can 0
handle enormous records in a short time and perform well. Project Names: AG Insurance
Description: The customer is an insurance-
In our research study, we have proposed two levels of banking firm involved in developing marketing
classification techniques to extract segment information from insurance and services for self-employed persons, 0
the resume. First, a classification model has been built to private customers and small businesses based on
predict a text line in the resume is a heading or not. The specifications by the business analysts as well as
programming using SAS Base.
heading is then classified as different heading categories.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
5) Stop Words: Number of stop words.
6) Numeric: Number of numeric values
7) Special Characters: Number of Special Characters
8) Text Case Feature: This feature describes whether all
the words in the text is in upper case, lower case and camel
case (first letter of all words in uppercase) or not. Since
people use camel case and upper case for headings, mainly to
make a difference from heading or not-a-heading.
9) Part of Speech Feature (POS): The recurrence for
every POS label is determined and used to compute the
frequency of every POS tag in the content for every datum
point. These frequencies lead to potential highlights for the
model.
10) Similarity Features: We considered the most repeated
words present in the combination of heading text and created
a feature for each word taken. Then, we calculated the cosine
similarity for each word feature with the text present in each
Fig. 1. Distribution of target class block of information.

Distribution of class ratio shown in Fig. 1. If the text E. Feature Analysis


within the data was a heading, the label was set to 1, All the features were analysed to observe the impact of
otherwise 0. Labelling text is one of the most critical steps features on the target class. For example, average word count
because the model's performance depends on the quality of is more in class 0 compared to class 1 shown in Fig 2 because
the data. words in not-a-heading will be more than heading. Then the
probability of text ends feature is more in class 1 than class 0
C. Data Pre-processing shown in Fig 3 because unlike not-a-heading words, heading
The process of transforming raw data into usable training mostly represents the symbols like colon, hyphen and
data is referred to as data pre-processing. Inadequate resumes combination of both at the end.
are dropped from the sample, which could not provide the
proper alignment. It means usually resume contains text in
different format like list, table and combination of both. We
have excluded the table format resumes for classifier model
which didn’t give a proper alignment. Raw data is always
unacceptable in the real term, and data cannot be sent through
a model because it causes certain errors. So, it will be easy to
pre-process the data before fitting through a model. Then, 333
unique resumes have been finalized from the selected sample,
which is processed further to extract text lines. Next, we
dropped missing values and the text lines which contain only
special characters, to reduce model complexity.
D. Feature Extraction
Different types of features are extracted and analyzed to
collect meaningful insights from the text, to differentiate the
heading and not-a-heading. Some features are described in
the following sub-sections.
1) Word Count: A total number of words in the text. The Fig. 2. Word count impact on target class.
basic intuition behind this feature generally, heading contains
F. Feature Transformation
a lesser amount of words than the not-a-heading.
2) Length of text: Number of characters in the text. Standardization is an important technique mostly
3) Text ends with symbol: Text line end with any special preferred as a pre-processing step before getting into Machine
pattern or symbol. Usually for headings word will end with Learning models. Most of the algorithms perform well when
some patterns like a colon (:), hyphen (-) or combination of variables are relatively on a similar scale or close to normal
both (:-) and not-a-heading or contents will end with a dot (.) distribution. Some of the extracted features are highly skewed,
to differentiate it. In order to, normalize the skewed feature, we have used
4) Average word length: Average word length in each Robust Scaler technique. It transforms the variable vector by
text, used to calculate the average word length of each block. subtracting the median and then dividing by the interquartile
This can also potentially help us in improving the model. range (75% value – 25%). It also handles outliers which

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
require scaling the data and achieving normal distribution. III. METHODS
Then, the categorical features were transformed by applying In general, resumes are classified into multiple sections
one hot encoding technique, including bold and text ending like objective, summary, skillset, education, experience,
with the special pattern. It would result in a dummy variable project and personal details. Under each heading, the content
trap, as other variables can easily predict the outcome of one will be associated with a detailed description. All the sections
variable. But the dummy variable trap leads to are separated based on title heading. We considered each
multicollinearity problem because of dependency between the segment as a section of a resume. To extract such segment's
independent variables. So, we dropped one of the dummy details, we have used multi-level classification and retrieve
variables from each categorical to get rid of this problem. information under that segment. The whole process has three
main steps. a) heading prediction – that predicts a line is
heading or not, b) segment classification – that maps a
heading to a generalized segment label, and c) segment
extraction -that extracts segment details.

1) Heading Prediction: Classifier model is fitted on


labelled data to allow text from the resume and differentiated
the heading and not-a-heading based on training data. For
model training, we have extracted 61 features to analyze the
relationship by learning the function that maps the
input(variables) to the output(target). We've also split the
dataset into train and testing for model building. Then, we
used a scikit-learn module to continue with the classifier
algorithm. Multiple classification models have been built
using various techniques, such as K-Nearest Neighbors
(KNN), Support Vector Machine(SVM) with RBF kernel and
ensembling techniques like Bagging, Random Forest (RB),
Gradient Boosting (GB), Extreme Gradient Boosting (XGB),
Fig. 3. Impact of text end with special symbols on target class. Ada boost and LightGBM. Finally, The best classifier was
chosen for the heading prediction by evaluation metrics. But
G. Creation of Training and Validation data this heading prediction differs when the entire resume
In general, the data is divided into two parts; train and test represents in table format as mentioned the difficulties in data
using the stratified sampling technique, with the same ratio of preprocessing section. So, instead of classifier model
two labels distributed in both the sets. So, the model can be similarity features are used to identify the heading based on a
generalized well on unseen data. In other terms, it can predict threshold value. So, we proposed the rule if the resume in list
better accurate results on its adjustment of internal parameters or combination of list and table format will predict the
when it was trained and validated. The total number of headings by the classifier model and prediction based on a
observations taken in train and test shown in Fig 4. threshold value using similarity score for the table format.

2) Segment Extraction: After the heading prediction, we


segmented the different section of information from resume
by rule-based technique. The proposed rule identifies the start
of the heading, which was mentioned in a predicted label as
(1) and it will extract all the information until the next
heading appears. Therefore, all the extracted information is
considered as the details of the previous heading. Then, we
created a dictionary to store all section information in the key-
value pair for every resume. Here key holds the title or
heading and its respective content present in the value. This
is because key should be unique as well as heading imparts
the different information with distinctive title and not-a-
heading as a value pair of each key. Table II shows the
example of the dictionary contains a map of all respective
contents to its unique title from the predicted resume. By this
technique, we successfully converted the unstructured
information to structured format from resume. This
dictionary key will be pass into the next step for the specific
Fig. 4. Number of observations for train and test data. segment extraction.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
TABLE II. DICTIONARY OF HEADING AND ITS DETAILS If a text is identified as a heading, but it is a not-a-heading
Key Value then such case is considered as FP. FN is the case when a
OBJECTIVE: A post graduate who is seeking a heading is detected as not-a-heading. F1 score will be
challenging position, with a stron calculated by using the weighted average of precision and
g emphasis on Data science Tech
recall. Equation (1) and (2) describes the formula for the
nology, where I can use my abiliti
es and explore the best of skills a calculation of precision and recall. F1-score is calculated
nd acumen to become a valuable based on (3). AUC ROC graph will show the overall
asset to the organization. performance of the classification model at all thresholds. We
PROFESIONAL SUMMARY: Have knowledge about working
have chosen the F1 score metric instead of accuracy for the
and architecture of HDFS', assessment of heading classification, as the dataset is
Worked and contributed with the imbalanced.
team in the project.
TECHNICAL SKILLS: Programming Languages: JAVA, Precision = TP/(TP+FP)
SQL(BASIC), R, PYTHON, (1)
Operating System: Windows,
Linux
Recall = TP/(TP+FN)
EXPERIENCE: Tata Consultancy Services (2)
Internship 20 November 2017 - 1 F1 = 2* (Recall * Precision) / (Recall + Precision)
9 March 2018 Chennai, Siruseri (3)
campus, Got initial training in
Scala and Kafka.
AWARDS & ACHIEVEMENTS: Won inter house debate competiti F1 score, precision and recall have been calculated using
ons in school. Served as organizer the test dataset and analyzed the performance of all the
Jet club-2014 in college. classifiers used for model training as defined in Table IV.
XGBoost outperforms the other classifiers used. Finally, we
3) Segment Classification: We have categorized 20 considered XGBoost as the classification model for the
classes for all the major headings in resumes. We prediction of a text is heading or not. Moreover, we've
implemented the rule to extract the specific skill information separately measured the impact of features for the XGBoost
from the dictionary, based on approximate string matching model. Table V shows the confusion matrix to visualize the
algorithm fuzzy-matching. Since fuzzy search will look for actual and predicted values from the model. Fig. 5, shows the
the outputs, that misspelt or differently punctuated words. ROC curve and AUC for the XGBoost. Also, we tried to tune
Unlike rigid search looks for the exact instances. Each key- the classification model using the probability threshold value
value information is passed through the partial string- of predicting heading and not-a-heading using XGBoost
matching algorithm to match with all possible names of the model. In our experiment, precision reduced when the
skill. At that time, the fuzzy match algorithm calculated the threshold adjusted below 0.5 and recall decreased above 0.5.
partial string match, by using Levenshtein Distance metric So, it results in lower F1 score and needs to focus on both
with configurable parameters like maximum allowed values. Therefore, the default probability threshold value
distance, substitutions, deletions and insertions. Each key fixed to 0.5 gives the best performance of the model. Fig. 6
was ranked, based on the distance range and matched with a describes the important features with predictive power. The
possible number of elements in a skill list. Therefore, the number of words present in the text has the highest impact.
higher rank considered as skill details based on the distance We have also tested the performance of skillset extraction
score. Table III shows the output of skillset information. We using test data which has 33 resumes with PDF and MS Word
have explored only the skillset category, although this can files. We have successfully extracted the skillset from 28
apply to other section categories. resumes, which gives 85% of accuracy.

TABLE III. EXTRACTED SEGMENT FOR SKILLSET TABLE IV. PERFORMANCE OF CLASSIFIERS

Segment Category Output Performance Measures in %


Programming Languages: JAVA, Classifiers
Skillset SQL(BASIC), R, PYTHON, F1-Score Precision Recall
Operating System: Windows, linux KNN 0.846 0.872 0.823
BAG 0.875 0.920 0.834
IV. RESULTS RF 0.891 0.902 0.881
The classification models were evaluated using GB 0.893 0.896 0.890
performance metrics like Accuracy, F1 Score, Precision,
Recall and Area Under Curve (AUC). These performance AB 0.876 0.876 0.876
metrics can be calculated using true positive (TP) value, false XGB 0.901 0.914 0.898
positive (FP) value and false-negative (FN) value. TP is the SVM 0.85 0.899 0.811
case when a particular text is correctly identified as a heading.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
TABLE V. CONFUSION MATRIX FOR XGB REFERENCES
Actual Vs Predicted [1] Chif, E. S., Chifu, V. R., Popa, I., & Salomie, I. (2017, September). A
Confusion
Predicted Predicted system fordetecting professional skills from resumes written in natural
Matrix
Not-a-Heading Heading language. In 2017 13th IEEE International Conference on Intelligent
Actual Not- Computer Communication and Processing (ICCP) (pp. 189-196).
a-Heading
4752 36
IEEE.
Actual
Heading
48 382 [2] [2] M. Zhao, F. Javed, F. Jacob, and M. McNair, SKILL: A System for
Skill Identification and Normalization, Proceedings of the Twenty-
Ninth AAAI Conference on Artificial Intelligence, pp. 4012-4017,
V. CONCLUSION 2015.
We have extracted various features from text and built a [3] Chen, J., Zhang, C., & Niu, Z. (2018). A two-step resume information
classifier to predict whether a text line in a resume is heading extraction algorithm. Mathematical Problems in Engineering, 2018.
or not. Our research shows that XG boost outperforms the [4] Sanyal, S., Hazra, S., Adhikary, S., & Ghosh, N. (2017). Resume Parser
with Natural Language Processing. International Journal of
other classifiers in predicting headings. After the prediction Engineering Science, 4484.
by using fuzzy search technique based on the edit distance,
[5] Bernabé-Moreno, J., Tejeda-Lorente, Á., Herce-Zelaya, J., Porcel, C.,
we successfully extracted the skillset information from both & Herrera-Viedma, E. (2019). An automatic skills standardization
the PDF and Word files. Manually annotated data created a method based on subject expert knowledge extraction and semantic
more realistic view of the data generation process. So, it matching. Procedia Computer Science, 162, 857-864.
generalized well on unseen resumes. In future, we will [6] Ayishathahira, C. H., Sreejith, C., & Raseek, C. (2018, July).
enhance our work by including all formats of different files Combination of Neural Networks and Conditional Random Fields for
Efficient Resume Parsing. In 2018 International CET Conference on
of resume and optimize the performance of the classifier, to Control, Communication, and Computing (IC4) (pp. 388-393). IEEE.
improve the accuracy in predicting and retrieving all the 20 [7] Kumar, R., Agnihotram, G., Naik, P., & Trivedi, S. (2019). A
segments from the resume. We also want to build a Supervised Method to Find the Relevance of Extracted Keywords
classification model for the prediction of the category of each Using Deep Learning Approaches. In Emerging Technologies in Data
heading. This extracted skillset, experience and education can Mining and Information Security (pp. 837-847). Springer, Singapore.
be used to recommend resumes for a job requirement. [8] Shah, F. P., & Patel, V. (2016, March). A review on feature selection
and feature extraction for text classification. In 2016 International
Conference on Wireless Communications, Signal Processing and
Networking (WiSPNET) (pp. 2264-2268). IEEE.
[9] Dwivedi, R. K., Aggarwal, M., Keshari, S. K., & Kumar, A. (2019).
Sentiment analysis and feature extraction using rule-based model
(RBM). In International Conference on Innovative Computing and
Communications (pp. 57-63). Springer, Singapore.
[10] Kumar, S. S., & Rajini, A. (2019, July). Extensive Survey on Feature
Extraction and Feature Selection Techniques for Sentiment
Classification in Social Media. In 2019 10th International Conference
on Computing, Communication and Networking Technologies
(ICCCNT) (pp. 1-6). IEEE.
[11] Dadgar, S. M. H., Araghi, M. S., & Farahani, M. M. (2016, March). A
novel text mining approach based on TF-IDF and Support Vector
Machine for news classification. In 2016 IEEE International
Conference on Engineering and Technology (ICETECH) (pp. 112-
116). IEEE.
[12] Manchanda, S., & Karypis, G. (2018, November). Text segmentation
on multilabel documents: A distant-supervised approach. In 2018 IEEE
Fig. 5. ROC and AUC Based on XGB Model.
International Conference on Data Mining (ICDM) (pp. 1170-1175).
IEEE.
[13] Jo, T. (2017, January). Using K Nearest Neighbors for text
segmentation with feature similarity. In 2017 International Conference
on Communication, Control, Computing and Electronics Engineering
(ICCCCEE) (pp. 1-5). IEEE.
[14] Hakak, S. I., Kamsin, A., Shivakumara, P., Gilkar, G. A., Khan, W. Z.,
& Imran, M. (2019). Exact String Matching Algorithms: Survey,
Issues, and Future Research Directions. IEEE Access, 7, 69614-69637.
[15] Zhang, D., Ma, J., Yi, J., Niu, X., & Xu, X. (2015, August). An
ensemble method for unbalanced sentiment classification. In 2015 11th
International Conference on Natural Computation (ICNC) (pp. 440-
445). IEEE.
[16] Goudjil, M., Koudil, M., Bedda, M., & Ghoggali, N. (2018). A novel
active learning method using SVM for text classification. International
Journal of Automation and Computing, 15(3), 290-298.
[17] Xu, A. L., Liu, B. J., & Gu, C. Y. (2018, July). A recommendation
system based on extreme gradient boosting classifier. In 2018 10th
International Conference on Modelling, Identification and Control
Fig. 6. Feature importance of XGB Model. (ICMIC) (pp. 1-5). IEEE.

Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.

You might also like