autoextraction
autoextraction
Abstract— Online recruitment systems or automatic resume LSTM and RNN also being used for named-entity
processing systems are becoming more popular because it saves recognition. M. Zhao et al. [2] proposed NER and named
time for both employers and job seekers. Manually processing entity normalization (NEN), to extract relevant information
these resumes and fitting to several job specifications is a from resumes. NER refers to the identification of a phrase that
difficult task. Due to the increased amount of data, it is a big
challenge to effectively analyze each resume based on various
expresses skill, while NEN goes further and relates the phrase
parameters like experience, skill set, etc. Processing, extracting to a specific individual or entity. The authors define a SKILL-
information and reviewing these applications automatically considered application which detects, extracts and links a skill
would save time and money. Automatic data extraction, focused to a qualified target. Most of the recruitment systems invest
primarily on skillset, experience and education from a resume. more time to structure the information from unstructured
So, it extremely helpful to map the appropriate resume for the resume by several techniques. The skill tagging matches the
right job description. In this research study, we propose a skills found in resumes with the concepts of skill taxonomy,
system that uses multi-level classification techniques to which needs to be developed for each domain. The automation
automatically extract detailed segment information like skillset, of this task developed by using MediaWiki and rules
experience and education from resume based on specific
parameters. We have achieved state-of-the-art accuracy in the
established on category tags. Chen et al. [3] integrated the text
segment of the resumes to identify skill sets. block classification by defining the text as a separate blocks,
with the help of title words and writing style of the document
Keywords—Information Retrieval, IR, resume parsing, in resumes. But it will be more complicated for the
segmentation, classification, bagging, boosting, GBDT and unstructured resume because the writing style and the title
random forest. words vary in resumes. There are some NLP techniques to
understand the context in the text. Hence, the heuristic model
I. INTRODUCTION has been used to retrieve the information from resume based
Online or intelligent recruitment systems play a significant on lexical, syntactic and semantic analysis [4]. Unlike
role in the recruitment channel with the advanced text classification model, Bernabé-Moreno et al. [5] approached
processing techniques. In recent years, most of the companies the word embedding technique, with Glove model to mine the
have posted their job requirements on various online platforms skill information by capturing the semantic relationships,
or gather a large number of resumes from different sources. between the terms and also clustered the entities with the help
Recruitment industry spends an excessive amount of time and of t-SNE. Sequential models like LSTM and RNN are
energy on filtering and pulling data for job requirements from extensively used for sequence tagging task in text, speech, and
resumes. The entire resume review process based on a job images. Ayishathahira et al. [6] developed a Convolutional
description is quite tedious because the resume is in Neural Network (CNN) segmentation model that segmented
unstructured format. Therefore, in recent years, many the knowledge into the family, educational, occupational, and
recruitment tools have been developed for the retrieval of others. In their research, CRF and Bi-LSTM-CNN created for
information. Although fundamental theories and processing information extraction. They have also shown CRF
methods for web data extraction are exist, most of the outperforms the other neural networks, based on the
recruitment tools are still subject to text processing and segmentation accuracy. Kumar, R et al. [7] proposed a
candidates complying with job requirements. Chif et al.[1] use supervised learning approach, by deep learning technique to
a technique of ontology matching to distinguish technical predict keyword relevance to identify the skills from resume.
skills with a set of common multi-word part-of-speech Hence, they took skills as relevant keywords and others as
patterns, such as lexicalized, multi-word language. If the irrelevant keywords in that approach. Then they measured the
specific skill is not mentioned in the database, then it fails to importance of each word in the text, which reflects the
recognize the information to extract from resume. Named- probability of a word is the skill in the document. Feature
entity recognition (NER) has been widely studied in NLP to extraction and feature selection are important steps for the text
identify and extract name, entities from text. With the classification, to understand the pattern and reduce number of
increased computational power, deep learning models like features from the high dimensional feature space. Shah et al.
Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
5) Stop Words: Number of stop words.
6) Numeric: Number of numeric values
7) Special Characters: Number of Special Characters
8) Text Case Feature: This feature describes whether all
the words in the text is in upper case, lower case and camel
case (first letter of all words in uppercase) or not. Since
people use camel case and upper case for headings, mainly to
make a difference from heading or not-a-heading.
9) Part of Speech Feature (POS): The recurrence for
every POS label is determined and used to compute the
frequency of every POS tag in the content for every datum
point. These frequencies lead to potential highlights for the
model.
10) Similarity Features: We considered the most repeated
words present in the combination of heading text and created
a feature for each word taken. Then, we calculated the cosine
similarity for each word feature with the text present in each
Fig. 1. Distribution of target class block of information.
Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
require scaling the data and achieving normal distribution. III. METHODS
Then, the categorical features were transformed by applying In general, resumes are classified into multiple sections
one hot encoding technique, including bold and text ending like objective, summary, skillset, education, experience,
with the special pattern. It would result in a dummy variable project and personal details. Under each heading, the content
trap, as other variables can easily predict the outcome of one will be associated with a detailed description. All the sections
variable. But the dummy variable trap leads to are separated based on title heading. We considered each
multicollinearity problem because of dependency between the segment as a section of a resume. To extract such segment's
independent variables. So, we dropped one of the dummy details, we have used multi-level classification and retrieve
variables from each categorical to get rid of this problem. information under that segment. The whole process has three
main steps. a) heading prediction – that predicts a line is
heading or not, b) segment classification – that maps a
heading to a generalized segment label, and c) segment
extraction -that extracts segment details.
Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
TABLE II. DICTIONARY OF HEADING AND ITS DETAILS If a text is identified as a heading, but it is a not-a-heading
Key Value then such case is considered as FP. FN is the case when a
OBJECTIVE: A post graduate who is seeking a heading is detected as not-a-heading. F1 score will be
challenging position, with a stron calculated by using the weighted average of precision and
g emphasis on Data science Tech
recall. Equation (1) and (2) describes the formula for the
nology, where I can use my abiliti
es and explore the best of skills a calculation of precision and recall. F1-score is calculated
nd acumen to become a valuable based on (3). AUC ROC graph will show the overall
asset to the organization. performance of the classification model at all thresholds. We
PROFESIONAL SUMMARY: Have knowledge about working
have chosen the F1 score metric instead of accuracy for the
and architecture of HDFS', assessment of heading classification, as the dataset is
Worked and contributed with the imbalanced.
team in the project.
TECHNICAL SKILLS: Programming Languages: JAVA, Precision = TP/(TP+FP)
SQL(BASIC), R, PYTHON, (1)
Operating System: Windows,
Linux
Recall = TP/(TP+FN)
EXPERIENCE: Tata Consultancy Services (2)
Internship 20 November 2017 - 1 F1 = 2* (Recall * Precision) / (Recall + Precision)
9 March 2018 Chennai, Siruseri (3)
campus, Got initial training in
Scala and Kafka.
AWARDS & ACHIEVEMENTS: Won inter house debate competiti F1 score, precision and recall have been calculated using
ons in school. Served as organizer the test dataset and analyzed the performance of all the
Jet club-2014 in college. classifiers used for model training as defined in Table IV.
XGBoost outperforms the other classifiers used. Finally, we
3) Segment Classification: We have categorized 20 considered XGBoost as the classification model for the
classes for all the major headings in resumes. We prediction of a text is heading or not. Moreover, we've
implemented the rule to extract the specific skill information separately measured the impact of features for the XGBoost
from the dictionary, based on approximate string matching model. Table V shows the confusion matrix to visualize the
algorithm fuzzy-matching. Since fuzzy search will look for actual and predicted values from the model. Fig. 5, shows the
the outputs, that misspelt or differently punctuated words. ROC curve and AUC for the XGBoost. Also, we tried to tune
Unlike rigid search looks for the exact instances. Each key- the classification model using the probability threshold value
value information is passed through the partial string- of predicting heading and not-a-heading using XGBoost
matching algorithm to match with all possible names of the model. In our experiment, precision reduced when the
skill. At that time, the fuzzy match algorithm calculated the threshold adjusted below 0.5 and recall decreased above 0.5.
partial string match, by using Levenshtein Distance metric So, it results in lower F1 score and needs to focus on both
with configurable parameters like maximum allowed values. Therefore, the default probability threshold value
distance, substitutions, deletions and insertions. Each key fixed to 0.5 gives the best performance of the model. Fig. 6
was ranked, based on the distance range and matched with a describes the important features with predictive power. The
possible number of elements in a skill list. Therefore, the number of words present in the text has the highest impact.
higher rank considered as skill details based on the distance We have also tested the performance of skillset extraction
score. Table III shows the output of skillset information. We using test data which has 33 resumes with PDF and MS Word
have explored only the skillset category, although this can files. We have successfully extracted the skillset from 28
apply to other section categories. resumes, which gives 85% of accuracy.
TABLE III. EXTRACTED SEGMENT FOR SKILLSET TABLE IV. PERFORMANCE OF CLASSIFIERS
Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.
TABLE V. CONFUSION MATRIX FOR XGB REFERENCES
Actual Vs Predicted [1] Chif, E. S., Chifu, V. R., Popa, I., & Salomie, I. (2017, September). A
Confusion
Predicted Predicted system fordetecting professional skills from resumes written in natural
Matrix
Not-a-Heading Heading language. In 2017 13th IEEE International Conference on Intelligent
Actual Not- Computer Communication and Processing (ICCP) (pp. 189-196).
a-Heading
4752 36
IEEE.
Actual
Heading
48 382 [2] [2] M. Zhao, F. Javed, F. Jacob, and M. McNair, SKILL: A System for
Skill Identification and Normalization, Proceedings of the Twenty-
Ninth AAAI Conference on Artificial Intelligence, pp. 4012-4017,
V. CONCLUSION 2015.
We have extracted various features from text and built a [3] Chen, J., Zhang, C., & Niu, Z. (2018). A two-step resume information
classifier to predict whether a text line in a resume is heading extraction algorithm. Mathematical Problems in Engineering, 2018.
or not. Our research shows that XG boost outperforms the [4] Sanyal, S., Hazra, S., Adhikary, S., & Ghosh, N. (2017). Resume Parser
with Natural Language Processing. International Journal of
other classifiers in predicting headings. After the prediction Engineering Science, 4484.
by using fuzzy search technique based on the edit distance,
[5] Bernabé-Moreno, J., Tejeda-Lorente, Á., Herce-Zelaya, J., Porcel, C.,
we successfully extracted the skillset information from both & Herrera-Viedma, E. (2019). An automatic skills standardization
the PDF and Word files. Manually annotated data created a method based on subject expert knowledge extraction and semantic
more realistic view of the data generation process. So, it matching. Procedia Computer Science, 162, 857-864.
generalized well on unseen resumes. In future, we will [6] Ayishathahira, C. H., Sreejith, C., & Raseek, C. (2018, July).
enhance our work by including all formats of different files Combination of Neural Networks and Conditional Random Fields for
Efficient Resume Parsing. In 2018 International CET Conference on
of resume and optimize the performance of the classifier, to Control, Communication, and Computing (IC4) (pp. 388-393). IEEE.
improve the accuracy in predicting and retrieving all the 20 [7] Kumar, R., Agnihotram, G., Naik, P., & Trivedi, S. (2019). A
segments from the resume. We also want to build a Supervised Method to Find the Relevance of Extracted Keywords
classification model for the prediction of the category of each Using Deep Learning Approaches. In Emerging Technologies in Data
heading. This extracted skillset, experience and education can Mining and Information Security (pp. 837-847). Springer, Singapore.
be used to recommend resumes for a job requirement. [8] Shah, F. P., & Patel, V. (2016, March). A review on feature selection
and feature extraction for text classification. In 2016 International
Conference on Wireless Communications, Signal Processing and
Networking (WiSPNET) (pp. 2264-2268). IEEE.
[9] Dwivedi, R. K., Aggarwal, M., Keshari, S. K., & Kumar, A. (2019).
Sentiment analysis and feature extraction using rule-based model
(RBM). In International Conference on Innovative Computing and
Communications (pp. 57-63). Springer, Singapore.
[10] Kumar, S. S., & Rajini, A. (2019, July). Extensive Survey on Feature
Extraction and Feature Selection Techniques for Sentiment
Classification in Social Media. In 2019 10th International Conference
on Computing, Communication and Networking Technologies
(ICCCNT) (pp. 1-6). IEEE.
[11] Dadgar, S. M. H., Araghi, M. S., & Farahani, M. M. (2016, March). A
novel text mining approach based on TF-IDF and Support Vector
Machine for news classification. In 2016 IEEE International
Conference on Engineering and Technology (ICETECH) (pp. 112-
116). IEEE.
[12] Manchanda, S., & Karypis, G. (2018, November). Text segmentation
on multilabel documents: A distant-supervised approach. In 2018 IEEE
Fig. 5. ROC and AUC Based on XGB Model.
International Conference on Data Mining (ICDM) (pp. 1170-1175).
IEEE.
[13] Jo, T. (2017, January). Using K Nearest Neighbors for text
segmentation with feature similarity. In 2017 International Conference
on Communication, Control, Computing and Electronics Engineering
(ICCCCEE) (pp. 1-5). IEEE.
[14] Hakak, S. I., Kamsin, A., Shivakumara, P., Gilkar, G. A., Khan, W. Z.,
& Imran, M. (2019). Exact String Matching Algorithms: Survey,
Issues, and Future Research Directions. IEEE Access, 7, 69614-69637.
[15] Zhang, D., Ma, J., Yi, J., Niu, X., & Xu, X. (2015, August). An
ensemble method for unbalanced sentiment classification. In 2015 11th
International Conference on Natural Computation (ICNC) (pp. 440-
445). IEEE.
[16] Goudjil, M., Koudil, M., Bedda, M., & Ghoggali, N. (2018). A novel
active learning method using SVM for text classification. International
Journal of Automation and Computing, 15(3), 290-298.
[17] Xu, A. L., Liu, B. J., & Gu, C. Y. (2018, July). A recommendation
system based on extreme gradient boosting classifier. In 2018 10th
International Conference on Modelling, Identification and Control
Fig. 6. Feature importance of XGB Model. (ICMIC) (pp. 1-5). IEEE.
Authorized licensed use limited to: BRACT's Vishwakarma Institute Pune. Downloaded on March 27,2025 at 10:59:41 UTC from IEEE Xplore. Restrictions apply.