0% found this document useful (0 votes)
28 views

Personality Prediction System Based On Graphology Using Machine Learning

Uploaded by

Humam Alani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Personality Prediction System Based On Graphology Using Machine Learning

Uploaded by

Humam Alani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Open in app Sign up Sign In

Personality Prediction System Based on


Graphology using Machine Learning
Hyeon Gu Kim · Follow
15 min read · Jan 6, 2022

Listen Share

Team Members: Lucy Hwang, Yashaswini Kalva, Hyeon Gu Kim, Kaushik Kumaran,
Archit Patel

Abstract
Graphology is a method of identifying, evaluating and understanding human
personality traits through the strokes and patterns revealed by handwriting.
Handwriting reveals the true personality including emotional outlay, fears, honesty,
defenses and many others. Professional handwriting examiners called graphologists
often identify the writer with a piece of handwriting. Accuracy of handwriting
analysis depends on how skilled the analyst is. Although human intervention in
handwriting analysis has been effective, it is costly and prone to error. Hence the
proposed methodology focuses on developing a system that can predict personality
traits with the aid of machine learning without human intervention. To make this
happen, we considered seven handwriting features: (i) size of letters, (ii) slant of the
writing, (iii) baseline, (iv) pen pressure, (v) spacing between letters, (vi) spacing
between words and (vii) top margin in a document to predict eight personality traits
of a writer as shown in Figure 1.0.
Figure 1.0 handwriting attributes and respective personality behavior

After extracting all these features from the images containing the handwriting we
applied a Random Forest classifier for each personality trait of the writer. We also
built ANN and CNN models on the raw image data.

Introduction
Graphology is defined as the analysis of the physical characteristics and patterns of
the handwriting of an individual to understand his or her psychological state at the
time of writing. Handwriting is a kind of projective test where the unconscious
comes to the fore and expresses itself in the conscious [1]. A Graphologist can
roughly interpret an individual’s character and personality traits by analysing the
handwriting. We can use graphology to determine the personality and character
profile of a person.

Objective
The objective of this project is to develop a system that takes an image document
containing the handwriting of a person and outputs a few of his/her personality
traits based on some selected handwriting features. Carefully analysing all the
significant characteristics of a handwriting manually is not only time consuming
but prone to errors as well. Automating the analysis on a few selected
characteristics of handwriting will speed up the process and reduce the errors

Motivation
Handwriting analysis is one among several methods to understand the psychology
of a person. Graphology can be used for below two areas:

Psychological analysis: Graphology is used clinically by counsellors and


psychotherapists.

Employment profiling: Companies use handwriting analysis for recruitment. A


graphological report is meant to be used in conjunction with other tools, such as
comprehensive background checks, practical demonstration or record of work
skills.

Hand-writing analysis with a computer is fast, accurate and identifies the patterns
better than visual inspection. Moreover, machine learning assisted analysis is
efficient and devoid of human errors.

Literature Review
The project focuses on development of a system to predict some psychological traits
of a person by analyzing his or her handwriting using machine learning. Many
researchers have also done similar works on computer aided graphology.

A similar work was done by Shitala Prasad, Vivek Kumar Singh and Akshay Sapre of
Department of Information Technology, Indian Institute of Information Technology
Allahabad, India to predict human personality through handwriting using support
vector machines [4]. Another similar work was done by Navin Karanth, Vijay Desai
and S. M. Kulkarni of Mechanical Engineering Department, National Institute of
Technology Karnataka, India to predict a writer’s personality through graphology,
without any machine learning [5]. Another similar work was done by Champa H N,
Assistant Professor of Department of Computer Science and Engg., University
Visvesvaraya College of Engineering, Karnataka, India and Dr. K R Ananda Kumar,
Professor of Department of Computer Science and Engg., SJB Institute of
Technology, Karnataka, India on computer aided graphology using artificial neural
networks [6]. All these research works have fundamental differences in selection of
handwriting features, extraction methods, classification and output, etc.

Problem Statement
A system is proposed to automate the basic handwriting analysis tasks of
graphology to determine a few important personality traits. Seven
features/characteristics of a handwriting are considered to be extracted from a
sample handwriting image. Each of the seven resulting raw values will be put into
corresponding categories of respective feature variations. The classifiers will then
be able to predict the personality traits of the writer. An overview is represented
below:

Figure 1.1: The proposed system — A handwriting sample is taken and the personality traits are predicted.
Data Acquisition
Data from the IAM Handwriting Database of Research Group on Computer Vision
and Artificial Intelligence INF, University of Bern, Switzerland is obtained. The data
was readily available for download to be used for non-profit research purposes. The
database contains 1538 pages of scanned text for which 657 writers contributed
samples of their handwriting. Each handwriting sample is labelled with the
corresponding psychological traits by manually studying each document.

Pre Processing
The handwriting images we obtained contain unwanted noise, printed texts and
lines. The aim of pre-processing is to make the image data suitable for feature
extraction for which we adopted below methods

1. Image resizing
These images were cropped and saved as PNG images with an automatic action
script. Now the width of all the images is 850 pixels and the height is according to
the content of the handwriting in the image. PNG format is used instead of JPEG
because the former is a lossless format and is more suitable for storing text images,
printed or handwriting.
Figure 1.21: Original image data sample obtained sample with 850px width
Figure 1.22: Cropped and normalized image data from the IAM Handwriting Database.

2. Noise Removal
Image noise is defined as random variation of brightness or color information in
images, and is usually an aspect of electronic noise.

From below 2 images, it is observed that a bilateral filter preserves the edges of the
subjects in the image

Figure 1.31: Noisy image before any filter is applied.

Figure 1.32: Noiseless image after bilateral filter is applied.

3. Grayscale and Binarization


The image instances were converted to grayscale and binarized using inverted
global thresholding. An example is given in Figure 1.4.
Figure 1.4: A binarized version of the image

4. Contour and Warp Affine Transformation


After noise was removed and the image was converted to grayscale and inversely
binarized, the lines of the handwriting were straightened using dilation, contour
and warp affine transformation of OpenCV library.

Figure 1.5: The sample image after applying dilation with a 5x100 kernel. The foreground pixels are spread
horizontally.

5. Horizontal and Vertical Projections


In the context of this project, the horizontal projection of an image was a Python list
of sum of all the pixel values of each row of the image, while vertical projection was
a Python list of sum of all the pixel values of each column of the image. Both of
these operations are performed on grayscale images.

Feature Extraction
Features used for building Random Forest are — Baseline ; Line; Letter Size; Line
Spacing; Word Spacing; Top Margin; Pen Pressure; Slant of Letters

Classification Labels
1. Openness
2. Conscientiousness

3. Agreeableness

4. Neuroticism

Random Forest
Random forest is used in modeling predictions and behavior analysis as feature
scaling is not required and as it is less impacted by noise.

Given below are the steps followed for predicting personality traits using Random
Forest:

Figure 1.6: Steps for predicting personality traits using Random Forest

For predicting each personality trait a separate random forest classifier was built.
Given below is a snippet of the input data fed into the models:
Figure 1.7: Input data fed into the models

Hyperparameter Tuning
We used Randomized Grid Search to find the most optimal hyper parameters for
RandomForest Classifier. Below hyper parameters are tuned

n_estimators

Max_features

max_depth

min_samples_split

ccp_alpha

Feature Importance
Using Random Forest Models we were able to understand the importance of
features that we extracted in the pre-processing step as the model assigns
importance to a feature based on the frequency of its inclusion in the sample by all
trees.

Below is the summary of feature importance:


Figure 1.8: Feature importance

Figure 1.9: The most important features for each personality type

Results
Below are the results obtained from Random Forest:

Test Accuracy = 97.06%

Test Recall Score = 93.70

Test Precision Score = 100%


The accuracy achieved by random forest classifier with 4 trees is 97.06%. Changing
the number of estimators,max features, depth,ccp_alpha, min_samples split for this
data didn’t significantly improve the results

ANN

Before we dive into the art of neural networks, we first need to understand what
ANN is. In short, Artificial Neural Network (ANN) is a machine learning algorithm
that mimics the processing of the brain. In other words, ANN enables machines to
process given data similar to how the human brain processes. Below figure shows
how biological neuron and ANN similarly process data:

Figure 2.0: Biological Neuron vs ANN

This is the simplest form of ANN that is consist of inputs (x1, x2, …,xn ), weights
(w1,w2,…,wn) and activation function. Similar to how the human brain takes inputs
with dendrites, processes from nucleus to axon and outputs the results in axon
terminals, ANN takes input data, gives weights to each input, processes through
activation function and outputs the result.

Because of the vast amount of complex data from preprocessing steps, the simplest
form of ANN above is not enough — we need more than that. For such a reason, we
decided to include two hidden layers which distill redundant data and makes the
process more efficient and faster. This is called Multi Layer Perceptron (MLP) which
consists of an input layer, one or more hidden layers and an output layer (Figure
2.1).

Figure 2.1: Multi Layer Perceptron


Figure 2.2: Feedforward & Backpropagation

In addition to processing from input layer to output layer, which is called


Feedforward network, what makes ANN even more powerful is the opposite notion
of feedforward network, backpropagation algorithm (Figure 2.2). From the
backpropagation algorithm, ANN has the ability to learn from its errors and
improve the model further.

Now that we have a better understanding about ANN, let’s see how we implemented
ANN for predicting personality using handwriting. The overall process of the
implementation of ANN is quite simple: converting pre-processed data into arrays
of pixels and putting the arrays into ANN. Below figure shows a high-level view of
the ANN process in this project.

Figure 2.3: High-level view of the ANN process: With datasets of handwriting images, we converted them
into arrays of pixels and put them into ANN model

Although the data already had been preprocessed, we still needed to do data
transformation process where we encode categorical variables (personality labels,
which is our target variable), reshape the data matrices for ANN, and split the data
into train, validation and test sets (70%, 15%, 15%, respectively). Then we used
Keras from TensorFlow for ANN:

Figure 2.4 : Code snippet of ANN model

The ANN is constructed as follows:

Rescaling & flattening


An input layer — 113 nodes, activation function=ReLU

Two hidden layers — 128, 64 nodes, activation function=ReLU

An output layer — 4 nodes, activation function=Softmax

Two regularized (“Dropout”) layers between each layer to prevent overfitting

Sparse Categorical Cross entropy loss function

RMSprop optimizer

Hyperparameter Tuning
Epochs: 60

Batch size: tried batch sizes of 16, 32, and 64

We chose ReLU activation function because it avoids the gradient vanishing


problem with its linearity and is computationally lighter and faster. Moreover, we
chose Softmax activation function for the output layer since it calculates relative
probability of each class which is more suitable for multiclass classification
problems like this project. Similarly, the Sparse Categorical Cross entropy loss
function was used because this project is a multiclass classification problem. Lastly,
we used RMSprop optimizer because it adapts as it moves down to minima which
makes it faster and optimal than other optimizers. We also tried ADAM optimizer as
well, but the accuracy turned out to be a little lower than when we used RMSprop.
Below figures are the results of the ANN:
Figure 2.5: Train accuracy vs Test accuracy

We can see from Figure 2.5 that the ANN is performing well by looking at the train
and test accuracy graph above. One interesting fact is that the test accuracy starts to
outperform train accuracy after the 34th epoch. Next, let’s see the relationship
between accuracy and loss.
Figure 2.6: Relationship between the test loss and accuracy

Similarly, in Figure 2.6, we can observe the equilibrium between the accuracy and
the loss at the 34th epoch and the accuracy continues to increase as the loss
continues to decrease.
Figure 2.7: Train loss vs Test loss

The above graph shows a comparison between the train loss and the test loss.
Interestingly enough, the test loss diverges from the train loss when epoch is 20.

Figure 2.8: Classification Report


The above figures show a multiclass confusion matrix and a classification report
(Figure 2.8) from our ANN. We can observe the model has successfully classified the
data into our four personality labels — agreeableness, conscientiousness,
neuroticism and openness. One thing to note is the model has the lowest F1 score
on classifying conscientiousness and the highest F1 score on openness. This could
be due to the size of train data of each class — openness has the largest train data
size while conscientiousness has the lowest train data size.

We implemented ANN because of the three main key advantages:

Can learn and model non-linear and complex relationships

Doesn’t impose fixed constraints on the input variables

Robust to the data with heteroskedasticity (data with high volatile and non-
constant variance)

However, ANN is not an all-mighty algorithm. Recall that our objective is to predict
personality from handwriting and the data is image! Unfortunately, ANN cannot
take the image data as it is but rather have to convert the images to numbers which
could lead to the loss of important information. Furthermore, the high test accuracy
score could raise the problem of overfitting in the future. Therefore, we decided to
try another popular neural network model — Convolutional Neural Network (CNN).

CNN
Inspired from the human visual perception of recognizing things, CNN follows a
hierarchical model which works on building a network, like a funnel, and finally
gives out a fully-connected layer where all the neurons are connected to each other
and the output is processed. The input image is fed into the CNN layers, these layers
are trained to extract relevant features from the image. A CNN convolves learned
features with input data, and uses 2D convolutional layers, making this architecture
well suited to processing 2D data, such as images.
Figure 2.9: How CNN classifies handwritten digits

CNN Methodology

Data Preprocessing
As a first step, we separated the data into training, validation and test sets in the
ratio of 70%, 15% and 15% respectively.

Since the training set had only 657 images, Data Augmentation was used in an effort
to increase the number of samples.

Model Building
Since the number of available images were limited even after augmentation, there
was a need to use Transfer Learning so that the model learns the lower level
features with some pre-trained network. The base model used was Inception
Resnetv2 with pre-trained weights flowing in from the ImageNet dataset.

This base layer was followed by the following layers:

Max Pooling layer: It helps in extracting sharp and smooth features

Dropout layer: Used to prevent overfitting which was initially observed

Batch Normalization was used to scale the inputs and thereby make the network
more stable.

Finally, the network had a fully connected layer of 50 units.


Relu activation function was used for all the hidden layers and SoftMax activation
was used for the output layer. Adam Optimizer was used for Gradient Descent.

Model checkpoints were incorporated to store the best weights of the model.

Hyperparameter Tuning
Epochs: set the number of epochs to 30

Batch size: tried batch sizes of 16, 32, and 64

Learning rate: 0.001

The following hyperparameters were tuned after running several iterations:

The optimal number of epochs was found to be 30.

The batch size was found to be 16.

The best learning rate was 0.001

CNN Results
Accuracy on the training set — 78.9%

Accuracy on the validation set — 67.3%

Accuracy on the test set — 65.5%

Precision : 66.3%

Recall : 62.5%
Figure 3.0: Train accuracy vs Validation accuracy

Figure 3.1: Code snippet of CNN


CNN Next Steps
The following can be tried as next steps to improve the accuracy of the model:

Unfreeze certain layers and try re training the model with our dataset for those
layers

Try other architectures which could potentially outperform Inception Resnet V2


for the given dataset

Augment the data further for the imbalanced classes.

Tune parameters like the optimizer, number of layers etc

Conclusion
We used machine learning to automate the graphology process to determine
important personality traits through different classifiers such as Random Forest,
ANN and CNN. After image preprocessing features were extracted. The feature
importance we received for each trait using the classifiers was similar to
importance given by the graphologist in determining the personality traits. Random
forest has performed better than CNN and ANN because subject knowledge was
incorporated into the pre-processing phase.

However, we are aware there are additional resources available to better understand
human personality. The sample did not require to standardize pen type and ink
color. With standardization of pen, paper, margins, as well as guiding personality
questions, we could further enhance our automated handwriting process to lead to
more accurate results.

References
[1] D. J. Antony. Personality Profile Through Handwriting Analysis. Anugraha
Publications, 2008.

[2] Karen Amend and Mary S. Ruiz. Handwriting Analysis The Complete Basic Book.
New Page Books, 1980.

[3] Alessandro Vinciarelli, Juergen Luettin. A new normalization technique for


cursive handwritten words. Pattern Recognition Letters 22 (2001) 1043–1050 IDIAP
Switzerland, 26 February 2001.

[4] Shitala Prasad, Vivek Kumar Singh, Akshay Sapre. Handwriting Analysis based
on Segmentation Method for Prediction of Human Personality using Support Vector
Machine. International Journal of Computer Applications (0975 8887) Volume 8
№12, October 2010.

[5] Vikram Kamath, Nikhil Ramaswamy, P. Navin Karanth, Vijay Desai and S. M.
Kulkarni . Development of an Automated Handwriting Analysis System. ARPN
Journal of Engineering and Applied Sciences VOL 6, NO.9, September 2011.

[6] Champa H N, K R AnandaKumar. Arti cial Neural Network for Human Behavior
Prediction through Handwriting Analysis. International Journal of Com-puter
Application (0975–8887) Volume 2- №2, May 2010.l

Machine Learning Graphology Predictions Artificial Intelligence


Follow

Written by Hyeon Gu Kim


9 Followers

More from Hyeon Gu Kim

Hyeon Gu Kim

Analyzing Employee Satisfaction in Major Consulting Firms from


Glassdoor Reviews — Part 3 (Topic…
Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit
Agrawal, Sophia Scott, Sungho Park

7 min read · Feb 5, 2022

1
Hyeon Gu Kim

Analyzing Employee Satisfaction in Major Consulting Firms from


Glassdoor Reviews — Part 1
Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit
Agrawal, Sophia Scott, Sungho Park

6 min read · Jan 22, 2022

Hyeon Gu Kim
Analyzing Employee Satisfaction in Major Consulting Firms from
Glassdoor Reviews — Part 2…
Team Members: Lucy Hwang, Rhiannon Pytlak, Hyeon Gu Kim, Mario Gonzalez, Namit
Agrawal, Sophia Scott, Sungho Park

6 min read · Jan 30, 2022

Hyeon Gu Kim

Cracking Principal Components Analysis (PCA) — Part 2


This blog is based on Professor Tom Sager’s Unsupervised Learning class

6 min read · Feb 2, 2022

See all from Hyeon Gu Kim


Recommended from Medium

Dominik Polzer in Towards Data Science

All You Need to Know to Build Your First LLM App


A step-by-step tutorial to document loaders, embeddings, vector stores and prompt
templates

· 26 min read · Jun 22

3.4K 29
The PyCoach in Artificial Corner

You’re Using ChatGPT Wrong! Here’s How to Be Ahead of 99% of


ChatGPT Users
Master ChatGPT by learning prompt engineering.

· 7 min read · Mar 17

29K 525

Lists

Predictive Modeling w/ Python


18 stories · 173 saves

AI Regulation
6 stories · 51 saves

Natural Language Processing


442 stories · 82 saves

Practical Guides to Machine Learning


10 stories · 192 saves
Matt Chapman in Towards Data Science

The Portfolio that Got Me a Data Scientist Job


Spoiler alert: It was surprisingly easy (and free) to make

· 10 min read · Mar 24

4K 68

Coucou Camille in CodeX

Time Series Prediction Using LSTM in Python


Implementation of Machine Learning Algorithm for Time Series Data Prediction.

· 6 min read · Feb 10

173 2

Kristen Walters in Adventures In AI

5 Ways I’m Using AI to Make Money in 2023


These doubled my income last year

· 9 min read · Jul 19

17.3K 277
Love Sharma in ByteByteGo System Design Alliance

System Design Blueprint: The Ultimate Guide


Developing a robust, scalable, and efficient system can be daunting. However, understanding
the key concepts and components can make the…

· 9 min read · Apr 20

6.8K 53

See more recommendations

You might also like