0% found this document useful (0 votes)
75 views

Iraj PDF

This document proposes an adaptive hierarchical leader follower algorithm using evolutionary computing for email classification. It involves preprocessing emails which extracts header fields like subject, date, and analyzes HTML content for malicious links. The body is then processed to identify conceptual similarity beyond term similarity. Key terms are stemmed and negation phrases identified. Documents are divided and processed in parallel to generate vectors and a document-term matrix. Documents are then clustered using an adaptive hierarchical leader follower algorithm with evolutionary computing to handle concept drift in email streams.

Uploaded by

Surya Kameswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Iraj PDF

This document proposes an adaptive hierarchical leader follower algorithm using evolutionary computing for email classification. It involves preprocessing emails which extracts header fields like subject, date, and analyzes HTML content for malicious links. The body is then processed to identify conceptual similarity beyond term similarity. Key terms are stemmed and negation phrases identified. Documents are divided and processed in parallel to generate vectors and a document-term matrix. Documents are then clustered using an adaptive hierarchical leader follower algorithm with evolutionary computing to handle concept drift in email streams.

Uploaded by

Surya Kameswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.

-2014

ADAPTIVE HIERARCHICAL LEADER FOLLOWER WITH


EVOLUTIONARY COMPUTING ALGORITHM FOR
EMAIL CLASSIFICATION
1
U. SURYA KAMESWARI, 2I. RAMESH BABU
1
Asst. Professor, 2Professor, Dept. of CSE, Acharya Nagarjuna University, Guntur, AP, India

Abstract- Most of the existing systems categorize the document or email-corpus based on the term similarity by find the
document-term relationship. It cannot identify the conceptual similarity or correlation among them. But proposed system
focuses on both term wise as well as conceptual wise similarity to find the email statistics to tag the email with suitable type.
Categorization of email data is multi fold in proposed system. Major stages of proposed approach are described in the
following section. Classification of the emails based on their conceptual similarity rather than blind term wise similarity
along with mail header, html content and attachment analysis.

Keywords- Email Classification, Adaptive Hierarchical Leader, Dynamic Building of Clusters, Dynamic Selection of
Similarity Thresholds.

I. INTRODUCTION A. Email Header processing


Generally email header contains following
Clustering maps the data items into clusters, where information: From, Subject, Date, To, Return-Path,
clusters are natural grouping of data items based on Envelop-To, Delivery Date, Received, Message-id,
similarity or probability density methods. Unlike Content-type, Content-Length, X-spam status and
classification and prediction which analyzes class- Message Body. Among these fields proposed system
label data objects, clustering analyzes data objects utilizes Subject, Date, To, Delivery Date, Message Id,
without class-labels and tries to generate such labels. Received, Content type, Content length and Message
A similarity measure or similarity function is a real- Body fields for email statistics and analysis.
valued function that quantifies the similarity between
two objects. In this paper classification of the emails Initially Historical or training email corpus is loaded
based on their conceptual similarity rather than blind and extract the above mentioned fields from each
term wise similarity along with mail header, html email. Then these fields are store as a transaction
content and attachment analysis. table in the database for offline or online querying.
For this transaction table a view is created with
II. PROPOSED APPROACH specified group by statements. This view is used to
get frequency of particular path and content. So it is
Proposed system focuses on both term wise as well as more helpful to filter the mails in basic level. Based
conceptual wise similarity to find the email statistics on this statistics initial decision is made and mark
to tag the email with suitable type. Categorization of some of the mail paths or ids are susceptible. Later
email data is multi fold in proposed system. Major body processing stage is initiated.
stages of proposed approach are described in the
following sections. B. Body processing
In this stage body is identified based on the content
A. Machine Learning type. It may be html, plain or other MIME type.
Machine learning is a process of giving knowledge to Based on the content type process is also divided. It is
the proposed learner. For that purpose initially having two stages of work again.
proposed system is trained with some existing email
corpus with heterogeneous content. A novel 1. HTML Processing: In this body text is
unsupervised machine learning algorithm named as scanned for <A> and <SCRIPT> tags. Because, in
“Adaptive Hierarchical Leader follower with most of the cases hyperlinks and script elements
Evolutionary computing algorithm” is used by the causes to phishing and virus injection. Every hyper
proposed system for effective handling of concept link is basically scanned to identify the source of that
drift email streams. Before enter into details of link. This can be achieved through processing of host
algorithm we need to finish some pre processing steps part of link with “trace route” like tools. The results
described as follows. of the “trace route” are compared with” phish tank”
database. If any one of the address in the trace route
III. PRE PROCESSING STEPS belongs to phish tank database, proposed system
removes that hyper link. This phish tank database is
Email pre-processing having following stages: publicly available.

Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification

101
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014

2. Text processing of email body includes that word otherwise it will insert the word into hash
number of stages. This is heart of theconceptual map. This is ongoing process till the end of stream.
document mining. It includes some natural language This type of mechanism required to adaptively cluster
processing steps. They are described in following the documents based on the conceptual similarity.
modules. Body is identified with <P> tag as well as
<TD>tag. After that all the individual <P> [<TD>] IV. PROPOSED SYSTEM ARCHITECTURE
plain text parts are merged into single document for
each mail and placed in a mail body store locally.
Here we assume that this local store is maintained in
the firewall system.

C. Conceptual Document Processing


This stage is executed in parallel by utilizing recent
advancements in processor technology. The following
procedure is executed in parallel. Before this
procedure to be executed entire mail body corpus is
divided into multiple partitions.Each Partition is
submitted to different processors as multiple Fork-
Join Tasks. In every partition procedure executed as
follows.
 Each document is split into multiple Figure 1: Proposed Architecture
sentences
 Each Sentence is split into multiple terms. Sometimes some words are treated as nouns, verbs or
 Each term is applied to stemming algorithm adjectives. In that case, there is a need to consider the
such as Porter Stemming Algorithm. This sense of those words in the given sentences for given
algorithm finds the stem word the given context. This can also be achieved through auto
term to get the generalization of that term. tagging or words in a given sentence with natural
 To find the conceptual similarity there is a language processing tools such as OpenNLP. It has as
need to identify the negation phrases. POS (Parts-of-Speech) utility to know the parts-of-
speech of term. Finally document vectors are also
For example, “Rama is not bad boy” means “Rama is generated and normalized by other processes
good boy”. According to the example presented both simultaneously. Initially document vectors are
of the sentences are conceptually similar. If we only generated with equal size term frequencies.
consider term wise similarity then they may be Document-Term joint probability matrix is also
treated as different. To overcome such type of generated prior to the clustering.
problems, identify the negation words in each
sentence and try to replace it with antonym words. V. ADAPTIVE HIERARCHICAL LEADER
For this purpose we have deployed WordNet [6] tool FOLLOWER WITH EVOLUTIONARY
with Java-WordNet-Interface developed by MIT COMPUTING ALGORITHM
Team members. It will return not only antonyms but
also synonyms, hypernyoms and hyponyms those are After pre-processing is done as mentioned above
also used in our text processing. (although it is ongoing process) documents are
clustered as follows. The base of this algorithm is
 Similarly some words or sentences are also acquired from and modified according to the demand
conceptually exhibit similar meaning. For example, of context presented in this paper.
“you have to pay your bill” may conceptually equal
to “you are supposed to pay the bill”. Other example Documents in training e-mail corpus are equal sized
could be, “you need to submit documents” document vectors. So they can be directly compared
conceptually equals to “you required submitting with each other with familiar cosine similarity
document”. To achieve this proposed system find the measure.
synonyms, hypernyms and hyponyms of each term
and compared with the respective list. But this is not Along with cosine similarity, conceptual similarity
that much of scalable and memory efficient solution also simultaneously compared as mentioned above.
to maintain all the terms and their synonyms in the
list in the memory. 1. First document itself is treated as Leader and
 For this purpose Synset ids acquired from form a new and first cluster.
WordNet are maintained in a hash map. Every a new 2. From second document onwards, each
word found, the respective index word will be found document is compared with existing clusters with
and related synset id will be retrieved. If indexed given similarity thresholds. Here similarity means
word id under synset id is found process will ignore both conceptual and cosine similarity. Each similarity

Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification

102
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014

value is compared with respective similarity the previous sections. But documents which are
thresholds. to be clustered may have varied length those
3. If both of the similarities are above the given cannot directly compared with the existing
threshold then that document is placed in the current clustered documents.
cluster.  So mail body document vector is sub divided or
4. Otherwise current document will form a new extended based on the actual cluster document
cluster and announce itself as leader to that cluster. vector size. This sub document vectors are
5. Same process is repeated for all the individually processed and compared with
documents placed in the Training email corpus. existing leaders.
 In this case sub document vectors may be
This is actual but slightly modified version of actual assigned to different clusters. To clearly specify
Leader follower algorithm. But it has some limitation. the cluster to which main document is belongs
Let us consider some nth document. That document to is evaluated based on the weight calculated at
doesn’t have knowledge about the leaders created each cluster for sub document vectors. If overall
after its creation. So it may assign to a leader which is probability is less than given threshold then this
created before. But if any leader which is also having document itself form new cluster with those sub
similarity greater than given threshold but nearest to document vectors which are not fit to any other
the nth document when compared to the current cluster.
leader, then existing algorithm doesn’t give solution
to this. So in this paper we extend the algorithm to Here two more things have to be considered. Firstly,
meet this requirement i.e except leader documents new document may have new set of terms those are
remaining all documents are compared with those not placed in the Hash map. So update the dictionary
leader which are formed after their creation. If any i.e Hash map with new terms with their respective
leader or cluster is more nearer when compared to synset ids. This is used in conceptual similarity.
current cluster then this document will be removed Secondly existing similarity thresholds may not fit in
from the current cluster and assigned to new cluster case of concept drift occurs.
and update their leader. Otherwise no change has
been made. To overcome these limitations Evolutionary
computing is introduced. It is used to find the best fit
Here also there is limitation to handle concept drift threshold for given context. According to this method
that will be occurred in the mail stream processing. initial population is given and this process will be
Existing mail corpus is static and it is fully offline terminated based on the fitness function value. Here
process and all the documents are converted into fitness function is cluster validity measure which is
equal sized document vectors. This will not raise any used to find the quality of the clusters as well as
problem due to offline and more over static dataset. clustering scheme for given similarity thresholds.
But this cannot meet the online as well as dynamic
document vector size demand. So this can be B. cluster evolutionary algorithm
achieved through our novel “Adaptive Hierarchical Input: population size, expecting similarity threshold,
Leader follower with Evolutionary computing mutation probability Output: Best fit thresholds
algorithm”. 1. Randomly generate similarity thresholds
between [0,1] with population size
It includes following sub modules. 2. Find the Fitness of each Array in population
3. Select the top most half best fit arrays and
A. Mail Stream Processor Implementation clone them
4. Do mutation in all of the arrays to get
 In this module proposed system asked the user offspring and find fitness again
to submit his email credentials to process the 5. Select top most half best fit arrays from this
mails in inbox. offspring
 After successful login each unread mail is 6. Merge previous half and current half
extracted by Mail stream processor by default. population to get the next generation
If user change the setting to “all’ then both read 7. Repeat the process till given fitness value >
and unread mails are extracted. = excepted similarity threshold
 Email header part is processed first and
compared with the existing training database. If One more enhancement is making the clustering as
it is matched with any existing mail header hierarchical clustering to clearly differentiate the
information then current mail is marked generalization from specialization. Child level cluster
according to the existing header. Otherwise new is crisper in nature than parent level and exhibits
entry will be submitted to the database. more specific nature and behaviour. To do this, same
 Next Mail body is processed. Here all the text Adaptive Leader follower algorithm is applied to
processing steps are executed as mentioned in individual cluster to get multi-level. Here adaptive is

Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification

103
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014

more significant term because proposed algorithm


can handle concept drifts, conceptual similarity,
dynamic building of clusters, and dynamic selection
of similarity thresholds.
C Attachment Processing
Last but not least is attachment classification. In this Fig 2: Validating False Positive
case attachments may have virus information.
Already Gmail like mail systems don’t allow files
with extension exe, bat, sh etc. But it only considers
the extension. i.e., if the file is renamed or placed in a
Winrar, it cannot be traced out and it can be
uploaded. To overcome this limitation, first two bytes
of every attachment has to be extracted and verify the Fig 3: email documents feature difference
file type. Because most of the cases first two bytes
represent unique magic numbers belongs to the file CONCLUSION AND FUTURE WORK
type. If those two bytes are belongs to any not
allowed case then either attachment has to be stopped This paper focuses on conceptual similarity of email
or mail can’t be forwarded to the actual client system. messages along with mail header and attachment
In case of archive files this process should be applied information to detect and separate normal mails from
after extracting only. malicious mails. For that purpose a novel “Adaptive
Hierarchical Leader follower with Evolutionary
D Data Post Processing computingalgorithm” is introduced to handle concept
Finally with aggregating all of the results retrieved drifts, conceptual similarity, dynamic building of
from mail header statistics, body processing and clusters, and dynamic selection of similarity
attachment processing susceptible mails will be thresholds. Future work could be implementation of
finalized. Now to label the clusters with proper type, cloud based email filter technique to overcome the
phishing and spam word database has to be used. memory and processor requirements.
Every cluster is scanned for these words and
calculates the probability of these words in each mail REFERENCES
body document. Which of the clusters are having
probability of malicious documents > given malicious [1] C.J. van Rijsbergen, S.E. Robertson and M.F. Porter,
threshold they are labelled with Spam or Phish or 1980. New models in probabilistic information
retrieval. London: British Library. (British Library Research
Malware accordingly and rest of the clusters are and Development Report, no. 5587)
marked as normal. All these clusters are again useful
of training for next generation or upcoming mails. [2] Karen Sparck Jones and Peter Willet, 1997, Readings in
This is called as online learning. Information Retrieval, SanFrancisco: Morgan Kaufmann,
ISBN 1-55860-454-4

VI. SIMULATION RESULTS [3] P. A. Vijaya Department of Computer Science and


Automation, Intelligent Systems Lab, Indian Institute of
Table 1 Comparison of Cluster validity Termwize Vs Science, Bangalore 560 012, India,M. NarasimhaMurty
Department of Computer Science and Automation,
Conceptual similairty Intelligent Systems Lab, Indian Institute of Science,
Bangalore 560 012, India,D. K. Subramanian Department of
Computer Science and Automation, Intelligent Systems
Lab, Indian Institute of Science, Bangalore 560 012, India

[4] Davies, David L.; Bouldin, Donald W. (1979). "A Cluster


Separation Measure". IEEE Transactions on Pattern
Analysis and Machine Intelligence. PAMI-1 (2): 224–
227. doi:10.1109/TPAMI.1979.4766909

[5] OpenNLPhttps://opennlp.apache.org/

[6] G. A. Miller, R. Beckwith, C. D. Fellbaum, D. Gross, K.


Miller. 1990. WordNet: An online lexical database. Int. J.
Lexicograph. 3, 4, pp. 235–244

[7] Wordnet URL http://wordnet.princeton.edu/wordnet/


Fig 1: Variance of email feature vector with Phishing
Suspectability



Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification

104

You might also like