Iraj PDF
Iraj PDF
-2014
Abstract- Most of the existing systems categorize the document or email-corpus based on the term similarity by find the
document-term relationship. It cannot identify the conceptual similarity or correlation among them. But proposed system
focuses on both term wise as well as conceptual wise similarity to find the email statistics to tag the email with suitable type.
Categorization of email data is multi fold in proposed system. Major stages of proposed approach are described in the
following section. Classification of the emails based on their conceptual similarity rather than blind term wise similarity
along with mail header, html content and attachment analysis.
Keywords- Email Classification, Adaptive Hierarchical Leader, Dynamic Building of Clusters, Dynamic Selection of
Similarity Thresholds.
Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification
101
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014
2. Text processing of email body includes that word otherwise it will insert the word into hash
number of stages. This is heart of theconceptual map. This is ongoing process till the end of stream.
document mining. It includes some natural language This type of mechanism required to adaptively cluster
processing steps. They are described in following the documents based on the conceptual similarity.
modules. Body is identified with <P> tag as well as
<TD>tag. After that all the individual <P> [<TD>] IV. PROPOSED SYSTEM ARCHITECTURE
plain text parts are merged into single document for
each mail and placed in a mail body store locally.
Here we assume that this local store is maintained in
the firewall system.
Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification
102
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014
value is compared with respective similarity the previous sections. But documents which are
thresholds. to be clustered may have varied length those
3. If both of the similarities are above the given cannot directly compared with the existing
threshold then that document is placed in the current clustered documents.
cluster. So mail body document vector is sub divided or
4. Otherwise current document will form a new extended based on the actual cluster document
cluster and announce itself as leader to that cluster. vector size. This sub document vectors are
5. Same process is repeated for all the individually processed and compared with
documents placed in the Training email corpus. existing leaders.
In this case sub document vectors may be
This is actual but slightly modified version of actual assigned to different clusters. To clearly specify
Leader follower algorithm. But it has some limitation. the cluster to which main document is belongs
Let us consider some nth document. That document to is evaluated based on the weight calculated at
doesn’t have knowledge about the leaders created each cluster for sub document vectors. If overall
after its creation. So it may assign to a leader which is probability is less than given threshold then this
created before. But if any leader which is also having document itself form new cluster with those sub
similarity greater than given threshold but nearest to document vectors which are not fit to any other
the nth document when compared to the current cluster.
leader, then existing algorithm doesn’t give solution
to this. So in this paper we extend the algorithm to Here two more things have to be considered. Firstly,
meet this requirement i.e except leader documents new document may have new set of terms those are
remaining all documents are compared with those not placed in the Hash map. So update the dictionary
leader which are formed after their creation. If any i.e Hash map with new terms with their respective
leader or cluster is more nearer when compared to synset ids. This is used in conceptual similarity.
current cluster then this document will be removed Secondly existing similarity thresholds may not fit in
from the current cluster and assigned to new cluster case of concept drift occurs.
and update their leader. Otherwise no change has
been made. To overcome these limitations Evolutionary
computing is introduced. It is used to find the best fit
Here also there is limitation to handle concept drift threshold for given context. According to this method
that will be occurred in the mail stream processing. initial population is given and this process will be
Existing mail corpus is static and it is fully offline terminated based on the fitness function value. Here
process and all the documents are converted into fitness function is cluster validity measure which is
equal sized document vectors. This will not raise any used to find the quality of the clusters as well as
problem due to offline and more over static dataset. clustering scheme for given similarity thresholds.
But this cannot meet the online as well as dynamic
document vector size demand. So this can be B. cluster evolutionary algorithm
achieved through our novel “Adaptive Hierarchical Input: population size, expecting similarity threshold,
Leader follower with Evolutionary computing mutation probability Output: Best fit thresholds
algorithm”. 1. Randomly generate similarity thresholds
between [0,1] with population size
It includes following sub modules. 2. Find the Fitness of each Array in population
3. Select the top most half best fit arrays and
A. Mail Stream Processor Implementation clone them
4. Do mutation in all of the arrays to get
In this module proposed system asked the user offspring and find fitness again
to submit his email credentials to process the 5. Select top most half best fit arrays from this
mails in inbox. offspring
After successful login each unread mail is 6. Merge previous half and current half
extracted by Mail stream processor by default. population to get the next generation
If user change the setting to “all’ then both read 7. Repeat the process till given fitness value >
and unread mails are extracted. = excepted similarity threshold
Email header part is processed first and
compared with the existing training database. If One more enhancement is making the clustering as
it is matched with any existing mail header hierarchical clustering to clearly differentiate the
information then current mail is marked generalization from specialization. Child level cluster
according to the existing header. Otherwise new is crisper in nature than parent level and exhibits
entry will be submitted to the database. more specific nature and behaviour. To do this, same
Next Mail body is processed. Here all the text Adaptive Leader follower algorithm is applied to
processing steps are executed as mentioned in individual cluster to get multi-level. Here adaptive is
Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification
103
International Journal of Advanced Computational Engineering and Networking, ISSN: 2320-2106, Volume-2, Issue-12, Dec.-2014
[5] OpenNLPhttps://opennlp.apache.org/
Adaptive Hierarchical Leader Follower with Evolutionary Computing Algorithm for Email Classification
104