0% found this document useful (0 votes)
72 views10 pages

Spam Detection Using Compression and PSO: Conference Paper

This document describes two algorithms for spam detection: 1. A Bayesian filter-based approach improved using data compression when the filter cannot determine if an email is spam. It calculates probabilities to determine if an email is spam. 2. A document classification algorithm using Particle Swarm Optimization to classify emails as spam or not spam. Experiments showed promising results for both algorithms.

Uploaded by

sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views10 pages

Spam Detection Using Compression and PSO: Conference Paper

This document describes two algorithms for spam detection: 1. A Bayesian filter-based approach improved using data compression when the filter cannot determine if an email is spam. It calculates probabilities to determine if an email is spam. 2. A document classification algorithm using Particle Swarm Optimization to classify emails as spam or not spam. Experiments showed promising results for both algorithms.

Uploaded by

sam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/261422994

Spam detection using compression and PSO

Conference Paper · November 2012


DOI: 10.1109/CASoN.2012.6412413

CITATIONS READS
3 114

4 authors, including:

Michal Prílepok Jan Platos


VŠB-Technical University of Ostrava VŠB-Technical University of Ostrava
28 PUBLICATIONS   74 CITATIONS    189 PUBLICATIONS   777 CITATIONS   

SEE PROFILE SEE PROFILE

Vaclav Snasel
VŠB-Technical University of Ostrava
751 PUBLICATIONS   3,649 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

IPROCOM: (The development of in silico process models for roll compaction) is a multidisciplinary and inter-sectoral consortium funded by European Commission under the
FP7-PEOPLE-2012-ITN Programme. View project

Bio-inspired Optimization Algorithms and Variants for Several Applications View project

All content following this page was uploaded by Michal Prílepok on 24 April 2017.

The user has requested enhancement of the downloaded file.


Spam detection using Compression and PSO

Michal Prilepok, Tomas Jezowicz, Jan Platos, Vaclav Snasel


Department of Computer Science, FEECS
IT4 Innovations, Centre of Excellence
VSB-Technical University of Ostrava
17. listopadu 15, 708 33, Ostrava Poruba, Czech Republic
{michal.prilepok, tomas.jezowicz, jan.platos, vaclav.snasel}@vsb.cz

Abstract—The problem of spam emails is still growing. think that it was a forwarded e-mail from the third person,
Therefore, developing of algorithms which are able to solve e.g. ”RE: About us” [2], [1].
this problem is also very active area. This paper presents two Later, filter techniques started to use body of the e-
different algorithms for spam detection. The first algorithm
is based on Bayesian filter, but it is improved using data mail as well. In the same way, words from the e-mail
compression algorithms in case that the Bayesian filter cannot body were compared to the list of prohibited words. This
decide. The second algorithm is based on document classifica- extension brought only a little improvement. Rate of success
tion algorithm using Particle Swarm Optimization. Results of in detection of unrequested e-mail was dependent on the
presented algorithms are promising. quality of the list of prohibited words. Individual prohibited
Keywords-e-mail; spam; Bayesian filter; data compression; words had to be chosen very carefully so that increasing in
similarity; particle-swarm optimization final false detection rate would not happen.
Filters use check-sums and marks for elimination of
I. I NTRODUCTION false detection rate. The check-sums are calculated over
Spam senders are flooding us with increasing number of each received e-mail and resulting hash are compared with
unrequested e-mails. Such practice makes us think about database of known spam e-mails. One-way transformation
developing of defensive techniques. Currently, there are function also called hash function (MD5, SHA1, ...) are used
many developed techniques and approaches, which enable as check-sum calculators.
us to eliminate unrequested e-mails. These techniques and Spam filtering based on key words [1] did not bring
approaches can be divided into the following categories: required success of unrequested e-mails detection. Analysis
of recent e-mails trend seems to be a very effective method
• Sender or mediator analysis
in fighting against unrequested mail. This technique reduces
• E-mail content analysis
the rate of unrequested e-mail false detection. Heuristic
• Sponsor analysis
filters are ranked to these techniques - rules-based filters
Neither of the given categories is separately effective and learning-based filters.
enough. Nowadays, combination of various techniques and Rules-based filters [1] are looking for features which are
attitudes occurs in order to improve success of fighting characteristic for spam in the e-mails. These are some words
against spam [1]. (e.g. viagra), collocations and mistakes typical for spam.
The organization of this paper will be as follows. Second Example of such mistake can be sending e-mail with a future
section describes current research in the area of spam date, prohibited marks in the heading, incorrectly marked
detection. Third section describes spam detection algorithm MIME type of e-mail, etc. Each detected feature has defined
based on Bayesian filtering and data compression and fifth certain points score. Usually, points are summed up and if
section described spam detection algorithms based on docu- the sum exceed defined limit, the email is marked as a spam.
ment classification using Particle Swarm Optimization. Fifth Detected features are defined by the help of rules that have
section described performed experiments and their results for to be regularly updated and conformed to spamers practices.
both designed algorithms. Last section conclude this paper. Learning-based filters (often called Bayesian) [3], [4],
[5], [6] use tricks from the area of artificial intelligence. In
II. C URRENT R ELATED W ORK
the learning phase, the email is submitted into filter. Each
First techniques of fighting against spam came out of email is marked as spam of ham (not spam). Filter extracts
detection key words in the e-mail subject. These spam filters features from each email and stores them into database.
compared words contained in e-mail subject with the list of Usually, the e-mail divided into words (eventually other text
prohibited words. Spam authors started to avoid this way of segments) and a probability of individual words is computed
spam detection by several modifications of e-mail subjects. and statistically evaluated. The words probability for spam
Modification of e-mail subject is often used and makes us and ham emails is evaluated.
In detection phase, the filter uses collected information • Pr(H) is probability that the given e-mail is not spam
and computed the probability that the tested email is spam. (ham)
The most often formula for probability calculation was • Pr(W | H) is probability that the examined word
suggested by mathematician Bayes. Learning filters are the occurs in ham e-mail
most efficient if they are taught what is and what is not Probabilities for Pr(W | S) and Pr(W | H) will be
spam by end users according to their individual opinion. The determined in learning phase of the filter. By the help of
Bayesian filters are also used in servers where the learning Pr(H) and Pr(S), it may potentially affect partiality or
is made by all users together. impartiality of the filter against the checked mails. The more
Even the Bayesian filter shows a high success of unre- is value of Pr(S) closer to 1.0, the more is filter partial
quested e-mail detection, false marking or non-detection still against spam mails. The value Pr(H) adjust the filter’s
happens in some cases. This imperfection can be eliminated opposite. The more is this value higher, the less is the filter
by comparison of the examined e-mail content with unre- partial against spam mails. Summary of values Pr(H) and
quested e-mails known in advance [2], [1]. Pr(S) must be equal to 1.0. The statistics presents that the
III. S PAM DETECTION USING BAYESIAN FILTER AND probability of spam is approximately 80%. On the basis of
DATA C OMPRESSION this statement we can determine values for Pr(s) = 0.8, Pr(h)
In this section, we describe first approach to the spam = 0.2. In this case, the Bayesian filter anticipates that 80% of
detection based on Bayesian filter and data compression checked e-mails are spams and remaining 20% are legitimate
techniques. The Bayesian filter based spam detection is very ham mails. The majority of the Bayesian filters for detection
frequent spam detection algorithm, but in our approach, we uses a hypothesis that incoming e-mails contain less spam
improve the precision of the algorithm using data com- than legitimate e-mails (ham). Therefore, they have adjusted
pression. This significantly improve the precision of the both probabilities to 50% (Pr(S) = 0.5; Pr(H) = 0.5)).
algorithm. It can be said about the filters which use this hypothesis
that they are impartial, they do not have any prejudice
A. The Bayesian spam filter against incoming mails. This hypothesis enables simplifi-
Bayesian spam filter is a statistic technique for filtering cation of the general formula to:
e-mails. It uses naive Bayesian classifiers for spam identifi-
cation. The Bayesian classifiers work with relations between Pr(W | S)
elements (typically words) from unrequested (spam) and Pr(S | W ) =
Pr(W | S) × Pr(W | H)
requested e-mails. They calculate probability whether an e-
mail is spam or not by the help of the Bayesian statistics. This number is called spamcity or spaminess of the
Particular words have a particular probabilities of occurrence examined word. The value of Pr(W | S) that is used in this
in unrequested e-mails and legitimate e-mails. Filter does not formula is rounded to frequency of e-mails containing the
recognize these probabilities in advance. It first has to learn examined word in e-mails marked as spam during the learn-
them so that he could build upon them. Each new e-mail ing phase. Similarly, the Pr(W | H) is rounded to frequency
must be manually marked whether it is spam or is not. For all of e-mails containing the examined word in e-mails marked
words in each e-mail the filter must adjust probability with as ham during the learning phase. The collection of e-mails
which the given word occurs in spam or in legitimate e-mails determined for learning has to be representative enough due
in its database. For instance words ”viagra” or ”refinance” to these approximations. Data files of ham and spam e-mails
are often found in spam e-mails and names of friends or should be in accordance with a 50% hypothesis of the same
family members are often found in legitimate e-mails [7]. size. Determination whether the e-mail is spam or ham, just
1) Calculation of probability that e-mail containing a on the basis of a single word, is prone to mistake. That is
word is spam: Probability of e-mail which contains a par- why the Bayesian filter tries to take into account several
ticular word can be calculated by the help of the following words and interconnect their spamcity in order to determine
formula [3], [4], [5], [6]: total probability.
2) Combination of individual probabilities: The Bayesian
Pr(W | S) × Pr(S) filter of unrequested e-mail assumes that words are indepen-
Pr(S | W ) =
Pr(W | S) × Pr(S) + Pr(W | H) × Pr(H) dent. This is bad in natural languages where probability of
where: detection of an adjective is affected, e.g. by probability of
a noun. Considering this assumption we can deduce further
• Pr(S | W ) is probability that e-mail is spam with
formulas of the Bayesian theorem:
knowledge that it contains the examined word.
• Pr(S) is total probability that e-mail is spam.
• Pr(W | S) is probability that the examined word occurs p1 × p2 . . . pn
p=
in spam p1 × p2 . . . pn + (1 − p1 ) × (1 − p2 ) . . . (1 − pn )
where p is probability that the suspected e-mail is spam good for”, instead of calculation of spamcity values for each
and pi is probability that Pr(S | Wi ) is spam containing the word ”Viagra”, ”is”, ”good” and ”for”. This method provides
i − th examined word. a higher sensitivity to context and leads to better elimination
The result p is compared to a specific value. If the result of the Bayesian noise to the detriment of a bigger database.
p is higher than the given limit, then the email is considered
as a spam, otherwise it is a ham mail. B. Metrics for Compression based similarity
3) Use of rare words: In case that the word does not Formally, a distance is a function D with nonnegative real
occur during the learning phase, numerator and denominator values, defined on the Cartesian product X × X of a set X.
is equal to zero in general, but also in spamcity formula. it is called a metric on X, if for every x, y, z ∈ X:
Software may leave out the words which do not provide • D(x, y) = 0 if x = z (the identity axiom);
any information. Words that occurred several times only in • D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality);
the phase of learning may cause a problem because it would • D(x, y) = D(y, x) (the symmetry axiom).
be a mistake to believe to a blind information. Solution of A set X provided with a metric is called a metric space.
this problem is to prevent of acceptance of such words into For example, every set X has the trivial discrete metric
account. The Bayesian theorem is applied several times, and D(x, y) = 0 if x = y and D(x, y) = 1 otherwise [8].
the division between spam and ham e-mails containing the 1) Universal Similarity Metric: The most outstanding
examined word is a random quantity with beta distribution. metric based on Kolmogorov complexity is the Universal
Therefore, we may use a modified formula for calculation Similarity Metric (USM) proposed by Li et al.
of probability: This USM has been used for detect similarity mitochon-
s × Pr(S) + n × Pr(S | W ) drial genomes, cluster SARS virus, musical pieces, image,
P̄r(S | W ) = detect plagiarism of student assignments. But it fails to
s+n
compare TOPS diagrams [9].
where
Resulting rate of probability distance is calculated by the
• P̄r(S | W ) is corrected probability of spam e-mail with
following formula:
knowledge that it contains the examined word.
• s is a strength by which we give basic information
max(C(xy) − C(x), C(yx) − C(y))
about incoming unrequested mail. U SM (x, y) =
• Pr(S) is probability that incoming e-mails are spams.
max(C(x), C(y))
• n is a number of occurrences of the examined word in Where:
the course of the learning phase. • C(xy) is length of compression concatenation of x and
• Pr(S | W ) is spamcity of the examined word. y;
This corrected probability is used instead of spamcity in • C(x) is length of compression of x;
combined formula. Pr(S) may equal to 0.5 in order to avoid • max(x, y) is maximum of values x and y;
too big distrust towards incoming mails. Three is a good • min(x, y) is minimum of values x and y.
value for s, which means that the examined word must exist The USM is in the interval 0 ≤ U SM (x, y) ≤ 1. If
three times within learning e-mails in order to increase trust U SM (x, y) = 0, then files x and y are equal. They have the
of spamcity value as a default value. This formula may be highest difference when the result value of U SM (x, y) = 1.
extended in case when n is equal to 0 (and where spamcity 2) Normalized Compression Distance: Normalized Com-
is not defined). In this case, Pr(S) is evaluated. pression Distance (NCD) is a mathematical way for mea-
4) Other heuristics: Neutral words like the, and, some, suring similarity of objects. Measuring of similarity is re-
or is (in English), or their equivalents in other languages alized by the help of compression where repeating parts
may be ignored. Generally said, some Bayesian filters ignore are suppressed by compression. It is based on algorithmic
all words which has spamcity around 0.5, because they difficulty of the Normalized Information Distance (NID)
bring insufficiently good decision. Words taken into account developed by Andrey Kolmogorov. NCD may be used for
should have spamcity close to 0.0 (distinctive mark of comparison of different objects, such as images, music, texts
legitimate mail), or 1.0 (distinctive mark of spam mail). or gene sequences. NCD has requirement to compressor.
Method may look like this for example: 10 words that have Compressor meet the condition C(x) = C(xx) within
the highest absolute value |0.5 − p|. logarithmic bounds [10]. We may use NCD for detection
Some software products consider the fact that the given of plagiarism and visual data extraction [11].
word occurs several times in the examined e-mail and others Resulting rate of probability distance is calculated by the
not. following formula:
Some software products use samples (word sequences)
instead of separated words of natural language. For instance, C(xy) − min (C(x), C(y))
N CD =
they calculate the spamcity value of four words ”Viagra is max (C(x), C(y))
Where: the documents collection is represented as document-term
• C(x) is length of compression of x. frequency matrix (Doci × T Fij ), where Doci refers to the
• C(xy) is length of compression concatenation of x and each document in collection and T Fij refers to the frequency
y. of the j term in the document i. In this representation, only
• min{x, y} is minimum of values x and y. relation of the term to the individual document is concerned,
• max{x, y} is maximum of values x and y. but for classification across the document collection must be
The NCD is in the interval 0 ≤ N CD(x, y) ≤ 1. If computed, therefore, we need to evaluate term importance in
N CD(x, y) = 0, then files x and y are equal. They have the individual document according the importance of the term
highest difference when the result value of N CD(x, y) = 1. in the collection (it will be called weights). One of the way
3) Compression-based Dissimilarity Measure: we may define a weight to the terms is according to TF-IDF
Compression-based Dissimilarity Measure (CDM) is (term-frequency inverse document frequency) was described
response to NCD. It is called ”simpler measure”, but in [14], [15]. The weights vij of the term j in document i
avoiding theoretical analysis. is computed as :
F
C(xy) vij = tij × log
CDM (x, y) = fi
C(x) + C(y)
Where tij is the number of times that the term j appears
Authors of this metric are aware that CDMC is nonmetric, in document i, fi is the number of times that the term tj
failing the identity property. CDM give values in the range appears in the entire document database and F is the number
of [ 12 , 1], where 12 shows pure identity and 1 shows pure of unique terms in document collection.
disparity [10]. The previous paragraph described the process for conver-
C. Compression-based Cosine sion of document collection into (document - term-weight)
matrix, but the number of terms with non-trivial weight for
C(x) + C(y) − C(xy) each document is still large (tenths of thousands). This may
CosS(x, y) = 1 − p cause problems with automatic classifiers because of the
C(x)C(y)
large number of terms to process. Therefore, feature/term
This measure is normalized to the range [0,1], with 0 selection may be applied. Several approaches to feature
showing total identity between the files x a y., and 1 total selection were developed [16], [17]. One of the popular
dissimilarity [10]. approach is entropy weighting scheme [18]. The entropy
weighting scheme is computed to the each term as a multi-
IV. D OCUMENT SIMILARITY USING PARTICLE S WARM
plication of the local weighting scheme Lij and the global
O PTIMIZATION
weighting scheme Gj of the document i and term j. The
A. Document classification definition of the schemes is following:
Document classification is a process which decide if a 
1 + log T Fij , T Fij > 0
document di ∈ D belong to the category ci ∈ C according Lij =
0, otherwise
to the knowledge of the correct categories known for subset
DT ⊂ D of training documents, where D is a collection PN T F
1 + k=1 Fjij log Fkij
TF
of all documents and C is collection of all categories. Gk =
Generally, each document may belong to more than one log N
category and each category may contain more than one Where N is number of documents in collections, T Fij refers
document [12]. to the frequency of the j term in the document i, Fh is a
The document classification task may be solved by hu- frequency of term k in the entire document collection. In
mans or by automatic classifiers. Automatic documents our work, we choose terms with the highest entropy value.
classifiers need several preprocessing steps, which must con- This may reduce the number of processed term to several
vert documents into automatic classification capable form. hundreds or less.
The first task is preprocessing of the words a creation of
vector representation of the documents. Each document is B. Particle Swarm Optimization
parsed out and list of used words with their frequency is Particle Swarm Optimization (PSO) is a optimization
extracted. Each word is compared with the list of stop- technique that is inspired by the behavior of the swarm of
words, which are useless in classification process, because birds and other animals that uses a collective behavior. It
they are present in most of the documents and, therefore, was developed by Kennedy and Eberhart in 1995 [19], [20].
bring no information about them. Each word is converted It was inspired by various interpretation of the movement of
into it canonical form using normalization algorithm such as organisms in a bird flock or fish school [19]. The principle
Porter Stemmer Algorithm [13]. When this process finished, of the PSO is very simple. The goal is to find the optimal
solution of the fitness function defined over the search space. is compared with each vector in training collection during
PSO will generate a set of particles where each particle is each iteration. The similarity between vectors is computed
defined by its position and velocity. Particles travel through using standard Euclidean distance. When all comparisons are
a search space and try to find optimal value of the fitness done, each particle modifies its best solution if necessary as
function. The movement of the particle is influenced by well as they update global maximum if necessary.
the position of the best value found by given particle (so The definition of the fitness function is the main task
called local optimum) and the position of the best value in each evolutionary algorithm. One of the most popular
over all particles (so called global optimum). The influence metrics in document classification is precision and recall.
is realized as a change of the vector of velocity of each Precision (Pr) is defined as a probability that selected
particle. The velocity vit+1 of i-th particle in iteration t + 1 document is classified correctly and recall (Re) is defined as
is defined as follows: a probability that randomly selected document is assigned
to the correct category [12]. Mathematical definitions are as
follows:
vit+1 = wvit + c1 r1 × lBest − xti + c2 r2 × gBest − xti
where r1 , r2 are randomly generated number in the interval TP
Pr =
< 0, 1), c1 , c2 are parameters of the algorithm called learn- TP + FP
ing factors, xti is the position of the i-th particle in iteration TP
t and w is a inertia factor of the algorithm. lBest and gBest Re =
TP + FN
are local and global best solution found.
The PSO was successfully used in many optimization task Where TP (true positive) is a count of correctly classified
such as dealing with the equality and inequality constraints documents, FP (false positives) is count of documents in-
of the economic dispatch problems with non-smooth cost correctly assigned into category, and FN (false negatives) is
functions [21], optimization of the reactive power flow in count of documents incorrectly not assigned into category.
the power system network in order to minimize real power The final fitness function F1 is combination of the Preci-
system losses and several other application in electric power sion and Recall and is defined according [29] as
systems [22], estimation of the non-linear parameters of 2 × P r × Re
mathematical models in chemistry [23], solving of no-wait F1 =
P r + Re
flowshop scheduling problem [24], solving of vehicle routing
problem with simultaneous pick-up and delivery [25] and Fitness function F1 works well when only one category is
many others. used. In our case, we need to define classification vectors for
1) GPU implementation: Implementation of PSO using more than one category (according to document collections).
GPU becomes very natural step due to the implicit parallel Therefore, we need to generalize fitness function for more
nature. Usually, PSO algorithms use tenths of hundreds of categories. The solution is averaging of the precision and re-
particles. But the most expensive part of the algorithm is call through all categories. Two basic averaging approaches
usually the computation of the fitness value; therefore, the exists according [12], Micro and Macro averaging. Macro-
main task is the implementation of the problem solution average is defined as a arithmetic mean of precision and
according of the particle position using GPU. In the past, recall and, therefore, prefer categories with low number of
several works focused on PSO and GPU were published. documents. Micro-average is proportional average according
Mussi et al. [26] uses PSO implemented using CUDA to number of documents in categories with higher number
architecture for implementation of the Road Sign Detection of documents.
used in Advanced Driver Assistance Systems. Rymut and 1) GPU Classification using PSO: The basic algorithm
Kwolek [27] uses PSO on GPU with cooperation of the of classification of documents using PSO was defined in
Adaptive Appearance Model to tracking of the objects. the previous paragraphs. The precision of classification is
Zhang et al. [28] uses PSO for simplification of the terrain. depicted in the Section V. When we analyze the process
The following section describes our implementation of the of the searching for the optimal classification vectors, we
PSO for document classification. will see, that the most time consuming part is comparison
of category vector with all documents in the collection.
C. Algorithm Therefore, we need to optimize this part of the algorithm.
As was mentioned above, we are solving document clas- We choose GPU units for this task. The reason was that
sification problem using Particle Swarm Optimization. First GPU brings massive parallelism for reasonable price.
of all, we must define classification problem in the terms We define two approaches for this part of the algorithm.
of the PSO. In our approach, as well as in [15], each par- Both versions has common goal - compute the similarity
ticle represents a vector which best describes an individual between all category vectors and document vectors. Let C
category. Each particle (vector which describe a category) is the set of M categories and D is a collection of N
m n

Block 1

Thread 1 Thread 3 Thread 4 Thread i


Figure 2. Histgram of NCD metric with tested compressions

Results

Figure 1. Visualization of the Vector-comparison computation

documents. The size of the matrix is M ×N . The dimension Figure 3. Histgram of USM metric with tested compressions
of all vectors is D.
In our approach, each kernel computes a comparison of a
particular vector c ∈ C with a k vectors from the collection Burrows Wheeler Transform with Adaptive Huffman Encod-
D. Therefore, we need to run M ×N k threads, because each ing (BWT-AHUFF) and Adaptive Huffman and Fibonacci
kernel compute exactly k comparisons. The situation is encoding (BWT-AHUFF-FIB). Remaining algorithms are
depicted in Fig 1. based on Lampel-Ziv 77 (LZ77) and Lampel-Ziv-Storer-
Szymanski (LZSS) algorithm. The differences was made by
V. E XPERIMENTS the final encoding. First there was Direct bit encoding (no
A. Experiment with Bayesian Spam detector and data com- final encoding used et al.), the second was Adaptive Huffman
pression encoding. All compression algorithms are listed below.
The testing dataset contains 48 360 spam emails and 1) LZW
36 450 ham emails. The dataset was taken from the The 2) Adaptive Huffman
Text REtrieval Conference (TREC) organized in year 2005, 3) BurrowsWheeler with Adaptive Huffman Encoding
co-sponsored by the National Institute of Standards and 4) BurrowsWheeler with Adaptive Huffman Encoding,
Technology (NIST) and U.S. Department of Defense. RLE and Fibonacci Encoding
The k-fold cross-validation was executed on the same test 5) LZ77, DirectBit
email database (TREC 2005). Spam and ham email was 6) LZSS lazy, DirectBit
splited in 10 subdatasets. 9 spam and ham email groups was 7) LZ77, Huffman
marked as training data and 1 remaining group was marked 8) LZSS, Huffman
as group for testing. Each subdataset Each part included 43 9) LZSS lazy, Huffman
524 spam and 32 805 ham emails for training BSF and 4836
spam and 3645 ham emails was used for testing. the result B. Experiments with PSO based algorithm
of k-fold cross-validation are listed in table I. In our work we set the PSO based document clustering
In our experiments, we used several classical compression algorithm in the following way. The preprocessing of the
algorithm. The first was Lampel-Ziv-Welch (LZW) algo- emails was made in the same way as was done in [15].
rithm developed in 1984. Second was Adaptive Huffman En- Numbers and words shorter than 3 characters was removed
coding (AHUFF). Third and fourth algorithm was based on from the input texts. Number of features was to 300.
HAM SPAM
Fold Precision Recall F1 Precision Recall F1
1 0.4901 0.9855 0.6546 0.5897 0.9050 0.7141
2 0.4998 0.9767 0.6612 0.5884 0.8965 0.7105
3 0.5069 0.9822 0.6687 0.5807 0.9038 0.7071
4 0.4983 0.9837 0.6615 0.5838 0.9082 0.7108
5 0.4874 0.9798 0.6509 0.5895 0.9002 0.7125
6 0.4928 0.9830 0.6565 0.5895 0.9034 0.7134
7 0.4947 0.9807 0.6576 0.5806 0.9042 0.7071
8 0.4935 0.9824 0.6570 0.5800 0.9026 0.7062
9 0.4909 0.9812 0.6544 0.5943 0.9063 0.7179
10 0.4906 0.9836 0.6547 0.5961 0.9074 0.7196
Avg 0.4944 0.9818 0.6577 0.5872 0.9037 0.7119
Table II
R ESULTS OF THE PSO BASED DOCUMENT CLUSTERING ALGORITHM
Figure 4. Histgram of CDM metric with tested compressions

VI. C ONCLUSION
In this paper we presented two algorithm for spam de-
tection. First algorithm was based on Bayesian spam filter
which is very common in this purpose. But we improve
the precision of the spam detection using compression
algorithms and defined metrics. We achieved precision of
the Ham detection more than 99%, and the Spam detection
precision ranges from 66% to 90%. Second algorithm was
based on the document classification algorithm using Particle
Swarm Optimization algorithm. This algorithm achieved
precision for Spam around 60% and 50% for ham. The
results of the second algorithm are bad for practical im-
Figure 5. Histgram of CosS metric with tested compressions plementation, but the tuning of the used parameters may
improve the efficiency of the algorithm.

ACKNOWLEDGMENT
The PSO used 1000 partitions, 15 iterations, the maximum
This work was partially supported by the Grant Agency
velocity was set to 25. Coefficient k was set to 0.99, c1 to
of the Czech Republic under grant no. P202/11/P142, SGS
2.0 and c2 to 2.0. The GPU uses 256 threads per block. The
in VSB Technical University of Ostrava, Czech Republic,
experiments was performed on GPU TESLA 2050 with 448
under the grant No. SP2012/58, and was supported by the
CUDA cores.
European Regional Development Fund in the IT4Innovations
The results of the PSO based document clustering algo- Centre of Excellence project (CZ.1.05/1.1.00/02.0070) and
rithm are depicted in table II. As may be seen, the average by the Bio-Inspired Methods: research, development and
precision of HAM detection is 49% but the F1 score was knowledge transfer project, reg. no. CZ.1.07/2.3.00/20.0073
65%. The F1 score of SPAM detection was 71%. Achieved funded by Operational Programme Education for Competi-
speed up on one GPU was 3. tiveness, co-financed by ESF and state budget of the Czech
Republic.

Table I
R EFERENCES
K - FOLD CROSS VALIDATION BSF AND NCD
[1] P. Wolfe, C. Scott, and M. Erwin, Anti-Spam Tool Kit.
Compression method Spam Ham FNR FPR McGraw-Hill Osborne Media, March 2004.
0 87.087% 99.199% 12.913% 0.801%
1 86.616% 99.237% 13.384% 0.763% [2] A. Khorsi, “An overview of content-based spam filtering
2 90.139% 99.098% 9.861% 0.902% techniques,” Informatica (Slovenia), vol. 31, no. 3, pp. 269–
3 90.087% 99.106% 9.913% 0.894% 277, 2007.
4 65.765% 99.443% 34.235% 6.527%
5 68.552% 99.314% 31.448% 0.686% [3] Y. Song, A. Kolcz, and C. L. Giles, “Better naive bayes
6 66.518% 99.320% 33.482% 0.680% classification for high-precision spam detection,” Softw. Pract.
7 76.633% 99.257% 23.367% 0.746% Exper., vol. 39, pp. 1003–1024, August 2009. [Online]. Avail-
8 77,468% 99,180% 22,532% 0,820% able: http://dl.acm.org/citation.cfm?id=1568514.1568517
[4] B. C. Dhinakaran, D. Nagamalai, and J.-K. Lee, [16] Y. Saeys, I. Inza, and P. Larra?aga, “A review of feature
“Bayesian approach based comment spam defending selection techniques in bioinformatics,” Bioinformatics,
tool,” in Proceedings of the 3rd International Conference vol. 23, no. 19, pp. 2507–2517, 2007. [Online]. Available:
and Workshops on Advances in Information Security http://bioinformatics.oxfordjournals.org/content/23/19/2507.abstract
and Assurance, ser. ISA ’09. Berlin, Heidelberg:
Springer-Verlag, 2009, pp. 578–587. [Online]. Available: [17] Y. Yang and J. Pedersen, Feature selection in statistical
http://dx.doi.org/10.1007/978-3-642-02617-1 59 learning of text categorization. Morgan Kaufmann, 1997,
pp. 412–420.
[5] Y. Begriche and A. Serhrouchni, “Bayesian statistical analysis
for spams,” in Local Computer Networks (LCN), 2010 IEEE
35th Conference on, oct. 2010, pp. 989 –992. [18] S. Dumais, “Improving the retrieval of information from
external sources,” Behavior Research Methods, vol. 23, pp.
[6] T. Almeida, A. Yamakami, and J. Almeida, “Evaluation of 229–236, 1991, 10.3758/BF03203370. [Online]. Available:
approaches for dimensionality reduction applied with naive http://dx.doi.org/10.3758/BF03203370
bayes anti-spam filters,” in Machine Learning and Applica-
tions, 2009. ICMLA ’09. International Conference on, dec. [19] J. Kennedy and R. Eberhart, “Particle swarm optimization,”
2009, pp. 517 –522. in Neural Networks, 1995. Proceedings., IEEE International
Conference on, vol. 4, nov/dec 1995, pp. 1942 –1948 vol.4.
[7] M. Prilepok, J. Platos, V. Snasel, and E. El-Qawasmeh, “The
bayesian spam filter with ncd,” in DATESO, ser. CEUR [20] R. Eberhart and J. Kennedy, “A new optimizer using particle
Workshop Proceedings, J. Pokorný, V. Snásel, and K. Richta, swarm theory,” in Micro Machine and Human Science, 1995.
Eds., vol. 837. CEUR-WS.org, 2012, pp. 60–68. MHS ’95., Proceedings of the Sixth International Symposium
on, oct 1995, pp. 39 –43.
[8] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The
similarity metric,” IEEE Transactions on Information Theory,
vol. 50, no. 12, pp. 3250–3264, 2004. [21] J.-B. Park, K.-S. Lee, J.-R. Shin, and K. Lee, “A parti-
cle swarm optimization for economic dispatch with nons-
mooth cost functions,” Power Systems, IEEE Transactions on,
[9] J. Rocha, F. Rosselló, and J. Segura, “Compression ratios
vol. 20, no. 1, pp. 34 – 42, feb. 2005.
based on the universal similarity metric still yield pro-
tein distances far from cath distances,” CoRR, vol. abs/q-
bio/0603007, 2006. [22] M. AlRashidi and M. El-Hawary, “A survey of particle
swarm optimization applications in electric power systems,”
[10] D. Sculley and C. E. Brodley, “Compression and machine Evolutionary Computation, IEEE Transactions on, vol. 13,
learning: A new perspective on feature space vectors,” in no. 4, pp. 913 –918, aug. 2009.
DCC, 2006, pp. 332–332.
[23] M. Schwaab, E. C. Biscaia, Jr., J. L. Monteiro, and J. C.
[11] P. M. B. Vitányi, “Universal similarity,” CoRR, vol. Pinto, “Nonlinear parameter estimation through particle
abs/cs/0504089, p. 5, 2005. swarm optimization,” Chemical Engineering Science,
vol. 63, no. 6, pp. 1542 – 1552, 2008. [Online]. Available:
[12] F. Sebastiani, “Machine learning in automated http://www.sciencedirect.com/science/article/pii/S0009250907008755
text categorization,” ACM Comput. Surv., vol. 34,
no. 1, pp. 1–47, Mar. 2002. [Online]. Available: [24] Q.-K. Pan, M. F. Tasgetiren, and Y.-C. Liang, “A
http://doi.acm.org/10.1145/505282.505283 discrete particle swarm optimization algorithm for the
no-wait flowshop scheduling problem,” Computers &amp;
[13] M. F. Porter, “Readings in information retrieval,” Operations Research, vol. 35, no. 9, pp. 2807 – 2839,
K. Sparck Jones and P. Willett, Eds. San Francisco, 2008, ¡ce:title¿Part Special Issue: Bio-inspired Methods in
CA, USA: Morgan Kaufmann Publishers Inc., 1997, ch. Combinatorial Optimization¡/ce:title¿. [Online]. Available:
An algorithm for suffix stripping, pp. 313–316. [Online]. http://www.sciencedirect.com/science/article/pii/S0305054806003170
Available: http://dl.acm.org/citation.cfm?id=275537.275705

[14] V. P. Guerrero Bote, F. de Moya Anegón, and V. H. [25] T. J. Ai and V. Kachitvichyanukul, “A particle
Solana, “Document organization using kohonen’s algorithm,” swarm optimization for the vehicle routing problem
Inf. Process. Manage., vol. 38, no. 1, pp. 79–89, Jan. with simultaneous pickup and delivery,” Computers
2002. [Online]. Available: http://dx.doi.org/10.1016/S0306- &amp; Operations Research, vol. 36, no. 5, pp.
4573(00)00066-2 1693 – 1702, 2009, ¡ce:title¿Selected papers presented
at the Tenth International Symposium on Locational
[15] Z. Wang, Q. Zhang, and D. Zhang, “A pso-based Decisions (ISOLDE X)¡/ce:title¿. [Online]. Available:
web document classification algorithm,” in Proceedings http://www.sciencedirect.com/science/article/pii/S0305054808000774
of the Eighth ACIS International Conference on
Software Engineering, Artificial Intelligence, Networking, [26] L. Mussi, S. Cagnoni, and F. Daolio, “Gpu-based road sign
and Parallel/Distributed Computing - Volume 03, ser. detection using particle swarm optimization,” in Intelligent
SNPD ’07. Washington, DC, USA: IEEE Computer Systems Design and Applications, 2009. ISDA ’09. Ninth
Society, 2007, pp. 659–664. [Online]. Available: International Conference on, 30 2009-dec. 2 2009, pp. 152
http://dx.doi.org/10.1109/SNPD.2007.84 –157.
[27] B. Rymut and B. Kwolek, “Gpu-supported object tracking
using adaptive appearance models and particle swarm
optimization,” in Computer Vision and Graphics, ser. Lecture
Notes in Computer Science, L. Bolc, R. Tadeusiewicz,
L. Chmielewski, and K. Wojciechowski, Eds. Springer
Berlin / Heidelberg, 2010, vol. 6375, pp. 227–
234, 10.1007/978-3-642-15907-7 28. [Online]. Available:
http://dx.doi.org/10.1007/978-3-642-15907-7 28

[28] H. Zhang, J. Sun, J. Liu, and N. Lv, “A new simplification


method for terrain model using discrete particle swarm
optimization,” in Proceedings of the 15th annual ACM
international symposium on Advances in geographic
information systems, ser. GIS ’07. New York, NY,
USA: ACM, 2007, pp. 67:1–67:4. [Online]. Available:
http://doi.acm.org/10.1145/1341012.1341091

[29] Y. Yang, “An evaluation of statistical approaches to text


categorization,” Information Retrieval, vol. 1, pp. 69–
90, 1999, 10.1023/A:1009982220290. [Online]. Available:
http://dx.doi.org/10.1023/A:1009982220290

View publication stats

You might also like