Automatic Malware Signature Generation: Karin Ask October 16, 2006
Automatic Malware Signature Generation: Karin Ask October 16, 2006
Karin Ask
The times when malware researchers could spend weeks analyzing a new piece of mal-
ware are long gone. Today new malicious programs are written and distributed at such
speed that it just is not possible. Virus scanners are the most common countermeasure
against malware attacks and they need up-to-date signatures to successfully identify
malware. This thesis describes Autosig, a program for automatic generation of mal-
ware signatures.
The generation of signatures is based on the fact that most malware come in many
different variants, but still share some invariant code. Using statistical data on how
often certain byte combinations appear in legitimate files, Autosig extracts a substring
from this invariant code to generate a signature. The signatures are tested and those
that fail to pass all tests are discarded. By remembering all discarded signatures, Au-
tosig learns which code to avoid.
This technique has turned out to be successful in many of the time consuming rou-
tine cases, leaving the human analysts more time to generate working signatures for
more complicated malware. It has also turned out helpful in replacing overlapping
and redundant signatures, leading to a smaller signature database.
Acknowledgements
This work has been carried out at Ikarus Software GmbH and I want to thank them for
giving me the opportunity to work on such an interesting subject.
I also want to thank Christopher Krügel and Engin Kirda at TU Wien for getting me
interested in Internet security in the first place and for their help and advice throughout
my work on this thesis.
Christian Schulte at KTH Stockholm has been very helpful in giving me feedback
and valuable comments on my writing.
Moreover my parents Ulla & Gösta Ask deserve a big thank you for their constant
help and support, not just during the writing of this thesis but throughout my entire
studies.
Contents
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Malware 5
2.1 A Brief History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Classification of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Malware Environments . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Malware Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Detecting Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Virus Scanners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Behavior Blockers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Distribution of Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5.1 Malware Propagation . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Signatures 21
3.1 Basic Properties of Signatures . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Generic Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 IDS Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Signature Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Existing Workflow at Ikarus Software GmbH . . . . . . . . . . . . . . . . 26
3.3.1 Incoming Malware . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Signature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
I
4.4.1 Negative Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4.2 Bad Starting Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Implementation of Autosig 37
5.1 External Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Find Invariant Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.2 Sorting List of Common Code . . . . . . . . . . . . . . . . . . . . . 41
5.3 Find Good Signature Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Bad Starting Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Negative Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Generate Signature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Test Results 45
6.1 Daily Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2 Database Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Case Study: W32/Sobig . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.4 Case Study: W32/Polip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.5 Evaluation of Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7 Conclusion 51
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
II
List of Figures
III
IV
List of Tables
V
VI
Chapter 1
Introduction
In 1984 Dr. Frederick B. Cohen defined a major computer security problem that he
called a virus - a program able to attach itself to other programs and cause them to be-
come viruses as well [10]. Today, twenty-two years later, the problem has grown more
severe than anyone could imagine. The popularity of the Internet, a lack of diversity in
the software running on Internet-attached hosts and an increasing variety of exploitable
vulnerabilities in software has brought with it an explosion in the number of malicious
software attacks.
The most common countermeasure to these attacks are virus scanners, that search
files for signatures, small bytepatterns that correspond to malicious code. Since only
already known malware, where a signature is available, can be detected this approach
requires a fast response to new malware. As long as there is no signature for a certain
malware it will pass through the virus scanner as a non-malware file. The signatures
have to be carefully chosen, so that they are not found in innocent files by coincidence.
There is always a trade-off between fast generation of signatures and avoiding false
positives.
In the best case scenario, every new malware strain is analyzed and a signature
corresponding to its unique function is generated. The problem is that this type of
analysis takes time. In January 2006 2,312 new samples of malware were spotted [28].
Even if this is a new record for the number of malware being spotted in one month,
it is not hard to see that the analysts in an anti-virus lab do not have enough time to
analyze new malware and generate a working signature for it. Malware writers are
well aware of this weakness and have challenged anti-virus software with zero-day,
short-span and serial-variant attacks as well as other new propagation techniques.
Motivated by the slow pace of manual signature generation, the goal for this master
thesis has been to develop and implement the tool Autosig, that generates signatures
automatically. Autosig identifies invariant code among several instances of the same
malware strain and uses statistical data on this code to generate signatures. The rel-
evance of such an approach in automatically generating signatures has already been
shown by Kephart and William [19].
Much malware today are generated with the help of virus generation kits, or are
2 1.1. GOALS
slightly modified copies of already known malware. In practice we find that these mal-
ware instances often contain some invariant code. By identifying this code it is possible
to generate a signature that finds all variants of the same malware. Such generic signa-
tures help reduce the memory needed to store signatures and speed up the time needed
to scan files.
It is not anticipated that working signatures can be generated automatically for all
malware. Much malware evade detection by polymorphism techniques, through which
a program may encode and re-encode itself and thus changing its payload between
every attack. Even though there is invariant code in these files too, such as the routine
for encoding, this type of malware normally need to be analyzed by human experts.
However, by generating some signatures automatically, the human analysts will have
more time for these complicated attacks.
1.1 Goals
This Master Thesis analyzes the existing signature generation workflow at Ikarus Soft-
ware GmbH, an Austrian anti-virus company, with its benefits and drawbacks. After
this analysis the tool Autosig that automatically generates signatures is to be devel-
oped. The main question to be answered by this work is whether automatic signature
generation is feasible. When yes, where can the main profits be found? Is it possible to
enhance the quality of the signatures generated? Can a faster reaction to new malware
be achieved this way?
The goals can be summarized as follows:
• Analyze the existing workflow used by the Ikarus Analysis team, how to manu-
ally generate malware signatures.
• Develop a concept how to automate the process of generating signatures for mal-
ware samples.
• Autosig should try to generate signatures for all malware samples, independent
of file type or target machine.
The ultimate goal is then to integrate Autosig with the other tools at Ikarus so that
signatures are generated without any human intervention, when possible. An overview
of such a system is schematically shown in figure 1.1.
signature signature
SIGDB
User of Ikarus
Ikarus domain virus scanner
SIGDB is the database for storing working signatures. All malware samples are stored in the database
VBASE.
chapter gives an overview of malware in general and how to deal with malware at-
tacks. It also compares the anti-virus solution at Ikarus Software to these. In the third
chapter signatures are discussed in detail and the present signature generation work-
flow at Ikarus is analyzed. The design and implementation of Autosig is explained
in the fourth and fifth chapter. In the sixth chapter some test results are presented.
The seventh and last chapter concludes the thesis and outlines future work. In the
appendixes log files from running Autosig can be found.
The term ”virus” is often used to describe all kinds of malicious software, including
those that are more properly classified as worms, trojans or something else. In this
thesis the term malware will be used instead, to avoid any misunderstandings.
1.3 Glossary
In this section, important technical terms that are used throughout this thesis are de-
scribed.
in this format. Since most malware is targeted at Windows, PE files are generally the
dominant malware format.
Malware
The word malware is a portmanteau of ”malicious software” and means software de-
signed to infiltrate or damage a computer system, without the owner’s consent [36].
Malware is a collective name for viruses, worms, trojan horses, spyware, adware and
their likes. Commonly the word ”virus” is used to describe all the above, even though
actual viruses make out a pretty small number of the existing malware. In 2004 viruses
only stood for about 4 percent of all malware, whereas worms stood for the major-
ity [23].
The following chapter gives an overview of malware in general. It also addresses
some different ways to detect malware, how to approach malware attacks and describes
the anti-virus solution at Ikarus.
A subsequent development was The Reaper, a ”search and destroy” program which
stalked Creeper across the same network, eliminating it from all its hiding places. It has
been postulated that The Reaper was thus the first anti-virus program [33].
One of the earliest verified reports of a computer virus was issued in 1981. Elk
Cloner [15] infected the boot sector disks on Apple II PCs and copied itself to the boot
sectors of diskettes as they were accessed. On every 50th boot, the Elk Cloner hooked
6 2.1. A BRIEF HISTORY
the reset handler; thus, only pressing reset triggered the payload of the virus. The
virus contained a number of payloads: it inverted the screen, caused text to flash and
displayed the following poem:
In 1984 Dr. Frederick B. Cohen introduced the term virus based on the recom-
mendation of his advisor Professor Leonard Adleman, who picked the name from sci-
ence fiction novels. Cohen also provided a formal mathematical model for computer
viruses [11].
Reports of the Brain-virus started to appear in newspapers worldwide in 1986. This
boot sector virus, which infected 360K floppy disks, quickly spread worldwide with
reports being received from Asia, the Middle East, Europe and the US. The virus had
been written by two pakistani brothers who ran a software company and were report-
edly annoyed at the extent of software piracy and theft taking place in their country.
The virus was an attempt to define the extent of such copying.
One of the first mass-mailers was the infamous CHRISTMA EXEC worm [6]. In
December 1987 it caused the first major network attack. The CHRISTMA EXEC was
a chain letter that spread from the University of Clausthal Zellerfeld in Germany onto
Bitnet, via a gateway to the European Academic Research Network (EARN), and then
onto IBMs internal network known as VNet. The CHRISTMA EXEC was written in
the script language REXX introduced by IBM and spread on VM/CMS installations. It
relied on social engineering and since users were happy to follow the instructions in the
script it was able to spread rapidly. When run, it drew a Christmas tree on the screen,
looked around for user IDs on the system and sent a copy of itself to these, using the
SENDFILE command.
The Morris Internet worm [29] was released on 2 November 1988. The worm repli-
cated by exploiting a number of bugs in the Unix operating system on VAX and Sun
Microsystems hardware, including a bug in sendmail (an electronic mailing program)
and in fingerd (a program for obtaining details of who is logged on to the system).
Stanford University, MIT, the University of Maryland and Berkeley University in Cali-
fornia were infected within five hours of the worm being released. The NASA Research
Institute at Ames and the Lawrence Livermore national laboratory were also infected.
It is believed that about 6,000 computer systems were infected, about 10 percent of the
Internet at that time. The worm consisted of some 4,000 lines of C code and once it was
analyzed, specialists distributed bug fixes to sendmail and fingerd which prevented the
2.1. A BRIEF HISTORY 7
Viruses As defined earlier, a computer virus is code that recursively replicates a possi-
bly evolved copy of itself. Viruses infect a host file or system area, or they simply
modify a reference to such objects to take control and then multiply again to form
new generations.
Worms are typically standalone applications without a host program. They primarily
replicate on networks, usually without any help from a user. However, some
worms also spread as a file-infector virus and infect host programs, which is why
they are often classified as a special subclass of viruses. If the primary target of a
virus is the network, it should be classified as a worm [31].
Trojan Horses (trojans) portray themselves as something other than what they are at
the point of execution. Though it may advertise its activity after launching, this
information is not apparent to the user beforehand. A trojan neither replicates nor
copies itself, but causes damage or compromises the security of the computer. A
trojan must be sent by someone or carried by another program and may arrive in
the form of a joke program or software of some sort. The malicious functionality
of a trojan may be anything undesirable for a computer user, including data de-
struction or compromising a system by providing a means for another computer
to gain access, thus bypassing normal access controls [30].
For example, on UNIX-based systems, intruders often leave a modified version of
”ps” (a tool to display a process list) to hide a particular process ID (PID).
Probably the most famous trojan horse is the AIDS TROJAN DISK [1] that was
sent to about 7 000 research organizations on a diskette. When the trojan was
introduced to the system, it scrambled the names of files and filled the empty
areas of the disk. A recovery solution was offered in exchange of a bounty.
Spyware are programs that have the ability to scan systems or monitor activity and
pass this information on to the attacker. Common information that may be ac-
tively or passively gathered is passwords, log-in details, account numbers, per-
sonal information, individual files or other personal documents. Spyware may
also gather and distribute information related to the user’s computer, applications
running on the computer, Internet browser usage or other computing habits.
Spyware frequently attempts to remain unnoticed, either by actively hiding or by
simply not making its presence on a system known to the user. Spyware can be
downloaded from Web sites (typically in shareware or freeware), email messages,
and instant messengers. Additionally, a user may unknowingly receive and/or
trigger spyware, for example by accepting an End User License Agreement from
a software program linked to the spyware [30].
Downloaders are programs that install a set of other items on the machine under at-
tack. Usually a downloader is sent in email, and when it is executed, it down-
loads malicious content from a website or some other location and then extracts
and runs its content.
Keyloggers are applications that monitor a user’s keystrokes and send this informa-
tion back to the attacker. This can happen via email or to a server controlled by
the attacker. Information commonly collected is email and online banking user-
names, passwords, PINs and credit card numbers. Attackers often use keyloggers
to commit identity theft.
Keyloggers are also sold as legitimate applications, often as monitoring tools for
concerned parents (or suspicious spouses).
Rootkits are components that use stealth to maintain a persistent and undetectable
presence on the machine. Actions performed by a rootkit, such as installation and
any form of code execution, are done without end user consent or knowledge.
Rootkits do not infect machines by themselves like viruses or worms, but rather,
seek to provide an undetectable environment for malicious code to execute. At-
tackers typically leverage vulnerabilities in the target machine, or use social en-
gineering techniques, to manually install rootkits. In some cases rootkits can be
installed automatically upon execution of a virus or worm or simply by browsing
a malicious website [30].
Once installed, an attacker can perform virtually any function on the system to
include remote access, eavesdropping, as well as hide processes, files, registry
keys and communication channels. Rootkits exists for most operating systems.
Hoaxes or Chain letters usually arrive in the form of an email, warning about a new
computer virus infection and ask the recipient of the message to forward it to
others. End users then spread the email hoax to others, ”replicating” the message
on the Internet by themselves and overloading email systems.
10 2.2. CLASSIFICATION OF MALWARE
Combining two or more of the categories above can lead to even more powerful
attacks. For example, a worm can contain a payload that installs a backdoor to allow
remote access. When the worm replicates to a new system (via email or other means),
the backdoor is installed on that system, thus providing an attacker with a quick and
easy way to gain access to a large set of hosts.
In The Wild
Another classification of malware, often used by malware researchers and anti-virus
companies is whether the malware has been seen in the wild or not.
In The Wild Viruses The term in the wild was coined by the IBM researcher Dave
Chess, who used it to describe computer viruses encountered on production systems [17].
The organization Wildlist [37] keeps a list of current malware in the wild and anti-virus
companies are assumed to find 100% of these. Today the definition of the term is typ-
ically a malware that has been seen by at least two independent wildlist submitters in
at least two different regions.
Zoo Viruses Zoo Viruses are the opposite of in the wild viruses. It is a malware
which is rarely reported anywhere in the world, but which exists in the collections
of researchers. Its prevalence can increase to the point that it is considered to be in the
wild. Though zoo viruses seldom pose a threat to the common user, they might prove
some concept, making the detection of zoo viruses important for anti-virus software.
2.3. MALWARE NAMING 11
[<type>://][<platform>/]<family>[.<group>]
[.<length>].<variant>[<modifiers>][!<comment>]
In practice very few malware requires all name components. Practically anything other
than the family name is an optional field. The most commonly used identifiers are
explained in the following.
12 2.4. DETECTING MALWARE
with detected malware. This is not the focus of this Master Thesis. Instead some avail-
able methods for detecting malware are described in the following.
ulator to identify parts that might be infected or typical for malware. To this end the
scan engine tries to identify instructions that transfer program control or program code
in sections where it normally does not belong, such as data sections. These sections are
called blocks.
The signature database at Ikarus contains a table of known headers. Some file types,
such as PDF or RTF, are known to be somewhat “immune” to malware. To keep scan
times down, the Ikarus scan engine avoids scanning such files. A PDF for example will
always start with the bytes 255044462D. If the scanner finds these bytes at address
0, scanning is terminated. The same goes for the starting bytes 7B5C5C727466 that
identifies a RTF file. At the time being 108 such known headers are kept in the database.
Window of vulnerability
Virus scanners can only find malware they are looking for. One major drawback with
this is that the virus scanners need constant updates. Virus scanners are necessarily a
step behind the latest crop of malware since there is a lot that has to happen before the
scanner has an up-to-date signature database:
• Malware has to be detected somehow to begin with. Since virus scanners will not
detect new malware, it will have some time to spread before someone detects it
by other means.
• The signature must be incorporated into the next release of the signature database.
In the case of retail software, the software must be sent to be packaged, to the dis-
tributors, and then on to the retail outlets. Commercial retail software takes so long to
get to the shelves, that it is almost certainly out of date. Virtually all product makers
today provide some way to obtain updates via the Internet in order to help speed up
the update process.
The window of vulnerability equals the time an anti-virus vendor needs to respond
to new malware. That is, how long it takes to generate and distribute a signature for
new malware. There are many metrics by which to measure the efficiency and effec-
tiveness of an anti-virus product. This response time is one commonly used metric.
To compensate for this weakness most virus scanners are also capable of some
heuristic analysis.
2.4. DETECTING MALWARE 15
Heuristics
The term heuristic comes from the Greek word heuriskein, meaning ”to discover”, and
describes the technique of approaching a problem through past experience, or knowl-
edge.
When a file is scanned with a signature, it is determined to be either a specific mal-
ware or not. The result is easy to interpret and signature scanning is pretty accurate.
Heuristic techniques, on the other hand, are working on the probability of a file being
infected. A heuristic scanner looks for behaviors and indicators that malware use. This
may include code to check a date, an oversized file, or attempts to access address books.
Any of these indicators could simply be an innocent application, but when a number of
such occurrences are combined, it may indicate malicious code. With heuristic scanners
there is a thin line between being too stringent and producing a lot of false positives, or
too lenient and missing malicious code.
Heuristics can also be used in a reverse manner, looking for behaviors that could
not be viruses. In some cases, it is faster to determine that a file could not be malware,
then finding one that could. For example, the W32/Simile virus, contains not only the
payload, but code to compile, decompile, encrypt and decrypt the virus, resulting in
a bigger file. The infection is 32K to 130K in length, so a file with size 25K could not
contain that virus.
Heuristic search is supported by the virus scanner at Ikarus, but the results are to
this date not presented to its users. In most cases the results are too unexact and users
do not know how to respond to a message stating that ”this file might be infected with
some unknown virus”. This problem is not unique for Ikarus.
If the behavior blocker detects that a program is initiating some behavior that could
be malicious, it can block these behaviors in real-time and/or terminate the responsible
16 2.5. DISTRIBUTION OF MALWARE
program. This gives it an advantage over virus scanners or heuristics. While there
are literally trillions of different ways to obfuscate the instructions of a virus or worm,
many of which will evade detection by scanners or heuristics, eventually malicious
code must make a request to the operating system. Given that the behavior blocker can
intercept all such requests, it can identify and block malicious actions regardless of how
obfuscated the program logic is.
Watching software as it runs in real-time has some major drawbacks. Since the
malicious code must actually run on the target machine before its behaviors can be
identified, it can cause a great deal of harm before it has been detected and blocked.
For instance, a virus might shuffle a number of seemingly unimportant files around the
hard drive before infecting a file and being blocked. Even though the actual infection
was blocked, the user may be unable to locate her files.
Behavior blockers can also have problems identifying for example trojans, since
these do not have an easily characterized set of behaviors like viruses or worms. This
in turn has lead to many false positives in the past, which is one of the reasons why
behavior blocking only have had limited success on the market [26].
Other spreading mechanisms have also been observed. In June 2004 for example
the SymbOS/Cabir worm, that run on Nokia 60 series phones running the Symbian
operating system, appeared. It spread via the Bluetooth feature of wireless phones.
However, new malware distribution patterns, such as short-span attacks and serial
variant attacks have been seen. In these cases malware is distributed in ways that en-
able attacks to be executed fully before they can be blocked by signatures. Widespread
adoption of these distribution methods could pose a serious threat to signature-based
protection methods.
Typically, an entire short-span attack is completed within a few hours, sometimes
within as little as 20 minutes. Unlike viral-propagation attacks, which die slowly, short-
span attacks have a rapid buildup, steady distribution rate throughout the attack, and
almost instant dropping off (see Figure 2.2). In many short-span attacks, anti-virus
vendors avoid the trouble of developing a signature, since it will be obsolete by the
time it is released.
A severe short-span attack was W32/Beagle.BQ, which started and finished within
seven hours. Of 20 major anti-virus products tested independently by VirusTotal (a tool
for scanning of suspicious files using several anti-virus engines), 10 did not manage to
produce a signature before the end of the outbreak. 24 hours later, seven anti-virus
vendors still had no signature for it at all.
Serial variant attacks uses the short-span characteristic, but extend it by a cumula-
18 2.5. DISTRIBUTION OF MALWARE
tive factor. A series of variants of a certain malware, prepared in advance, are launched
at closely-spaced time intervals. Each of the variants requires a new signature. The
overall window of vulnerability of the attack is the cumulative vulnerable time span
of the individual variants (see Figure 2.3). Theoretically, if an unlimited number of
variants could be released, it would mean extending the window of vulnerability in-
definitely.
Signatures
False Positives A false positive exists when a test incorrectly reports that it has found
a positive result where none really exists [36]. In anti-virus terms this means that mal-
ware has been found in a file which is not infected. An anti-virus software that keep
finding malware in uninfected files will soon lose credibility with its users. This in turn
might lead to that when malware really is found, the users do not care about the warn-
ing. Worse yet is when the anti-virus software starts trying to remove the supposedly
infected file.
An example on what false positives can cause occurred on the 10th of March 2006.
On this date McAfee released the file 4715.DAT, which contained a signature identi-
22 3.1. BASIC PROPERTIES OF SIGNATURES
fying a number of common commercial programs as the W95/CTX virus. The list of
programs being reported as viruses is seven pages long and contains among others Mi-
crosoft Excel, Macromedia’s Flashplayer, Adobe Update Manager and Norton’s Ghost.
These files were then quarantined, patched, and/or deleted, depending on the users
local settings, leading to that the programs concerned stopped working.
False Negatives Exist when a test incorrectly reports that a result was not detected,
when it was really present. Why this is bad hardly needs an explanation. An anti virus
software unable to identify known malware obviously has a problem.
Bugs In April 2005 Trend Micro released a signature-file, which under certain condi-
tions caused systems to experience high CPU power consumption [32]. To better iden-
tify the W32/Rbot worm family, Trend Micro had enhanced the decompression ability
by supporting 3 new heuristic patterns. Unfortunately this enhancement caused per-
formance issues when certain computer configurations were met. This incident shows
the importance of testing signature-files before they are released.
over the network. Signatures that characterize inputs that carry attacks are then auto-
matically generated. The system Polygraph [27] automatically generate signatures for
polymorphic worms. Polygraph can generate multiple shorter byte-sequences as sig-
natures. All the above take advantage of the fact that even in the case of polymorphic
worms, some invariant code has to be present.
Name
A signature should have a name so that when a match has been found, the user can
be notified which malware has been identified. The names should follow the CARO
naming convention, discussed in section 2.3 as closely as possible. Ikarus use the same
names as Kaspersky Labs [18].
Type
At Ikarus, signatures are classified by the type of infection they represent: boot sector,
COM file, PE file, scripts or macros to mention a few. When a particular file is scanned,
only the signatures that pertain to that file type is used. For example, a boot sector
signature will not be used to scan a macro file. This helps keep scan times down.
Table 3.1 shows the different signature types used at Ikarus. SIG ALL is an addi-
tional type that tells the scanner to search all possible files for the signature in question.
SIG ALL is allowed for testing purposes, but should not be used in real life, since it
slows down scanning. Multiple types are allowed for files where more than one type
might be fitting.
Bytepattern
The most essential part of a signature is the bytepattern as it is the actual code sequence
identifying the malware in question.
Wildcards Some virus scanners allow wildcards in their signatures. Wildcards allow
the scanner to detect if malicious code is padded with other junk code. Normally
wildcards let the scanner skip bytes or ranges of bytes.
Mismatches Mismatches allow N number of bytes in the signature to be any value, re-
gardless of their position. Mismatches are especially useful in generic signatures,
but it is a rather slow scanning technique.
24 3.2. SIGNATURE STRUCTURE
At Ikarus, wildcards are allowed, identified with a > (jump X bytes forward from
this position) or < (jump X bytes backward) followed by the number of bytes to be
skipped. The } and { identifiers tell the scanner to skip an unknown number of bytes
until the pattern after the identifier is found. No mismatches are allowed. The bytepat-
tern part of the signature may also contain one or more checksums marked with a #
identifier.
A signature for the trojan downloader W32/Banload.aap could for example have
the following bytepattern:
7BC231D9898B3B2ED20CC0530803530C3BC27514D53126D8022F8B4367
412C00430C014604EB15033B59B7B685750DB7 >33 #0x78DD9178,144
This signature tells the scanner to match the 48 bytes starting with 7BC2 exactly. In
this case 7BC2 will be searched for initially and only when these 2 bytes are found will
a complete match be tried. The following 33 bytes are skipped and the 144 consecutive
bytes adds up to the checksum 0x78DD9178.
The following bytepattern is generated from the virus W32/Polip.a:
In this case when the first four bytes have been found the scanner is told to jump
forward until it finds the byte sequence starting with 4244 and ending with EC04. If
this sequence is found it should jump forward again until the byte sequence starting
with 1015 is found, and so on.
Signature Size
An important property, which has to be carefully chosen, is the size of a signature. The
longer the byte sequence, the smaller the probability of false positives [25]. The shorter
the signature, the more likely it is to be found in some perfectly legitimate code just
by coincidence. However a very long signature will slow down scanning of files, since
more bytes have to be matched. It also wastes memory.
One way to get around this, implemented at Ikarus, is using checksums. In this case
a relatively small byte sequence is chosen and then a checksum over a bigger sequence
is calculated and added to the signature. Only if the scanner finds the byte sequence in
a file will it also calculate the checksum. This reduces the false positive probability, but
keeps the signature database small. It does not reduce performance as the checksum is
only calculated when the leading bytepattern has been found in a file.
The maximum total size of a signature at Ikarus is 500B, but most signatures are
smaller than that.
26 3.3. EXISTING WORKFLOW AT IKARUS SOFTWARE GMBH
If common parts are found between the files, this code should be used for signature
generation. A suitable byte sequence then has to be identified within the common code.
As there is seldom enough time to analyze what the code actually does, the analysts use
their experience to this end. First of all, actual program code is identified, since data
sections are seldom suitable for signature generation. This is done by finding code
that looks random to the human eye. It is then made sure that the first 2 bytes of
the signature are as unique as possible. Starting bytes that should be avoided are for
example 0x0000 or 0x9090, since they appear very often in uninfected files. A byte
sequence of about 48 bytes is then extracted. If more common code is present a CRC32
checksum calculated over this code and it is appended to the signature.
3.3.3 Tests
Each signature generated has to be tested before it can be distributed to the users of
Ikarus anti-virus product. The newly generated signature is saved in a temporary sig-
nature database for this cause. First a false negatives test is performed by scanning the
samples, from which the signature has been generated with this temporary signature
database. If the signature finds the malware it is supposed to it has to be checked for
false positives. Ikarus has a large collection of uninfected files for this purpose 1 . A scan
of these files are initiated and, of course, no malware should be found using the current
signatures. Since this test takes about half an hour, the signatures are not tested one by
one. Instead this test is run when about 20 signatures have been generated.
At the time being about 10 percent of the manually generated signatures fail to pass
the false positives test and have to be redone.
If a signature pass both these tests it is added to the signature database for distribu-
tion to end users. Updates are distributed according to need. Normally there will be at
least one update a day, but if there is a malware outbreak more updates are normally
needed.
1
At the time being this collection has reached about 30 GB.
28 3.3. EXISTING WORKFLOW AT IKARUS SOFTWARE GMBH
Chapter 4
The starting point for designing Autosig was to imitate the way human analysts gen-
erate signatures. During implementation some features were discovered that Autosig
can use to improve its ability to generate signatures.
Autosig is primarily designed to be run over the collection of malware at Ikarus.
When properly integrated with the other tools used by the human analyst team at
Ikarus Software GmbH it can also be used in their daily work.
In this chapter the design principles of Autosig are explained.
Analyzing Malware
The most common approach when analyzing malware is reverse engineering. With this
approach the function of the malware in question is analyzed and the malware specific
30 4.2. PREPARATION STEPS
code can be identified. This code, or parts of it, can then be used to generate a work-
ing signature. In the case of automatic signature generation reverse engineering is not
sustainable. To automatically analyze a reverse engineered file a set of rules has to be
implemented, stating what routines are typical for malware. Code segments includ-
ing these routines are then characterized as malicious. This might work temporarily
but it is highly unlikely that rules suitable for malware today will also be valid for the
malware of tomorrow.
The scan engine at Ikarus only scans between certain addresses in a file, which fur-
ther complicates this kind of analysis. A disassembler or decompiler can never return
an exact copy of the program that after compilation produced the file being analyzed.
This means that the addresses in the original file are different from those in the reversed
engineered one. Since the scan engine does not scan whole files, generating a signature
that perhaps is out of the scanned address span will cause false negatives. Thus, this
kind of analysis should be left to the human experts.
Having said this, one might wonder how Autosig finds the code segments from
which signatures are generated. Basically this is done by finding invariant code across
two or more malware samples. Based on statistical data, the code segments most suit-
able for signatures are chosen from the invariant code. Then the same trial and error
principle used by the human analysts is used.
Since the malware samples being compared are all of the same family, we can as-
sume that some similarities exists. When this assumption holds it is only necessary to
compare two files with each other and remembering the common code between these.
In the next step, this common code will be searched for in the rest of the files. Finding
a byte sequence in a file is significantly faster than comparing two whole files. Only in
the worst case scenario, when no files have anything in common with the others, every
file has to be compared with every other file.
If no invariant code is found across the malware samples, no signature is to be
generated.
Multiple Signatures
It can be the case that it is not possible to generate one signature that finds all of the
malware samples in question. Some code suitable for signatures might be found in half
of the samples and some other code in the other half. In this case two signatures are
generated by Autosig.
When writing the routine for finding invariant code, some special cases had to be con-
sidered.
An exe packer is a compression tool for executable files. Exe packers allow the com-
pressed files to be executed without prior decompression. This is done by compress-
ing an executable file and perpending a decompression stub, which is responsible for
decompressing the executable and initiating execution. The decompression stub is a
standalone executable, making packed and unpacked executables indistinguishable to
users. No additional steps are required to start execution [36]. Some of the reasons
for using exe packers are to reduce storage requirements, to decrease load times across
networks and to decrease download times from the Internet. Exe packers are also often
used by malware writers to prevent direct disassembly. Although it does not elimi-
nate the chance of reverse engineering, exe packers can make the process more costly.
Commonly used exe packers are ASPack, UPX and FSG.
Obviously, when comparing a compressed file with an original version of the file no
common code will be found. This however is no problem that we have to worry about.
When generating blocks from the original file, the T3 scan engine analyzes the file in
a simulated environment. The files are unpacked, the interesting code segments are
extracted and written to block files and these files are then later used to find common
code.
32 4.3. FINDING INVARIANT CODE
DOS stubs
At the beginning of every PE file there is a header containing DOS a stub [12]. This is
just a minimal DOS EXE program that displays an error message (usually ”this program
cannot be run in DOS mode” or “this program must be run under Win32” as shown in
figure 4.1 and figure 4.2). The DOS stub remains as a compatibility issue with 16-bit
Windows systems.
Because this header is present at the beginning of the file, some DOS viruses can
infect PE images correctly at their DOS stub. Because of this a virus scanner has to scan
this part of a PE file.
Consider the trojan horse Win32/VB.alc with original size 139 264B. One of its sam-
ples is divided in two blocks by the scan engine at Ikarus; one 112 byte big block con-
sisting of the DOS header and one 32 768 byte block corresponding to some other code
segment. In fact, when generating block files from any Win32/VB.alc sample, the result
will always include this 112 byte block. When comparing these blocks to another the
encouraging result that they are exactly the same will be found. However, using this as
a signature would cause a flood of false positives, since it will be found in almost every
PE file. Therefore it is not necessary to include these blocks in the search of common
code. Before starting the comparison of files Autosig removes block files corresponding
to DOS stubs.
Broken Samples
Incoming malware is sometimes corrupted on its way to Ikarus. These samples are
not to be used for signature generation as they unlikely will contain meaningful code.
Instead they will slow down the search of invariant code.
Broken samples normally differ a lot in their file size to the other samples of the
same malware strain. Autosig uses this relatively naive approach and files that have
4.4. FINDING CODE SUITABLE FOR SIGNATURE GENERATION 33
a size much bigger or much smaller than the average of the files to be compared are
removed before comparison.
These statistics are produced by scanning a big collection of uninfected files with the
current signature database. It should be noted that the resulting statistics are correlated
to the signature database. The values shown in table 4.1 is not the actual value of
how often a particular 2-byte combination occur, but a weighted value. As an example
the bytes 0x000 appears extremely often in every file, but is not represented in theses
statistics because no signature starting with this value is allowed.
When designing Autosig the choice had to be made weather these 2-byte combina-
tions should be hard coded into the program or loaded dynamically at run-time. Since
new operating systems or programs could alter these statistics the choice fell on dy-
namical loading. Whenever new files are added to the collection of uninfected files the
user can choose to renew the statistics. Some values not present in these statistics, such
as 0x0000, have to be hard coded in to the program.
36 4.4. FINDING CODE SUITABLE FOR SIGNATURE GENERATION
Chapter 5
Implementation of Autosig
Compare samples
Match found
Match found
Generate signature
Test failed
Test signature
Remove temporary
Directory and all its files
EXIT
5.2.1 Algorithm
Autosig uses a combination of hashtables and a slightly modified version of the well
known Knuth Morrison Pratt (KMP) string search algorithm to find invariant code be-
tween files.
The simplest and least efficient way to find a pattern P within a text S is to check
each byte, one by one. This approach has a worst case performance of O(n ∗ m), where
n is the length of S and m is the length of P. Fortunately, many fast string searching algo-
rithms exist, such as the KMP [21] or Boyer-Moore algorithm [4]. The KMP algorithm
searches for occurrences of P within S by employing the simple observation that when a
mismatch occurs, the pattern itself embodies sufficient information to determine where
the next match could begin. Thereby re-examination of previously matched charac-
ters is avoided. The algorithm has complexity O(n). Boyer-Moore is a slightly faster
algorithm with an average complexity of O(n/m) and only in worst case O(n). Nor-
mally string searching algorithms return exact matches. However, for the purposes of
Autosig it is enough if a match bigger or equal to MIN_SIG_SIZE is found. The Boyer-
Moore algorithm starts searching with the last character of P and skips the preceding
40 5.2. FIND INVARIANT CODE
characters when a mismatch occurs, which is precisely what makes the algorithm so
fast. Due to this fact it is not suitable for the purposes of Autosig. Those skipped bytes
could be a smaller match, perfectly suitable for signature generation. Instead the choice
of algorithm fell on KMP, which is also relatively easy to implement.
If the given path only contains one malware sample, Autosig exits. Until all files
have been parsed the following steps are executed:
Step One
The first step is to compare two whole files (in the following, a ”file” means all block
files corresponding to the original file). The biggest file, in the following called file 1,
is set to be the base file to compare the other with. The second biggest file, file 2, is
chosen to start comparing with. If file 1 and file 2 have exactly the same size it has
turned out that they often have common code at the same addresses. Knowing this we
can perform a first byte-to-byte check, which is a lot faster than looking for similarities
at every possible position. If matches are found we step directly to step four.
Step Two
If step one failed a checksum for every 32 byte value in file 1 is calculated. These check-
sums are used as keys in a hashtable where the address in file 1 of the first of the 32
byte is the value. Since it is not unlikely that a 32-byte value will repeat itself in a file,
duplicate keys are allowed. Then file 2 is opened and parsed byte by byte, calculating
the same checksum. The checksum is used to retrieve the value in the hashtable, i.e. the
position of this checksum in file 1. The positions in both files are saved as an Integer
Pair in a vector.
Step Three
When file 2 has been parsed, the vector containing the position pairs is evaluated. As
long as the difference between both the first and the second value of a pair and its
precursor is 1, we have found common code.
If no matches are found step one is repeated with the next biggest file, which will
replace file 2. Should it be the case that file 1 has nothing in common with any other
files, it is simply ignored and the procedure is repeated with file 2 replacing file 1.
5.2. FIND INVARIANT CODE 41
Step Four
When reaching this step common code has been found and we create a new instance of
a code struct, as shown below.
struct code {
char *file;
int start, size;
vector<char *> files;
bool operator< (const code a)
{
return this->files.size() < a.files.size();
}
};
This struct contains information on which file has been used as file 1, the starting
position in this file, the size of the common code found and in which other files the code
is present. Only code segments bigger than MIN_SIG_SIZE are considered. This struct
is added to a list.
Step Five
Now we can begin looking for the common code sections in other files. This is done
by opening the next biggest file, file 3, and reading it in to a char array. Then the list
of common code is parsed, node by node. The code segment represented by the code
struct is read in to another char array. Using a slightly modified version of the KMP
search algorithm it is checked if the code segment is present in file 3. If an exact match
is found, file 3 is added to the file list in the struct. If a smaller match is found the size
is also set to the new, smaller value. This step is repeated until there are no more files
to compare with.
If there are files that have nothing in common with file 1, step one is repeated with
only these files, the biggest of them as file 1.
Of course there might exist code segments that are found in equally many files.
How these are ranked internally is decided by which flag the user has entered upon
starting Autosig.
If the user has entered the “-ss” flag the code segments with the biggest size will be
ranked highest. This might help finding a good signature faster as the probability to
find a good signature in a big code segment is higher than finding it in a small one.
Picture 5.2 shows part of a list sorted with this option.
When entering this flag code segments are put in a sequential order depending on their
position in the file. Choosing this option should make scanning of files slightly faster,
since the scan engine can parse the file from beginning to end without having to jump
back and forth. Picture 5.3 shows the same list as above, sorted with this option.
Following Bytes
It has turned out that these statistics are not only helpful in choosing good starting
bytes. They can also be used to check the following bytes. For example to remove code
segments consisting of just zeros. Autosig uses only code segments consisting of 2 byte
combinations that have 8 hits or less in the statistics.
Signature number Every signature has to have a number. This is normally set auto-
matically when adding the signature to the database, but a temporary number
has to be set for the test database. This variable is saved in an external file and the
number is incremented by 1 per signature.
Signature type The signatures also have to have a type compatible with the type given
by the scan engine, as explained in section 4.2. The type is retrieved by reading
the log file generated by the scan engine in the preparation steps. It states which
malware type the scanner assigns the malware in question to. The log file often
contain multiple types, but these are avoided by Autosig when possible. It is
assumed that if the type SIG_EXE_PE or SIG_TROJAN_DOS is present, this will
be adequate. Otherwise multiple types is used.
Signature name Autosig uses the name of the malware given as input as signature
name. Since more than one signature might be necessary a number is appended
to the name.
44 5.5. TESTS
Bytepattern In this field the actual code segment found in the previous steps is situ-
ated. If the list from the previous steps contain more than 1 node, the first one
is used as a pure bytepattern and the rest of the nodes are used to calculate up
to 20 checksums. More than 20 checksums is normally not necessary and would
only waste memory. Should the first node be bigger than MIN_SIG_SIZE a CRC
checksum is calculated over the exceeding code segment.
5.5 Tests
Before any signatures are ready to be shipped to the users of Ikarus anti-virus software
it has to be thoroughly tested. The finished signatures are written to a file and from
this a temporary signature database is generated. The directory with the malware,
from which the signatures have been generated is then scanned with the temporary
signature database. All malware from which a signature has been generated should be
found during this scan. Chapter 6 will show that the signatures generated by Autosig
hold in 99 per cent of the cases. If the signature does not find the malware it is supposed
to, it will be discarded.
At this point Autosig has finished its execution and exits after removing block- and
log-files. The signatures also have to be tested for false positives, but this test must be
initiated by the user. Signatures that do not pass this test will be (manually) added to
the file with negative signatures as described in section 5.3.2.
Chapter 6
Test Results
Autosig has been thoroughly tested to identify and measure the benefits of automatic
signature generation. Two different areas of use are of interest; signature generation
from new malware samples on a daily basis and signature generation from old samples
in the malware database. The latter is to possibly shrink the number of signatures in the
database. Both these areas of interest have been tested. During the tests the following
was observed:
False Negatives Rate Only one false negative was produced during all of the tests.
This was due to a faulty log file from the scan engine, resulting in a signature type that
was not correct. If the signature type is not correct the scanner will not search for the
signature in the area were it can be found. After manually changing the signature type,
the signature worked.
False Positives Rate A finished signature is not allowed to generate any false posi-
tives. However this is not a hard limit for Autosig since signatures causing false posi-
tives simply will be discarded as negative signatures, and then a new try will be done.
This way the program should learn which code segments to avoid. As anticipated the
false positives rate dropped after a few days, as more negative signatures are added, as
seen in table 6.1.
Sort Algorithm When sorting the list of common code as described in section 5.2.2
two different options have been given to the user; sort according to size or position.
However, it turned out that these options did not have much effect on the overall per-
formance. No improvement of the signatures could be noticed in either of the cases,
nor did any of the different algorithms yield any time benefits.
46 6.1. DAILY TESTS
The initial false positives rate sank pretty fast after the first week of tests, to a near
minimum at the end of the month. This shows that Autosig was able to learn which
6.2. DATABASE TESTS 47
code segments to avoid. The first day Autosig had to be run as many times as 5 to be
able to sort out the code to avoid. This was because a substring was extracted from a
relatively big code segment also present in morpheus.exe, a file sharing client. Since a
signature was only generated from a part of this code, when adding the signature to the
negative signatures there were still morpheus-code left that Autosig tried to generate a
signature from, leading to yet another false positive.
Another encouraging result of these daily tests was that some of the signatures gen-
erated managed to identify new samples from the same malware family. For example
the signature
B12F6E70DA2746823D12BA24F4CAFF3B33CE4BD20F7F4419FB058B09997
DD0DBC5CF2057AA3E0F78A95322CA68787506 >10 #0x19d1e599,108
>1 #0xa90b914b,740 >4 #0x507a8328,108
generated from the Backdoor trojan W32/IRCBot.pu on the 2nd of April also found the
new W32/IRCBot.pu samples arriving on the 3rd, 12th, 13th and 14th. This shows that
the signatures generated by Autosig are indeed able to find new examples of already
known malware.
During the test it was noticed that when more than 10 samples were parsed the time
limit of 8 minutes was often reached.
4C98FFFF3BFE7410FF75F48B3D40404100FFD7FF75F0FFD73975087409FF
7508FF150C4041006A01 >5 #0xDD5B7F86,44 >42 #0x891F1CD6,164
>6 #0xF1793237,48
Parsing the 545 W32/Sobig samples was surprisingly fast and took only 37 seconds.
This is probably because invariant code can be found at the same position in almost all
files. A log file from this execution of Autosig can be found in Appendix A.
With this one signature 508 samples are found and it generates no false positives.
This result clearly shows one of the benefits of automatically generating signatures.
Manually comparing all the block files generated from the 545 samples is a tedious
work, better done by a computer.
Autosig did not manage to generate any signature for W32/Polip, as anticipated
in the case of polymorphic viruses. It did find some invariant code among two of the
samples, shown in figure 6.1. The invariant code is highlighted in the picture.
This code was characterized as not suitable for signature generation. The leading
0xCCCC bytes are obviously not suitable to start a signature with, as these are padding
bytes that appear often in all kinds of files. To get a minimum sized signature of 32
bytes it has to start at address 0x2fef. However, none of the bytes
513D001000008D4C2408721481E900 preceding this address are suitable to start a
signature with, as they all appear to often in the statistics.
Autosig Parameters
There are a number of parameters that affects the performance of Autosig. The mini-
mum signature size is such a parameter. During the tests it has turned out that 32 byte
might be too small for a signature. Most of the signatures with this size failed to pass
the false positives test. By increasing the minimum size it might be possible that less
signatures fail the false positives tests.
To this date only four code segments from the list of good signature code have been
used for signature generation, even when more were found. The number four was cho-
sen because it is the standard used by the human analysts. By increasing this number
signatures more precisely identifying the malware in question could be generated.
Another parameter that affects the signatures generated is the value that starting
2-byte combinations are allowed to have in the statistics. By increasing this value from
0 to 1 the number of allowed starting bytes grows with about 50%, which might lead to
more signatures generated. However this could also lead to a deterioration in quality
in the signatures generated. The consecutive bytes can also be allowed to have more
hits in the statistics to characterize more byte combinations as suitable for signature
generation. In both of these cases it has to carefully observed how the false positives
rate is affected.
Chapter 7
Conclusion
This Master Thesis shows that it is indeed possible to generate working malware sig-
natures automatically. It shows that, when dealing with malware where many variants
of the same family exists, generic signatures for these can be generated fast and ac-
curately. This way the human analysts will have more time to analyze polymorphic,
metamorphic or in other ways complicated malware.
The tool Autosig has been implemented and tested. The results show that it is pos-
sible to extract a substring suitable for signature generation without performing any
analysis of the underlying functionality. Even though the extent at which signatures
were generated did not reach the numbers hoped for, it is believed that by fine tun-
ing parameters, such as the minimum signature length, these numbers will raise. By
increasing the number of negative signatures it is also anticipated that more working
signatures can be generated. To this date only one false negative was caused by the
signatures generated by Autosig.
Database Connectivity To this date Autosig has generated signatures from local copies
of malware samples. The next development step will be to connect directly to the mal-
ware database and integrate a search function. This way it will be easier to compare
old and new samples of the same malware family and subfamilies. A user could then
enter ”Sober” to possibly generate a signature for all Sober samples, or she could pass
”Sober.b” as a parameter in order to just parse the ”b” subfamily of Sober. This step
will automate the signature generation process further by removing the steps needed
52 7.1. FUTURE WORK
to identify which malware to parse and copying these to the local computer.
Broken Samples Broken files should be removed before Autosig starts parsing the
samples. Preferably this should be taken care of before Autosig is executed. This might
be implemented as a part of Autosig or as an external tool that cleans up the malware
database before execution of Autosig. An external tool has the advantage that it does
not slow down signature generation.
Negative Signatures All signatures that cause false positives are saved in the signa-
ture database at Ikarus. To this date Autosig has only used the false positives generated
by its own signatures. A further step would be to take advantage of all false positive
signatures in the signature database too.
It would probably also be profitable to add more code from different exe-packers
and other code segments known to be unsuitable for signatures to the negative sig-
natures collection. During the tests it was noted that some signatures were generated
from the import tables of the files in question and these signatures often caused false
positives. The initial false positive rate could probably be lowered by looking into how
these ares can be avoided.
[1] Jim Bates. Trojan horse: Aids information introducory diskette version 2.0. Virus
Bulletin, January:3, 1992.
[2] Dr. Vesselin Bontchev. Current status of the caro malware naming scheme. In 15th
Virus Bulletin Conference, October 2005.
[3] Vesselin Bontchev, Fridrik Skulason, and Dr. Alan Solomon. A new virus naming
convention. 1991.
[4] Robert S. Boyer and J. Strother Moore. A fast string searching algorithm. Commu-
nications of ACM, 20(10):762–772, 1977.
[6] Peter G. Capek, David M. Chess, and Steve R. White. Merry christmas: an early
network worm. IEEE Security and Privacy, 2003.
[8] David M. Chess and Steve R. White. An undetectable computer virus. Virus Bul-
letin Conference, 2000.
[9] Elliot J. Chikofsky and James H. Cross. Reverse engineering and design recovery:
A taxonomy. IEEE Software, 7(1):13–18, 1990.
[10] Dr. Frederick B. Cohen. Computer viruses - theory and experiments. Computers
and Security, 6(1):22–35, 1987.
[11] Dr. Frederick B. Cohen. A Short Course On Computer Viruses. John Wiley and Sons,
second edition, 1994.
[12] Microsoft Corporation. Microsoft portable executable and common object file for-
mat specification. 1999.
[14] Oren Drori, Nicky Pappo, and Dan Yachan. New malware distribution methods
threaten signature-based AV. Virus Bulletin, September 2005.
54 BIBLIOGRAPHY
[16] Nick FitzGerald. A virus by any other name: The revised caro naming convention.
AVAR Conference, 2002.
[17] Sarah Gordon. What is wild? In The 20th National Information Systems Security
Conference, October 1997.
[19] Jeffrey O. Kephart and William C. Arnold. Automatic extraction of computer virus
signatures. In Proceedings of the 4th Virus Bulletin International Conference, pages
178–184, 1994.
[20] Hyang-Ah Kim and Brad Karp. Autograph: Toward automated, distributed worm
signature detection. In Proceedings of the 13th USENIX Security Symposium, August
2004.
[21] Donald Knuth, James H. Morris, and Jr. Vaughan Pratt. Fast pattern matching in
strings. SIAM Journal on Computing, 6(2), 1977.
[22] Donald E. Knuth. The art of computer programming - Sorting and Searching, volume 3.
Addison-Wesley, 1973.
[23] Thomas Krauss. Viren Wuermer und trojaner. Interest Verlag, Kissing, Germany,
2004.
[24] Zhenkai Liang and R. Sekar. Fast and automated generation of attack signatures:
a basis for building self-protecting servers. In CCS ’05: Proceedings of the 12th ACM
conference on Computer and communications security, pages 213–222, New York, USA,
2005. ACM Press.
[26] Carey Nachenberg. Behavior blocking: The next step in anti-virus protection.
http://www.securityfocus.com/infocus/1557, March 2002.
[27] James Newsome, Brad Karp, and Dawn Song. Polygraph: Automatically gener-
ating signatures for polymorphic worms. In SP ’05: Proceedings of the 2005 IEEE
Symposium on Security and Privacy, pages 226–241, Washington DC, USA, 2005.
IEEE Computer Society.
[29] Eugene H. Spafford. The internet worm program: an analysis. SIGCOMM Com-
puter Communication Review, 19(1):17–57, 1989.
BIBLIOGRAPHY 55
[31] Peter Szor. The art of Computer Virus Research and Defense. Addison-Wesley, Upper
Saddle River, NJ, United States, 2005.