Mark Stamp - Introduction To Machine Learning With Applications in Information Security - Previewpdf
Mark Stamp - Introduction To Machine Learning With Applications in Information Security - Previewpdf
MACHINE
LEARNING with
APPLICATIONS
in INFORMATION
SECURITY
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
SERIES EDITORS
This series reflects the latest advances and applications in machine learning and pattern rec-
ognition through the publication of a broad range of reference works, textbooks, and hand-
books. The inclusion of concrete examples, applications, and methods is highly encouraged.
The scope of the series includes, but is not limited to, titles in the areas of machine learning,
pattern recognition, computational intelligence, robotics, computational/statistical learning
theory, natural language processing, computer vision, game AI, game theory, neural networks,
computational neuroscience, and other relevant topics, such as machine learning applied to
bioinformatics or cognitive science, which might be proposed by potential contributors.
PUBLISHED TITLES
BAYESIAN PROGRAMMING
Pierre Bessière, Emmanuel Mazer, Juan-Manuel Ahuactzin, and Kamel Mekhnacha
UTILITY-BASED LEARNING FROM DATA
Craig Friedman and Sven Sandow
HANDBOOK OF NATURAL LANGUAGE PROCESSING, SECOND EDITION
Nitin Indurkhya and Fred J. Damerau
COST-SENSITIVE MACHINE LEARNING
Balaji Krishnapuram, Shipeng Yu, and Bharat Rao
COMPUTATIONAL TRUST MODELS AND MACHINE LEARNING
Xin Liu, Anwitaman Datta, and Ee-Peng Lim
MULTILINEAR SUBSPACE LEARNING: DIMENSIONALITY REDUCTION OF
MULTIDIMENSIONAL DATA
Haiping Lu, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos
MACHINE LEARNING: An Algorithmic Perspective, Second Edition
Stephen Marsland
SPARSE MODELING: THEORY, ALGORITHMS, AND APPLICATIONS
Irina Rish and Genady Ya. Grabarnik
A FIRST COURSE IN MACHINE LEARNING, SECOND EDITION
Simon Rogers and Mark Girolami
INTRODUCTION TO MACHINE LEARNING WITH APPLICATIONS IN
INFORMATION SECURITY
Mark Stamp
Chapman & Hall/CRC
Machine Learning & Pattern Recognition Series
INTRODUCTION TO
MACHINE
LEARNING with
APPLICATIONS
in INFORMATION
SECURITY
Mark Stamp
San Jose State University
California
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize
to copyright holders if permission to publish in this form has not been obtained. If any copyright material
has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information storage
or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Preface xiii
Acknowledgments xvii
1 Introduction 1
1.1 What Is Machine Learning? . . . . . . . . . . . . . . . . . . . 1
1.2 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Necessary Background . . . . . . . . . . . . . . . . . . . . . . 4
1.4 A Few Too Many Notes . . . . . . . . . . . . . . . . . . . . . 4
vii
viii CONTENTS
II Applications 235
Index 338
Preface
For the past several years, I’ve been teaching a class on “Topics in Information
Security.” Each time I taught this course, I’d sneak in a few more machine
learning topics. For the past couple of years, the class has been turned on
its head, with machine learning being the focus, and information security
only making its appearance in the applications. Unable to find a suitable
textbook, I wrote a manuscript, which slowly evolved into this book.
In my machine learning class, we spend about two weeks on each of the
major topics in this book (HMM, PHMM, PCA, SVM, and clustering). For
each of these topics, about one week is devoted to the technical details in
Part I, and another lecture or two is spent on the corresponding applica-
tions in Part II. The material in Part I is not easy—by including relevant
applications, the material is reinforced, and the pace is more reasonable.
I also spend a week covering the data analysis topics in Chapter 8 and
several of the mini topics in Chapter 7 are covered, based on time constraints
and student interest.1
Machine learning is an ideal subject for substantive projects. In topics
classes, I always require projects, which are usually completed by pairs of stu-
dents, although individual projects are allowed. At least one week is allocated
to student presentations of their project results.
A suggested syllabus is given in Table 1. This syllabus should leave time
for tests, project presentations, and selected special topics. Note that the
applications material in Part II is intermixed with the material in Part I.
Also note that the data analysis chapter is covered early, since it’s relevant
to all of the applications in Part II.
1
Who am I kidding? Topics are selected based on my interests, not student interest.
xiii
xiv PREFACE
Mark Stamp
Los Gatos, California
April, 2017
2
In my experience, in-person lectures are infinitely more valuable than any recorded or
online format. Something happens in live classes that will never be fully duplicated in any
dead (or even semi-dead) format.
About the Author
My work experience includes more than seven years at the National Security
Agency (NSA), which was followed by two years at a small Silicon Valley
startup company. Since 2002, I have been a card-carrying member of the
Computer Science faculty at San Jose State University (SJSU).
My love affair with machine learning began during the early 1990s, when
I was working at the NSA. In my current job at SJSU, I’ve supervised vast
numbers of master’s student projects, most of which involve some combination
of information security and machine learning. In recent years, students have
become even more eager to work on machine learning projects, which I would
like to ascribe to the quality of the book that you have before you and my
magnetic personality, but instead, it’s almost certainly a reflection of trends
in the job market.
I do have a life outside of work.3 Recently, kayak fishing and sailing my
Hobie kayak in the Monterey Bay have occupied most of my free time. I also
ride my mountain bike through the local hills and forests whenever possible.
In case you are a masochist, a more complete autobiography can be found at
http://www.sjsu.edu/people/mark.stamp/
If you have any comments or questions about this book (or anything else)
you can contact me via email at [email protected]. And if you happen
to be local, don’t hesitate to stop by my office to chat.
3
Of course, here I am assuming that what I do for a living could reasonably be classified
as work. My wife (among others) has been known to dispute that assumption.
xv
Acknowledgments
The first draft of this book was written while I was on sabbatical during the
spring 2014 semester. I first taught most of this material in the fall semester
of 2014, then again in fall 2015, and yet again in fall 2016. After the third
iteration, I was finally satisfied that the manuscript had the potential to be
book-worthy.
All of the students in these three classes deserve credit for helping to
improve the book to the point where it can now be displayed in public without
excessive fear of ridicule. Here, I’d like to single out the following students
for their contributions to the applications in Part II.
Topic Students
HMM Sujan Venkatachalam, Rohit Vobbilisetty
PHMM Lin Huang, Swapna Vemparala
PCA Ranjith Jidigam, Sayali Deshpande, Annapurna Annadatha
SVM Tanuvir Singh, Annapurna Annadatha
Clustering Chinmayee Annachhatre, Swathi Pai, Usha Narra
xvii
Chapter 1
Introduction
I took a speed reading course and read War and Peace in twenty minutes.
It involves Russia.
— Woody Allen
1
2 INTRODUCTION
the primary goal of this book is to provide the reader with a deeper un-
derstanding of what is actually happening inside those mysterious machine
learning black boxes.
Why should anyone care about the inner workings of machine learning al-
gorithms when a simple black box approach can—and often does—suffice? If
you are like your curious author, you hate black boxes, and you want to know
how and why things work as they do. But there are also practical reasons
for exploring the inner sanctum of machine learning. As with any technical
field, the cookbook approach to machine learning is inherently limited. When
applying machine learning to new and novel problems, it is often essential to
have an understanding of what is actually happening “under the covers.” In
addition to being the most interesting cases, such applications are also likely
to be the most lucrative.
By way of analogy, consider a medical doctor (MD) in comparison to a
nurse practitioner (NP).1 It is often claimed that an NP can do about 80%
to 90% of the work that an MD typically does. And the NP requires less
training, so when possible, it is cheaper to have NPs treat people. But, for
challenging or unusual or non-standard cases, the higher level of training of
an MD may be essential. So, the MD deals with the most challenging and
interesting cases, and earns significantly more for doing so. The aim of this
book is to enable the reader to earn the equivalent of an MD in machine
learning.
The bottom line is that the reader who masters the material in this book
will be well positioned to apply machine learning techniques to challenging
and cutting-edge applications. Most such applications would likely be beyond
the reach of anyone with a mere black box level of understanding.
sometimes skip a few details, and on occasion, we might even be a little bit
sloppy with respect to mathematical niceties. The goal here is to present
topics at a fairly intuitive level, with (hopefully) just enough detail to clarify
the underlying concepts, but not so much detail as to become overwhelming
and bog down the presentation.3
In this book, the following machine learning topics are covered in chapter-
length detail.
Topic Where
Hidden Markov Models (HMM) Chapter 2
Profile Hidden Markov Models (PHMM) Chapter 3
Principal Component Analysis (PCA) Chapter 4
Support Vector Machines (SVM) Chapter 5
Clustering (�-Means and EM) Chapter 6
Topic Where
�-Nearest Neighbors (�-NN) Section 7.2
Neural Networks Section 7.3
Boosting and AdaBoost Section 7.4
Random Forest Section 7.5
Linear Discriminant Analysis (LDA) Section 7.6
Vector Quantization (VQ) Section 7.7
Naı̈ve Bayes Section 7.8
Regression Analysis Section 7.9
Conditional Random Fields (CRF) Section 7.10
http://www.cs.sjsu.edu/~stamp/ML/
where you’ll find links to PowerPoint slides, lecture videos, and other relevant
material. An updated errata list is also available. And for the reader’s benefit,
all of the figures in this book are available in electronic form, and in color.
3
Admittedly, this is a delicate balance, and your unbalanced author is sure that he didn’t
always achieve an ideal compromise. But you can rest assured that it was not for lack of
trying.
4 INTRODUCTION