Crivat B., MacLennan J. - Detect Anomalies in Excel Spreadsheets
Crivat B., MacLennan J. - Detect Anomalies in Excel Spreadsheets
Page 1
Ask Advisor
Experts
Username Submit Go to Article Doc # Go
Advisor Tips
Password Shop Advanced Search Search All Publications Go
Join:Member Center Login Advisor Jump to . . .
EXPERT ADVICE & KNOW-HOW
Customer
Help
ADVISOR Home Advisor Zones Advisor Magazines Advisor Live Events CDs/DVDs Advisor Forums
ARTICLE ACCESS = You are here: Advisor.com > Magazines > Access Advisor > Detect Anomalies in Excel Spreadsheets
free access
= subscriber-only
= has download
ARTICLE INFO
ACCESS ADVISOR
Web issue 2004 week
36
Print issue October
2004 SQL SERVER DEVELOPMENT
Length 6.25 pages
Doc #14413 Detect Anomalies in
Files for this article Excel Spreadsheets
are on this issue's Use SQL Server 2005 Data Mining
Professional Resource
CD.
inside Excel.
File Description:
By Bogdan Crivat and Jamie MacLennan,
Example to find Microsoft SQL Server Data Mining
anomalies in
spreadsheets. No reader comments yet.
Click to download
file
763,329 bytes Microsoft Excel becomes more and more versatile with each release and
solves a wider variety of business needs. Its flexibility and
programmability let you integrate different technologies to better
understand and process the data in your spreadsheets. From its inception
ADVISOR ARTICLES in SQL Server 2000, Microsoft's data mining solution has provided a
More on Microsoft Excel programming model to access data mining technologies, which has
expanded with SQL Server 2005. This article shows you how these two
More on Microsoft SQL technologies can work together seamlessly.
Server
ACCESS ADVISOR for an Excel worksheet. Web applications with less code, so can you turn big ideas HIPAA and much more...
into reality W hat really works?
This solution doesn't Subscribe to the manager's
Extended Systems - Advantage Database Server--
require any separate guide to corporate
TECH ARCHITECT scalable, reliable, and built for performance compliance strategies and
ADVISOR treatment of the data; it Black Moshannon Systems - SPEED Ferret--a global solutions.
works directly in your find-and-replace utility ComplianceAdvisor.com
Applications (VBA) and uses Data Mining eXtensions (DMX) for SQL Join the Mobile &
MICROSOFT O FFICE statements in modeling the data and detecting the anomalies. Some Wireless Revolution
SYSTEM ADVISOR knowledge of these technologies might help you understand and further Read the official guide to the
next wave of business and
extend the solution. The "Add-in details" section also contains a brief lifestyle. Subscribe now to
MICROSOFT ADVISOR description of the DMX language and the location of the specification. keep up, and scan the
archives to catch up.
MobileBusinessAdvisor.com
The add-in also requires that you have permissions to create a temporary
file on the C:\ drive, although you can change the code to use any folder Free E-Newsletters
Keep up! Hot News, How-To,
where the users have permissions. Tips & Tricks, Expert Advice,
i ADVISOR.com
j
k
l
m
n and more. Click to request
your's free.
j Web
k
l
m
n AdvisorUpdate.info
Search
The problem
Real, meaningful data usually contains patterns. You can describe these Internet Domain
patterns in terms of relations between various column values. An example Management
Get total control of your W eb
is "IF the value of the Age column is smaller than 18 THEN the value of and e-mail domains with a
the Occupation column IS LIKELY TO BE Student." Sometimes you can powerful browser control
panel -- and save money!
detect a simple rule easily just by visually inspecting the data or by using Register your domains with
common sense. A spreadsheet entry with 10 as the value of the Age AdvisorDomains.com
column and Lawyer as the value of the Occupation column usually raises
eyebrows and will most likely be treated as an anomaly.
Showcase Your
If the spreadsheet has a small number of columns (we also use the term Smarts
Submit your tips, techniques
"attributes" for columns), data visualization tools (such as the graph and advice and let Advisor
component of Excel) are helpful. With scatter plots (such as the X-Y promote your business and
build your career. Show the
graphs in Excel), you can usually detect simple relations between two or world what you know!
three columns just by visually inspecting the spreadsheet. AdvisorTips.com
"IF
the value of the Age column is greater than 21
BUT smaller than 25 AND
the value of the Credit Score column is smaller than 720
AND
the value of the Income column is greater than 50000 BUT
the value of the Number of Children column IS NOT 0
THEN
the value of the Home Ownership column IS LIKELY TO BE
Rent"
Now, such a rule isn't easy to find. A good understanding of the data
always helps, but visually inspecting a few thousand rows in a
spreadsheet is a daunting task even for a person familiar with the
columns.
The problem is even more complex because the relations between two
columns may change completely for different values in a third column. For
example, the relations between the number of children and the home
ownership status change a lot with age. For instance, regardless of the
number of children, people at the beginning of their careers are less likely
to own a home than a seasoned professional. Therefore, you must
consider all the possible values of a column before attempting to use that
column in a rule.
The anomaly detection process can only start when you've completely
determined the set of rules, and it requires reading all the data again and
verifying, for each row, any rule that might apply.
The problem treated in this article is finding the values that are anomalies
for a specified column (we refer to this column as "the target column")-
that is, those spreadsheet entries that don't abide by the rules that, in
general, relate the values in the target column to the values in the other
columns. After you resolve this problem, the spreadsheet user can take
action and clean the spreadsheet in various ways, depending on his
purpose. For example, he can:
For this article, the spreadsheet of interest looks like figure 1. The
spreadsheet has six columns: Student ID, Gender, Parent Income, IQ
(intelligence quotient), Parent Encouragement, and College Plans. These
are the values in each column:
Here's the problem we want to solve: Who are the students whose college
plans don't fit their potential?
high IQ
parental encouragement
and high-income parents
would plan to to go to college. Therefore any student who didn't follow this
pattern would be an anomaly. But are these the only ones? Are there
other cases that don't fit the general pattern?
The problem becomes even more difficult when the user of the
spreadsheet doesn't fully understand the data.
In a typical usage scenario, you first train a Data Mining engine from the
whole data or just a subset. During training, the engine learns the rules
and patterns in the subset. In a second phase, you apply the rules for
various purposes, such as detecting anomalies in your spreadsheet.
With these rules at hand, for each row in the spreadsheet, you decide
what's the most likely value for the College Plans column based on the
values in the other columns. In other words, for each student whose IQ,
Parent Income, and Parent Encouragement are known, you have to decide
whether he's likely to continue with college education. Then, if the likely
college plans don't match the actual college plans, you treat that student
as an anomaly from the set of discovered rules.
Let's assume, for now, the factors affecting a student's college plans are
(in the descending order of importance) IQ and the parents' income. The
Microsoft_Decision_Trees algorithm will find and organize rules like those
shown in table 1.
You can think of this structure of rules as a tree because all the students
are first divided based on the most important attribute (here, IQ). Then,
each branch is divided again based on the most important factor for that
subset of data. In table 1, we assumed the parents' income to be the
most important factor for those students with high IQ.
IF "IQ >= 100" AND "Parent's income > 20000" THEN (the student)
MOST LIKELY "Plans to attend"
Now, let's take another look at the rule above. What does it mean by
"MOST LIKELY Plans to attend"? How likely is "MOST LIKELY"? Let's
assume, in the context of the rule above, that the support is 100. This
means that, in the whole spreadsheet, we found 100 students who have
an IQ greater than or equal to 100 and parents with income greater than
$20,000. Now, if all these students plan to attend college, this is a strong
rule; there seems to be no exception. However, this hardly happens in real
life. You usually end up with something like this: 82 of the students plan
to attend college, but 17 don't plan to attend college, and one didn't
mention any plans.
Now, you can define most likely college plans as those plans shared by
most of the students who match this rule. This is "Plans to attend", with
82 votes. The likelihood of such plans is 82 out of 100-82 percent (or
0.82). This value is the confidence (or probability) of the rule.
If you select Yes, Excel saves a copy of this add-in into a special Add-Ins
folder. Otherwise, the add-in only works as long as its original location is
still valid.
Please note that, depending on your current Excel security settings, this
procedure might not work if you've disabled macros. Usually, an error
message indicates this problem. You can solve it by selecting Tools >
Options and, in the Security panel of the resulting dialog, clicking on the
"Macro Security …" button. You see a new dialog that lets you select the
security level for running macros. The Medium level lets you choose
whether to allow macros; in particular, you can select whether you want to
allow the Anomaly Detection add-in to run. After adjusting security
settings, you have to re-add the data mining add-in.
Now, you can use the Anomaly Detection add-in to find problems in the
CollegePlans.xls spreadsheet. You can apply these steps to any
spreadsheet:
In the spreadsheet, select the columns and rows that contain the data you
want to analyze. The selection must include the names of the Excel
columns in the first row, and it must have at least two rows (at least one
data row, besides the column names). You don't have to select all the
data in the spreadsheet, but this is how I got the results I describe below.
After you select the range of data, select the new entry by choosing Data
Mining Anomaly Detection from the Tools menu. A dialog like figure 2
displays.
The first input of this dialog (the one containing the range "$A$1:$F$9001"
in figure 2) lets you change or make a selection. After you make the
selection, the add-in populates the two drop-down lists for the key column
and anomaly detection column.
Microsoft Data Mining also lets you see the reasons behind the anomalies.
If you don't want this, deselect the box marked "Click on anomalies to
show the rule they break." When enabled, this option creates a hyperlink
inside each Excel cell containing an anomaly. By clicking on that cell, you'll
be able to see the rule that identified the cell as an anomaly. However, if
the column to search for anomalies in your particular spreadsheet already
contains hyperlinks, leave this box unchecked, as the new hyperlinks will
remove the existing ones.
After you've made all the selections, click on the OK button to instruct the
add-in to start looking for rules and anomalies. This process took us about
2 to 3 minutes for all 9,000 rows of the CollegePlans.xls spreadsheet. A
small dialog appears and informs you of the status of the operation.
After the anomalies are detected, you can move on to inspect the results.
After Excel completes the analysis and detects the anomalies, it highlights
them on the spreadsheet. The cells detected as anomalies are red and
have comments associated with them. If you selected the "Click on
anomalies shows the rule they break" option, these cells are also
hyperlinks (that is, clicking on the cells shows the rule they break).
Figure 3 shows how the spreadsheet looks after you run the Anomaly
Detection add-in.
As you can see, each anomaly cell has a comment now, describing the
expected college plans for that particular student and the probability
(confidence) of the rule that fits the student.
Finally, the add-in adds a new worksheet to the Excel workbook
containing those rules the Data Mining add-in found were relevant in
detecting the anomalies.
The newly created "Rules found by Data Mining" worksheet looks like
figure 4.
Confidence and Support -- These are measures for the quality of the
rule, as mentioned before in the "How likely is 'most likely'?" section.
Likely value for College Plans -- This is the most likely value for the
College Plans of students that fit into this rule.
Note that the rules may differ a lot depending on the spreadsheet data
you're analyzing. For example, if you only select the first few rows, you'll
likely find fewer rules and each rule will have fewer conditions.
fill the empty cells with the "[Missing data]" string, and they'll have a
comment with their most likely value and a link to a rule that justifies the
comment.
With small changes, you can use the add-in for a purpose other than
detecting anomalies. For instance, you can use it to partition the
spreadsheet in groups of rows with common characteristics (a problem
known in Data Mining as clustering, or segmentation).
While running the add-in, if you indicate a file to export the rules to, you
can load the set of rules on Microsoft Analysis Services 2005. To do this,
open the SQL Server Management Studio and connect to a running
instance of Microsoft Analysis Services 2005. In the context menu
associated with the Databases node of the Object Explorer, select Restore
and indicate the name of the file to which you exported the rules. Also,
enter a name for the database to contain the local mining model you're
restoring and go ahead with the Restore operation. As a result, you create
a new database on the Analysis Services server containing the local
mining model built on your spreadsheet. This model doesn't contain the
actual data, only the set of rules Data Mining discovered while processing
the Excel spreadsheet.
After the data is loaded on the server, a rich set of tools is available for
graphically displaying the rules.
On the server side, Data Mining analysis can handle large volumes of data
by taking advantage of multiple processors. Also, a collection of various
Data Mining algorithms can help with various business problems.
Microsoft Data Mining comes with a query language similar to SQL. The
DMX query language lets you model data, train algorithms, and execute
business intelligence operations, such as retrieving the rules for the
spreadsheet or determining the most likely values for various attributes.
The next section includes a few examples of the DMX syntax. It also
describes how the add-in works and shows you the implementation
details.
Add-in details
Microsoft Data Mining is designed as a platform for developing the various
applications that can take advantage of the Data Mining technology. For
data warehouse applications, it contains a powerful, scalable server that
can handle large volumes of data and help many users. It also contains a
solution for lightweight, embedded Data Mining usage, such as finding
patterns in Excel spreadsheets.
This embedded solution is called "local mining models" and is a library with
many of the most commonly used functionalities of the Data Mining
server.
For the server and the local mining models, communication with the Data
Mining framework occurs via an OLE DB provider that lets you send
commands in the SQL-like DMX language and read results. The Excel add-
in uses this local server to perform data mining on the spreadsheet data.
Note the following elements specific to the OLE DB provider for Analysis
Services:
To perform the analysis, Data Mining must first model the data. For this to
happen, you must create a mining model object. This object will be the
container of all the rules and patterns in the data you analyze. Creating a
mining model is similar to creating a table in SQL. Here's the DMX
statement that creates the model associated with the spreadsheet:
You obtain the name of the model columns from the first line of the
selection to which you apply the add-in. The type of the columns is
inferred from the values in the second row of the selection (the first row of
data). Note the PREDICT keyword that marks the CollegePlans column; it
signifies the model is supposed to find rules for that column.
After you create the model, you have to train it-that is, feed it with data
to find rules. Here's the DMX statement (again, similar to the SQL INSERT
statement) to do this:
The add-in contains some code that reads all the selections and packs it
into the XMLA format. This code is included in the XMLARowsetGen class
module, which is included in the plug-in. The XMLARowsetGen class simply
serializes each row in the XML format described by the XMLA 1.1
specification. The XMLARowsetGen object then reads the rows one by one.
The GenerateRowset method of this class module returns a string, which
contains the XML serialization of all the rows added so far.
Here's the code that attaches an XMLA 1.1 rowset as a parameter to the
ADODB command:
param.Name = "MySpreadsheet"
param.Type = adBSTR
param.Direction = adParamInput
param.Attributes = adParamLong
Note you have to name the parameter. This is a requirement for the OLE
DB for Analysis Services provider. Also, the parameter is of type adBSTR
Then, this statement is sent to the local Data Mining server, which
processes the mining model.
With the processed model, you can start predicting the most likely value
for the College Plans column. You can use a statement similar to SQL
SELECT with JOIN:
SELECT
Predict(__ExcelTemp.[CollegePlans], EXCLUDE_NULL),
PredictProbability(__ExcelTemp.[CollegePlans]),
PredictNodeId(__ExcelTemp.[CollegePlans])
FROM
__ExcelTemp
NATURAL PREDICTION JOIN
@MySpreadsheet as __Input
The __ExcelTemp local mining model, created with the CREATE MINING
MODEL … DMX statement, contains a number of columns, matching the
columns in the spreadsheet. The @MySpreadsheet table input parameter
also contains the columns in the spreadsheet.
ON
[Mining Model Column] = [Input Column]
"For each row in the @MySpreadsheet input table, using the rules
detected while you trained the ExcelTemp mining model compute:
The predicted value for the [College Plans] column, excluding null
values
The probability of this prediction
The node identifier of the rule that governs this prediction
When:
The [Student Id] column of the mining model takes the value of the
[Student Id] column of the input table
The [Gender] column of the mining model takes the value of the
[Gender] column of the input table
The [ParentIncome] column of the mining model takes the value of
the [ParentIncome] column of the input table
And so on"
The predicted value is the most likely value for the [College Plans]
column, according to the rules that apply to the current row in the input
table. The probability of the prediction is the confidence of the rule. The
add-in uses the node identifier of the rule later to fetch the description,
support, and confidence for the respective rule from the Data Mining local
server.
For each line in the input table, the statement computes three values;
therefore, the result is a table with three columns for each row in the
input table.
But MySpreadsheet contains all the current selection. This means each row
in the response contains the three named columns for one row in the
selection.
After the add-in fetches this result, it walks the selection row by row. If
the value of the College Plans column differs from the most likely value
the Data Mining server returned, the respective row is marked as
containing an anomaly.
The last DMX statement the add-in issues exports the local mining model
into a file you can reuse in Analysis Services 2005. The statement is:
The DMX language is pretty powerful and not hard to comprehend for
someone familiar with SQL syntax. The detailed specification for DMX is
also included in the "OLEDB for Data Mining" specification, which you can
find on the Microsoft Web site at http://www.microsoft.com/downloads/
details.aspx?FamilyID=c66af00d-51be-4d8d-9056-
82cb2410ae3f&displaylang=en.
If the data source isn't an Excel spreadsheet and doesn't support an add-
in development language such as VBA, you can solve the anomaly
detection problem inside the Analysis Services 2005 server (although with
a few more clicks than required to run the add-in).
In the past, data mining has been a field of academics and high-end
researchers and analysts. The simplicity and ease of use of the new
Microsoft SQL Server 2005 Data Mining platform bring this powerful
technology to everybody's fingertips.
ADVISORAMA
There comes a time when you should stop expecting other people to
make a big deal about your birthday. That time is age eleven.
-- Dave Barry
Refresh (F5) for more Contribute
RSS FEEDS - Look for the XML icons to get ADVISOR headlines on COMPLIANCE SOLUTIONS ADVISOR MAGAZINE - Compliance is not
your desktop. optional! Keep up on Sarbanes-Oxley, HIPAA, Patriot Act, and much more.
ARTICLE FEEDBACK: Discuss topics and share your wisdom via yellow Subscribe now.
Reader Comments box at the bottom of each article. MOBILE BUSINESS ADVISOR MAGAZINE - Strategies and solutions for
success in the new world of mobile & wireless business and lifestyle. Subscribe
now.
NOW: SECRETS OF THE TOP EXPERTS! - See exactly how to do it in
Advisor Academy step-by-step training CDs. Get insider advice directly from
the top experts. Click to see what you can learn right now.
GET ADVISOR ANTHOLOGIES ON CD - Have it all, at your fingertips:
articles, tips, code, files. Complete Advisor CDs are available now!
INTERNET DOMAIN REGISTRATION - Have total control of all your Web
and e-mail domains with a powerful new system. Set up advanced DNS,
customize e-mail, block spam, and save money on domain registrations.
Use of this or any other site, content, product or service of Advisor Media constitutes acceptance of Terms of Use. Portions copyright ©1983-2005 Advisor Media,
Inc. All Rights Reserved. Reuse or reproduction of any portion or quantity of Advisor Media's copyrighted content, in any form, for any purpose, requires written
permission.
ADVISOR®, and other names and logos that incorporate ADVISOR, are registered trademarks, trademarks or service marks of Advisor Media, Inc. in the United
States, the European Union, and/or other countries. Other trademarks are used for identification, editorial or descriptive purposes and are the property of their
owners.
Page generated 03/09/2005 03:51:15 AM - MMB U:[email protected] B:Microsoft 6 WinNT Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)
Opera 8.00 HR:http://www.sqlserverdatamining.com/DMCommunity/Whitepapers+and+Articles/Articles/default.aspx SN:accessvbsqladvisor.com PI:/articles.nsf/aid/
14413 P:Windows/32 V:194 Advisor expert advice, help, know-how, tips, news, training, and more.