0% found this document useful (0 votes)
56 views12 pages

Crivat B., MacLennan J. - Detect Anomalies in Excel Spreadsheets

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

Crivat B., MacLennan J. - Detect Anomalies in Excel Spreadsheets

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da...

Page 1

Ask Advisor
Experts
Username Submit Go to Article Doc # Go
Advisor Tips
Password Shop Advanced Search Search All Publications Go
Join:Member Center Login Advisor Jump to . . .
EXPERT ADVICE & KNOW-HOW
Customer
Help

ADVISOR Home Advisor Zones Advisor Magazines Advisor Live Events CDs/DVDs Advisor Forums

ARTICLE ACCESS = You are here: Advisor.com > Magazines > Access Advisor > Detect Anomalies in Excel Spreadsheets
free access
= subscriber-only
= has download

ARTICLE INFO

ACCESS ADVISOR
Web issue 2004 week
36
Print issue October
2004 SQL SERVER DEVELOPMENT
Length 6.25 pages
Doc #14413 Detect Anomalies in
Files for this article Excel Spreadsheets
are on this issue's Use SQL Server 2005 Data Mining
Professional Resource
CD.
inside Excel.

File Description:
By Bogdan Crivat and Jamie MacLennan,
Example to find Microsoft SQL Server Data Mining
anomalies in
spreadsheets. No reader comments yet.
Click to download
file
763,329 bytes Microsoft Excel becomes more and more versatile with each release and
solves a wider variety of business needs. Its flexibility and
programmability let you integrate different technologies to better
understand and process the data in your spreadsheets. From its inception
ADVISOR ARTICLES in SQL Server 2000, Microsoft's data mining solution has provided a
More on Microsoft Excel programming model to access data mining technologies, which has
expanded with SQL Server 2005. This article shows you how these two
More on Microsoft SQL technologies can work together seamlessly.
Server

More on Data Integration A common problem that A CCESS A DVISOR Advertisers


FMS - Suite Savings with Total .NET Developer Suite,
requires a data mining Total Visual Developer Suite, Total Access Developer Suite,
More on Database solution is anomaly and Total Access Ultimate Suite
More on Database detection -- that is, FMS - Total Access Emailer--generate personalized email
Development finding those values that for everyone in your table or query. Free Trial.
FMS - Total Access Memo; Total Access Speller; Total
"do not fit" based on the Access Components. All three support Access 2003.
More on Data Mining patterns present in the FMS - Total Access Analyzer--take your solutions to the
rest of the data. You'll next level; over 200 Best Practices for creating Access Compliance
see how to solve this applications. Solutions Advisor
FMS - Total Visual CodeTools--simplify VB/VBA code Magazine
problem using SQL writing; enforce a consistent set of coding standards.
ADVISOR NETWORK Compliance is not optional!
Server 2005 Data Mining Microsoft - Visual Studio .NET 2003--build enterprise Sarbanes-Oxley, Patriot Act,

ACCESS ADVISOR for an Excel worksheet. Web applications with less code, so can you turn big ideas HIPAA and much more...
into reality W hat really works?
This solution doesn't Subscribe to the manager's
Extended Systems - Advantage Database Server--
require any separate guide to corporate
TECH ARCHITECT scalable, reliable, and built for performance compliance strategies and
ADVISOR treatment of the data; it Black Moshannon Systems - SPEED Ferret--a global solutions.
works directly in your find-and-replace utility ComplianceAdvisor.com

workbook after a few SageKey - Build and troubleshoot installations


DATABASED ADVISOR
mouse clicks. The
code that
DEVELOPMENT accompanies this ADVERTISEMENT
ADVISOR Secrets Of The Top
article is an Excel add- Experts -- Now!
in you can install in Microsoft Excel and use for detecting anomalies in any See exactly how to do it, step-
MICROSOFT EXCHANGE Excel worksheet.
by-step, in Advisor Academy
CDs created by the top
& O UTLOOK ADVISOR experts. Click to see what you
can learn right now.
You don't need any previous knowledge of data mining or programming to AdvisorAcademy.com
MICROSOFT .NET use the solution in this article. However, the last part of the article (the
ADVISOR
"Add-in details" section) contains details about the solution
implementation. The Excel add-in is developed with Visual Basic for

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 2

Applications (VBA) and uses Data Mining eXtensions (DMX) for SQL Join the Mobile &
MICROSOFT O FFICE statements in modeling the data and detecting the anomalies. Some Wireless Revolution
SYSTEM ADVISOR knowledge of these technologies might help you understand and further Read the official guide to the
next wave of business and
extend the solution. The "Add-in details" section also contains a brief lifestyle. Subscribe now to
MICROSOFT ADVISOR description of the DMX language and the location of the specification. keep up, and scan the
archives to catch up.
MobileBusinessAdvisor.com

MICROSOFT SQL Requirements


SERVER ADVISOR
We tested the Excel add-in presented here with the SQL Server 2005 Beta Need Know-How?
CUSTOMER SERVICE As 2005 cranks up, what
2 release, which Microsoft shipped to more than 200,000 Microsoft direction are you going?
Shop Advisor Developer Network (MSDN) subscribers. The Microsoft Analysis Services Advisor magazines are
packed with the answers you
Advisor FAQ 2005 server, part of which is Microsoft Data Mining, is included in this beta need to work smarter. Can
Writing & Speaking release. For the add-in to work, you have to install the Connectivity you afford to fall behind?
AdvisorStore.com
Components included with SQL Server 2005 Beta 2 on the machine.

The add-in also requires that you have permissions to create a temporary
file on the C:\ drive, although you can change the code to use any folder Free E-Newsletters
Keep up! Hot News, How-To,
where the users have permissions. Tips & Tricks, Expert Advice,
i ADVISOR.com
j
k
l
m
n and more. Click to request
your's free.
j Web
k
l
m
n AdvisorUpdate.info
Search
The problem
Real, meaningful data usually contains patterns. You can describe these Internet Domain
patterns in terms of relations between various column values. An example Management
Get total control of your W eb
is "IF the value of the Age column is smaller than 18 THEN the value of and e-mail domains with a
the Occupation column IS LIKELY TO BE Student." Sometimes you can powerful browser control
panel -- and save money!
detect a simple rule easily just by visually inspecting the data or by using Register your domains with
common sense. A spreadsheet entry with 10 as the value of the Age AdvisorDomains.com

column and Lawyer as the value of the Occupation column usually raises
eyebrows and will most likely be treated as an anomaly.
Showcase Your
If the spreadsheet has a small number of columns (we also use the term Smarts
Submit your tips, techniques
"attributes" for columns), data visualization tools (such as the graph and advice and let Advisor
component of Excel) are helpful. With scatter plots (such as the X-Y promote your business and
build your career. Show the
graphs in Excel), you can usually detect simple relations between two or world what you know!
three columns just by visually inspecting the spreadsheet. AdvisorTips.com

However, the complexity of these patterns grows with the dimensionality


of the data. That is, the more columns a spreadsheet has, the more
complex the rules describing the patterns. Intuitively, more columns mean
the rules tend to be more complicated. For example:

"IF
the value of the Age column is greater than 21
BUT smaller than 25 AND
the value of the Credit Score column is smaller than 720
AND
the value of the Income column is greater than 50000 BUT
the value of the Number of Children column IS NOT 0
THEN
the value of the Home Ownership column IS LIKELY TO BE
Rent"

Now, such a rule isn't easy to find. A good understanding of the data
always helps, but visually inspecting a few thousand rows in a
spreadsheet is a daunting task even for a person familiar with the
columns.

The problem is even more complex because the relations between two
columns may change completely for different values in a third column. For
example, the relations between the number of children and the home
ownership status change a lot with age. For instance, regardless of the
number of children, people at the beginning of their careers are less likely
to own a home than a seasoned professional. Therefore, you must
consider all the possible values of a column before attempting to use that
column in a rule.

The anomaly detection process can only start when you've completely
determined the set of rules, and it requires reading all the data again and
verifying, for each row, any rule that might apply.
The problem treated in this article is finding the values that are anomalies
for a specified column (we refer to this column as "the target column")-
that is, those spreadsheet entries that don't abide by the rules that, in
general, relate the values in the target column to the values in the other

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 3

columns. After you resolve this problem, the spreadsheet user can take
action and clean the spreadsheet in various ways, depending on his
purpose. For example, he can:

Recheck the data for the entries that contain anomalies


Eliminate entries considered abnormal
Correct the abnormal entries by changing the anomaly values to the
ones suggested by the rules

For this article, the spreadsheet of interest looks like figure 1. The
spreadsheet has six columns: Student ID, Gender, Parent Income, IQ
(intelligence quotient), Parent Encouragement, and College Plans. These
are the values in each column:

Student ID contains a unique identifier (a number) for each entry in


the spreadsheet. In terms of relational databases, this column is a
key: It can help you connect this spreadsheet to another one
containing the name and address for each student, or to one
containing the exam grades for each course a student has taken.
The values in this column depend only on the order in which the
student information was entered in the spreadsheet. They're in no
way related to the other columns, so you can ignore this column.
The Gender column contains demographic information about each
student. The values are Male and Female.
The Parent Income column contains the income information for the
parents of each student. It contains values between 5,000 and
74,900.
The IQ column contains the intelligence quotient for each student.
The values are between 60 and 140.
The Parent Encouragement column describes whether the parents
encourage a student to continue his education through college. It
contains values of Encouraged and Not Encouraged.
The College Plans column shows whether a student intends to go to
college.

Here's the problem we want to solve: Who are the students whose college
plans don't fit their potential?

A first guess would indicate that students with:

high IQ
parental encouragement
and high-income parents

would plan to to go to college. Therefore any student who didn't follow this
pattern would be an anomaly. But are these the only ones? Are there
other cases that don't fit the general pattern?
The problem becomes even more difficult when the user of the
spreadsheet doesn't fully understand the data.

How Data Mining can help


You can think of SQL Server 2005 Data Mining as a set of technologies
that deal with automatically discovering meaning in data, as opposed to
imperative technologies such as query languages, where the user explicitly
asks for certain properties of the data. This isn't really a definition, but
this explanation describes how we use data mining in this article. You'll
see how data mining can help in finding rules and anomalies in your
spreadsheet.

How can data mining find rules in my


spreadsheet?
For Microsoft SQL Server 2005 Data Mining, data is always represented as
a set of input cases. These input cases share a set of attributes. Generally,
each case has a value for each attribute. However, for some cases, certain
attributes may be missing. For the Excel spreadsheet, each row is an input
case, with column values acting as attributes.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 4

In a typical usage scenario, you first train a Data Mining engine from the
whole data or just a subset. During training, the engine learns the rules
and patterns in the subset. In a second phase, you apply the rules for
various purposes, such as detecting anomalies in your spreadsheet.

You can reformulate the problem of detecting anomalies in terms of data


mining technology: You have to find the rules and patterns behind the
columns in the spreadsheet.

With these rules at hand, for each row in the spreadsheet, you decide
what's the most likely value for the College Plans column based on the
values in the other columns. In other words, for each student whose IQ,
Parent Income, and Parent Encouragement are known, you have to decide
whether he's likely to continue with college education. Then, if the likely
college plans don't match the actual college plans, you treat that student
as an anomaly from the set of discovered rules.

This kind of problem, in terms of data mining, is a classification problem.


You're classifying each row based on the student's plans to go to college.
SQL Server 2005 Data Mining provides a few algorithms for solving
classification problems. The Microsoft_Decision_Trees algorithm is a
particularly good fit for your spreadsheet problem because it's proven to
find good rules with high accuracy, it's highly optimized for performance,
and it describes the rules in an intuitive form.

For a specific attribute, the Microsoft_Decision_Trees algorithm can


determine the factors that most influence the value of that column.
Furthermore, it's able to clearly describe the relative importance of these
factors.

Let's assume, for now, the factors affecting a student's college plans are
(in the descending order of importance) IQ and the parents' income. The
Microsoft_Decision_Trees algorithm will find and organize rules like those
shown in table 1.

You can think of this structure of rules as a tree because all the students
are first divided based on the most important attribute (here, IQ). Then,
each branch is divided again based on the most important factor for that
subset of data. In table 1, we assumed the parents' income to be the
most important factor for those students with high IQ.

Here's an example of the format of a rule the Microsoft_Decision_Trees


algorithm discovers:

IF "IQ >= 100" AND "Parent's income > 20000" THEN (the student)
MOST LIKELY "Plans to attend"

How likely is "most likely"?


Before moving further, let's see how much you can trust these rules. The
Microsoft_Decision_Trees algorithm never generates rules that aren't
reflected in the data. However, some rules are more important than
others, and some rules are to be trusted more than others.

Clearly, a rule that applies to 1,000 students deserves more consideration


than a rule that applies to only two students. So, a first measure of
confidence is the support: the number of students (spreadsheet rows) for
whom the rule applies, or, in data mining terminology, the number of
cases that support this rule.

Now, let's take another look at the rule above. What does it mean by
"MOST LIKELY Plans to attend"? How likely is "MOST LIKELY"? Let's
assume, in the context of the rule above, that the support is 100. This
means that, in the whole spreadsheet, we found 100 students who have
an IQ greater than or equal to 100 and parents with income greater than
$20,000. Now, if all these students plan to attend college, this is a strong
rule; there seems to be no exception. However, this hardly happens in real
life. You usually end up with something like this: 82 of the students plan
to attend college, but 17 don't plan to attend college, and one didn't
mention any plans.

Now, you can define most likely college plans as those plans shared by
most of the students who match this rule. This is "Plans to attend", with
82 votes. The likelihood of such plans is 82 out of 100-82 percent (or
0.82). This value is the confidence (or probability) of the rule.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 5

Microsoft_Decision_Trees finds both the support and the confidence for


each rule and provides access to this information through the SQL-like
DMX language.

Integration with Excel


Microsoft Data Mining provides an environment for describing the data to
be mined and for intuitively displaying the rules and patterns you found.
However, we want to solve this problem completely inside Excel. Microsoft
Data Mining also comes with an extensive programmability solution, a set
of libraries that simplify the task of integration in various applications. It's
this set of libraries that allows seamless integration with Excel and other
applications.

Excel provides a handy feature called add-ins. An add-in is a library that


can extend the workbook functionality. The DataMining Anomaly
Detection.xla file provided with this article is such an add-in. To install it,
perform these steps:

1. Open a workbook in Excel (for example, open the CollegePlans.xls file


that comes with this article).
2. In the Tools menu, select the "Add-Ins …" menu item. A box labeled
Add-Ins appears.
3. In that box, look for the "Browse …" button. After you click on it, the
usual Windows file selection box comes up. Select the DataMining Anomaly
Detection.xla file from the location where you saved it.
4. A new box might show up, with a message like this:

"Copy Data Mining Anomaly Detection.xla to the Add-Ins folder


for <your user name>?"

If you select Yes, Excel saves a copy of this add-in into a special Add-Ins
folder. Otherwise, the add-in only works as long as its original location is
still valid.

5. A new entry shows up in the list of add-ins: Data Mining Anomaly


Detection.
6. After you close the Add-Ins box, Excel adds a new entry to the Tools
menu, Data Mining Anomaly Detection.

Please note that, depending on your current Excel security settings, this
procedure might not work if you've disabled macros. Usually, an error
message indicates this problem. You can solve it by selecting Tools >
Options and, in the Security panel of the resulting dialog, clicking on the
"Macro Security …" button. You see a new dialog that lets you select the
security level for running macros. The Medium level lets you choose
whether to allow macros; in particular, you can select whether you want to
allow the Anomaly Detection add-in to run. After adjusting security
settings, you have to re-add the data mining add-in.

Now, you can use the Anomaly Detection add-in to find problems in the
CollegePlans.xls spreadsheet. You can apply these steps to any
spreadsheet:

1. Select the range of data you want to analyze.

In the spreadsheet, select the columns and rows that contain the data you
want to analyze. The selection must include the names of the Excel
columns in the first row, and it must have at least two rows (at least one
data row, besides the column names). You don't have to select all the
data in the spreadsheet, but this is how I got the results I describe below.

After you select the range of data, select the new entry by choosing Data
Mining Anomaly Detection from the Tools menu. A dialog like figure 2
displays.

2. Tell the add-in what to do.

The first input of this dialog (the one containing the range "$A$1:$F$9001"
in figure 2) lets you change or make a selection. After you make the
selection, the add-in populates the two drop-down lists for the key column
and anomaly detection column.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 6

If your spreadsheet contains a key column, such as a row identifier,


indicate this in the "Select the key column, if any" field to instruct the
Anomaly Detection add-in that no significant information is in that column.
Or, simply don't select that column at all and leave the key column to
"<none>". Then, select a column to search for anomalies. For the data set
described in this article, select the College Plans column.

Microsoft Data Mining also lets you see the reasons behind the anomalies.
If you don't want this, deselect the box marked "Click on anomalies to
show the rule they break." When enabled, this option creates a hyperlink
inside each Excel cell containing an anomaly. By clicking on that cell, you'll
be able to see the rule that identified the cell as an anomaly. However, if
the column to search for anomalies in your particular spreadsheet already
contains hyperlinks, leave this box unchecked, as the new hyperlinks will
remove the existing ones.

If you have Microsoft Analysis Services 2005 installed on your machine,


you can export the set of rules for your spreadsheet to a file. This lets you
import the rules into Analysis Services 2005 and further explore them. The
"What else can Data Mining do with my data?" section describes the
procedure to follow. If you don't have Microsoft Analysis Services installed
or don't want to export your data, just leave the file name empty.

After you've made all the selections, click on the OK button to instruct the
add-in to start looking for rules and anomalies. This process took us about
2 to 3 minutes for all 9,000 rows of the CollegePlans.xls spreadsheet. A
small dialog appears and informs you of the status of the operation.

After the anomalies are detected, you can move on to inspect the results.

3. Inspect the results.

After Excel completes the analysis and detects the anomalies, it highlights
them on the spreadsheet. The cells detected as anomalies are red and
have comments associated with them. If you selected the "Click on
anomalies shows the rule they break" option, these cells are also
hyperlinks (that is, clicking on the cells shows the rule they break).

Figure 3 shows how the spreadsheet looks after you run the Anomaly
Detection add-in.

As you can see, each anomaly cell has a comment now, describing the
expected college plans for that particular student and the probability
(confidence) of the rule that fits the student.
Finally, the add-in adds a new worksheet to the Excel workbook
containing those rules the Data Mining add-in found were relevant in
detecting the anomalies.

The newly created "Rules found by Data Mining" worksheet looks like
figure 4.

For each rule, the following columns are present:

Rule Description -- Contains the verbose description of the rule. As we


discussed in the "How data mining can help" section, the conditions in a
rule description are ordered based on the importance. Notice the most
important factor for determining the college plans of a student is the
encouragement he receives from the parents.

Confidence and Support -- These are measures for the quality of the
rule, as mentioned before in the "How likely is 'most likely'?" section.

Likely value for College Plans -- This is the most likely value for the
College Plans of students that fit into this rule.

Note that the rules may differ a lot depending on the spreadsheet data
you're analyzing. For example, if you only select the first few rows, you'll
likely find fewer rules and each rule will have fewer conditions.

What else can Data Mining do with my data?


Microsoft Data Mining can do a lot with your data. Rather than detecting
anomalies, this add-in can suggest the most likely value when the
information is missing. To try this, just empty a few cells in the College
Plans column and run the Anomaly Detection add-in again. The add-in will

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 7

fill the empty cells with the "[Missing data]" string, and they'll have a
comment with their most likely value and a link to a rule that justifies the
comment.

With small changes, you can use the add-in for a purpose other than
detecting anomalies. For instance, you can use it to partition the
spreadsheet in groups of rows with common characteristics (a problem
known in Data Mining as clustering, or segmentation).

While running the add-in, if you indicate a file to export the rules to, you
can load the set of rules on Microsoft Analysis Services 2005. To do this,
open the SQL Server Management Studio and connect to a running
instance of Microsoft Analysis Services 2005. In the context menu
associated with the Databases node of the Object Explorer, select Restore
and indicate the name of the file to which you exported the rules. Also,
enter a name for the database to contain the local mining model you're
restoring and go ahead with the Restore operation. As a result, you create
a new database on the Analysis Services server containing the local
mining model built on your spreadsheet. This model doesn't contain the
actual data, only the set of rules Data Mining discovered while processing
the Excel spreadsheet.

After the data is loaded on the server, a rich set of tools is available for
graphically displaying the rules.

Figure 5 shows how the rules discovered inside the CollegePlans


spreadsheet display inside the Microsoft Decision Tree Viewer. You can
easily follow the way Data Mining discovers rules as well as the
importance of various attributes (spreadsheet columns) in determining the
outcome. You can also easily understand the confidence and support for
each rule.

On the server side, Data Mining analysis can handle large volumes of data
by taking advantage of multiple processors. Also, a collection of various
Data Mining algorithms can help with various business problems.

Microsoft Data Mining comes with a query language similar to SQL. The
DMX query language lets you model data, train algorithms, and execute
business intelligence operations, such as retrieving the rules for the
spreadsheet or determining the most likely values for various attributes.

The next section includes a few examples of the DMX syntax. It also
describes how the add-in works and shows you the implementation
details.

Add-in details
Microsoft Data Mining is designed as a platform for developing the various
applications that can take advantage of the Data Mining technology. For
data warehouse applications, it contains a powerful, scalable server that
can handle large volumes of data and help many users. It also contains a
solution for lightweight, embedded Data Mining usage, such as finding
patterns in Excel spreadsheets.

This embedded solution is called "local mining models" and is a library with
many of the most commonly used functionalities of the Data Mining
server.

For the server and the local mining models, communication with the Data
Mining framework occurs via an OLE DB provider that lets you send
commands in the SQL-like DMX language and read results. The Excel add-
in uses this local server to perform data mining on the spreadsheet data.

First, you have to initialize a connection to the OLE DB provider for


Analysis Services. In the connection string, you can substitute the Data
Source property, which is usually a server name, for a file name. When a
file name is detected, the provider understands that it's supposed to load
the local server. In VBA, the ADODB library is a great instrument for
dealing with OLE DB providers. The add-in uses ADODB for sending DMX
requests to the local mining model. This is the VBA code snippet that
opens an ADODB connection to a local mining model, hosted in a
temporary file on the root drive:

Private m_cnAS As ADODB.Connection


Set m_cnAS = New ADODB.Connection

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 8

m_cnAS.Open "Provider=MSOLAP.3;Data Source=c:\ExcelAddIn.cub"

Note the following elements specific to the OLE DB provider for Analysis
Services:

The provider signature is "MSOLAP.3".


The data source is a file name because the connection is created
against a local mining model.

To perform the analysis, Data Mining must first model the data. For this to
happen, you must create a mining model object. This object will be the
container of all the rules and patterns in the data you analyze. Creating a
mining model is similar to creating a table in SQL. Here's the DMX
statement that creates the model associated with the spreadsheet:

CREATE MINING MODEL __ExcelTemp(


[StudentID] TEXT KEY,
[Gender] TEXT DISCRETE,
[ParentIncome] DOUBLE CONTINUOUS,
[IQ] DOUBLE CONTINUOUS,
[ParentEncouragement] TEXT DISCRETE,
[CollegePlans] TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees

You obtain the name of the model columns from the first line of the
selection to which you apply the add-in. The type of the columns is
inferred from the values in the second row of the selection (the first row of
data). Note the PREDICT keyword that marks the CollegePlans column; it
signifies the model is supposed to find rules for that column.

After you create the model, you have to train it-that is, feed it with data
to find rules. Here's the DMX statement (again, similar to the SQL INSERT
statement) to do this:

INSERT INTO __ExcelTemp (


[Gender],
[ParentIncome],
[IQ],
[ParentEncouragement],
[CollegePlans])
@MySpreadsheet

The Analysis Services OLE DB provider supports parameters, such as


@MySpreadsheet. To be more specific, it's a parameter that's a set of
data rows. The OLE DB provider supports a data table parameter. You
pass this type of parameter in the format described by the XML for
Analysis (XMLA) 1.1 specification, which is available at http://
www.xmla.org.

The add-in contains some code that reads all the selections and packs it
into the XMLA format. This code is included in the XMLARowsetGen class
module, which is included in the plug-in. The XMLARowsetGen class simply
serializes each row in the XML format described by the XMLA 1.1
specification. The XMLARowsetGen object then reads the rows one by one.
The GenerateRowset method of this class module returns a string, which
contains the XML serialization of all the rows added so far.

Here's the code that attaches an XMLA 1.1 rowset as a parameter to the
ADODB command:

' Execute Training command


Dim cmd As New ADODB.Command
cmd.ActiveConnection = m_cnAS

' The INSERT INTO DMX Statement


cmd.CommandText = strInsert
cmd.NamedParameters = True

Dim param As ADODB.Parameter


Set param = cmd.CreateParameter

param.Name = "MySpreadsheet"
param.Type = adBSTR

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database Da... Page 9

param.Direction = adParamInput
param.Attributes = adParamLong

' The XMLA 11 serialized rowset


param.value = m_xmla.GenerateRowset

Note you have to name the parameter. This is a requirement for the OLE
DB for Analysis Services provider. Also, the parameter is of type adBSTR
Then, this statement is sent to the local Data Mining server, which
processes the mining model.

With the processed model, you can start predicting the most likely value
for the College Plans column. You can use a statement similar to SQL
SELECT with JOIN:

SELECT
Predict(__ExcelTemp.[CollegePlans], EXCLUDE_NULL),
PredictProbability(__ExcelTemp.[CollegePlans]),
PredictNodeId(__ExcelTemp.[CollegePlans])
FROM
__ExcelTemp
NATURAL PREDICTION JOIN
@MySpreadsheet as __Input

MySpreadsheet has the same meaning as above-i.e., a parameter that


contains the selection in which you're looking for anomalies.

We'll analyze the semantics of this prediction statement, as it's important


for understanding how Microsoft Data Mining works.

The __ExcelTemp local mining model, created with the CREATE MINING
MODEL … DMX statement, contains a number of columns, matching the
columns in the spreadsheet. The @MySpreadsheet table input parameter
also contains the columns in the spreadsheet.

The NATURAL PREDICTION JOIN part of the prediction statement indicates


the local mining model to map each case in the input table to the columns
of the local mining model based on the names of the columns. If the
columns in the input table have different names from the ones in the
mining model, you have to specify the mappings explicitly with a syntax
similar to SQL JOIN:

ON
[Mining Model Column] = [Input Column]

The statement, translated to plain English, is:

"For each row in the @MySpreadsheet input table, using the rules
detected while you trained the ExcelTemp mining model compute:

The predicted value for the [College Plans] column, excluding null
values
The probability of this prediction
The node identifier of the rule that governs this prediction

When:

The [Student Id] column of the mining model takes the value of the
[Student Id] column of the input table
The [Gender] column of the mining model takes the value of the
[Gender] column of the input table
The [ParentIncome] column of the mining model takes the value of
the [ParentIncome] column of the input table
And so on"

The predicted value is the most likely value for the [College Plans]
column, according to the rules that apply to the current row in the input
table. The probability of the prediction is the confidence of the rule. The
add-in uses the node identifier of the rule later to fetch the description,
support, and confidence for the respective rule from the Data Mining local
server.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database D... Page 10

For each line in the input table, the statement computes three values;
therefore, the result is a table with three columns for each row in the
input table.

But MySpreadsheet contains all the current selection. This means each row
in the response contains the three named columns for one row in the
selection.

After the add-in fetches this result, it walks the selection row by row. If
the value of the College Plans column differs from the most likely value
the Data Mining server returned, the respective row is marked as
containing an anomaly.

The last DMX statement the add-in issues exports the local mining model
into a file you can reuse in Analysis Services 2005. The statement is:

EXPORT MINING MODEL __ExcelTemp TO


'c:\MySpreadsheetRulesFile.abf'

This statement creates a new file on the hard disk, named


c:\MySpreadsheetRulesFile.abf, which contains an Analysis Services 2005
database with a single mining model. You can later restore this file as a
database or import it into an existing database with an IMPORT
statement:

IMPORT FROM 'c:\MySpreadSheetRulesFile.abf'

The DMX language is pretty powerful and not hard to comprehend for
someone familiar with SQL syntax. The detailed specification for DMX is
also included in the "OLEDB for Data Mining" specification, which you can
find on the Microsoft Web site at http://www.microsoft.com/downloads/
details.aspx?FamilyID=c66af00d-51be-4d8d-9056-
82cb2410ae3f&displaylang=en.

Simply by using a different algorithm name in the CREATE MINING MODEL


statement and changing some of the functions in the prediction
statement, you can modify the add-in to solve other business problems.

If the data source isn't an Excel spreadsheet and doesn't support an add-
in development language such as VBA, you can solve the anomaly
detection problem inside the Analysis Services 2005 server (although with
a few more clicks than required to run the add-in).

Power at your fingertips


Microsoft SQL Server 2005 Data Mining gives you a lot of versatility in
finding meaning in your data. Use this add-in with your spreadsheets or,
even better, play with the add-in code to further explore what local mining
models can do for you. You can change the algorithm from
Microsoft_Decision_Trees to Microsoft_Clustering. The add-in continues to
work as it did before, but by adding a few more DMX statements, you can
see how your data is partitioned and a receive a detailed description for
each of these partitions.

In the past, data mining has been a field of academics and high-end
researchers and analysts. The simplicity and ease of use of the new
Microsoft SQL Server 2005 Data Mining platform bring this powerful
technology to everybody's fingertips.

Figure 1: Example -- The CollegePlans


spreadsheet.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database D... Page 11

Figure 2: The Data Mining Anomaly Detection


dialog -- Select your options before finding the
anomalies.

Figure 3: Result of the Anomaly -- Detection


add-in-Note the highlighted cells with
anomalies and the comment that describes the
likely value in those cells.

Figure 4: Rules found by Data Mining -- The


rule's verbose description is accompanied by
the confidence and the support, as well as the
most likely College Plans value for cases
matching that rule.

Figure 5: Microsoft Decision Tree Viewer -- The


cases are divided based on various attributes.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34


Access Advisor :: Detect Anomalies in Excel Spreadsheets -- Microsoft Excel Microsoft SQL Server Data Integration Database D... Page 12

The Node Legend window shows the


confidence and the support for the selected
rule node.

Table 1: Rules organized by


Microsoft_Decision_Trees -- From left to
right, each row is divided based on values of
the attributes.
Most Secondary Most
important factor and likely
factor and value college
value plans
All IQ >= 100 Parents' Plans to
Students income > attend
20000
Parents' Does not
income <= plan to
20000 attend
IQ < 100 ... ...

What do YOU think? CLICK HERE to add a comment to this article.

Detect Anomalies in Excel Spreadsheets


No reader comments ... yet.

Printer-friendly page layout

ADVISORAMA
There comes a time when you should stop expecting other people to
make a big deal about your birthday. That time is age eleven.
-- Dave Barry
Refresh (F5) for more Contribute

SPECIAL OFFERS ADVISOR MARKETPLACE

RSS FEEDS - Look for the XML icons to get ADVISOR headlines on COMPLIANCE SOLUTIONS ADVISOR MAGAZINE - Compliance is not
your desktop. optional! Keep up on Sarbanes-Oxley, HIPAA, Patriot Act, and much more.
ARTICLE FEEDBACK: Discuss topics and share your wisdom via yellow Subscribe now.
Reader Comments box at the bottom of each article. MOBILE BUSINESS ADVISOR MAGAZINE - Strategies and solutions for
success in the new world of mobile & wireless business and lifestyle. Subscribe
now.
NOW: SECRETS OF THE TOP EXPERTS! - See exactly how to do it in
Advisor Academy step-by-step training CDs. Get insider advice directly from
the top experts. Click to see what you can learn right now.
GET ADVISOR ANTHOLOGIES ON CD - Have it all, at your fingertips:
articles, tips, code, files. Complete Advisor CDs are available now!
INTERNET DOMAIN REGISTRATION - Have total control of all your Web
and e-mail domains with a powerful new system. Set up advanced DNS,
customize e-mail, block spam, and save money on domain registrations.

Posted 08/27/2004 Modified 03/07/2005 03:46:38 PM

Site Privacy Terms Trademarks Advertising Jobs About Advisor's


Map of Advisor San
Use Media Diego

Use of this or any other site, content, product or service of Advisor Media constitutes acceptance of Terms of Use. Portions copyright ©1983-2005 Advisor Media,
Inc. All Rights Reserved. Reuse or reproduction of any portion or quantity of Advisor Media's copyrighted content, in any form, for any purpose, requires written
permission.

ADVISOR®, and other names and logos that incorporate ADVISOR, are registered trademarks, trademarks or service marks of Advisor Media, Inc. in the United
States, the European Union, and/or other countries. Other trademarks are used for identification, editorial or descriptive purposes and are the property of their
owners.

Page generated 03/09/2005 03:51:15 AM - MMB U:[email protected] B:Microsoft 6 WinNT Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en)
Opera 8.00 HR:http://www.sqlserverdatamining.com/DMCommunity/Whitepapers+and+Articles/Articles/default.aspx SN:accessvbsqladvisor.com PI:/articles.nsf/aid/
14413 P:Windows/32 V:194 Advisor expert advice, help, know-how, tips, news, training, and more.

http://accessvbsqladvisor.com/doc/14413 09/03/2005 17:24:34

You might also like