0% found this document useful (0 votes)
7 views

Data mining

Uploaded by

ahmd66699
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Data mining

Uploaded by

ahmd66699
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 252

Written by JBacon 14/7/13 1

 To introduce the module by discussing the module


study guide, drawing out important aspects and
discussing what is expected of you.
 To discuss what is meant by data mining, focussing
on the goal of data mining and its association with
other areas such as data warehousing and statistics.
 To provide an overview of some of the major data
mining applications.
 To discuss how identifying patterns in data,
i.e.) relationships is not always straight forward as
data quality needs to be considered.
 To be aware of the drawbacks of poor quality data.

Written by JBacon 14/7/13 2


Over the past decade, the amount of data
stored in electronic format has grown at an
explosive rate.
 Every day thousands of terabytes of data are
circulated via the Internet. We have difficulty finding
the information we need in large amounts of data.
 This data is a source of lots of variables, from which
various models and trends could be predicted.

Question: Which organisations hold data about you?

Written by JBacon 14/7/13 3


 Data is Rich, Information is Poor…
Businesses need to extract knowledge and
meaningful information from the vast amounts
of data available to them.

 “There’s gold in that data!”


Businesses want to find patterns in the data
… to make use of the ‘patterns’ (so that their job is
done more effectively)
… to enable them to modify their business strategies
accordingly
…to give them a competitive edge.

Written by JBacon 14/7/13 4


Data miners get a ‘hunch’ about
patterns in data.
Historical Example: survived several many
John Snow(1854) got a ‘hunch’
about cholera coming from
contaminated water. Data were
gathered manually from the died most few
registered deaths and Snow
drew a map where people had
died.
close far
From this basic visualisation what
is the pattern? most people
died close to the water pump.
 Visualisation is a powerful medium to generate initial ideas.
 Data mining is about detective work…

...“The gold you seek may be sparse and very hard to find…it
may not even exist at all…To find it will require a methodical
and organisational approach.”
...” You will be luckier if you have the right tools and know how
to use them.”
...“No tools automatically generate knowledge but help in the
analysis process.”
...“In most cases you will not find the answer, just part of the
jigsaw puzzle.”
 “Data miners are the business detectives.”

Written by JBacon 14/7/13 6


Literally:
 ‘to mine’ means ‘to extract’. ‘Mining’
refers to ‘mining operations that extract
from the Earth, her hidden, precious
resources’.

 Associated with data, ‘data mining’


suggests an in-depth search to find
additional information which previously
went unnoticed in the mass of data
available.

Written by JBacon 14/7/13 7


“Data mining is the process of
data selection, exploration and
building models using vast data
stores to uncover previously
unknown patterns.” (SAS
Institute 2004)

i.e.) data mining involves


analysing large quantities of
data in order to discover
meaningful patterns and rules,
for the purpose of decision
making.
 Data mining is the process of using “raw”
data to infer important “business”
relationships.
 DataMining is a collection of powerful
techniques intended for analysing large
amounts of data.
 There is no single data mining approach,
but rather a set of techniques that can be
used stand alone or in combination with
each other.
Written by JBacon 14/7/13 9
 Is
to allow a business/organisation
… to IMPROVE its ‘competitive
edge’
… the potential for increasing its
market share by giving it a
competitive advantage.

How?
Through better understanding of its
data.
Written by JBacon 14/7/13 10
Written by JBacon 14/7/13 11
 For organisational learning to take place, data
from many sources must be gathered together
and organised in a consistent and useful way –
hence, Data Warehousing.
whilst
 Data Mining techniques make use of the data in a
Data Warehouse.

Written by JBacon 14/7/13 12


Enterprise
“Database”

Customers Orders
Transactions

Vendors Etc…

Data Miners:
Etc… • “Business detectives”

Copied,
organized
summarized

Data Data Mining


Warehouse

Written by JBacon 14/7/13 13


In statistics:
 data often needs to be collected

 sample sizes are smaller

 accuracy is very important

whilst
 Data Mining makes use of the data available
and accuracy has to be balanced with
timeliness.

Written by JBacon 14/7/13 14


 Direct marketing – for
targeting people who are
most likely to buy certain
products/services,
e.g.1) supermarkets have
used point-of-sale data to
decide what coupons to
print for which customers.
e.g.2) web retailers have
used past purchases in
order to determine what to
display when customers
return to the site.
Written by JBacon 14/7/13 15
 Typical mailing of 100,000 pieces costs about
£100,000 (£1/piece)
 Typical response rates < 10%
 Any list of customers that can be ranked by
likelihood of response is good
 Campaign focused at top of list to increase
response rate %

Written by JBacon 14/7/13 16


 Trend analysis – to identify trends,
e.g.) modelling the stock market.

 Fraud detection – to determine fraudulent


use of various products,
e.g.) insurance claims, cellular phone
calls, credit card purchases, etc.

Written by JBacon 14/7/13 17


 Existing customers – organisation has a lot of data, e.g.)
number of calls made, length of call, usage of text
message service, personal data: age, occupation, etc.

From the phone company’s perspective:


 Who is likely to remain a loyal customer?

 What determines whether a customer will respond to a


certain offer?

 What is the next product the customer will want?

Data mining techniques help make it possible to exploit


the vast amounts of data generated by interactions
with customers to get to know them better!

Written by JBacon 14/7/13 18


 New customers – no data.

From the phone company’s perspective:


 How do we attract new customers?
Use data on existing customers.

 Customer behaviours captured in corporate data


are not random, but reflect the differing needs,
preferences and treatment of customers.
The goal of data mining is to find patterns in
the data that shed light on those needs and
preferences.

Written by JBacon 14/7/13 19


 Some relationships
are meaningful
BUT
 Some relationships
are meaningless.
Just because there are
lots of variables in the
data doesn’t mean all
of them need to be
used in the
relationship!
 The task is made difficult by the fact that
the patterns are not always strong, and
 the signals sent by customers are noisy
and confusing.

 Separating signal from noise – recognising


the fundamental patterns beneath
seemingly random variations – is an
important role of data mining.

Written by JBacon 14/7/13 21


 Data mining techniques can produce lots of
statistical models but the quality of the data
that a data miner is working with is the key!
So data quality is fundamental.

 Poor quality data makes data mining very


risky and can lead to:
 Distorted models when summarising the data, or
 Useless patterns within the data.

Written by JBacon 14/7/13 22


A bank uses a scorecard based on various
characteristics of a customer,
e.g.) age, income, mortgage, previous loans etc.

However the bank omitted to record whether


the customer was married so failed to take into
account the overall joint income in these cases!
So the bank didn’t give as many loans as in previous
years to couples /families.

 Result: Distorted model provided incorrect conclusion.


The bank made wrong decisions on whether to give a
loan, which lead to lost revenue!

Written by JBacon 14/7/13 23


In a hospital, a patient’s date of birth should be
recorded. If, for some reason, the d.o.b. is missing
when the data are entered into the database,
six ones ‘111111’ are entered as default as the
system won’t accept a series of zeros.

Thus: a pattern emerges that a very large proportion of


patients have birthdays on 11th November!

 Result: Useless pattern!


Understanding/knowledge of the variables and data
structure would have prevented wasted time being
spent analysing inappropriate variables and discovering
useless data patterns.

Written by JBacon 14/7/13 24


…requires an understanding of the distortion
process.

Data miners attempt to detect the source of


errors,
e.g.) Are there more errors…
…in a particular department?
...on a Friday afternoon?

 Ifan understanding can be gained, then a model


can be produced and necessary adjustments
made.

Written by JBacon 14/7/13 25


 But a data miner analyses data – a data miner
doesn’t perform miracles

 GIGO (Garbage In Garbage Out)

 In practice, 60-75% analysis time is spent


cleaning the data!

 Quality costs - At what point does the cost of


detecting errors in a dataset outweigh the
gains?

Written by JBacon 14/7/13 26


1. Data are being produced
2. Data are being warehoused
3. Computing power is more affordable
4. Competitive pressures are enormous
5. Data Mining software is available

Written by JBacon 14/7/13 27


As previously discussed, data are produced but
 Often data are gathered because it is needed for
some operational purpose,
e.g.) inventory control or billing.

 Once it has served that purpose, the data either


remains on disk (untouched) or is discarded!

Thus, the potential of data has not been fully


exploited by all businesses and organisations YET!

Written by JBacon 14/7/13 28


Two main reasons:

1)Information is scattered within different archive


systems that are not connected with one another,
producing an inefficient organisation of the data.

2)The lack of awareness and/or understanding


about statistical tools and their potential.

Written by JBacon 14/7/13 29


 Software and hardware – to allow
companies to collect and organise
data in structures that give easier
access and transfer.

 Data Mining.

Written by JBacon 14/7/13 30


 The largest challenge a data miner may face is the sheer
volume of data in the data warehouse.

 It is quite important, then, that summary data also be


available to get the analysis started.

 A major problem is that this sheer volume may mask the


important relationships the data miner is interested in.

 The ability to overcome the volume and be able to


interpret the data is important.

Written by JBacon 14/7/13 31


You should now :
 Know what the goal of data mining is and be
able to explain data mining in more detail
(knowing the differences when compared to
statistics or data warehousing).
 Be able to describe in detail some of the major
data mining applications.
 Appreciate how identifying patterns in data is
not always straight forward.
 Be aware of the need to consider data quality in
data mining.

Written by JBacon 14/7/13 32


 Written notes for Week 1 Tutorial.
 Completion of SAS Trainer.
 Background reading of chapter 1 in text:
Berry Michael JA and Linoff Gordon S (2003).
Data Mining Techniques. 2nd ed. Wiley. pp1-19.
 Have access to the mandatory text:
Data Mining using SAS Enterprise Miner: A Case
Study approach. 2nd ed. SAS Publishing.
*NOTE You will need to reference this text in the Week
3 lab*

 Read over your lecture notes and supplement your


notes with further background reading.

Written by JBacon 14/7/13 33


(Where appropriate) References will be provided
for each lecture, e.g.):
Mohamed, Omar (2004). Career story & You’ve got
direct mail. Significance; vol 1(2): pp76-80.
Berry Michael JA and Linoff Gordon S (2011). Data
Mining Techniques. 3rd ed. Wiley. pp1-11; 68-74.

Also of general interest:


SAS Institute, Inc (2004) http://www.sas.com

Note Lectures provide an overview of the module


topics. It is essential that you read around the
subject area in order to gain the required
depth of knowledge.
Written by JBacon 14/7/13 34
written by JBacon 14/7/13 1
 To explain the need for a data mining cycle when
viewed from the perspective of Customer
Relationship Management and to discuss each of
the stages in the Virtuous Cycle of data mining.
 To introduce the concepts of hypothesis testing,
models, profiling and prediction.
 To raise the awareness of a need for a data mining
methodology.
 To make you appreciate the need for learning some
technical details about the data mining techniques
used in data mining software as a customer-centric
corporate culture rarely exists in the real world!

written by JBacon 14/7/13 2


 Marketing literature makes it look easy!!!
 Users of data mining software will be told: “just
apply automated algorithms created by neural
networks, decision trees, etc. THEN
.…magic happens!!!

Not so …
 Data Mining is an iterative, learning process
 Data Mining takes conscientious, long-term hard
work and commitment, but…

 Data Mining’s Reward: Success transforms a


company from being reactive to being proactive!

written by JBacon 14/7/13 3


 To be effective, data mining must occur within a
context that allows an organisation to change
its behaviour as a result of what it learns.

For example, it is no use knowing that mobile


phone customers who are on the wrong rate
plan are likely to cancel their phone contracts if
there is noone in the organisation empowered
to propose that the customers switch to
something more suitable for them.

written by JBacon 14/7/13 4


 Data mining should be embedded in a corporate
customer relationship strategy that spells out the
actions to be taken as a result of what is learned
through data mining,
e.g.)
 When low-value customers are identified, how will they
be treated?
 Are there strategies in place to stimulate their usage
to increase their ‘value’ as a customer?
 Or does it make more sense to lower the cost of
serving them?
 If some channels bring in more profitable customers,
how can resources be shifted to these channels?

written by JBacon 14/7/13 5


 Just ‘finding patterns’ is not enough!
 Businesses must:
◦ Respond to the pattern(s) by taking action.

◦ Turning:
 Data into Information
 Information into Action
 Action into Value

 Hence, the Virtuous Cycle


of Data Mining.

written by JBacon 14/7/13 6


In order to form a learning relationship
with its customers, a business/organisation
must be able to:
1. Notice – what its customers are doing.
2. Remember – what it and its
customers have done over time.
3. Learn – from what it has remembered.
4. Act On – what it has learned to make
customers more profitable to the business!

written by JBacon 14/7/13 7


1. Identify the business problem or
opportunity.
2. Mining data to transform it into
actionable information.
3. Acting on the information.
4. Measuring the results.

written by JBacon 14/7/13 8


Many business processes are good
candidates for data mining, e.g.)
Planning for a new product introduction

Planning direct marketing campaigns

Understanding customer behaviours.

Note It is important to measure the impact of


whatever actions are taken in order to judge the
value of the data mining effort itself. If we cannot
measure the results of mining the data, then we
cannot learn from the effort, indicating that there
would not be a virtuous cycle.

written by JBacon 14/7/13 9


Measurements of past efforts and ad hoc
questions also suggest data mining
opportunities:

What types of customers responded to the last


campaign?

What other product(s) should be promoted with


our XYZ product?

written by JBacon 14/7/13 10


 Success is making business sense of the data

 Numerous data “issues” interfere with the ability to use


the results of data mining:
Bad data formats, e.g.) missing data or bogus data.
Confusing data fields.
Legal ramifications, i.e.) having to provide a legal
reason when rejecting a loan (rather than saying ‘my
neural network told me to!)
Organisational factors including lack of timeliness

written by JBacon 14/7/13 11


This is the purpose of Data Mining
(with the hope of adding value.)
 Actions are usually in line with what the business is
doing anyway, e.g.)
Interactions with customers, prospects, suppliers.
(Different messages may go to different people).

Modifying service procedures to prioritise customer


service.

Adjusting inventory levels, etc.

written by JBacon 14/7/13 12


 This stage assesses the impact of the action
taken, and is the stage that is often overlooked!
 “How can the results be measured?” should be considered
at the beginning when identifying the business problem (not
after it is “all over”),
e.g.) A company that sends out coupons to encourage sales
of their products will measure the coupon redemption rate.
 Comparing expectations to actual results makes it possible
to recognise promising opportunities to exploit on the next
round of the virtuous cycle.

written by JBacon 14/7/13 13


 Some examples of questions that have future value:
Did this campaign do what we hoped?
What are the characteristics of the most loyal
customers reached by this campaign?
Did these customers purchase additional products?
Did some messages (or offers) work better than
others?.............
 Data mining is about connecting the past –
through learning – to future actions.

written by JBacon 14/7/13 14


 The outline is the same but the emphasis shifts. Instead of
identifying a business problem, we turn our attention to
translating business problems into data mining problems,
i.e.)
Transforming data into information would include
hypothesis testing, profiling and predictive modelling.

Taking action would include model deployment and


scoring.

Measurement would assess a model’s stability &


effectiveness before it is used.

written by JBacon 14/7/13 15


 The Virtuous Cycle of Data Mining (with 4
stages) is iterative.

 No steps should be skipped.


 Common sense prevails with respect to how
rigorous each step is carried out.
 Simplest approach: ad-hoc queries to test
hypotheses.
 Rigorous approach: The 4 stages of the
virtuous cycle expand to become an 11-step
methodology.

written by JBacon 14/7/13 16


 A Data Mining methodology which includes Data
Mining Best Practices helps to avoid:
◦ Learning things that are not true
◦ Learning things that are true, but not useful

 Learning things that are not true is the most


dangerous!
 Patterns may not represent any underlying rule o
 Or Sampling may not reflect its parent
population, hence bias.
 Data may be at the wrong level of detail
(granularity; aggregation).
written by JBacon 14/7/13 17
 A hypothesis is a proposed explanation whose
validity can be tested by analysing data.
 It’s purpose is to substantiate or disprove
preconceived ideas.

 Hypothesis testing is at its most valuable when


it reveals that the assumptions that have been
guiding a company’s actions in the
marketplace are incorrect.
 Hypothesis testing is useful (more technical
details to follow in future lectures) but not
totally sufficient for the data miner.…models
need to be created based on the data.

written by JBacon 14/7/13 18


 Model: An explanation or description of how
something works that reflects reality well
enough that it can be used to make inferences
about the real world. We use models every
day…e.g.) London Underground map
 Data Mining uses models of data called Model
Set. The Model Set includes:
◦ Training Set – used to build a set of models.
◦ Validation Set – used to choose best model.
◦ Test Set – used to determine how model
◦ performs.
written by JBacon 14/7/13 19
Profiling and
prediction differ
only in the time
frames of the input
and target
variables.
Profiling: timing is
PAST
Prediction: timing is
FUTURE.
 Profiling (e.g. surveys)  Prediction
o Goes one step further
◦ Uses data from past to
(than profiling). Prediction
describe what happened uses data from the past to
in past. predict what is likely to
◦ Often based on happen in the future.
e.g.) In a profile of CD
demographic variables owners, no awareness of
(location, age, gender,etc) relationship between
e.g.) insurance premiums: savings balances and CD
ownership. But, likely that
a 17yr old male pays a high savings balance is
more for car insurance a predictor of future CD
than a 60yr old female. purchases.

written by JBacon 14/7/13 21


1. Translate business problem into Data Mining problem
2. Select appropriate data
3. Get to know the data
4. Create a model set
5. Fix problems with the data
6. Transform data to bring information to the surface.
7. Build models
8. Assess models
9. Deploy models
10. Assess results 11. Begin again

written by JBacon 14/7/13 22


written by JBacon 14/7/13 23
 …is a customer-centric culture and all the
resources to support it,
e.g.) data, data miners, data mining software
AND
 … a data mining infrastructure,
i.e.) the need for good information is ingrained in
the corporate culture, operational procedures are
designed with the need to gather good quality
data and the requirements for data mining shape
the design of the data warehouse.

But changing the culture is NOT easy!

written by JBacon 14/7/13 24


 When there is an ongoing need for data mining, it is
best done internally so that insights produced during
mining remain within the organisation (rather than
being outsourced!)

 Choosing software for the data mining environment


is important.

 However, the success of data mining depends more


on having good processes and good people (with
some technical data mining knowledge) rather than
particular software on their desktops!

written by JBacon 14/7/13 25


You should now know:
 Each of the 4 stages in the Virtuous Cycle of
data mining.
 What is meant by: hypothesis testing, models,
profiling and prediction.
 Why YOU need to learn some of the technical
details about the data mining techniques used
in data mining software as ideal data mining
environments such as the customer-centric
corporate culture described, rarely exists in
the real world!

written by JBacon 14/7/13 26


To be completed before Week 3 tutorial:
 Berry Michael JA and Linoff Gordon S
(2011). Data Mining Techniques. 3rd ed.
Wiley. pp11-68.

written by JBacon 14/7/13 27


written by JBacon 14/7/2013 1
 To address main issues concerning data,
e,g.) data preparation, data format expectations for
data mining, derived variables, outliers and missing
data.

 To explain the difference types of data,


i.e.) Quantitative data versus Qualitative data
 To indicate what type of graphical displays are
appropriate for quantitative data and qualitative data.

 Within the context of SAS Enterprise Miner to explain


what is meant by: model roles and measurement
levels and to provide an overview of the initial nodes
that you will come across in SAS Enterprise Miner:
Input Data Source, Data Partition, etc.

written by JBacon 14/7/2013 2


In manufacturing raw materials are transformed
into finished products, so too with data that is to
be used for data mining.
ECTL:
Extraction, Clean, Transform, Load

This is the general process for preparing data for


data mining.

written by JBacon 14/7/2013 3


 All data mining algorithms want their data in tabular
form – rows & columns as in a spreadsheet or
database table.

written by JBacon 14/7/2013 4


 A row is the unit of action.
 It is one instance (or record or case) that, by
understanding patterns at the row level, provides
useful insight.
e.g.) each row refers to a customer.

written by JBacon 14/7/2013 5


 The columns represent the data in each record and
describe aspects of the customer,
e.g.) credit card limit, final balance, minimum payment, etc

Thus each column is a variable ‘of interest’.


 Also, columns contain the results of calculations referred
to as derived variables,
e.g.) total no. times card not been paid off in full

Derived variables are calculated columns not in the


original data, i.e.) combinations of columns,
summarisations, etc. Scores are derived variables and
will often condense a large number of potential variables
into a manageable few. Scores are assigned to
customers based on their usage patterns.

written by JBacon 14/7/2013 6


Each row
represents
the
customer
and
whatever
might be
useful for
data
mining.

written by JBacon 14/7/2013 7


 However, sometimes information is embedded
in specific codes and in our own domain
knowledge, so could be extracted and
transformed into an appropriate format.

 Typical examples:
 features from dates and/or time
 features from telephone numbers, addresses,
product codes, identification numbers, etc.

written by JBacon 14/7/2013 8


 All data in a single table (rows and columns).

 Each row corresponds to an entity (customer), i.e.)


a customer’s record.

 Single value columns should be ignored.

 Columns with unique values (i.e. a different value


for every row) should be ignored.

 For predictive modelling, the target column should


be identified and all synonymous columns removed.

written by JBacon 14/7/2013 9


Variables have important Model Roles in data mining:
 Target (or response) variable or y-variable: Used in
predictive modelling and represents the variable we want
to predict in the model.
(Sometimes referred to as dependent variable.)

 Input (or explanatory) variables or x-variables: these are


input into the model, and are believed to ‘explain’ the
behaviour of the y-variable.
(Sometimes referred to as independent variables.)

Note Columns not used in a particular data mining


analysis can be labelled as a rejected model role.
However there is also the opportunity in SAS Enterprise
Miner for a variable to have a model role ID as
sometimes features from ids, phone nos., dates,etc.
could be extracted into useful information.
written by JBacon 14/7/2013 10
 Determining the right ‘measure’ requires
understanding of what the data represents.
 In SAS Enterprise Miner, the following measures
for variables (columns of data) are used:

 Categorical variable or character variable or


class variable,
i.e.) binary, nominal or ordinal data

 Interval variable.

written by JBacon 14/7/2013 11


 Quantitative discrete

interval data (or


continuous data)

Qualitative (or categorical)

binary ordinal
nominal

written by JBacon 14/7/2013 12


 Qualitative data (or categorical data) are
categories, words, that are fundamentally non-
numeric.
 Class variables and character variables are
qualitative data. They have a well-defined set of
values or ‘category labels’ that they can take on.
e.g.) In the house dataset in the SAS trainer,
location was a categorical variable, with values:
1=NE, 2=NW, 3=SW, 4=SE.

Note, the categories can be represented by


numbers, but the numbers are just a label – it
doesn’t make sense to do any arithmetic on the
numbers!

written by JBacon 14/7/2013 13


 Nominal data have more than 2 categories,
with no implied order.
e.g.) location variable:
1=NE, 2=NW, 3=SW, 4=SE.
 Ordinal data have more than 2 categories,
with an implied order.
e.g.1) small, medium, large.
e.g.2) first, upper second, lower second, third,
non-hons.
 Binary data have only 2 categories.
e.g.1) male, female.
e.g.2) award loan, do not award loan.

written by JBacon 14/7/2013 14


 Quantitative data are numeric data. Numeric
variables measure ‘something of interest’ so
calculations, summations, summarisations make
sense on quantitative data.

Examples of different types of Quantitative data


 Interval data (or continuous data) are data that
are, theoretically, on a continuous scale but for
ease of use are recorded according to the level of
accuracy required, e.g.) time (mins? secs? )
Calculating an average on an interval variable
makes sense, e.g.) average income.
 Discrete data can take on only certain values,
e.g.) family size, number of times minimum
payment made, shoe size, etc.

written by JBacon 14/7/2013 15


Summarising Qualitative data
 Frequency (or relative frequency) distribution
 Bar chart
 Pie chart

Summarising Quantitative data


 Frequency (or relative frequency) distribution
 Histogram
Eg) nominal (JOB) Eg) binary (BAD)

Eg) interval (LOAN)


 An outlier is a value that is unusual and lies outside
the expected values of the data,
e.g.) director’s salary. Outliers are extreme values,
that are not typical of the rest of the dataset.

 Missing data can occur for various reasons:


 empty values
 non-existent values
 incomplete data
 uncollected data.
 Dirty data (e.g. erroneous postcodes, etc.)
 Inconsistent values (e.g. different revisions)

written by JBacon 14/7/2013 19


The approaches are similar:
 Do nothing
 Filter the rows containing them
 Ignore the column
 Replace the outlying value(s)
 Data mining is largely concerned with
building models.

 A model is simply a set of rules that


connects a collection of inputs to a particular
target or outcome.

 Under the right circumstances, a model can


result in insight by providing an explanation
of how outcomes of particular interest, such
as placing an order or failing to pay a bill,
are related to and predicted by the available
facts.
written by JBacon 14/7/2013 21
 Data mining can be prescriptive
(sometimes called directed data
mining) or descriptive (sometimes
called undirected data mining).
 The distinction between prescriptive
and descriptive refers to the goal of
the data mining.
written by JBacon 14/7/2013 22
 The goal is to automate a decision-making
process by creating a model capable of
predicting / estimating the value of a
particular target variable.

 The results in prescriptive data mining will be


acted upon directly,
e.g.) someone will or will not be offered credit
or insurance.

written by JBacon 14/7/2013 23


 The goal is to gain increased understanding
of what is happening inside the data and
thereby in the wider world that the data reflects,
e.g.) a credit card company wants to try to
understand the spending habits of
undergraduates.

 With descriptive data mining the best model


may not be the one that gives the most accurate
predictions – often the insight gained through
building the model is the most important part!
 This undirected data mining attempts to
similarities among groups of records without the
use of a particular target field.
written by JBacon 14/7/2013 24
 Examples of directed mining techniques
 regression,
 decision trees and
 neural networks.

 Examples of undirected mining techniques


 Cluster analysis
 Market basket analysis.

written by JBacon 14/7/2013 25


 Variable – a quantity that varies, such that it
can take on any one of a specified set of
values.
 Observations – the observed values of the
variables.
 Data set – a list of the values of one or more
variables, the observations.
 Raw data – the data in the form it has been
collected.

written by JBacon 14/7/2013 26


 Accesses SAS datasets.
 Automatically creates a metadata sample
(based on random sample of 2,000
observations from the dataset that is
identified in the Input Data Source node.)
 Automatically selects measurement levels
and model roles for each variable (but these
can be changed, if required).
 Displays summary statistics for interval and
class variables.

written by JBacon 14/7/2013 27


 In order to build an effective model, SAS
Enterprise Miner does not process all of the
original dataset in one go! Why?

 Instead, it divides the data into 3 subsets for


training, validation, and testing the model.
 The training data subset is used for
preliminary model fitting;
 the validation data subset is used to tune
model weights during estimation;
 the test data subset is used for model
assessment.
In the Data Partition node, by default:
40% training, 30% validation & 30% test
written by JBacon 14/7/2013 28
 The larger the dataset, the more likely it contains some
missing or outlying values.
 Depending on the model deployed, the effects of ‘flawed’
data can be trivial or dramatic.,
e.g.) If your model is based on a decision tree, missing values
cause no harm, because decision trees handle missing
values directly. However, regression and neural network
models ignore observations that contain any missing
values.

 It is therefore advisable to impute missing values before


fitting a regression model or neural network. (Failure to
impute missing values for these models may result in a loss
of much data and produce inferior models.)
 The Replacement node imputes missing values.

written by JBacon 14/7/2013 29


 By default, SAS Enterprise Miner creates the
values for data replacement by examining a
random sample of the training subset:
 Interval variables have their missing values
replaced with the mean of their sample.

 Binary, nominal, and ordinal variables have


their missing values replaced with the most
commonly occurring level.
Note Some data stores use a special value,
such as 999, to encode their missing values!
(In this case, the encoded values need
replacing before imputation can occur!)

written by JBacon 14/7/2013 30


You should now know:
 What the main issues are concerning data.
 The difference between Quantitative data and
Qualitative data, be able to explain what is meant by:
interval, discrete, binary, ordinal and nominal data.
 What type of graphical displays are appropriate for
quantitative data and qualitative data.
 What is meant by an outlier and the reasons why
missing data occur.
 How to deal with missing data and outliers.
 What is meant by: model roles and measurement levels
(of variables) within the SAS Enterprise Miner
environment.
 The role of the Input Data Source node, Data Partition
node and the Replacement node, within SAS Enterprise
Miner.

written by JBacon 14/7/2013 31


written by JBacon 14/7/13 1
 To explain that descriptive statistics generally measure:
the centre of a distribution (averages),
the spread of a distribution (measures of dispersion) or
the extent to which the shape of a distribution is not
symmetrical (skewness and kurtosis).

 To define various statistical terminology including:


population, sample, statistical inference, mean, median,
mode, percentile, decile, quartile, range, inter-quartile
range, variance, standard deviation, skewness, kurtosis.

 To introduce the concept of the normal distribution as a


symmetrically bell-shaped distribution and to compare it
with positively and negatively skewed distributions.

written by JBacon 14/7/13 2


 To know how to properly present
information and obtain reliable forecasts.

 To know how to draw conclusions about


populations based on sample information.

 To know how to improve processes.


written by JBacon 14/7/13 3
 For statisticians, data mining has a negative
connotation – one of searching for data to
support preconceived ideas! Statistics don’t lie
but liars use statistics!
 Statistics was developed as a discipline to help
scientists make sense of observations and
experiments, hence the scientific method.

 Problem has often been too little data for


statisticians! In comparison, Data Miners are is
faced with too much data! Many of the
techniques & algorithms used are shared by
both statisticians and data miners.
written by JBacon 14/7/13 4
…differs from the standard statistical
approach in several areas:
 Data miners tend to ignore measurement
error in raw data.
 Data miners assume that there is more than
enough data.
 Data mining assumes dependency on time
everywhere.
 It can be hard to design experiments in the
business world.
 Data is truncated and censored.
written by JBacon 14/7/13 5
 Population (universe) – all possible observations
of a variable. The population consists of all
members of the group we wish to study, not just
those members we have actually observed!

 Sample - a subset of observations taken from the


population for detailed analysis. The sample needs
to be selected in such a way as to be
representative of the population as a whole to
avoid bias.

written by JBacon 14/7/13 6


Parameter – characteristic of the population,
e.g.) population mean,
population standard deviation.

Statistics – characteristic of the sample of


observations,
e.g.) sample mean, sample standard deviation.
Statistics are often used as an estimator of a
population parameter.

 In
Statistical Inference we are drawing
conclusions (inferring something) about the
population based on results from sample(s).

written by JBacon 14/7/13 7


Population Sample
Use statistics to
summarise features
Use parameters to
summarise features

Inference on the population from the sample


written by JBacon 14/7/13 8
Generally, the whole area of statistics may
be
classified as:

 Descriptive statistics – describing data.

 Inferential
statistics– to enable us to
draw conclusions (infer) from the data
(sample) and make decisions (about the
population) based on these conclusions.
written by JBacon 14/7/13 9
 Qualitative data, such as products, channels,
regions, and descriptions are a main focus of
data mining.
 The data are often represented in a frequency
distribution and then represented graphically in:
 Bar Chart – bars show number of times different
values occur.
Note The length
of bar is in
proportion to
frequency.

written by JBacon 14/7/13 10


 Discrete Quantitative data are also represented in bar
charts (provided sensible number of ‘discrete’ values)
e.g.) Family size var displayed in SASTrainer.
 However, when the number of ‘discrete’ values is large
(10 or more) then grouping into sensible intervals
occurs.
 This is what happens with interval data – as the data
can (theoretically) take on any value on a continuous
scale, the data are grouped into sensible intervals. e.g.)
Income could be grouped as <£15K, £15-20K, £20-
25K, £25-30K, £30-35K, £40K+

 Histograms – look similar to a


bar chart but the bars lie next to
each other on horizontal axis.
Note The area of bar is in
proportion to frequency.

written by JBacon 14/7/13 11


Descriptive statistics attempt to describe a set
of data using a single summary statistic.
There are various descriptive statistics that
MEASURE:

 the centre of a distribution (averages).


 the spread of a distribution (measures of
dispersion).
 the extent to which the shape of a distribution
is not symmetrical (skewness).

written by JBacon 14/7/13 12


…measure the centre of a distribution of data.
There are 3 main averages:
 Mean – calculated by the sum of the observations
divided by the total number of observations.
(Common average but affected by outliers.)
 Median - the middle value when all observations
are listed in order.
 Mode - the most frequently occurring value.
(Not unique.)

written by JBacon 14/7/13 13


An average should convey an impression of the
centre of a distribution in a single figure.

Picking which average to use might depend on a


number of factors:
 The type of data we are dealing with
 The shape of the distribution,
 Whether the result will be used as the basis for
further statistical analysis, etc.

written by JBacon 14/7/13 14


In one month the total costs (to the
nearest £) of calls made by 23 male
mobile phone owners were:

17 17 14 16 15 24 12 20 17 17 13 21 15
14 14 20 21 9 15 22 19 27 19.

Calculate the average monthly cost,


using the: a) mean, b) median, c) mode.

written by JBacon 14/7/13 15


a) The sum of these costs:
 x = 17 + 17 + 14 +….+ 19 = 398
Mean x
 x
= 398/23 = 17.3043, i.e.) £17.30
n
b) Arrange data from smallest to largest and select
the middle (12th) value.
9 12 13….17….24 27
Median = £17.
c) Most frequently occurring value is 17, so
mode = £17.

written by JBacon 14/7/13 16


Phone call cost (£) Frequency
Table 1
9 1
12 1  Note: A frequency
13 1 distribution simply
14 3
records each unique
15 3
16 1
data value and its
17 4 frequency. However,
19 2 too many discrete
20 2
values (as in this
21 2
22 1 case) doesn’t prove
24 1 useful!
27 1
Total 23
Phone call cost (£) Relative Frequency Table 2
 A relative frequency
0-10 1/23 = 0.04
(or 4%)
distribution records
the frequency relative
11-15 8/23 = 0.36 to the total (i.e. a
(or 36%)
proportion or
16-20 9/23 = 0.39 percentage).
(or 39%)

21-25 4/23 = 0.17 A grouped frequency


(or 17%) table groups the data
26-30 1/23 = 0.04 into sensible class
(or 4%) intervals (but at a cost
Total 1.00 (i.e. 100%)
of the precision of
data being displayed!)
Call cost 0-10 11-15 16-20 21-25 26-30
Frequency 1 8 9 4 1
Midpoint 5 13 18 23 28
 In
grouped frequency distributions, the
midpoint is assumed to be the best
representation of a particular interval.
Hence descriptive statistics are
estimated using the midpoints.
Mean =  f i M i
f i

= 1x5 + 8x13 + 9x18 + 4x23 + 1x28


23
= 391/23 =£17

Median is in £16-20 interval.


Modal interval is £16-20.

written by JBacon 14/7/13 20


…measure the spread or variation in a dataset.
 The main measures of dispersion are:

 Range - the difference between the maximum and


the minimum value in the data. (Common measure
of dispersion but relies on only 2 data values so
could be affected by outliers.)

 Percentiles, Deciles and Quartiles.

 Variance and Standard deviation.

written by JBacon 14/7/13 21


If the distribution was from minimum to maximum and the
Distribution was divided into:
 100 sections, with equal numbers of observations in
each section – each section represents a percentile.
The 1st percentile, P1, is 1% of the way along from the
start of the distribution. The 100th percentile, P100,
is at the end of the data set.

 10 sections, with equal numbers of observations in


each section – each section represents a decile.
The 1st decile, D1, is 10% of the way along from the
start of the distribution, etc.

 4 sections, with equal numbers of observations in each


section – each section represents a quartile.
The 1st quartile, Q1, is 25% of the way along from the
start of the distribution. The 2nd quartile, Q2, is median.

written by JBacon 14/7/13 22


 Thevariance measures how closely the
observations are spread about the mean, and
takes every data value into account.
• The difference between a given observation
and the mean of the sample is called its
deviation.
• The variance is defined as the average of the
squared deviations.

 The standard deviation is the square root of


the variance, and is a more appropriate
measure of dispersion to use as its units are
the same as the original data (not squared
units like the variance!)

written by JBacon 14/7/13 23


The total costs of the calls made by 23
male mobile phone owners, in one month,
was shown previously in Table 1.

Calculate the:
a) range,
b) inter-quartile range,
c) sample variance and sample standard
deviation.

written by JBacon 14/7/13 24


a) Range is max – min = 27-9 = £18, i.e. [£9, £27]

b) Lower quartile, Q1 is represented by the 25th


percentile (i.e. quarter of the way along from
start of distribution).
Thus, Q1 is the 0.25 x 23 = 5.75th data value in
distribution, i.e.) from Table 1, Q1 is 14
Upper quartile, Q3 is represented by the 75th
percentile (i.e. three quarters of the way along
from start of distribution).
Thus, Q3 is the 0.75 x 23 = 17.25th data value
in distribution, i.e.) from Table 1, Q3 is 20.
Hence, inter-quartile range is Q3-Q1 = 20-14
= £6, i.e. [£14, £20]
written by JBacon 14/7/13 25
 To calculate the variance and
standard deviation, construct a table
with the squared deviations from the
mean.
 Recall: mean = £17.3043

written by JBacon 14/7/13 26


xi fi Deviation Frequency
from mean, multiplied by
(xi – 17.3043) deviation
squared
9 1 -8.3043 1 x 68.89
: : : :
14 3 -3.3043 3 x 10.89

27 1 9.6957 1 x 94.09
Total 23 374.87
_
c) Sample Variance = average of squared deviations
=  f i ( xi  x) 2

f i 1
= 374.87 = 17.04 (squared £)
23-1
But variance is in squared units!

Hence, Sample standard deviation

= £4.13 = var iance


Note In calculations of the sample variance, the denominator
has 1 subtracted from its total as an adjustment for a sample
(of 23 males) being considered. 23 does not represent the
entire population!
written by JBacon 14/7/13 28
 The frequencies of different values in a
population will vary. The description of how
values are distributed in the population is called
a distribution.

 Graphically, the horizontal axis (the x-axis)


represents the range of values taken on and the
vertical axis (the y-axis) represents the number
of observations at each value, i.e.) the
frequency.

written by JBacon 14/7/13 29


written by JBacon 14/7/13 30
As more and more samples are taken from a
population, the distribution of the averages of
the samples follows the normal distribution.
The sample averages (i.e. sample statistics)
come arbitrarily close to the average of the
population (i.e. population parameter).

Symmetrical distribution.
The normal distribution is a symmetrically
bell-shaped distribution, about its mean,
where: mean=median=mode.

written by JBacon 14/7/13 31


 Measurements of
natural phenomena
often follow the normal
distribution.

 The normal distribution


is described by:
its mean (an average)
and its standard
deviation (a measure of
dispersion).
 Skewed distributions
are not symmetrical
about their mean. Their
peak is at the mode.
 In a negatively skewed
distribution,
mode > median > mean

 Ina positively skewed


distribution,
mean > median > mode
 The coefficient of skewness shows the
tendency of the dataset values to “bunch” at
one end of its distribution.
 The higher the value, the greater the
skewness. Negative skewness yields –ve
values, positive skewness yields + ve values.
 The coefficient of kurtosis is a measure of the
degree of peakedness/flatness of a distribution
(relative to a normal distribution).
 Data sets with high kurtosis have a distinct
peak near the mean, decline rather rapidly and
have heavy tails. Data sets with low kurtosis
have a flat top near the mean, rather than a
sharp peak.

written by JBacon 14/7/13 34


You should now know:
 What is meant by descriptive statistics.
 The difference between…
 …a population and a sample,
 …a (population) parameter and a (sample)
statistic,
 …the mean, mode and median,
 …the range and inter-quartile range,
 …the variance and the standard deviation,
 …the normal distribution and a skewed
distribution,
 …a positively skewed distribution and a
negatively skewed distribution,
 …skewness and kurtosis.

written by JBacon 14/7/13 35


 Berry Michael JA and Linoff Gordon S (2011). Data
Mining Techniques. 3rd ed. Wiley. pp101-104, 144-149.

 Data analysis including mean and standard deviation


http://www.robertniles.com/stats/dataanly.shtml

 Very simple factsheets on frequency tables and averages


http://www.bbc.co.uk/schools/ks3bitesize/maths/handling_data
/measures_average/activity.shtml
http://www.gcse.com/maths/averages.htm

written by JBacon 17/9/14 36


 To discuss the normal distribution in more detail,
focussing upon its properties.
 To explain what is meant by hypothesis testing
and to introduce concepts such as the z-score,
test statistic, p-value, q-value and the idea of
confidence intervals.
 To further develop exploratory data analysis
using box and whisker plots (boxplots).
 As more and more samples are taken from a
population, the distribution of the averages of
the samples follows the normal distribution.
The average of the samples comes arbitrarily
close to the average of the entire population.

 Normal distribution is described by the mean


(average count) and the standard deviation
(clustering around the mean)
 Some normal
curves are tall and
skinny (low
standard deviation).

 Others are short


and fat (high
standard deviation).
 The standard deviation is more appropriate to
use than the variance because it is expressed in
the same units as the observations (rather than
in terms of those units squared.)

 This allows the standard deviation itself to be


used as a unit of measurement.

 The z-score is an observation’s distance from


the mean measured in standard deviations.
 Using the normal distribution, the z-score can be
converted to a probability or confidence interval.
The z-scores are used to compare the
significance of measurements taken from
different situations.

 Approx 68% data lies within 1 sd mean.


 Approx 95% data lies within 2 sd mean.
 Nearly 100% data lies within 3 sd mean.

 The z-scores are used to compare the


significance of measurements taken from
different situations.
Statistical inference uses sample data and
statistical procedures to:

 Estimate an unknown population parameter


using a sample statistic with a specified level of
certainty (point estimates and confidence
intervals).

 Testa hypothesis using sample statistic(s) and


making a decision either to support or to reject a
statement about a parameter value.
 Hypothesis testing is a top-down approach that
attempts to substantiate or disprove
preconceived ideas.
 Hypothesis testing is what statisticians spend
their lives doing!
 An hypothesis is a proposed explanation
whose validity can be tested, by analysing
data, for example, an initial assumption may
be:
H0: no difference in average monthly mobile
call costs between males and females
 The Null Hypothesis, H0 assumes that
differences among observations are due simply
to chance.
 Layperson asks, “Are these %’s different?”
 Statistician asks, “What is the probability that
these two values are really the same?”

 The Alternative Hypothesis is denoted by H1.


 Note In the H0 we assume the variables are
independent, i.e.) there is no relationship.
(Remember: Innocent until proven guilty!)
 Is good for both statisticians and data miners.

 Goal for both is to demonstrate results that


work, hence discounting the null hypothesis.

 The less reliance on chance the better!


 Point estimates are single values summarising
some aspect of the data, e.g.) mean, proportion,
etc.

 Confidence intervals provide a range of values


around an estimate. The width of the interval
depends on the sample size and sample standard
deviation. Usually 95%CIs are reported.
“The CI is a measure of only one thing, the
statistical dispersion of the result. Assuming
that everything else remains the same, it
measures the amount of inaccuracy introduced
by the process of sampling.”
Berry & Linoff (2004)
 Various hypothesis tests exist,
e.g.) t test (for comparing means).

 Hypothesis tests are used to look for evidence


that a null hypothesis is not true (using a test
statistic and a p-value).

 Measure of error is attached based on sample


size and sample standard deviation.

 Generally, hypotheses are rejected at the 5%


significance level, i.e.) 95% confidence.
1.State good ideas and aims (hypotheses).
2.Determine what data would allow these
hypotheses to be tested and locate data.
3.Prepare data for analysis.
4.Build models based on the data.
5.Evaluate models to confirm or reject initial
hypothesis (null hypothesis H0).
 During step 4, a test statistic and a p-value
are computed.

 The test statistic is a measure used to


determine how close the sample statistic(s) is
to the hypothesised population parameter(s).
Different test statistics exist for different
hypothesis tests, eg) t-score, logit, etc.
 The null hypothesis H0 can be quantified.
 The p-value is the probability that the null
hypothesis is true. When H0 is true, nothing is
really happening; differences are due to chance.
 Much of statistics is devoted to determining
bounds for the p-value. Statistical software will
calculate a p-value to assist in deciding whether
to reject H0 or not.
 Confidence, the reverse of a p-value, is called
the q-value.
 If the p-value = 5% then…the q-value
(confidence) is 95%.
A p-value less than 0.05 shows that the
sample indicates that there is evidence that
H0 should be rejected. Hence:
compute test statistic and p-value

p>0.05 p<0.05
Do not reject H0 Reject Ho at 5% sig. level

No evidence that Evidence that


H0 is not true H0 is not true
…one difference between data miners and
statisticians…

 Data miners are often working with sufficiently


large amounts of data that make it unnecessary to
worry about the mechanics of calculating the
probability of something being due to chance.
In one month the total costs (to the
nearest £) of the calls made by 23 female
mobile phone owners were:
14 5 15 6 17 10 22 10 12 17 13 29 7
27 33 16 30 9 15 7 33 28 21.
Calculate the averages and measures of
dispersion of phone call costs made by
females.
 Mean The sum of these costs:

 x = 14 + 5 + 15 +….+ 21 = 396
x
Mean: x  n = 396/23 = 17.2174, i.e.) £17.22

 Median Arrange data from smallest to largest and


select the middle (12th) value, hence median = £15
i.e.) 5 6 7….15….30 33 33

 Mode There are 5 modes (as each of these values


occur twice). Thus,
modes are: £7, £10, £15, £17, £33.
 Range is 33-5 = £28, i.e. [£5, £33]

From the table with the squared


deviations from the mean: _
 Sample Variance =
 i if ( x  x ) 2

var iance

fi  1
= 1876.094 = 85.277 (£ squared)
22
 Standard deviation
= £9.23
xi fi Deviation Frequency
from mean, multiplied by
(xi-17.2174) deviation
correct to 4dp squared
5 1 -12.2174 1 x 149.265
6 1 -11.2174 1 x 125.830
7 2 -10.2174 2 x 104.395
: : ……: ……:
33 2 15.7826 2 x 249.091
Total 23 1876.094
Phone call Mean Standard
costs deviation
Male £17.30 £4.13
Female £17.22 £9.23
The average monthly call costs (approx £17)
are similar for males and females.
However there is more than twice as much
variation in the costs of phone calls made by
females than males.
 Ina boxplot the middle half of the values in a
distribution are represented by a box, i.e.) the
‘spread’ of the box is the inter-quartile range.

A line inside the box represents the median.

 The ‘spread’ between smallest & largest data


values are represented by straight lines
called ‘the whiskers’, i.e.) range.
You should now:
 Know what is meant by the normal distribution and be
able to state the properties of the normal distribution.
 Be able to explain what is meant by hypothesis
testing.
 Know what is meant by…
 … a z-score,
 … a test statistic,
 … a p-value,
 …a q-value and the idea of confidence.
 Recognise and understand the role of graphical
displays such as stem & leaf diagrams and boxplots
used in comparisons of data sets.
 Ref the PROC UNIVARIATE output obtained
from the SAS Trainer and make sure you
understand the stem & leaf diagram and boxplot
displayed.

 Background reading.

 Complete labwork in preparation for Week 7


labtest.
 To consolidate work on the normal distribution.
 To introduce different sampling methods.
 To introduce the concept of time series.
 To introduce the modelling that will be focussed
upon (after the lab test).
At a supermarket, the money spent by male
and female shoppers (per trip) is summarised by
the descriptive statistics shown below:
Mean Standard
deviation
Male shoppers £50 £10

Female shoppers £40 £15

Assuming the data are normally distributed,


compare the spending patterns of the male and
female shoppers.
There are various sampling methods,
including:
 Simple random sampling
 Stratified sampling
 Cluster sampling
 Quota sampling
 User defined sampling
 When a sample is chosen
randomly every member of
population has an equal chance of
being in sample.

 But when the population is humans, this can be


difficult to achieve!
Example
 People who fill out registration cards may not be
representative of the population of product owners.
 People whose medical claims appear in an
insurance company database may not be
representative of population of sick people.
…is superior to simple random
because it reduces sampling error.
 When the population has a number of
distinct categories (strata) which need to
be considered in the sample, a sufficient
random sample is taken from each
stratum.
 A stratum (plural strata) is a subset of the
population that share at least one common
characteristic, e.g.) male/female, age
group, students/non-students, etc.
 Typically, strata should be chosen to have:
 Means which differ substantially from each other,
 Variances which are different from one another
and are lower than the overall variance.
…sometimes it is cheaper to cluster
in some way, by selecting respondents from
certain areas only.

 Cluster sampling is an example of two-stage


sampling:
i. In the 1st stage a sample of areas are chosen.
ii. In the 2nd stage random sample of respondents
within those areas is selected.
 Cluster sampling generally increases the
variability of sample estimates above that of
simple random sampling, depending on how the
clusters differ between themselves.
The population is first segmented in
mutually exclusive sub-groups, as in
stratified sampling but then judgement is
used to select the subjects from each sub-
group based on a specified proportion
e.g.) market researcher may be told to
interview 200 females of 30-45yrs old.
 Hence selection of non-random sample!

Problem: possible bias as not everyone gets a


chance of selection!
 In some situations, even a combination of previously
defined sampling methods might not provide a
sample that is representative of intended (target)
population, from data miners viewpoint,
e.g.) an intended stratified sample with urban/rural
strata might have an under-representation of rural
subjects
Hence data miner would weight proportion rural
subjects appropriately in the analysis to compensate,
i.e. ‘user-defined’!
 Histograms describe a single moment in time
 Data mining is often concerned with what is
happening over time.
 In Time Series Analysis choose an
appropriate time frame to consider the data.
 A time series represents a variable (Y)
observed across time (horizontal axis).
 Time increment: years, quarters, months,
weeks or days, etc.
 Usually, the points in the graph are connected
by straight lines (i.e. line graph) making it
easier to detect any existing patterns.
There are 4 main components in a time
series:
 Trend
 Seasonality (or seasonal variation)
 Cyclical activity (or cyclical variation)
 Irregular activity
 The purpose of time series analysis is to
describe a particular dataset by estimating the
various components that make up the time
series.
 The Trend is the long-term movement in the time
series, i.e.) a steady increase or decrease.
 If the rate of change in the variable Y from one
time period to the next is relatively constant, the
trend is a linear trend and can be determined
using simple linear regression,
Trend = a + bt, where b is gradient of line,
a is intercept on vertical axis.
 Seasonality refers to periodic increases or
decreases that occur within a calendar year of
a time series.
 Predictable because occur every year!
Examples:
 More fireworks bought in October & November.
 Increase in flower sales at Valentine’s day and
Mother’s day.
 Cyclical activity describes a gradual cyclical
movement about the trend; it is generally
attributable to business and economic
conditions, for example:
 A peak of a cycle occurs at the height of an
expansion (prosperity) period.
 The low point (trough) of each cycle usually
represents an economic recession or
depression.
 Cyclical periods typically range from 2 to 10
years.
 Irregular activity consists of what is ‘left over’
after accounting for the effect of any trend,
seasonality or cyclical activity.
 The Irregular activity component measures
the random movement in the time series and
represents the effect introduced by
unpredictable rare events, e.g.) company
strike.
 An extremely large irregular component can
be caused by a measurement error in the
variable. (Such an outlier should always be
checked to ensure its accuracy).
 From the Date variable (model role: ID)
information can be extracted to aid in time
series analysis.
 Time Series charts are useful, but have
limitations also; cannot tell whether the
changes over time are expected or
unexpected.
 We could look at a segment of the data, say a
day at a time asking: “Is it possible that the
differences seen on each day are strictly due
to chance?” (null hypothesis).
You should now:
 Assuming the data are normally distributed, be able to
compare 2 datasets (using means and standard
deviations).
 Be more familiar with the relevance of the properties
of the normal distribution.
 Be able to define the different sampling methods.
 Know what is meant by a time series and appreciate
its various components
 Appreciate how variables having a model role as id
may be useful in time series modelling.
 Be ready to start some modelling in SAS Enterprise
Miner (after your lab test).
written by JBacon 7/11/14 1
 To introduce the need for analysis on bivariate
data.

 To introduce correlation analysis and simple


linear regression analysis, in detail.

 To explain the need for non-linear regression


and multiple regression when simple linear
regression analysis is inappropriate.

written by JBacon 7/11/14 2


 Bivariate data consist of observed values of 2
variables.

 The analysis of bivariate data involves data


mining techniques such as correlation
analysis and regression analysis.

 The results of this sort of analysis have


affected many aspects of business
considerably.

written by JBacon 7/11/14 3


 E.g.1) The establishment of the relationship
between smoking and health problems
transformed the tobacco industry.

 E.g.2) The analysis of survival rates of micro-


organisms and temperature was crucial to the
setting of appropriate refrigeration levels by food
retailers.

 E.g.3) Marketing strategies of many organisations


are often based on the analysis of consumer
expenditure in relation to age or income.

written by JBacon 7/11/14 4


1. When dealing with bivariate data, the possibility of
an association is explored by plotting a scatterplot
of one variable against another the other.

2. The strength of this association can then be


assessed by calculating Pearson’s correlation
coefficient for interval data.

3. If the scatterplot suggests a possible association


then we can use least squares regression to fit
this model to the data set.
written by JBacon 7/11/14 5
 Correlation analysis allows us to assess whether
there is a connection between 2 variables and if
so, how strong that connection is.

 If correlation analysis tells us there is a


connection, we can use regression analysis to
identify the exact form of the relationship.

written by JBacon 7/11/14 6


 Bivariate analysis assumption: Variable Y
depends upon variable X.

 Correlation analysis is a way of investigating


whether 2 variables are connected with each
other (correlated).
 A scatter diagram can give us a ‘visual feel‘ for
the association between 2 variables but it
doesn’t measure the strength of the connection!
 Hence the need to calculate a correlation
coefficient.

written by JBacon 7/11/14 7


 Suitable for assessing the strength of the
connection between quantitative variables
(whose values are interval).

 It is a ratio; it compares the co-ordinated


scatter to the total scatter. The co-ordinated
scatter is the extent to which the observed
values of one variable X, vary ‘in step with’ a
second variable Y (the covariance measures
the degree of co-ordinated scatter).

written by JBacon 7/11/14 8


written by JBacon 7/11/14 9


 Pearson’s Product Moment correlation coefficient,
r, is the covariance of the x and y values divided
by the product of the two standard deviations.
r = covariance (x,y)
sx * sy

Note r can be positive or negative and has a range


between -1 and +1 (inclusive).

written by JBacon 7/11/14 10


Correlation measures the strength of a relationship between
two variables.
A value of r near 1.0 indicates a strong positive relationship.
A value of r near -1.0 indicates a strong negative relationship.
A value of r ≈ 0 would indicate no linear relationship between x
and y but this may indicate that the true form of the relationship
is non-linear.

written by JBacon 7/11/14 11


written by JBacon 7/11/14 12


written by JBacon 7/11/14 13
written by JBacon 7/11/14 14
written by JBacon 7/11/14 15
❖ Regressionanalysis is a directed learning data
mining technique.

❖ But
is susceptible to missing values and
skewed distributions of the data.

❖3 types of regression will be considered:


➢ Simple linear regression (lecture 7)
➢ Multiple regression (lecture 8)
➢ Logistic regression (lecture 8)

written by JBacon 7/11/14 16


 Used for numeric (interval) data.
 Based on ONLY 2 variables:
1 dependent variable Y and
1 independent variable X
e.g.) ref CUSTDET1 data set:
Is there a linear relationship between total dining
(Y) and age (X)?
 Simple linear regression analysis predicts
values by fitting a linear equation to available
data. When the data is plotted on a scatterplot,
linear regression draws the ‘line of best fit’
through the data.

written by JBacon 7/11/14 17


written by JBacon 7/11/14 18


 In time series, if there is
an overall linear trend,
simple linear regression
modelling can be used
to estimate the trend
line.

 When SLR is applied to


time series, the
independent variable is
time.
Employee % change in
Number Production, x productivity, y
1 47 4.2
A large manufacturing firm 2 71 8.1
3 64 6.8
has designed a training 4 35 4.3
program that is supposed to 5 43 5.0
increase the productivity of 6 60 7.5
7 38 4.7
employees. The personnel 8 59 5.9
manager decides to 9 67 6.9
examine this claim by 10 56 5.7
11 67 5.7
analysing the results from 12 57 5.4
the first group of 20 13 69 7.5
employees that attended the 14 38 3.8
course. 15 54 5.9
16 76 6.3
The dataset for the % 17 53 5.7
change in productivity (y) 18 40 4.0
measured against a range 19 47 5.2
20 23 2.2
of production values (x) is
written by JBacon 7/11/14 20
shown in the table.
On the vertical axis is the %
change in production i.e.) the
dependent variable (y). Scatterplot of %change in production vs production
On the horizontal axis is the
production - this provides the 8
basis for calculating the
value of the dependent 7

%change in production
variable and is called the
6
independent variable (x).
From the scatterplot, as 5
production increases, the %
change in production also 4

increases, i.e.) there is a


3
positive reasonably linear
relationship. 2
20 30 40 50 60 70 80
production
written by JBacon 7/11/14 21
Note Scatterplots are a good way of identifying
outliers (extreme observations that are atypical of the
rest of the dataset) as may be seen from the above
illustration.
written by JBacon 7/11/14 22
Pearson correlation coefficient of %change in
productivity and production = 0.890
p-Value = 0.000

Interpretation of the correlation coefficient: r = 0.890


The sign is positive which indicates as production
increases so does % change in productivity.
The magnitude is close to 1 indicating a strong
reasonably linear relationship.

written by JBacon 7/11/14 23


The hypotheses being tested are:
H0: population correlation coefficient = 0
H1: population correlation coefficient ≠ 0

Interpretation of the p-value:


p-value = 0.000 < 0.05 so there is evidence to reject H0
Hence there is a significant correlation between
production and % change in productivity.

written by JBacon 7/11/14 24


When the line of best fit is fitted to the scatterplot, not
all data points lie on the fitted line!
In this case, there is an
error called a residual
between the data y-value
and the value of the line at
each data point.
This concept of error can
be measured using a
variety of methods,
including the coefficient of
determination.
written by JBacon 7/11/14 25
The regression equation is:
Intercept a Gradient b

%change = 0.6712 + 0.09152 production

R-Sq = 0.7920 Coefficient of determination

Interpretation: 79.2% of the change in productivity can be


explained by the variation in production. Conversely, this
implies that 20.8% of the sample variability in the %
change in productivity is due to other factors than
production and is not explained by the regression line.

written by JBacon 7/11/14 26


 When there is no correlation or minimal correlation
between the 2 variables.
 When the scatterplot illustrates a curved relationship
(rather than a straight line relationship). Hence a
non-linear regression model is fitted.
 When Y is best defined by a number of independent
variables X1, X2, etc (rather than just one). Hence
the need for multiple regression.
Note Multiple regression will be considered in
IMAT3603 but non-linear regression is beyond the
scope of this module (it is studied at postgraduate
level).
written by JBacon 7/11/14 27
 When there is no correlation or minimal correlation
between the 2 variables.

 When the scatter diagram illustrates a curved


relationship (rather than a straight line
relationship). Hence a non-linear regression model
is fitted.

 When the response variable Y is best defined by a


number of explanatory variables X1, X2, etc
(rather than just one). Hence the need for multiple
regression.
written by JBacon 7/11/14 28
You should now know:
 What is meant by bivariate data.
 What is meant by covariance.
 What is meant by the coefficient of
determination.
 The difference between correlation analysis
and regression analysis.
 That the Simple Linear Regression (SLR)
model is defined by the gradient and the y-
intercept of the ‘line of best fit’.
 Under what circumstances, non-linear
regression and/or multiple regression is more
appropriate than SLR.

written by JBacon 7/11/14 29


Simple Linear Regression applet
http://science.kennesaw.edu/~plaval/applets/LRegressio
n.html

Basic information about correlation


http://www.bbc.co.uk/scotland/learning/bitesize/standard
/maths_ii/relationships/scatter_diagrams_rev2.shtml
http://www.socialresearchmethods.net/kb/statcorr.php

Basic information about SLR


http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/reg
r1.html

written by JBacon 7/11/14 30


written by JBacon 7/11/14 1
 To introduce the basic concepts of multiple
regression and logistic regression.

 To interpret regression output generated in


SAS Enterprise Miner.

written by JBacon 7/11/14 2


 Describes the relationship between 1 numeric
dependent variable and 2 or more
independent variables.

e.g.)CUSTDET1:
Does total dining (dependent variable) depend
upon:
➢ 1 independent variable.
If so, simple linear regression (SLR).

➢2 or more independent variables.


If so, multiple regression.

➢ All of the independent variables.


If so, multiple regression.
written by JBacon 7/11/14 3
Target Intercept with vertical y-axis
variable

Y = a + b1X1 + b2X2 + b3X3 + ………

Gradient b1 with Gradient b2 Gradient b3


respect to with respect to with respect to
independent independent independent
variable X1 variable X2 variable X3

written by JBacon 7/11/14 4


❖ Buthow do we know whether all of the
‘independent’ variables are needed or just 1 or
2 or 3 or 4 or 5……..?
Answer: using Selection methods

❖ Whatselection methods are used for regression


analysis?
Answer: backward, forward, stepwise.

❖ And,
what general criteria are used when
making this selection?
Answer: 5% significance level.

written by JBacon 7/11/14 5


➢ Forward selection – starts with the null model
(which has no input ‘independent X variables)
then adds specific X variables (the ‘effects’) that
are significantly associated with the target Y
variable until none of the remaining effects can be
added (based on the ‘entry’ significance level.)

➢ Backward deletion – starts with the full model


(which has all of the input ‘independent’ X
variables) then removes specific X variables
(called ‘effects’) that are not significantly
associated with the target Y variable until no other
effect in the model should be removed (based on
the ‘stay’ significance level.)
written by JBacon 7/11/14 6
➢ Stepwise – starts with the null model and then
adds specific X variables (the ‘effects’) that are
significantly associated with the target Y variable.
However, after an effect is added to the model,
stepwise may remove any effect that is already in
the model that, now, is no longer needed.

written by JBacon 7/11/14 7


 Stepwise – starts with null model, and then adds
explanatory variables (‘effects’) that are
associated with the target variable. However, after
an effect is added to the model, stepwise may
remove any effect that is already in the model that,
now, is no longer needed.

written by JBacon 7/11/14 8


 Describes the relationship between a categorical
target variable and a set of independent variables.

 The target variable can be binary or ordinal.

 The independent variables can be categorical or


interval.

 Hence, a different approach is needed…


…using probability and odds ratios.

written by JBacon 7/11/14 9


A mathematical
transformation called
the logit transformation
is used to accommodate
the different nature
of the target variable.
The logit transformation
generates probabilities
that are used to predict the
different levels of the
binary or ordinal target
variable, using mathematical
functions called logarithms
and odds
ratios .
Note The mathematical detail
will not be discussed in this
module.

written by JBacon 7/11/14 10


Target variable logit p
where p is probability
Intercept with vertical y-axis
of target event

logit p =a + b1X1 + b2X2 + b3X3 + ………

Gradient b1 with Gradient b2 Gradient b3


respect to with respect to with respect to
independent independent independent
variable X1 variable X2 variable X3

Note The nature of logit p is non-linear


written by JBacon 7/11/14 11
Multiple Regression (ignore significance tests)
http://davidmlane.com/hyperstat/B123219.html

Logistic regression (introduction & formula only)


http://www.dtreg.com/logistic.htm

written by JBacon 7/11/14 12


written by JBacon 5/11/14 1
 To discuss the appeal of neural networks and to explain what is
meant by an (artificial) neural network.

 To illustrate the structure of a neural network and to explain what is


meant by: the inputs, bias, transfer function, combination function,
activation function and the outputs.

 To give a flavour of the three most commonly used transfer


functions.

 To explain the characteristics of feed-forward networks and back


propagation.

 To introduce the Multilayer Perceptron (MLP) as this is the default


architecture used in neural network modelling in SAS Enterprise
Miner.
➢To discuss the effect of overtraining and other network
considerations.

written by JBacon 5/11/14 2


 A class of powerful, general-purpose tools readily
applied to:
Prediction, Classification and Clustering.
 The most powerful neural networks are the biological
kind, i.e.) the human brain.
 The human brain makes it possible for people to
generalise from experience;
… in comparison to computers that are best at
following pre-determined instructions over and over
again.
 Thus there is a difference
between the ‘learning’ of the human
brain and computers!

written by JBacon 5/11/14 3


 Neural Networks bridge the gap (between the
human brain and computers) by modelling, on a
digital computer, the neural connections in
human brains.

 When used in well-defined


domains, a neural network’s
ability to generalise and learn
from data “mimics”, the human’s
ability to learn from experience.

written by JBacon 5/11/14 4


 Drawback:
Training a neural network results in internal
weights being distributed throughout the network
but these weights provide no more insight into
why the solution is valid than dissecting a
human brain explains our thought processes!
 Perhaps one day, sophisticated techniques for
probing neural networks may help provide some
explanation!

written by JBacon 5/11/14 5


 1930s and 1940s: work on the functioning of biological
neurons (before digital computers!)
 1943: Research paper by a neurophysiologist McCulloch
explained how biological neurons work – this new modelling
concept provided inspiration for the field of Artificial
Intelligence.
 1950s: when digital computers first became available,
computer scientists implemented models called
perceptrons based on the work of McCulloch (1943).
However these simple networks had some deficiencies!
 1970s: study of neural network implementations slowed
down as computers were not very powerful!

written by JBacon 5/11/14 6


 1980s: Hopfield (1982) invented Back Propagation,
which was a better way of training a neural network.

This sparked a renaissance in neural network research:


◦ Research moved from labs into the commercial world,
being used to solve operational problems.
◦ Computing power became more readily available.
◦ Statisticians took advantage of computers to extend the
capabilities of statistical methods, e.g.) logistic
regression was represented as neural network. So
researchers became more comfortable with neural nets!
◦ As relevant operational data was more accessible,
useful applications (expert systems) emerged.

written by JBacon 5/11/14 7


Problem: Reduce costs by targeting a direct
mailing to the group of people most likely to
respond.

Inputs: Various personal data,


e.g.) income, employment category, magazine
readership, etc.

Output: A probability of response.

written by JBacon 5/11/14 8


Problem: determine risk of granting a loan to an
applicant.

Inputs: Number of credit cards, expenses as


a fraction of income, marital status, etc.

Output: A credit rating.

written by JBacon 5/11/14 9


Problem: Predict the future survival status of a
company.

Inputs: Asset ratios such as Total


Assets/Total Liabilities, etc.

Output: Survive, fail or survive in distressed


condition.

written by JBacon 5/11/14 10


written by JBacon 5/11/14 11
1. Automated appraisals could help real estate
agents better match prospective buyers to
prospective homes, improving productivity.
2. To set up Web pages where prospective
buyers could describe the homes that they
wanted – and get immediate feedback on how
much their dream home costs.
 Thus, the neural network mimics an appraiser
who estimates the market value of a house
based on features of the property. The
appraiser is not applying a set formula, but
balancing her experience and knowledge of the
sale prices of similar houses. Human
knowledge is not static!

written by JBacon 5/11/14 12


 The appraiser is an example of a human
expert in a well-defined domain.

 Houses are described by a fixed set of


standard features taken into account by the
expert and turned into an appraisal value.

 A neural network takes


specific inputs: e.g.) house features
and turns them into
a specific output: appraised value.

written by JBacon 5/11/14 13


• A Neural Network (Expert System) is like a
black box that knows how to process
inputs to create a useful output.
• But the calculation(s) are quite complex
and can be difficult to understand!
1. Inputs are well understood, e.g.) in appraisal example,
the inputs are well-defined because there is extensive
sharing of information about the housing market among
estate agents, and housing descriptions need to be
standardised (because of mortgages).

2. Output is well understood, e.g.) in appraisal example, the


output is well-defined because it is a specific monetary
value.

3.Experience is available, e.g.) in appraisal example, there


is a wealth of experience in the form of previous sales for
training the network how to value a house. Note Neural
Networks are only as good as the training set used to
generate it. The resulting neural net is static and must be
updated with more recent examples and retraining for it to
stay relevant!

written by JBacon 5/11/14 15


 Determine a set of features that affect house
prices, e.g.) living space, size of garage, age of
house, etc work for houses in a single
geographical area.
 Then, extend the appraisal example to handle
houses in many neighbourhoods, e.g.) input
variables would then include: postcodes,
neighbourhood demographics, neighbourhood
quality of life indicators such as schools, etc.
 Training the neural network model involves
presenting known examples (data from previous
house sales) to the network so that it can learn
how to calculate the house prices.

written by JBacon 5/11/14 16


…is actually the process of adjusting weights inside the
network to arrive at the best combination of weights for
making the desired predictions.
 The network starts with a random set of weights, so it
initially performs very poorly.

 However, by reprocessing the training set over and over


and adjusting the internal weights each time to reduce
the overall error, the network gradually does a better
and better job of approximating the target values in the
training set. When the approximations no longer
improve, the network stops training. This type of training
is known as supervised training.

 Thus, training is a process of iterating through the


training set to adjust the weights – each iteration is
called a generation. Once the network has been trained,
the performance of each generation must be measured
on the validation set.
written by JBacon 5/11/14 17
The diagram
illustrates
the unit and the main
features of an
artificial neuron:
 Inputs (including
constant input
called bias)
 activation function

 output.

written by JBacon 5/11/14 18


 Inputs – each input has its own weight.
 The bias has a weight and is included in the
combination function. The bias (set to 1) acts as a
global offset that helps the network better
understand patterns.
 The activation function consists of:
1. The combination function – which combines all the
inputs into a single value, usually as a weighted
summation.
2. The transfer function – which calculates the output
value from the result of the combination function.
 The output is usually a value between -1 and +1.

written by JBacon 5/11/14 19


 Linear function (output range
from -1 to +1) is the least
interesting!
i.e.) a neural network consisting
only of units with linear
transfer functions and a
weighted sum combination
function is really just doing a
linear regression.

 Sigmoid functions are


S-shaped functions, of which
the most usual are: the logistic
function (output range 0 to 1)
and the hyperbolic tangent
(output range -1 to +1).

 The power and complexity of


neural networks arise from
their non-linear behaviour.
 Each unit in the input
layer is connected to
exactly one source field.
Input layer maps values
into a reasonable range.

 Units in the hidden layer


are fully connected to all
units in the input layer,
and they multiply the
value of each input by its
weight, add these up and
then apply transfer
function

 Output layer is fully


connected to all units in
the hidden layer.
Appraised value of this
property is $176,228.

written by JBacon 5/11/14 21


…are the simplest and most useful type of
network for directed data mining.
 There is a 1-way flow through the network
from the inputs to the output(s).
Questions to address re: feed forward networks
➢ What are the units and how do they behave?
i.e.) structure? activation function?
➢ How are the units connected together?
i.e.) what is the topology of network?
➢ How does the network learn to recognise
patterns?
i.e.) how is the network trained?
written by JBacon 5/11/14 22
written by JBacon 5/11/14 23
written by JBacon 5/11/14 24
Example: A department
store wants to predict
the likelihood that
customers will be
purchasing products
from various depts.,
i.e.) women’s apparel,
furniture, and
entertainment.
 The model will be used
to plan promotions and
direct mailings.
After feeding the inputs for a customer into the neural
net, the network calculates 3 outputs. But given all
these outputs, how can the dept. store determine the
right promotion(s) to offer the customer?
Some common methods include:
 Take the dept. corresponding to the output with the
maximum value.
 Take all depts. corresponding to the outputs that
exceed some threshold value.
Note There is no right answer that always works – in
practice, you try several of these possibilities on the
test set in order to determine which works best in a
particular situation.
written by JBacon 5/11/14 26
 At the heart of back propagation, invented by
Hopfield(1982) are the following 3 steps:
1. The network gets a training example and, using
the existing weights in the network, it calculates
the output(s).
2. Back propagation then calculates the error by
taking the difference between the calculated
result and the expected (actual result).
3. The error is fed back through the network and
the weights are adjusted to minimise the error –
hence the name back propagation because the
errors are sent back through the network.

Note These days, there are alternative training


algorithms available.
written by JBacon 5/11/14 27
 Given enough data, enough hidden units, and
enough training time, a MLP with just one hidden
layer can learn to approximate virtually any
function to any degree of accuracy.
 The Neural Network node in SAS Enterprise
Miner supports many variations of MLP.
e.g.) you can add direct connections between the
inputs and outputs, or you can cut the default
connections and add new connections to bypass
one or more hidden layers.
 The MLP is the most popular form of neural
network architecture and are known as:
universal approximators.
written by JBacon 5/11/14 28
A multilayer perceptron:
 has any number of inputs.
 has one or more hidden layers with any number of
units.
 uses linear combination functions in the hidden
and output layers.
 uses sigmoid activation functions in the hidden
layers.
 has any number of outputs with any activation
function.
 has connections between the input layer and the
first hidden layer, between the hidden layers, and
between the last hidden layer and the output layer.

written by JBacon 5/11/14 29


 Getting the best from a neural network takes
some effort!
 Biggest decision is the number of units in the
hidden layer.
i.e.) the more units, the more patterns the network
can recognise – suggesting a very large hidden
layer but there is a drawback…
 The network might end up memorising the training
set instead of generalising from it,
i.e.) the network is overtrained.
 Overfitting is a major concern with networks using
customer data!

written by JBacon 5/11/14 30


Answer:
When the network performs very well on
the training set but does much worse on
the validation set, this is an indication
that the neural network has just
memorised the training set!

written by JBacon 5/11/14 31


 The size of the training set must be sufficiently
large to cover the full range of values for all
features that the network might encounter,
e.g.) In the real estate appraisal example, including
inexpensive houses and expensive houses, big
houses and little houses, etc.

 The learning rate. Initially, the learning should


be set high to make large adjustments to the
weights. Then as training proceeds, the learning
rate should decrease to fine tune the network.

written by JBacon 5/11/14 32


 Neural networks do not produce easily
understood rules that explain how they arrive
at a given result.

 Sensitivity analysis is used to aid the


understanding of the relative importance of
inputs into the network but sensitivity
analysis will not be discussed in this module.

written by JBacon 5/11/14 33


You should now know:
 What is meant by a neural network and be able to
give some examples of neural network
applications.
 What is meant by:
bias, a transfer function, combination function,
and an activation function.
 The three most commonly used transfer functions,
and the main differences between them.
 What is meant by a feed-forward network.
 What is involved in the training of a neural network
when using back propagation.
 What is meant by the Multilayer Perceptron (MLP).
 That SAS Enterprise Miner can easily handle
neural network modelling.
written by JBacon 5/11/14 34
 Background reading – Berry & Linoff (2011).
Data Mining Techniques. 3rd ed. pp281-320.

written by JBacon 5/11/14 35

You might also like