0% found this document useful (0 votes)
4 views

chap1_DMBI_Jan_April2022 (1)

The document discusses the concepts and techniques of data mining, emphasizing its importance in extracting valuable patterns from large datasets. It covers various data mining functionalities, including classification, clustering, and association analysis, as well as the integration of data mining systems with databases. Additionally, it highlights the challenges of identifying interesting patterns and the need for effective data preprocessing and evaluation methods.

Uploaded by

agents0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

chap1_DMBI_Jan_April2022 (1)

The document discusses the concepts and techniques of data mining, emphasizing its importance in extracting valuable patterns from large datasets. It covers various data mining functionalities, including classification, clustering, and association analysis, as well as the integration of data mining systems with databases. Additionally, it highlights the challenges of identifying interesting patterns and the need for effective data preprocessing and evaluation methods.

Uploaded by

agents0209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Lingma Acheson

Department of Computer and Information Science, IUPUI


[email protected]
• 1.1 Motivation: Why data mining?
• 1.2 What is data mining?
• 1.3 Data Mining: On what kind of data?
• 1.4 Data mining functionality: What kinds of Patterns Can
Be Mined?
• 1.5 Are all the patterns interesting?
• 1.6 Classification of data mining systems
• 1.7 Data Mining Task Primitives
• 1.8 Integration of data mining system with a DB and DW
System
Data Mining: Concepts and
• 1.9 Major issues in data mining
Techniques
• The Explosive Growth of Data: from terabytes(10004) to
yottabytes(10008)
– Data collection and data availability
• Automated data collection tools, database systems, web
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: bioinformatics, scientific simulation, medical
research …
• Society and everyone: news, digital cameras, …
• Data rich but information poor!
– What does those data mean?
– How to analyze data? Data Mining: Concepts and
Techniques
• Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
– Data mining: a misnomer?
• Alternative names
– Knowledge discovery (mining) in databases (KDD),
knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence,
etc.

Data Mining: Concepts and


5
Techniques
• Data analysis and decision support
– Market analysis and management
• Target marketing, customer relationship management
(CRM),
market basket analysis, cross selling, market
segmentation
– Risk analysis and management
• Forecasting, customer retention, improved underwriting,
quality
control, competitive analysis
– Fraud detection and detection of unusual patterns (outliers)
• Other Applications
Data Mining: Concepts and
– Text mining (news group, email, documents) and Web
Techniques
• Where does the data come from?—Credit card transactions, loyalty
cards,
discount coupons, customer complaint calls, surveys …
• Target marketing
– Find clusters of “model” customers who share the same characteristics:
interest,
income level, spending habits, etc.,
• E.g. Most customers with income level 60k – 80k with food expenses $600 - $800 a month
live in that area
– Determine customer purchasing patterns over time
• E.g. Customers who are between 20 and 29 years old, with income of 20k – 29k usually buy
this type of CD player

• Cross-market analysis—Find associations/co-relations between


product sales, & predict based on such association
– E.g. Customers who buy computer A usually buy software B

Data Mining: Concepts and


7
Techniques
• Customer requirement analysis
– Identify the best products for different customers
– Predict what factors will attract new customers
• Provision of summary information
– Multidimensional summary reports
• E.g. Summarize all transactions of the first quarter from three different branches
Summarize all transactions of last year from a particular branch
Summarize all transactions of a particular product
– Statistical summary information
• E.g. What is the average age for customers who buy product A?

• Fraud detection
– Find outliers of unusual transactions
• Financial planning
– Summarize and compare the resources and spending

Data Mining: Concepts and


8
Techniques
Data Mining: Concepts and
Techniques
• Learning the application domain
– relevant prior knowledge and goals of application
• Identifying a target data set: data selection
• Data processing
– Data cleaning (remove noise and inconsistent data)
– Data integration (multiple data sources maybe combined)
– Data selection (data relevant to the analysis task are retrieved from
database)
– Data transformation (data transformed or consolidated into forms
appropriate for mining)
(Done with data preprocessing)
– Data mining (an essential process where intelligent methods are applied
to extract
data patterns)
– Pattern evaluation (indentify the truly interesting patterns)
– Knowledge presentation (mined
Data Mining: knowledge
Concepts and is presented to the user with
10
Techniques
visualization or representation techniques)
Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Concepts and
11
Techniques
• Database, data warehouse, WWW or other
information
repository (store data)
• Database or data warehouse server (fetch and
combine data)
• Knowledge base (turn data into meaningful
groups
according to domain knowledge)
• Data mining engine (perform mining tasks)
• Pattern evaluation module (find interesting
patterns)
• User interface (interact
Data Mining: with
Concepts the
Techniques
and user)
Database
Technology Statistics

Information Machine
Science Data Mining Learning

Visualization Other
Disciplines

• Not all “Data Mining System” performs true data mining


 machine learning system, statistical analysis (small amount of data)
 Database system (information retrieval, deductive querying…)

Data Mining: Concepts and


14
Techniques
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Object-Relational Databases
– Temporal Databases, Sequence Databases, Time-Series databases
– Spatial Databases and Spatiotemporal Databases
– Text databases and Multimedia databases
– Heterogeneous Databases and Legacy Databases
– Data Streams
– The World-Wide Web

Data Mining: Concepts and


15
Techniques
• DBMS – database management system, contains a
collection of
interrelated databases
e.g. Faculty database, student database, publications
database
• Each database contains a collection of tables and functions
to
manage and access the data.
e.g. student_bio, student_graduation, student_parking
• Each table contains columns and rows, with columns as
attributes of data and rows as records.
• Tables can be used to represent the relationships between
or among multiple tables.

Data Mining: Concepts and


Techniques
Data Mining: Concepts and
Techniques
• With a relational query language, e.g. SQL, we will be able
to find
answers to questions such as:
– How many items were sold last year?
– Who has earned commissions higher than 10%?
– What is the total sales of last month for Dell laptops?
• When data mining is applied to relational databases, we
can search for trends or data patterns.
• Relational databases are one of the most commonly
available and
rich information repositories, and thus are a major data
form in our study.

Data Mining: Concepts and


Techniques
• A repository of information collected from multiple sources,
stored
under a unified schema, and that usually resides at a single
site.
• Constructed via a process of data cleaning, data
integration, data
transformation, data loading and periodic data refreshing.

Data Mining: Concepts and


Techniques
• Data are organized around major subjects, e.g. customer,
item, supplier and activity.
• Provide information from a historical perspective (e.g. from
the past 5 – 10 years)
• Typically summarized to a higher level (e.g. a summary of
the
transactions per item type for each store)
• User can perform drill-down or roll-up operation to view the
data at different degrees of summarization

Data Mining: Concepts and


Techniques
• Consists of a file where each record represents a
transaction
• A transaction typically includes a unique transaction ID and
a list of the items making up the transaction.

• Either stored in a flat file or unfolded into relational tables


• Easy to identify items that are frequently sold together

Data Mining: Concepts and


Techniques
• Concept/Class Description: Characterization and
Discrimination
– Data can be associated with classes or concepts.
• E.g. classes of items – computers, printers, …
concepts of customers – bigSpenders, budgetSpenders, …
• How to describe these items or concepts?
– Descriptions can be derived via
• Data characterization – summarizing the general characteristics
of a
target class of data.
– E.g. summarizing the characteristics of customers who spend more than
$1,000 a year
at AllElectronics. Result can be a general profile of the customers, such as 40
Data Mining:
– 50 years old, employed, Concepts
have and credit ratings.
excellent 23
Techniques
• Data discrimination – comparing the target class with one or a
set of
comparative classes
– E.g. Compare the general features of software products whole sales increase
by 10% in the last year with those whose sales decrease by 30% during the
same period

• Or both of the above

• Mining Frequent Patterns, Associations and


Correlations
– Frequent itemset: a set of items that frequently appear
together in a transactional data set (e.g. milk and bread)
– Frequent subsequence: a pattern that customers tend to
purchase product A, followed by a purchase of product B
Data Mining: Concepts and
24
Techniques
– Association Analysis: find frequent patterns
• E.g. a sample analysis result – an association rule:
buys(X, “computer”) => buys(X, “software”) [support = 1%, confidence
= 50%]
(if a customer buys a computer, there is a 50% chance that she will buy
software. 1% of all of the transactions under analysis showed that
computer and software
are purchased together. )
• Associations rules are discarded as uninteresting if they do not satisfy
both a minimum support threshold and a minimum confidence
threshold.
– Correlation Analysis: additional analysis to find statistical
correlations between associated pairs

Data Mining: Concepts and


25
Techniques
• Classification and Prediction
– Classification
• The process of finding a model that describes and distinguishes the data
classes or concepts, for the purpose of being able to use the model to
predict the class of
objects whose class label is unknown.
• The derived model is based on the analysis of a set of training data
(data objects whose class label is known).
• The model can be represented in classification (IF-THEN) rules, decision
trees,
neural networks, etc.
– Prediction
• Predict missing or unavailable numerical data values

Data Mining: Concepts and


26
Techniques
Data Mining: Concepts and
27
Techniques
• Cluster Analysis
– Class label is unknown: group data to form new classes
– Clusters of objects are formed based on the principle of
maximizing intra-class similarity & minimizing interclass
similarity
• E.g. Identify homogeneous subpopulations of customers. These clusters
may
represent individual target groups for marketing.

Data Mining: Concepts and


28
Techniques
• Outlier Analysis
– Data that do no comply with the general behavior or model.
– Outliers are usually discarded as noise or exceptions.
– Useful for fraud detection.
• E.g. Detect purchases of extremely large amounts

• Evolution Analysis
– Describes and models regularities or trends for objects
whose
behavior changes over time.
• E.g. Identify stock evolution regularities for overall stocks and for the
stocks of
particular companies.

Data Mining: Concepts and


29
Techniques
• Data mining may generate thousands of patterns: Not all of
them
are interesting
• A pattern is interesting if it is
– easily understood by humans
– valid on new or test data with some degree of certainty,
– potentially useful
– novel
– validates some hypothesis that a user seeks to confirm
• An interesting measure represents knowledge !

Data Mining: Concepts and


30
Techniques
• Objective measures
– Based on statistics and structures of patterns, e.g., support,
confidence, etc. (Rules that do not satisfy a threshold are
considered uninteresting.)
• Subjective measures
– Reflect the needs and interests of a particular user.
• E.g. A marketing manager is only interested in characteristics of customers who
shop
frequently.

– Based on user’s belief in the data.


• e.g., Patterns are interesting if they are unexpected, or can be used for strategic
planning, etc

• Objective and subjective measures need to be combined.


Data Mining: Concepts and
31
Techniques
• Find all the interesting patterns: Completeness
– Unrealistic and inefficient
– User-provided constraints and interestingness measures should be
used
• Search for only interesting patterns: An optimization problem
– Highly desirable
– No need to search through the generated patterns to identify truly
interesting ones.
– Measures can be used to rank the discovered patterns according
their
interestingness.

Data Mining: Concepts and


32
Techniques
Database
Technology Statistics

Information Machine
Science Data Mining Learning

Visualization Other
Disciplines
• Database
– Relational, data warehouse, transactional, stream, object-oriented/
relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
• Knowledge
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Multiple/integrated functions and mining at multiple levels
• Techniques utilized
– Database-oriented, data warehouse (OLAP), machine learning,
statistics,
visualization, etc.
• Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data
• How to construct a data mining query?
– The primitives allow the user to interactively communicate

with

the data mining system during discovery to direct the mining

process, or examine the findings

Data Mining: Concepts and


35
Techniques
– The primitives specify:

(1) The set of task-relevant data – which portion of the database to

be used
– Database or data warehouse name

– Database tables or data warehouse cubes

– Condition for data selection

– Relevant attributes or dimensions

– Data grouping criteria

Data Mining: Concepts and


36
Techniques
– The primitives specify:

(2) The kind of knowledge to be mined – what DB functions to be

performed
– Characterization
– Discrimination
– Association
– Classification/prediction
– Clustering
– Outlier analysis
– Other data mining tasks

Data Mining: Concepts and


37
Techniques
(3) The background knowledge to be used – what domain

knowledge,

concept hierarchies, etc.

(4) Interestingness measures and thresholds – support,

confidence, etc.

(5) Visualization methods


Data Mining:–Concepts
what form
and to display the result, e.g.
38
Techniques
• DMQL – Data Mining Query Language
– Designed to incorporate these primitives
– Allow user to interact with DM systems
– Providing a standardized language like SQL

Data Mining: Concepts and


39
Techniques
An Example Query in DMQL

(1)
(3)
(2)
(1)
(1)

(1)

(2)
(1)

(5)
Data Mining: Concepts and
40
Techniques
• Automated vs. query-driven?
– Finding all the patterns autonomously in a database?—
unrealistic because the patterns could be too many but
uninteresting
• Data mining should be an interactive process
– User directs what to be mined
• Users must be provided with a set of primitives to be used to
communicate with the data mining system
• Incorporating these primitives in a data mining query
language
– More flexible user interaction
– Foundation for design of graphical user interface
Data Mining: Concepts and
– Standardization of data Techniques
mining industry and practice 41
• No coupling
– Flat file processing, no utilization of any functions of a
DB/DW
system
– Not recommended
• Loose coupling
– Fetching data from DB/DW
– Does not explore data structures and query
optimization methods provided by DB/DW system
– Difficult to achieve high scalability and good
performance with
large data sets Data Mining: Concepts and
42
Techniques
• Semi-tight
– Efficient implementations of a few essential data mining
primitives in a DB/DW system are provided, e.g., sorting,
indexing, aggregation,
histogram analysis, multiway join, precomputation of some
stat
functions
– Enhanced DM performance
• Tight
– DM is smoothly integrated into a DB/DW system, mining
query is
optimized based on mining
Data query
Mining: Concepts andanalysis, data structures,
43
Techniques
indexing, query processing methods of a DB/DW system
• Mining methodology and User interaction
– Mining different kinds of knowledge
• DM should cover a wide spectrum of data analysis and knowledge
discovery tasks
• Enable to use the database in different ways
• Require the development of numerous data mining techniques
– Interactive mining of knowledge at multiple levels of
abstraction
• Difficult to know exactly what will be discovered
• Allow users to focus the search, refine data mining requests
– Incorporation of background knowledge
• Guide the discovery process
• Allow discovered patterns to be expressed in concise terms and different
levels of abstraction
– Data mining queryData
languages and
Mining: Concepts and ad hoc data mining
44
Techniques
• High-level query languages need to be developed
– Presentation and visualization of results
• Knowledge should be easily understood and directly usable
• High level languages, visual representations or other expressive forms
• Require the DM system to adopt the above techniques
– Handling noisy or incomplete data
• Require data cleaning methods and data analysis methods that can
handle noise
– Pattern evaluation – the interestingness problem
• How to develop techniques to access the interestingness of discovered
patterns, especially with subjective measures bases on user beliefs or
expectations

Data Mining: Concepts and


45
Techniques
• Performance Issues
– Efficiency and scalability
• Huge amount of data
• Running time must be predictable and acceptable
– Parallel, distributed and incremental mining algorithms
• Divide the data into partitions and processed in parallel
• Incorporate database updates without having to mine the entire data
again from
scratch

• Diversity of Database Types


– Other database that contain complex data objects,
multimedia data,
spatial data, etc.
– Expect to have different DM
Data Mining: systems
Concepts and for different kinds of
46
data Techniques

You might also like