0% found this document useful (0 votes)
5 views

Introduction To Data Mining

Uploaded by

nagarajan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Introduction To Data Mining

Uploaded by

nagarajan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

scenarios by analyzing

patterns in data.
• Data is Everywhere, in sheets, in social media platforms, in
product reviews and feedback, everywhere. In this latest
information age it’s created at blinding speeds and, when
data is analyzed correctly, can be a company’s most valuable
asset.
• “To grow your business even to grow in your life, sometimes
all you need to do is Analysis
• Data is raw information, and analysis of data is the
systematic process of interpreting and transforming that data
into meaningful insights.
• In a data-driven world, analysis involves applying
statistical, mathematical, or computational techniques to
extract patterns, trends, and correlations from datasets.
• Data analysis is the process of inspecting, cleaning,
transforming, and modeling data to discover useful
information, draw conclusions, and support decision-
making.
• It involves the application of various techniques and
tools to extract meaningful insights from raw data,
helping in understanding patterns, trends, and
relationships within a dataset.
• Data and analysis together form the backbone of evidence-
based decision-making, enabling organizations and
individuals to understand complex phenomena, predict
outcomes, and derive actionable conclusions for improved
outcomes and efficiency.
Why Data Analysis is important?
• Data analysis is crucial for informed decision-making,
revealing patterns, trends, and insights within datasets.
• It enhances strategic planning, identifies opportunities and
challenges, improves efficiency, and fosters a deeper
understanding of complex phenomena across various
industries and fields.
• Informed Decision-Making: Analysis of data provides a basis for informed
decision-making by offering insights into past performance, current trends,
and potential future outcomes.
• Business Intelligence: Analyzed data helps organizations gain a competitive
edge by identifying market trends, customer preferences, and areas for
improvement.
• Problem Solving: It aids in identifying and solving problems within a system
or process by revealing patterns or anomalies that require attention.
• Performance Evaluation: Analysis of data enables the assessment of
performance metrics, allowing organizations to measure success, identify
areas for improvement, and set realistic goals.
• Risk Management: Understanding patterns in data helps in predicting and
managing risks, allowing organizations to mitigate potential challenges.
• Optimizing Processes: Data analysis identifies inefficiencies in processes,
allowing for optimization and cost reduction.
Introduction to Data Mining
 Data mining is a technology that blends traditional data
analysis methods with sophisticated algorithms for
processing large volumes of data.
 It has also opened up stimulating opportunities for exploring
and analyzing new types of data and for analyzing old types
of data in new ways.
 Business Point-of-sale data collection (bar code scanners,
radio frequency identification (RFID), and smart card
technology) have allowed retailers to collect up-to-the-
minute data about customer purchases at the checkout
counters of their stores.
• Retailers can utilize this information, along with other business-
critical data such as Web logs from e-commerce Web sites and cus
tomer service records from call centers, to help them better
understand the needs of their customers and make more informed
business decisions.
• Data mining techniques can be used to support a wide range of
business intelligence applications such as customer profiling, targeted
marketing, work flow management, store layout, and fraud detection.
• It can also help retailers answer important business questions such as
“Who are the most profitable customers?” “What products can be
cross-sold or up-sold?” and “What is the revenue outlook of the
company for next year?”
• Data mining is the process of automatically discovering useful
information in large data repositories.
• Data mining techniques are deployed to place large databases in
order to find novel and useful patterns that might otherwise
remain unknown.
• Not all information discovery tasks are considered to be data
mining.
• For example, looking up individual records using a database
management system or finding particular Web pages via a query
to an Internet search engine are tasks related to the area of
information retrieval.
Data Mining and Knowledge
Discovery Diagram
Data Mining and Knowledge Discovery
• Data mining is an integral part of knowledge discovery in databases
(KDD), which is the overall process of converting raw data into useful in
formation, as shown in the Diagram.
• This process consists of a series of transformation steps, from data
preprocessing to postprocessing of data mining results
KDD Process in Data Mining
• In the context of computer science, “Data Mining” can be referred to as
knowledge mining from data, knowledge extraction, data/pattern analysis,
data archaeology, and data dredging.
• Data Mining also known as Knowledge Discovery in Databases, refers to the
nontrivial extraction of implicit, previously unknown and potentially useful
information from data stored in databases.
• The need of data mining is to extract useful information from large datasets
and use it to make predictions or better decision-making.
• Nowadays, data mining is used in almost all places where a large amount of
data is stored and processed.
For examples: Banking sector, Market Analysis, Network Intrusion Detection.
KDD Process
• KDD (Knowledge Discovery in Databases) is a process that involves the
extraction of useful, previously unknown, and potentially valuable information
from large datasets.
• The KDD process is an iterative process and it requires multiple iterations of the
above steps to extract accurate knowledge from the data.The following steps
are included in KDD process:
Data Cleaning
• Data cleaning is defined as removal of noisy and irrelevant data from collection.
Cleaning in case of Missing values.
• Cleaning noisy data, where noise is a random or variance error.
• Cleaning with Data discrepancy detection and Data transformation tools.
Data Integration
• Data integration is defined as heterogeneous data from
multiple sources combined in a common
source(DataWarehouse). Data integration using Data
Migration tools, Data Synchronization tools and
ETL(Extract-Load-Transformation) process.
Data Selection
• Data selection is defined as the process where data
relevant to the analysis is decided and retrieved from
the data collection. For this we can use Neural
network, Decision Trees, Clustering, and Regression
methods.
Data Transformation
• Data Transformation is defined as the process of transforming data
into appropriate form required by mining procedure. Data
Transformation is a two step process:
Data Mapping: Assigning elements from source base to destination
to capture transformations.
• Code generation: Creation of the actual transformation program.
Data Mining
• Data mining is defined as techniques that are applied to extract
patterns potentially useful. It transforms task relevant data into
patterns, and decides purpose of model using classification or
characterization.
Pattern Evaluation
• Pattern Evaluation is defined as identifying strictly
increasing patterns representing knowledge based on
given measures. It find interestingness score of each
pattern, and uses summarization and Visualization to
make data understandable by user.
Knowledge Representation
• This involves presenting the results in a way that is
meaningful and can be used to make decisions.
Advantages of KDD

• Improves decision-making: KDD provides valuable insights and


knowledge that can help organizations make better decisions.
• Increased efficiency: KDD automates repetitive and time-consuming
tasks and makes the data ready for analysis, which saves time and
money.
• Better customer service: KDD helps organizations gain a better
understanding of their customers’ needs and preferences, which can
help them provide better customer service.
• Fraud detection: KDD can be used to detect fraudulent activities by
identifying patterns and anomalies in the data that may indicate fraud.
• Predictive modeling: KDD can be used to build predictive models that
can forecast future trends and patterns.
Disadvantages of KDD
• Privacy concerns: KDD can raise privacy concerns as it involves collecting and
analyzing large amounts of data, which can include sensitive information about
individuals.
• Complexity: KDD can be a complex process that requires specialized skills and
knowledge to implement and interpret the results.
• Unintended consequences: KDD can lead to unintended consequences, such as bias
or discrimination, if the data or models are not properly understood or used.
• Data Quality: KDD process heavily depends on the quality of data, if data is not
accurate or consistent, the results can be misleading
• High cost: KDD can be an expensive process, requiring significant investments in
hardware, software, and personnel.
• Overfitting: KDD process can lead to overfitting, which is a common problem in
machine learning where a model learns the detail and noise in the training data to
the extent that it negatively impacts the performance of the model on new unseen
data.
Difference between KDD and Data Mining
Paramete KDD Data Mining
r
KDD refers to a process of
identifying valid, novel, Data Mining refers to a process of
Definition potentially useful, and ultimately extracting useful and valuable
understandable patterns and information or patterns from large
relationships in data. data sets.

To find useful knowledge from To extract useful information from


Objective data. data.
Data cleaning, data integration,
data selection, data Association rules, classification,
Techniques transformation, data mining, clustering, regression, decision
Used pattern evaluation, and trees, neural networks, and
knowledge representation and dimensionality reduction.
visualization.
Patterns, associations, or
Structured information, such as rules and
insights that can be used
Output models, that can be used to make
to improve decision-
decisions or predictions.
making or understanding.

Focus is on the discovery of useful Data mining focus is on


Focus knowledge, rather than simply finding the discovery of patterns
patterns in data. or relationships in data.

Domain expertise is less


Domain expertise is important in KDD, as critical in data mining, as
Role of
it helps in defining the goals of the the algorithms are
domain
process, choosing appropriate data, and designed to identify
expertise
interpreting the results. patterns without relying on
prior knowledge.
Attributes and Objects
• A data set can often be viewed as a collection of data objects.
• Other names for a data object are record, point, vector, pattern, event,
case, sample, observation, or entity.
• In turn, data objects are described by a number of attributes that capture
the basic characteristics of an object, such as the mass of a physical object
or the time at which an event occurred.
• Other names for an attribute are variable, characteristic, field,
feature,ordimension.
• Example of Student Information Often, a data set is a file, in which
the objects are records (or rows) in the file and each field (or column)
corre sponds to an attribute.
• For example, The following table shows a data set that consists of
student information. Each row corresponds to a student and each
column is an attribute that describes some aspect of a student, such
as grade point average (GPA) or identification number (ID)
• Although record-based data sets are common, either in flat files or relational
database systems, there are other important types of data sets and systems
for storing data. we first consider attributes.
Attributes and Measurement
An attribute is a property or characteristic of an object that may vary, either
from one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature of
an object varies over time. Note that eye color is a symbolic attribute with a
small number of possible values {brown, black, blue, green etc.}, while
temperature is a numerical attribute with a potentially unlimited number of
values. At the most basic level, attributes are not about numbers or symbols.
However, to discuss and more precisely analyze the characteristics of objects,
we assign numbers or symbols to them.
To do this in a well-defined way, we need a measurement scale.
• A measurement scale is a rule (function) that associates a numerical or symbolic
value with an attribute of an object.
• Formally, the process of measurement is the application of a measurement scale
to associate a value with a particular attribute of a specific object.
• While this may seem a bit abstract, we engage in the process of measurement
all the time. For instance, we step on a bathroom scale to determine our weight,
• we classify someone as male or female, or we count the number of chairs in a
room to see if there will be enough to seat all the people coming to a meeting.
• In all these cases, the “physical value” of an attribute of an object is mapped to
a numerical or symbolic value.
• With this background, we can now discuss the type of an attribute, a concept
that is important in determining if a particular data analysis technique is
consistent with a specific type of attribute.
Type of Attributes
• It should be apparent from the previous discussion that the
properties of an attribute need not be the same as the properties
of the values used to measure it .
• In other words, the values used to represent an attribute may
have properties that are not properties of the attribute itself, and
vice versa. This is illustrated with two examples.
• Employee Age and ID Number. Two attributes that might be
associated with an employee are ID and age (in years).
• Both of these attributes can be represented as integers. However,
while it is reasonable to talk about the average age of an
employee, it makes no sense to talk about the average employee
ID.
Different Types of
Attributes
• A useful (and simple) way to specify the type of an attribute is to identify the
properties of numbers that correspond to underlying properties of the attribute.
• For example, an attribute such as length has many of the properties of numbers.
• It makes sense to compare and order objects by length, as well as to talk about
the differences and ratios of length.
The following properties (operations) of numbers are typically used to describe
attributes.
1. Distinctness = and _x0003_=
2. Order <, ≤, >, and ≥
3. Addition + and −
4. Multiplication ∗and /
we can define four types of attributes:
1. Nominal,
2. Ordinal,
3. Interval, and
4. Ratio.
The definitions of these types, along with information about the statistical
operations that are valid for each type.
Each attribute type possesses all of the properties and operations of the
attribute types above it.
Consequently, any property or operation that is valid for nominal, ordinal,
and interval attributes is also valid for ratio attributes.
• In other words, the definition of the attribute types is cumulative. However
Nominal and ordinal attributes are collectively referred to as
categorical or qualitative attributes.
As the name suggests, qualitative attributes, such as employee
ID, lack most of the properties of numbers.
Even if they are represented by numbers, i.e., integers, they
should be treated more like symbols.
The remaining two types of attributes, interval and ratio, are
collectively referred to as quantitative or numeric attributes.
Quantitative Attributes
Quantitative attributes are represented by numbers and have most of the
properties of numbers.
Note that quantitative attributes can be integer-valued or continuous.
• The types of attributes can also be described in terms of transformations
that do not change the meaning of an attribute.
• Indeed, S. Smith Stevens, the psychologist who originally defined the types
of attributes shown in above table, defined them in terms of these
permissible transformations.
For example, the meaning of a length attribute is unchanged if it is measured
in meters instead of feet.
• Indeed, the only aspect of employees that we want to capture
with the ID attribute is that they are distinct.
• Consequently, the only valid operation for employee IDs is to
test whether they are equal.
• There is no hint of this limitation, however, when integers are
used to represent the employee ID attribute.
• For the age attribute, the properties of the integers used to
represent age are very much the properties of the attribute.
• Even so, the correspondence is not complete since, for
example, ages have a maximum, while integers do not.
• The statistical operations that make sense for a particular type of
attribute are those that will yield the same results when the attribute is
transformed using a transformation that preserves the attribute’s
meaning.
• To illustrate, the average length of a set of objects is different when
measured in meters ratherthan in feet, but both averages represent the
same length.
Temperature Scales

• Temperature provides a good illustration of some of the concepts that have been
described.
• First, temperature can be either an interval or a ratio attribute, depending on its
measurement scale.
• When measured on the Kelvin scale, a temperature of 2◦ is, in a physically
meaningful way, twice that of a temperature of 1◦.
• This is not true when temperature is measured on either the Celsius or Fahrenheit
scales, because, physically, a temperature of 1◦ Fahrenheit (Celsius) is not much
different than a temperature of 2◦ Fahrenheit (Celsius).
• The problem is that the zero points of the Fahrenheit and Celsius scales are, in a
physical sense, arbitrary, and therefore, the ratio of two Celsius or Fahrenheit
temperatures is not physically meaningful.
Describing Attributes by the Number of Values
• An independent way of distinguishing between attributes is by the number
of values they can take.
Discrete
• A discrete attribute has a finite or countably infinite set of values. Such
attributes can be categorical, such as zip codes or ID numbers, or numeric,
such as counts.
• Discrete attributes are often represented using integer variables. Binary
attributes are a special case of discrete attributes and assume only two
values, e.g., true/false, yes/no, male/female, or 0/1.
• Binary attributes are often represented as Boolean variables, or as integer
variables that only take the values 0 or 1.
Continuous Attribute
• A continuous attribute is one whose values are real numbers.
• Examples include attributes such as temperature, height, or weight. Continuous
attributes are typically represented as floating-point variables.
• Practically, real values can only be measured and represented with limited precision.
• In theory, any of the measurement scale types—nominal, ordinal, interval, and ratio—
could be combined with any of the types based on the number of attribute values—
binary, discrete, and continuous.
• However, some combinations occur only infrequently or do not make much sense.
• For instance, it is difficult to think of a realistic data set that contains a continuous
binary attribute.
• Typically, nominal and ordinal attributes are binary or discrete, while interval and ratio
attributes are continuous.
• However, count attributes, which are discrete, are also ratio attributes.
Types of Data Sets

• There are many types of data sets, and as the field of data
mining develops and matures, a greater variety of data sets
become available for analysis.
• In this section, we describe some of the most common types.
• For convenience, we have grouped the types of data sets into
three groups: record data, graph based data, and ordered
data.
• These categories do not cover all possibilities and other
groupings are certainly possible
General Characteristics of Data Sets
• Dimensionality
• Sparsity
• Resolution
• Record Data
• Transaction or Market Basket Data
• The Data Matrix

You might also like