Introduction to Data Mining_125604
Introduction to Data Mining_125604
Online analytical processing (OLAP) is an analysis technique with functionalities such as summarization,
consolidation, and aggregation- and the ability to view information from different angles.
Online analytical processing (OLAP) is a software technology you can use to analyze business data from
different points of view.
Although OLAP tools support multidimensional analysis and decision-making, additional data analysis
tools are required for in-depth analysis.
For example, data mining tools that provide data classification, clustering, outlier/anomaly detection,
and the characterization of changes in data over time.
Many people treat data mining as a synonym for another popularly used term, knowledge discovery
from data, or KDD, while others view data mining as merely an essential step in the process of
knowledge discovery.
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining. The
data mining step may interact with the user or a knowledge base.
The interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view shows data mining as one step in the knowledge discovery process, albeit an
essential one because it uncovers hidden patterns for evaluation.
However, in industry, media, and the research milieu, the term data mining is often used to refer to the
entire knowledge discovery process (perhaps because the term is shorter than knowledge discovery
from data).
Data Mining
Data mining is the process of discovering interesting patterns and knowledge from large amounts of
data.
The data sources can include databases, data warehouses, the Web, other information repositories, or
data that are streamed into the system dynamically.
It is a collection of techniques for efficient automated discovery of previously unknown, valid, novel,
useful and understandable patterns in large databases. The patterns must be actionable so they may be
used in an enterprise’s decision making.
• Data mining is a powerful tool that helps to determine the relationships and patterns within
data.
• However, it does not work by itself and does not eliminate the requirement for understanding
data, analytical methods and to know business.
• Data mining extracts hidden information from the data, but it is not able to assess the value of
the information.
• One should know the important patterns and relationships to work with data over time.
• In addition to discovering new patterns, data mining can also highlight other empirical
observations that are not instantly visible through simple observation.
• It is important to note that the relationships or patterns predicted through data mining are not
necessarily causes for an action or behavior.
• Database data,
• Data warehouse data, and
• Transactional data
• Time-related or sequence data (e.g., historical records, stock exchange data, and time-series and
biological sequence data)
• Data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
• Spatial data (e.g., maps),
• Engineering design data (e.g., the design of buildings, system components, or integrated
circuits),
• Hypertext and multimedia data (including text, image, video, and audio data),
• Graph and networked data (e.g., social and information networks), and
• The Web (a huge, widely distributed information repository made available by the Internet).
• Banks are able to assess the credit worthiness of their customers by mining a customer’s
historical records of business transactions.
• So, credit agencies and banks collect a lot of customer’ behavioural data from many sources.
• This information is used to predict the chances of a customer paying back a loan.
Market segmentation
Fraud detection
• Fraud detection is a very challenging task because it’s difficult to define the characteristics for
detection of fraud.
• Fraud detection can be performed by analyzing the patterns or relationships that deviate from an
expected norm.
• With the help of data mining, we can mine the data of an organization to know about outliers as
they may be possible locations for fraud.
Better marketing
• Usually all online sale web portals provide recommendations to their users based on their
previous purchase choices, and purchases made by customers of similar profile.
• Such recommendations are generated through data mining and help to achieve more sales with
better marketing of their products.
• For example, amazon.com uses associations and provides recommendations to customers on the
basis of past purchases and what other customers are purchasing.
• To take another example, a shoe store can use data mining to identify the right shoes to stock in
the right store on the basis of shoe sizes of the customers in the region.
Trend analysis
• In a large company, not all trends are always visible to the management.
• It is then useful to use data mining software that will help identify trends.
• Trends may be long term trends, cyclic trends or seasonal trends.
• Market basket analysis is useful in designing store layouts or in deciding which items to put on
sale.
• It aims to find what the customers buy and what they buy together.
Customer churn
• If an organization knows its customers better then it can retain them longer.
• In businesses like telecommunications, companies very hard to retain their good customers and
to perhaps persuade good customers of their competitors to switch to them.
• In such an environment, businesses want to rate customers (whether they are worthwhile to be
retained), why customers switch over, and what makes customers loyal.
• With the help of this information some businesses may wish to get rid of customers that cost
more than they are worth, e.g., credit card holders that don’t use the card; bank customers with
very small amounts of money in their accounts.
Website design
• A web site is effective only if the visitors easily find what they are looking for.
• Data mining can help discover the affinity of visitors to pages and the site layout may be
modified based on data generated from their web clicks.
• The main focus of the first phase of a data mining process is to understand the requirements and
objectives of such a project.
• Once the project has been specified, it can be formulated as a data mining problem.
• After this, a preliminary implementation plan can be developed.
• Let us consider a business problem such as ‘How can I sell more of my product to customers?’
• This business problem can be translated into a data mining problem such as ‘Which customers
are most likely to buy the product?’
• A model that predicts the frequent customers of a product must be built on the previous records
of customers’ data.
• Before building the model, the data must be assembled that consists of relationships between
customers who have purchased the product and customers who have not purchased the
product.
• The attributes of customers might include age, years of residence, number of children,
owners/renters, and so on.
• The next phase of the data mining process starts with the data collection.
• In this phase, data is collected from the available sources
• In order to make data collection proper, some important activities such as data loading and data
integration are performed.
• After this, the data is analyzed closely to determine whether the data will address the business
problem or not.
• Therefore, additional data can be added or removed to solve the problem effectively.
• At this stage missing data is also identified.
• For example, if we require the AGE attribute for a record then column DATE_OF_ BIRTH can be
changed to AGE. We can also consider another example in which average income can be inserted
if the value of column INCOME is null.
• Moreover, new computed attributes can be added in the data in order to obtain better focused
information. For example, a new attribute such as ‘Number of Times Amount Purchased Exceeds
Kshs. 5000 in a 12-month time period.’ can be created instead of using the purchase amount.
Modeling phase
• In this phase, different data mining algorithms are applied to build models.
• Appropriate data mining algorithms are selected and applied on given data to achieve the
objectives of proposed solution.
Evaluation phase
• In the evaluation phase, the model results are evaluated to determine whether it satisfies the
originally stated business goal or not.
• For this the given data is divided into training and testing datasets.
• The models are trained on training data and tested on testing data.
• If the accuracy of models on testing data is not adequate then one goes back to the previous
phases to fine tune those areas that may be the reasons for low accuracy.
• Having achieved a satisfactory level of accuracy, the process shifts to the deployment phase
Deployment phase
• In the deployment phase, insights and valuable information derived from data need to be
presented in such a way that stakeholders can use it when they want to.
• On the basis of requirements of the project, the deployment phase can be simple (just creating a
report) or complex (requiring further iterative data mining processing).
• In this phase, Dashboards or Graphical User Interfaces are built to solve all the requirements of
stakeholders.
Data mining can be classified into four major techniques as given below.
• Predictive modeling
• Database segmentation
• Link analysis
• Deviation detection
Predictive modeling
Database segmentation
• Database segmentation is based on the concept of clustering of data and it falls under
unsupervised learning, where data is not labeled.
• This data is segmented into groups or clusters based on its features or attributes.
• Segmentation is creating a group of similar records that share a number of properties.
• Applications of database segmentation include customer segmentation, customer churn, direct
marketing, and cross-selling.
Link analysis
Link analysis aims to establish links, called associations, between the individual record, or sets of records,
in a database. There are three specializations of link analysis.
• Associations discovery
• Sequential pattern discovery
• Similar time sequence discovery
• Associations discovery locates items that imply the presence of other items in the same event.
There are association rules which are used to define association.
o For example, ‘when a customer rents property for more than two years and is more than 25
years old, in 40% of cases, the customer will buy a property. This association happens in 35%
of all customers who rent properties.
• Sequential pattern discovery finds patterns between events such that the presence of one set of
items is followed by another set of items in a database of events over a period of time. For example,
this approach can be used to understand long-term customer buying behavior.
• Time sequence discovery is used to determine whether links exist between two sets of data that are
time-dependent.
o For example, within three months of buying property, new homeowners will purchase goods
such as cookers, freezers, and washing machines.
o Applications of link analysis include market basket analysis, recommendation systems, direct
marketing, and stock price movement.
Deviation detection
• Deviation detection is a relatively new technique in terms of commercially available data mining
tools.
• It is based on identifying the outliers in the database, which indicates deviation from some
previously known expectations and norms.
• This operation can be performed using statistics and visualization techniques.
• Applications of deviation detection include fraud detection in the use of credit cards and insurance
claims, quality control, and defects tracing