Seminar Data Mining
Seminar Data Mining
CONTENTS
Introduction..........................................................................................................................................2
Data Mining Overview.........................................................................................................................2
Data..................................................................................................................................................2
Information.......................................................................................................................................3
Knowledge.......................................................................................................................................3
How does data mining work?...............................................................................................................3
Elements of Data mining......................................................................................................................4
Types of Data Mining Techniques.......................................................................................................4
Artificial neural networks:...............................................................................................................4
Genetic algorithms:..........................................................................................................................4
Decision trees:..................................................................................................................................4
Nearest neighbor method:................................................................................................................5
Data Mining Issues...............................................................................................................................5
Data Quality.....................................................................................................................................5
Interoperability.................................................................................................................................5
Mission Creep..................................................................................................................................6
Privacy..............................................................................................................................................7
Data Mining Uses.................................................................................................................................7
Automated prediction of trends and behaviors................................................................................7
Automated discovery of previously unknown patterns....................................................................7
Limitations...........................................................................................................................................7
Data Mining Products...........................................................................................................................8
Applications.........................................................................................................................................8
Conclusion............................................................................................................................................9
References............................................................................................................................................9
2
DATA MINING
Introduction
Data mining is the process of extracting patterns from large data sets by combining methods from
statistics and artificial intelligence with database management. Data mining is becoming an
increasingly important tool to transform this data into information. It is currently used in a wide
range of profiling practices, such as marketing, surveillance, fraud detection, and scientific
discovery. The data mining consists of more than collecting and managing data, it also includes
analysis and prediction.
Data mining is often carried out only on samples of data. The mining process will be ineffective if
the samples are not a good representation of the larger body of data. Data mining cannot discover
patterns that may be present in the larger body of data if those patterns are not present in the
sample being "mined". The discovery of a particular pattern in a particular set of data does not
necessarily mean that a pattern is found elsewhere in the larger data from which that sample was
drawn. An important part of the process is the verification and validation of patterns on other
samples of data.
Data
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations
are accumulating vast and growing amounts of data in different formats and different databases.
This includes:
3
operational or transactional data such as, sales, cost, inventory, payroll, and accounting
nonoperational data, such as industry sales, forecast data, and macro economic data
meta data - data about the data itself, such as logical database design or data dictionary
definitions.
Information
The patterns, associations, or relationships among all this data can provide information. For
example, analysis of retail point of sale transaction data can yield information on which products
are selling and when.
Knowledge
Information can be converted into knowledge about historical patterns and future trends. For
example, summary information on retail supermarket sales can be analyzed in light of
promotional efforts to provide knowledge of consumer buying behavior. Thus, a manufacturer or
retailer could determine which items are most susceptible to promotional efforts.
While large-scale information technology has been evolving separate transaction and analytical
systems, data mining provides the link between the two. Data mining software analyzes
relationships and patterns in stored transaction data based on open-ended user queries. Several
types of analytical software are available: statistical, machine learning, and neural networks.
Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant
chain could mine customer purchase data to determine when customers visit and what they
typically order. This information could be used to increase traffic by having daily specials.
4
Clusters: Data items are grouped according to logical relationships or consumer preferences. For
example, data can be mined to identify market segments or consumer affinities.
Associations: Data can be mined to identify associations.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an
outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a
consumer's purchase of sleeping bags and hiking shoes.
Genetic algorithms:
Optimization techniques that use processes such as genetic combination, mutation, and natural
selection in a design based on the concepts of natural evolution.
Decision trees:
Tree-shaped structures that represent sets of decisions. These decisions generate rules for the
classification of a dataset. Specific decision tree methods include Classification and Regression
Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . CART and CHAID
are decision tree techniques used for classification of a dataset. They provide a set of rules that
you can apply to a new (unclassified) dataset to predict which records will have a given outcome.
CART segments a dataset by creating 2-way splits while CHAID segments using chi square tests
5
to create multi-way splits. CART typically requires less data preparation than CHAID.
Data Quality
Data quality is a multifaceted issue that represents one of the biggest challenges for data mining.
Data quality refers to the accuracy and completeness of the data. Data quality can also be affected
by the structure and consistency of the data being analyzed. The presence of duplicate records, the
lack of data standards, the timeliness of updates, and human error can significantly impact the
effectiveness of the more complex data mining techniques, which are sensitive to subtle
differences that may exist in the data. To improve data quality, it is sometimes necessary to
“clean” the data, which can involve the removal of duplicate records, normalizing the values used
to represent information in the database (e.g., ensuring that “no” is represented as a 0 throughout
the database, and not sometimes as a 0, sometimes as a N, etc.), accounting for missing data
points, removing unneeded data fields, identifying anomalous data points (e.g., an individual
whose age is shown as 142 years), and standardizing data formats (e.g., changing dates so they all
include MM/DD/YYYY).
Interoperability
Related to data quality, is the issue of interoperability of different databases and data mining
software. Interoperability refers to the ability of a computer system and/or data to work with other
systems or data using common standards or processes. Interoperability is a critical part of the
larger efforts to improve interagency collaboration and information sharing through e-government
and homeland security initiatives. For data mining, interoperability of databases and software is
important to enable the search and analysis of multiple databases simultaneously, and to help
ensure the compatibility of data mining activities of different agencies. Data mining projects that
are trying to take advantage of existing legacy databases or that are initiating first-time
6
collaborative efforts with other agencies or levels of government (e.g., police departments in
different states) may experience interoperability problems. Similarly, as agencies move forward
with the creation of new databases and information sharing efforts, they will need to address
interoperability issues during their planning stages to better ensure the effectiveness of their data
mining projects.
Mission Creep
Mission creep refers to the use of data for purposes other than that for which the data was
originally collected. This can occur regardless of whether the data was provided voluntarily by the
individual or was collected through other means.
Efforts to fight terrorism can, at times, take on an acute sense of urgency. This urgency can create
pressure on both data holders and officials who access the data. To leave an available resource
unused may appear to some as being negligent. Data holders may feel obligated to make any
information available that could be used to prevent a future attack or track a known terrorist.
Similarly, government officials responsible for ensuring the safety of others may be pressured to
use and/or combine existing databases to identify potential threats. Unlike physical searches, or
the detention of individuals, accessing information for purposes other than originally intended
may appear to be a victimless or harmless exercise. However, such information use can lead to
unintended outcomes and produce misleading results.
One of the primary reasons for misleading results is inaccurate data. All data collection efforts
suffer accuracy concerns to some degree. Ensuring the accuracy of information can require costly
protocols that may not be cost effective if the data is not of inherently high economic value. In
well-managed data mining projects, the original data collecting organization is likely to be aware
of the data’s limitations and account for these limitations accordingly. However, such awareness
may not be communicated or heeded when data is used for other purposes. For example, the
accuracy of information collected through a shopper’s club card may suffer for a variety of
reasons, including the lack of identity authentication when a card is issued, cashiers using their
own cards for customers who do not have one, and/or customers who use multiple cards. For the
purposes of marketing to consumers, the impact of these inaccuracies is negligible to the
individual. If a government agency were to use that information to target individuals based on
food purchases associated with particular religious observances though, an outcome based on
inaccurate information could be, at the least, a waste of resources by the government agency, and
an unpleasant experience for the misidentified individual.
7
Privacy
Concerns about privacy focus both on actual projects proposed, as well as concerns about the
potential for data mining applications to be expanded beyond their original purposes (mission
creep). For example, some experts suggest that anti-terrorism data mining applications might also
be useful for combating other types of crime as well. There is some disagreement over how
privacy concerns should be addressed. Some observers suggest that technical solutions are
adequate. In contrast, some privacy advocates argue in favor of creating clearer policies and
exercising stronger oversight.
Limitations
While data mining products can be very powerful tools, they are not self-sufficient applications.
To be successful, data mining requires skilled technical and analytical specialists who can
8
structure the analysis and interpret the output that is created. Consequently, the limitations of data
mining are primarily data or personnel-related, rather than technology-related.
Although data mining can help reveal patterns and relationships, it does not tell the user the value
or significance of these patterns. These types of determinations must be made by the user.
Similarly, the validity of the patterns discovered is dependent on how they compare to “real
world” circumstances. For example, to assess the validity of a data mining application designed to
identify potential terrorist suspects in a large pool of individuals, the user may test the model
using data that includes information about known terrorists. However, while possibly re-affirming
a particular profile, it does not necessarily mean that the application will identify a suspect whose
behavior significantly deviates from the original model.
Another limitation of data mining is that while it can identify connections between behaviors
and/or variables, it does not necessarily identify a causal relationship. For example, an application
may identify that a pattern of behavior, such as the propensity to purchase airline tickets just
shortly before the flight is scheduled to depart, is related to characteristics such as income, level
of education, and Internet use. However, that does not necessarily indicate that the ticket
purchasing behavior is caused by one or more of these variables. In fact, the individual’s behavior
could be affected by some additional variable(s) such as occupation (the need to make trips on
short notice), family status (a sick relative needing care), or a hobby (taking advantage of last
minute discounts to visit new destinations).
Applications
Banking: loan/credit card approval
o predict good customers based on old customers
Customer relationship management:
o Identify those who are likely to leave for a competitor.
Targeted marketing:
9
Conclusion
Generally, data mining, sometimes called data or knowledge discovery, is the process of
analyzing data from different perspectives and summarizing it into useful information -
information that can be used to increase revenue, cuts costs, or both. In the new millennium,
competitive enterprises will be mining their data with sophisticated data mining tools to find and
attract the best customers, to improve and enhance their product offerings, to maximize operating
efficiency and to cut costs and improve customer satisfaction. With time and resources in short
supply, data mining software will help enterprises maximize resources to remain competitive. The
advancements and deployment of sophisticated data mining tools, computers can think bringing
knowledge to our desktops.
10
References