Discussion Questions BA
Discussion Questions BA
data mining. Why are there many names and definitions for data mining?
Data mining is the process through which previously unknown
patterns in data were discovered. Another definition would be a process
that uses statistical, mathematical, artificial intelligence, and machine
learning techniques to extract and identify useful information and
subsequent knowledge from large databases. This includes most types
of automated data analysis. A third definition: Data mining is the process
of finding mathematical patterns from (usually) large sets of data; these
can be rules, affinities, correlations, trends, or prediction models.
Data mining has many definitions because its been stretched beyond
those limits by some software vendors to include most forms of data
analysis in order to increase sales using the popularity of data mining.
What are the main reasons for the recent popularity of data mining?
Following are some of most pronounced reasons:
More intense competition at the global scale driven by customers everchanging needs and wants in an increasingly saturated marketplace.
General recognition of the untapped value hidden in large data
sources.
Consolidation and integration of database records, which enables a
single view of customers, vendors, transactions, etc.
Consolidation of databases and other data repositories into a single
location in the form of a data warehouse.
The exponential increase in data processing and storage technologies.
Significant reduction in the cost of hardware and software for data
storage and processing.
Movement toward the de-massification (conversion of information
resources into nonphysical form) of business practices.
Discuss what an organization should consider before making a decision to purchase data mining
software.
Technically speaking, data mining is a process that uses statistical,
mathematical, and artificial intelligence techniques to extract and identify
useful information and subsequent knowledge (or patterns) from large sets
of data. Before making a decision to purchase data mining software
organizations should consider the standard criteria to use when investing in
any major software: cost/benefit analysis, people with the expertise to use
the software and perform the analyses, availability of historical data, a
business need for the data mining software.
Discuss the main data mining methods. What are the fundamental differences among them?
Prediction is the act of telling about the future. It differs from simple guessing by taking
into account the experiences, opinions, and other relevant information in conducting the
task of foretelling. A term that is commonly associated with prediction is forecasting.
Even though many believe that these two terms are synonymous, there is a subtle but
critical difference between the two. Whereas prediction is largely experience and
opinion based, forecasting is data and model based. That is, in order of increasing
reliability, one might list the relevant terms as guessing, predicting, and
forecasting, respectively. In data mining terminology, prediction and
forecasting are used synonymously, and the term prediction is used as the common
representation of the act.
Classification: analyzing the historical behavior of groups of entities with similar
characteristics, to predict the future behavior of a new entity from its similarity to those
groups
Clustering: finding groups of entities with similar characteristics
Association: establishing relationships among items that occur together
Sequence discovery: finding time-based associations
Visualization: presenting results obtained through one or more of the other methods
Regression: a statistical estimation technique based on fitting a curve defined by a
mathematical equation of known type but unknown parameters to existing data
Forecasting: estimating a future data value based on past data values.
What are the main data mining application areas? Discuss the commonalities of these areas that
make them a prospect for data mining studies.
Applications are listed near the beginning of this section (pp. 204-206):
CRM, banking, retailing and logistics, manufacturing and production,
brokerage, insurance, computer hardware and software, government,
travel, healthcare, medicine, entertainment, homeland security, and
sports.
The commonalities are the need for predictions and forecasting for
planning purposes and to support decision making.
Why do we need a standardized data mining process? What are the most commonly used data
mining processes?
In order to systematically carry out data mining projects, a general
process is usually followed. Similar to other information systems
initiatives, a data mining project must follow a systematic project
management process to be successful. Several data mining processes
have been proposed: CRISP-DM, SEMMA, and KDD.
Discuss the differences between the two most commonly used data mining process.
The main difference between CRISP-DM and SEMMA is that CRISP-DM
takes a more comprehensive approachincluding understanding of the
business and the relevant datato data mining projects, whereas SEMMA
implicitly assumes that the data mining projects goals and objectives
along with the appropriate data sources have been identified and
understood.
9
10 Why do we need data preprocessing? What are the main tasks and relevant techniques used in
data preprocessing?
Data preprocessing is essential to any successful data mining study.
Good data leads to good information; good information leads to good
decisions. Data preprocessing includes four main steps (listed in Table 5.4
on page 211):
data consolidation: access, collect, select and filter data
data cleaning: handle missing data, reduce noise, fix errors
data transformation: normalize the data, aggregate data, construct new
attributes
data reduction: reduce number of attributes and records; balance skewed
data
11 Discuss the reasoning behind the assessment of classification models.
The model-building step also encompasses the assessment and
comparative analysis of the various models built. Because there is not a
universally known best method or algorithm for a data mining task, one
should use a variety of viable model types along with a well-defined
experimentation and assessment strategy to identify the best method for
a given purpose.
12 What is the main difference between classification and clustering? Explain using concrete
examples.
Classification learns patterns from past data (a set of information
traits, variables, featureson characteristics of the previously labeled
items, objects, or events) in order to place new instances (with unknown
labels) into their respective groups or classes. The objective of
classification is to analyze the historical data stored in a database and
automatically generate a model that can predict future behavior.
Classifying customer-types as likely to buy or not buy is an example.
2.
What is a data warehouse and what are its benefits? Why is Web
accessibility important with a data warehouse?
A data warehouse can be defined (Section 5.2) as a pool of data produced to support
decision making. This focuses on the essentials, leaving out characteristics that may vary
from one DW to another but are not essential to the basic concept.
The same paragraph gives another definition: a subject-oriented, integrated, timevariant, nonvolatile collection of data in support of managements decision-making
process. This definition adds more specifics, but in every case appropriately: it is hard, if
not impossible, to conceive of a data warehouse that would not be subject-oriented,
integrated, etc.
The benefits of a data warehouse are that it provides decision making information,
organized in a way that facilitates the types of access required for that purpose and
supported by a wide range of software designed to work with it.
Web accessibility of a data warehouse is important because many analysis
applications are Web-based, because users often access data over the Web (or over an
intranet using the same tools) and because data from the Web may feed the DW.
(The first part of this question is essentially the same as Review Question 1 of
Section 5.2. It would be redundant to assign that question if this one is to be answered as
well.)
3.
For a data mart to replace a data warehouse, it must make the DW unnecessary. This would
mean that all the analyses for which the DW would be used can instead be satisfied by a
DM (or perhaps a combination of several DMs). If this is so, it can be much less expensive,
in terms of development and computer resources, to use multiple DMs (let alone one DM!)
instead of an overall DW.
In other situations, a data mart can be used for some analyses which would in its
absence use the DW, but not all of them. For those, the smaller DM is more efficientquite
possibly, enough more efficient as to justify the cost of having a DM in addition to a DW.
Here the DM complements the DW.
4.
Discuss the major drivers and benefits of data warehousing to end users.
Major drivers include:
5.
6.
Describe how data integration can lead to higher levels of data quality.
A question involving the word higher (or any other comparative, for that matter) requires
asking higher than what? In this case, we can take it to mean higher than we would have
for the same data, but without a formal data integration process.
Without a data integration process to combine data in a planned and structured
manner, data might be combined incorrectly. That could lead to misunderstood data (a
measurement in meters taken as being in feet) and to inconsistent data (data from one
source applying to calendar months, data from another to four-week or five-week fiscal
months). These are aspects of low-quality data which can be avoided, or at least reduced, by
data integration.
7.
What are some of the most popular text mining software tools?
1. ClearForest offers text analysis and visualization tools.
2. IBM Intelligent Miner Data Mining Suite, now fully integrated into IBMs
InfoSphere Warehouse software, includes data and text mining tools.
3. Megaputer Text Analyst offers semantic analysis of free-form text,
summarization, clustering, navigation, and natural language retrieval with
search dynamic refocusing.
4. SAS Text Miner provides a rich suite of text processing and analysis
tools.
5. SPSS Text Mining for Clementine extracts key concepts, sentiments,
and relationships from call-center notes, blogs, e-mails, and other
unstructured data and converts it to a structured format for predictive
modeling.
6. The Statistica Text Mining engine provides easy-to-use text mining
functionally with exceptional visualization capabilities.
7. VantagePoint provides a variety of interactive graphical views and
analysis tools with powerful capabilities to discover knowledge from text
databases.
8. The WordStat analysis module from Provalis Research analyzes textual
information such as responses to open-ended questions, interviews, etc.
4. Why do you think most of the text mining tools are offered by statistics companies?
Students should mention that many of the capabilities of data mining
apply to text mining. Since statistics companies offer data mining tools,
offering text mining is a natural business extension.
5. What do you think are the pros and cons of choosing a free text mining tool over a commercial
tool?
6. Define Web structure mining, and differentiate it from Web content mining.
Web structure mining is the process of extracting useful information from
the links embedded in Web documents.