Data Mining and Business Intelligence
Data Mining and Business Intelligence
141
Chapter IX
ABSTRACT
Most businesses generate, are surrounded by, and are even overwhelmed by data much of it never used to its full potential for gaining insights into ones own business, customers, competition, and overall business environment. By using a technique known as data mining, it is possible to extract critical and useful patterns, associations, relationships, and, ultimately, useful knowledge from the raw data available to businesses. This chapter explores data mining and its benefits and capabilities as a key tool for obtaining vital business intelligence information. The chapter includes an overview of data mining, followed by its evolution, methods, technologies, applications, and future.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
142 Hsu
INTRODUCTION
One aspect of our technological society is clear there is a large amount of data but a shortage of information. Every day, enormous amounts of information are generated from all sectors business, education, the scientific community, the World Wide Web, or one of many off-line and online data sources readily available. From all of this, which represents a sizable repository of human data and information, it is necessary and desirable to generate worthwhile and usable knowledge. As a result, the field of data mining and knowledge discovery in databases (KDD) has grown in leaps and bounds, and has shown great potential for the future (Han & Kamber, 2001). Data mining is not a single technique or technology but, rather, a group of related methods and methodologies that are directed towards the finding and automatic extraction of patterns, associations, changes, anomalies, and significant structures from data (Grossman, 1998). Data mining is emerging as a key technology that enables businesses to select, filter, screen, and correlate data automatically. Data mining evokes the image of patterns and meaning in data, hence the term that suggests the mining of nuggets of knowledge and insight from a group of data. The findings from these can then be applied to a variety of applications and purposes, including those in marketing, risk analysis and management, fraud detection and management, and customer relationship management (CRM). With the considerable amount of information that is being generated and made available, the effective use of data-mining methods and techniques can help to uncover various trends, patterns, inferences, and other relations from the data, which can then be analyzed and further refined. These can then be studied to bring out meaningful information that can be used to come to important conclusions, improve marketing and CRM efforts, and predict future behavior and trends (Han & Kamber, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
143
Transformation The data is transformed in that overlays may be added, such as the demographic overlays, and the data is made usable and navigable. Data mining This stage is concerned with the extraction of patterns from the data. Interpretation and evaluation The patterns identified by the system are interpreted into knowledge that can then be used to support human decision-making, e.g., prediction and classification tasks, summarizing the contents of a database, or explaining observed phenomena (Han & Kamber).
Data mining is a field that is heavily influenced by traditional statistical techniques, and most data-mining methods will reveal a strong foundation of statistical and data analysis methods. Some of the traditional data-mining techniques include classification, clustering, outlier analysis, sequential patterns, time series analysis, prediction, regression, link analysis (associations), and multidimensional methods including online analytical processing (OLAP). These can then be categorized into a series of data-mining techniques, which are classified and illustrated in Table 1 (Goebel & Le Grunwald, 1999). In addition, the entire broad field of data mining includes not only a discussion of statistical techniques, but also various related technologies and techniques, including data warehousing, and many software packages and languages that have been developed for the purpose of mining data. Some of these packages and languages include: DBMiner, IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, and MS OLEDB for Data Mining (Goebel & Le Grunwald, 1999). Data warehousing complements data mining in that data stored in a data warehouse is organized in such a form as to make it suitable for analysis using data-mining methods. A data warehouse is a central repository for the data that an enterprises various business systems collect. Typically, a data warehouse is housed on an enterprise server. Data from various online transaction processing (OLTP) applications and other sources are extracted and organized on the data warehouse database for use by analytical applications, user queries, and data-mining operations. Data warehousing focuses on the capture of data from diverse sources for useful analysis and access. A data mart emphasizes the point of view of the end-user or knowledge worker who needs access to specialized, but often local, databases (Delmater & Hancock, 2001; Han & Kamber, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
144 Hsu
Partition a set into classes, whereby items with similar characteristics are grouped together
OLAP tools enable users to analyze different dimensions of multidimensional data. For example, it provides time series and trend analysis views.
Model Visualization
Making discovered knowledge easily understood using charts, plots, histograms, and other visual means
Explores a data set without a strong dependence on assumptions or models; goal is to identify patterns in an exploratory manner
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
145
The origins of data mining can be thought of as having come from three areas of learning and research: statistics, machine learning, and artificial intelligence (AI). The first foundation of data mining is in statistics. Statistics is the foundation of most technologies on which data mining is built. Many of the classic areas of statistics, such as regression analysis, standard distributions, standard deviation and variance, discriminant analysis, and cluster analysis are the very building blocks from which the more advanced statistical techniques of data mining are based (Delmater & Hancock, 2001; Fayyad, PiateskyShapiro, & Smith, 1996; Han & Kamber, 2001). Another major area of influence is AI. This area, which derives its power from heuristics rather than statistics, attempts to apply human-thought-like processing to statistical problems. Because AI needs significant computer processing power, it did not become a reality until the 1980s, when more powerful computers began to be offered at affordable prices. There were a number of important AI-based applications, such as query optimization modules for Relational Database Management Systems (RDBMS) and others, and AI was an area of much research interest (Delmater & Hancock, 2001; Fayyad, Piatesky-Shapiro, & Smith, 1996; Han & Kamber, 2001). Finally, there is machine learning, which can be thought of as a combination of statistics and artificial intelligence. While AI did not enjoy much commercial success, many AI techniques were largely adapted for use in machine learning. Machine learning could be considered a next step in the evolution of AI, because its strength lies in blending AI heuristics with advanced statistical analyses. Some of the capabilities that were implemented into machine learning included the ability to have a computer program learn about the data it is studying, i.e., a program can make different kinds of decisions based on the characteristics of the studied data. For instance, based on the data set being analyzed, basic statistics are used for fundamental problems, and more advanced AI heuristics and algorithms are used to examine more complex data (Delmater & Hancock, 2001; Fayyad, Piatesky-Shapiro, & Smith, 1996; Han & Kamber, 2001). Data mining, in many ways, is the application of machine learning techniques to business applications. Probably best described as a combination of historical and recent developments in statistics, AI, and machine learning, its purpose is to study data and find the hidden trends or patterns within it. Data mining is finding increasing acceptance in both the scientific and business communities, meeting the need to analyze large amounts of data and discover trends that would not be found using other, more traditional means (Delmater
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
146 Hsu
& Hancock, 2001; Fayyad, Piatesky-Shapiro, & Smith, 1996; Han & Kamber, 2001). Other areas that have influenced the field of data mining include developments in database systems, visualization techniques and technologies, and advanced techniques including neural networks. Databases have evolved from flat files to sophisticated repositories of information, with complex forms of storing, arranging, and retrieving data. The evolution of database technologies from relational databases to more intricate forms such as data warehouses and data marts, have helped to make data mining a reality. Developments in visualization have also been an influence in developing certain areas of data mining. In particular, visual and spatial data mining have come of age due to the work being done in those areas. Many of the applications for which data mining is being used employ advanced artificial intelligence and related technologies, including such areas as neural networks, pattern recognition, information retrieval, and advanced statistical analyses. From this discussion of the theoretical and computer science origins of data mining, it would be useful to now look at a classification of data-mining systems that can provide some insight into how data-mining systems and technologies have evolved (Delmater & Hancock, 2001; Fayyad, Piatesky-Shapiro, & Smith, 1996; Han & Kamber, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
147
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
148 Hsu
First Second
Supports one or more Stand alone algorithms systems Multiple algorithms Data management systems, including database and data warehouses systems
Third
Fourth
Data Mobile and management ubiquitous computing predictive modeling & mobile systems
which encompasses data mining, is estimated to increase from $3.6 billion in 2000 to $11.9 billion in 2005. The growth in the CRM Analytic application market is expected to approach 54.1% per year through 2003. In addition, data-mining projects are expected to grow by more than 300% by the year 2002. By 2003, more than 90% of consumer-based industries with e-commerce orientation will utilize some kind of data-mining models. As mentioned previously, the field of data mining is very broad, and there are many methods and technologies that have become dominant in the field. Not only have there been developments in the traditional areas of data mining, but there are other areas that have been identified as being especially important as future trends in the field.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
149
The trends that focus on data mining from complex types of data include Web mining, text mining, distributed data mining, hypertext/hypermedia mining, ubiquitous data mining, as well as multimedia, visual, spatial, and time series/ sequential data mining. These are examined in detail in the upcoming sections. The techniques and methods that are highlighted include constraint-based and phenomenal data mining. In addition, two of the areas that have become extremely important include bioinformatics and DNA analysis, and the work being done in support of customer relationship management (CRM).
WEB MINING
Web mining is one of the most promising areas in data mining, because the Internet and World Wide Web are dynamic sources of information. Web mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web (Etzioni, 1996). The main tasks that comprise Web mining include retrieving Web documents, selection and processing of Web information, pattern discovery in sites and across sites, and analysis of the patterns found (Garofalis, 1999; Han, Zaiane, Chee, & Chiang, 2000; Kosala & Blockeel, 2000). Web mining can be categorized into three separate areas: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. This includes the mining of Web text documents, which is a form of resource discovery based on the indexing of concepts, sometimes using agent-based technology. Web structure mining is the process of inferring knowledge from the links and organization in the World Wide Web. Finally, Web usage mining, also known as Web Log Mining, is the process of extracting interesting patterns in Web access logs and other Web usage information (Borges & Levene, 1999; Kosala & Blockeel, 2000; Madria, 1999). Web mining is closely related to both information retrieval (IR) and information extraction (IE). Web mining is sometimes regarded as an intelligent form of information retrieval, and IE is associated with the extraction of information from Web documents (Pazienza, 1997). Aside from the three types mentioned above, there are different approaches to handling these problems, including those with emphasis on databases and the use of intelligent software agents. Web Content Mining is concerned with the discovery of new information and knowledge from Web-based data, documents, and pages. Because the
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
150 Hsu
Web contains so many different kinds of information, including text, graphics, audio, video, and hypertext links, the mining of Web content is closely related to the field of hypermedia and multimedia data mining. However in this case, the focus is on information that is found mainly on the World Wide Web. Web content mining is a process that goes beyond the task of extracting keywords. Some approaches have involved restructuring the document content in a representation that could be better used by machines. One approach is to use wrappers to map documents to some data model. According to Kosala and Blockeel (2000), there are two main approaches to Web content mining: an Information Retrieval view and a database view. The Information Retrieval view is designed to work with both unstructured (free text such as news stories) or semistructured documents (with both HTML and hyperlinked data), and attempts to identify patterns and models based on an analysis of the documents, using such techniques as clustering, classification, finding text patterns, and extraction rules. There are a number of studies that have been conducted in these and related areas, such as clustering, categorization, computational linguistics, exploratory analysis, and text patterns. Many of these studies are closely related to, and employ the techniques of text mining (Billsus & Pazzani, 1999; Frank, Paynter, Witten, Gutwin, & Nevill-Manning, 1998; Nahm & Mooney, 2000). The other main approach, which is to content mine semi structured documents, uses many of the same techniques used for unstructured documents, but with the added complexity and challenge of analyzing documents containing a variety of media elements. For this area, it is frequently desired to take on a database view, with the Web site being analyzed as the database. Here, hypertext documents are the main information that is to be analyzed, and the goal is to transform the data found in the Web site to a form in which better management and querying of the information is enabled (Crimmins, 1999, Shavlik & Elassi-Rad, 1998). Some of the applications from this kind of Web content mining include the discovery of a schema for Web databases and of building structural summaries of data. There are also applications that focus on the design of languages that provide better querying of databases that contain Web-based data. Researchers have developed many Web-oriented query languages that attempt to extend standard database query languages such as SQL to collect data from the Web. WebLog is a logic-based query language for restructuring extracted information from Web information sources. WebSQL provides a framework that supports a large class of data-restructuring operations. In addition, WebSQL combines structured queries, based on the organization of hypertext documents, and content queries, based on information retrieval techniques.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
151
The TSIMMIS system (Chawathe et al., 1994) extracts data from heterogeneous and semi structured information sources and correlates them to generate an integrated database representation of the extracted information (Han, Fu, Wang, Koperski, & Zaiane, 1996; Maarek & Ben Shaul, 1996; Meldelzon, Mihaila, & Milo, 1996; Merialdo, Atzeni, & Mecca, 1997) . Others focus on the building and management of multilevel or multilayered databases. This suggests a multilevel database approach to organizing Webbased information. The main idea behind this method is that the lowest level of the database contains primitive semi structured information stored in various Web repositories, such as hypertext documents. At the higher level(s), meta data or generalizations are extracted from lower levels and organized in structured collections such as relational or object-oriented databases. As an example of this, the ARANEUS system extracts relevant information from hypertext documents and integrates these into higher level derived Web hypertexts that are generalized as database views. Kholsa, Kuhn, and Soparkar (1996) and King and Novak (1996) have done research in this area. Web Structure Mining has as its goal mining knowledge from the structure of Web sites rather than looking at the text and data on the pages themselves. More specifically, it attempts to examine the structures that exist between documents on a Web site, such as the hyperlinks and other linkages. For instance, links pointing to a document indicate the popularity of the document, while links coming out of a document indicate the richness or perhaps the variety of topics covered in the document. The PageRank (Brin & Page, 1998) and CLEVER (Chakrabarti et al., 1999) methods take advantage of this information conveyed by the links to find pertinent Web pages. Counters of hyperlinks, into and out of documents, retrace the structure of the Web artifacts summarized. The concept of examining the structure of Web sites in order to gain additional insight and knowledge is closely related to the areas of social network and citation analysis. The idea is to model the linkages and structures of the Web using the concepts of social network analysis. There are also a number of algorithms that have been employed to model the structure of the Web, and have been put to practical use in determining the relevance of Web sites and pages. Other uses for these include the categorization of Web pages and the identification of communities existing on the Web. Some of these include PageRank and HITS (Pirolli, Pitkow, & Rao, 1996; Spertus, 1997). Web Usage Mining is yet another major area in the broad spectrum of Web mining. Rather than looking at the content pages or the underlying structure, Web usage mining is focused on Web-user behavior or, more
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
152 Hsu
specifically, modeling and predicting how a user will use and interact with the Web. In general, this form of mining examines secondary data, or the data that is derived from the interaction of users (Chen, Park, & Yu, 1996). For instance, Web servers record and accumulate data about user interactions whenever requests for resources are received. Analyzing the Web access logs of different Web sites can enable a better understanding of user behavior and the Web structure, thereby improving the design of this collection of resources. The sources of Web usage data could be divided into three main categories: client level, server level, and proxy level. Client-level data is typically data collected by the Web browser itself running on a client machine, or by Java applets or Javascript programs running off the browser. This is in contrast to server-level data, which is probably the more widely used of these three data sources. Server-level data is data gathered from Web servers, including server logs, as well as logs that record cookie data and query data. Finally, proxy-level data, which is in the form of proxy traces, can provide information on the browsing habits of users sharing the same proxy server (Srivastava, Cooley, Deshpande, & Tan, 2000). There are two main thrusts in Web usage mining: General Access Pattern Tracking and Customized Usage Tracking. General access pattern tracking analyzes the Web logs in order to better understand access patterns and trends. These analyses can shed light on the structure and grouping of resource providers. Applying data-mining techniques to access logs can unveil interesting access patterns that can be used to restructure sites more effectively, pinpoint advertising better, and target ads to specific users. Customized usage tracking analyzes individual trends. Its purpose is to customize Web sites to users. The information displayed, the depth of the site structure, and the format of the resources can all be dynamically customized for each user over time, based on their patterns of access. It is important to point out that the success of such applications depends on what and how much valid and reliable knowledge one can discover from usage logs and other sources. It may be useful to incorporate information not only from Web servers, but also from customized scripts written for certain sites (Kosala & Blockeel, 2000). In general, the mining of Web usage data can be divided into two main approaches: analyzing the log data directly, or, alternately, mapping the data into relational tables. In the first case, some special preprocessing is required, and in the second, it is necessary to adapt and encode the information into a form that can be entered into the database. In either case, it is important to ensure the accuracy and definition of users and sessions given the influence of caching and proxy servers (Kosala & Blockeel, 2000).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
153
In the case of Web usage applications, they can also be categorized into impersonalized and personalized. In the first case, the goal is to examine general user navigational patterns, so that it is possible to understand how users go about moving through and using the site. The other case looks more from the perspective of individual users and what would be their preferences and needs, so as to start towards developing a profile for that user. As a result, webmasters and site designers, with this knowledge in hand, can better structure and tailor their site to the needs of users, personalize the site for certain types of users, and learn more about the characteristics of the sites users. Srivastava, Cooley, Deshpande, and Tan (2000) have produced a taxonomy of different Web-mining applications and have categorized them into the following types: Personalization. The goal here is to produce a more individualized experience for a Web visitor, which includes making recommendations about other pages to visit based on the pages he/she has visited previously. In order to be able to personalize recommended pages, part of the analysis is to cluster those users who have similar access patterns and then develop a group of possible recommended pages to visit. System Improvement. Performance and speed have always been an important factor when it comes to computing systems, and through Web usage data it is possible to improve system performance by creating policies and using such methods as load balancing, Web caching, and network transmission. The role of security is also important, and an analysis of usage patterns can be used to detect illegal intrusion and other security problems. Site Modification. It is also possible to modify aspects of a site based on user patterns and behavior. After a detailed analysis of a users activities on a site, it is possible to make design changes and structural modifications to the site to enhance the users satisfaction and the sites usability. In one interesting study, the structure of a Web site was changed, automatically, based on patterns analyzed from usage logs. This adaptive Web site project was described by Perkowitz and Etzioni (1998, 1999). Business Intelligence. Another important application of Web usage mining is the ability to mine for marketing intelligence information. Buchner and Mulvenna (1998) used a data hypercube to consolidate Web usage data together with marketing data in order to obtain insights with regards to e-commerce. They identified certain areas in the customer relationship
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
154 Hsu
life cycle that were supported by analyses of Web usage information: customer attraction and retention, cross sales, and departure of customers. A number of commercial products are on the market that aid in collecting and analyzing Web log data for business intelligence purposes. Usage Characterization. There is a close relationship between data mining of Web usage data and Web usage characterization research. Usage characterization is focused more on such topics as interactions with the browser interface, navigational strategies, the occurrence of certain types of activities, and models of Web usage. Studies in this area include Arlitt and Williamson (1997), Catledge and Pitkow (1995), and Doorenbos, Etzioni, and Weld (1996).
Three major components of the Web usage mining process include preprocessing, pattern discovery, and pattern analysis. The preprocessing component adapts the data to a form that is more suitable for pattern analysis and Web usage mining. This involves taking raw log data and converting it into usable (but as of yet not analyzed) information. In the case of Web usage data, it would be necessary to take the raw log information and start by identifying users, followed by the identification of the users sessions. Often it is important to have not only the Web server log data, but also data on the content of the pages being accessed, so that it is easier to determine the exact kind of content to which the links point (Perkowitz & Etzioni, 1995; Srivastava, Cooley, Deshpande, & Tan, 2000). Pattern discovery includes such analyses as clustering, classification, sequential pattern analysis, descriptive statistics, and dependency modeling. While most of these should be familiar to those who understand statistical and analysis methods, a couple may be new to some. Sequential pattern analysis attempts to identify patterns that form a sequence; for example, certain types of data items in session data may be followed by certain other specific kinds of data. An analysis of this data can provide insight into the patterns present in the Web visits of certain kinds of customers, and would make it easier to target advertising and other promotions to the customers who would most appreciate them. Dependency modeling attempts to determine if there are any dependencies between the variables in the Web usage data. This could help to identify, for example, if there were different stages that a customer would go through while using an e-commerce site (such as browsing, product search, purchase) on the way to becoming a regular customer. Pattern analysis has as its objective the filtering out of rules and patterns that are deemed uninteresting and, therefore, will be excluded from further
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
155
analysis. This step is necessary to avoid excessive time and effort spent on patterns that may not yield productive results. Yet another area that has been gaining interest is agent-based approaches. Agents are intelligent software components that crawl through the Net and collect useful information, much like the virus-like worm moves through systems wreaking havoc. Generally, agent-based Web-mining systems can be placed into three main categories: information categorization and filtering, intelligent search agents, and personal agents. Information Filtering/Categorization agents try to automatically retrieve, filter, and categorize discovered information by using various information retrieval techniques. Agents that can be classified into this category include HyPursuit (Weiss et al., 1996) and Bookmark Organizer (BO). HyPursuit clusters together hierarchies of hypertext documents, and structures an information space by using semantic information embedded in link structures as well as document content. The BO system uses both hierarchical clustering methods and user interaction techniques to organize a collection of Web documents based on conceptual information. Intelligent Search Agents search the Internet for relevant information and use characteristics of a particular domain to organize and interpret the discovered information. Some of the better known include ParaSite and FAQ-Finder. These agents rely either on domain-specific information about particular types of documents or on models of the information sources to retrieve and interpret documents. Other agents, such as ShopBot and Internet Learning Agent (ILA), attempt to interact with and learn the structure of unfamiliar information sources. ShopBot retrieves product information from a variety of vendor sites using only general information about the product domain. ILA, on the other hand, learns models of various information sources and translates these into its own internal concept hierarchy. Personalized Web Agents try to obtain or learn user preferences and discover Web information sources that correspond to these preferences, and possibly those of other individuals with similar interests, using collaborative filtering. Systems in this class include Netperceptions, WebWatcher (Armstrong, Freitag, Joachims, & Mitchell, 1995), and Syskill and Webert (Pazzani, Muramatsu, & Billsus, 1996).
As a related area, it would be useful to examine knowledge discovery from discussion groups and online chats. In fact, online discussions could be a good
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
156 Hsu
way to discover knowledge, since many people who are active in online chatting are usually experts in some fields. Nevertheless, some researchers have done fairly well in this area. The Butterfly system at MIT is a conversationfinding agent that aims to help Internet Relay Chat (IRC) users find desired groups. It uses a natural language query language and a highly interactive user interface. One study on Yenta (Foner, 1997) used a privacy-safe referral mechanism to discover clusters of interest among people on the Internet, and built user profiles by examining users email and Usenet messages. Resnick discussed how to tackle the Internet information within large groups (1994). Another development is IBMs Sankha. It is a browsing tool for online chat that demonstrates a new online clustering algorithm to detect new topics in newgroups. The idea behind Sankha is based on another pioneering project by IBM called Quest (Agarwal et al., 1996).
Benefits of TDM
It is important to differentiate between TDM and information access (or information retrieval, as it is better known). The goal of information access is
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
157
to help users find documents that satisfy their information needs (Baeza-Yates & Ribeiro-Neto, 1999). The goal is one of homing in on what is currently of interest to the user. However, text mining focuses on how to use a body of textual information as a large knowledge base from which one can extract new, never-before encountered information (Craven et al., 1998). However, the results of certain types of text processing can yield tools that indirectly aid in the information access process. Examples include text clustering to create thematic overviews of text collections (Rennison, 1994; Wise et al., 1995), automatically generating term associations to aid in query expansion (Voorhees, 1994; Xu & Croft, 1996), and using co-citation analysis to find general topics within a collection or identify central Web pages (Hearst, 1999; Kleinberg 1998; Larson, 1996). Aside from providing tools to aid in the standard information access process, text data mining can contribute by providing systems supplemented with tools for exploratory data analysis. One example of this is in projects such as LINDI. The LINDI project investigated how researchers can use large text collections in the discovery of new important information, and how to build software systems to help support this process. The LINDI interface provides a facility for users to build and so reuse sequences of query operations via a drag-and-drop interface. These allow the user to repeat the same sequence of actions for different queries. This system will allow maintenance of several different types of history, including history of commands issued, history of strategies employed, and history of hypotheses tested (Hearst, 1999). The user interface provides a mechanism for recording and modifying sequences of actions. These include facilities that refer to metadata structure, allowing, for example, query terms to be expanded by terms one level above or below them in a subject hierarchy. Thus, the emphasis of this system is to help automate the tedious parts of the text manipulation process and to combine text analysis with human-guided decision making. One area that is closely related to TDM is corpus-based computational linguistics. This field is concerned with computing statistics over large text collections in order to discover useful patterns. These patterns are used to develop algorithms for various sub-problems within natural language processing, such as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation. However, these tend to serve the specific needs of computational linguistics and are not applicable to a broader audience (Hearst, 1999).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
158 Hsu
Text Categorization
Some researchers have suggested that text categorization should be considered TDM. Text categorization is a condensation of the specific content of a document into one (or more) of a set of predefined labels. It does not discover new information; rather, it summarizes something that is already known. However, there are two recent areas of inquiry that make use of text categorization and seem to be more related to text mining. One area uses text category labels to find unexpected patterns among text articles (Dagan, Feldman, & Hirsh, 1996; Feldman, Klosgen, & Zilberstein, 1997). The main goal is to compare distributions of category assignments within subsets of the document collection. Another effort is that of the DARPA Topic Detection and Tracking initiative. This effort included the Online New Event Detection, the input to which is a stream of news stories in chronological order, and whose output is a yes/no decision for each story, indicating whether the story is the first reference to a newly occurring event (Hearst, 1999).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
159
Methods of TDM
Some of the major methods of text data mining include feature extraction, clustering, and categorization. Feature extraction, which is the mining of text within a document, attempts to find significant and important vocabulary from within a natural language text document. This involves the use of techniques including pattern matching and heuristics that are focused on lexical and partof-speech information. An effective feature extraction system is able not only to take out relevant terms and words, but also to do some more advanced processing, including the ability to overcome ambiguity of variants in other words, mistaking words that are spelled the same. For instance, a system would ideally be able to distinguish between the same word, if it is used as the name of a city or as a part of a persons name. From the document-level analysis, it is possible to examine collections of documents. The methods used to do this include clustering and classification. Clustering is the process of grouping documents with similar contents into dynamically generated clusters. This is in contrast to text categorization, where the process is a bit more involved. Here, samples of documents fitting into predetermined themes or categories are fed into a trainer, which in turn generates a categorization schema. When the documents to be analyzed are then fed into the categorizer, which incorporates the schema previously produced, it will then assign documents to different categories based on the taxonomy previously provided. These features are incorporated into programs such as IBMs Intelligent Miner for Text (Dorre, Gerstl, & Seiffert, 1999).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
160 Hsu
This global model combines the results of the separate analyses. Often the global model produced may become incorrect or ambiguous, especially if the data in different locations has different features or characteristics. This problem is especially critical when the data in distributed sites is heterogeneous rather than homogeneous. These heterogeneous data sets are known as vertically partitioned datasets. An approach proposed by Kargupta et al. (2000) speaks of the collective data mining (CDM) approach, which provides a better approach to vertically partitioned datasets, using the notion of orthonormal basis functions, and computes the basis coefficients to generate the global model of the data.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
161
computing device offer many challenges. For example, UDM introduces additional costs due to communication, computation, security, and other factors. So, one of the objectives of UDM is to mine data while minimizing the cost of ubiquitous presence. Human-computer interaction is another challenging aspect of UDM. Visualizing patterns like classifiers, clusters, associations, and others in portable devices is usually difficult. The small display areas offer serious challenges to interactive data-mining environments. Data management in a mobile environment is also a challenging issue. Moreover, the sociological and psychological aspects of the integration between data-mining technology and our lifestyle are yet to be explored. The key issues to consider, according to Kargupta and Joshi (2001), include: theories of UDM, advanced algorithms for mobile and distributed applications, data management issues, mark-up languages and other data representation techniques, integration with database applications for mobile environments, architectural issues (architecture, control, security, and communication issues), specialized mobile devices for UDMs, software agents and UDM (agent-based approaches in UDM, agent interaction cooperation, collaboration, negotiation, organizational behavior), applications of UDM (application in business, science, engineering, medicine, and other disciplines), location management issues in UDM, and technology for Web-based applications of UDM.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
162 Hsu
there are other kinds of hypertext/hypermedia data sources that are not found on the Web. Examples of these include the information found in online catalogues, digital libraries, online information databases, and the like. In addition to the traditional forms of hypertext and hypermedia, together with the associated hyperlink structures, there are also inter-document structures that exist on the Web, such as the directories employed by such services as Yahoo! (www.yahoo.com) or the Open Directory project (http://dmoz.org) These taxonomies of topics and subtopics are linked together to form a large network or hierarchical tree of topics and associated links and pages. Some of the important data-mining techniques used for hypertext and hypermedia data mining include classification (supervised learning), clustering (unsupervised learning), semi structured learning, and social network analysis. In the case of classification, or supervised learning, the process starts off by reviewing training data in which items are marked as being part of a certain class or group. This data is the basis from which the algorithm is trained. One application of classification is in the area of Web topic directories, which can group similar-sounding or spelled terms into appropriate categories so that searches will not bring up inappropriate sites and pages. The use of classification can also result in searches that are not only based on keywords, but also on category and classification attributes. Methods used for classification include naive Bayes classification, parameter smoothing, dependence modeling, and maximum entropy (Chakrabarti, 2000). Unsupervised learning, or clustering, differs from classification in that classification involves the use of training data; clustering is concerned with the creation of hierarchies of documents based on similarity, and organizes the documents based on that hierarchy. Intuitively, this would result in more similar documents being placed on the leaf levels of the hierarchy, with less similar sets of document areas being placed higher up, closer to the root of the tree. Techniques that have been used for unsupervised learning include k-means clustering, agglomerative clustering, random projections, and latent semantic indexing. Semi supervised learning and social network analysis are other methods that are important to hypermedia-based data mining. Semi supervised learning is the case where there are both labeled and unlabeled documents, and there is a need to learn from both types of documents. Social network analysis is also applicable because the Web is considered a social network, which examines networks formed through collaborative association, whether between friends, academics doing research or serving on committees, or between papers
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
163
through references and citations. Graph distances and various aspects of connectivity come into play when working in the area of social networks (Larson, 1996; Mizruchi, Mariolis, Schwartz, & Mintz, 1986). Other research conducted in the area of hypertext data mining include work on distributed hypertext resource discovery (Chakrabarti, van den Berg, & Dom, 1999).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
164 Hsu
collaborative visual data exploration and model building; metrics for evaluation of visual data-mining methods; generic system architectures and prototypes for visual data mining; and methods for visualizing semantic content.
Pictures and diagrams are also often used, mostly for psychological reasons harnessing our ability to reason visually with the elements of a diagram in order to assist our more purely logical or analytical thought processes. Thus, a visual-reasoning approach to the area of data mining and machine learning promises to overcome some of the difficulties experienced in the comprehension of the information encoded in data sets and the models derived by other quantitative data mining methods (Han & Kamber, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
165
does require users to concentrate on watching patterns, which can become monotonous. But when representing data as a stream of audio, it is possible to transform patterns into sound and music and listen to pitches, rhythms, tune, and melody in order to identify anything interesting or unusual. Its is possible not only to summarize melodies, based on the approximate patterns that repeatedly occur in the segment, but also to summarize style, based on tone, tempo, or the major musical instruments played (Han & Kamber, 2001; Zaiane, Han, & Zhu, 2000).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
166 Hsu
warehouse include the difficulties of integration of data from heterogeneous sources and applying the use of online analytical processing, which is not only relatively fast, but also offers some forms of flexibility. In general, spatial data cubes, which are components of spatial data warehouses, are designed with three types of dimensions and two types of measures. The three types of dimensions include the nonspatial dimension (data that is nonspatial in nature), the spatial to nonspatial dimension (primitive level is spatial but higher level generalization is nonspatial), and the spatial-to-spatial dimension (both primitive and higher levels are all spatial). In terms of measures, there are both numerical (numbers only) and spatial (pointers to spatial object) measures used in spatial data cubes (Stefanovic, Han, & Koperski, 2000; Zhou, Truffet, & Han, 1999). Aside from the implemention of data warehouses for spatial data, there is also the issue of analyses that can be done on the data, such as association analysis, clustering methods, and the mining of raster databases There have been a number of studies conducted on spatial data mining (Bedard, Merrett, & Han, 2001; Han, Kamber, & Tung, 1998; Han, Koperski, & Stefanovic, 1997; Han, Stefanovic, & Koperski, 1998; Koperski, Adikary, & Han, 1996; Koperski & Han, 1995; Koperski, Han, & Marchisio, 1999; Koperski, Han, & Stefanovic, 1998; Tung, Hou, & Han, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
167
other, while subsequence matching attempts to find those patterns that are similar to a specified, given sequence. Sequential-pattern mining has as its focus the identification of sequences that occur frequently in a time series or sequence of data. This is particularly useful in the analysis of customers, where certain buying patterns could be identified, for example, what might be the likely follow-up purchase to purchasing a certain electronics item or computer. Periodicity analysis attempts to analyze the data from the perspective of identifying patterns that repeat or recur in a time series. This form of data-mining analysis can be categorized as being full periodic, partial periodic, or cyclic periodic. In general, full periodic is the situation where all of the data points in time contribute to the behavior of the series. This is in contrast to partial periodicity, where only certain points in time contribute to series behavior. Finally, cyclical periodicity relates to sets of events that occur periodically (Han, Dong, & Yin, 1999; Han & Kamber, 2001; Han, Pei et al., 2000; Kim, Lam, & Han, 2000; Pei, Han, Pinto et al., 2001; Pei, Tung, & Han, 2001).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
168 Hsu
conducted within the framework of an ad hoc, query-driven system, data constraints can be specified in a form similar to that of a SQL query. Dimension/level constraints. Because much of the information being mined is in the form of a database or multidimensional data warehouse, it is possible to specify constraints that specify the levels or dimensions to be included in the current query. Interestingness constraints. It would also be useful to determine what ranges of a particular variable or measure are considered to be particularly interesting and should be included in the query. Rule constraints. It is also important to specify the specific rules that should be applied and used for a particular data mining query or application.
One application of the constraint-based approach is in the Online Analytical Mining Architecture (OLAM) developed by Han, Lakshamanan, and Ng (1999), which is designed to support the multidimensional and constraintbased mining of databases and data warehouses. In short, constraint-based data mining is one of the developing areas that allows for the use of guiding constraints that should make for better data mining. A number of studies have been conducted in this area: Cheung, Hwang, Fu, and Han (2000), Lakshaman, Ng, Han, and Pang (1999), Lu, Feng, and Han (2001), Pei and Han (2000), Pei, Han, and Lakshaman (2001), Pei, Han, and Mao (2000), Tung, Han, Lakshaman, and Ng (2001), Wang, He, and Han (2000), and Wang, Zhou, and Han (2000).
169
doing the data mining. Part of the challenge in creating such a knowledge base involves the coding of common sense into a database, which has proved to be a difficult problem so far (Lyons & Tseytin, 1998).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
170 Hsu
etc.), and competitive analysis on competitors and market directions (competitive intelligence, CI). Another possible analysis could group customers into classes and develop class-based pricing procedures.
171
potential customers. Automatic personalization and recommender system technologies have become critical tools in this arena because they help tailor the sites interaction with a visitor to his or her needs and interests (Nakagawa, Luo, Mobasher, & Dai, 2001). The current challenge in electronic commerce is to develop ways of gaining deep understanding into the behavior of customers based on data which is, at least in part, anonymous (Mobasher, Dai, Luo, Sun, & Zhu). While most of the research in personalization is directed toward ecommerce functions, personalization concepts can be applied to any Web browsing activity. Mobasher, one of the most recognized researchers on this topic, defines Web personalization as any action that tailors the Web experience to a particular user, or set of users (Mobasher, Cooley, & Srivastava, 2000). Web personalization can be described as any action that makes the Web experience of a user personalized to the users taste or preferences. The experience can be something as casual as browsing the Web or as significant (economically) as trading stocks or purchasing a car. The actions can range from simply making the presentation more pleasing to an individual to anticipating the needs of the user and providing the right information, or performing a set of routine bookkeeping functions automatically (Mobasher, 1999). User preferences may be obtained explicitly or by passive observation of users over time as they interact with the system (Mobasher, 1999). The target audience of a personalized experience is the group of visitors whose members will all see the same content. Traditional Web sites deliver the same content regardless of the visitors identity their target is the whole population of the Web. Personal portal sites, such as MyYahoo! and MyMSN, allow users to build a personalized view of their content the target here is the individual visitor. Personalization involves an application that computes a result, thereby actively modifying the end-user interaction. A main goal of personalization is to deliver some piece of content (for example, an ad, product, or piece of information) that the end-user finds so interesting that the session lasts at least one more click. The more times the end-user clicks, the longer the average session lasts; longer session lengths imply happier end-users, and happier end-users help achieve business goals (Rosenberg, 2001). The ultimate objectives are to own a piece of the customers mindshare and to provide customized services to each customer according to his or her personal preferences whether expressed or inferred. All this must be done while protecting the customers privacy and giving them a sense of power and control over the information they provide (Charlet, 1998).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
172 Hsu
The bursting of the so-called IT bubble has put vastly increased pressure on Internet companies to make a profit quickly. Imagine if in a brick and mortar store it were possible to observe which products a customer picks up and examines and which ones he or she just passes by. With that information, it would be possible for the store to make valuable marketing recommendations. In the online world, such data can be collected. Personalization techniques are generally seen as the true differentiator between brick and mortar businesses and the online world and a key to the continued growth and success of the Internet. This same ability may also serve as a limitation in the future as the public becomes more concerned about personal privacy and the ethics of sites that collect personal information (Drogan & Hsu, 2003).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
173
Musicians Friend
Musicians Friend (www.musiciansfriend.com), which is a subsidiary of Guitar Center, Inc., is part of the worlds largest direct marketer of music gear. Musicians Friend features more than 24,000 products in its mail-order catalogs and on its Web site. Products offered include guitars, keyboards, amplifiers, percussion instruments, as well as recording, mixing, lighting, and DJ gear. In 1999, Musicians Friend realized that both its e-commerce and catalog sales were underperforming. It realized that it had vast amounts of customer and product data, but was not leveraging this information in any intelligent or productive way. The company sought a solution to increase its e-commerce and catalog revenues through better understanding of its customer and product data interactions and the ability to leverage this knowledge to generate greater demand. To meet its objectives, Musicians Friend decided to implement Web personalization technology. The company felt it could personalize the shoppers experience and at the same time gain a better understanding of the vast and complex relationships between products, customers, and promotions. Successful implementation would result in more customers, more customer loyalty and increased revenue. Musicians Friend decided to implement Net Perceptions technology (www.netperceptions.com). This technology did more than make recommendations based simply on the shoppers preferences for the Web site. It used preference information and combined it with knowledge about product relationships, profit margins, overstock conditions, and more. Musicians Friend also leveraged personalization technology to help its catalog business. The merchandising staff quickly noticed that the same technology could help it to determine which of the many thousands of products available on the Web site to feature in its catalog promotions. The results were impressive. In 2000, catalog sales increased by 32% while Internet sales increased by 170%. According to Eric Meadows, Director of Internet for the company, We have been able to implement several enhancements to our site as a direct result of the Net Perceptions solution, including using data on the items customers return to refine and increase the effectiveness of the additional product suggestions the site recommends (www.netperceptions.com). Net Perceptions personalization solutions helped Musicians Friend generate a substantial increase on items per order yearover-year in other words, intelligently generating greater customer demand (Drogan & Hsu, 2003).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
174 Hsu
J.Crew
J.Crew is one of the clothing industrys most recognized retailers, with hundreds of clothiers around the world and a catalog on thousands of doorsteps with every new season. J.Crew is a merchandising-driven company, which means its goal is to get the customer exactly what he or she wants as easily as possible. Dave Towers, Vice President of e-Commerce Operations explains: As a multichannel retailer, our business is divided between our retail stores, our catalog, and our growing business on the Internet. J.Crew understood the operational cost reductions that could be achieved by migrating customers from the print catalog to www.j.crew.com. To accommodate all of its Internet customers, J.Crew built an e-commerce infrastructure that consistently supports about 7,000 simultaneous users and generates up to $100,000 per hour of revenue during peak times. J.Crew realized early on that personalization technology would be a critical area of focus if it was to succeed in e-commerce. As Mr. Towers put it, A lot of our business is driven by our ability to present the right apparel to the right customer, whether its pants, shirts or sweaters, and then up-sell the complementary items that round out a customers purchase. J.Crews personalization technology has allowed it to refine the commerce experience for Internet shoppers. J.Crew has definitely taken notice of the advantages that personalization technology has brought to its e-commerce site. The expanded capabilities delivered by personalization have given J.Crew a notable increase in up-sells or units per transaction (UPTs), thanks to the ability to cross-sell items based on customers actions on the site. Towers explains: We can present a customer buying a shirt with a nice pair of pants that go with it, and present that recommendation at the right moment in the transaction. The combination of scenarios and personalization enable us to know more about a customers preferences and spending habits and allows us to make implicit yet effective recommendations. Clearly, J.Crew is the type of e-commerce site that can directly benefit from personalization technology. With its business model and the right technology implantation, J.Crew is one company that has been able to make very effective and profitable use of the Internet (Drogan & Hsu, 2003).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
175
Half.com
Half.com (www.half.com), which is an eBay company, offers consumers a fixed price, online marketplace to buy and sell new, overstocked and used products at discount prices. Unlike auctions, where the selling price is based on bidding, the seller sets the price for items at the time the item is listed. The site currently lists a wide variety of merchandise, including books, CDs, movies, video games, computers, consumer electronics, sporting goods, and trading cards. Half.com determined that to increase customer satisfaction as well as company profits, personalization technology would have to be implemented. It was decided that product recommendations would be presented at numerous locations on the site, including the product detail, add-to-wish list, add-to-cart, and thank you pages. In fact, each point of promotion would include three to five personalized product recommendations. In addition, the site would generate personalized, targeted emails. For example, Half.com would send a personalized email to its customers with product recommendations that are relevant based on prior purchases. In addition, it would send personalized emails to attempt to reactivate customers who had not made a purchase in more than six months. Half.com decided to try out Net Perceptions technology (www.net perceptions.com) to meet these needs. As a proof of concept, Net Perceptions and Half.com performed a 15-week effectiveness study of Net Perceptions recommendation technology to see if a positive business benefit could be demonstrated to justify the cost of the product and the implementation. For the study, visitors were randomly split into groups upon entering the Half.com site. Eighty percent of the visitors were placed in a test group and the remaining 20% were placed into a control group. The test group received the recommendations, and the control group did not. The results of this test showed Half.com the business benefits of personalization technology. The highlights were: Normalized sales were 5.2% greater in the test group than the control group. Visitor to buyer conversion was 3.8% greater in the test group. Average spending per account per day was 1.1% greater in the test group. For the email campaign, 7% of the personalized emails generated a site visit compared to 5% of the non-personalized. When personalized emails were sent to inactive customers (not made a
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
176 Hsu
purchase in six months), 28% of them proceeded to the site and actually made a purchase.
177
bioinformaticists are working on the prediction of the biological functions of genes and proteins based on structural data (Chalifa-Caspi, Prilusky, & Lancet, 1998). In recent years, many new databases storing biological information have appeared. While this is a positive development, many scientists complain that it gets increasingly difficult to find useful information in the resulting labyrinth of data. This may largely be due to the fact that the information gets more and more scattered over an increasing number of heterogeneous resources. One solution would be to integrate the scattered information in new types of Web resources. The principal benefit is that these databases should enable the user to quickly obtain an idea about the current knowledge that has been gathered about a particular subject. For instance, the Chromosome 17 Database stores all available information about Human Chromosome 17, including all the genes, markers, and other genomic features. In the near future, this integration concept will be expanded to the whole human genome. Another example of integration is the GeneCards encyclopedia (Rebhan, Chalifa-Caspi, Prilusky, & Lancet, 1997). This resource contains data about human genes, their products, and the diseases in which they are involved. Whats special about it is that it contains only selected information that has been automatically extracted from a variety of heterogeneous databases, similar to data mining. In addition, this resource offers advanced user navigation guidance that leads the user rapidly to the desired information. Since data mining offers the ability to discover patterns and relationships from large amounts of data, it seems ideally suited to use in the analysis of DNA. This is because DNA is essentially a sequence or chain of four main components called nucleotides. A group of several hundred nucleotides in a certain sequence is called a gene, and there are about 100,000 genes that make up the human genome. Aside from the task of integrating databases of biological information noted above, another important application is the use of comparison and similarity search on DNA sequences. This is useful in the study of genetics-linked diseases, as it would be possible to compare and contrast the gene sequences of normal and diseased tissues and attempt to determine what sequences are found in the diseased but not in the normal tissues. The analyses on biological data sequences will be different from those for numerical data, of course. In addition, since diseases may not be caused by one gene but by a group of genes interacting together, it would be possible to use a form of association analysis to examine the relationships and interactions between various genes that are associated with a certain genetic disease or condition. Another
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
178 Hsu
application might be to use a path analysis to study the genes that come into play during different stages of a disease, and so gain some insight into which genes are key during what time in the course of the disease. This may enable the targeting of drugs to treat conditions existing during the various stages of a disease. Yet another use of data mining and related technologies is in the display of genes and biological structures using advanced visualization techniques. This allows scientists to better understand the further study and analysis of genetic information in a way that may bring out new insights and discoveries than using more traditional forms of data display and analysis. There are a number of projects that are being conducted in this area, whether on the areas discussed above, or on the analysis of micro-array data and related topics. Among the centers doing research in this area are the European Bioinformatics Institute (EBI) in Cambridge, UK, and the Weizmann Institute of Science in Israel.
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
179
products to a certain customer profile. What are the buying habits of a certain segment of customers? Finally, customer service can be enhanced by examining patterns of customer purchases, finding customer needs which have not been fulfilled, and routing of customer inquiries effectively. Customer retention is another issue that can be analyzed using data mining. Of the customers that a firm currently has, what percentage will eventually leave and go to another provider? What are the reasons for leaving or the characteristics of customers who are likely to leave? With this information, there is an opportunity to address these issues and perhaps increase retention of these customers (Dyche, 2001; Greenberg, 2001; Hancock & Delmater, 2001; Swift, 2000).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
180 Hsu
Invisible Data Mining. The concept of invisible data mining is to make data mining as unobtrusive and transparent as possible, hence the term invisible. End-User Data Mining. Many of the data-mining tools and methods available are complex to use, not only in the techniques and theories involved, but also in terms of the complexity of using many of the available
Integrated Data Mining Systems Invisible Data Mining Mining of Dynamic Data End-User Data Mining
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.
181
data-mining software packages. Many of these are designed to be used by experts and scientists well versed in advanced analytical techniques, rather than end- users such as marketing professionals, managers, and engineers. Professional end-users, who actually could benefit a great deal from the power of various data-mining analyses, cannot due to the complexity of the process; they really would be helped by the development of simpler, easier to use tools and packages with straightforward procedures, intuitive user interfaces, and better usability overall. In other words, designing systems that can be more easily used by non-experts would help to improve the level of use in the business and scientific communities and increase the awareness and development of this highly promising field.
REFERENCES
Agrawal, R., Arning, A., Bollinger T., Mehta, M., Shafer, J., & Srikant, R. (1996). The Quest Data Mining System. In Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon (August). Arlitt, M., & Williamson, C. (1997). Internet Web servers: Workload characterization and performance implications. IEEE/ACM Transactions on Networking, 5, 5. Armstrong, R., Freitag, D., Joachims, T., & Mitchell, T. (1995). Webwatcher: A learning apprentice for the World Wide Web. In Proceedings of AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, California (March). Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. Boston, MA; Addison-Wesley Longman. Bedard, T., & Han, J. (2001). Fundamentals of geospatial data warehousing for geographic knowledge discovery. In H. Miller & J. Han (Eds.), Geographic Data Mining and Knowledge Discovery. London: Taylor and Francis. Beeferman, D. (1998). Lexical discovery with an enriched semantic network. In Proceedings of the ACL/COLING Workshop on Applications of WordNet in Natural Language Processing Systems, (pp. 358-364). Billsus, D., & Pazzani, M. (1999). A hybrid user model for news story classification. In Proceedings of the Seventh International Conference on User Modeling, Banff, Canada (June 20-24).
Copyright 2004, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea Group Inc. is prohibited.