Data Mining and Data Warehousing
Data Mining and Data Warehousing
A PAPER PRESENTATION AT
TECHNO CARNIVAL- 2006
FROM
SUBMITTED BY:
PRADEEP BHANAWAT T.E. (C.S.E.)
[email protected]
Definition:
Data mining is the process of finding correlation or patterns among fields in large relational databases. The process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. (Simoudis, 1996)
Handling of different type of data Efficiency and scalability of algorithm,Usefulness, certainty and expressiveness of result Expression of various kinds of mining results Interactive mining knowledge at multiple levels Mining information from different sources of data
Retail/Marketing:
Performing basket analysisWhich items customers tend to purchase together. This knowledge can improve stocking, store layout strategies, and promotions. Sales forecasting Examining time-based patterns helps retailers make stocking decisions. If a customer purchases an item today, when are they likely to purchase a complementary item?
Telecommunication:
Call detail record analysis Telecommunication companies accumulate detailed call records. By identifying customer segments with similar use patterns, the companies can develop attractive pricing and feature promotions. Customer loyalty Some customers repeatedly switch providers, or churn, to take advantage of attractive incentives by competing companies. The companies can use DM to identify the characteristics of customers who are likely to remain loyal once they switch, thus enabling the companies to target their spending on customers who will produce the most profit.
Data Warehouse Characteristics: 1. Subject-orientedWH is organized around the major subjects of the enterprise rather 2. 3.
than the major application areas. This is reflected in the need to store decision-support data rather than application-oriented data. Integratedbecause the source data come together from different enterprise-wide applications systems. The source data is often inconsistent using..The integrated data source must be made consistent to present a unified view of the data to the users. Time-variantthe source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots. Non-volatiledata is not update in real time but is refresh from OS on a regular basis. New data is always added as a supplement to DB, rather than replacement. the DB continually absorbs this new data, incrementally integrating it with previous data
4.
Data WarehouseArchitecture :
Operational data source1
Meta-data Operational data source 2 Load Manager Operational data source n Lightly summarized data Query Manage
Detailed data
DBMS
Warehouse Manager
Operational data store (ODS) Data mining Archive/backup data End-user access tools
Main Components: Operational data sourcesfor the DW is supplied from mainframe operational data held
in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organizations suppliers or customers
Operational datastore(ODS)(is a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse
load manager(also called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse warehouse manager(performs all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data query manager(also called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries detailed, lightly and lightly summarized data,archive/backup data meta-data end-user access tools(can be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools.
Ranges from detailed to summarized data Contains metadata Many views of the data Subject-Oriented Time-variant Metadata
Data Flows
Inflow- The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse. upflow- The process associated with adding value to the data in the warehouse through summarizing, packaging , packaging, and distribution of the data downflow- The processes associated with archiving and backing-up of data in the warehouse outflow- The process associated with making the data availabe to the end-users Meta-flow- The processes associated with the management of the meta-data
Date Dimension Date key Transaction data Day of month Month of year Year
BookSales fact table Foreign Keys: Date key Bookstore key Book key Clerk code key Summary Data: Units Sales Discounts
The potential benefits of data warehousing are high returns on investment. substantial competitive advantage. Increased productivity of corporate decision-makers.. More cost effective decision making Better enterprise intelligence Enhanced customer service Better asset/liability management Business process reengineering
Decision Support Systems (DSS) : They ideally present information in graphical and tabular form, providing the user with the ability to drill down on selected information. Note the increased detail and data manipulation options presented.
Conclusion:
Data Warehousing provides the means to change the raw data into information for making effective business decisions-the emphasis on information, not data.The Data warehouse is the hub for decision support data. Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks. It can benefit business, medicine, and science. It needs more efficient algorithms to speed up data mining process.