Data Warehousing and Data Mining

About data

Uploaded by

alexhaddis97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

Data Warehousing and Data Mining

About data

Uploaded by

alexhaddis97

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 11

Data warehousing and Data Mining Techniques Data Warehousing Introduction Data warehouses are databases that store and maintain analytical data separately from transaction-oriented databases for the purpose of decision support. They provide access to data (years’ worth of data) for complex analysis, knowledge discovery, and decision making through ad hoc and canned queries’. Therefore a data warehouse is characterized as a subject-oriented, integrated, non-volatile, time-variant collection of data in support of management's decisions. * Subject-oriented, as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales). This is reflected in the need to store decision support data rather than application-oriented data. * Integrated, because of the coming together of source data from different enterprise-wide applications systems. The source data is often inconsistent, using, for example, different formats. The integrated data source must be made consistent to present a unified view of the data to the users. Time-variant, because the data warehouse is shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots. * Nonvolatile, as the data is not updated in real time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database, rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. Generally, they provide storage, functionality, and responsiveness to queries beyond the capabilities of transaction-oriented databases. Notice that a data warehouse refers to a collection of information as well as a supporting system. Different types of applications—OLAP, and data mining applications—are supported. OLAP (online analytical processing) is a term used to describe the analysis of complex data from the data warehouse. In the hands of skilled knowledge workers, OLAP tools enable quick and straightforward querying of the analytical data stored in data warehouses and data marts (analytical databases similar to data warehouses but with a defined narrow scope). Hence, we can also describe data warehousing more generally as a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. 1 Canned queries refer to a-priori defined queries with parameters that may recur with high frequency:Benefits of Data Warehousing The successful implementation of a data warehouse can bring major benefits to an organization, including: + Potential high returns on investment. data warehouse projects deliver a high return on investment. however, An organization must commit a huge amount of resources to ensure the successful implementation of a data warehouse. + Competitive advantage. The huge returns on investment for those companies that have successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision makers access to data that can reveal previously unavailable, unknown, and untapped information on for example customers, trends, and demands. + Increased productivity of corporate decision makers. Data warehousing improves the productivity of corporate decision makers by creating an integrated database of consistent, subject-oriented, historical data. It integrates data from multiple incompatible systems into a form that provides one consistent view of the organization. By transforming data into meaningful information, a data warehouse allows corporate decision makers to perform more substantive, accurate, and consistent analysis. Online Transaction Processing (OLTP) and Data Warehousing Traditional databases support online transaction. processing (OLTP), which includes insertions, updates, and deletions while also supporting information query requirements. Traditional relational databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process. Thus, they cannot be optimized for OLAP, or data mining. By contrast, data warehouses are designed precisely to support efficient extraction, processing, and presentation for analytic and decision- making purposes. In comparison to traditional databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms. transactional databases, data warehouses are nonvolatile. This means the data is, ‘Compared not updated in real time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database, rather than 2 replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. Warehouse insertions are handled by the warehouse’s ETL (extract, transform, load) process, which does a large amount of preprocessing, ‘An organization will normally have a number of different OLTP systems for business processes such as inventory control, customer invoicing, and point-of-sale. These systems generate operational data that is detailed, current, and subject to change. In contrast, an organization will normally have a single data warehouse, which holds data that is historical, detailed, and summarized to various levels and rarely subject to change (other than being supplemented with new data). The data warehouse is designed to support relatively low numbers of transactions that are unpredictable innature and require answers to queries that are ad hoc, unstructured, and heuristic. The warehouse data is organized according to the requirements of potential queries and supports the analytical requirements of a lower number of users. Although OLTP systems and data warehouses have different characteristics and are built with different purposes in mind, these systems are closely related, in that the OLTP systems provide the source data for the warehouse. A major problem of this relationship is that the data held by the LTP systems can be inconsistent, fragmented, and subject to change, containing duplicate or missing entries. As such, the operational data must be “cleaned up” before it can be used in the data warehouse. Data Warehouse Architecture Figure | gives an overview of the conceptual structure of a data warehouse. It shows the entire data warehousing process, which includes possible cleaning and reformatting of data before loading it into the warehouse. This process is handled by tools known as ETL (extraction, transformation, and loading) tools. At the back end of the process, OLAP, and data mining and DSS may generate new relevant information such as rules (or additional metadata); this information is shown in Figure 1 as going back as additional data inputs into the warehouse. The figure also shows that data sources may include > Ss Bectdushing S| onan I J goblins | OLAP ———— - ~ Daa | + La [Estect Tarsfom|_, | “ Databases een) >| wacsaie Oss ~ Ke) Sata ning | thor data inputs 4 Updates/new data Figure 1 Overview of the general architecture of a data warehouse. Different tools and technologies are associated with building and managing a data warehouse. A general overview of these tools is provided below. Extraction, Transformation, and Loading (ETL) One of the most commonly cited benefits associated with enterprise data warehouses (EDW) is that these centralized systems provide an integrated enterprise wide view of corporate data, However, achieving this valuable view of data can be very complex and time-consuming. The data destined for an EDW must first be extracted from one or more data sources, transformed into a form that is easy to analyze and consistent with data already in the warehouse, and then finally loaded into the EDW. This entire process is referred to as the extraction, transforming, and loai (ETL) process and is a critical process in any data warehouse project.Extraction The extraction step targets one or more data sources for the EDW; these sources typically include OLTP databases but can also include sources such as personal databases and spreadsheets, enterprise resource planning (ERP) files, and web usage log files. The data sources are normally internal but can also include external sources, such as the systems used by suppliers and/or customers. The extraction step normally copies the extracted data to temporary storage referred to as the operational data store (ODS) or staging area (SA). Additional issues associated with the extraction step include establishing the frequency for data extractions from each source system to the EDW, monitoring any modifications to the source systems to ensure that the extraction process remains valid, and monitoring any changes in the performance or availability of source systems, which may have an impact on the extraction process. Transformation The transformation step applies a series of rules or functions to the extracted data, which determines how the data will be used for analysis and can involve transformations such as data summations, data encoding, data merging, data splitting, data calculations, and creation of surrogate keys. The output from the transformations is data that is clean (checked for validity) and consistent with the data already held in the warehouse, and furthermore, is in a form that is ready for analysis by users of the warehouse. Recognizing erroneous and incomplete data Is difficult to automate. Hence, Data cleaning is an involved and complex process that has been identified as the largest labor-demanding component of data warehouse construction. As data managers in the organization discover that their data is being cleaned for input into the warehouse, they will likely want to upgrade their data with the cleaned data. The process of returning cleaned data to the source is called backflushing. Loading + The loading of the data into the warehouse can occur after all transformations have taken place or as part of the transformation processing, As the data loads into the warehouse, additional constraints defined in the database schema as well as in triggers activated upon data loading will be applied (such as uniqueness, referential integrity, and mandatory fields), which also contribute to the overall data quality performance of the ETL process. Data Warehouse Metadata The major purpose of metadata is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse. However, the problem is that metadata has several functions within the warehouse that relates to the processes associated with data transformation and loading, data warehouse management, and query generation. The metadata associated with data transformation and loading must describe the source data and any changes that were made to the data. For example, for each source field there should be a unique identifier, original field name, source data type, and original location including the system and object name, along with the destination data type and destinationtable name. If the field is subject to any transformations such as a simple field type change toa complex set of procedures and functions, this should also be recorded. * The query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse. There is also metadata associated with the users of queries that includes, for example, information describing what the term “price” or “customer” means in a particular database and whether the meaning has changed over time. Administration and Management tools * Adata warehouse requires tools to support the administration and management of such a complex environment. These tools must be capable of supporting the following tasks: + monitoring data loading from multiple sources; data quality and integrity checks; managing and updating metadata; monitoring database performance to ensure efficient query response times and resource utilization; auditing data warehouse usage to provide user chargeback information; replicating, subsetting, and distributing data; maintaining efficient data storage management; purging data; archiving and backing up data; implementing recovery following failure; Security management. +++ +++ eett Data Modeling for Data Warehouses A standard, normalized, relational database model is completely inappropriate to the requirements of a data warehouse. An entirely different modeling technique, called a dimensional database ‘model (data cubes), is needed for data warehouses. A standard spreadsheet is a two-dimensional matrix. One example would be a spreadsheet of nal sales by product for a particular time period. Products could be shown as rows, with columns comprising sales revenues for each region. (Figure 2 shows this two-dimensional organization.) Adding a time dimension, such as an organization's fiscal quarters, would produce 2 three-dimensional matrix, which could be represented using a data cube. Product Region Regt Reg 2 Reg 3 wee P123 Pi24 P125 P126 two-dimensional matrix model.Figure 3 shows a three-dimensional data cube that organizes product sales data by fiscal quarters and sales regions. Each cell could contain data for a specific product, specific fiscal quarter, and specific region. By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be easily visualized or graphically presented. The data can be ‘queried directly in any combination of dimensions, thus bypassing complex database queries. Tools exist for viewing data according to the user's choice of dimensions. Prog pros prize Product pia7 Figure 3A three-dimensional data cube model. Changing from one-dimensional hierarchy (orientation) to another is easily accomplished in a data cube with a technique called pivoting (also called rotation). In this technique, the data cube can be thought of as rotating to show a different orientation of the axes. For example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter revenue totals as columns, and the company’s products in the third dimension (Figure 4). Hence, this technique is equivalent to having a regional sales table for each product separately, where each table shows quarterly sales for that product region by region. The term slice is used to refer to a two-dimensional view of a three or higher dimensional cube. The Product vs. Region 2-D view shown in Figure 2 is a slice of the 3.D cube shown in Figure 3. The popular term “slice and dice” implies a systematic reduction of a body of data into smaller chunks or views so that the information is made visible from multiple angles or viewpoints.Figure 4 Pivoted version of the daia cube from Figure 3. Multidimensional models lend themselves readily to hierarchical views in what is known as roll-up display and drill- down display. A roll-up display moves up the hierarchy, grouping into larger units along a dimension (for example, summing weekly data by quarter or by year). A drill-down display provides the opposite capability, furnishing a finer-grained view, perhaps disaggregating country sales by region and then regional sales by subregion and also breaking up products by styles. The multidimensional model (also called the. dimensional model)-involves two types of tables: dimension tables and fact tables. A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. A fact table contains historical transactions which could be a lot of records. Dimensions describe facts. Figure 5 shows an example of a fact table that can be viewed from the perspective of multi- dimension tables. Two common multidimensional schemas are the star schema and the snowflake schema. The star schema consists of a fact table with a single table for each dimension (Figure 5). Dimension table Dimension table Product Fiscal quarter Fact table Prod_no Qtr Prod_name Business results Year Prod_descr Beg_date Prod_style Product End_date Prod ine ‘Quarter Region Sales_revenue Dimension table Region Subregion Figure 5 A star schema with fact and dimensional tables.The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them (Figure 6). Dimension tables Dimension tables Prame Fiscal quarter FQ dates Product Fact table \ pean] Business results Prod_no \ N Style J Prod ine.no Pline \ Sales revenue Prod_tine_no Prod_ine_name Data Mart A database that contains a subset of corporate data to support the analytical requirements of a particular business unit (such as the Sales department) or to support users who share the same requirement to analyze a particular business process\(such as property sales). There are many reasons for creating a data mart, including: ‘To provide data in a form that matches the collective view of the data by a group of users in a department or group of users interested in a particular business process. ‘© To improve end-user response time due to the reduction in the volume of data to be accessed. Figure 6 A snovelake schema. ‘Data marts normally use less data, so the data ETL process is less complex, and hence implementing and setting up a data mart is simpler compared with establishing an EDW. # The cost of implementing data marts (in time, money, and resources) is normally less than that required to establish an EDW. The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than an EDW project. If an organization has a large volume of relational data, it may consider creating some data marts for specific business needs. For example, the accounts department may create a data mart to maintain balance sheets and prepare customer account statements, while the marketing department may create another data mart for optimizing advertising campaigns. Most large organizations use a combination of data lakes?, warehouses, and marts in their storage infrastructure. Typically, all data is ingested into a data lake then loaded into different warehouses and marts for assorted use cases. However, not every organization may require that level of scale. > A data lake isa centralized repository that allows you to store any data (structured, semi-structured, and unstructured) at any seale.Data Mining Introduction Simply storing information in a data warehouse does not provide the benefits that an organization is seeking. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden within the warehouse. However, as the amount and complexity of the data in a data warehouse grows, it becomes increasingly difficult, if not impossible, for business analysts to identify trends and relationships in the data using simple query and reporting tools. Data mining is used in searching data for unanticipated new knowledge. Data mining is a way that discovers within data warehouses information that queries and reports cannot effectively reveal. Hence, Data Mining can be defined as the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Data mining is used in a wide range of industries. Table 1 lists examples of applications of data mining in retail/marketing, banking, insurance, and medicine. Industry Examples applications Retai/marketing + Identifying buying patterns of customers + Finding associations among customer demographic characteristics * Market basket analysis Banking 7. Detecting pattems of fiandulent credit card use * Determining credit card spending by customer groups Insurance © Claim analysis * Predicting which customers will buy new policies medicine «Identifying successful medical therapies for different illnesses Table 1 Examples of data mining applications Data Mining Techniques There are four main operations associated with data mining techniques, which include predictive modeling, database segmentation, link analysis, and deviation detection. Although any of the four major operations can be used for implementing any of the business applications listed in Table 2, there are certain recognized associations between the applications and the corresponding operations. For example, direct marketing strategies are normally implemented using the database segmentation operation, and fraud detection could be implemented by any of the four operations. Further, many applications work particularly well when several operations are used. For example, a common approach to customer profiling is to segment the database first and then apply predictive modeling to the resultant data segments. Operations ]Data Mining TechniquesPredictive modeling * Classification Database segmentation © Demographic clustering «Neural clustering Link analysis * Association discovery Sequential pattem discovery «Similar time sequence discovery Deviation detection * Statistics «Visualization Table 2 Data mining operations and associated techniques. Predictive Modeling This approach uses generalizations of the “real world” and the ability to fit new data into a general framework. Predictive modeling can be used to analyze an existing database to determine some essential characteristics (model) about the data set. The model is developed using a supervised learning approach, which has two phases: training and testing, Training builds a model using 2 large sample of historical data called a training set, and testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. Applications of predictive modeling include customer retention management, credit approval, cross-selling, and direct marketing. Data segmentation The aim of database segmentation is to partition a database into an unknown number of segments, or clusters, of similar records, that ig;-records that share a number of properties and so are considered to be homogeneous. (Segments have high internal homogeneity and high external heterogeneity.) This approach uses unsupervised learning to discover homogeneous subpopulations in a database to improve the accuracy of the profiles. Applications of database segmentation include customer profiling, direct marketing, and cross-selling. Link Analysis Link analysis aims to establish links, called associations, between the individual records, or sets of records, in a database. There are three specializations of link analysis: associations discovery, sequential pattern discovery, and similar time sequence discovery, Associations discovery finds items that imply the presence of other items in the same event. These affinities between items are represented by association rules. For example, "When a customer rents property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties.” Sequential pattern discovery finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. For ‘example, this approach can be used to understand long-term customer buying behavior. Similar time sequence discovery is used, for example, in the discovery of links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. For example, within three months of buying property, new home ‘owners will purchase goods such as stoves, refrigerators, and washing machines. 10Deviation Detection It is a relatively new technique in terms of commercially available data mining tools. However, deviation detection is often a source of true discovery, because it identifies outliers, which express deviation from some previously known expectation and norm. This operation can be performed using statistics and visualization techniques or as a by-product of data mining. For example, linear regression facilitates the identification of outliers in data, and modern visualization techniques display summaries and graphical representations that make deviations easy to detect. Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. nT

Data Warehousing Research Paper
50% (2)
Data Warehousing Research Paper
7 pages
Introduction To Data Warehousing Concepts
No ratings yet
Introduction To Data Warehousing Concepts
8 pages
Data Warehousing
No ratings yet
Data Warehousing
16 pages
Data Warehouse - Final
No ratings yet
Data Warehouse - Final
28 pages
Warehousing
No ratings yet
Warehousing
15 pages
$RRWYO9T
No ratings yet
$RRWYO9T
71 pages
Data Warehouse Basics (Lec. Notes 1)
No ratings yet
Data Warehouse Basics (Lec. Notes 1)
5 pages
CS2202 DataWarehouse OLAP
No ratings yet
CS2202 DataWarehouse OLAP
49 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
26 pages
Data - Mining - and - Data by Manyindo
No ratings yet
Data - Mining - and - Data by Manyindo
3 pages
Unit - I DW
No ratings yet
Unit - I DW
12 pages
Data Warehouse
No ratings yet
Data Warehouse
86 pages
Data Warehouse Notes
No ratings yet
Data Warehouse Notes
26 pages
Bi Units F
No ratings yet
Bi Units F
53 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
53 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
DWDM202
No ratings yet
DWDM202
6 pages
DM Module 1
No ratings yet
DM Module 1
16 pages
Data Warehouse and Data Mining Notes
No ratings yet
Data Warehouse and Data Mining Notes
31 pages
Data Warehouse Architecture
No ratings yet
Data Warehouse Architecture
4 pages
Module 1
No ratings yet
Module 1
32 pages
DWM Unit 1
No ratings yet
DWM Unit 1
34 pages
All Unit
No ratings yet
All Unit
17 pages
DWDM Notes - Final
No ratings yet
DWDM Notes - Final
46 pages
Data Warehouse References
No ratings yet
Data Warehouse References
40 pages
MIS-15 - Data and Knowledge Management
No ratings yet
MIS-15 - Data and Knowledge Management
55 pages
DWDM
No ratings yet
DWDM
12 pages
DWDM Unit-1
No ratings yet
DWDM Unit-1
31 pages
2.data Warehousing: Heterogeneous Database Integration
No ratings yet
2.data Warehousing: Heterogeneous Database Integration
26 pages
1 & 2 Data Warehousing - 021052
No ratings yet
1 & 2 Data Warehousing - 021052
80 pages
DWDM Notes/Unit 1
No ratings yet
DWDM Notes/Unit 1
31 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
39 pages
Data Warehouse: From Wikipedia, The Free Encyclopedia
No ratings yet
Data Warehouse: From Wikipedia, The Free Encyclopedia
5 pages
DMBI Unit-1
No ratings yet
DMBI Unit-1
37 pages
EDWH
No ratings yet
EDWH
10 pages
Data Warehousing: Hu Yan Huy@cs - Tut.fi
No ratings yet
Data Warehousing: Hu Yan Huy@cs - Tut.fi
27 pages
Nirali DWM (Unit 1)
No ratings yet
Nirali DWM (Unit 1)
21 pages
Eval of Business Performance - Module 1
No ratings yet
Eval of Business Performance - Module 1
8 pages
Data Warehouse
No ratings yet
Data Warehouse
22 pages
Unit 2 Updated
No ratings yet
Unit 2 Updated
50 pages
Imp Doc 2
No ratings yet
Imp Doc 2
6 pages
Data Warehousing, Data Mining, OLAP and OLTP Technologies Are Indispensable Elements To Support Decision-Making Process in Industrial World
No ratings yet
Data Warehousing, Data Mining, OLAP and OLTP Technologies Are Indispensable Elements To Support Decision-Making Process in Industrial World
7 pages
Data and AI - Data Warehousing
No ratings yet
Data and AI - Data Warehousing
58 pages
Selected Topics of Recent Trends in Information Technology
No ratings yet
Selected Topics of Recent Trends in Information Technology
21 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
No ratings yet
Datawarehousing&Datamining: R.Kartheek B.Tech-Iii RD I.T V.R.S College, Chirala
18 pages
Data Warehousing-1
No ratings yet
Data Warehousing-1
51 pages
Data Warehousing: Acpce
No ratings yet
Data Warehousing: Acpce
27 pages
Introduction To Data Warehousing and Business Intelligence
No ratings yet
Introduction To Data Warehousing and Business Intelligence
15 pages
Unit 2 Data Warehouse New
No ratings yet
Unit 2 Data Warehouse New
45 pages
Data Warehousing
No ratings yet
Data Warehousing
2 pages
Unit 1 1
No ratings yet
Unit 1 1
91 pages
Data Warehouse 9 Oct
No ratings yet
Data Warehouse 9 Oct
15 pages
Data Ware House
No ratings yet
Data Ware House
203 pages
Data Ware Housing1
No ratings yet
Data Ware Housing1
18 pages
Module 2
No ratings yet
Module 2
43 pages
CH08 DSS Turban Data Warehouse
No ratings yet
CH08 DSS Turban Data Warehouse
65 pages

Data Warehousing and Data Mining

Uploaded by

Data Warehousing and Data Mining

Uploaded by

You might also like