0% found this document useful (0 votes)
20 views11 pages

Data Warehousing and Data Mining

About data

Uploaded by

alexhaddis97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Data Warehousing and Data Mining

About data

Uploaded by

alexhaddis97
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
Data warehousing and Data Mining Techniques Data Warehousing Introduction Data warehouses are databases that store and maintain analytical data separately from transaction-oriented databases for the purpose of decision support. They provide access to data (years’ worth of data) for complex analysis, knowledge discovery, and decision making through ad hoc and canned queries’. Therefore a data warehouse is characterized as a subject-oriented, integrated, non-volatile, time-variant collection of data in support of management's decisions. * Subject-oriented, as the warehouse is organized around the major subjects of the enterprise (such as customers, products, and sales) rather than the major application areas (such as customer invoicing, stock control, and product sales). This is reflected in the need to store decision support data rather than application-oriented data. * Integrated, because of the coming together of source data from different enterprise-wide applications systems. The source data is often inconsistent, using, for example, different formats. The integrated data source must be made consistent to present a unified view of the data to the users. Time-variant, because the data warehouse is shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots. * Nonvolatile, as the data is not updated in real time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database, rather than a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. Generally, they provide storage, functionality, and responsiveness to queries beyond the capabilities of transaction-oriented databases. Notice that a data warehouse refers to a collection of information as well as a supporting system. Different types of applications—OLAP, and data mining applications—are supported. OLAP (online analytical processing) is a term used to describe the analysis of complex data from the data warehouse. In the hands of skilled knowledge workers, OLAP tools enable quick and straightforward querying of the analytical data stored in data warehouses and data marts (analytical databases similar to data warehouses but with a defined narrow scope). Hence, we can also describe data warehousing more generally as a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. 1 Canned queries refer to a-priori defined queries with parameters that may recur with high frequency: Benefits of Data Warehousing The successful implementation of a data warehouse can bring major benefits to an organization, including: + Potential high returns on investment. data warehouse projects deliver a high return on investment. however, An organization must commit a huge amount of resources to ensure the successful implementation of a data warehouse. + Competitive advantage. The huge returns on investment for those companies that have successfully implemented a data warehouse is evidence of the enormous competitive advantage that accompanies this technology. The competitive advantage is gained by allowing decision makers access to data that can reveal previously unavailable, unknown, and untapped information on for example customers, trends, and demands. + Increased productivity of corporate decision makers. Data warehousing improves the productivity of corporate decision makers by creating an integrated database of consistent, subject-oriented, historical data. It integrates data from multiple incompatible systems into a form that provides one consistent view of the organization. By transforming data into meaningful information, a data warehouse allows corporate decision makers to perform more substantive, accurate, and consistent analysis. Online Transaction Processing (OLTP) and Data Warehousing Traditional databases support online transaction. processing (OLTP), which includes insertions, updates, and deletions while also supporting information query requirements. Traditional relational databases are optimized to process queries that may touch a small part of the database and transactions that deal with insertions or updates of a few tuples per relation to process. Thus, they cannot be optimized for OLAP, or data mining. By contrast, data warehouses are designed precisely to support efficient extraction, processing, and presentation for analytic and decision- making purposes. In comparison to traditional databases, data warehouses generally contain very large amounts of data from multiple sources that may include databases from different data models and sometimes files acquired from independent systems and platforms. transactional databases, data warehouses are nonvolatile. This means the data is, ‘Compared not updated in real time but is refreshed from operational systems on a regular basis. New data is always added as a supplement to the database, rather than 2 replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. Warehouse insertions are handled by the warehouse’s ETL (extract, transform, load) process, which does a large amount of preprocessing, ‘An organization will normally have a number of different OLTP systems for business processes such as inventory control, customer invoicing, and point-of-sale. These systems generate operational data that is detailed, current, and subject to change. In contrast, an organization will normally have a single data warehouse, which holds data that is historical, detailed, and summarized to various levels and rarely subject to change (other than being supplemented with new data). The data warehouse is designed to support relatively low numbers of transactions that are unpredictable in nature and require answers to queries that are ad hoc, unstructured, and heuristic. The warehouse data is organized according to the requirements of potential queries and supports the analytical requirements of a lower number of users. Although OLTP systems and data warehouses have different characteristics and are built with different purposes in mind, these systems are closely related, in that the OLTP systems provide the source data for the warehouse. A major problem of this relationship is that the data held by the LTP systems can be inconsistent, fragmented, and subject to change, containing duplicate or missing entries. As such, the operational data must be “cleaned up” before it can be used in the data warehouse. Data Warehouse Architecture Figure | gives an overview of the conceptual structure of a data warehouse. It shows the entire data warehousing process, which includes possible cleaning and reformatting of data before loading it into the warehouse. This process is handled by tools known as ETL (extraction, transformation, and loading) tools. At the back end of the process, OLAP, and data mining and DSS may generate new relevant information such as rules (or additional metadata); this information is shown in Figure 1 as going back as additional data inputs into the warehouse. The figure also shows that data sources may include > Ss Bectdushing S| onan I J goblins | OLAP ———— - ~ Daa | + La [Estect Tarsfom|_, | “ Databases een) >| wacsaie Oss ~ Ke) Sata ning | thor data inputs 4 Updates/new data Figure 1 Overview of the general architecture of a data warehouse. Different tools and technologies are associated with building and managing a data warehouse. A general overview of these tools is provided below. Extraction, Transformation, and Loading (ETL) One of the most commonly cited benefits associated with enterprise data warehouses (EDW) is that these centralized systems provide an integrated enterprise wide view of corporate data, However, achieving this valuable view of data can be very complex and time-consuming. The data destined for an EDW must first be extracted from one or more data sources, transformed into a form that is easy to analyze and consistent with data already in the warehouse, and then finally loaded into the EDW. This entire process is referred to as the extraction, transforming, and loai (ETL) process and is a critical process in any data warehouse project. Extraction The extraction step targets one or more data sources for the EDW; these sources typically include OLTP databases but can also include sources such as personal databases and spreadsheets, enterprise resource planning (ERP) files, and web usage log files. The data sources are normally internal but can also include external sources, such as the systems used by suppliers and/or customers. The extraction step normally copies the extracted data to temporary storage referred to as the operational data store (ODS) or staging area (SA). Additional issues associated with the extraction step include establishing the frequency for data extractions from each source system to the EDW, monitoring any modifications to the source systems to ensure that the extraction process remains valid, and monitoring any changes in the performance or availability of source systems, which may have an impact on the extraction process. Transformation The transformation step applies a series of rules or functions to the extracted data, which determines how the data will be used for analysis and can involve transformations such as data summations, data encoding, data merging, data splitting, data calculations, and creation of surrogate keys. The output from the transformations is data that is clean (checked for validity) and consistent with the data already held in the warehouse, and furthermore, is in a form that is ready for analysis by users of the warehouse. Recognizing erroneous and incomplete data Is difficult to automate. Hence, Data cleaning is an involved and complex process that has been identified as the largest labor-demanding component of data warehouse construction. As data managers in the organization discover that their data is being cleaned for input into the warehouse, they will likely want to upgrade their data with the cleaned data. The process of returning cleaned data to the source is called backflushing. Loading + The loading of the data into the warehouse can occur after all transformations have taken place or as part of the transformation processing, As the data loads into the warehouse, additional constraints defined in the database schema as well as in triggers activated upon data loading will be applied (such as uniqueness, referential integrity, and mandatory fields), which also contribute to the overall data quality performance of the ETL process. Data Warehouse Metadata The major purpose of metadata is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse. However, the problem is that metadata has several functions within the warehouse that relates to the processes associated with data transformation and loading, data warehouse management, and query generation. The metadata associated with data transformation and loading must describe the source data and any changes that were made to the data. For example, for each source field there should be a unique identifier, original field name, source data type, and original location including the system and object name, along with the destination data type and destination table name. If the field is subject to any transformations such as a simple field type change toa complex set of procedures and functions, this should also be recorded. * The query manager generates additional metadata about the queries that are run, which can be used to generate a history on all the queries and a query profile for each user, group of users, or the data warehouse. There is also metadata associated with the users of queries that includes, for example, information describing what the term “price” or “customer” means in a particular database and whether the meaning has changed over time. Administration and Management tools * Adata warehouse requires tools to support the administration and management of such a complex environment. These tools must be capable of supporting the following tasks: + monitoring data loading from multiple sources; data quality and integrity checks; managing and updating metadata; monitoring database performance to ensure efficient query response times and resource utilization; auditing data warehouse usage to provide user chargeback information; replicating, subsetting, and distributing data; maintaining efficient data storage management; purging data; archiving and backing up data; implementing recovery following failure; Security management. +++ +++ eett Data Modeling for Data Warehouses A standard, normalized, relational database model is completely inappropriate to the requirements of a data warehouse. An entirely different modeling technique, called a dimensional database ‘model (data cubes), is needed for data warehouses. A standard spreadsheet is a two-dimensional matrix. One example would be a spreadsheet of nal sales by product for a particular time period. Products could be shown as rows, with columns comprising sales revenues for each region. (Figure 2 shows this two-dimensional organization.) Adding a time dimension, such as an organization's fiscal quarters, would produce 2 three-dimensional matrix, which could be represented using a data cube. Product Region Regt Reg 2 Reg 3 wee P123 Pi24 P125 P126 two-dimensional matrix model. Figure 3 shows a three-dimensional data cube that organizes product sales data by fiscal quarters and sales regions. Each cell could contain data for a specific product, specific fiscal quarter, and specific region. By including additional dimensions, a data hypercube could be produced, although more than three dimensions cannot be easily visualized or graphically presented. The data can be ‘queried directly in any combination of dimensions, thus bypassing complex database queries. Tools exist for viewing data according to the user's choice of dimensions. Prog pros prize Product pia7 Figure 3A three-dimensional data cube model. Changing from one-dimensional hierarchy (orientation) to another is easily accomplished in a data cube with a technique called pivoting (also called rotation). In this technique, the data cube can be thought of as rotating to show a different orientation of the axes. For example, you might pivot the data cube to show regional sales revenues as rows, the fiscal quarter revenue totals as columns, and the company’s products in the third dimension (Figure 4). Hence, this technique is equivalent to having a regional sales table for each product separately, where each table shows quarterly sales for that product region by region. The term slice is used to refer to a two-dimensional view of a three or higher dimensional cube. The Product vs. Region 2-D view shown in Figure 2 is a slice of the 3.D cube shown in Figure 3. The popular term “slice and dice” implies a systematic reduction of a body of data into smaller chunks or views so that the information is made visible from multiple angles or viewpoints. Figure 4 Pivoted version of the daia cube from Figure 3. Multidimensional models lend themselves readily to hierarchical views in what is known as roll-up display and drill- down display. A roll-up display moves up the hierarchy, grouping into larger units along a dimension (for example, summing weekly data by quarter or by year). A drill-down display provides the opposite capability, furnishing a finer-grained view, perhaps disaggregating country sales by region and then regional sales by subregion and also breaking up products by styles. The multidimensional model (also called the. dimensional model)-involves two types of tables: dimension tables and fact tables. A dimension table consists of tuples of attributes of the dimension. A fact table can be thought of as having tuples, one per a recorded fact. A fact table contains historical transactions which could be a lot of records. Dimensions describe facts. Figure 5 shows an example of a fact table that can be viewed from the perspective of multi- dimension tables. Two common multidimensional schemas are the star schema and the snowflake schema. The star schema consists of a fact table with a single table for each dimension (Figure 5). Dimension table Dimension table Product Fiscal quarter Fact table Prod_no Qtr Prod_name Business results Year Prod_descr Beg_date Prod_style Product End_date Prod ine ‘Quarter Region Sales_revenue Dimension table Region Subregion Figure 5 A star schema with fact and dimensional tables. The snowflake schema is a variation on the star schema in which the dimensional tables from a star schema are organized into a hierarchy by normalizing them (Figure 6). Dimension tables Dimension tables Prame Fiscal quarter FQ dates Product Fact table \ pean] Business results Prod_no \ N Style J Prod ine.no Pline \ Sales revenue Prod_tine_no Prod_ine_name Data Mart A database that contains a subset of corporate data to support the analytical requirements of a particular business unit (such as the Sales department) or to support users who share the same requirement to analyze a particular business process\(such as property sales). There are many reasons for creating a data mart, including: ‘To provide data in a form that matches the collective view of the data by a group of users in a department or group of users interested in a particular business process. ‘© To improve end-user response time due to the reduction in the volume of data to be accessed. Figure 6 A snovelake schema. ‘Data marts normally use less data, so the data ETL process is less complex, and hence implementing and setting up a data mart is simpler compared with establishing an EDW. # The cost of implementing data marts (in time, money, and resources) is normally less than that required to establish an EDW. The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than an EDW project. If an organization has a large volume of relational data, it may consider creating some data marts for specific business needs. For example, the accounts department may create a data mart to maintain balance sheets and prepare customer account statements, while the marketing department may create another data mart for optimizing advertising campaigns. Most large organizations use a combination of data lakes?, warehouses, and marts in their storage infrastructure. Typically, all data is ingested into a data lake then loaded into different warehouses and marts for assorted use cases. However, not every organization may require that level of scale. > A data lake isa centralized repository that allows you to store any data (structured, semi-structured, and unstructured) at any seale. Data Mining Introduction Simply storing information in a data warehouse does not provide the benefits that an organization is seeking. To realize the value of a data warehouse, it is necessary to extract the knowledge hidden within the warehouse. However, as the amount and complexity of the data in a data warehouse grows, it becomes increasingly difficult, if not impossible, for business analysts to identify trends and relationships in the data using simple query and reporting tools. Data mining is used in searching data for unanticipated new knowledge. Data mining is a way that discovers within data warehouses information that queries and reports cannot effectively reveal. Hence, Data Mining can be defined as the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Data mining is used in a wide range of industries. Table 1 lists examples of applications of data mining in retail/marketing, banking, insurance, and medicine. Industry Examples applications Retai/marketing + Identifying buying patterns of customers + Finding associations among customer demographic characteristics * Market basket analysis Banking 7. Detecting pattems of fiandulent credit card use * Determining credit card spending by customer groups Insurance © Claim analysis * Predicting which customers will buy new policies medicine «Identifying successful medical therapies for different illnesses Table 1 Examples of data mining applications Data Mining Techniques There are four main operations associated with data mining techniques, which include predictive modeling, database segmentation, link analysis, and deviation detection. Although any of the four major operations can be used for implementing any of the business applications listed in Table 2, there are certain recognized associations between the applications and the corresponding operations. For example, direct marketing strategies are normally implemented using the database segmentation operation, and fraud detection could be implemented by any of the four operations. Further, many applications work particularly well when several operations are used. For example, a common approach to customer profiling is to segment the database first and then apply predictive modeling to the resultant data segments. Operations ]Data Mining Techniques Predictive modeling * Classification Database segmentation © Demographic clustering «Neural clustering Link analysis * Association discovery Sequential pattem discovery «Similar time sequence discovery Deviation detection * Statistics «Visualization Table 2 Data mining operations and associated techniques. Predictive Modeling This approach uses generalizations of the “real world” and the ability to fit new data into a general framework. Predictive modeling can be used to analyze an existing database to determine some essential characteristics (model) about the data set. The model is developed using a supervised learning approach, which has two phases: training and testing, Training builds a model using 2 large sample of historical data called a training set, and testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics. Applications of predictive modeling include customer retention management, credit approval, cross-selling, and direct marketing. Data segmentation The aim of database segmentation is to partition a database into an unknown number of segments, or clusters, of similar records, that ig;-records that share a number of properties and so are considered to be homogeneous. (Segments have high internal homogeneity and high external heterogeneity.) This approach uses unsupervised learning to discover homogeneous subpopulations in a database to improve the accuracy of the profiles. Applications of database segmentation include customer profiling, direct marketing, and cross-selling. Link Analysis Link analysis aims to establish links, called associations, between the individual records, or sets of records, in a database. There are three specializations of link analysis: associations discovery, sequential pattern discovery, and similar time sequence discovery, Associations discovery finds items that imply the presence of other items in the same event. These affinities between items are represented by association rules. For example, "When a customer rents property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties.” Sequential pattern discovery finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. For ‘example, this approach can be used to understand long-term customer buying behavior. Similar time sequence discovery is used, for example, in the discovery of links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate. For example, within three months of buying property, new home ‘owners will purchase goods such as stoves, refrigerators, and washing machines. 10 Deviation Detection It is a relatively new technique in terms of commercially available data mining tools. However, deviation detection is often a source of true discovery, because it identifies outliers, which express deviation from some previously known expectation and norm. This operation can be performed using statistics and visualization techniques or as a by-product of data mining. For example, linear regression facilitates the identification of outliers in data, and modern visualization techniques display summaries and graphical representations that make deviations easy to detect. Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. nT

You might also like