Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
The document discusses data warehouses. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management decision making. Data warehouses consolidate data from multiple sources to support analysis and are separate from operational databases which contain current transactional data. The document outlines the key features and applications of data warehouses including financial services, banking, retail and more.
Advance Concept in Data Bases Unit-5 by Arun Pratap Singh
The document discusses data warehouses. It defines a data warehouse as a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management decision making. Data warehouses consolidate data from multiple sources to support analysis and are separate from operational databases which contain current transactional data. The document outlines the key features and applications of data warehouses including financial services, banking, retail and more.
DESIGN OF DATA WAREHOUSE : The term "Data Warehouse" was first coined by Bill Inmon in 1990. He said that Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data.This data helps in supporting decision making process by analyst in an organization The operational database undergoes the per day transactions which causes the frequent changes to the data on daily basis.But if in future the business executive wants to analyse the previous feedback on any data such as product,supplier,or the consumer data. In this case the analyst will be having no data available to analyse because the previous data is updated due to transactions. The Data Warehouses provide us generalized and consolidated data in multidimensional view. Along with generalize and consolidated view of data the Data Warehouses also provide us Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective analysis of data in multidimensional space. This analysis results in data generalization and data mining. The data mining functions like association,clustering ,classification, prediction can be integrated with OLAP operations to enhance interactive mining of knowledge at multiple level of abstraction. That's why data warehouse has now become important platform for data analysis and online analytical processing. Understanding Data Warehouse The Data Warehouse is that database which is kept separate from the organization's operational database. There is no frequent updation done in data warehouse. Data warehouse possess consolidated historical data which help the organization to analyse it's business. Data warehouse helps the executives to organize,understand and use their data to take strategic decision. Data warehouse systems available which helps in integration of diversity of application systems. The Data warehouse system allows analysis of consolidated historical data analysis. Definition : Data warehouse is Subject Oriented, Integrated, Time-Variant and Nonvolatile collection of data that support management's decision making process. Why Data Warehouse Separated from Operational Databases The following are the reasons why Data Warehouse are kept separate from operational databases: The operational database is constructed for well known tasks and workload such as searching particular records, indexing etc but the data warehouse queries are often complex and it presents the general form of data. UNIT : V
PREPARED BY ARUN PRATAP SINGH 2
2 Operational databases supports the concurrent processing of multiple transactions. Concurrency control and recovery mechanism are required for operational databases to ensure robustness and consistency of database. Operational database query allow to read, modify operations while the OLAP query need only read only access of stored data. Operational database maintain the current data on the other hand data warehouse maintain the historical data. Data Warehouse Features The key features of Data Warehouse such as Subject Oriented, Integrated, Nonvolatile and Time- Variant are are discussed below: Subject Oriented - The Data Warehouse is Subject Oriented because it provide us the information around a subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoing operations rather it focuses on modelling and analysis of data for decision making. Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhance the effective analysis of data. Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data in data warehouse provide information from historical point of view. Non Volatile - Non volatile means that the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in data warehouse. Note: - Data Warehouse does not require transaction processing, recovery and concurrency control because it is physically stored separate from the operational database. Data Warehouse Applications As discussed before Data Warehouse helps the business executives in organize, analyse and use their data for decision making. Data Warehouse serves as a soul part of a plan-execute- assess "closed-loop" feedback system for enterprise management. Data Warehouse is widely used in the following fields: financial services Banking Services Consumer goods Retail sectors. Controlled manufacturing Data Warehouse Types Information processing, Analytical processing and Data Mining are the three types of data warehouse applications that are discussed below:
PREPARED BY ARUN PRATAP SINGH 3
3 Information processing - Data Warehouse allow us to process the information stored in it. The information can be processed by means of querying, basic statistical analysis, reporting using crosstabs, tables, charts, or graphs. Analytical Processing - Data Warehouse supports analytical processing of the information stored in it. The data can be analyzed by means of basic OLAP operations, including slice-and- dice, drill down, drill up, and pivoting. Data Mining - Data Mining supports knowledge discovery by finding the hidden patterns and associations, constructing analytical models, performing classification and prediction.These mining results can be presented using the visualization tools.
SN Data Warehouse (OLAP) Operational Database(OLTP) 1 This involves historical processing of information. This involves day to day processing. 2 OLAP systems are used by knowledge workers such as executive, manager and analyst. OLTP system are used by clerk, DBA, or database professionals. 3 This is used to analysis the business. This is used to run the business. 4 It focuses on Information out. It focuses on Data in. 5 This is based on Star Schema, Snowflake Schema and Fact Constellation Schema. This is based on Entity Relationship Model. 6 It focuses on Information out. This is application oriented. 7 This contains historical data. This contains current data. 8 This provides summarized and consolidated data. This provide primitive and highly detailed data. 9 This provide summarized and multidimensional view of data. This provides detailed and flat relational view of data. 10 The number or users are in Hundreds. The number of users are in thousands. 11 The number of records accessed are in millions. The number of records accessed are in tens. 12 The database size is from 100GB to TB The database size is from 100 MB to GB. 13 This are highly flexible. This provide high performance.
PREPARED BY ARUN PRATAP SINGH 4
4 What is Data Warehousing? Data Warehousing is the process of constructing and using the data warehouse. The data warehouse is constructed by integrating the data from multiple heterogeneous sources. This data warehouse supports analytical reporting, structured and/or ad hoc queries and decision making. Data Warehousing involves data cleaning, data integration and data consolidations. Using Data Warehouse Information There are decision support technologies available which help to utilize the data warehouse. These technologies helps the executives to use the warehouse quickly and effectively. They can gather the data, analyse it and take the decisions based on the information in the warehouse. The information gathered from the warehouse can be used in any of the following domains: Tuning production strategies - The product strategies can be well tuned by repositioning the products and managing product portfolios by comparing the sales quarterly or yearly. Customer Analysis - The customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles etc. Operations Analysis - Data Warehousing also helps in customer relationship management, making environmental corrections. The Information also allow us to analyse the business operations. In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. The data stored in the warehouse is uploaded from the operational systems (such as marketing, sales, etc.). The data may pass through an operational data store for additional operations before it is used in the DW for reporting.
PREPARED BY ARUN PRATAP SINGH 5
5
Data warehouses support business decisions by collecting, consolidating, and organizing data for reporting and analysis with tools such as online analytical processing (OLAP) and data mining. Although data warehouses are built on relational database technology, the design of a data warehouse database differs substantially from the design of an online transaction processing system (OLTP) database.
Data Warehouses, OLTP, OLAP, and Data Mining
A relational database is designed for a specific purpose. Because the purpose of a data warehouse differs from that of an OLTP, the design characteristics of a relational database that supports a data warehouse differ from the design characteristics of an OLTP database.
A Data Warehouse Supports OLTP- A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data as it accumulates, and by providing services that would complicate and degrade OLTP operations if they were performed in the OLTP database. Without a data warehouse to hold historical information, data is archived to static media such as magnetic tape, or allowed to accumulate in the OLTP database. If data is simply archived for preservation, it is not available or organized for use by analysts and decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the
PREPARED BY ARUN PRATAP SINGH 6
6 OLTP database continues to grow in size and requires more indexes to service analytical and report queries. These queries access and process large portions of the continually growing historical data and add a substantial load to the database. The large indexes needed to support these queries also tax the OLTP transactions with additional index maintenance. These queries can also be complicated to develop due to the typically complex OLTP database schema. A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse and do not load the OLTP, which does not need additional indexes for their support. As data is moved to the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and more efficient.
OLAP is a Data Warehouse Tool- Online analytical processing (OLAP) is a technology designed to provide superior performance for ad hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in accordance with the common dimensional model used in data warehouses. A data warehouse provides a multidimensional view of data in an intuitive model designed to match the types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide maximum performance for queries that summarize data in various ways. For example, a query that requests the total sales income and quantity sold for a range of products in a specific geographical region for a specific time period can typically be answered in a few seconds or less regardless of how many hundreds of millions of rows of data are stored in the data warehouse database. Data warehouse database OLTP database Designed for analysis of business measures by categories and attributes Designed for real-time business operations Optimized for bulk loads and large, complex, unpredictable queries that access many rows per table Optimized for a common set of transactions, usually adding or retrieving a single row at a time per table Loaded with consistent, valid data; requires no real time validation Optimized for validation of incoming data during transactions; uses validation data tables Supports few concurrent users relative to OLTP Supports thousands of concurrent users
PREPARED BY ARUN PRATAP SINGH 7
7 OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high volume update transactions. The inherent stability and consistency of historical data in a data warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for analytical queries. In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server specifically designed to service OLAP queries.
PREPARED BY ARUN PRATAP SINGH 8
8
PREPARED BY ARUN PRATAP SINGH 9
9
PREPARED BY ARUN PRATAP SINGH 10
10
Data Warehouse Tools and Utilities Functions The following are the functions of Data Warehouse tools and Utilities: Data Extraction - Data Extraction involves gathering the data from multiple heterogeneous sources. Data Cleaning - Data Cleaning involves finding and correcting the errors in data.
PREPARED BY ARUN PRATAP SINGH 11
11 Data Transformation - Data Transformation involves converting data from legacy format to warehouse format. Data Loading - Data Loading involves sorting, summarizing, consolidating, checking integrity and building indices and partitions. Refreshing - Refreshing involves updating from data sources to warehouse. Note: Data Cleaning and Data Transformation are important steps in improving the quality of data and data mining results.
Data Warehouse : Data warehouse is subject Oriented, Integrated, Time-Variant and nonvolatile collection of data that support of management's decision making process. Let's explore this Definition of data warehouse. Subject Oriented - The Data warehouse is subject oriented because it provide us the information around a subject rather the organization's ongoing operations. These subjects can be product, customers, suppliers, sales, revenue etc. The data warehouse does not focus on the ongoing operations rather it focuses on modelling and analysis of data for decision making. Integrated - Data Warehouse is constructed by integration of data from heterogeneous sources such as relational databases, flat files etc. This integration enhance the effective analysis of data. Time-Variant - The Data in Data Warehouse is identified with a particular time period. The data in data warehouse provide information from historical point of view. Non Volatile - Non volatile means that the previous data is not removed when new data is added to it. The data warehouse is kept separate from the operational database therefore frequent changes in operational database is not reflected in data warehouse. Metadata - Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example the index of a book serve as metadata for the contents in the book.In other words we can say that metadata is the summarized data that lead us to the detailed data. In terms of data warehouse we can define metadata as following: Metadata is a road map to data warehouse. Metadata in data warehouse define the warehouse objects. The metadata act as a directory.This directory helps the decision support system to locate the contents of data warehouse. Metadata Respiratory : The Metadata Respiratory is an integral part of data warehouse system. The Metadata Respiratory contains the following metadata: Business Metadata - This metadata has the data ownership information, business definition and changing policies. Operational Metadata -This metadata includes currency of data and data lineage. Currency of data means whether data is active, archived or purged. Lineage of data means history of data migrated and transformation applied on it.
PREPARED BY ARUN PRATAP SINGH 12
12 Data for mapping from operational environment to data warehouse -This metadata includes source databases and their contents, data extraction,data partition, cleaning, transformation rules, data refresh and purging rules. The algorithms for summarization - This includes dimension algorithms, data on granularity, aggregation, summarizing etc. Data cube : Data cube help us to represent the data in multiple dimensions. The data cube is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise keep the records. Illustration of Data cube Suppose a company wants to keep track of sales records with help of sales data warehouse with respect to time, item, branch and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold.There is a table associated with each dimension. This table is known as dimension table. This dimension table further describes the dimensions. For example "item" dimension table may have attributes such as item_name, item_type and item_brand. The following table represents 2-D view of Sales Data for a company with respect to time,item and location dimensions.
But here in this 2-D table we have records with respect to time and item only. The sales for New Delhi are shown with respect to time and item dimensions according to type of item sold. If we want to view the sales data with one new dimension say the location dimension. The 3-D view of the sales data with respect to time, item, and location is shown in the table below:
PREPARED BY ARUN PRATAP SINGH 13
13
The above 3-D table can be represented as 3-D data cube as shown in the following figure:
DATA MART : Data mart contains the subset of organization-wide data. This subset of data is valuable to specific group of an organization. In other words we can say that data mart contains only that data which is specific to a particular group. For example the marketing data mart may contain only data related to item, customers and sales. The data mart are confined to subjects.
PREPARED BY ARUN PRATAP SINGH 14
14 Points to remember about data marts: Window based or Unix/Linux based servers are used to implement data marts. They are implemented on low cost server. The implementation cycle of data mart is measured in short period of time i.e. in weeks rather than months or years. The life cycle of a data mart may be complex in long run if it's planning and design are not organization-wide. Data mart are small in size. Data mart are customized by department. The source of data mart is departmentally structured data warehouse. Data mart are flexible. Graphical Representation of data mart.
A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of its data mart including all the hardware, software and data. [1] This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc.
PREPARED BY ARUN PRATAP SINGH 15
15 The reasons why organizations are building data warehouses and data marts are because the information in the database is not organized in a way that makes it easy for organizations to find what they need. Also complicated queries might take a long time to answer what people want to know since the database systems are designed to process millions of transactions per day. Transactional database are designed to be updated, however, data warehouses or marts are read only. Data warehouses are designed to access large groups of related records. Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users. A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. Each data mart is dedicated to a specific business function or region. This subset of data may span across many or all of an enterprises functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit (different data marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing, sales, etc.).
Reasons for creating a data mart : Easy access to frequently needed data Creates collective view by a group of users Improves end-user response time Ease of creation Lower cost than implementing a full data warehouse Potential users are more clearly defined than in a full data warehouse Contains only business essential data and is less cluttered.
PREPARED BY ARUN PRATAP SINGH 16
16
DEPENDENT DATA MART : According to the Inmon school of data warehousing, a dependent data mart is a logical subset (view) or a physical subset (extract) of a larger data warehouse, isolated for one of the following reasons: A need refreshment for a special data model or schema: e.g., to restructure for OLAP Performance: to offload the data mart to a separate computer for greater efficiency or to obviate the need to manage that workload on the centralized data warehouse. Security: to separate an authorized data subset selectively Expediency: to bypass the data governance and authorizations required to incorporate a new application on the Enterprise Data Warehouse Proving Ground: to demonstrate the viability and ROI (return on investment) potential of an application prior to migrating it to the Enterprise Data Warehouse Politics: a coping strategy for IT (Information Technology) in situations where a user group has more influence than funding or is not a good citizen on the centralized data warehouse. Politics: a coping strategy for consumers of data in situations where a data warehouse team is unable to create a usable data warehouse. According to the Inmon school of data warehousing, tradeoffs inherent with data marts include limited scalability, duplication of data, data inconsistency with other silos of information, and inability to leverage enterprise sources of data. The alternative school of data warehousing is that of Ralph Kimball. In his view, a data warehouse is nothing more than the union of all the data marts. This view helps to reduce costs and provides fast development, but can create an inconsistent data warehouse, especially in large organizations. Therefore, Kimball's approach is more suitable for small-to-medium corporations.
PREPARED BY ARUN PRATAP SINGH 17
17
Virtual Warehouse : The view over a operational data warehouse is known as virtual warehouse. It is easy to built the virtual warehouse. Building the virtual warehouse requires excess capacity on operational database servers.
PROCESS FLOW IN DATA WAREHOUSE : There are four major processes that build a data warehouse. Here is the list of four processes: Extract and load data. Cleaning and transforming the data. Backup and Archive the data. Managing queries & directing them to the appropriate data sources.
Extract and Load Process The Data Extraction takes data from the source systems. Data load takes extracted data and loads it into data warehouse. Note: Before loading the data into data warehouse the information extracted from external sources must be reconstructed. Points to remember while extract and load process: Controlling the process When to Initiate Extract
PREPARED BY ARUN PRATAP SINGH 18
18 Loading the Data CONTROLLING THE PROCESS Controlling the process involves determining that when to start data extraction and consistency check on data. Controlling process ensures that tools, logic modules, and the programs are executed in correct sequence and at correct time. WHEN TO INITIATE EXTRACT Data need to be in consistent state when it is extracted i.e. the data warehouse should represent single, consistent version of information to the user. For example in a customer profiling data warehouse in telecommunication sector it is illogical to merge list of customers at 8 pm on wednesday from a customer database with the customer subscription events up to 8 pm on tuesday. This would mean that we are finding the customers for whom there are no associated subscription. LOADING THE DATA After extracting the data it is loaded into a temporary data store.Here in the temporary data store it is cleaned up and made consistent. Note: Consistency checks are executed only when all data sources have been loaded into temporary data store. Clean and Transform Process Once data is extracted and loaded into temporary data store it is the time to perform Cleaning and Transforming. Here is the list of steps involved in Cleaning and Transforming: Clean and Transform the loaded data into a structure. Partition the data. Aggregation CLEAN AND TRANSFORM THE LOADED DATA INTO A STRUCTURE This will speed up the queries.This can be done in the following ways: Make sure data is consistent within itself. Make sure data is consistent with other data within the same data source. Make sure data is consistent with data in other source systems. Make sure data is consistent with data already in the warehouse.
PREPARED BY ARUN PRATAP SINGH 19
19 Transforming involves converting the source data into a structure. Structuring the data will result in increases query performance and decreases operational cost. Information in data warehouse must be transformed to support performance requirement from the business and also the ongoing operational cost. PARTITION THE DATA It will optimize the hardware performance and simplify the management of data warehouse. In this we partition each fact table into a multiple separate partitions. AGGREGATION Aggregation is required to speed up the common queries. Aggregation rely on the fact that most common queries will analyse a subset or an aggregation of the detailed data. Backup and Archive the data In order to recover the data in event of data loss, software failure or hardware failure it is necessary to backed up on regular basis.Archiving involves removing the old data from the system in a format that allow it to be quickly restored whenever required. For example in a retail sales analysis data warehouse, it may be required to keep data for 3 years with latest 6 months data being kept online. In this kind of scenario there is often requirement to be able to do month-on-month comparisons for this year and last year. In this case we require some data to be restored from the archive. Query Management Process This process performs the following functions This process manages the queries. This process speed up the queries execution. This Process direct the queries to most effective data sources. This process should also ensure that all system sources are used in most effective way. This process is also required to monitor actual query profiles. Information in this process is used by warehouse management process to determine which aggregations to generate. This process does not generally operate during regular load of information into data warehouse.
PREPARED BY ARUN PRATAP SINGH 20
20 THREE-TIER DATA WAREHOUSE ARCHITECTURE : Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of data warehouse architecture. Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the relational database system.We use the back end tools and utilities to feed data into bottom tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in either of the following ways. o By relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and operations. Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool, analysis tools and data mining tools. Following diagram explains the Three-tier Architecture of Data warehouse:
PREPARED BY ARUN PRATAP SINGH 21
21 OLAP : Introduction Online Analytical Processing Server (OLAP) is based on multidimensional data model. It allows the managers , analysts to get insight the information through fast, consistent, interactive access to information. In this chapter we will discuss about types of OLAP, operations on OLAP, Difference between OLAP and Statistical Databases and OLTP. Types of OLAP Servers We have four types of OLAP servers that are listed below. Relational OLAP(ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) Specialized SQL Servers
Relational OLAP(ROLAP) The Relational OLAP servers are placed between relational back-end server and client front-end tools. To store and manage warehouse data the Relational OLAP use relational or extended- relational DBMS. ROLAP includes the following. implementation of aggregation navigation logic. optimization for each DBMS back end. additional tools and services. Multidimensional OLAP (MOLAP) Multidimensional OLAP (MOLAP) uses the array-based multidimensional storage engines for multidimensional views of data.With multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore many MOLAP Server uses the two level of data storage representation to handle dense and sparse data sets. Hybrid OLAP (HOLAP) The hybrid OLAP technique combination of ROLAP and MOLAP both. It has both the higher scalability of ROLAP and faster computation of MOLAP. HOLAP server allows to store the large data volumes of detail data. the aggregations are stored separated in MOLAP store.
PREPARED BY ARUN PRATAP SINGH 22
22 Specialized SQL Servers specialized SQL servers provides advanced query language and query processing support for SQL queries over star and snowflake schemas in a read-only environment. OLAP Operations As we know that the OLAP server is based on the multidimensional view of data hence we will discuss the OLAP operations in multidimensional data. Here is the list of OLAP operations. Roll-up Drill-down Slice and dice Pivot (rotate)
ROLL-UP This operation performs aggregation on a data cube in any of the following way: By climbing up a concept hierarchy for a dimension By dimension reduction. Consider the following diagram showing the roll-up operation.
PREPARED BY ARUN PRATAP SINGH 23
23
The roll-up operation is performed by climbing up a concept hierarchy for the dimension location. Initially the concept hierarchy was "street < city < province < country". On rolling up the data is aggregated by ascending the location hierarchy from the level of city to level of country. The data is grouped into cities rather than countries. When roll-up operation is performed then one or more dimensions from the data cube are removed. DRILL-DOWN Drill-down operation is reverse of the roll-up. This operation is performed by either of the following way: By stepping down a concept hierarchy for a dimension.
PREPARED BY ARUN PRATAP SINGH 24
24 By introducing new dimension. Consider the following diagram showing the drill-down operation:
The drill-down operation is performed by stepping down a concept hierarchy for the dimension time. Initially the concept hierarchy was "day < month < quarter < year." On drill-up the time dimension is descended from the level quarter to the level of month. When drill-down operation is performed then one or more dimensions from the data cube are added. It navigates the data from less detailed data to highly detailed data.
PREPARED BY ARUN PRATAP SINGH 25
25 SLICE The slice operation performs selection of one dimension on a given cube and give us a new sub cube. Consider the following diagram showing the slice operation.
The Slice operation is performed for the dimension time using the criterion time ="Q1". It will form a new sub cube by selecting one or more dimensions. DICE The Dice operation performs selection of two or more dimension on a given cube and give us a new subcube. Consider the following diagram showing the dice operation:
PREPARED BY ARUN PRATAP SINGH 26
26
The dice operation on the cube based on the following selection criteria that involve three dimensions. (location = "Toronto" or "Vancouver") (time = "Q1" or "Q2") (item =" Mobile" or "Modem"). PIVOT The pivot operation is also known as rotation.It rotates the data axes in view in order to provide an alternative presentation of data.Consider the following diagram showing the pivot operation.
PREPARED BY ARUN PRATAP SINGH 27
27
In this the item and location axes in 2-D slice are rotated. OLAP vs OLTP SN Data Warehouse (OLAP) Operational Database(OLTP) 1 This involves historical processing of information. This involves day to day processing. 2 OLAP systems are used by knowledge workers such as executive, manager and analyst. OLTP system are used by clerk, DBA, or database professionals.
PREPARED BY ARUN PRATAP SINGH 28
28 3 This is used to analysis the business. This is used to run the business. 4 It focuses on Information out. It focuses on Data in. 5 This is based on Star Schema, Snowflake Schema and Fact Constellation Schema. This is based on Entity Relationship Model. 6 It focuses on Information out. This is application oriented. 7 This contains historical data. This contains current data. 8 This provides summarized and consolidated data. This provide primitive and highly detailed data. 9 This provide summarized and multidimensional view of data. This provides detailed and flat relational view of data. 10 The number or users are in Hundreds. The number of users are in thousands. 11 The number of records accessed are in millions. The number of records accessed are in tens. 12 The database size is from 100GB to TB The database size is from 100 MB to GB. 13 This are highly flexible. This provide high performance.
CONCEPTUAL MODELING OF DATA WAREHOUSES : Dimensional modeling is a technique for conceptualizing and visualizing data models as a set of measures that are described by common aspects of the business. Dimensional modeling has two basic concepts. Facts: A fact is a collection of related data items, consisting of measures. A fact is a focus of interest for the decision making process. Measures are continuously valued attributes that describe facts.
PREPARED BY ARUN PRATAP SINGH 29
29 A fact is a business measure. Dimension: The parameter over which we want to perform analysis of facts The parameter that gives meaning to a measure number of customers is a fact, perform analysis over time. Dimensional modeling also has emerged as the only coherent architecture for building distributed DW systems. If we come up with more complex questions for our warehouse which involves three or more dimensions. This is where the multi-dimensional database plays a significant role analysis. Dimensions are categories by which summarized data can be viewed. Cubes are data processing units composed of fact tables and dimensions from the data warehouse. Dimensional modeling also has emerged as the only coherent architecture for building distributed data warehouse systems. Multi-Dimensional Modeling Multidimensional database technology has come a long way since its inception more than 30 years ago. It has recently begun to reach the mass market, with major vendors now delivering multidimensional engines along with their relational database offerings, often at no extra cost. Multi-dimensional technology has also made significant gains in scalability and maturity. Multidimensional data model emerged for use when the objective is to analyze rather than to perform on-line transactions. Multidimensional model is based on three key concepts: Modeling business rules Cube and measures Dimensions Multidimensional data-base technology is a key factor in the interactive analysis of large amounts of data for decision-making purposes. Multidimensional data model is introduced based on relational elements. Dimensions are modeled as dimension relations. languages similar to structured query language. They can not treat all dimensions and measures symmetrically the definition of multidimensional schema describes multiple levels along a dimension, and there is at least one key attribute in each level that is included in the keys of the star schema in RD systems. Multidimensional database enable end-users to model data in a multidimensional environment. This is real product strength, as it provides for the fastest, most flexible method to process multidimensional requests.
PREPARED BY ARUN PRATAP SINGH 30
30
The principal characteristic of a dimensional model is a set of detailed business facts surrounded by multiple dimensions that describe those facts. When realized in a database, the schema for a dimensional model contains a central fact table and multiple dimension tables. A dimensional model may produce a star schema or a snowflake schema. The schema is a logical description of the entire database. The schema includes the name and description of records of all record types including all associated data-items and aggregates. Likewise the database the data warehouse also require the schema. The database uses the relational model on the other hand the data warehouse uses the Stars, snowflake and fact constellation schema. In this chapter we will discuss the schemas used in data warehouse. STAR SCHEMA : In star schema each dimension is represented with only one dimension table. This dimension table contains the set of attributes. In the following diagram we have shown the sales data of a company with respect to the four dimensions namely, time, item, branch and location.
PREPARED BY ARUN PRATAP SINGH 31
31
There is a fact table at the centre. This fact table contains the keys to each of four dimensions. The fact table also contain the attributes namely, dollars sold and units sold. Note: Each dimension has only one dimension table and each table holds a set of attributes. For example the location dimension table contains the attribute set {location_key,street,city,province_or_state,country}. This constraint may cause data redundancy. For example the "Vancouver" and "Victoria" both cities are both in Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country.
What is star schema? The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle.
Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension Tables
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If
PREPARED BY ARUN PRATAP SINGH 32
32 a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic region(markets, cities) , clients, products, times, channels.
The main characteristics of star schema: -> easy to understand schema -> small number of tables to join -> de-normalization, redundancy data caused that size of the table could be large. SNOWFLAKE SCHEMA : In Snowflake schema some dimension tables are normalized. The normalization split up the data into additional tables. Unlike Star schema the dimensions table in snowflake schema are normalized for example the item dimension table in star schema is normalized and split into two dimension tables namely, item and supplier table.
Therefore now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.
PREPARED BY ARUN PRATAP SINGH 33
33 The supplier key is linked to supplier dimension table. The supplier dimension table contains the attributes supplier_key, and supplier_type. <b<>Note: Due to normalization in Snowflake schema the redundancy is reduced therefore it becomes easy to maintain and save storage space.</b<> FACT CONSTELLATION SCHEMA :
What is fact constellation schema? For each star schema it is possible to construct fact constellation schema (for example by splitting the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies). The fact constellation architecture contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large. In fact Constellation there are multiple fact tables. This schema is also known as galaxy schema. In the following diagram we have two fact tables namely, sales and shipping.
The sale fact table is same as that in star schema.
PREPARED BY ARUN PRATAP SINGH 34
34 The shipping fact table has the five dimensions namely, item_key, time_key, shipper-key, from- location. The shipping fact table also contains two measures namely, dollars sold and units sold. It is also possible for dimension table to share between fact tables. For example time, item and location dimension tables are shared between sales and shipping fact table.
DATA MINING Data Mining is defined as extracting the information from the huge set of data. In other words we can say that data mining is mining the knowledge from data. Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies for extracting useful knowledge from data. The ongoing rapid growth of online data due to the Internet and the widespread use of databases have created an immense need for KDD methodologies. The challenge of extracting knowledge from data draws upon research in statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing, to deliver advanced business intelligence and web discovery solutions. Introduction There is huge amount of data available in Information Industry. This data is of no use until converted into useful information. Analysing this huge amount of data and extracting useful information from it is necessary. The extraction of information is not the only process we need to perform, it also involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we are now position to use this information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration etc. What is Data Mining Data Mining is defined as extracting the information from the huge set of data. In other words we can say that data mining is mining the knowledge from data. This information can be used for any of the following applications: Market Analysis Fraud Detection Customer Retention Production Control
PREPARED BY ARUN PRATAP SINGH 35
35 Science Exploration Need of Data Mining Here are the reasons listed below: In field of Information technology we have huge amount of data available that need to be turned into useful information. This information further can be used for various applications such as market analysis, fraud detection, customer retention, production control, science exploration etc. Data Mining Applications Here is the list of applications of Data Mining: Market Analysis and Management Corporate Analysis & Risk Management Fraud Detection Data Mining deals with what kind of patterns can be mined. On the basis of kind of data to be mined there are two kind of functions involved in Data Mining, that are listed below: Descriptive Classification and Prediction
Classification Criteria: Classification according to kind of databases mined Classification according to kind of knowledge mined Classification according to kinds of techniques utilized Classification according to applications adapted CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED We can classify the data mining system according to kind of databases mined. Database system can be classified according to different criteria such as data models, types of data etc. And the data mining system can be classified accordingly. For example if we classify the database according to data model then we may have a relational, transactional, object- relational, or data warehouse mining system.
PREPARED BY ARUN PRATAP SINGH 36
36 CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED We can classify the data mining system according to kind of knowledge mined. It is means data mining system are classified on the basis of functionalities such as: Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution Analysis CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED We can classify the data mining system according to kind of techniques used. We can describes these techniques according to degree of user interaction involved or the methods of analysis employed. CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED We can classify the data mining system according to application adapted. These applications are as follows: Finance Telecommunications DNA Stock Markets E-mail
DATA MINING FUNCTIONALITIES : Characterization Discrimination
PREPARED BY ARUN PRATAP SINGH 37
37 Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution Analysis
PREPARED BY ARUN PRATAP SINGH 38
38
PREPARED BY ARUN PRATAP SINGH 39
39
DATA MINING SYSTEM CATEGORIZATION AND ITS ISSUES : Introduction There is a large variety of Data Mining Systems available. Data mining System may integrate techniques from the following: Spatial Data Analysis
PREPARED BY ARUN PRATAP SINGH 40
40 Information Retrieval Pattern Recognition Image Analysis Signal Processing Computer Graphics Web Technology Business Bioinformatics Data Mining System Classification The data mining system can be classified according to the following criteria: Database Technology Statistics Machine Learning Information Science Visualization Other Disciplines
PREPARED BY ARUN PRATAP SINGH 41
41
Data mining is an interdisciplinary field, the confluence of a set of disciplines , including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and/or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows. Classification according to kind of databases mined Classification according to kind of knowledge mined Classification according to kinds of techniques utilized Classification according to applications adapted
PREPARED BY ARUN PRATAP SINGH 42
42 CLASSIFICATION ACCORDING TO KIND OF DATABASES MINED : We can classify the data mining system according to kind of databases mined. Database system can be classified according to different criteria such as data models, types of data etc. And the data mining system can be classified accordingly. For example if we classify the database according to data model then we may have a relational, transactional, object- relational, or data warehouse mining system. A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time-series, text, or multimedia data mining system, or a World-Wide Web mining system. Other system types include heterogeneous data mining systems, and legacy data mining systems.
CLASSIFICATION ACCORDING TO KIND OF KNOWLEDGE MINED : We can classify the data mining system according to kind of knowledge mined. It is means data mining system are classified on the basis of functionalities such as: Characterization Discrimination Association and Correlation Analysis Classification Prediction Clustering Outlier Analysis Evolution Analysis Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis, similarity analysis, etc. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities.
PREPARED BY ARUN PRATAP SINGH 43
43 Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge (at a high level of abstraction), primitive-level knowledge (at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction. CLASSIFICATION ACCORDING TO KINDS OF TECHNIQUES UTILIZED : We can classify the data mining system according to kind of techniques used. We can describes these techniques according to degree of user interaction involved or the methods of analysis employed. Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed (e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on). A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches. CLASSIFICATION ACCORDING TO APPLICATIONS ADAPTED : We can classify the data mining system according to application adapted. These applications are as follows: Finance Telecommunications DNA Stock Markets E-mail
ISSUES IN DATA MINING : Introduction Data mining is not that easy. The algorithm used are very complex. The data is not available at one place it needs to be integrated form the various heterogeneous data sources. These factors also creates some issues. Here in this tutorial we will discuss the major issues regarding: Mining Methodology and User Interaction Performance Issues Diverse Data Types Issues
PREPARED BY ARUN PRATAP SINGH 44
44 The following diagram describes the major issues.
Mining Methodology and User Interaction Issues It refers to the following kind of issues: Mining different kinds of knowledge in databases. - The need of different users is not the same. And Different user may be in interested in different kind of knowledge. Therefore it is necessary for data mining to cover broad range of knowledge discovery task. Interactive mining of knowledge at multiple levels of abstraction. - The data mining process needs to be interactive because it allows users to focus the search for patterns, providing and refining data mining requests based on returned results. Incorporation of background knowledge. - To guide discovery process and to express the discovered patterns, the background knowledge can be used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple level of abstraction. Data mining query languages and ad hoc data mining. - Data Mining Query language that allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query language and optimized for efficient and flexible data mining. Presentation and visualization of data mining results. - Once the patterns are discovered it needs to be expressed in high level languages, visual representations. This representations should be easily understandable by the users. Handling noisy or incomplete data. - The data cleaning methods are required that can handle the noise, incomplete objects while mining the data regularities. If data cleaning methods are not there then the accuracy of the discovered patterns will be poor. Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
PREPARED BY ARUN PRATAP SINGH 45
45 Performance Issues It refers to the following issues: Efficiency and scalability of data mining algorithms. - In order to effectively extract the information from huge amount of data in databases, data mining algorithm must be efficient and scalable. Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of databases, wide distribution of data,and complexity of data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithm divide the data into partitions which is further processed parallel. Then the results from the partitions is merged. The incremental algorithms, updates databases without having mine the data again from scratch. Diverse Data Types Issues Handling of relational and complex types of data. - The database may contain complex data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one system to mine all these kind of data. Mining information from heterogeneous databases and global information systems. - The data is available at different data sources on LAN or WAN. These data source may be structured, semi structured or unstructured. Therefore mining knowledge from them adds challenges to data mining.
OTHER ISSUES IN DATA MINING : Some of these issues are addressed below. Note that these issues are not exclusive and are not ordered in any way. Security and social issues: Security is an important issue with any data collection that is shared and/or is intended to be used for strategic decision-making. In addition, when data is collected for customer profiling, user behavior understanding, correlating personal data with other information, etc., large amounts of sensitive and private information about individuals or companies is gathered and stored. This becomes controversial given the confidential nature of some of this data and the potential illegal access to the information. Moreover, data mining could disclose new implicit knowledge about individuals or groups that could be against privacy policies, especially if there is potential dissemination of discovered information. Another issue that arises from this concern is the appropriate use of data mining. Due to the value of data, databases of all sorts of content are regularly sold, and because of the competitive advantage that can be attained from implicit knowledge discovered, some important information could be withheld, while other information could be widely distributed and used without control.
User interface issues: The knowledge discovered by data mining tools is useful as long as it is interesting, and above all understandable by the user. Good data visualization eases the interpretation of data mining results, as well as helps users better understand their needs. Many data exploratory analysis tasks are significantly facilitated by the ability to see data in an appropriate visual presentation. There are many visualization ideas and proposals for effective data graphical presentation. However, there is still much research to accomplish in order to obtain good visualization tools for large datasets that could be used to display and manipulate mined
PREPARED BY ARUN PRATAP SINGH 46
46 knowledge. The major issues related to user interfaces and visualization are "screen real- estate", information rendering, and interaction. Interactivity with the data and data mining results is crucial since it provides means for the user to focus and refine the mining tasks, as well as to picture the discovered knowledge from different angles and at different conceptual levels.
Mining methodology issues: These issues pertain to the data mining approaches applied and their limitations. Topics such as versatility of the mining approaches, the diversity of data available, the dimensionality of the domain, the broad analysis needs (when known), the assessment of the knowledge discovered, the exploitation of background knowledge and metadata, the control and handling of noise in data, etc. are all examples that can dictate mining methodology choices. For instance, it is often desirable to have different data mining methods available since different approaches may perform differently depending upon the data at hand. Moreover, different approaches may suit and solve user's needs differently. Most algorithms assume the data to be noise-free. This is of course a strong assumption. Most datasets contain exceptions, invalid or incomplete information, etc., which may complicate, if not obscure, the analysis process and in many cases compromise the accuracy of the results. As a consequence, data preprocessing (data cleaning and transformation) becomes vital. It is often seen as lost time, but data cleaning, as time-consuming and frustrating as it may be, is one of the most important phases in the knowledge discovery process. Data mining techniques should be able to handle noise in data or incomplete information. More than the size of data, the size of the search space is even more decisive for data mining techniques. The size of the search space is often depending upon the number of dimensions in the domain space. The search space usually grows exponentially when the number of dimensions increases. This is known as the curse of dimensionality. This "curse" affects so badly the performance of some data mining approaches that it is becoming one of the most urgent issues to solve.
Performance issues: Many artificial intelligence and statistical methods exist for data analysis and interpretation. However, these methods were often not designed for the very large data sets data mining is dealing with today. Terabyte sizes are common. This raises the issues of scalability and efficiency of the data mining methods when processing considerably large data. Algorithms with exponential and even medium-order polynomial complexity cannot be of practical use for data mining. Linear algorithms are usually the norm. In same theme, sampling can be used for mining instead of the whole dataset. However, concerns such as completeness and choice of samples may arise. Other topics in the issue of performance are incremental updating, and parallel programming. There is no doubt that parallelism can help solve the size problem if the dataset can be subdivided and the results can be merged later. Incremental updating is important for merging results from parallel mining, or updating data mining results when new data becomes available without having to re-analyze the complete dataset.
Data source issues: There are many issues related to the data sources, some are practical such as the diversity of data types, while others are philosophical like the data glut problem. We
PREPARED BY ARUN PRATAP SINGH 47
47 certainly have an excess of data since we already have more data than we can handle and we are still collecting data at an even higher rate. If the spread of database management systems has helped increase the gathering of information, the advent of data mining is certainly encouraging more data harvesting. The current practice is to collect as much data as possible now and process it, or try to process it, later. The concern is whether we are collecting the right data at the appropriate amount, whether we know what we want to do with it, and whether we distinguish between what data is important and what data is insignificant. Regarding the practical issues related to data sources, there is the subject of heterogeneous databases and the focus on diverse complex data types. We are storing different types of data in a variety of repositories. It is difficult to expect a data mining system to effectively and efficiently achieve good mining results on all kinds of data and sources. Different kinds of data and sources may require distinct algorithms and methodologies. Currently, there is a focus on relational databases and data warehouses, but other approaches need to be pioneered for other specific complex data types. A versatile data mining tool, for all sorts of data, may not be realistic. Moreover, the proliferation of heterogeneous data sources, at structural and semantic levels, poses important challenges not only to the database community but also to the data mining community.
DATA PROCESSING : What is the need for Data Processing? To get the required information from huge, incomplete, noisy and inconsistent set of data it is necessary to use data processing. Steps in Data Processing: Data Cleaning Data Integration Data Transformation Data reduction Data Summarization What is Data Cleaning? Data cleaning is a procedure to clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies What is Data Integration? Integrating multiple databases, data cubes, or files, this is called data integration. What is Data Transformation? Data transformation operations, such as normalization and aggregation, are additional data preprocessing procedures that would contribute toward the success of the mining process.
PREPARED BY ARUN PRATAP SINGH 48
48 What is Data Reduction? Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. What is Data Summarization? It is the processes of representing the collected data in an accurate and compact way without losing any information, it also involves getting a information from collected data.Ex: Display the data as a graph and get the mean, median, mode etc. How to Clean Data? Handling Missing values Ignore the tuple Fill in the missing value manually Use a global constant to fill in the missing value Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class as the given tuple Use the most probable value to fill in the missing value. Handle Noisy Data Binning: Binning methods smooth a sorted data value by consulting its neighborhood. Regression: Data can be smoothed by fitting the data to a function, such as with regression. Clustering: Outliers may be detected by clustering, where similar values are organized into groups, or clusters. Data Integration : Data Integration combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files. Issues that arises during data integration like Schema integration and object matching Redundancy is another important issue. Data Transformation Data transformation can be achieved in following ways Smoothing: which works to remove noise from the data
PREPARED BY ARUN PRATAP SINGH 49
49 Aggregation: where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute weekly and annuual total scores. Generalization of the data: where low-level or primitive (raw) data are replaced by higher- level concepts through the use of concept hierarchies. For example, categorical attributes, like street, can be generalized to higher-level concepts, like city or country. Normalization: where the attribute data are scaled so as to fall within a small specified range, such as 1.0 to 1.0, or 0.0 to 1.0. Attribute construction : this is where new attributes are constructed and added from the given set of attributes to help the mining process. Data Reduction techniques These are the techniques that can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. 1) Data cube aggregation 2) Attribute subset selection 3) Dimensionality reduction 4) Numerosity reduction 5) Discretization and concept hierarchy generation
PREPARED BY ARUN PRATAP SINGH 50
50
DATA REDUCTION : What is Data Reduction? Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efcient yet produce the same (or almost the same) analytical results. Data Reduction techniques These are the techniques that can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. 1) Data cube aggregation 2) Attribute subset selection 3) Dimensionality reduction 4) Numerosity reduction 5) Discretization and concept hierarchy generation
PREPARED BY ARUN PRATAP SINGH 51
51
PREPARED BY ARUN PRATAP SINGH 52
52
PREPARED BY ARUN PRATAP SINGH 53
53
PREPARED BY ARUN PRATAP SINGH 54
54
PREPARED BY ARUN PRATAP SINGH 55
55
PREPARED BY ARUN PRATAP SINGH 56
56
PREPARED BY ARUN PRATAP SINGH 57
57
PREPARED BY ARUN PRATAP SINGH 58
58
PREPARED BY ARUN PRATAP SINGH 59
59
PREPARED BY ARUN PRATAP SINGH 60
60
PREPARED BY ARUN PRATAP SINGH 61
61
PREPARED BY ARUN PRATAP SINGH 62
62
PREPARED BY ARUN PRATAP SINGH 63
63 DATA MINING STATISTICS :
PREPARED BY ARUN PRATAP SINGH 64
64
PREPARED BY ARUN PRATAP SINGH 65
65
DATA MINING TECHNIQUES : Many different data mining, query model, processing model, and data collection techniques are available. Which one do you use to mine your data, and which one can you use in combination with your existing software and infrastructure? Examine different data mining and analytics techniques and solutions, and learn how to build them using existing software and installations. Explore the different data mining tools that are available, and learn how to determine whether the size and complexity of your information might result in processing and storage complexities, and what to do.
PREPARED BY ARUN PRATAP SINGH 66
66 This overview provides a description of some of the most common data mining algorithms in use today. We have broken the discussion into two sections, each with a specific theme: Classical Techniques: Statistics, Neighborhoods and Clustering Next Generation Techniques: Trees, Networks and Rules I. Classical Techniques: Statistics, Neighborhoods and Clustering 1.1. The Classics These two sections have been broken up based on when the data mining technique was developed and when it became technically mature enough to be used for business, especially for aiding in the optimization of customer relationship management systems. Thus this section contains descriptions of techniques that have classically been used for decades the next section represents techniques that have only been widely used since the early 1980s. This section should help the user to understand the rough differences in the techniques and at least enough information to be dangerous and well armed enough to not be baffled by the vendors of different data mining tools. The main techniques that we will discuss here are the ones that are used 99.9% of the time on existing business problems. There are certainly many other ones as well as proprietary techniques from particular vendors - but in general the industry is converging to those techniques that work consistently and are understandable and explainable. 1.2. Statistics By strict definition "statistics" or statistical techniques are not data mining. They were being used long before the term data mining was coined to apply to business applications. However, statistical techniques are driven by the data and are used to discover patterns and build predictive models. And from the users perspective you will be faced with a conscious choice when solving a "data mining" problem as to whether you wish to attack it with statistical methods or other data mining techniques. For this reason it is important to have some idea of how statistical techniques work and how they can be applied. What is different between statistics and data mining? I flew the Boston to Newark shuttle recently and sat next to a professor from one the Boston area Universities. He was going to discuss the drosophila (fruit flies) genetic makeup to a pharmaceutical company in New Jersey. He had compiled the world's largest database on the genetic makeup of the fruit fly and had made it available to other researchers on the internet through Java applications accessing a larger relational database. He explained to me that they not only now were storing the information on the flies but also were doing "data mining" adding as an aside "which seems to be very important these days whatever that is". I mentioned that I had written a book on the subject and he was interested in knowing what the difference was between "data mining" and statistics. There was no easy answer. The techniques used in data mining, when successful, are successful for precisely the same reasons that statistical techniques are successful (e.g. clean data, a well defined target to predict
PREPARED BY ARUN PRATAP SINGH 67
67 and good validation to avoid overfitting). And for the most part the techniques are used in the same places for the same types of problems (prediction, classification discovery). In fact some of the techniques that are classical defined as "data mining" such as CART and CHAID arose from statisticians. So what is the difference? Why aren't we as excited about "statistics" as we are about data mining? There are several reasons. The first is that the classical data mining techniques such as CART, neural networks and nearest neighbor techniques tend to be more robust to both messier real world data and also more robust to being used by less expert users. But that is not the only reason. The other reason is that the time is right. Because of the use of computers for closed loop business data storage and generation there now exists large quantities of data that is available to users. IF there were no data - there would be no interest in mining it. Likewise the fact that computer hardware has dramatically upped the ante by several orders of magnitude in storing and processing the data makes some of the most powerful data mining techniques feasible today. 1.3. Nearest Neighbor Clustering and the Nearest Neighbor prediction technique are among the oldest techniques used in data mining. Most people have an intuition that they understand what clustering is - namely that like records are grouped or clustered together. Nearest neighbor is a prediction technique that is quite similar to clustering - its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it nearest to the unclassified record. A simple example of clustering A simple example of clustering would be the clustering that most people perform when they do the laundry - grouping the permanent press, dry cleaning, whites and brightly colored clothes is important because they have similar characteristics. And it turns out they have important attributes in common about the way they behave (and can be ruined) in the wash. To cluster your laundry most of your decisions are relatively straightforward. There are of course difficult decisions to be made about which cluster your white shirt with red stripes goes into (since it is mostly white but has some color and is permanent press). When clustering is used in business the clusters are often much more dynamic - even changing weekly to monthly and many more of the decisions concerning which cluster a record falls into can be difficult. A simple example of nearest neighbor A simple example of the nearest neighbor prediction algorithm is that if you look at the people in your neighborhood (in this case those people that are in fact geographically near to you). You may notice that, in general, you all have somewhat similar incomes. Thus if your neighbor has an income greater than $100,000 chances are good that you too have a high income. Certainly the chances that you have a high income are greater when all of your neighbors have incomes over $100,000 than if all of your neighbors have incomes of $20,000. Within your neighborhood there may still be a wide variety of incomes possible among even your closest neighbors but if you had to predict someones income based on only knowing their neighbors youre best chance of being right would be to predict the incomes of the neighbors who live closest to the unknown person.
PREPARED BY ARUN PRATAP SINGH 68
68 The nearest neighbor prediction algorithm works in very much the same way except that nearness in a database may consist of a variety of factors not just where the person lives. It may, for instance, be far more important to know which school someone attended and what degree they attained when predicting income. The better definition of near might in fact be other people that you graduated from college with rather than the people that you live next to. Nearest Neighbor techniques are among the easiest to use and understand because they work in a way similar to the way that people think - by detecting closely matching examples. They also perform quite well in terms of automation, as many of the algorithms are robust with respect to dirty data and missing data. Lastly they are particularly adept at performing complex ROI calculations because the predictions are made at a local level where business simulations could be performed in order to optimize ROI. As they enjoy similar levels of accuracy compared to other techniques the measures of accuracy such as lift are as good as from any other. How to use Nearest Neighbor for Prediction One of the essential elements underlying the concept of clustering is that one particular object (whether they be cars, food or customers) can be closer to another object than can some third object. It is interesting that most people have an innate sense of ordering placed on a variety of different objects. Most people would agree that an apple is closer to an orange than it is to a tomato and that a Toyota Corolla is closer to a Honda Civic than to a Porsche. This sense of ordering on many different objects helps us place them in time and space and to make sense of the world. It is what allows us to build clusters - both in databases on computers as well as in our daily lives. This definition of nearness that seems to be ubiquitous also allows us to make predictions. The nearest neighbor prediction algorithm simply stated is: Objects that are near to each other will have similar prediction values as well. Thus if you know the prediction value of one of the objects you can predict it for its nearest neighbors. Where has the nearest neighbor technique been used in business? One of the classical places that nearest neighbor has been used for prediction has been in text retrieval. The problem to be solved in text retrieval is one where the end user defines a document (e.g. Wall Street Journal article, technical conference paper etc.) that is interesting to them and they solicit the system to find more documents like this one. Effectively defining a target of: this is the interesting document or this is not interesting. The prediction problem is that only a very few of the documents in the database actually have values for this prediction field (namely only the documents that the reader has had a chance to look at so far). The nearest neighbor technique is used to find other documents that share important characteristics with those documents that have been marked as interesting. Using nearest neighbor for stock market data As with almost all prediction algorithms, nearest neighbor can be used in a variety of places. Its successful use is mostly dependent on the pre-formatting of the data so that nearness can be calculated and where individual records can be defined. In the text retrieval example this was not too difficult - the objects were documents. This is not always as easy as it is for text retrieval. Consider what it might be like in a time series problem - say for predicting the stock market. In
PREPARED BY ARUN PRATAP SINGH 69
69 this case the input data is just a long series of stock prices over time without any particular record that could be considered to be an object. The value to be predicted is just the next value of the stock price. The way that this problem is solved for both nearest neighbor techniques and for some other types of prediction algorithms is to create training records by taking, for instance, 10 consecutive stock prices and using the first 9 as predictor values and the 10th as the prediction value. Doing things this way, if you had 100 data points in your time series you could create 10 different training records. You could create even more training records than 10 by creating a new record starting at every data point. For instance in the you could take the first 10 data points and create a record. Then you could take the 10 consecutive data points starting at the second data point, then the 10 consecutive data point starting at the third data point. Even though some of the data points would overlap from one record to the next the prediction value would always be different. In our example of 100 initial data points 90 different training records could be created this way as opposed to the 10 training records created via the other method. Why voting is better - K Nearest Neighbors One of the improvements that is usually made to the basic nearest neighbor algorithm is to take a vote from the K nearest neighbors rather than just relying on the sole nearest neighbor to the unclassified record. In Figure 1.4 we can see that unclassified example C has a nearest neighbor that is a defaulter and yet is surrounded almost exclusively by records that are good credit risks. In this case the nearest neighbor to record C is probably an outlier - which may be incorrect data or some non-repeatable idiosyncrasy. In either case it is more than likely that C is a non-defaulter yet would be predicted to be a defaulter if the sole nearest neighbor were used for the prediction.
Figure 1.4 The nearest neighbors are shown graphically for three unclassified records: A, B, and C. In cases like these a vote of the 9 or 15 nearest neighbors would provide a better prediction accuracy for the system than would just the single nearest neighbor. Usually this is accomplished
PREPARED BY ARUN PRATAP SINGH 70
70 by simply taking the majority or plurality of predictions from the K nearest neighbors if the prediction column is a binary or categorical or taking the average value of the prediction column from the K nearest neighbors. How can the nearest neighbor tell you how confident it is in the prediction? Another important aspect of any system that is used to make predictions is that the user be provided with, not only the prediction, but also some sense of the confidence in that prediction (e.g. the prediction is defaulter with the chance of being correct 60% of the time). The nearest neighbor algorithm provides this confidence information in a number of ways: The distance to the nearest neighbor provides a level of confidence. If the neighbor is very close or an exact match then there is much higher confidence in the prediction than if the nearest record is a great distance from the unclassified record. The degree of homogeneity amongst the predictions within the K nearest neighbors can also be used. If all the nearest neighbors make the same prediction then there is much higher confidence in the prediction than if half the records made one prediction and the other half made another prediction. 1.4. Clustering Clustering for Clarity Clustering is the method by which like records are grouped together. Usually this is done to give the end user a high level view of what is going on in the database. Clustering is sometimes used to mean segmentation - which most marketing people will tell you is useful for coming up with a birds eye view of the business. Two of these clustering systems are the PRIZM system from Claritas corporation and MicroVision from Equifax corporation. These companies have grouped the population by demographic information into segments that they believe are useful for direct marketing and sales. To build these groupings they use information such as income, age, occupation, housing and race collect in the US Census. Then they assign memorable nicknames to the clusters. Some examples are shown in Table 1.2. Name Income Age Education Vendor Blue Blood Estates Wealthy 35-54 College Claritas Prizm Shotguns and Pickups Middle 35-64 High School Claritas Prizm Southside City Poor Mix Grade School Claritas Prizm Living Off the Land Middle-Poor School Age Families Low Equifax MicroVision University USA Very low Young - Mix Medium to High Equifax MicroVision Sunset Years Medium Seniors Medium Equifax MicroVision Table 1.2 Some Commercially Available Cluster Tags
PREPARED BY ARUN PRATAP SINGH 71
71 This clustering information is then used by the end user to tag the customers in their database. Once this is done the business user can get a quick high level view of what is happening within the cluster. Once the business user has worked with these codes for some time they also begin to build intuitions about how these different customers clusters will react to the marketing offers particular to their business. For instance some of these clusters may relate to their business and some of them may not. But given that their competition may well be using these same clusters to structure their business and marketing offers it is important to be aware of how you customer base behaves in regard to these clusters. Finding the ones that don't fit in - Clustering for Outliers Sometimes clustering is performed not so much to keep records together as to make it easier to see when one record sticks out from the rest. For instance: Most wine distributors selling inexpensive wine in Missouri and that ship a certain volume of product produce a certain level of profit. There is a cluster of stores that can be formed with these characteristics. One store stands out, however, as producing significantly lower profit. On closer examination it turns out that the distributor was delivering product to but not collecting payment from one of their customers. A sale on mens suits is being held in all branches of a department store for southern California . All stores with these characteristics have seen at least a 100% jump in revenue since the start of the sale except one. It turns out that this store had, unlike the others, advertised via radio rather than television. How is clustering like the nearest neighbor technique? The nearest neighbor algorithm is basically a refinement of clustering in the sense that they both use distance in some feature space to create either structure in the data or predictions. The nearest neighbor algorithm is a refinement since part of the algorithm usually is a way of automatically determining the weighting of the importance of the predictors and how the distance will be measured within the feature space. Clustering is one special case of this where the importance of each predictor is considered to be equivalent. Nearest Neighbor Clustering Used for prediction as well as consolidation. Used mostly for consolidating data into a high-level view and general grouping of records into like behaviors. Space is defined by the problem to be solved (supervised learning). Space is defined as default n- dimensional space, or is defined by the user, or is a predefined space driven by past experience (unsupervised learning). Generally only uses distance metrics to determine nearness. Can use other metrics besides distance to determine nearness of two records - for example linking two points together.
PREPARED BY ARUN PRATAP SINGH 72
72 II. Next Generation Techniques: Trees, Networks and Rules 2.1. The Next Generation The data mining techniques in this section represent the most often used techniques that have been developed over the last two decades of research. They also represent the vast majority of the techniques that are being spoken about when data mining is mentioned in the popular press. These techniques can be used for either discovering new information within large databases or for building predictive models. Though the older decision tree techniques such as CHAID are currently highly used the new techniques such as CART are gaining wider acceptance. Decision trees Related to most of the other techniques (primarily classification and prediction), the decision tree can be used either as a part of the selection criteria, or to support the use and selection of specific data within the overall structure. Within the decision tree, you start with a simple question that has two (or sometimes more) answers. Each answer leads to a further question to help classify or identify the data so that it can be categorized, or so that a prediction can be made based on each answer. Figure 4 shows an example where you can classify an incoming error condition. Figure 4. Decision tree
PREPARED BY ARUN PRATAP SINGH 73
73 Decision trees are often used with classification systems to attribute type information, and with predictive systems, where different predictions might be based on past historical experience that helps drive the structure of the decision tree and the output.
Neural Networks : What is a Neural Network? When data mining algorithms are talked about these days most of the time people are talking about either decision trees or neural networks. Of the two neural networks have probably been of greater interest through the formative stages of data mining technology. As we will see neural networks do have disadvantages that can be limiting in their ease of use and ease of deployment, but they do also have some significant advantages. Foremost among these advantages is their highly accurate predictive models that can be applied across a large number of different types of problems. What is a rule? In rule induction systems the rule itself is of a simple form of if this and this and this then this. For example a rule that a supermarket might find in their data collected from scanners would be: if pickles are purchased then ketchup is purchased. Or If paper plates then plastic forks If dip then potato chips If salsa then tortilla chips In order for the rules to be useful there are two pieces of information that must be supplied as well as the actual rule: Accuracy - How often is the rule correct? Coverage - How often does the rule apply? Just because the pattern in the data base is expressed as rule does not mean that it is true all the time. Thus just like in other data mining algorithms it is important to recognize and make explicit the uncertainty in the rule. This is what the accuracy of the rule means. The coverage of the rule has to do with how much of the database the rule covers or applies to. Examples of these two measure for a variety of rules is shown in Table 2.2. In some cases accuracy is called the confidence of the rule and coverage is called the support. Accuracy and coverage appear to be the preferred ways of naming these two measurements. Rule Accuracy Coverage If breakfast cereal purchased then milk purchased. 85% 20% If bread purchased then swiss cheese purchased. 15% 6% If 42 years old and purchased pretzels and purchased dry roasted peanuts then beer will be purchased. 95% 0.01% Table 2.2 Examples of Rule Accuracy and Coverage
PREPARED BY ARUN PRATAP SINGH 74
74 SOME IMPORTANT QUESTIONS Q.1 Discuss in detail the architecture of data warehouse. Ans : The technical architecture of data warehouses is somewhat similar to other systems, but does have some special characteristics. There are two border areas in data warehouse architecture - the single-layer architecture and the N-layer architecture. The difference here is the number of middleware between the operational systems and the analytical tools. The data warehouse architecture described here is a high level architecture and the parts in the architectures mentioned are full bodied systems and not system-parts. Components of Data Warehouse Architecture Source Data Component 1. Production Data 2. Internal Data 3. Archived Data 4. External Data Data Staging Component 1. Data Extraction 2. Data Transformation 3. Data Loading Data Storage Component Information Deliver Component Metadata Component Management and Control Component
PREPARED BY ARUN PRATAP SINGH 75
75
Data Warehouses can be architected in many different ways, depending on the specific needs of a business. The model shown below is the "hub-and-spokes" Data Warehousing architecture that is popular in many organizations. In short, data is moved from databases used in operational systems into a data warehouse staging area, then into a data warehouse and finally into a set of conformed data marts. Data is copied from one database to another using a technology called ETL (Extract, Transform, Load).
PREPARED BY ARUN PRATAP SINGH 76
76 Typical Data Warehousing Environment
THREE-TIER DATA WAREHOUSE ARCHITECTURE : Generally the data warehouses adopt the three-tier architecture. Following are the three tiers of data warehouse architecture. Bottom Tier - The bottom tier of the architecture is the data warehouse database server.It is the relational database system.We use the back end tools and utilities to feed data into bottom tier.these back end tools and utilities performs the Extract, Clean, Load, and refresh functions.
Middle Tier - In the middle tier we have OLAp Server. the OLAP Server can be implemented in either of the following ways. o By relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations. o By Multidimensional OLAP (MOLAP) model, which directly implements multidimensional data and operations. Top-Tier - This tier is the front-end client layer. This layer hold the query tools and reporting tool, analysis tools and data mining tools. Following diagram explains the Three-tier Architecture of Data warehouse:
PREPARED BY ARUN PRATAP SINGH 77
77
Q.2 Write short notes on any four of the following : (1) Distributed data marts (2) Data mining techniques Ans : (1) Distributed data marts : The data mart could be on different locations from the data warehouse so we should ensure that the LAN or WAN has the capacity to handle the data volumes being transferred within the data mart load process.
PREPARED BY ARUN PRATAP SINGH 78
78
Since the datamarts are smaller they can be placed on smaller distributed machines to allow users to break away from massively powered machines and still handle processing of the reports.
(2) Data mining techniques : Explained above.
Q.3 Write in brief bayesian classifiers. Ans :
PREPARED BY ARUN PRATAP SINGH 79
79 Introduction Bayesian classification is based on Baye's Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifier are able to predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Baye's Theorem Baye's Theorem is named after Thomas Bayes. There are two types of probability as follows: Posterior Probability [P(H/X)] Prior Probability [P(H)] Where, X is data tuple and H is some hypothesis. According to Baye's Theorem P(H/X)= P(X/H)P(H) / P(X) Bayesian Belief Network Bayesian Belief Network specify joint conditional probability distributions Bayesian Networks and Probabilistic Network are known as belief network. Bayesian Belief Network allows class conditional independencies to be defined between subsets of variables. Bayesian Belief Network provide a graphical model of causal relationship on which learning can be performed. We can use the trained Bayesian Network for classification. Following are the names with which the Bayesian Belief are also known: Belief networks Bayesian networks Probabilistic networks There are two components to define Bayesian Belief Network: Directed acyclic graph A set of conditional probability tables Directed Acyclic Graph Each node in directed acyclic graph is represents a random variable.
PREPARED BY ARUN PRATAP SINGH 80
80 These variable may be discrete or continuous valued. These variable may corresponds to actual attribute given in data. Directed Acyclic Graph Representation The following diagram shows a directed acyclic graph for six boolean variables.
The arc in the diagram allows representation of causal knowledge. For example lung cancer is influenced by a person's family history of lung cancer, as well as whether or not the person is a smoker.It is woth noting that the variable PositiveXRay is independent of whether the patient has a family history of lung cancer or is a smoker, given that we know the patient has lung cancer. Set of Conditional probability table representation: The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH) and Smoker (S).
Q.4 What is the role of information visualization in data mining ? Ans : Data mining provides many useful results but is difficult to implement. The choice of data mining technique is not easy and expertise in the domain of interest is required. If one could travel over the data set of interest, much as a plane flies over a landscape with the occupants identifying points of interest, the task of data mining would be much simpler. Just as population centers are noted and isolated communities are identified in a landscape so clusters of data instances and
PREPARED BY ARUN PRATAP SINGH 81
81 isolated instances might be identified in a data set. Identification would be natural and understood by all. At the moment no such general-purpose visualization techniques exist for multi- dimensional data. The visualization techniques available are either crude or limited to particular domains of interest. They are used in an exploratory way and require confirmation by other more formal data mining techniques to be certain about what is revealed. Visualization of data to make information more accessible has been used for centuries. The work of Tufte provides a comprehensive review of some of the better approaches and examples from the past . The interactive nature of computers and the ability of a screen display to change dynamically have led to the development of new visualization techniques. Researchers in computer graphics are particularly active in developing new visualizations. These researchers have adopted the term Visualization to describe representations of various situations in a broad way. Physical problems such as volume and flow analysis have prompted researchers to develop a rich set of paradigms for visualization of their application areas [Niel96 p.97]. The term Information Visualization has been adopted as a more specific description of the visualization of data that are not necessarily representations of 4 physical systems which have their inherent semantics embedded in three dimensional space . Consider a multi-dimensional data set of US census data on individuals. Each individual represents an entity instance and each entity instance has a number of attributes. In a relational database a row of data in a table would be equivalent to an entity instance. Each column in that row would contain a value equivalent to an attribute value of that entity instance. The data set is multidimensional; the number of attributes being equal and equivalent to the number of dimensions. The attributes in the chosen example are occupation, age, gender, income, marital status, level of education and birthplace. The data is categorical because the values of the attributes for each instance may only be chosen from certain categories. For example gender may only take a value from the categories male or female. Information Visualization is concerned with multi-dimensional data that may be less structured than data sets grounded in some physical system. Physical systems have inherent semantics in a three dimensional space. Multi-dimensional data, in contrast, may have some dimensions containing values that fall into categories instead of being continuous over a range. This is the case for the data collected in many fields of study.