Need of Two Types of Data: Information
Need of Two Types of Data: Information
DATA
1. Items about things, events, activities, and transactions
2. Unorganized, i.e do not convey any specific meaning 3. Numeric, Alfanumeric, Figures, Sounds or Images INFORMATION 1. Organized Data, Meaningful Data, Results KNOWLEDGE 1. Data items or Information that convey understanding, experience, accumulated learning, and expertise 2. Application of data and information in making a decision.
A data warehouse is data management and data analysis Goal: is to integrate enterprise wide corporate data into a single reository from which users can easily run queries Def :- A DW is a subject oriented, integrated, time variant and non volatile collection of data in support of managements decision making process (Immon) 1. Subject Oriented 1. Organized around major subjects, such as customer, product, sales. 2. Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. 2. Integrated 1. 2. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. Constructed by integrating multiple, heterogeneous data sources 2.1. relational databases, flat files, on-line transaction records 3. Data cleaning and data integration techniques are applied. 3.1. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources E.g., Hotel price: currency, tax, breakfast covered, etc. 3.2. When data is moved to the warehouse, it is converted. 3. Time Variant 1. The source data in the WH is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is held, the implicit or explicit association of time with all data, and the fact that the data represents a series of snapshots
2. The time horizon for the data warehouse is significantly longer than that of operational systems. 3. Operational database: current value data. 3.1. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) 4. Every key structure in the data warehouse contains an element of time, explicitly or implicitly but the key of operational data may or may not contain time element.
4.Non Volatile 1. 2. A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. -Does not require transaction processing, recovery, and concurrency control mechanisms -Requires only two operations in data accessing: initial loading of data and access of data.
Conceptual modelling of DW
1. 2. 3. Represent facts and their properties Connect the temporal dimension to facts Represent objects, capture their properties and the associations among them
4. 5.
Record the associations between objects and facts Distinguish dimensions and categorize them into hierarchies
Characteristics of DW
Subject oriented Summarized Client/server Integrated Not normalized Time-variant (time series) Metadata Nonvolatile Web based (relational/multi-dimensional)
Generic DW Architectures
Three-tier architecture 1. Data acquisition software (back-end) 2. The data warehouse that contains the data & software 3. Client (front-end) software that allows users to access and analyze data from the warehouse Two-tier architecture sometime there is only one tier?
7. The major integration issue is how to synchronize the various types of meta-data use throughout the data warehouse. The challenge is to synchronize meta-data between different products from different vendors using different meta-data stores. 8. Two major standards for meta-data and modeling in the areas of data warehousing and component-based development-MDC(Meta Data Coalition) and OMG(Object Management Group)
data mart a subset of a data warehouse that supports the requirements of particular department or business function 1) The characteristics that differentiate data marts and data warehouses include: a) a data mart focuses on only the requirements of users associated with one department or business function b) data marts do not normally contain detailed operational data, unlike data warehouses c) as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated A departmental data warehouse that stores only relevant data Dependent data mart
A subset that is created directly from a data warehouse Independent data mart
Reasons for creating a data mart :1) To give users access to the data they need to analyze most often 2) To provide data in a form that matches the collective view of the data by a group of users in a department or business function 3) To improve end-user response time due to the reduction in the volume of data to be accessed 4) To provide appropriately structured data as ditated by the requirements of end-user access tools 5) Normally use less data so tasks such as data cleansing, loading, transformation, and integration are far easier, and hence implementing and setting up a data mart is simpler than establishing a corporate data warehouse 6) The cost of implementing data marts is normally less than that required to establish a data warehouse 7) The potential users of a data mart are more clearly defined and can be more easily targeted to obtain support for a data mart project rather than a corporate data warehouse project
Data Integration and the Extraction, Transformation, and Load (ETL) Process
Data integration :- Integration that comprises three major processes: data access, data federation, and change capture. Enterprise application integration (EAI):- A technology that provides a vehicle for pushing data from source systems into a data warehouse
Enterprise information integration (EII):- An evolving tool space that promises real-time data integration from a variety of sources Service-oriented architecture (SOA):- A new way of integrating information systems Extraction, transformation, and load (ETL) process
ETL
Issues affecting the purchase of and ETL tool Data transformation tools are expensive Data transformation tools may have a long learning curve Important criteria in selecting an ETL tool Ability to read from and write to an unlimited number of data sources/architectures Automatic capturing and delivery of metadata A history of conforming to open standards An easy-to-use interface for the developer and the functional user
Benefits of DW
Direct benefits of a data warehouse
1. Allows end users to perform extensive analysis 2. Allows a consolidated view of corporate data 3. Better and more timely information 4. Enhanced system performance 5. Simplification of data access
5.
ii) Provides an integrated, flexible architecture to support analytic data structures b) Data mart approach (bottom-up) i) Goal: to deliver business value quickly by deploying multidimensional Data Marts, which are later organized into DW
Data cube :- A 2D, 3D, or higher-dimensional object in which each dimension of the data represents a measure of interest
2. Dimension Tables which surround the fact table and are linked via foreign keys. 3 Contain attributes that describe the data contained within fact table
5. Adaptability must be built in from the start 6. The project must be managed by both IT and business professionals (a businesssupplier relationship must be developed) 7. Only load data that have been cleansed/high quality 8. Do not overlook training requirements 9. Be politically aware.
Risks in Implementing DW
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. No mission or objective Quality of source data unknown Skills not in place Inadequate budget Lack of supporting software Source data not understood Weak sponsor Users not computer literate Political problems or turf wars Unrealistic user expectations Architectural and design risks Scope creep and changing requirements Vendors out of control Multiple platforms Key people leaving the project Loss of the sponsor Too much new technology Having to fix an operational system Geographically distributed environment Team geography and language culture