0% found this document useful (0 votes)
41 views

Data Warehouse Schema

There are three main types of data warehouse schemas: star schema, snowflake schema, and starflake schema. The star schema is the most basic with facts stored in fact tables at the center and dimensions stored in separate tables. The snowflake schema normalizes the dimensions into separate tables. The starflake schema is a hybrid with both normalized and denormalized dimensions. The CODD rules define requirements for a relational database management system including representing all data in tables, supporting null values, having an online catalog, supporting query languages, and ensuring logical and physical data independence. Fact tables store facts or measurements and are located at the center of star and snowflake schemas. They provide values analyzed across dimensional attributes.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Data Warehouse Schema

There are three main types of data warehouse schemas: star schema, snowflake schema, and starflake schema. The star schema is the most basic with facts stored in fact tables at the center and dimensions stored in separate tables. The snowflake schema normalizes the dimensions into separate tables. The starflake schema is a hybrid with both normalized and denormalized dimensions. The CODD rules define requirements for a relational database management system including representing all data in tables, supporting null values, having an online catalog, supporting query languages, and ensuring logical and physical data independence. Fact tables store facts or measurements and are located at the center of star and snowflake schemas. They provide values analyzed across dimensional attributes.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 10

Data warehouse Schema

Star Schema: Data modeling technique used to map multidimensional decision support data into relational db. Creates the near equivalent of multidimensional db schema from the existing relational db. Four components: facts, dimensions, attributes, and attribute hierarchies. Star flake schema: A hybrid structure that contains a mixture of star (de normalized) and snowflake (normalized) schema's. Allow dimensions to be present in both forms to cater for different query requirements. Snow Flake Schema: It is a variant of the star schema where dimension table do not contain de normalized data.

CODD RULE
Rule (0): The system must qualify as relational, as a database, and as a management system. For a system to qualify as a relational database management system (RDBMS), that system must use its relational facilities (exclusively) to manage the database. Rule 1: The information rule: All information in the database is to be represented in only one way, namely by values in column positions within rows of tables. Rule 2: The guaranteed access rule: All data must be accessible. This rule is essentially a restatement of the fundamental requirement for primary keys. It says that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column and the primary key value of the containing row. Rule 3: Systematic treatment of null values: The DBMS must allow each field to remain null (or empty). Specifically, it must support a representation of "missing information and inapplicable information" that is systematic, distinct from all regular values (for example, "distinct from zero or any other number", in the case of numeric values), and independent of data type. It is also implied that such representations must be manipulated by the DBMS in a systematic way. Rule 4: Active online catalog based on the relational model: The system must support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. That is, users must be able to access the database's structure (catalog) using the same query language that they use to access the database's data. Rule 5: The comprehensive data sublanguage rule: The system must support at least one relational language that 1. Has a linear syntax 2. Can be used both interactively and within application programs, 3. Supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback). Rule 6: The view updating rule: All views that are theoretically updatable must be updatable by the system. Rule 7: High-level insert, update, and delete: The system must support set-at-a-time insert, update, and delete operators. This means that data can be retrieved from a relational database in sets constructed of data from multiple rows and/or multiple tables. This rule states that insert, update, and delete operations should be supported for any retrievable set rather than just for a single row in a single table. Rule 8: Physical data independence: Changes to the physical level (how the data is stored, whether in arrays or linked lists etc.) must not require a change to an application based on the structure. Rule 9: Logical data independence: Changes to the logical level (tables, columns, rows, and so on) must not require a change to an application based on the structure. Logical data independence is more difficult to achieve than physical data independence. Rule 10: Integrity independence: Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications. Rule 11: Distribution independence: The distribution of portions of the database to various locations should be invisible to users of the database. Existing applications should continue to operate successfully: 1. when a distributed version of the DBMS is first introduced; and 2. When existing distributed data are redistributed around the system. Rule 12: The non subversion rule: If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system, for example, bypassing a relational security or integrity constraint.

Fact table/Summary table


In data warehousing, a fact table consists of the measurements, metrics or facts of a business process. It is often located at the center of a star schema or a snowflake schema, surrounded by dimension tables. Fact tables provide the (usually) additive values that act as independent variables by which dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (a region contains many stores).

Measure types
Additive - Measures that can be added across all dimensions. Non Additive - Measures that cannot be added across any dimension. Semi Additive - Measures that can be added across some dimensions and not across others. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). Special care must be taken when handling ratios and percentage. One good design rule[1] is to never store percentages or ratios in fact tables but only calculate these in the data access tool. Thus only store the numerator and denominator in the fact table, which then can be aggregated and the aggregated stored values can then be used for calculating the ratio or percentage in the data access tool. In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called "factless fact tables", or "junction tables". The "Factless fact tables" can for example be used for modeling many-to-many relationships or capture events. [1]

Types of fact tables


There are basically three fundamental measurement events, which characterizes all fact tables.[2] Transactional : A transactional table is the most basic and fundamental. The grain associated with a transactional fact table is usually specified as "one row per line in a transaction", e.g., every line on a receipt. Typically a transactional fact table holds data of the most detailed level, causing it to have a great number of dimensions associated with it. Periodic snapshots: The periodic snapshot, as the name implies, takes a "picture of the moment", where the moment could be any defined period of time, e.g. a performance summary of a salesman over the previous month. A periodic snapshot table is dependent on the transactional table, as it needs the detailed data held in the transactional fact table in order to deliver the chosen performance output. Accumulating snapshots: This type of fact table is used to show the activity of a process that has a well-defined beginning and end, e.g., the processing of an order. An order moves through specific steps until it is fully processed. As steps towards fulfilling the order are completed, the associated row in the fact table is updated. An accumulating snapshot table often has multiple date columns, each representing a milestone in the process. Therefore, it's important to have an entry in the associated date dimension that represents an unknown date, as many of the milestone dates are unknown at the time of the creation of the row. Sub-transactional: The sub-transactional table is used to store facts that represent events at a more detailed level than the transactional table. These events, known as Griffith facts after their inventor, are typically those events that occur during the processing of a transaction, for example "change returned" amount from a vending machine.

Steps in designing a fact table


Identify a business process for analysis (like sales). Identify measures or facts (sales dollar), by asking questions like 'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business. Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test. List the columns that describe each dimension (region name, branch name, business unit name). Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars).

SIMON MODEL OF DECISION MAKING Intelligence phase


o Reality is examined o The problem is identified and defined Scan the environment to identify problem situations or opportunities o Identify organizational goals and objectives o Determine whether they are being met o Explicitly define the problem Classify the problem Decompose into sub-problems Is it my problem (ownership) Can I solve it Outcome: Problem statement

Design phase
o Representative model is constructed o The model is validated and evaluation criteria are set o Generating, developing, and analyzing possible courses of action Includes o Understanding the problem o Testing solutions for feasibility o A model is constructed, tested, and validated Modeling o Conceptualization of the problem o Abstraction to quantitative and/or qualitative forms

Choice phase
o o o o Includes a proposed solution to the model If reasonable, move on to the next phase. Search, evaluation, and recommending an appropriate solution to the model Specific set of values for the decision variables in a selected alternative

The problem is considered solved after the recommended solution to the model is successfully implemented o Search Approaches Analytical Techniques Algorithms (Optimization) Blind and Heuristic Search Techniques

Implementation phase
o Solution to the original problem o There is nothing more difficult to carry out, nor more doubtful of success, nor more dangerous to handle, than to initiate a new order of things (Machiavelli [1500s]) *** The Introduction of a Change *** Important Issues o Resistance to change

o Degree of top management support o Users roles and involvement in system development o Users training Failure: Return to the modeling process Often Backtrack / Cycle Throughout the Process

Expert system
An Expert System is a computer system that emulates the decision-making ability of a human expert. An expert system is a computer program that is designed to hold the accumulated knowledge of one or more domain experts. Expert systems are designed to solve complex problems by reasoning about knowledge, like an expert, and not by following the procedure of a developer as is the case in conventional programming. The first expert systems were created in the 1970s and then proliferated in the 1980s.Expert systems were among the first truly successful forms of Artificial Intelligence software. Component of an Expert System An expert system has a unique structure, different from traditional programs. 1. Fixed: the inference engine: It is the part of the system that chooses which facts and rules to apply when trying to solve the users query 2. Variable: the knowledge base: It is the collection of facts and rules which describe all the knowledge about the problem domain 3. A Dialog/User interface: This ability to conduct a conversation with users was later called "conversational". It is the part of the system which takes in the users query in a readable form and passes it to the inference engine. It then displays the results to the user. Software architecture The Rule base or Knowledge base In expert system technology, the knowledge base is expressed with natural language rules IF ... THEN ... For examples: "IF it is living THEN it is mortal" "IF his age = known THEN his year of birth = date of today - his age in years" This formulation has the advantage of speaking in everyday language which is very rare in computer science. Rules express the knowledge to be exploited by the expert system. There exist other formulations of rules, which are not in everyday language, understandable only to computer scientists. Each rule style is adapted to an engine style. The whole problem of expert systems is to collect this knowledge from the experts. The inference engine The inference engine is a computer program designed to produce reasoning on rules. In order to produce reasoning, it is based on logic. There are several kinds of logic: propositional logic, predicates of order 1 or more, epistemic logic, modal logic, temporal logic, fuzzy logic, etc. Except for propositional logic, all are complex and can only be understood by mathematicians, logicians or computer scientists. Propositional logic is the basic human logic that is expressed in syllogisms. The expert system that uses that logic is also called a zeroth-order expert system. With logic, the engine is able to generate new information from the knowledge contained in the rule base and data to be processed. The engine has two ways to run: batch or conversational. In batch, the expert system has all the necessary data to process from the beginning. For the user, the program works as a classical program: he provides data and receives results immediately. Reasoning is invisible. The conversational method becomes necessary when the developer knows he cannot ask the user for all the necessary data at the start, the problem being too complex. Advantages Conversational

Expert systems offer many advantages for users when compared to traditional programs because they operate like a human brain. Quick availability and opportunity to program itself As the rule base is in everyday language, expert system can be written much faster than a conventional program, by users or experts, bypassing professional developers and avoiding the need to explain the subject. Ability to exploit a considerable amount of knowledge The expert system uses a rule base, unlike conventional programs, which means that the volume of knowledge to program is not a major concern. Reliability The reliability of an expert system is the same as the reliability of a database, i.e. good, higher than that of a classical program. Scalability Evolving an expert system is to add, modify or delete rules. Since the rules are written in plain language, it is easy to identify those to be removed or modified. Pedagogy The engines that are run by a true logic are able to explain to the user in plain language why they ask a question and how they arrived at each deduction. In doing so, they show knowledge of the expert contained in the expert system. Preservation and improvement of knowledge Valuable knowledge can disappear with the death, resignation or retirement of an expert. Recorded in an expert system, it becomes eternal. To develop an expert system is to interview an expert and make the system aware of their knowledge. In doing so, it reflects and enhances it. New areas neglected by conventional computing Automating a vast knowledge, the developer may meet a classic problem: "combinatorial explosion" commonly known as "information overload" that greatly complicates his work and results in a complex and time consuming program. Disadvantages Limited domain Systems are not always up to date, and dont learn No common sense Experts needed to setup and maintain system Application field Expert systems address areas where combinatorics is enormous: highly interactive or conversational applications, IVR, voice server, chatterbot fault diagnosis, medical diagnosis decision support in complex systems, process control, interactive user guide educational and tutorial software logic simulation of machines or systems knowledge management Constantly changing software. Examples of applications Expert systems are designed to facilitate tasks in the fields of : Accounting, Medicine, Process control, Financial service, Production, Human Resources Knowledge engineering The building, maintaining and development of expert systems is known as knowledge engineering. Knowledge engineering is a "discipline that involves integrating knowledge into computer systems in order to solve complex problems normally requiring a high level of human expertise". There are generally three individuals having an interaction in an expert system. Primary among these is the end-user, the individual who uses the system for its

problem solving assistance. In the construction and maintenance of the system there are two other roles: the problem domain expert who builds the system and supplies the knowledge base, and a knowledge engineer who assists the experts in determining.

Knowledge Assets
knowledge is less tangible and depends on human cognition and awareness. There are several types of knowledge - 'knowing' a fact is little different from 'information', but 'knowing' a skill, or 'knowing' that something might affect market conditions is something, that despite attempts of knowledge engineers to codify such knowledge, has an important human dimension. It is some combination of context sensing, personal memory and cognitive processes. Measuring the knowledge asset, therefore, means putting a value on people, both as individuals and more importantly on their collective capability, and other factors such as the embedded intelligence in an organizations computer systems.

Knowledge Storage
Knowledge repositories provide what might be termed as the long-term memory of organizational knowledge management systems. Knowledge repository tools form the basis for storing and retrieving vast quantities of business intelligence or previous information, which can subsequently be used to form the basis for future predictions. These technologies and tools contribute to the effective codification, storage and archiving of knowledge while also focusing attention on another important aspect in the Knowledge management process such as the quality, quantity, accessibility and representation of the knowledge being stored. a) Knowledge Storage Data warehouses are the main component of KM infrastructure. Organizations store data in a number of databases. The data warehousing process extracts data captured by multiple business applications and organizes it in a way that provides meaningful knowledge to the business, which can be accessed for future reference. For example, data warehouses could act as a central storage area for an organizations transaction data. Data warehouses differ from traditional transaction databases in that they are designed to support decision-making and data processing and analyses rather than simply efficiently capturing transaction data. Knowledge warehouses are another type of data warehouse but which are aimed more at providing qualitative data than the kind of quantitative data typical of data warehouses. Knowledge warehouses store the knowledge generated from a wide range of databases including: data warehouses, work processes, news articles, external databases, web pages and people (documents, etc.). Thus, knowledge warehouses are likely to be virtual warehouses where knowledge is dispersed across a number of servers. Databases and Knowledge bases can be distinguished by the type and characteristics of the data stored. While data in a database has to be represented in explicit form (generally speaking the information can only be extracted as it is stored in the system), the knowledge-based systems support generation of knowledge that does not explicitly exist in the database. Data marts represent specific database systems on a much smaller scale representing a structured, searchable database system, which is organized according to the users needs. Data repository is a database used primarily as an information storage facility, with minimal analysis or querying functionality. Content and Document Management Systems represent the convergence of full-text retrieval, document management, and publishing applications. It supports the unstructured data management requirements of knowledge management (KM) initiatives through a process that involves capture, storage, access, selection, and document publication.. (b) Knowledge Organization Technologies Knowledge organization technologies allow better access to knowledge resources within the organization and facilitate knowledge retrieval.

Topic maps are an advanced solution to the problem of structuring, storing and representing knowledge within a corporation. It is an established ISO standard developed to address the problem of coherent representation of relations between topics (or ideas) and associating those topics with actual documents (topic occurrences). Skill maps are an extension of topic maps, creating new structures for storing information about employees, their knowledge and their skills. They are created by copying specified topic map objects and adding individual modifications, thereby providing mechanisms to enhance searching of knowledge repositories that can take into consideration the state of each employee's knowledge and skills. Controlled vocabularies enable the creation of information, its archiving for future uses and communication to others and to computer systems. Not only should there be a common language and vocabulary but there also has to be a common categorization or classification a description of the relationship between words. Knowledge Management Knowledge Utilization is one of four types of activities integral to Knowledge Management (KM), the other three being Knowledge Creation, Knowledge Retention and Knowledge Transfer. Successful Knowledge Utilization is necessary as knowledge gained must be applied in order for the organization to successfully close its Knowledge Gaps and to meet organizational goals and objectives. The knowledge cycle consists of three defined phases: Knowledge creation Diffusion Utilization. Knowledge Utilization in this cycle is viewed as something that is different than Knowledge Diffusion. Knowledge Utilization tends to be viewed as more of a linear process, while Knowledge Diffusion is viewed as non-linear. Diffusion refers more to the pushing of knowledge throughout an organization while the focus of the utilization cycle involves interventions and decisions related to what knowledge should be utilized and how to best do that. One additional impact upon Knowledge Utilization can be found in understanding the stages of Knowledge Utilization are Reception and Cognition. Reception is important to ensure that those in the organization are able to receive (through transfer, creation or diffusion) knowledge critical to close Knowledge Gaps. Cognition is important to ensure that those in the organization are able to understand and utilize the knowledge without which innovation would not be possible. THE KNOWLEDGE CYCLE The knowledge cycle consists of at least three components, also known as interrelated subfields of study: knowledge creation, knowledge diffusion, and knowledge utilization. Knowledge Utilization means "practical use," "conceptual use," or "adaptive use." With hindsight, some of the differentiating attributes of the subfields, amid the overlapping attributes and mixed terminology. A. Knowledge creation Knowledge creation may stem from 5 differing kinds of formalized research efforts: Basic Research Applied Research Summative Evaluations Formative Evaluations Action Research. Basic research focuses on knowledge building among researchers while applied research focuses on knowledge building specific to practice areas. Program evaluation forms of research look at the outputs and outcomes of programs as well as the processes that could be improved to make ongoing programs and projects more effective. Action research is generally linked to the organization within which it occurs and addresses that organization's specific problems. Products of knowledge creation efforts include: Research findings Demonstration results

Program evaluation findings General purpose statistics Knowledge creation may also encompass technology or tangible prototypical products deemed worthy of transfer or mass manufacture. Researchers who study how scientific knowledge is created represent the disciplines of sociology of knowledge, intellectual history, the history and the philosophy of science, and the sociology and the psychology of science. B. Knowledge diffusion Researchers focusing on knowledge diffusion study communication channels used to disseminate innovations, rates of adoption, earliness of knowing about an innovation, innovativeness of members of a social system, opinion leadership, who interacts with whom in diffusion networks, and consequences of an innovation. A meta-analysis of several hundred diffusion studies has resulted in a set of propositions for each significant study area. Disciplines studying knowledge diffusion include: Communications research Information science Library science Sociology of science Nine major disciplines as making significant contributions: Anthropology Education Sociology: Early, Rural, Medical, General Public health Communications Marketing Geography.

C. Knowledge utilization Knowledge utilization, seek to measure information pickup, processing, and application. Information pickup means the process of retrieving or receiving information whether from a data bank, a library shelf, a consultation session, or other means. Information processing involves: Understanding the information Testing it for validity and reliability Testing it against one's own intuition and assumptions Transforming the information into a form that is usable. Testing does not necessarily refer to formal experimental models; it may involve cognitive procedures. The application part of the knowledge utilization process may include rejection of the information as well as acceptance. Researchers in this area of study consider the results from diffusion research and technology transfer findings along with their studies of planned change, determining factors of use, and decision-making or problem solving uses by policymakers, administrators, and practitioners. Products of knowledge utilization may include: Models Factors Strategies Processes They found most predictive of generating use. Utilization may focus on bringing about planned change in individuals, organizations, or societies. They may also focus on practical use, perceptual use, adaptive use, selective use, premature use, rejected use (i.e., deliberate nonuse), discontinued use, and misuse. Researchers focusing on utilization include those affiliated with disciplines or areas of study such as: Industrial psychology Motivational psychology Psychology of thought processes

Organizational theory Management theory Social, Political and Communications theory.

DATA MANIPULATION LANGUAGE


A data manipulation language (DML) is a family of syntax elements similar to a computer programming language used for inserting, deleting and updating data in a database. Performing read-only queries of data is sometimes also considered a component of DML. A popular data manipulation language is that of Structured Query Language (SQL), which is used to retrieve and manipulate data in a relational database. Other forms of DML are those used by IMS/DLI, CODASYL databases, such as IDMS and others. Data manipulation language comprises the SQL data change statements,[2] which modify stored data but not the schema or database objects. Manipulation of persistent database objects, e.g., tables or stored procedures, via the SQL schema statements,[2] rather than the data stored within them, is considered to be part of a separate data definition language. In SQL these two categories are similar in their detailed syntax, data types, expressions etc., but distinct in their overall function.[2] Data manipulation languages have their functional capability organized by the initial word in a statement, which is almost always a verb. In the case of SQL, these verbs are: SELECT ... FROM ... WHERE ... INSERT INTO ... VALUES ... UPDATE ... SET ... WHERE ... DELETE FROM ... WHERE ... The purely read-only SELECT query statement is classed with the 'SQL-data' statements and so is considered by the standard to be outside of DML. The SELECT ... INTO form is considered to be DML because it manipulates data. In common practice though, this distinction is not made and SELECT is widely considered to be part of DML. Most SQL database implementations extend their SQL capabilities by providing imperative, i.e., procedural, languages. Examples of these are Oracle's PL/SQL and DB2's SQL PL. Data manipulation languages tend to have many different flavors and capabilities between database vendors. There have been a number of standards established for SQL by ANSI but vendors still provide their own extensions to the standard while not implementing the entire standard. Data manipulation languages are divided into two types, procedural programming and declarative programming. Each SQL DML statement is a declarative command. The individual SQL statements are declarative, as opposed to imperative, in that they describe the program's purpose, rather than describing the procedure for accomplishing it. Data manipulation languages were initially only used within computer programs, but with the advent of SQL have come to be used interactively by database administrators.

Knowledge Discovery Process

Knowledge discovery is a concept of the field of computer science that describes the process of automatically searching large volumes of data for patterns that can be considered knowledge about the data. It is often described as deriving knowledge from the input data. This complex topic can be categorized according to 1) what kind of data is searched; and 2) in what form is the result of the search represented. Knowledge discovery developed out of the Data mining domain, and is closely related to it both in terms of methodology and terminology.

You might also like