Summaries chapters
Summaries chapters
1. Why Databases
2. Ubiquity and Pervasiveness of Data
Data is unescapable, prevalent, and persistent nature and it exists from birth to death.
Individuals continuously generate and consume a lot of data throughout their lives.
It starts with birth certificates and extends to death certificates, highlighting the lifelong
data generation process.
3. Importance of Databases
Databases are the optimal solution for storing and managing data effectively.
Databases make data persistent, shareable, and secure, addressing the challenges posed
by the sheer volume of generated data.
Businesses are not able to store and retrieve huge collections of data
Databases are the solution to efficiently process, store, and retrieve vast amounts of
data for timely decision-making.
Data- raw facts (facts that have not been processed to reveal their meaning)
Metadata- the data characteristics and the set of relationship that links the data found
within the database.
Advantages of DBMS
Workgroup database- when multiuser database supports relatively small number of users
Centralized database – Database that supports data located on the single site
Distributed database – database that supports data distributed across several different sites
Cloud database- Database that is created and maintained using cloud data services
General purpose database- contains a wide variety of data used in different disciplines
Analytical database- stores historical data and business metrics used exclusively for tactical or
strategic decision making.
Online analytical processing: is set of tools that work together to provide an advanced data
analysis environment for retrieving, processing and modelling data from the data warehouse
Structured data: formatted raw data to facilitate storage, use and generation of information.
Database design- activities that focus on the design on the design of the database structure that
will be used to store and manage end-user data.
Structural independence- exist when you can change the file structure without affecting the
applications’ ability to access data
Physical data format- How computers must work with the data
Data redundancy: occurs when the same data is stored unnecessarily at different places
Update anomalies
Insertion anomalies
Deletion anomalies
Database system environment- is an organization of components that define and regulate the
collection, storage, management and use of data within database environment.
o Increased cost
o Management complexities
o Maintaining currency
o Vendor dependence
o Needs frequent upgrades
Chapter 2: Data Models
Data model: is the simple representation, usually graphical of more complex real world data
structure
Entity- a person, place, thing or event about which data will be collected and stored
Relationships – describes association among entities. Designers usually use shorthand notations
to represent one-to-many, many-to-many and one-to-one [1: M or M:N or *..* and 1:1 or 1..1
respectively]
Constraints- itis the restriction placed on data. They ensure data integrity and are expressed
inform of rules.
Business rules: is a brief, precise and unambiguous description of policy, procedure, or principle
within specific organization
The main source of business rules are company managers, policy makers, department managers
and written documentation such as company’s procedures, manuals and standards.
The quest for better a better data management model led to the development of several models
Hierarchical model – developed in 1960 to manage large amount of complex data
Network model: represents complex data relationship more effectively than hierarchical model
to improve data performance and impose a database standard
Schema- is the conceptual organization of the entire database as viewed by the database
administrator.
Subschema – it defines the portion of the database “seen” seen by the application programs
that actually produce the desired information from the data within the
database.
Schema data definition language- enables the database administrator to define the schema
components.
A data manipulation language- defines environment in which data can be managed and it is
used to work with the data within the database.
Inheritance- is the ability of the object within the class hierarchy to inherit the attributes and
methods of the classes above it.
UML class diagram- are used to represent data and its relationship within the upper UML object
oriented systems modeling language
Big Data: is the movement to find new and better ways to manage large amount of web-and
sensor generated data and drive business insight from it while simultaneously providing high
performance and scalability at a reasonable cost.
Velocity- it is the speed with which data grows and the need to process this data quickly in order
to generate information and insight.
Variety- refers to the data being collected comes in multiple different data formats.
Big data technologies –Hadoop- is java based, open source, high speed, fault tolerant
distributed storage and computational framework
- Also called relation because the relational model creator E.F Codd used the two terms as
synonyms.
Characteristics of relational tables
Determination is the in which knowing the value of the attribute makes it possible to determine the
value of another.
Full functional dependence –is a functional dependencies in which the entire collection of attributes in
the determinant is necessary for the relationship.
Types of Keys
Super key – a key that can uniquely identify any row in a table
Candidate key- one specific type of super key
Primary key- a candidate key selected to identify all other attribute value in a a given row, cannot
contain null entry
Foreign key- an attribute or combination of attributes in one table whose value must either much the
primary key in other table
Entity integrity- is the condition in which each row in a table has its own unique identity
It develops a conceptual design for the database. It also develops a very simple and easy to
design view of data.
Types of attributes
Existence dependence
Existence dependence – occurs when an entity is associated with another entity occurrence
Existence independence – occurs when an entity is existing apart from all of its related entities. It is also called
strong or regular entity.
Relationship strength
Relationships are the glue that holds the tables together. They are used to connect related information
between tables.
Relationship strength is based on how the primary key of a related entity is defined.
A weak, or non-identifying, relationship exists if the primary key of the related entity does not contain a
primary key component of the parent entity
A strong, or identifying, relationship exists when the primary key of the related entity contains the
primary key component of the parent entity
Relationship participation
Optional relationship- one entity occurrence does not require the corresponding entity occurrence in
particular relationship.
Mandatory relationship- one entity occurrence requires the corresponding entity occurrence in a
particular relationship.
Relationship Degree
Ternary relationship: a relationship type that involves many to many relationships between
three tables.
Binary relationship: occurs when two entities are associated in a relationship
Unary relationship: one in which a relationship exists between occurrences of the same entity
set.
Recursive relationship- is one in which relationships can exist between occurrence of the same
entities.
Database design Challenges
Design standards- the database design must conform to the to the design standards
Processing speed- the processing speed must be higher to minimize access time
Information requirements
Chapter 5: Advanced Data Modeling
Extended entity relationship model
Result of adding more semantic constructs to original entity relationship (ER) model
Diagram using this model is called an EER diagram (EERD)
Entity Supertypes and Subtypes
Entity supertype: -Generic entity type related to one or more entity subtypes
Specialization Hierarchy
Disjoint subtypes – Also known as non-overlapping subtypes – Subtypes that contain unique
subset of supertype entity set
Overlapping subtypes – Subtypes that contain nonunique subsets of supertype entity set
Completeness Constraint
Specifies whether entity supertype occurrence must be a member of at least one subtype
Partial completeness – Symbolized by a circle over a single line – Some supertype occurrences
that are not members of any subtype
Total completeness – Symbolized by a circle over a double line – Every supertype occurrence
must be member of at least one subtype
Specialization and Generalization
Specialization
“Virtual” entity type used to represent multiple entities and relationships in ERD
Considered “virtual” or “abstract” because it is not actually an entity in final ERD
Temporary entity used to represent multiple entities and relationships
Eliminate undesirable consequences – Avoid display of attributes when entity clusters are used
Entity Integrity: Selecting Primary Keys
Primary key most important characteristic of an entity – Single attribute or some combination of
attributes
Primary key’s function is to guarantee entity integrity
Primary keys and foreign keys work together to implement relationships
Properly selecting primary key has direct bearing on efficiency and effectiveness
Natural Keys and Primary Keys
Natural key is a real-world identifier used to uniquely identify real-world objects – Familiar to
end users and forms part of their day-to-day business vocabulary
Primary Key Guidelines
Attribute that uniquely identifies entity instances in an entity set – Could also be combination of
attributes
Main function is to uniquely identify an entity instance or row within a table
Guarantee entity integrity, not to “describe” the entity
Primary keys and foreign keys implement relationships among entities – Behind the scenes,
hidden from user
When to Use Composite Primary Keys
Normalization
Normal forms
• Used while designing a new database structure or when adding to an existing structure
• Improves the existing data structure and creates an appropriate database design
relations
Repeating group: group of multiple entries of same type can exist for any single
structure
• The relation or table or report given (called a relation from now on) should have NO
repeating groups and should identify any multi-valued attributes (these will be broken out later
after the normalization process)
• Normalizing the relation will help to reduce or eliminate possible data redundancies
and anomalies
• 1NF is the first step in the normalization process … it requires three (3) steps:
Step 1: Eliminate (fill in) any Repeating Groups (nulls) and identify any multi-valued
attributes in given relation
• See next view graph where we have filled-in the repeating groups (nulls) in the given
relation
Step 2: Identify the Primary Key (PK) or composite PK by analyzing the existing data
• Attribute(s) chosen must uniquely identify each row in the given relation
• Conversion to 2NF occurs only when the 1NF has a composite primary key
• If the 1NF has a single-attribute primary key, then the table is automatically in 2NF
The data anomalies created by the database organization shown in Figure 6.4 are easily
eliminated
• Is in 2NF
Denormalization
• Conflicts are often resolved through compromises that may include denormalization
• Defects of unnormalized tables:
Character
CHAR(L)
VARCHAR(L) or VARCHAR2(L)
Date
DATE
SQL Constraints
SQL Indexes
Composite index:
Comparison Operators
Arithmetic operators
The Rule of Precedence: Establish the order in which computations are completed
Perform:
Operations within parentheses
Power operations
Multiplications and divisions
Additions and subtractions
Special Operators
Advanced Data Definition Commands
Some RDBMSs do not permit changes to data types unless column is empty
Syntax –
Adding a column
o Use ALTER and ADD
Do not include the NOT NULL clause for new column
Dropping a column
o Use ALTER and DROP
Some RDBMSs impose restrictions on the deletion of an attribute
• Before a new RDBMS can be used, the database structure and the tables that will hold the
end-user data must be created
• The database schema- Logical group of database objects—such as tables and indexes—that
are related to each other
SQL constraints
• FOREIGN KEY
• NOT NULL
• UNIQUE
• DEFAULT
• CHECK
Rapidly creates a new table based on selected columns and rows of an existing table
using a subquery
Automatically copies all of the data rows returned
• SQL indexes
CREATE INDEX improves the efficiency of searches and avoids duplicate column values
DROP Index deletes an index.
All changes in the table structure are made by using the ALTER TABLE command followed by a
keyword that produces the specific change you want to make
ADD, MODIFY, and DROP
Changing a column’s data type
ALTER
Changing a column’s data characteristics
If the column to be changed already contains data, you can make changes in the column’s
characteristics if those changes do not alter the data type
Adding a column
You can alter an existing table by adding one or more columns
Be careful not to include the NOT NULL clause for the new column
Adding primary key, foreign key, and check constraints
Primary key syntax:
VENDOR;
Dropping a column
Syntax:
Syntax:
• Add multiple rows to a table, using another table as the source, at the same time
• SELECT syntax:
SELECT source_columnlist
FROM source_tablename;
COMMIT [WORK]
• UPDATE syntax:
UPDATE tablename
[WHERE conditionlist ];
[WHERE conditionlist ];
ROLLBACK command is used restore the database to its previous condition ROLLBACK;
• CREATE VIEW statement: data definition command that stores the subquery specification in the data
dictionary
Updating views
• Batch update routine: pools multiple transactions into a single batch to update a master table
field in a single operation
Sequences
Procedural SQL
Performs a conditional or looping operation by isolating critical code and making all application
programs call the shared code
Contains standard SQL statements and procedural extensions that is stored and executed at
the DBMS server
• Use and storage of procedural code and SQL statements within the database
• Stored procedures
• PL/SQL functions
Triggers
Procedural SQL code automatically invoked by RDBMS when given data manipulation event
occurs
Parts of a trigger definition
- Triggering action: PL/SQL code enclosed between the BEGIN and END keywords
• Actions depend on the type of DML statement that fires the trigger
Stored Procedures
Advantages
Cursor: special construct used to hold data rows returned by a SQL query
• Implicit cursor: automatically created when SQL statement returns only one value
• Explicit cursor: holds the output of a SQL statement that may return two or more rows
• Syntax:
• Cannot be invoked from SQL statements unless the function follows some very specific
compliance rules
Embedded SQL
• Run-time mismatch
• Processing mismatch
• Data type mismatch - Data types provided by SQL might not match data types used in different host
languages
• Standard syntax to identify embedded SQL code within the host language
• Communication area used to exchange status and error information between SQL and host
language
Chapter 9: Database design
The Information System
A database is a carefully designed collection of facts within a larger system called an information
system.
The information system collects, stores, transforms, and retrieves data, helping to manage both
data and information.
People, hardware, software, databases, application programs, and procedures are all part of a
complete information system.
Systems analysis establishes the need for an information system, while systems development is
the process of creating it.
Information systems today should be aligned with strategic business goals and integrated with
the company’s wider information systems architecture.
Applications within the system turn data into useful information for decision-making through
reports, tabulations, and graphics.
The performance of an information system depends on database design, application design, and
administrative procedures.
Database design should focus on creating complete, normalized, and integrated models that are
flexible and scalable over time.
Procedures for database development are applicable across different types of information
systems, but the scale may vary.
The Systems Development Life Cycle (SDLC) helps understand the activities required to develop
and maintain information systems.
Different methodologies like Unified Modeling Language (UML), Rapid Application Development
(RAD), and Agile Software Development offer alternative approaches but work within the same
framework.
The Systems Development Life Cycle
The Systems Development Life Cycle (SDLC) guides the history of an information system.
SDLC offers a comprehensive view for designing and developing databases and applications.
The traditional SDLC includes five phases: planning, analysis, detailed systems design,
implementation, and maintenance.
Planning phase involves assessing the company's objectives and considering whether to
continue, modify, or replace the existing system.
Feasibility study in planning phase addresses technical, cost, and operational aspects of the new
system.
Analysis phase examines user requirements, existing systems, and creates a logical system
design.
Detailed Systems Design phase completes the design process and plans for system conversion
and training.
Implementation phase involves installing hardware, software, and application programs, testing,
debugging, and training end-users.
Maintenance phase handles changes and updates to the system, including corrective, adaptive,
and perfective maintenance activities.
CASE tools aid in software engineering, making systems more structured, documented, and
standardized, thus extending their operational life.
• The Database Life Cycle (DBLC) comprises six phases: database initial study, database design,
implementation and loading, testing and evaluation, operation, and maintenance and evolution.
• The initial study phase involves analyzing the company situation, defining problems and
constraints, defining objectives, and setting scope and boundaries.
• Analyzing the company situation includes understanding the company's operational
components, structure, and mission.
• Defining problems involves collecting information on how the existing system functions and
identifying areas of inefficiency or failure.
• Defining objectives aims to address major problems identified during the initial study, focusing
on creating efficient solutions.
• Setting scope and boundaries determines the extent of the design according to operational
requirements and external constraints like time, budget, and existing hardware and software.
Database Design
• The second phase of the Database Life Cycle (DBLC) focuses on database design.
• This phase ensures the final product meets user and system requirements.
• Two views of data are considered: the business view and the designer's view.
• Database design is an iterative process with three essential stages: conceptual, logical, and
physical design.
• Most designs and implementations are based on the relational model.
• The design process includes selecting the DBMS, creating the database, and loading or
converting the data.
• Implementation may involve installing the DBMS, creating the database, and loading data.
• Cloud-based database services may incur additional costs for data loading due to network traffic
charges.
Testing and Evaluation
• In the implementation phase, decisions from the design phase are put into action for integrity,
security, performance, and recoverability.
• The DBA tests and fine-tunes the database, ensuring its performance aligns with expectations,
often in conjunction with application programming.
• Database testing ensures data integrity, security, and adherence to management policies.
• Testing also addresses broader security concerns like physical security, password security,
access rights, audit trails, data encryption, and diskless workstations.
• Database performance is evaluated based on various factors such as hardware, software
environment, data characteristics, and configuration parameters.
• Evaluation includes broader system tests, integration issues, deployment plans, user training,
and finalizing system documentation.
• Backup and recovery plans are tested to protect against data loss, often employing fault-tolerant
components and automated backup functions.
• Recovery procedures involve restoring the database from backups following hardware or
software failures, with recovery processes varying based on the extent of the failure.
• Testing, evaluation, and modification iteratively continue until the system is certified for
operational use.
Maintenance and Evolution
Routine maintenance tasks are crucial for database administrators.
Periodic maintenance activities include:
Preventive maintenance (backup)
Corrective maintenance (recovery)
Adaptive maintenance (performance enhancement, adding entities and attributes)
Assignment and maintenance of access permissions
Generating database access statistics for audits and performance monitoring
Conducting periodic security audits
Creating system usage summaries for internal billing or budgeting purposes
Conceptual Design
• Second phase of DBLC
• Comprises conceptual, logical, and physical design stages
• Aimed at software and hardware-independent database design
Conceptual Design:
• Goal: Develop database independent of software and physical details
• Output: Conceptual data model describing entities, attributes, relationships, constraints
• Utilizes data modeling for abstract database structure representing real-world objects
• Flexibility needed for future hardware and database model choices
Minimal Data Rule:
• All needed data must be in the model, all data in the model must be needed
• Focus on future data needs to ensure flexibility and endurance of investment
Steps in Conceptual Design:
• Data analysis and requirements
• Entity relationship modeling and normalization
• Data model verification
• Distributed database design
Data Analysis and Requirements
Information Needs:
Business Rules:
Standards Enforcement:
Relationship definition
Attribute, primary key, and foreign key definition
Entity normalization
ER diagram completion
Iterative Process:
Verification Process:
Iterative Nature:
This presentation covers the conceptual design phase of the Database Life Cycle (DBLC), focusing
on creating a software and hardware-independent database structure. Here's a summary of each slide:
Conceptual Design Overview:
Tools like query by example (QBE), data dictionaries, and security influence selection.
Logical Design:
Steps include mapping entities, relationships, and constraints; validating model; ensuring
normalization.
Top-down approach: Identifies data sets, defines data elements, suitable for larger, complex
systems.
Bottom-up approach: Identifies data elements, groups them into sets, suitable for smaller
databases.
Centralized design: All decisions made centrally by small group, suitable for small-scale
problems.
Decentralized design: Complex projects divided into modules, each designed independently,
suitable for large, distributed systems.
Transactions can involve single or multiple SQL statements, each constituting a database
request.
Even read-only transactions, like SELECT queries, are considered transactions because
they access the database.
The text provides an example transaction involving sales, showing how different tables
are affected and how the database state changes.
Successful transactions are finalized with a COMMIT statement, ensuring changes are
permanently saved.
Incomplete transactions due to system failures can leave the database in an inconsistent
state.
Sophisticated DBMSs support transaction management to handle such failures and roll
back to a consistent state.
Users are responsible for ensuring that transactions accurately represent real-world
events to maintain database integrity.
DBMSs can enforce integrity rules like primary key constraints automatically to validate
transactions and prevent errors.
Transaction Properties
Atomicity: Ensures all parts of a transaction are treated as a single unit of work. If any part fails,
the entire transaction is aborted.
Isolation: Prevents concurrent transactions from accessing data used by another transaction
until it completes, ensuring data integrity.
Durability: Guarantees that once changes are committed, they cannot be lost, even in the event
of system failure.
Serializability: Ensures that the results of concurrent transaction execution are consistent with
those of a serial execution, crucial in multiuser and distributed databases.
Transaction Management with SQL: ANSI standards govern SQL transactions, supported by
COMMIT and ROLLBACK statements.
Transaction Log
Maintains a record of all transactions updating the database, crucial for recovery.
Stores transaction details like operation type, object names, before-and-after values, and
transaction boundaries.
Used by the DBMS for recovery in cases of ROLLBACK, abnormal termination, or system failure.
Concurrency Control
Aims to preserve isolation property to prevent issues like lost updates, uncommitted data, and
inconsistent retrievals.
Problems like lost updates occur when concurrent transactions overwrite each other's changes.
Uncommitted data arises when a transaction accesses data rolled back by another transaction.
Inconsistent retrievals happen when a transaction reads data before and after updates by other
transactions, leading to erroneous results.
• The concept of database transactions, emphasizing the need for consistency before and
after transaction execution
• It introduces the idea of temporary inconsistency during transaction execution,
especially when multiple tables and rows are updated.
• The scheduler, a special component of the DBMS, determines the order of operations
within concurrent transactions to ensure serializability and isolation.
• It achieves this by using concurrency control algorithms like locking or time stamping.
• The scheduler also ensures efficient utilization of CPU and storage resources by
interleaving transaction operations effectively.
• Without scheduling, transactions would be executed on a first-come, first-served basis,
leading to inefficiencies.
Data Isolation
• The scheduler ensures that transactions do not update the same data simultaneously,
thus preventing conflicts.
• Various conflict scenarios are discussed, highlighting the importance of proper
scheduling methods.
Lock Granularity
• Lock granularity refers to the level at which locks are applied, such as database, table,
page, row, or field.
• Different levels have varying degrees of restrictiveness and efficiency, with page-level
locks being most commonly used in multiuser DBMSs.
Database-Level Lock
Locks the entire database, suitable for batch processes but not for multiuser DBMSs due to slow
data access.
Table-Level Lock
Locks entire tables, causing traffic jams in multiuser environments and delaying transactions
unnecessarily
Page-Level Lock: Locks entire disk pages, allowing concurrent access to different parts of the
same table.
Field-Level Lock: Allows concurrent access to different fields within the same row, but rarely
implemented due to high overhead and the practicality of row-level locks.
In wait/die, older transactions wait for younger ones to complete; younger transactions may die.
In wound/wait, older transactions preempt younger ones, or wait for them to complete.
ANSI SQL standard defines isolation levels: Read Uncommitted, Read Committed, Repeatable
Read, Serializable.
Each level controls what data transactions can see during execution.
Hardware/software failures, human-caused incidents, and natural disasters can cause critical
errors.
Recovery techniques also apply to system and database after critical errors occur.
Transaction Recovery- Transaction recovery involves restoring a database to a consistent state after
a failure.
Write-Ahead-Log Protocol: Ensures transaction logs are written before database data is updated.
Redundant Transaction Logs: Multiple copies of transaction logs prevent data loss due to disk
failures.
Database Checkpoints:
Operations where updated buffers are written to disk, ensuring database and log synchronization.
Deferred Write Technique: Transaction operations update transaction log first, then database after
commit.
Time Stamping Methods: Assigns unique time stamps to transactions to resolve conflicts.
Optimistic Methods: Assumes most transactions don't conflict, updates private copies of data.
Database Recovery: Restores database to consistent state after critical events like hardware errors.
Chapter 11: Database Performance Tuning
and Query Optimization
Database Performance-Tuning Concepts
– Set of activities and procedures designed to reduce response time of database system
• All factors must be checked to ensure that each one operates at its optimum level and has
sufficient resources to minimize occurrence of bottlenecks
• All factors must be checked to ensure that each one operates at its optimum level and has
sufficient resources to minimize occurrence of bottlenecks
• All factors must be checked to ensure that each one operates at its optimum level and has
sufficient resources to minimize occurrence of bottlenecks
• Data files
• Table space or file group is logical grouping of several data files that store data with similar
characteristics
• Breaking down (parsing) query into smaller units and transforming original SQL query into
slightly different version of original SQL code
– Fully equivalent
– More efficient
• Optimized query will almost always execute faster than original query
• Indexes
– Facilitate searching, sorting, and using aggregate functions as well as join operations
• More efficient to use index to access table than to scan all rows in table sequentially
Optimizer Choices
• Rule-based optimizer
– Uses set of preset rules and points to determine best approach to execute query
• Cost-based optimizer
– Most SQL performance optimization techniques are DBMS-specific and are rarely
portable
Index Selectivity
Conditional Expressions
Query Formulation
• Includes global tasks such as managing DBMS processes in primary memory and structures in
physical storage
• DBMS performance tuning at server end focuses on setting parameters used for:
– Data cache
– SQL cache
– Sort cache
– Optimizer mode
– Governs storage and processing of logically related data over interconnected computer
systems in which both data and processing functions are distributed among several
sites
• Centralized database required that corporate data be stored in a single central site
• Dynamic business environment and centralized database’s shortcomings spawned a demand for
applications based on data access from different sources at multiple locations (PDAs for
example)
• Distributed processing
– Database’s logical processing is shared among two or more physically independent sites
– For example, the data input/output (I/O), data selection, and data validation might be
performed on one computer, and a report based on that data might be created on
another computer
• Distributed database
– Stores logically related database over two or more physically independent sites
• Advantages include:
– Growth facilitation: New sites can be added to the network without affecting the
operations of other sites.
– Improved communications: Because local sites are smaller and located closer to
customers
– User-friendly interface
– Processor independence: end user is able to access any available copy of the data, and
an end user’s request is processed by any processor at the data location.
DDBMS Disadvantages
• Disadvantages include:
– Security
– Lack of standards
– Increased storage requirements: Multiple copies of data are required at different sites
• Application interface: interact with the end user, application programs, and other DBMSs
• Formatting: to prepare the data for presentation to the end user or to an application program
• Backup and recovery: to ensure the availability and recoverability of DB in case of a failure
• DB administration
• Concurrency control: to manage simultaneous data access and to ensure data consistency
• Transaction management: to ensure that the data moves from one consistent state to another
• Must handle all necessary functions imposed by distribution of data and processing
– Computer workstations
– Network hardware and software
– Communications media
Software component found in each computer that requests data application’s data requests
• All processing is done on single CPU or host computer (mainframe, midrange, or PC)
• Processing cannot be done on end user’s side of system. several processes to run concurrently
on a host computer accessing a single DP
• MPSD scenario requires network file server running conventional applications that are accessed
through LAN
• Many multiuser accounting applications, running under personal computer network, fit such a
description
• Fully distributed database management system with support for multiple data processors and
transaction processors at multiple sites
• Homogeneous DDBMSs
– Fully distributed database management system with support for multiple data
processors and transaction processors at multiple sites
– Homogeneous DDBMSs
• Features include:
– Distribution transparency
– Transaction transparency
– Failure transparency
– Performance transparency
– Heterogeneity transparency
Transaction Transparency
• Ensures database transactions will maintain distributed database’s integrity and consistency
• Ensures transaction completed only when all database sites involved complete their part
• Remote request: single SQL statement accesses data from single remote database
• Distributed transaction: requests data from several different remote sites on network
Performance Transparency
• Objective of query optimization routine is to minimize total cost associated with execution of
request
– Communication cost
– CPU time cost
• Must provide:
– Data fragmentation
– Data replication
– Data allocation
Data Fragmentation
• Information about data fragmentation is stored in distributed data catalog (DDC), from which it
is accessed by TP to process user requests
• Information about data fragmentation is stored in distributed data catalog (DDC), from which it
is accessed by TP to process user requests
Data Replication
• Replication scenarios
• Most DDBMSs are able to handle the partially replicated database well
– Unreplicated database
Data Allocation
• Data distribution over computer network is achieved through data partition, data replication, or
combination of both
• Allocation strategies
• Can be used to implement a DBMS in which client is the TP and server is the DP
• Client/server advantages
• Allow end user to use microcomputer’s GUI, thereby improving functionality and
simplicity
• Numerous data analysis and query tools exist to facilitate interaction with DBMSs
available in PC market
• Client/server disadvantages
• Different platforms (LANs, operating systems, and so on) are often difficult to
manage
• An increase in number of users and processing sites often paves the way for security
problems
• Increases demand for people with broad knowledge of computers and software
• Data analysis provides information about short-term tactical evaluations and strategies
Business Intelligence
• Comprehensive, cohesive, integrated tools and processes
• Multiple tools from different vendors can be integrated into a single BI framework
• Other benefits
– Integrating architecture
• Personal analytics
– Database schema
– Database size
• Usually a read-only database optimized for data analysis and query processing
• Typically lower cost and lower implementation time than data warehouse
Star Schemas
• Data-modeling technique
• Easily implemented model for multidimensional data analysis while preserving relational
structures
Facts
Attributes
Attribute Hierarchies
• Two purposes:
– Aggregation
– Fact table primary key formed by combining foreign keys pointing to dimension tables
Data Analytics
• Subset of BI functionality
– Explanatory analytics
– Predictive analytics
Data Mining
– Analyze data
– Uncover problems or opportunities hidden in data relationships
– Guided
– Automated
Predictive Analytics
• Employs mathematical and statistical algorithms, neural networks, artificial intelligence, and
other advanced modeling tools
– Access to many different kinds of DBMSs, flat files, and internal and external data
sources
OLAP Architecture
– Data-processing logic
Chapter 14: Big Data Analytics and NoSQL
Big Data-Refers to data with volume, velocity, and variety that challenges traditional database
management.
Stream Processing: Analyzing data as it enters the system to decide what to keep and discard.
Feedback Loop Processing: Analyzing data to produce actionable results, focusing on both
inputs and outputs.
Variety: Refers to the diverse formats and structures of data in Big Data.
Structured Data: Organized to fit a predefined data model, typical in relational databases.
Unstructured Data: Not organized to fit into a predefined data model, includes various formats
like text, images, and videos.
Variability: Refers to changes in the meaning of data over time, relevant in sentiment analysis.
Veracity: Concerns the trustworthiness of data and the accuracy of information generated from
it.
Value: Relates to the meaningful insights derived from analyzed data that can impact
organizational behavior.
Hadoop: A Java-based framework for distributed storage and processing of large datasets.
HDFS (Hadoop Distributed File System): Designed for high volume, write-once, read-many
access, streaming, and fault tolerance.
HDFS Assumptions: Large file sizes, write-once model, streaming access, fault tolerance through
replication.
HDFS Nodes: Client nodes, NameNode, and DataNodes manage data storage and retrieval.
o Divides tasks into smaller subtasks, processes them concurrently, and combines results.
Map Function: Sorts and filters data into key-value pairs, performed by a mapper program.
Reduce Function: Summarizes key-value pairs with the same key into a single result, performed
by a reducer program.
MapReduce Implementation: Pushes copies of the program to nodes containing data instead of
transferring data to a central node.
Hadoop Structure: Composed of a job tracker (JobTracker) and task trackers (TaskTrackers).
Hadoop Workflow: Client node submits MapReduce job to job tracker, which communicates
with name node to locate data nodes.
Job tracker determines available task trackers, assigns tasks, and manages failures.
Hadoop Ecosystem: Collection of related applications around Hadoop for easier use and
accessibility.
Tools like Hive and Pig simplify creating MapReduce jobs, especially for users without
extensive programming skills.
Hive:
Data warehousing system on HDFS with HiveQL language for ad hoc queries, processed
into MapReduce jobs.
Batch Processing: Runs from start to finish without user interaction, often used for tasks
requiring extended time or system resources.
NoSQL:
Simplest NoSQL model storing data as key-value pairs, organized into buckets.
Only supports basic operations like get, store, and delete.
Document Databases:
Store data as tagged documents in key-value pairs, allowing more structured data
storage.
Schema-less but rely on tags for querying.
Column-Oriented Databases:
Store data by column instead of by row, suitable for systems requiring queries over few
columns but many rows.
Can refer to traditional relational databases or NoSQL databases like Cassandra and
HBase.
In NoSQL, organizes data into key-value pairs where the value is a set of columns varying
by row.
Supports super columns grouping logically related columns together.
Graph Databases:
Graph databases store data based on graph theory, focusing on relationships between nodes.
They excel in relationship-rich environments like social networks, logistics, and identity
management.
Queries in graph databases are called traversals, focusing on relationships between nodes.
NewSQL Databases:
Support SQL and distributed clusters like NoSQL, but use in-memory storage, impacting
durability.
Includes explanatory analytics (explaining past and present) and predictive analytics (forecasting
future).
Predictive analytics uses advanced statistical tools to predict future outcomes accurately.
Predictive analytics models are used in customer relationships, fraud detection, marketing, etc.
Data Mining:
Uses algorithms like neural networks, decision trees, and regression to create predictive models.