Study Guide Math
Study Guide Math
Update Operations
Update operation terminology note: in practice there are two different uses of the term “update
operation”:
A) Update operation as a collective term for insert, delete and modify operations
Modify operation - Used for changing the existing data in the relation
Update Anomalies
Anomalies in relations that contain redundant (unnecessarily repeating) data, caused by update
operations
• Insertion anomaly - occurs when inserting data about one real-world entity requires inserting
data about another real-world entity
• Deletion anomaly - occurs when deletion of data about a real-world entity forces deletion of
data about another real-world entity
• Modification anomaly - occurs when, in order to modify one real-world value, the same
modification has to be made multiple times
Functional Dependencies
Functional dependency - occurs when the value of one (or more) column(s) in each record of a relation
uniquely determines the value of another column in that same record of the relation
For example
A→B
ClientID → ClientName
• Only composite primary keys have separate components, while single-column primary keys do
not have separate components
• Hence, partial functional dependency can occur only in cases when a relation has a composite
primary key
Full key functional dependency - occurs when a primary key functionally determines the column of a
relation and no separate component of the primary key partially determines the same column
• If a relation has a single component (non-composite) primary key, the primary key fully
functionally determines all the other columns of a relation
• If a relation has a composite key, and portions of the key partially determine columns of a
relation, then the primary key does not fully functionally determine the partially determined
columns
Transitive functional dependency - occurs when nonkey columns functionally determine other nonkey
columns of a relation
Nonkey column is a column in a relation that is neither a primary nor a candidate key column
Normalization
A process used to improve the design of relational databases
The normalization process involves examining each table and verifying if it satisfies a particular normal
form
If a table satisfies a particular normal form, then the next step is to verify if that relation satisfies the next
higher normal form
If a table does not satisfy a particular normal form, actions are taken to convert the table into a set of
tables that satisfy the particular normal form
Normal form - term representing a set of particular conditions (whose purpose is reducing data
redundancy) that a table has to satisfy
From a lower to a higher normal form, these conditions are increasingly stricter and leave less possibility
for redundant data
There are several normal forms, most fundamental of which are: First normal form (1NF), Second normal
form (2NF) and Third normal form (3NF).
First Normal Form (1NF) - A table is in 1NF if each row is unique and no column in any row contains
multiple values
• 1NF states that each value in each column of a table must be a single value from the domain of
the column
• Every relational table is, by definition, in 1NF
• Related multivalued columns - columns in a table that refer to the same real-world concept
(entity) and can have multiple values per record
• Normalizing to 1NF involves eliminating groups of related multi-valued columns
Second Normal Form (2NF) - A table is in 2NF if it is in 1NF and if it does not contain partial functional
dependencies
• If a relation has a single-column primary key, then there is no possibility of partial functional
dependencies
• Such a relation is automatically in 2NF and it does not have to be normalized to 2NF
• If a relation with a composite primary key has partial dependencies, then it is not in 2NF, and it
has to be normalized it to 2NF
• Normalization of a relation to 2NF creates additional relations for each set of partial
dependencies in a relation
• The primary key of the additional relation is the portion of the primary key that functionally
determines the columns in the original relation
• The columns that were partially determined in the original relation are part of the additional
table
• The original table remains after the process of normalizing to 2NF, but it no longer contains the
partially dependent columns
Third Normal Form (3NF) - A table is in 3NF if it is in 2NF and if it does not contain transitive functional
dependencies
• Normalization of a relation to 3NF creates additional relations for each set of transitive
dependencies in a relation.
o The primary key of the additional relation is the nonkey column (or columns) that
functionally determined the nonkey columns in the original relation
o The nonkey columns that were transitively determined in the original relation are part of
the additional table.
• The original table remains after normalizing to 3NF, but it no longer contains the transitively
dependent columns
Normalization Exceptions
In general, database relations are normalized to 3NF in order to eliminate unnecessary data redundancy
and avoid update anomalies
However, normalization to 3NF should be done judiciously and pragmatically, which may in some cases
call for deliberately not normalizing certain relations to 3NF
Denormalization
Reversing the effect of normalization by joining normalized relations into a relation that is not
normalized, in order to improve query performance
The data that resided in fewer relations prior to normalization is spread out across more relations after
normalization
Instead, denormalization should be used judiciously, after analyzing its costs and benefits
When faced with a non-normalized table, instead of identifying functional dependencies and going
through normalization to 2NF and 3NF, a designer can analyze the table and create an ER diagram based
on it (and subsequently map it into a relational schema)
• Even if a relation is in 3NF additional opportunities for streamlining database content may still
exist
• Designer-added entities (tables) and designer-added keys can be used for additional streamlining
• Augmenting databases with designer added tables and keys is not a default process that is to be
undertaken in all circumstances
• Instead, augmenting databases with designer added tables and keys should be done judiciously,
after analyzing pros and cons for each augmentation
Chapter 5 – SQL
SQL - Structured Query Language
• Creating databases
• Adding, modifying, and deleting database structures
• Inserting, deleting, and modifying records in databases
• Querying databases (data retrieval)
SQL functions as a standard relational database language/became the standard language for querying
data contained in a relational database
It can be used (with minor dialectical variations) with most relational DBMS software tools.
Semicolons follow the end of an SQL statement and indicates the end of the SQL command/of each SQL
statement.
SQL keywords, table names and column names in SQL commands are not case sensitive
Though usually broken down, SQL statements can be written as one long sentence in one line of text.
SQL Command Categories
Data Definition Language (DDL) - Used to create and modify the structure of the database. Ex: CREATE,
ALTER, DROP
Data Manipulation Language (DML) - Used to insert, modify, delete, and retrieve data. Ex: INSERT INTO,
UPDATE, DELETE, SELECT
SQL Commands
Command Description
CREATE TABLE Used for creating and connecting relational tables
DROP TABLE Used to remove a table from the database
INSERT INTO Used to populate the created relations with data
SELECT Used for the retrieval of data from the database
relations
Most commonly issued SQL statement
Basic form:
SELECT <columns>
FROM <table>
• COUNT
• SUM
• AVG
• MIN
• MAX
Nested Queries
A query that is used within another query (Yes, a query may contain another query (or queries).)
The query that uses the nested query is referred to as an outer query
Alias
An alternative and usually shorter name that can be used anywhere within a query instead of the full
relation name
The differences are minor, and a user of SQL in one RDBMS is be able to switch to another RDBMS with
very little additional learning effort.
Set Operators
Standard set operators: union, intersection, and difference
Used to combine the results of two or more SELECT statements that are union compatible
Two sets of columns are union compatible if they contain the same number of columns, and if the data
types of the columns in one set match the data types of the columns in the other set
The first column in one set has a compatible data type with the data type of the first column in the other
set, the second column in one set has a compatible data type with the data type of the second column in
the other set, and so on.
The set operators can combine results from SELECT statements querying relations, views, or other
SELECT queries.
A JOIN condition can connect a column from one table with a column from the other table as long as
those columns contain the same values.
Regulates the relationship between a table with a foreign key and a table with a primary key to which
the foreign key refers
Most RDBMS packages DO NOT implement assertions using CREATE ASSERTION statement.
Delete options:
• DELETE RESTRICT - makes it so you are unable to delete instances/rows that are referred to in
another table
• DELETE CASCADE -
• DELETE SET-TO-NULL
• DELETE SET-TO-DEFAULT - makes it so that when a value or instance that is referred to is deleted,
the value(s) that referred to it in the other table are assigned a default value.
Update options:
• UPDATE RESTRICT
• UPDATE CASCADE - makes it so that when you update a value that is referred to in another table
it matches with the updated value. That is to say that the original value and the foreign value
match
• UPDATE SET-TO-NULL - makes it so that when a value or instance that is referred to is updated,
the value(s) that referred to it in the other table are set to the null value
• UPDATE SET-TO-DEFAULT
In many cases the logic of user-defined constraints is not implemented as a part of the database, but as a
part of the front-end database application
For the proper use of the database, it is important that user-defined constraints are implemented fully
Indexing
INDEX - Mechanism for increasing the speed of data search and data retrieval on relations with a large
number of records
The preceding examples provided simplified conceptual illustration of the principles on which an index is
based
Instead of simply sorting on the indexed column and applying binary search, different contemporary
DBMS tools implement indexes using different logical and technical approaches, such as:
• Clustering indexes
• Hash indexes
• B+ trees
• etc.
Each of the available approaches has the same goal – increase the speed of search and retrieval on the
columns that are being indexed
Once this statement is executed, the effect is that the searches and retrievals involving the CustName
column in the relation CUSTOMER are faster
This statement drops the index, and the index is no longer used
Database Front-End
Provides access to the database for indirect use
In most cases, a portion of intended users (often a majority of the users) of the database lack the time
and/or expertise to engage in the direct use of the data in the database. It is not reasonable to expect
every person who needs to use the data from the database to write his or her own queries and other
statements.
Form - a mechanism that provides an interface into a query or relation in a database to input data or
retrieve data for an end-user.
Report - a mechanism that displays data and calculations on data from a database table(s) in a formatted
way to either be seen on a screen or printed out/hard copy
In addition to the forms and reports, database front-end applications can include many other
components and functionalities, such as:
• menus
• charts
• graphs
• maps
• etc.
The choice of how many different components to use and to what extent is driven by the needs of the
end-users
A database can have multiple sets of front-end applications for different purposes or groups of end-users
Data Quality
The data in a database is considered of high quality if it correctly and non-ambiguously reflects the real-
world it is designed to represent
Front-end applications can be accessible separately on their own or via an interface that allows the user
to choose an application that they need.
• Accuracy - the extent to which data correctly reflects the real-world instances it is supposed to
depict
• Uniqueness - requires each real-world instance to be represented only once in the data
collection
o The uniqueness data quality problem is sometimes also referred to as data duplication
• Completeness - the degree to which all the required data is present in the data collection
• Consistency - the extent to which the data properly conforms to and matches up with the other
data
• Timeliness - the degree to which the data is aligned with the proper time window in its
representation of the real world
o Typically, timeliness refers to the “freshness” of the data
• Conformity - the extent to which the data conforms to its specified format
Preventive data quality actions - Actions taken to preclude data quality problems
Corrective data quality actions - Actions taken to correct the data quality problems
Successful transaction changes database from one consistent state to another. One in which all data
integrity constraints are satisfied
Most real-world database transactions are formed by two or more database requests. Equivalent of a
single SQL statement in an application program or transaction
When you read from or update a database entry, you create a transaction.
When the end of a program is successfully reached, it is equivalent to the execution of a COMMIT
command.
Improper or incomplete transactions can have devastating effect on database integrity. Some DBMSs
provide means by which user can define enforceable constraints. Other integrity rules are enforced
automatically by the DBMS
Transaction Properties
Atomicity - All operations of a transaction must be completed
Isolation - Data used during transaction cannot be used by second transaction until the first is completed
The information stored in the transaction log is used by the DBMS for a recovery requirement triggered
by a program's abnormal termination or a system failure such as a disk crash.
Concurrency Control
As long as two transactions, T1 and T2 access unrelated data, there is no conflict and the order of
execution is irrelevant to the final outcome.
• Lost updates
• Uncommitted data
• Inconsistent retrievals
Lost Updates
One of the three most common data integrity and consistency problems.
Uncommitted Data
Two transactions are executed concurrently
First transaction rolled back after second already accessed uncommitted data
Inconsistent Retrievals
First transaction accesses data, Second transaction alters the data, First transaction accesses the data
again
Transaction might read some data before they are changed and other data after changed and thus yields
inconsistent results
A consistent database is one in which all data integrity constraints are satisfied.
The Scheduler
Special DBMS program - Purpose is to establish order of operations within which concurrent transactions
are executed
Serializable schedule - Interleaved execution of transactions yields same results as serial execution
Lock manager - Responsible for assigning and policing the locks used by transactions
Lock Granularity
Indicates level of lock use
Page-level lock - Entire diskpage is locked (A diskpage or page, is the equivalent of a diskblock)
Row-level lock - Allows concurrent transactions to access different rows of same table, even if rows are
located on same page
Field-level lock - Allows concurrent transactions to access same row. Requires use of different fields
(attributes) within the row
Lock Types
Locks are required to prevent another transaction from reading inconsistent data.
Exclusive lock. Access is specifically reserved for transaction that locked object. Must be used when
potential for conflict exists
Shared lock. Concurrent transactions are granted read access on basis of a common lock. A shared lock
produces no conflict as long as all the transactions are read only.
Growing phase - Transaction acquires all required locks without unlocking any data
Shrinking phase - Transaction releases all locks and cannot obtain any new lock
Deadlocks
Condition that occurs when two transactions wait for each other to unlock data.
When two transactions wait for the other to unlock data and one or both of them want an exclusive lock
on a data item.
Possible only if one of the transactions wants to obtain an exclusive lock on a data item
If transaction operation cannot be completed, the transaction is aborted and changes to database are
rolled back.
Transaction Recovery
Write-ahead-log protocol: ensures transaction logs are written before data is updated
Redundant transaction logs: ensure physical disk failure will not impair ability to recover
If transaction committed after last checkpoint, DBMS redoes the transaction using “after” values
If transaction had ROLLBACK or was left active, Do nothing because no updates were made
Summary
Transaction: sequence of database operations that access database
A single user database system automatically ensures serializability and isolation of the database because
only one transaction is executed at a time.
The implicit beginning of a transaction is when the first SQL statement is encountered.
The rollback segment table space is used for transaction- recovery purposes.
Database recovery restores database from given state to previous consistent state
Database performance tuning – a set of activities and procedures designed to reduce response time of
database system
The database performance tuning activities can be divided into those taking place on the client side or
on the server side.
Server side is the DBMS environment configured to respond to clients’ requests as fast as possible,
achieving optimum use of existing resources. DBMS performance tuning
DBMS Architecture
All data in database are stored in data files
Data files automatically expand in predefined increments known as extends and grouped in file groups
or table spaces
Table space or file group - Logical grouping of several data files that store data with similar characteristics
Data cache or buffer cache: shared, reserved memory area that stores most recently accessed data
blocks in RAM
SQL cache or procedure cache: stores most recently executed SQL statements but also PL/SQL
procedures, including triggers and functions
Database Statistics
Measurements about database objects and available resources:
Tables, Indexes, Number of processors used, Processor speed and Temporary space available
Query Processing
DBMS processes queries in three phases
Parsing - DBMS parses the query and chooses the most efficient access/execution plan
Fetching - DBMS fetches the data and sends the result back to the client
Optimization is the central activity during the parsing phase in query processing.
Transform original SQL query into slightly different version of original SQL code that is still fully
equivalent (results are always the same as the original query) but more efficient (will almost always
execute faster than original query).
Query optimizer analyzes SQL query and finds most efficient way to access data
The system table’s tablespace is used to store the data dictionary tables.
The SQL query is Validated for syntax compliance, Validated against data dictionary (Tables, column
names are correct and user has proper access rights), Analyzed and decomposed into components,
Optimized and Prepared for execution
Access plans are DBMS specific. They translate client’s SQL query into a series of complex I/O operations
and are required to read the data from the physical data files and generate result set
DBMS checks if access plan already exists for query in SQL cache
If not, optimizer evaluates various plans, and the chosen plan is placed in SQL cache
More efficient to use index to access table than to scan all rows in table sequentially (full table scan)
Data sparsity - number of different values a column could possibly have. A measure that determines the
need for an index is the data sparsity of the column you want to index.
Optimizer Choices
Most DBMSs operate in one of two optimization modes.
Rule-based optimizer – Has preset rules and points. Rules assign a fixed cost to each operation
Cost-based optimizer - Algorithms based on statistics about objects being accessed. Adds up processing
cost, I/O costs, resource costs to derive total cost
Makes decisions based on existing statistics, but statistics may be old and thus it might choose less-
efficient decisions
Optimizer hints - special instructions for the optimizer embedded in the SQL command text
SQL Performance Tuning
Evaluated from client perspective
Most current relational DBMSs perform automatic query optimization at the server end
Most SQL performance optimization techniques are DBMS specific, Rarely portable
Index Selectivity
Conditional Expressions
Query Formulation
Identify what columns and computations are required
DBMS performance tuning at server end focuses on setting parameters used for:
• Data cache
• SQL cache
• Sort cache
• Optimizer mode
The data cache is where the data read from the database files are stored after the data have been read
or before the data are written to the database files.
• Use RAID (Redundant Array of Independent Disks) to provide balance between performance and
fault tolerance
• Minimize disk contention
• Put high-usage tables in their own table spaces
• Assign separate data files in separate storage volumes for indexes, system, high-usage tables
• Take advantage of table storage organizations in database
• Partition tables based on usage
• Use denormalized tables where appropriate
• Store computed and aggregate attributes in tables
Summary
Database performance tuning - Refers to activities to ensure query is processed in minimum amount of
time
SQL performance tuning - refers to activities on client side to generate SQL code
Database statistics refers to measurements gathered by the DBMS. Describe snapshot of database
objects’ characteristics
During query optimization, DBMS chooses: Indexes to use, how to perform join operations, table to use
first, etc.
SQL performance tuning deals with writing queries that make good use of statistics
Query formulation deals with translating business questions into specific SQL code
The Scheduler establishes the order of concurrent transaction operations before they are executed, and
the lock manager assigns and regulates locks used by the transactions. (Not yet graded answer I wrote)