0% found this document useful (0 votes)
7 views

DataBase Practices

This is very useful to students for go their knowledge of computer ..I will promise about this Document.students have to read this notes...these are only for students..to everyone will understand 8n this lecture notes...every students hast to grow up in this computer Information Technology field..

Uploaded by

silentkiller8778
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

DataBase Practices

This is very useful to students for go their knowledge of computer ..I will promise about this Document.students have to read this notes...these are only for students..to everyone will understand 8n this lecture notes...every students hast to grow up in this computer Information Technology field..

Uploaded by

silentkiller8778
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 254

Unit – 1 Relational Data Model

What is Data?
In simple words, data can be facts related to any object in consideration. For example, your name,
age, height, weight, etc. are some data related to you. A picture, image, file, pdf, etc. can also be
considered data.

Define Database :
A database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database management
system (DBMS). ... The data can then be easily accessed, managed, modified, updated, controlled,
and organized.

ER model
o ER model stands for an Entity-Relationship model. It is a high-level data model. This
model is used to define the data elements and relationship for a specified system.
o It develops a conceptual design for the database. It also develops a very simple and easy to
design view of data.
o In ER modeling, the database structure is portrayed as a diagram called an entity-
relationship diagram.

For example, Suppose we design a school database. In this database, the student will be an entity
with attributes like address, name, id, age, etc. The address can be another entity with attributes
like city, street name, pin code, etc and there will be a relationship between them.

Component of ER Diagram

1. Entity:
An entity may be any object, class, person or place. In the ER diagram, an entity can be
represented as rectangles.

Consider an organization as an example- manager, product, employee, department etc. can be


taken as an entity.

a. Weak Entity
10 SecQL CREATE TABLE

An entity that depends on another entity called a weak entity. The weak entity doesn't contain any
key attribute of its own. The weak entity is represented by a double rectangle.
2. Attribute
The attribute is used to describe the property of an entity. Eclipse is used to represent an attribute.

For example, id, age, contact number, name, etc. can be attributes of a student.

a. Key Attribute

The key attribute is used to represent the main characteristics of an entity. It represents a primary
key. The key attribute is represented by an ellipse with the text underlined.

b. Composite Attribute

An attribute that composed of many other attributes is known as a composite attribute. The
composite attribute is represented by an ellipse, and those ellipses are connected with an ellipse.

c. Multivalued Attribute

An attribute can have more than one value. These attributes are known as a multivalued attribute.
The double oval is used to represent multivalued attribute.

For example, a student can have more than one phone number.

d. Derived Attribute
An attribute that can be derived from other attribute is known as a derived attribute. It can be
represented by a dashed ellipse.

For example, A person's age changes over time and can be derived from another attribute like
Date of birth.

3. Relationship
A relationship is used to describe the relation between entities. Diamond or rhombus is used to represent the
relationship.

Types of relationship are as follows:

a. One-to-One Relationship

When only one instance of an entity is associated with the relationship, then it is known as one to one relationship.

For example, A female can marry to one male, and a male can marry to one female.

b. One-to-many relationship
When only one instance of the entity on the left, and more than one instance of an entity on the right associates
with the relationship then this is known as a one-to-many relationship.

For example, Scientist can invent many inventions, but the invention is done by the only specific scientist.

c. Many-to-one relationship

When more than one instance of the entity on the left, and only one instance of an entity on the right associates
with the relationship then it is known as a many-to-one relationship.

For example, Student enrolls for only one course, but a course can have many students.

d. Many-to-many relationship

When more than one instance of the entity on the left, and more than one instance of an entity on the right
associates with the relationship then it is known as a many-to-many relationship.

For example, Employee can assign by many projects and project can have many employees.

Participation Constraints
 Total Participation − Each entity is involved in the relationship. Total participation is
represented by double lines.
 Partial participation − Not all entities are involved in the relationship. Partial participation is
represented by single lines.
Example-Total

Here,

 Double line between the entity set “Student” and relationship set “Enrolled in” signifies total participation.
 It specifies that each student must be enrolled in at least one course.

Example-

Here,

 Single line between the entity set “Course” and relationship set “Enrolled in” signifies partial participation.
 It specifies that there might exist some courses for which no enrollments are made.

What is Relational Model?


Relational Model (RM) represents the database as a collection of relations. A relation is nothing but
a table of values. Every row in the table represents a collection of related data values. These rows in
the table denote a real-world entity or relationship.
The table name and column names are helpful to interpret the meaning of values in each row. The
data are represented as a set of relations. In the relational model, data are stored as tables. However,
the physical storage of the data is independent of the way the data are logically organized.
Relational Model concept
Relational model can represent as a table with columns and rows. Each row is known as a tuple.
Each table of the column has a name or attribute.

Domain: It contains a set of atomic values that an attribute can take.

 The domain of Marital Status has a set of possibilities: Married, Single, Divorced.
 The domain of Shift has the set of all possible days: {Mon, Tue, Wed…}.
 The domain of Salary is the set of all floating-point numbers greater than 0 and less than
200,000.
 The domain of First Name is the set of character strings that represents names of people.

Attribute: It contains the name of a column in a particular table. Each attribute Ai must have a
domain, dom(Ai)

Relational instance: In the relational database system, the relational instance is represented by a
finite set of tuples. Relation instances do not have duplicate tuples.

Relational schema: A relational schema contains the name of the relation and name of all
columns or attributes.

Relational key: In the relational key, each row has one or more attributes. It can identify the row
in the relation uniquely.

Degree: The degree is the number of attributes in a table.

Example: STUDENT Relation

NAME ROLL_NO PHONE_NO ADDRESS AGE

Ram 14795 7305758992 Noida 24


Shyam 12839 9026288936 Delhi 35

Laxman 33289 8583287182 Gurugram 20

Mahesh 27857 7086819134 Ghaziabad 27

Ganesh 17282 9028 9i3988 Delhi 40

o In the given table, NAME, ROLL_NO, PHONE_NO, ADDRESS, and AGE are the
attributes.
o The instance of schema STUDENT has 5 tuples.
o t3 = <Laxman, 33289, 8583287182, Gurugram, 20>

Properties of Relations

o Name of the relation is distinct from all other relations.


o Each relation cell contains exactly one atomic (single) value
o Each attribute contains a distinct name
o Attribute domain has no significance
o tuple has no duplicate value
o Order of tuple can have a different sequence

Keys :
atomic value: each value in the domain is indivisible as far as the relational model is
concernedattribute: principle storage unit in a database

column: see attribute

degree: number of attributes in a table

domain: the original sets of atomic values used to model data; a set of acceptable values that a
column is allowed to contain

field: see attribute

file:see relation

record: contains fields that are related; see tuple

relation: a subset of the Cartesian product of a list of domains characterized by a name; the technical
term for table or file

row: see tuple


structured query language (SQL): the standard database access language

table:see relation

tuple: a technical term for row or record


Degree Of Relationship :

Denotes the number of entity types that participate in a relationship

Types:

1)Unary Relationship:

Exists when there is an association with only one entity.

Eg : PERSON is Married to

PERSON is entity

Married to is relationship

2)Binary Relationship :

Exists when there is an association among 2 entities.

Eg)PUBLISHER publishes BOOK

3)Ternary Relationship :

Exists when there is an association among 3 entities

Eg)SUBJECT teaches TEACHER teaches STUDENT

Mapping from ER Model to Relational Model


After designing the ER diagram of system, we need to convert it to Relational models which can
directly be implemented by any RDBMS like Oracle, MySQL etc. In this article we will discuss how to
convert ER diagram to Relational Model for different scenarios.
In data modelling terms, cardinality is how one table relates to another.
Relationship constraints :
1.Cardinality Ratio: Maximum number of relationship instances that an entity can
participate in.
Case 1: Binary Relationship with 1:1 cardinality with total participation of an entity
A person has 0 or 1 passport number and Passport is always owned by 1 person. So it is 1:1 cardinality
with full participation constraint from Passport.

First Convert each entity and relationship to tables. Person table corresponds to Person Entity with
key as Per-Id. Similarly Passport table corresponds to Passport Entity with key as Pass-No. HashTable
represents relationship between Person and Passport (Which person has which passport). So it will take
attribute Per-Id from Person and Pass-No from Passport.

Person Has Passport

Per-Id Other Person Attribute Per-Id Pass-No Pass-No Other PassportAttribute

PR1 – PR1 PS1 PS1 –

PR2 – PR2 PS2 PS2 –

PR3 –

Table 1
As we can see from Table 1, each Per-Id and Pass-No has only one entry in Hashtable. So we can
merge all three tables into 1 with attributes shown in Table 2. Each Per-Id will be unique and not null.
So it will be the key. Pass-No can’t be key because for some person, it can be NULL.

Per-Id Other Person Attribute Pass-No Other PassportAttribute

Table 2
Case 2: Binary Relationship with 1:1 cardinality and partial participation of both entities
A male marries 0 or 1 female and vice versa as well. So it is 1:1 cardinality with partial participation
constraint from both. First Convert each entity and relationship to tables. Male table corresponds to
Male Entity with key as M-Id. Similarly Female table corresponds to Female Entity with key as F-Id.
Marry Table represents relationship between Male and Female (Which Male marries which female). So
it will take attribute M-Id from Male and F-Id from Female.

Male Marry Female

M-Id Other Male Attribute M-Id F-Id F-Id Other FemaleAttribute

M1 – M1 F2 F1 –

M2 – M2 F1 F2 –

M3 – F3 –

Table 3
As we can see from Table 3, some males and some females do not marry. If we merge 3 tables into 1,
for some M-Id, F-Id will be NULL. So there is no attribute which is always not NULL. So we can’t
merge all three tables into 1. We can convert into 2 tables. In table 4, M-Id who are married will have
F-Id associated. For others, it will be NULL. Table 5 will have information of all females. Primary
Keys have been underlined.

M-Id Other Male Attribute F-Id

Table 4

F-Id Other FemaleAttribute


Table 5
Note: Binary relationship with 1:1 cardinality will have 2 table if partial participation of both entities in
the relationship. If atleast 1 entity has total participation, number of tables required will be 1.
Case 3: Binary Relationship with n: 1 cardinality

In this scenario, every student can enroll only in one elective course but for an elective course there can
be more than one student. First Convert each entity and relationship to tables. Student table
corresponds to Student Entity with key as S-Id. Similarly Elective_Course table corresponds to
Elective_Course Entity with key as E-Id. Enrolls Table represents relationship between Student and
Elective_Course (Which student enrolls in which course). So it will take attribute S-Id from and Student
E-Id from Elective_Course.

Student Enrolls Elective_Course

S-Id Other Student Attribute S-Id E-Id E-Id Other Elective CourseAttribute

S1 – S1 E1 E1 –

S2 – S2 E2 E2 –

S3 – S3 E1 E3 –

S4 – S4 E1

Table 6
As we can see from Table 6, S-Id is not repeating in Enrolls Table. So it can be considered as a key of
Enrolls table. Both Student and Enrolls Table’s key is same; we can merge it as a single table. The
resultant tables are shown in Table 7 and Table 8. Primary Keys have been underlined.

S-Id Other Student Attribute E-Id


Table 7

E-Id Other Elective CourseAttribute

Table 8
Case 4: Binary Relationship with m: n cardinality

In this scenario, every student can enroll in more than 1 compulsory course and for a compulsory course
there can be more than 1 student. First Convert each entity and relationship to tables. Student table
corresponds to Student Entity with key as S-Id. Similarly Compulsory_Courses table corresponds to
Compulsory Courses Entity with key as C-Id. Enrolls Table represents relationship between Student
and Compulsory_Courses (Which student enrolls in which course). So it will take attribute S -Id from
Person and C-Id from Compulsory_Courses.

Student Enrolls Compulsory_Courses

S-Id Other Student Attribute S-Id C-Id C-Id Other Compulsory CourseAttribute

S1 – S1 C1 C1 –

S2 – S1 C2 C2 –

S3 – S3 C1 C3 –

S4 – S4 C3 C4 –

S4 C2

S3 C3

Table 9
As we can see from Table 9, S-Id and C-Id both are repeating in Enrolls Table. But its combination is
unique; so it can be considered as a key of Enrolls table. All tables’ keys are different, these can’t be
merged. Primary Keys of all tables have been underlined.
Case 5: Binary Relationship with weak entity

In this scenario, an employee can have many dependents and one dependent can depend on one
employee. A dependent does not have any existence without an employee (e.g; you as a child can be
dependent of your father in his company). So it will be a weak entity and its participation will always
be total. Weak Entity does not have key of its own. So its key will be combination of key of its
identifying entity (E-Id of Employee in this case) and its partial key (D-Name).
First Convert each entity and relationship to tables. Employee table corresponds to Employee Entity
with key as E-Id. Similarly Dependents table corresponds to Dependent Entity with key as D-Name
and E-Id. HashTable represents relationship between Employee and Dependents (Which employee has
which dependents). So it will take attribute E-Id from Employee and D-Name from Dependents.

Employee Has Dependents

E-Id Other Employee Attribute E-Id D-Name D-Name E-Id Other DependentsAttribute

E1 – E1 RAM RAM E1 –

E2 – E1 SRINI SRINI E1 –

E3 – E2 RAM RAM E2 –

E3 ASHISH ASHISH E3 –

Table 10
As we can see from Table 10, E-Id, D-Name is key for Has as well as Dependents Table. So we can
merge these two into 1. So the resultant tables are shown in Tables 11 and 12. Primary Keys of all
tables have been underlined.
E-Id Other Employee Attribute

Table 11

D-Name E-Id Other DependentsAttribute

Table 12
ER Model, when conceptualized into diagrams, gives a good overview of entity-relationship, which is easier to
understand. ER diagrams can be mapped to relational schema, that is, it is possible to create relational schema
using ER diagram. We cannot import all the ER constraints into relational model, but an approximate schema can
be generated.
There are several processes and algorithms available to convert ER Diagrams into Relational Schema. Some of
them are automated and some of them are manual. We may focus here on the mapping diagram contents to
relational basics.
ER diagrams mainly comprise of −

 Entity and its attributes


 Relationship, which is association among entities.

Mapping Entity
An entity is a real-world object with some attributes.

Mapping Process (Algorithm)


 Create table for each entity.
 Entity's attributes should become fields of tables with their respective data types.
 Declare primary key.

Mapping Relationship
A relationship is an association among entities.
Mapping Process
 Create table for a relationship.
 Add the primary keys of all participating Entities as fields of table with their respective data types.
 If relationship has any attribute, add each attribute as field of table.
 Declare a primary key composing all the primary keys of participating entities.
 Declare all foreign key constraints.

Mapping Weak Entity Sets


A weak entity set is one which does not have any primary key associated with it.

Mapping Process
 Create table for weak entity set.
 Add all its attributes to table as field.
 Add the primary key of identifying entity set.
 Declare all foreign key constraints.

Mapping Hierarchical Entities


ER specialization or generalization comes in the form of hierarchical entity sets.
Mapping Process
 Create tables for all higher-level entities.
 Create tables for lower-level entities.
 Add primary keys of higher-level entities in the table of lower-level entities.
 In lower-level tables, add all other attributes of lower-level entities.
 Declare primary key of higher-level table and the primary key for lower-level table.
 Declare foreign key constraints.

Generalization, Specialization and Aggregation in ER Model


Generalization –
Generalization is the process of extracting common properties from a set of entities and create
a generalized entity from it. It is a bottom-up approach in which two or more entities can be
generalized to a higher level entity if they have some attributes in common. For Example,
STUDENT and FACULTY can be generalized to a higher level entity called PERSON as
shown in Figure 1. In this case, common attributes like P_NAME, P_ADD become part of
higher entity (PERSON) and specialized attributes like S_FEE become part of specialized
entity (STUDENT).
Specialization –
In specialization, an entity is divided into sub-entities based on their characteristics. It is a top-
down approach where higher level entity is specialized into two or more lower level entities.
For Example, EMPLOYEE entity in an Employee management system can be specialized into
DEVELOPER, TESTER etc. as shown in Figure 2. In this case, common attributes like
E_NAME, E_SAL etc. become part of higher entity (EMPLOYEE) and specialized attributes
like TES_TYPE become part of specialized entity (TESTER).
Aggregation –
An ER diagram is not capable of representing relationship between an entity and a
relationship which may be required in some scenarios. In those cases, a relationship wi th its
corresponding entities is aggregated into a higher level entity. Aggregation is an abstraction
through which we can represent relationships as higher level entity sets.
For Example, Employee working for a project may require some machinery. So, REQUIRE
relationship is needed between relationship WORKS_FOR and entity MACHINERY. Using
aggregation, WORKS_FOR relationship with its entities EMPLOYEE and PROJECT is
aggregated into single entity and relationship REQUIRE is created between aggregated entity
and MACHINERY.
Relational Algebra
Relational database systems are expected to be equipped with a query language that can assist its users to query
the database instances. There are two kinds of query languages − relational algebra and relational calculus.

Relational Algebra
Relational Algebra is procedural query language, which takes Relation as input and generate relation as
output. Relational algebra mainly provides theoretical foundation for relational databases and SQL.

It uses operators to perform queries. An operator can be either unary or binary. They accept relations as their
input and yield relations as their output. Relational algebra is performed recursively on a relation and
intermediate results are also considered relations.
Relational algebra is a procedural query language that works on relational model. The purpose of a query
language is to retrieve data from database or perform various operations such as insert, update, delete on
the data. When I say that relational algebra is a procedural query language, it means that it tells what data
to be retrieved and how to be retrieved.

On the other hand relational calculus is a non-procedural query language, which means it tells what data
to be retrieved but doesn’t tell how to retrieve it.

The fundamental operations of relational algebra are as follows −

 Select
 Project
 Union
 Set different
 Cartesian product
 Rename
We will discuss all these operations in the following sections.

Select Operation (σ)


It selects tuples that satisfy the given predicate from a relation.
Notation − σp(r)
Where σ stands for selection predicate and r stands for relation. p is prepositional logic formula which may use
connectors like and, or, and not. These terms may use relational operators like : =, ≠, ≥, < , >, ≤.
For example −
σsubject = "database"(Books)
Output − Selects tuples from books where subject is 'database'.
σsubject = "database" and price = "450"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450.
σsubject = "database" and price = "450" or year > "2010"(Books)
Output − Selects tuples from books where subject is 'database' and 'price' is 450 or those books published after
2010.
Selection is used to select required tuples of the relations.
for the above relation
σ (c>3)R
will select the tuples which have c more than 3.
Note: selection operator only selects the required tuples but does not display them. For displaying, data
projection operator is used.
For the above selected tuples, to display we need to use projection also.
PBM :
R
(A B C)
----------
1 2 4
2 2 3
3 2 3
4 3 4

π (σ (c>3)R ) will show following tuples.

A B C
-------
1 2 4
4 3 4
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
Notation − ∏A , A , A (r)
1 2 n

Where A , A , A are attribute names of relation r.


1 2 n

Duplicate rows are automatically eliminated, as relation is a set.


For example −
∏subject, author (Books)
Selects and projects columns named as subject and author from the relation Books.
Projection is used to project required column data from a relation.
Example :
R
(A B C)
----------
1 2 4
2 2 3
3 2 3
4 3 4
π (BC)
B C
-----
2 4
2 3
3 4
Note: By Default projection removes duplicate data.

Union Operation (∪)


Union operation in relational algebra is same as union operation in set theory, only constraint is for
union of two relation both relation must have same set of Attributes.

It performs binary union between two given relations and is defined as −


r ∪ s = { t | t ∈ r or t ∈ s}
Notation − r U s
Where r and s are either database relations or relation result set (temporary relation).
For a union operation to be valid, the following conditions must hold −

 r, and s must have the same number of attributes.


 Attribute domains must be compatible.
 Duplicate tuples are automatically eliminated.
∏ author (Books) ∪ ∏ author (Articles)
Output − Projects the names of the authors who have either written a book or an article or both.
Syntax
∏regno(R1) ∪ ∏regno(R2)
It displays all the regno of R1 and R2.
Example
Consider two tables R1 and R2 −
Table R1 is as follows −

Regno Branch Section

1 CSE A

2 ECE B

3 MECH B

4 CIVIL A

5 CSE B

Table R2 is as follows −

Regno Branch Section

1 CIVIL A

2 CSE A

3 ECE B

To display all the regno of R1 and R2, use the following command −
∏regno(R1) ∪ ∏regno(R2)
Output
Regno

Union
Union combines two different results obtained by a query into a single result in the form of a table.
However, the results should be similar if union is to be applied on them. Union removes all duplicates, if
any from the data and only displays distinct values. If duplicate values are required in the resultant data,
then UNION ALL is used.
An example of union is −
Select Student_Name from Art_Students
UNION
Select Student_Name from Dance_Students
This will display the names of all the students in the table Art_Students and Dance_Students i.e John,
Mary, Damon and Matt.

Intersection
The intersection operator gives the common data values between the two data sets that are intersected.
The two data sets that are intersected should be similar for the intersection operator to work.
Intersection also removes all duplicates before displaying the result.
An example of intersection is −
Select Student_Name from Art_Students
INTERSECT
Select Student_Name from Dance_Students
This will display the names of the students in the table Art_Students and in the table Dance_Students
i.e all the students that have taken both art and dance classes .Those are Mary and Damon in this
example

Intersection Operator (∩)


Intersection operator is denoted by ∩ symbol and it is used to select common rows (tuples) from two
tables (relations).
Lets say we have two relations R1 and R2 both have same columns and we want to select all those
tuples(rows) that are present in both the relations, then in that case we can apply intersection operation on
these two relations R1 ∩ R2.

Note: Only those rows that are present in both the tables will appear in the result set.

Syntax of Intersection Operator (∩)

table_name1 ∩ table_name2
Intersection Operator (∩) Example
Lets take the same example that we have taken above.
Table 1: COURSE

Course_Id Student_Name Student_Id


--------- ------------ ----------
C101 Aditya S901
C104 Aditya S901
C106 Steve S911
C109 Paul S921
C115 Lucy S931
Table 2: STUDENT

Student_Id Student_Name Student_Age


------------ ---------- -----------
S901 Aditya 19
S911 Steve 18
S921 Paul 19
S931 Lucy 17
S941 Carl 16
S951 Rick 18
Query:

∏ Student_Name (COURSE) ∩ ∏ Student_Name (STUDENT)


Output:

Student_Name
------------
Aditya
Steve
Paul
Lucy

Set Difference (−)


Set Difference in relational algebra is same set difference operation as in set theory with the constraint
that both relation should have same set of attributes.

The result of set difference operation is tuples, which are present in one relation but are not in the second relation.
Notation − r − s
Finds all the tuples that are present in r but not in s.
∏ author (Books) − ∏ author (Articles)
Output − Provides the name of authors who have written books but not articles.
Minus (-): Minus on two relations R1 and R2 can only be computed if R1 and R2 are union
compatible. Minus operator when applied on two relations as R1-R2 will give a relation with
tuples which are in R1 but not in R2. Syntax:
Relation1 - Relation2
Find person who are student but not employee, we can use minus operator like:
STUDENT – EMPLOYEE
Table 1: EMPLOYEE
EMP_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

5 NARESH HISAR 9782918192 22

6 SWETA RANCHI 9852617621 21

4 SURESH DELHI 9156768971 18

Table 2: STUDENT
ROLL_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

2 RAMESH GURGAON 9652431543 18

3 SUJIT ROHTAK 9156253131 20

4 SURESH DELHI 9156768971 18

RESULT:
ROLL_NO NAME ADDRESS PHONE AGE

2 RAMESH GURGAON 9652431543 18

3 SUJIT ROHTAK 9156253131 20

Set difference
The set difference operators takes the two sets and returns the values that are in the first set but not the
second set.
An example of set difference is −
Select Student_Name from Art_Students
MINUS
Select Student_Name from Dance_Students
This will display the names of all the students in table Art_Students but not in table Dance_Students i.e
the students who are taking art classes but not dance classes.

Cartesian Product (Χ)


Combines information of two different relations into one.
Notation − r Χ s
Where r and s are relations and their output will be defined as −
r Χ s = { q t | q ∈ r and t ∈ s}
σauthor = 'tutorialspoint'(Books Χ Articles)
Output − Yields a relation, which shows all the books and articles written by tutorialspoint.
Cross product between two relations let say A and B, so cross product between A X B will results all
the attributes of A followed by each attribute of B. Each record of A will pairs with every record of B.
below is the example
A B
(Name Age Sex ) (Id Course)
------------------ -------------
Ram 14 M 1 DS
Sona 15 F 2 DBMS
kim 20 M

AXB
Name Age Sex Id Course
---------------------------------
Ram 14 M 1 DS
Ram 14 M 2 DBMS
Sona 15 F 1 DS
Sona 15 F 2 DBMS
Kim 20 M 1 DS
Kim 20 M 2 DBMS
Note: if A has ‘n’ tuples and B has ‘m’ tuples then A X B will have ‘n*m’ tuples.

Rename Operation (ρ)


Rename is a unary operation used for renaming attributes of a relation.
ρ (a/b)R will rename the attribute ‘b’ of relation by ‘a’.
The results of relational algebra are also relations but without any name. The rename operation allows us to
rename the output relation. 'rename' operation is denoted with small Greek letter rho ρ.
Notation − ρ x (E)
Where the result of expression E is saved with name of x.
Natural Join (⋈)
Natural join is a binary operator. Natural join between two or more relations will result set of all
combination of tuples where they have equal common attribute.
Let us see below example

Emp Dep
(Name Id Dept_name ) (Dept_name Manager)
------------------------ ---------------------
A 120 IT Sale Y
B 125 HR Prod Z
C 110 Sale IT A
D 111 IT

Emp ⋈ Dep

Name Id Dept_name Manager


-------------------------------
A 120 IT A
C 110 Sale Y
D 111 IT A

Conditional Join
Conditional join works similar to natural join. In natural join, by default condition is equal between
common attribute while in conditional join we can specify the any condition such as greater than, less
than, not equal
Let us see below example
R S
(ID Sex Marks) (ID Sex Marks)
------------------ --------------------
1 F 45 10 M 20
2 F 55 11 M 22
3 F 60 12 M 59
Join between R And S with condition R.marks >= S.marks

R.ID R.Sex R.Marks S.ID S.Sex S.Marks


-----------------------------------------------
1 F 45 10 M 20
1 F 45 11 M 22
2 F 55 10 M 20
2 F 55 11 M 22
3 F 60 10 M 20
3 F 60 11 M 22
3 F 60 12 M 59
The relations used to understand extended operators are STUDENT, STUDENT_SPORTS,
ALL_SPORTS and EMPLOYEE which are shown in Table 1, Table 2, Table 3 and Table 4
respectively.
STUDENT
ROLL_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

2 RAMESH GURGAON 9652431543 18

3 SUJIT ROHTAK 9156253131 20

4 SURESH DELHI 9156768971 18

Table 1
STUDENT_SPORTS
ROLL_NO SPORTS

1 Badminton

2 Cricket

2 Badminton

4 Badminton

Table 2
ALL_SPORTS
SPORTS

Badminton

Cricket

Table 3
EMPLOYEE
EMP_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

5 NARESH HISAR 9782918192 22

6 SWETA RANCHI 9852617621 21

4 SURESH DELHI 9156768971 18

Table 4
Intersection (∩): Intersection on two relations R1 and R2 can only be computed if R1 and R2
are union compatible (These two relation should have same number of attributes and
corresponding attributes in two relations have same domain). Intersection operator when
applied on two relations as R1∩R2 will give a relation with tuples which are in R1 as well as
R2. Syntax:
Relation1 ∩ Relation2
Example: Find a person who is student as well as employee- STUDENT ∩ EMPLOYEE
In terms of basic operators (union and minus) :
STUDENT ∩ EMPLOYEE = STUDENT + EMPLOYEE - (STUDENT U EMPLOYEE)
RESULT:
ROLL_NO NAME ADDRESS PHONE AGE

1 RAM DELHI 9455123451 18

4 SURESH DELHI 9156768971 18

Conditional Join(⋈c): Conditional Join is used when you want to join two or more relation
based on some conditions. Example: Select students whose ROLL_NO is greater than
EMP_NO of employees
STUDENT⋈c STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
In terms of basic operators (cross product and selection) :
σ (STUDENT.ROLL_NO>EMPLOYEE.EMP_NO) (STUDENT×EMPLOYEE)
RESULT:
ROLL_ ADDRE AG EMP_ NAM ADDRE AG
NO NAME SS PHONE E NO E SS PHONE E

RAME GURGA 9652431 9455123


2 SH ON 543 18 1 RAM DELHI 451 18

ROHTA 9156253 9455123


3 SUJIT K 131 20 1 RAM DELHI 451 18

SURE 9156768 9455123


4 SH DELHI 971 18 1 RAM DELHI 451 18
Equijoin(⋈): Equijoin is a special case of conditional join where only equality condition
holds between a pair of attributes. As values of two attributes will be equal in result of equijoin,
only one attribute will be appeared in result.
Example:Select students whose ROLL_NO is equal to EMP_NO of employees
STUDENT⋈STUDENT.ROLL_NO=EMPLOYEE.EMP_NOEMPLOYEE
In terms of basic operators (cross product, selection and projection) :
∏(STUDENT.ROLL_NO, STUDENT.NAME, STUDENT.ADDRESS, STUDENT.PHONE, STUDENT.AGE EMPLOYEE.NAME, EMPLOYEE.ADDRESS, EMPLOYEE.PHONE, EMPLOYEE>AGE) (σ
(STUDENT.ROLL_NO=EMPLOYEE.EMP_NO) (STUDENT×EMPLOYEE))
RESULT:

ROLL_N ADDRES AG ADDRES AG


O NAME S PHONE E NAME S PHONE E

94551234 94551234
1 RAM DELHI 51 18 RAM DELHI 51 18

SURES 91567689 SURES 91567689


4 H DELHI 71 18 H DELHI 71 18

Natural Join(⋈): It is a special case of equijoin in which equality condition hold on all
attributes which have same name in relations R and S (relations on which join operation is
applied). While applying natural join on two relations, there is no need to write equality
condition explicitly. Natural Join will also return the similar attributes only once as their value
will be same in resulting relation.
Example: Select students whose ROLL_NO is equal to ROLL_NO of STUDENT_SPORTS as:
STUDENT⋈STUDENT_SPORTS
In terms of basic operators (cross product, selection and projection) :
∏(STUDENT.ROLL_NO, STUDENT.NAME, STUDENT.ADDRESS, STUDENT.PHONE, STUDENT.AGE STUDENT_SPORTS.SPORTS) (σ (STUDENT.ROLL_NO=STUDENT_SPORTS.ROLL_NO)
(STUDENT×STUDENT_SPORTS))
RESULT:
ROLL_NO NAME ADDRESS PHONE AGE SPORTS

1 RAM DELHI 9455123451 18 Badminton

2 RAMESH GURGAON 9652431543 18 Cricket

2 RAMESH GURGAON 9652431543 18 Badminton

4 SURESH DELHI 9156768971 18 Badminton

Natural Join is by default inner join because the tuples which does not satisfy the conditions of
join does not appear in result set. e.g.; The tuple having ROLL_NO 3 in STUDENT does not
match with any tuple in STUDENT_SPORTS, so it has not been a part of result set.
Left Outer Join(⟕): When applying join on two relations R and S, some tuples of R or S does
not appear in result set which does not satisfy the join conditions. But Left Outer Joins gives
all tuples of R in the result set. The tuples of R which do not satisfy join condition will have
values as NULL for attributes of S.
Example:Select students whose ROLL_NO is greater than EMP_NO of employees and details
of other students as well
STUDENT&#10197STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT
ROLL_ ADDRE AG EMP_ NAM ADDRE AG
NO NAME SS PHONE E NO E SS PHONE E

RAME GURGA 9652431 9455123


2 SH ON 543 18 1 RAM DELHI 451 18

ROHTA 9156253 9455123


3 SUJIT K 131 20 1 RAM DELHI 451 18

SURE 9156768 9455123


4 SH DELHI 971 18 1 RAM DELHI 451 18

9455123 NUL NUL


1 RAM DELHI 451 18 NULL L NULL NULL L

Right Outer Join(⟖): When applying join on two relations R and S, some tuples of R or S
does not appear in result set which does not satisfy the join conditions. But Right Outer Joins
gives all tuples of S in the result set. The tuples of S which do not satisfy join condition will
have values as NULL for attributes of R.
Example: Select students whose ROLL_NO is greater than EMP_NO of employees and
details of other Employees as well
STUDENT⟖STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT:
ROLL_ ADDRE AG EMP_ ADDRE AG
NO NAME SS PHONE E NO NAME SS PHONE E

RAME GURGA 9652431 9455123


2 SH ON 543 18 1 RAM DELHI 451 18

ROHTA 9156253 9455123


3 SUJIT K 131 20 1 RAM DELHI 451 18

SURE 9156768 9455123


4 SH DELHI 971 18 1 RAM DELHI 451 18

NU NARE 9782918
NULL NULL NULL NULL LL 5 SH HISAR 192 22

NU SWET RANCH 9852617


NULL NULL NULL NULL LL 6 A I 621 21

NU SURE 9156768
NULL NULL NULL NULL LL 4 SH DELHI 971 18
Full Outer Join(⟗): When applying join on two relations R and S, some tuples of R or S does
not appear in result set which does not satisfy the join conditions. But Full Outer Joins gives all
tuples of S and all tuples of R in the result set. The tuples of S which do not satisfy join
condition will have values as NULL for attributes of R and vice versa.
Example:Select students whose ROLL_NO is greater than EMP_NO of employees and details
of other Employees as well and other Students as well
STUDENT⟗STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT:
ROLL_ ADDRE AG EMP_ ADDRE AG
NO NAME SS PHONE E NO NAME SS PHONE E

RAME GURGA 9652431 9455123


2 SH ON 543 18 1 RAM DELHI 451 18

ROHTA 9156253 9455123


3 SUJIT K 131 20 1 RAM DELHI 451 18

SURE 9156768 9455123


4 SH DELHI 971 18 1 RAM DELHI 451 18

NU NARE 9782918
NULL NULL NULL NULL LL 5 SH HISAR 192 22

NU SWET RANCH 9852617


NULL NULL NULL NULL LL 6 A I 621 21

NU SURE 9156768
NULL NULL NULL NULL LL 4 SH DELHI 971 18

9455123 NU
1 RAM DELHI 451 18 NULL NULL NULL NULL LL

Division Operator (÷): Division operator A÷B can be applied if and only if:
 Attributes of B is proper subset of Attributes of A.
 The relation returned by division operator will have attributes = (All attributes of A –
All Attributes of B)
 The relation returned by division operator will return those tuples from relation A
which are associated to every B’s tuple.
Consider the relation STUDENT_SPORTS and ALL_SPORTS given in Table 2 and Table 3
above.
To apply division operator as
STUDENT_SPORTS÷ ALL_SPORTS
 The operation is valid as attributes in ALL_SPORTS is a proper subset of attributes
in STUDENT_SPORTS.
 The attributes in resulting relation will have attributes {ROLL_NO,SPORTS}-
{SPORTS}=ROLL_NO
 The tuples in resulting relation will have those ROLL_NO which are associated with
all B’s tuple {Badminton, Cricket}. ROLL_NO 1 and 4 are associated to Badminton
only. ROLL_NO 2 is associated to all tuples of B. So the resulting relation will be:
ROLL_NO

Different Types of SQL JOINs


Here are the different types of the JOINs in SQL:

 (INNER) JOIN: Returns records that have matching values in both tables
 LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from
the right table
 RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records
from the left table
 FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table

Relational Calculus
In contrast to Relational Algebra, Relational Calculus is a non-procedural query language, that is, it tells what to
do but never explains how to do it.
Relational calculus exists in two forms −

Tuple Relational Calculus (TRC)


Filtering variable ranges over tuples
Notation − {T | Condition}
Returns all tuples T that satisfies a condition.
For example −
{ T.name | Author(T) AND T.article = 'database' }
Output − Returns tuples with 'name' from Author who has written article on 'database'.
TRC can be quantified. We can use Existential (∃) and Universal Quantifiers (∀).
For example −
{ R| ∃T ∈ Authors(T.article='database' AND R.name=T.name)}
Output − The above query will yield the same result as the previous one.

Domain Relational Calculus (DRC)


In DRC, the filtering variable uses the domain of attributes instead of entire tuple values (as done in TRC,
mentioned above).
Notation −
{ a , a , a , ..., a | P (a , a , a , ... ,a )}
1 2 3 n 1 2 3 n

Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
{< article, page, subject > | ∈ TutorialsPoint ∧ subject = 'database'}
Output − Yields Article, Page, and Subject from the relation TutorialsPoint, where subject is database.
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also involves relational
operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent to Relational
Algebra.

SQL
o SQL stands for Structured Query Language. It is used for storing and managing data in relational database
management system (RDMS).
o It is a standard language for Relational Database System. It enables a user to create, read, update and delete relational
databases and tables.
o All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as their standard database
language.
o SQL allows users to query the database in a number of ways, using English-like statements.

Rules:
SQL follows the following rules:

o Structure query language is not case sensitive. Generally, keywords of SQL are written in uppercase.
o Statements of SQL are dependent on text lines. We can use a single SQL statement on one or multiple text line.
o Using the SQL statements, you can perform most of the actions in a database.
o SQL depends on tuple relational calculus and relational algebra.

SQL process:
o When an SQL command is executing for any RDBMS, then the system figure out the best way to carry out the
request and the SQL engine determines that how to interpret the task.
o In the process, various components are included. These components can be optimization Engine, Query engine,
Query dispatcher, classic, etc.
o All the non-SQL queries are handled by the classic query engine, but SQL query engine won't handle logical files.
SQL is a language to operate databases; it includes database creation, deletion, fetching rows, modifying rows,
etc. SQL is an ANSI (American National Standards Institute) standard language, but there are many different
versions of the SQL language.

Purpose of the Query Optimizer


The optimizer attempts to generate the most optimal execution plan for a SQL statement.

The optimizer choose the plan with the lowest cost among all considered candidate plans. The optimizer uses available
statistics to calculate cost. For a specific query in a given environment, the cost computation accounts for factors of query
execution such as I/O, CPU, and communication.

For example, a query might request information about employees who are managers. If the optimizer statistics indicate
that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if
statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be
more efficient than a full table scan.

Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a better position than
the user to determine the optimal method of statement execution. For this reason, all SQL statements use the optimizer.

What is SQL?
SQL is Structured Query Language, which is a computer language for storing, manipulating and retrieving data
stored in a relational database.
SQL is the standard language for Relational Database System. All the Relational Database Management Systems
(RDMS) like MySQL, MS Access, Oracle, Sybase, Informix, Postgres and SQL Server use SQL as their standard
database language.
Also, they are using different dialects, such as −

 MS SQL Server using T-SQL,


 Oracle using PL/SQL,
 MS Access version of SQL is called JET SQL (native format) etc.

Why SQL?
SQL is widely popular because it offers the following advantages −
 Allows users to access data in the relational database management systems.
 Allows users to describe the data.
 Allows users to define the data in a database and manipulate that data.
 Allows to embed within other languages using SQL modules, libraries & pre-compilers.
 Allows users to create and drop databases and tables.
 Allows users to create view, stored procedure, functions in a database.
 Allows users to set permissions on tables, procedures and views.

A Brief History of SQL


 1970 − Dr. Edgar F. "Ted" Codd of IBM is known as the father of relational databases. He described a
relational model for databases.
 1974 − Structured Query Language appeared.
 1978 − IBM worked to develop Codd's ideas and released a product named System/R.
 1986 − IBM developed the first prototype of relational database and standardized by ANSI. The first
relational database was released by Relational Software which later came to be known as Oracle.

SQL Process
When you are executing an SQL command for any RDBMS, the system determines the best way to carry out
your request and SQL engine figures out how to interpret the task.
There are various components included in this process.
These components are −

 Query Dispatcher
 Optimization Engines
 Classic Query Engine
 SQL Query Engine, etc.
A classic query engine handles all the non-SQL queries, but a SQL query engine won't handle logical files.
Following is a simple diagram showing the SQL Architecture −
SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT, INSERT, UPDATE,
DELETE and DROP. These commands can be classified into the following groups based on their nature −

Types of SQL Commands


There are five types of SQL commands: DDL, DML, DCL, TCL, and DQL.
1. Data Definition Language (DDL)
o DDL changes the structure of the table like creating a table, deleting a table, altering a table, etc.
o All the command of DDL are auto-committed that means it permanently save all the changes in the
database.

Here are some commands that come under DDL:

o CREATE
o ALTER
o DROP
o TRUNCATE

a. CREATE It is used to create a new table in the database.

Syntax:

1. CREATE TABLE TABLE_NAME (COLUMN_NAME DATATYPES[,....]);

Example:

1. CREATE TABLE EMPLOYEE(Name VARCHAR2(20), Email VARCHAR2(100), DOB DATE);

b. DROP: It is used to delete both the structure and record stored in the table.

Syntax
1. DROP TABLE table_name;

Example

1. DROP TABLE EMPLOYEE;

c. ALTER: It is used to alter the structure of the database. This change could be either to modify the
characteristics of an existing attribute or probably to add a new attribute.

Syntax:

To add a new column in the table

1. ALTER TABLE table_name ADD column_name COLUMN-definition;

To modify existing column in the table:

1. ALTER TABLE table_name MODIFY(column_definitions....);

EXAMPLE

1. ALTER TABLE STU_DETAILS ADD(ADDRESS VARCHAR2(20));


2. ALTER TABLE STU_DETAILS MODIFY (NAME VARCHAR2(20));

d. TRUNCATE: It is used to delete all the rows from the table and free the space containing the table.

Syntax:

1. TRUNCATE TABLE table_name;

Example:

1. TRUNCATE TABLE EMPLOYEE;


2. Data Manipulation Language
o DML commands are used to modify the database. It is responsible for all form of changes in the database.
o The command of DML is not auto-committed that means it can't permanently save all the changes in the
database. They can be rollback.

Here are some commands that come under DML:

o INSERT
o UPDATE
o DELETE

a. INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of a table.

Syntax:
1. INSERT INTO TABLE_NAME
2. (col1, col2, col3,.... col N)
3. VALUES (value1, value2, value3, .... valueN);

Or

1. INSERT INTO TABLE_NAME


2. VALUES (value1, value2, value3, .... valueN);

For example:

1. INSERT INTO javatpoint (Author, Subject) VALUES ("Sonoo", "DBMS");

b. UPDATE: This command is used to update or modify the value of a column in the table.

Syntax:

1. UPDATE table_name SET [column_name1= value1,...column_nameN = valueN] [WHERE CONDITIO


N]

For example:

1. UPDATE students
2. SET User_Name = 'Sonoo'
3. WHERE Student_Id = '3'

c. DELETE: It is used to remove one or more row from a table.

Syntax:

1. DELETE FROM table_name [WHERE condition];

For example:

1. DELETE FROM javatpoint


2. WHERE Author="Sonoo";
3. Data Control Language
DCL commands are used to grant and take back authority from any database user.

Here are some commands that come under DCL:

o Grant
o Revoke

a. Grant: It is used to give user access privileges to a database.

Example
1. GRANT SELECT, UPDATE ON MY_TABLE TO SOME_USER, ANOTHER_USER;

b. Revoke: It is used to take back permissions from the user.

Example

1. REVOKE SELECT, UPDATE ON MY_TABLE FROM USER1, USER2;


4. Transaction Control Language
TCL commands can only use with DML commands like INSERT, DELETE and UPDATE only.

These operations are automatically committed in the database that's why they cannot be used while
creating tables or dropping them.

Here are some commands that come under TCL:

o COMMIT
o ROLLBACK
o SAVEPOINT

a. Commit: Commit command is used to save all the transactions to the database.

Syntax:

1. COMMIT;

Example:

1. DELETE FROM CUSTOMERS


2. WHERE AGE = 25;
3. COMMIT;

b. Rollback: Rollback command is used to undo transactions that have not already been saved to the
database.

Syntax:

1. ROLLBACK;

Example:

1. DELETE FROM CUSTOMERS


2. WHERE AGE = 25;
3. ROLLBACK;

c. SAVEPOINT: It is used to roll the transaction back to a certain point without rolling back the entire
transaction.

Syntax:
1. SAVEPOINT SAVEPOINT_NAME;
5. Data Query Language
DQL is used to fetch the data from the database.

It uses only one command:

o SELECT

a. SELECT: This is the same as the projection operation of relational algebra. It is used to select the
attribute based on the condition described by WHERE clause.

Syntax:

1. SELECT expressions
2. FROM TABLES
3. WHERE conditions;

For example:

1. SELECT emp_name
2. FROM employee
3. WHERE age > 20;

Transaction Control Language :

1. COMMIT-
COMMIT in SQL is a transaction control language that is used to permanently save the
changes done in the transaction in tables/databases. The database cannot regain its previous
state after its execution of commit.
Example: Consider the following STAFF table with records:
STAFF

sql>
SELECT *
FROM Staff
WHERE Allowance = 400;

sql> COMMIT;
Output:

So, the SELECT statement produced the output consisting of three rows.
2. ROLLBACK
ROLLBACK in SQL is a transactional control language that is used to undo the transactions
that have not been saved in the database. The command is only been used to undo changes
since the last COMMIT.
Example: Consider the following STAFF table with records:
STAFF

sql>
SELECT *
FROM EMPLOYEES
WHERE ALLOWANCE = 400;

sql> ROLLBACK;
Output:

So, the SELECT statement produced the same output with the ROLLBACK command.

Difference between COMMIT and ROLLBACK


COMMIT ROLLBACK

COMMIT permanently saves the changes made by the ROLLBACK undo the changes made by the
1. current transaction. current transaction.

2.
The transaction can not undo changes after COMMIT Transaction reaches its previous state after
COMMIT ROLLBACK

execution. ROLLBACK.

When the transaction is aborted, ROLLBACK


3. When the transaction is successful, COMMIT is applied. occurs.

Char vs Varchar
The basic difference between Char and Varchar is that: char stores only fixed-length character
string data types whereas varchar stores variable-length string where an upper limit of length is
specified.

Prime Attribute and Non Prime Attribute :


An attribute that is not part of any candidate key is known as non-prime attribute. An attribute that is a part of
one of the candidate keys is known as prime attribute.

Normalization in DBMS: 1NF, 2NF, 3NF and BCNF in Database

DBMS Keys
o Keys play an important role in the relational database.
o It is used to uniquely identify any record or row of data from the table. It is also used to establish and
identify relationships between tables.

For example: In Student table, ID is used as a key because it is unique for each student. In PERSON
table, passport_number, license_number, SSN are keys since they are unique for each person.

Types of key:
1. Primary key
o It is the first key used to identify one and only one instance of an entity uniquely. An entity can contain
multiple keys, as we saw in the PERSON table. The key which is most suitable from those lists becomes a
primary key.
o In the EMPLOYEE table, ID can be the primary key since it is unique for each employee. In the EMPLOYEE
table, we can even select License_Number and Passport_Number as primary keys since they are also
unique.
o For each entity, the primary key selection is based on requirements and developers.

2. Candidate key
o A candidate key is an attribute or set of attributes that can uniquely identify a tuple.
o Except for the primary key, the remaining attributes are considered a candidate key. The candidate keys
are as strong as the primary key.

For example: In the EMPLOYEE table, id is best suited for the primary key. The rest of the attributes, like
SSN, Passport_Number, License_Number, etc., are considered a candidate key.
3. Super Key
Super key is an attribute set that can uniquely identify a tuple. A super key is a superset of a candidate
key.

For example: In the above EMPLOYEE table, for(EMPLOEE_ID, EMPLOYEE_NAME), the name of two
employees can be the same, but their EMPLYEE_ID can't be the same. Hence, this combination can also
be a key.

The super key would be EMPLOYEE-ID (EMPLOYEE_ID, EMPLOYEE-NAME), etc.

4. Foreign key
o Foreign keys are the column of the table used to point to the primary key of another table.
o Every employee works in a specific department in a company, and employee and department are two
different entities. So we can't store the department's information in the employee table. That's why we link
these two tables through the primary key of one table.
o We add the primary key of the DEPARTMENT table, Department_Id, as a new attribute in the EMPLOYEE
table.
o In the EMPLOYEE table, Department_Id is the foreign key, and both the tables are related.
5. Alternate key
There may be one or more attributes or a combination of attributes that uniquely identify each tuple in
a relation. These attributes or combinations of the attributes are called the candidate keys. One key is
chosen as the primary key from these candidate keys, and the remaining candidate key, if it exists, is
termed the alternate key. In other words, the total number of the alternate keys is the total number of
candidate keys minus the primary key. The alternate key may or may not exist. If there is only one
candidate key in a relation, it does not have an alternate key.

For example, employee relation has two attributes, Employee_Id and PAN_No, that act as candidate
keys. In this relation, Employee_Id is chosen as the primary key, so the other candidate key, PAN_No,
acts as the Alternate key.

6. Composite key
Whenever a primary key consists of more than one attribute, it is known as a composite key. This key is
also known as Concatenated Key.

For example, in employee relations, we assume that an employee may be assigned multiple roles, and
an employee may work on multiple projects simultaneously. So the primary key will be composed of all
three attributes, namely Emp_ID, Emp_role, and Proj_ID in combination. So these attributes act as a
composite key since the primary key comprises more than one attribute.
7. Artificial key
The key created using arbitrarily assigned data are known as artificial keys. These keys are created when
a primary key is large and complex and has no relationship with many other relations. The data values of
the artificial keys are usually numbered in a serial order.

For example, the primary key, which is composed of Emp_ID, Emp_role, and Proj_ID, is large in
employee relations. So it would be better to add a new virtual attribute to identify each tuple in the
relation uniquely.

Normalization in DBMS: 1NF, 2NF, 3NF and BCNF in Database

Normalization is a process of organizing the data in database to avoid data redundancy, insertion
anomaly, update anomaly & deletion anomaly. Let’s discuss about anomalies first then we will discuss
normal forms with examples.

Anomalies in DBMS
There are three types of anomalies that occur when the database is not normalized. These are – Insertion,
update and deletion anomaly. Let’s take an example to understand this.
Example: Suppose a manufacturing company stores the employee details in a table named employee that
has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name,
emp_address for storing employee’s address and emp_dept for storing the department details in which the
employee works. At some point of time the table looks like this:

emp_id emp_name emp_address emp_dept

101 Rick Delhi D001

101 Rick Delhi D002

123 Maggie Agra D890

166 Glenn Chennai D900

166 Glenn Chennai D004

The above table is not normalized. We will see the problems that we face when a table is not normalized.

Update anomaly: In the above table we have two rows for employee Rick as he belongs to two
departments of the company. If we want to update the address of Rick then we have to update the same in
two rows or the data will become inconsistent. If somehow, the correct address gets updated in one
department but not in other then as per the database, Rick would be having two different addresses, which
is not correct and would lead to inconsistent data.

Insert anomaly: Suppose a new employee joins the company, who is under training and currently not
assigned to any department then we would not be able to insert the data into the table if emp_dept field
doesn’t allow nulls.

Delete anomaly: Suppose, if at a point of time the company closes the department D890 then deleting the
rows that are having emp_dept as D890 would also delete the information of employee Maggie since she
is assigned only to this department.

To overcome these anomalies we need to normalize the data. In the next section we will discuss about
normalization.
Normalization
Here are the most commonly used normal forms:

 First normal form(1NF)


 Second normal form(2NF)
 Third normal form(3NF)
 Boyce & Codd normal form (BCNF)

First normal form (1NF)


As per the rule of first normal form, an attribute (column) of a table cannot hold multiple values. It should
hold only atomic values.

Example: Suppose a company wants to store the names and contact details of its employees. It creates a
table that looks like this:

emp_id emp_name emp_address emp_mobile

101 Herschel New Delhi 8912312390

8812121212

102 Jon Kanpur

9900012222

103 Ron Chennai 7778881212

9990000123

104 Lester Bangalore


8123450987

Two employees (Jon & Lester) are having two mobile numbers so the company stored them in the same
field as you can see in the table above.

This table is not in 1NF as the rule says “each attribute of a table must have atomic (single) values”, the
emp_mobile values for employees Jon & Lester violates that rule.

To make the table complies with 1NF we should have the data like this:
emp_id emp_name emp_address emp_mobile

101 Herschel New Delhi 8912312390

102 Jon Kanpur 8812121212

102 Jon Kanpur 9900012222

103 Ron Chennai 7778881212

104 Lester Bangalore 9990000123

104 Lester Bangalore 8123450987

Second normal form (2NF)


A table is said to be in 2NF if both the following conditions hold:

 Table is in 1NF (First normal form)


 No non-prime attribute is dependent on the proper subset of any candidate key of table.

An attribute that is not part of any candidate key is known as non-prime attribute.

Example: Suppose a school wants to store the data of teachers and the subjects they teach. They create a
table that looks like this: Since a teacher can teach more than one subjects, the table can have multiple
rows for a same teacher.

teacher_id Subject teacher_age


111 Maths 38

111 Physics 38

222 Biology 38

333 Physics 40

333 Chemistry 40

Candidate Keys: {teacher_id, subject}


Non prime attribute: teacher_age

The table is in 1 NF because each attribute has atomic values. However, it is not in 2NF because non
prime attribute teacher_age is dependent on teacher_id alone which is a proper subset of candidate key.
This violates the rule for 2NF as the rule says “no non-prime attribute is dependent on the proper subset
of any candidate key of the table”.

To make the table complies with 2NF we can break it in two tables like this:
teacher_details table:

teacher_id teacher_age

111 38

222 38

333 40

teacher_subject table:
teacher_id subject

111 Maths

111 Physics

222 Biology

333 Physics

333 Chemistry

Now the tables comply with Second normal form (2NF).

Third Normal form (3NF)


A table design is said to be in 3NF if both the following conditions hold:

 Table must be in 2NF


 Transitive functional dependency of non-prime attribute on any super key should be
removed.

An attribute that is not part of any candidate key is known as non-prime attribute.

In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each functional
dependency X-> Y at least one of the following conditions hold:

 X is a super key of table


 Y is a prime attribute of table

An attribute that is a part of one of the candidate keys is known as prime attribute.

Example: Suppose a company wants to store the complete address of each employee, they create a table
named employee_details that looks like this:
emp_id emp_name emp_zip emp_state emp_city emp_district

1001 John 282005 UP Agra Dayal Bagh

1002 Ajeet 222008 TN Chennai M-City

1006 Lora 282007 TN Chennai Urrapakkam

1101 Lilly 292008 UK Pauri Bhagwan

1201 Steve 222999 MP Gwalior Ratan

Super keys: {emp_id}, {emp_id, emp_name}, {emp_id, emp_name, emp_zip}…so on


Candidate Keys: {emp_id}
Non-prime attributes: all attributes except emp_id are non-prime as they are not part of any candidate
keys.

Here, emp_state, emp_city & emp_district dependent on emp_zip. And, emp_zip is dependent on emp_id
that makes non-prime attributes (emp_state, emp_city & emp_district) transitively dependent on super
key (emp_id). This violates the rule of 3NF.

To make this table complies with 3NF we have to break the table into two tables to remove the transitive
dependency:

employee table:

emp_id emp_name emp_zip

1001 John 282005


1002 Ajeet 222008

1006 Lora 282007

1101 Lilly 292008

1201 Steve 222999

employee_zip table:

emp_zip emp_state emp_city emp_district

282005 UP Agra Dayal Bagh

222008 TN Chennai M-City

282007 TN Chennai Urrapakkam

292008 UK Pauri Bhagwan

222999 MP Gwalior Ratan

Boyce Codd normal form (BCNF)


It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is stricter than 3NF. A table
complies with BCNF if it is in 3NF and for every functional dependency X->Y, X should be the super
key of the table.
Example: Suppose there is a company wherein employees work in more than one department. They
store the data like this:

emp_id emp_nationality emp_dept dept_type dept_no_of_emp

1001 Austrian Production and planning D001 200

1001 Austrian Stores D001 250

1002 American design and technical support D134 100

1002 American Purchasing department D134 600

Functional dependencies in the table above:


emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}

Candidate key: {emp_id, emp_dept}

The table is not in BCNF as neither emp_id nor emp_dept alone are keys.

To make the table comply with BCNF we can break the table in three tables like this:
emp_nationality table:

emp_id emp_nationality

1001 Austrian

1002 American

emp_dept table:
emp_dept dept_type dept_no_of_emp

Production and planning D001 200

Stores D001 250

design and technical support D134 100

Purchasing department D134 600

emp_dept_mapping table:

emp_id emp_dept

1001 Production and planning

1001 Stores

1002 design and technical support

1002 Purchasing department

Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}

Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}

This is now in BCNF as in both the functional dependencies left side part is a key.

Functional dependency in DBMS


The attributes of a table is said to be dependent on each other when an attribute of a table uniquely
identifies another attribute of the same table.

For example: Suppose we have a student table with attributes: Stu_Id, Stu_Name, Stu_Age. Here Stu_Id
attribute uniquely identifies the Stu_Name attribute of student table because if we know the student id we
can tell the student name associated with it. This is known as functional dependency and can be written as
Stu_Id->Stu_Name or in words we can say Stu_Name is functionally dependent on Stu_Id.

Formally:
If column A of a table uniquely identifies the column B of same table then it can represented as A->B
(Attribute B is functionally dependent on attribute A)

DBMS Schema
Definition of schema: Design of a database is called the schema. Schema is of three types: Physical
schema, logical schema and view schema.

For example: In the following diagram, we have a schema that shows the relationship between three
tables: Course, Student and Section. The diagram only shows the design of the database, it doesn’t show
the data present in those tables. Schema is only a structural view(design) of a database as shown in the
diagram below.

The design of a database at physical level is called physical schema, how the data stored in blocks of
storage is described at this level.

Design of database at logical level is called logical schema, programmers and database administrators
work at this level, at this level data can be described as certain types of data records gets stored in data
structures, however the internal details such as implementation of data structure is hidden at this level
(available at physical level).

Design of database at view level is called view schema. This generally describes end user interaction with
database systems.

DBMS Instance
Definition of instance: The data stored in database at a particular moment of time is called instance of
database. Database schema defines the variable declarations in tables that belong to a particular database;
the value of these variables at a moment of time is called the instance of that database.
Unit 2
Distributed Databases,Active Database and Open Database
Connectivity
Distributed databases:
A distributed database is a type of database that has contributions from the common database and
information captured by local computers. In this type of database system, the data is not in one place
and is distributed at various organizations.

Types of Distributed Databases


Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database
environments, each with further sub-divisions, as shown in the following illustration.

Homogeneous Distributed Databases


In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its properties are

 The sites use very similar software.
 The sites use identical DBMS or DBMS from the same vendor.
 Each site is aware of all other sites and cooperates with other sites to process user requests.
 The database is accessed through a single interface as if it is a single database.

Types of Homogeneous Distributed Database


There are two types of homogeneous distributed database −
 Autonomous − Each database is independent that functions on its own. They are integrated by a
controlling application and use message passing to share data updates.
 Non-autonomous − Data is distributed across the homogeneous nodes and a central or master DBMS co-
ordinates data updates across the sites.

Heterogeneous Distributed Databases


In a heterogeneous distributed database, different sites have different operating systems, DBMS products and data
models. Its properties are −
 Different sites use dissimilar schemas and software.
 The system may be composed of a variety of DBMSs like relational, network, hierarchical or object
oriented.
 Query processing is complex due to dissimilar schemas.
 Transaction processing is complex due to dissimilar software.
 A site may not be aware of other sites and so there is limited co-operation in processing user requests.

Types of Heterogeneous Distributed Databases


 Federated − The heterogeneous database systems are independent in nature and integrated together so that
they function as a single database system.
 Un-federated − The database systems employ a central coordinating module through which the databases
are accessed.

Distributed DBMS Architectures


DDBMS architectures are generally developed depending on three parameters −
 Distribution − It states the physical distribution of data across the different sites.
 Autonomy − It indicates the distribution of control of the database system and the degree to which each
constituent DBMS can operate independently.
 Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components and
databases.

Architectural Models
Some of the common architectural models are −

 Client - Server Architecture for DDBMS


 Peer - to - Peer Architecture for DDBMS
 Multi - DBMS Architecture

Client - Server Architecture for DDBMS


This is a two-level architecture where the functionality is divided into servers and clients. The server functions
primarily encompass data management, query processing, optimization and transaction management. Client
functions include mainly user interface. However, they have some functions like consistency checking and
transaction management.
The two different client - server architecture are −

 Single Server Multiple Client


 Multiple Server Multiple Client (shown in the following diagram)
Peer- to-Peer Architecture for DDBMS
In these systems, each peer acts both as a client and a server for imparting database services. The peers share their
resource with other peers and co-ordinate their activities.
This architecture generally has four levels of schemas −
 Global Conceptual Schema − Depicts the global logical view of data.
 Local Conceptual Schema − Depicts logical data organization at each site.
 Local Internal Schema − Depicts physical data organization at each site.
 External Schema − Depicts user view of data.
Multi - DBMS Architectures
This is an integrated database system formed by a collection of two or more autonomous database systems.
Multi-DBMS can be expressed through six levels of schemas −
 Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated
distributed database.
 Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical
multi-database structure definitions.
 Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to
local data mapping.
 Local database View Level − Depicts public view of local data.
 Local database Conceptual Level − Depicts local data organization at each site.
 Local database Internal Level − Depicts physical data organization at each site.
There are two design alternatives for multi-DBMS −

 Model with multi-database conceptual level.


 Model without multi-database conceptual level.
Design Alternatives
The distribution design alternatives for the tables in a DDBMS are as follows −

 Non-replicated and non-fragmented


 Fully replicated
 Partially replicated
 Fragmented
 Mixed

Non-replicated & Non-fragmented


In this design alternative, different tables are placed at different sites. Data is placed so that it is at a close proximity
to the site where it is used most. It is most suitable for database systems where the percentage of queries needed
to join information in tables placed at different sites is low. If an appropriate distribution strategy is adopted, then
this design alternative helps to reduce the communication cost during data processing.

Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since, each site has its own
copy of the entire database, queries are very fast requiring negligible communication cost. On the contrary, the
massive redundancy in data requires huge cost during update operations. Hence, this is suitable for systems where
a large number of queries is required to be handled whereas the number of database updates is low.

Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the tables is done in
accordance to the frequency of access. This takes into consideration the fact that the frequency of accessing the
tables vary considerably from site to site. The number of copies of the tables (or portions) depends on how
frequently the access queries execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions, and each fragment
can be stored at different sites. This considers the fact that it seldom happens that all data stored in a table is
required at a given site. Moreover, fragmentation increases parallelism and provides better disaster recovery. Here,
there is only one copy of each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −

 Vertical fragmentation
 Horizontal fragmentation
 Hybrid fragmentation

Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are initially fragmented in any
form (horizontal or vertical), and then these fragments are partially replicated across the different sites according
to the frequency of accessing the fragments.

Distributed Data Storage

A distributed database is basically a database that is not limited to one system, it is spread over different
sites, i.e, on multiple computers or over a network of computers. A distributed database system is located
on various sites that don’t share physical components. This may be required when a particular database
needs to be accessed by various users globally. It needs to be managed such that for the users it looks
like one single database.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1.Replication –
In this approach, the entire relation is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of
data.
This is advantageous as it increases the availability of data at different sites. Also, now query requests
can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made
at one site needs to be recorded at every site that relation is stored or else it may lead to inconsistency.
This is a lot of overhead. Also, concurrency control becomes way more complex as concurrent access
now needs to be checked over a number of sites.
2.Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.

Fragmentation of relations can be done in two ways:

 Horizontal fragmentation – Splitting by rows –


The relation is fragmented into groups of tuples so that each tuple is assigned to at least one
fragment.
 Vertical fragmentation – Splitting by columns –
The schema of the relation is divided into smaller schemas. Each fragment must contain a
common candidate key so as to ensure lossless join.

Flat & Nested Distributed Transactions


Introduction:
A transaction is a series of object operations that must be done in an ACID-compliant manner.
Example : In banking transactions
 Atomicity –
The transaction is completed entirely or not at all.may be commit or abort.
 Consistency –
It is a term that refers to the transition from one consistent state to another.
 Isolation –
It is carried out separately from other transactions.
 Durability–
Once completed, it is long lasting.
Transactions – Commands :
 Begin–
initiate a new transaction.
 Commit–
End a transaction and the changes made during the transaction are saved. Also, it allows other
transactions to see the modifications you’ve made.
 Abort–
End a transaction and all changes made during the transaction will be undone.
Various roles are allocated to running a transaction successfully :
 Client–
The transactions are issued by the clients.
 Coordinator–
The execution of the entire transaction is controlled by it (handles Begin, commit & abort).
 Server–
Every component that accesses or modifies a resource is subject to transaction control. The
coordinator must be known by the transactional server. The transactional server registers its
participation in a transaction with the coordinator.
A flat or nested transaction that accesses objects handled by different servers is referred to as a
distributed transaction.
When a distributed transaction reaches its end, in order to maintain the atomicity property of the
transaction , it is mandatory that all of the servers involved in the transaction either commit the
transaction or abort it.
To do this, one of the servers takes on the job of coordinator, which entails ensuring that the same
outcome is achieved across all servers.
The method by which the coordinator accomplishes this is determined by the protocol selected. The
most widely used protocol is the ‘two-phase commit protocol.’ This protocol enables the servers to
communicate with one another in order to come to a joint decision on whether to commit or abort the
complete transaction.
Flat&NestedDistributed Transactions:
If a client transaction calls actions on multiple servers, it is said to be distributed. Distributed
transactions can be structured in two different ways:
1. Flat transactions
2. Nested transactions
FLAT TRANSACTIONS:
A flat transaction has a single initiating point(Begin) and a single end point(Commit or abort). They are
usually very simple and are generally used for short activities rather than larger ones.
A client makes requests to multiple servers in a flat transaction. Transaction T, for example, is a flat
transaction that performs operations on objects in servers X, Y, and Z.
Before moving on to the next request, a flat client transaction completes the previous one. As a result,
each transaction visits the server object in order.
A transaction can only wait for one object at a time when servers utilize locking.

Flat Transaction

Limitations of a flat Transaction :


 All work is lost in the event of a crash.
 Only one DBMS may be used at a time.
 No partial rollback is possible.
NESTED TRANSACTIONS :
A transaction that includes other transactions within its initiating point and a end point are known as
nested transactions. So the nesting of the transactions is done in a transaction. The nested transactions
here are called sub-transactions.
The top-level transaction in a nested transaction can open sub-transactions, and each sub-transaction
can open more sub-transactions down to any depth of nesting.
A client’s transaction T opens up two sub-transactions, T1 and T2, which access objects on servers X
and Y, as shown in the diagram below.
T1.1, T1.2, T2.1, and T2.2, which access the objects on the servers M,N, and P, are opened by the sub-
transactions T1 and T2.

Nested Transaction

Concurrent Execution of the Sub-transactions is done which are at the same level – in the nested
transaction strategy.Here, in the above diagram, T1 and T2 invoke objects on different servers and
hence they can run in parallel and are therefore concurrent.
T1.1, T1.2, T2.1, and T2.2 are four sub-transactions. These sub-transactions can also run in parallel.
Consider a distributed transaction (T) in which a customer transfers :
 Rs. 105 from account A to account C and
 Subsequently, Rs. 205 from account B to account D.
It can be viewed/ thought of as :
Transaction T :
Start
Transfer Rs 105 from A to C :
Deduct Rs 105 from A(withdraw from A) & Add Rs 105 to C(depopsit to C)
Transfer Rs 205 from B to D :
Deduct Rs 205 from B (withdraw from B)& Add Rs 205 to D(depopsit to D)
End
Assuming :
1. Account A is on server X
2. Account B is on server Y,and
3. Accounts C and D are on server Z.
The transaction T involves four requests – 2 for deposits and 2 for withdrawals. Now they can be treated
as sub transactions (T1, T2, T3, T4) of the transaction T.
As shown in the figure below, transaction T is designed as a set of four nested transactions : T1, T2, T3
and T4.
Advantage:
The performance is higher than a single transaction in which four operations are invoked one after the
other in sequence.

Nested Transaction

So, the Transaction T may be divided into sub-transactions as :


//Start the Transaction
T = open transaction
//T1
openSubtransaction
a.withdraw(105);
//T2
openSubtransaction
b.withdraw(205);
//T3
openSubtransaction
c.deposit(105);
//T4
openSubtransaction
d.deposit(205);
//End the trsnaction
close Transaction
Role of coordinator :
When the Distributed Transaction commits, the servers that are involved in the transaction execution,for
proper coordination, must be able to communicate with one another .
When a client initiates a transaction, an “openTransaction” request is sent to any coordinator server. The
contacted coordinator carries out the “openTransaction” and returns the transaction identifier to the
client.
Distributed transaction identifiers must be unique within the distributed system.
A simple way is to generate a TID contains two parts – the ‘server identifier” (example : IP address) of
the server that created it and a number unique to the server.
The coordinator who initiated the transaction becomes the distributed transaction’s coordinator and has
the responsibility of either aborting it or committing it.
Every server that manages an object accessed by a transaction is a participant in the transaction &
provides an object we call the participant. The participants are responsible for working together with
the coordinator to complete the commit process.
The coordinator every time, records the new participant in the participants list. Each participant knows
the coordinator & the coordinator knows all the participants. This enables them to collect the information
that will be needed at the time of commit and hence work in coordination.

Distributed Transactions Concepts

What Are Distributed Transactions?


A distributed transaction includes one or more statements that, individually or as a group, update data
on two or more distinct nodes of a distributed database. For example, assume the database
configuration depicted in Figure 4-1:
Figure 4-1 Distributed System

The following distributed transaction executed by SCOTT updates the local SALES database, the
remote HQ database, and the remote MAINT database:
UPDATE [email protected]
SET loc = 'REDWOOD SHORES'
WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11
WHERE deptno = 10;
UPDATE [email protected]
SET room = 1225
WHERE room = 1163;
COMMIT;

Note:

If all statements of a transaction reference only a single remote node, then the transaction is
remote, not distributed.

The section contains the following topics:

 Supported Types of Distributed Transactions


 Session Trees for Distributed Transactions
 Two-Phase Commit Mechanism

Supported Types of Distributed Transactions

This section describes permissible operations in distributed transactions:

 DML and DDL Transactions


 Transaction Control Statements

DML and DDL Transactions

The following list describes DML and DDL operations supported in a distributed transaction:
 CREATE TABLE AS SELECT
 DELETE
 INSERT (default and direct load)
 LOCK TABLE
 SELECT
 SELECT FOR UPDATE

You can execute DML and DDL statements in parallel, and INSERT direct load statements serially,
but note the following restrictions:

 All remote operations must be SELECT statements.


 These statements must not be clauses in another distributed transaction.
 If the table referenced in the table_expression_clause of an INSERT, UPDATE, or DELETE
statement is remote, then execution is serial rather than parallel.
 You cannot perform remote operations after issuing parallel DML/DDL or direct load INSERT.
 If the transaction begins using XA or OCI(oracle call interface), it executes serially.
 No loopback operations can be performed on the transaction originating the parallel operation.
For example, you cannot reference a remote object that is actually a synonym for a local object.
 If you perform a distributed operation other than a SELECT in the transaction, no DML is
parallelized.

Transaction Control Statements

The following list describes supported transaction control statements:

 COMMIT
 ROLLBACK
 SAVEPOINT

Savepoint in SQL
o Savepoint is a command in SQL that is used with the rollback command.
o It is a command in Transaction Control Language that is used to mark the transaction in a table.
o Consider you are making a very long table, and you want to roll back only to a certain position in a table
then; this can be achieved using the savepoint.
o If you made a transaction in a table, you could mark the transaction as a certain name, and later on, if you
want to roll back to that point, you can do it easily by using the transaction's name.
o Savepoint is helpful when we want to roll back only a small part of a table and not the whole table. In
simple words, we can say savepoint is a bookmark in SQL.

Session Trees for Distributed Transactions

Oracle8i defines a session tree of all nodes participating in a distributed transaction. A session tree is
a hierarchical model of the transaction that describes the relationships among the nodes that are
involved. Each node plays a role in the transaction. For example, the node that originates the
transaction is the global coordinator, and the node in charge of initiating a commit or rollback is called
the commit point site.
Two-Phase Commit Mechanism

Unlike a transaction on a local database, a distributed transaction involves altering data on multiple
databases. Consequently, distributed transaction processing is more complicated, because Oracle must
coordinate the committing or rolling back of the changes in a transaction as a self-contained unit. In
other words, the entire transaction commits, or the entire transactions rolls back.

Oracle ensures the integrity of data in a distributed transaction using the two-phase commit
mechanism. In the prepare phase, the initiating node in the transaction asks the other participating
nodes to promise to commit or roll back the transaction. During the commit phase, the initiating node
asks all participating nodes to commit the transaction; if this outcome is not possible, then all nodes
are asked to roll back.

Session Trees for Distributed Transactions


As the statements in a distributed transaction are issued, Oracle8i defines a session tree of all nodes
participating in the transaction. A session tree is a hierarchical model that describes the relationships
among sessions and their roles. Figure 4-2 illustrates a session tree:

Figure 4-2 Example of a Session Tree

All nodes participating in the session tree of a distributed transaction assume one or more of the
following roles:

client A node that references information in a database belonging to a different node.


database server A node that receives a request for information from another node.
global coordinator The node that originates the distributed transaction.
local coordinator A node that is forced to reference data on other nodes to complete its part of the transaction.
commit point site The node that commits or rolls back the transaction as instructed by the global coordinator.

The role a node plays in a distributed transaction is determined by:

 Whether the transaction is local or remote


 The commit point strength of the node ("Commit Point Site")
 Whether all requested data is available at a node, or whether other nodes need to be referenced
to complete the transaction
 Whether the node is read-only

Clients

A node acts as a client when it references information from another node's database. The referenced
node is a database server. In Figure 4-2, the node SALES is a client of the nodes that host the
WAREHOUSE and FINANCE databases.

Database Servers

A database server is a node that hosts a database from which a client requests data.

In Figure 4-2, an application at the SALES node initiates a distributed transaction that accesses data
from the WAREHOUSE and FINANCE nodes. Therefore, SALES.ACME.COM has the role of client
node, and WAREHOUSE and FINANCE are both database servers. In this example, SALES is a
database server and a client because the application also requests a change to the SALES database.

Local Coordinators

A node that must reference data on other nodes to complete its part in the distributed transaction is
called a local coordinator. In Figure 4-2, SALES is a local coordinator because it coordinates the
nodes it directly references: WAREHOUSE and FINANCE. SALES also happens to be the global
coordinator because it coordinates all the nodes involved in the transaction.

A local coordinator is responsible for coordinating the transaction among the nodes it communicates
directly with by:

 Receiving and relaying transaction status information to and from those nodes.
 Passing queries to those nodes.
 Receiving queries from those nodes and passing them on to other nodes.
 Returning the results of queries to the nodes that initiated them.

Global Coordinator

The node where the distributed transaction originates is called the global coordinator. The database
application issuing the distributed transaction is directly connected to the node acting as the global
coordinator. For example, in Figure 4-2, the transaction issued at the node SALES references
information from the database servers WAREHOUSE and FINANCE. Therefore,
SALES.ACME.COM is the global coordinator of this distributed transaction.

The global coordinator becomes the parent or root of the session tree. The global coordinator performs
the following operations during a distributed transaction:

 Sends all of the distributed transaction's SQL statements, remote procedure calls, etc. to the
directly referenced nodes, thus forming the session tree.
 Instructs all directly referenced nodes other than the commit point site to prepare the
transaction.
 Instructs the commit point site to initiate the global commit of the transaction if all nodes
prepare successfully.
 Instructs all nodes to initiate a global rollback of the transaction if there is an abort response.
Commit Point Site

The job of the commit point site is to initiate a commit or roll back operation as instructed by the
global coordinator. The system administrator always designates one node to be the commit point
site in the session tree by assigning all nodes a commit point strength. The node selected as commit
point site should be the node that stores the most critical data.

Figure 4-3 illustrates an example of distributed system, with SALES serving as the commit point site:

Figure 4-3 Commit Point Site

The commit point site is distinct from all other nodes involved in a distributed transaction in these
ways:

 The commit point site never enters the prepared state. Consequently, if the commit point site
stores the most critical data, this data never remains in-doubt, even if a failure occurs. In failure
situations, failed nodes remain in a prepared state, holding necessary locks on data until in-
doubt transactions are resolved.
 The commit point site commits before the other nodes involved in the transaction. In effect, the
outcome of a distributed transaction at the commit point site determines whether the transaction
at all nodes is committed or rolled back: the other nodes follow the lead of the commit point
site. The global coordinator ensures that all nodes complete the transaction in the same manner
as the commit point site.

How a Distributed Transaction Commits

A distributed transaction is considered committed after all non-commit point sites are prepared, and
the transaction has been actually committed at the commit point site. The online redo log at the commit
point site is updated as soon as the distributed transaction is committed at this node.

Because the commit point log contains a record of the commit, the transaction is considered committed
even though some participating nodes may still be only in the prepared state and the transaction not
yet actually committed at these nodes. In the same way, a distributed transaction is
considered not committed if the commit has not been logged at the commit point site.

Commit Point Strength

Every database server must be assigned a commit point strength. If a database server is referenced in
a distributed transaction, the value of its commit point strength determines which role it plays in the
two-phase commit. Specifically, the commit point strength determines whether a given node is the
commit point site in the distributed transaction and thus commits before all of the other nodes. This
value is specified using the initialization parameter COMMIT_POINT_STRENGTH.

How Oracle Determines the Commit Point Site

The commit point site, which is determined at the beginning of the prepare phase, is selected only
from the nodes participating in the transaction. The following sequence of events occurs:

1. Of the nodes directly referenced by the global coordinator, Oracle selects the node with the
highest commit point strength as the commit point site.

2. The initially-selected node determines if any of the nodes from which it has to obtain
information for this transaction has a higher commit point strength.

3. Either the node with the highest commit point strength directly referenced in the transaction or
one of its servers with a higher commit point strength becomes the commit point site.

4. After the final commit point site has been determined, the global coordinator sends prepare
responses to all nodes participating in the transaction.

Figure 4-4 shows in a sample session tree the commit point strengths of each node (in parentheses)
and shows the node chosen as the commit point site:

Figure 4-4 Commit Point Strengths and Determination of the Commit Point Site

The following conditions apply when determining the commit point site:

 A read-only node cannot be the commit point site.


 If multiple nodes directly referenced by the global coordinator have the same commit point
strength, then Oracle designates one of these as the commit point site.
 If a distributed transaction ends with a rollback, then the prepare and commit phases are not
needed. Consequently, Oracle never determines a commit point site. Instead, the global
coordinator sends a ROLLBACK statement to all nodes and ends the processing of the
distributed transaction.
As Figure 4-4 illustrates, the commit point site and the global coordinator can be different nodes of
the session tree. The commit point strength of each node is communicated to the coordinators when
the initial connections are made. The coordinators retain the commit point strengths of each node they
are in direct communication with so that commit point sites can be efficiently selected during two-
phase commits. Therefore, it is not necessary for the commit point strength to be exchanged between
a coordinator and a node each time a commit occurs.

Two-Phase Commit Mechanism


All participating nodes in a distributed transaction should perform the same action: they should either
all commit or all perform a rollback of the transaction. Oracle8i automatically controls and monitors
the commit or rollback of a distributed transaction and maintains the integrity of the global
database (the collection of databases participating in the transaction) using the two-phase commit
mechanism. This mechanism is completely transparent, requiring no programming on the part of the
user or application developer.

The commit mechanism has the following distinct phases, which Oracle performs automatically
whenever a user commits a distributed transaction:

prepare The initiating node, called the global coordinator, asks participating nodes other than the commit point
phase site to promise to commit or roll back the transaction, even if there is a failure. If any node cannot
prepare, the transaction is rolled back.
commit If all participants respond to the coordinator that they are prepared, then the coordinator asks the
phase commit point site to commit. After it commits, the coordinator asks all other nodes to commit the
transaction.
forget The global coordinator forgets about the transaction.
phase

This section contains the following topics:

 Prepare Phase
 Commit Phase
 Forget Phase

Prepare Phase
The first phase in committing a distributed transaction is the prepare phase. In this phase, Oracle does
not actually commit or roll back the transaction. Instead, all nodes referenced in a distributed
transaction (except the commit point site, described in the "Commit Point Site") are told to prepare to
commit. By preparing, a node:

 Records information in the online redo logs so that it can subsequently either commit or roll
back the transaction, regardless of intervening failures.
 Places a distributed lock on modified tables, which prevents reads.

When a node responds to the global coordinator that it is prepared to commit, the prepared
node promises to either commit or roll back the transaction later--but does not make a unilateral
decision on whether to commit or roll back the transaction. The promise means that if an instance
failure occurs at this point, the node can use the redo records in the online log to recover the database
back to the prepare phase.

Note:

Queries that start after a node has prepared cannot access the associated locked data until
all phases complete.

Types of Responses in the Prepare Phase

When a node is told to prepare, it can respond in the following ways:

prepared Data on the node has been modified by a statement in the distributed transaction, and the node has
successfully prepared.
read- No data on the node has been, or can be, modified (only queried), so no preparation is necessary.
only
abort The node cannot successfully prepare.
Prepared Response

When a node has successfully prepared, it issues a prepared message. The message indicates that the
node has records of the changes in the online log, so it is prepared either to commit or perform a
rollback. The message also guarantees that locks held for the transaction can survive a failure.

Read-Only Response

When a node is asked to prepare, and the SQL statements affecting the database do not change the
node's data, the node responds with a read-only message. The message indicates that the node will not
participate in the commit phase.

There are three cases in which all or part of a distributed transaction is read-only:

Case Conditions Consequence

Partially read- Any of the following occurs: The read-only nodes recognize their status when asked
only to prepare. They give their local coordinators a read-
 Only queries are issued at one only response. Thus, the commit phase completes faster
or more nodes. because Oracle eliminates read-only nodes from
 No data is changed. subsequent processing.
Case Conditions Consequence

 Changes rolled back due to


triggers firing or constraint
violations.

Completely read- All of following occur: All nodes recognize that they are read-only during
only with prepare phase, so no commit phase is required. The
prepare phase  No data changes. global coordinator, not knowing whether all nodes are
 Transaction is not started read-only, must still perform the prepare phase.
with SET TRANSACTION
READ ONLY statement.

Completely read- All of following occur: Only queries are allowed in the transaction, so global
only without coordinator does not have to perform two-phase
two-phase  No data changes. commit. Changes by other transactions do not degrade
commit  Transaction is started with global transaction-level read consistency because of
SET TRANSACTION READ global SCN coordination among nodes. The transaction
ONLY statement. does not use rollback segments.

Note that if a distributed transaction is set to read-only, then it does not use rollback segments. If many
users connect to the database and their transactions are not set to READ ONLY, then they allocate
rollback space even if they are only performing queries.

Abort Response

When a node cannot successfully prepare, it performs the following actions:

1. Releases resources currently held by the transaction and rolls back the local portion of the
transaction.

2. Responds to the node that referenced it in the distributed transaction with an abort message.

These actions then propagate to the other nodes involved in the distributed transaction so that they can
roll back the transaction and guarantee the integrity of the data in the global database. This response
enforces the primary rule of a distributed transaction: all nodes involved in the transaction either all
commit or all roll back the transaction at the same logical time.

Steps in the Prepare Phase

To complete the prepare phase, each node excluding the commit point site performs the following
steps:

1. The node requests that its descendants, that is, the nodes subsequently referenced, prepare to
commit.
2. The node checks to see whether the transaction changes data on itself or its descendants. If there
is no change to the data, then the node skips the remaining steps and returns a read-only
response

3. The node allocates the resources it needs to commit the transaction if data is changed.

4. The node saves redo records corresponding to changes made by the transaction to its online
redo log.

5. The node guarantees that locks held for the transaction are able to survive a failure.

6. The node responds to the initiating node with a prepared response , if its attempt or the attempt
of one of its descendents to prepare was unsuccessful, with an abort response.

These actions guarantee that the node can subsequently commit or roll back the transaction on the
node. The prepared nodes then wait until a COMMIT or ROLLBACK request is received from the
global coordinator.

After the nodes are prepared, the distributed transaction is said to be in-doubt . It retains in-doubt
status until all changes are either committed or rolled back.

Commit Phase

The second phase in committing a distributed transaction is the commit phase. Before this phase
occurs, all nodes other than the commit point site referenced in the distributed transaction have
guaranteed that they are prepared, that is, they have the necessary resources to commit the transaction.

Steps in the Commit Phase

The commit phase consists of the following steps:

1. The global coordinator instructs the commit point site to commit.

2. The commit point site commits.

3. The commit point site informs the global coordinator that it has committed.

4. The global and local coordinators send a message to all nodes instructing them to commit the
transaction.

5. At each node, Oracle8i commits the local portion of the distributed transaction and releases
locks.

6. At each node, Oracle8i records an additional redo entry in the local redo log, indicating that the
transaction has committed.

7. The participating nodes notify the global coordinator that they have committed.

When the commit phase is complete, the data on all nodes of the distributed system is consistent with
one another.
Guaranteeing Global Database Consistency

Each committed transaction has an associated system change number (SCN) to uniquely identify the
changes made by the SQL statements within that transaction. The SCN functions as an internal Oracle
timestamp that uniquely identifies a committed version of the database.

In a distributed system, the SCNs of communicating nodes are coordinated when all of the following
actions occur:

 A connection occurs using the path described by one or more database links.
 A distributed SQL statement executes.
 A distributed transaction commits.

Among other benefits, the coordination of SCNs among the nodes of a distributed system ensures
global read-consistency at both the statement and transaction level. If necessary, global time-based
recovery can also be completed.

During the prepare phase, Oracle8i determines the highest SCN at all nodes involved in the
transaction. The transaction then commits with the high SCN at the commit point site. The commit
SCN is then sent to all prepared nodes with the commit decision.

Forget Phase

After the participating nodes notify the commit point site that they have committed, the commit point
site can forget about the transaction. The following steps occur:

1. After receiving notice from the global coordinator that all nodes have committed, the commit
point site erases status information about this transaction.

2. The commit point site informs the global coordinator that it has erased the status information.

3. The global coordinator erases its own information about the transaction.

In-Doubt Transactions
The two-phase commit mechanism ensures that all nodes either commit or perform a rollback together.
What happens if any of the three phases fails because of a system or network error? The transaction
becomes in-doubt.

Distributed transactions can become in-doubt in the following ways:

 A server machine running Oracle software crashes.


 A network connection between two or more Oracle databases involved in distributed processing
is disconnected.
 An unhandled software error occurs.

The RECO (recovery)process automatically resolves in-doubt transactions when the machine,
network, or software problem is resolved. Until RECO can resolve the transaction, the data is locked
for both reads and writes. Oracle blocks reads because it cannot determine which version of the data
to display for a query.
This section contains the following topics:

 Automatic Resolution of In-Doubt Transactions


 Manual Resolution of In-Doubt Transactions
 Relevance of System Change Numbers for In-Doubt Transactions

Automatic Resolution of In-Doubt Transactions

In the majority of cases, Oracle resolves the in-doubt transaction automatically. Assume that there are
two nodes, LOCAL and REMOTE, in the following scenarios. The local node is the commit point
site. User SCOTT connects to LOCAL and executes and commits a distributed transaction that updates
LOCAL and REMOTE.

Failure During the Prepare Phase

Figure 4-5 illustrates the sequence of events when there is a failure during the prepare phase of a
distributed transaction:

Figure 4-5 Failure During Prepare Phase

The following steps occur:

1. Scott connects to LOCAL and executes a distributed transaction.

2. The global coordinator, which in this example is also the commit point site, requests all
databases other than the commit point site to promise to commit or roll back when told to do
so.

3. The REMOTE database crashes before issuing the prepare response back to LOCAL.

4. The transaction is ultimately rolled back on each database by the RECO process when the
remote site is restored.

Failure During the Commit Phase

Figure 4-5 illustrates the sequence of events when there is a failure during the commit phase of a
distributed transaction:
Figure 4-6 Failure During Prepare Phase

The following steps occur:

1. Scott connects to LOCAL and executes a distributed transaction.

2. The global coordinator, which in this case is also the commit point site, requests all databases
other than the commit point site to promise to commit or roll back when told to do so.

3. The commit point site receives a prepare message from REMOTE saying that it will commit.

4. The commit point site commits the transaction locally, then sends a commit message to
REMOTE asking it to commit.

5. The REMOTE database receives the commit message, but cannot respond because of a network
failure.

6. The transaction is ultimately committed on the remote database by the RECO process after the
network is restored.

Manual Resolution of In-Doubt Transactions

You should only need to resolve an in-doubt transaction in the following cases:

 The in-doubt transaction has locks on critical data or rollback segments.


 The cause of the machine, network, or software failure cannot be repaired quickly.

Resolution of in-doubt transactions can be complicated. The procedure requires that you do the
following:

 Identify the transaction identification number for the in-doubt transaction.


 Query the DBA_2PC_PENDING and DBA_2PC_NEIGHBORS views to determine whether
the databases involved in the transaction have committed.
 If necessary, force a commit using the COMMIT FORCE statement or a rollback using the
ROLLBACK FORCE statement.

Relevance of System Change Numbers for In-Doubt Transactions

A system change number (SCN) is an internal timestamp for a committed version of the database. The
Oracle database server uses the SCN clock value to guarantee transaction consistency. For example,
when a user commits a transaction, Oracle records an SCN for this commit in the online redo log.

Oracle uses SCNs to coordinate distributed transactions among different databases. For example,
Oracle uses SCNs in the following way:

1. An application establishes a connection using a database link.

2. The distributed transaction commits with the highest global SCN among all the databases
involved.

3. The commit global SCN is sent to all databases involved in the transaction.

SCNs are important for distributed transactions because they function as a synchronized commit
timestamp of a transaction--even if the transaction fails. If a transaction becomes in-doubt, an
administrator can use this SCN to coordinate changes made to the global database. The global SCN
for the transaction commit can also be used to identify the transaction later, for example, in distributed
recovery.

Distributed Transaction Processing: Case Study


In this scenario, a company has separate Oracle8i database servers, SALES.ACME.COM and
WAREHOUSE.ACME.COM. As users insert sales records into the SALES database, associated
records are being updated at the WAREHOUSE database.

This case study of distributed processing illustrates:

 The definition of a session tree


 How a commit point site is determined
 When prepare messages are sent
 When a transaction actually commits
 What information is stored locally about the transaction

Stage 1: Client Application Issues DML Statements

At the Sales department, a salesperson uses SQL*Plus to enter a sales order and then commit it. The
application issues a number of SQL statements to enter the order into the SALES database and update
the inventory in the WAREHOUSE database:
CONNECT scott/[email protected] ...;
INSERT INTO orders ...;
UPDATE [email protected] ...;
INSERT INTO orders ...;
UPDATE [email protected] ...;
COMMIT;

These SQL statements are part of a single distributed transaction, guaranteeing that all issued SQL
statements succeed or fail as a unit. Treating the statements as a unit prevents the possibility of an
order being placed and then inventory not being updated to reflect the order. In effect, the transaction
guarantees the consistency of data in the global database.
As each of the SQL statements in the transaction executes, the session tree is defined, as shown
in Figure 4-7.

Figure 4-7 Defining the Session Tree

Note the following aspects of the transaction:

 An order entry application running with the SALES database initiates the transaction.
Therefore, SALES.ACME.COM is the global coordinator for the distributed transaction.
 The order entry application inserts a new sales record into the SALES database and updates the
inventory at the warehouse. Therefore, the nodes SALES.ACME.COM and
WAREHOUSE.ACME.COM are both database servers.
 Because SALES.ACME.COM updates the inventory, it is a client of
WAREHOUSE.ACME.COM.

This stage completes the definition of the session tree for this distributed transaction. Each node in the
tree has acquired the necessary data locks to execute the SQL statements that reference local data.
These locks remain even after the SQL statements have been executed until the two-phase commit is
completed.

Stage 2: Oracle Determines Commit Point Site

Oracle determines the commit point site immediately following the COMMIT statement.
SALES.ACME.COM, the global coordinator, is determined to be the commit point site, as shown
in Figure 4-8.
Figure 4-8 Determining the Commit Point Site

Stage 3: Global Coordinator Sends Prepare Response

The prepare stage involves the following steps:

1. After Oracle determines the commit point site, the global coordinator sends the prepare message
to all directly referenced nodes of the session tree, excluding the commit point site. In this
example, WAREHOUSE.ACME.COM is the only node asked to prepare.

2. WAREHOUSE.ACME.COM tries to prepare. If a node can guarantee that it can commit the
locally dependent part of the transaction and can record the commit information in its local redo
log, then the node can successfully prepare. In this example, only WAREHOUSE.ACME.COM
receives a prepare message because SALES.ACME.COM is the commit point site.

3. WAREHOUSE.ACME.COM responds to SALES.ACME.COM with a prepared message.

As each node prepares, it sends a message back to the node that asked it to prepare. Depending on the
responses, one of the following can happen:

 If any of the nodes asked to prepare respond with an abort message to the global coordinator,
then the global coordinator tells all nodes to roll back the transaction, and the operation is
completed.
 If all nodes asked to prepare respond with a prepared or a read-only message to the global
coordinator, that is, they have successfully prepared, then the global coordinator asks the
commit point site to commit the transaction.

Figure 4-9 Sending and Acknowledging the Prepare Message


Stage 4: Commit Point Site Commits

The committing of the transaction by the commit point site involves the following steps:

1. SALES.ACME.COM, receiving acknowledgment that WAREHOUSE.ACME.COM is


prepared, instructs the commit point site to commit the transaction.

2. The commit point site now commits the transaction locally and records this fact in its local redo
log.

Even if WAREHOUSE.ACME.COM has not yet committed, the outcome of this transaction is pre-
determined. In other words, the transaction will be committed at all nodes even if a given node's ability
to commit is delayed.

Stage 5: Commit Point Site Informs Global Coordinator of Commit

This stage involves the following steps:

1. The commit point site tells the global coordinator that the transaction has committed. Because
the commit point site and global coordinator are the same node in this example, no operation is
required. The commit point site knows that the transaction is committed because it recorded
this fact in its online log.

2. The global coordinator confirms that the transaction has been committed on all other nodes
involved in the distributed transaction.

Stage 6: Global and Local Coordinators Tell All Nodes to Commit

The committing of the transaction by all the nodes in the transaction involves the following steps:

1. After the global coordinator has been informed of the commit at the commit point site, it tells
all other directly referenced nodes to commit.

2. In turn, any local coordinators instruct their servers to commit, and so on.

3. Each node, including the global coordinator, commits the transaction and records appropriate
redo log entries locally. As each node commits, the resource locks that were being held locally
for that transaction are released.

In Figure 4-10, SALES.ACME.COM, which is both the commit point site and the global coordinator,
has already committed the transaction locally. SALES now instructs WAREHOUSE.ACME.COM to
commit the transaction.
Figure 4-10 Instructing Nodes to Commit

Stage 7: Global Coordinator and Commit Point Site Complete the Commit

The completion of the commit of the transaction occurs in the following steps:

1. After all referenced nodes and the global coordinator have committed the transaction, the global
coordinator informs the commit point site of this fact.

2. The commit point site, which has been waiting for this message, erases the status information
about this distributed transaction.

3. The commit point site informs the global coordinator that it is finished. In other words, the
commit point site forgets about committing the distributed transaction. This action is
permissible because all nodes involved in the two-phase commit have committed the
transaction successfully, so they will never have to determine its status in the future.

4. The global coordinator finalizes the transaction by forgetting about the transaction itself.

After the completion of the COMMIT phase, the distributed transaction is itself complete. The
steps described above are accomplished automatically and in a fraction of a second.

Query Processing in Distributed DBMS


A Query processing in a distributed database management system requires the transmission of data
between the computers in a network. A distribution strategy for a query is the ordering of data
transmissions and local data processing in a database system. Generally, a query in Distributed DBMS
requires data from multiple sites, and this need for data from different sites is called the transmission of
data that causes communication costs. Query processing in DBMS is different from query processing in
centralized DBMS due to this communication cost of data transfer over the network. The transmission
cost is low when sites are connected through high-speed Networks and is quite significant in other
networks.

1. Costs (Transfer of data) of Distributed Query processing :


In Distributed Query processing, the data transfer cost of distributed query processing means the cost of
transferring intermediate files to other sites for processing and therefore the cost of transferring the
ultimate result files to the location where that result’s required. Let’s say that a user sends a query to
site S1, which requires data from its own and also from another site S2. Now, there are three strategies
to process this query which are given below:
1. We can transfer the data from S2 to S1 and then process the query
2. We can transfer the data from S1 to S2 and then process the query
3. We can transfer the data from S1 and S2 to S3 and then process the query. So the choice
depends on various factors like, the size of relations and the results, the communication cost
between different sites, and at which the site result will be utilized.
Commonly, the data transfer cost is calculated in terms of the size of the messages. By using the below
formula, we can calculate the data transfer cost:
Data transfer cost = C * Size
Where C refers to the cost per byte of data transferring and Size is the no. of bytes transmitted.
Example: Consider the following table EMPLOYEE and DEPARTMENT.
Site1: EMPLOYEE
EID NAME SALARY DID

EID- 10 bytes
SALARY- 20 bytes
DID- 10 bytes
Name- 20 bytes
Total records- 1000
Record Size- 60 bytes
Site2: DEPARTMENT
DID DNAME

DID- 10 bytes
DName- 20 bytes
Total records- 50
Record Size- 30 bytes
Example : Find the name of employees and their department names. Also, find the amount of data
transfer to execute this query when the query is submitted to Site 3.
Answer : Considering the query is submitted at site 3 and neither of the two relations that is an
EMPLOYEE and the DEPARTMENT not available at site 3. So, to execute this query, we have three
strategies:
1. Transfer both the tables that is EMPLOYEE and DEPARTMENT at SITE 3 then join the
tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500 bytes.
2. Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the result
at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since we have to
transfer 1000 tuples having NAME and DNAME from site 1,
3. Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the table at site1
and then transfer the result at site3. The total cost is 30 * 50 + 60 * 1000 = 61500 bytes since
we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3 that is 60
bytes each.
Now, If the Optimisation criteria are to reduce the amount of data transfer, we can choose either 1 or 3
strategies from the above.
2. Using Semi join in Distributed Query processing :
The semi-join operation is used in distributed query processing to reduce the number of tuples in a table
before transmitting it to another site. This reduction in the number of tuples reduces the number and the
total size of the transmission that ultimately reducing the total cost of data transfer. Let’s say that we
have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of one table say
R1 to the site where the other table say R2 is located. This column is joined with R2 at that site. The
decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing R1
with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in
distributed query processing.
Example : Find the amount of data transferred to execute the same query given in the above
example using semi-join operation.
Answer : The following strategy can be used to execute the query.
1. Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer them
to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 25 * 1000 =
25000 bytes.
2. Transfer the table DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE
with this table. The size of the DEPARTMENT table is 25 * 50 = 1250
Applying the above scheme, the amount of data transferred to execute the query will be 25000 + 1250
= 26250 bytes.

Distributed Transaction Management


Definition
Distributed transaction management deals with the problems of always providing a consistent
distributed database in the presence of a large number of transactions (local and global) and failures
(communication link and/or site failures). This is accomplished through (i) distributed commit protocols
that guarantee atomicity property; (ii) distributed concurrency control techniques to ensure consistency
and isolation properties; and (iii) distributed recovery methods to preserve consistency and durability
when failures occur.

1) Distributed Commit Protocols :


In a local database system, for committing a transaction, the transaction manager has to only convey
the decision to commit to the recovery manager. However, in a distributed system, the transaction
manager should convey the decision to commit to all the servers in the various sites where the
transaction is being executed and uniformly enforce the decision. When processing is complete at each
site, it reaches the partially committed transaction state and waits for all other transactions to reach
their partially committed states. When it receives the message that all the sites are ready to commit, it
starts to commit. In a distributed system, either all sites commit or none of them does.
The different distributed commit protocols are −

 One-phase commit
 Two-phase commit
 Three-phase commit

Distributed One-phase Commit


Distributed one-phase commit is the simplest commit protocol. Let us consider that there is a controlling
site and a number of slave sites where the transaction is being executed. The steps in distributed
commit are −
 After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site.
 The slaves wait for “Commit” or “Abort” message from the controlling site. This waiting time is
called window of vulnerability.
 When the controlling site receives “DONE” message from each slave, it makes a decision to
commit or abort. This is called the commit point. Then, it sends this message to all the slaves.
 On receiving this message, a slave either commits or aborts and then sends an
acknowledgement message to the controlling site.

Distributed Two-phase Commit


Distributed two-phase commit reduces the vulnerability of one-phase commit protocols. The steps
performed in the two phases are as follows −
Phase 1: Prepare Phase
 After each slave has locally completed its transaction, it sends a “DONE” message to the
controlling site. When the controlling site has received “DONE” message from all slaves, it sends
a “Prepare” message to the slaves.
 The slaves vote on whether they still want to commit or not. If a slave wants to commit, it sends
a “Ready” message.
 A slave that does not want to commit sends a “Not Ready” message. This may happen when the
slave has conflicting concurrent transactions or there is a timeout.
Phase 2: Commit/Abort Phase
 After the controlling site has received “Ready” message from all the slaves −
o The controlling site sends a “Global Commit” message to the slaves.

o The slaves apply the transaction and send a “Commit ACK” message to the controlling
site.
o When the controlling site receives “Commit ACK” message from all the slaves, it considers
the transaction as committed.
 After the controlling site has received the first “Not Ready” message from any slave −
o The controlling site sends a “Global Abort” message to the slaves.

o The slaves abort the transaction and send a “Abort ACK” message to the controlling site.

o When the controlling site receives “Abort ACK” message from all the slaves, it considers
the transaction as aborted.

Distributed Three-phase Commit


The steps in distributed three-phase commit are as follows −
Phase 1: Prepare Phase
The steps are same as in distributed two-phase commit.
Phase 2: Prepare to Commit Phase

 The controlling site issues an “Enter Prepared State” broadcast message.


 The slave sites vote “OK” in response.
Phase 3: Commit / Abort Phase
The steps are same as two-phase commit except that “Commit ACK”/”Abort ACK” message is not
required.

2)Distributed DBMS - Controlling Concurrency

Concurrency controlling techniques ensure that multiple transactions are executed simultaneously
while maintaining the ACID properties of the transactions and serializability in the schedules.
In this chapter, we will study the various approaches for concurrency control.
Locking Based Concurrency Control Protocols
Locking-based concurrency control protocols use the concept of locking data items. A lock is a variable
associated with a data item that determines whether read/write operations can be performed on that
data item. Generally, a lock compatibility matrix is used which states whether a data item can be locked
by two transactions at the same time.
Locking-based concurrency control systems can use either one-phase or two-phase locking protocols.

One-phase Locking Protocol


In this method, each transaction locks an item before use and releases the lock as soon as it has
finished using it. This locking method provides for maximum concurrency but does not always enforce
serializability.

Two-phase Locking Protocol


In this method, all locking operations precede the first lock-release or unlock operation. The transaction
comprise of two phases. In the first phase, a transaction only acquires all the locks it needs and do not
release any lock. This is called the expanding or the growing phase. In the second phase, the
transaction releases the locks and cannot request any new locks. This is called the shrinking phase.
Every transaction that follows two-phase locking protocol is guaranteed to be serializable. However,
this approach provides low parallelism between two conflicting transactions.

Timestamp Concurrency Control Algorithms


Timestamp-based concurrency control algorithms use a transaction’s timestamp to coordinate
concurrent access to a data item to ensure serializability. A timestamp is a unique identifier given by
DBMS to a transaction that represents the transaction’s start time.
These algorithms ensure that transactions commit in the order dictated by their timestamps. An older
transaction should commit before a younger transaction, since the older transaction enters the system
before the younger one.
Timestamp-based concurrency control techniques generate serializable schedules such that the
equivalent serial schedule is arranged in order of the age of the participating transactions.
Some of timestamp based concurrency control algorithms are −

 Basic timestamp ordering algorithm.


 Conservative timestamp ordering algorithm.
 Multiversion algorithm based upon timestamp ordering.
Timestamp based ordering follow three rules to enforce serializability −
 Access Rule − When two transactions try to access the same data item simultaneously, for
conflicting operations, priority is given to the older transaction. This causes the younger
transaction to wait for the older transaction to commit first.
 Late Transaction Rule − If a younger transaction has written a data item, then an older
transaction is not allowed to read or write that data item. This rule prevents the older transaction
from committing after the younger transaction has already committed.
 Younger Transaction Rule − A younger transaction can read or write a data item that has
already been written by an older transaction.

Optimistic Concurrency Control Algorithm


In systems with low conflict rates, the task of validating every transaction for serializability may lower
performance. In these cases, the test for serializability is postponed to just before commit. Since the
conflict rate is low, the probability of aborting transactions which are not serializable is also low. This
approach is called optimistic concurrency control technique.
In this approach, a transaction’s life cycle is divided into the following three phases −
 Execution Phase − A transaction fetches data items to memory and performs operations upon
them.
 Validation Phase − A transaction performs checks to ensure that committing its changes to the
database passes serializability test.
 Commit Phase − A transaction writes back modified data item in memory to the disk.
This algorithm uses three rules to enforce serializability in validation phase −
Rule 1 − Given two transactions Ti and Tj, if Ti is reading the data item which Tj is writing, then Ti’s
execution phase cannot overlap with Tj’s commit phase. Tj can commit only after Ti has finished
execution.
Rule 2 − Given two transactions Ti and Tj, if Ti is writing the data item that Tj is reading, then Ti’s commit
phase cannot overlap with T j’s execution phase. Tj can start executing only after Ti has already
committed.
Rule 3 − Given two transactions Ti and Tj, if Ti is writing the data item which Tj is also writing, then Ti’s
commit phase cannot overlap with T j’s commit phase. Tj can start to commit only after Ti has already
committed.

Concurrency Control in Distributed Systems


In this section, we will see how the above techniques are implemented in a distributed database system.

Distributed Two-phase Locking Algorithm


The basic principle of distributed two-phase locking is same as the basic two-phase locking protocol.
However, in a distributed system there are sites designated as lock managers. A lock manager controls
lock acquisition requests from transaction monitors. In order to enforce co-ordination between the lock
managers in various sites, at least one site is given the authority to see all transactions and detect lock
conflicts.
Depending upon the number of sites who can detect lock conflicts, distributed two-phase locking
approaches can be of three types −
 Centralized two-phase locking − In this approach, one site is designated as the central lock
manager. All the sites in the environment know the location of the central lock manager and
obtain lock from it during transactions.
 Primary copy two-phase locking − In this approach, a number of sites are designated as lock
control centers. Each of these sites has the responsibility of managing a defined set of locks. All
the sites know which lock control center is responsible for managing lock of which data
table/fragment item.
 Distributed two-phase locking − In this approach, there are a number of lock managers, where
each lock manager controls locks of data items stored at its local site. The location of the lock
manager is based upon data distribution and replication.

Distributed Timestamp Concurrency Control


In a centralized system, timestamp of any transaction is determined by the physical clock reading. But,
in a distributed system, any site’s local physical/logical clock readings cannot be used as global
timestamps, since they are not globally unique. So, a timestamp comprises of a combination of site ID
and that site’s clock reading.
For implementing timestamp ordering algorithms, each site has a scheduler that maintains a separate
queue for each transaction manager. During transaction, a transaction manager sends a lock request
to the site’s scheduler. The scheduler puts the request to the corresponding queue in increasing
timestamp order. Requests are processed from the front of the queues in the order of their timestamps,
i.e. the oldest first.

Conflict Graphs
Another method is to create conflict graphs. For this transaction classes are defined. A transaction class
contains two set of data items called read set and write set. A transaction belongs to a particular class
if the transaction’s read set is a subset of the class’ read set and the transaction’s write set is a subset
of the class’ write set. In the read phase, each transaction issues its read requests for the data items in
its read set. In the write phase, each transaction issues its write requests.
A conflict graph is created for the classes to which active transactions belong. This contains a set of
vertical, horizontal, and diagonal edges. A vertical edge connects two nodes within a class and denotes
conflicts within the class. A horizontal edge connects two nodes across two classes and denotes a
write-write conflict among different classes. A diagonal edge connects two nodes across two classes
and denotes a write-read or a read-write conflict among two classes.
The conflict graphs are analyzed to ascertain whether two transactions within the same class or across
two different classes can be run in parallel.

Distributed Optimistic Concurrency Control Algorithm


Distributed optimistic concurrency control algorithm extends optimistic concurrency control algorithm.
For this extension, two rules are applied −
Rule 1 − According to this rule, a transaction must be validated locally at all sites when it executes. If a
transaction is found to be invalid at any site, it is aborted. Local validation guarantees that the
transaction maintains serializability at the sites where it has been executed. After a transaction passes
local validation test, it is globally validated.
Rule 2 − According to this rule, after a transaction passes local validation test, it should be globally
validated. Global validation ensures that if two conflicting transactions run together at more than one
site, they should commit in the same relative order at all the sites they run together. This may require a
transaction to wait for the other conflicting transaction, after validation before commit. This requirement
makes the algorithm less optimistic since a transaction may not be able to commit as soon as it is
validated at a site.

3)Distributed Recovery Methods :


In order to recuperate from database failure, database management systems resort to a number of
recovery management techniques. In this chapter, we will study the different approaches for database
recovery.
The typical strategies for database recovery are −
 In case of soft failures that result in inconsistency of database, recovery strategy includes
transaction undo or rollback. However, sometimes, transaction redo may also be adopted to
recover to a consistent state of the transaction.
 In case of hard failures resulting in extensive damage to database, recovery strategies
encompass restoring a past copy of the database from archival backup. A more current state of
the database is obtained through redoing operations of committed transactions from transaction
log.

Recovery from Power Failure


Power failure causes loss of information in the non-persistent memory. When power is restored, the
operating system and the database management system restart. Recovery manager initiates recovery
from the transaction logs.
In case of immediate update mode, the recovery manager takes the following actions −
 Transactions which are in active list and failed list are undone and written on the abort list.
 Transactions which are in before-commit list are redone.
 No action is taken for transactions in commit or abort lists.
In case of deferred update mode, the recovery manager takes the following actions −
 Transactions which are in the active list and failed list are written onto the abort list. No undo
operations are required since the changes have not been written to the disk yet.
 Transactions which are in before-commit list are redone.
 No action is taken for transactions in commit or abort lists.

Recovery from Disk Failure


A disk failure or hard crash causes a total database loss. To recover from this hard crash, a new disk
is prepared, then the operating system is restored, and finally the database is recovered using the
database backup and transaction log. The recovery method is same for both immediate and deferred
update modes.
The recovery manager takes the following actions −
 The transactions in the commit list and before-commit list are redone and written onto the commit
list in the transaction log.
 The transactions in the active list and failed list are undone and written onto the abort list in the
transaction log.

Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers. As a
consequence, in case of a system crash, the recovery manager does not have to redo the transactions
that have been committed before checkpoint. Periodical checkpointing shortens the recovery process.
The two types of checkpointing techniques are −

 Consistent checkpointing
 Fuzzy checkpointing

Consistent Checkpointing
Consistent checkpointing creates a consistent image of the database at checkpoint. During recovery,
only those transactions which are on the right side of the last checkpoint are undone or redone. The
transactions to the left side of the last consistent checkpoint are already committed and needn’t be
processed again. The actions taken for checkpointing are −

 The active transactions are suspended temporarily.


 All changes in main-memory buffers are written onto the disk.
 A “checkpoint” record is written in the transaction log.
 The transaction log is written to the disk.
 The suspended transactions are resumed.
If in step 4, the transaction log is archived as well, then this checkpointing aids in recovery from disk
failures and power failures, otherwise it aids recovery from only power failures.
Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions are written in the log. In
case of power failure, the recovery manager processes only those transactions that were active during
checkpoint and later. The transactions that have been committed before checkpoint are written to the
disk and hence need not be redone.

Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the time of system crash is tfail.
Let there be four transactions T a, Tb, Tc and Td such that −
 Ta commits before checkpoint.
 Tb starts before checkpoint and commits before system crash.
 Tc starts after checkpoint and commits before system crash.
 Td starts after checkpoint and was active at the time of system crash.
The situation is depicted in the following diagram −

The actions that are taken by the recovery manager are −

 Nothing is done with Ta.


 Transaction redo is performed for Tb and Tc.
 Transaction undo is performed for T d.

Transaction Recovery Using UNDO / REDO


Transaction recovery is done to eliminate the adverse effects of faulty transactions rather than to
recover from a failure. Faulty transactions include all transactions that have changed the database into
undesired state and the transactions that have used values written by the faulty transactions.
Transaction recovery in these cases is a two-step process −
 UNDO all faulty transactions and transactions that may be affected by the faulty transactions.
 REDO all transactions that are not faulty but have been undone due to the faulty transactions.
Steps for the UNDO operation are −
 If the faulty transaction has done INSERT, the recovery manager deletes the data item(s)
inserted.
 If the faulty transaction has done DELETE, the recovery manager inserts the deleted data item(s)
from the log.
 If the faulty transaction has done UPDATE, the recovery manager eliminates the value by writing
the before-update value from the log.
Steps for the REDO operation are −
 If the transaction has done INSERT, the recovery manager generates an insert from the log.
 If the transaction has done DELETE, the recovery manager generates a delete from the log.
 If the transaction has done UPDATE, the recovery manager generates an update from the log.

event-condition-action rule (ECA rule)


An event-condition-action rule (ECA rule) is the method underlying event-driven computing, in which
actions are triggered by events, given the existence of specific conditions.

Events with significance to the system are identified within an event-driven program. An event could
be some user action, a transmission of sensor data or a message from some other program or system,
among an almost infinite number of other possibilities. The ECA rule specifies how events drive the
desired program responses. When an event with significance for the system occurs, the conditions are
checked for or evaluated; if the conditions exist or meet pre-established criteria, the appropriate action
is executed.

ECA rules originated in active databases and have since been used in areas
including personalization, big data management and business process automation. The model is being
explored for M2M (machine-to-machine) networking, Internet of Things (IoT), cognitive
computing and the Semantic Web.

Eg : ATM process

Event – ATM card insertion

Condition – ask the Password,whether the password match or not

Action – If Password matches then do the transaction

Active Databases
Active Database is a database consisting of set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In such
database, DBMS initially verifies whether the particular trigger specified in the statement that modifies
the database) is activated or not, prior to executing the statement.
If the trigger is active then DBMS executes the condition part and then executes the action part only if
the specified condition is evaluated to true. It is possible to activate more than one trigger within a single
statement.
In such situation, DBMS processes each of the trigger randomly. The execution of an action part of a
trigger may either activate other triggers or the same trigger that Initialized this action. Such types of
trigger that activates itself is called as ‘recursive trigger’. The DBMS executes such chains of trigger in
some pre-defined manner but it effects the concept of understanding.

Features of Active Database:


1. It possess all the concepts of a conventional database i.e. data modelling facilities, query
language etc.
2. It supports all the functions of a traditional database like data definition, data manipulation,
storage management etc.
3. It supports definition and management of ECA rules.
4. It detects event occurrence.
5. It must be able to evaluate conditions and to execute actions.
6. It means that it has to implement rule execution.
Advantages :
1. Enhances traditional database functionalities with powerful rule processing capabilities.
2. Enable a uniform and centralized description of the business rules relevant to the information
system.
3. Avoids redundancy of checking and repair operations.
4. Suitable platform for building large and efficient knowledge base and expert systems.

A trigger is a procedure which is automatically invoked by the DBMS in response to changes to the
database, and is specified by the database administrator (DBA). A database with a set of associated
triggers is generally called an active database.

Parts of trigger
A triggers description contains three parts, which are as follows −

 Event − An event is a change to the database which activates the trigger.

 Condition − A query that is run when the trigger is activated is called as a condition.
 Action −A procedure which is executed when the trigger is activated and its condition is true.

Use of trigger
Triggers may be used for any of the following reasons −

 To implement any complex business rule, that cannot be implemented using integrity constraints.

 Triggers will be used to audit the process. For example, to keep track of changes made to a table.

 Trigger is used to perform automatic action when another concerned action takes place.

Types of triggers
The different types of triggers are explained below −

 Statement level trigger − It is fired only once for DML statement irrespective of number of rows
affected by statement. Statement-level triggers are the default type of trigger.

 Before-triggers − At the time of defining a trigger we can specify whether the trigger is to be fired
before a command like INSERT, DELETE, or UPDATE is executed or after the command is
executed. Before triggers are automatically used to check the validity of data before the action is
performed. For instance, we can use before trigger to prevent deletion of rows if deletion should
not be allowed in a given case.

 After-triggers − It is used after the triggering action is completed. For example, if the trigger is
associated with the INSERT command then it is fired after the row is inserted into the table.

 Row-level triggers − It is fired for each row that is affected by DML command. For example, if
an UPDATE command updates 150 rows then a row-level trigger is fired 150 times whereas a
statement-level trigger is fired only for once.

Create database trigger


To create a database trigger, we use the CREATE TRIGGER command. The details to be given at the
time of creating a trigger are as follows −

 Name of the trigger.


 Table to be associated with.
 When trigger is to be fired: before or after.
 Command that invokes the trigger- UPDATE, DELETE, or INSERT.
 Whether row-level triggers or not.
 Condition to filter rows.
 PL/SQL block is to be executed when trigger is fired.
The syntax to create database trigger is as follows −
CREATE [OR REPLACE] TRIGGER triggername
{BEFORE|AFTER}
{DELETE|INSERT|UPDATE[OF COLUMNS]} ON table
[FOR EACH ROW {WHEN condition]]
[REFERENCE [OLD AS old] [NEW AS new]]
BEGIN
PL/SQL BLOCK
END.

Design and Implementation Issues for Active Databases


we discuss some additional issues concerning how rules are designed and implemented. The first issue concerns
activation, deactivation, and grouping of rules. In addition to creating rules, an active database system should allow
users to activate, deactivate, and drop rules by referring to their rule names. A deactivated rule will not be triggered
by the triggering event. This feature allows users to selectively deactivate rules for certain periods of time when
they are not needed. The activate command will make the rule active again. The drop com- mand deletes the rule
from the system. Another option is to group rules into named rule sets , so the whole set of rules can be activated,
deactivated, or dropped. It is also useful to have a command that can trigger a rule or rule set via an explicit
PROCESS RULES command issued by the user.

The second issue concerns whether the triggered action should be executed before, after, instead of, or concurrently
with the triggering event. A before trigger executes the trigger before executing the event that caused the trigger.
It can be used in applications such as checking for constraint violations. An after trigger executes the trig- ger after
executing the event, and it can be used in applications such as maintaining derived data and monitoring for specific
events and conditions. An instead of trig- ger executes the trigger instead of executing the event, and it can be used
in applica- tions such as executing corresponding updates on base relations in response to an event that is an update
of a view.

A related issue is whether the action being executed should be considered as a separate transaction or whether it
should be part of the same transaction that triggered the rule. We will try to categorize the various options. It is
important to note that not all options may be available for a particular active database system. In fact, most com-
mercial systems are limited to one or two of the options that we will now discuss.

Let us assume that the triggering event occurs as part of a transaction execution. We should first consider the
various options for how the triggering event is related to the evaluation of the rule’s condition. The rule condition
evaluation is also known as rule consideration , since the action is to be executed only after considering whether
the condition evaluates to true or false. There are three main possibilities for rule consideration:

1. Immediate consideration. The condition is evaluated as part of the same transaction as the triggering event, and
is evaluated immediately. This case

can be further categorized into three options:

Evaluate the condition before executing the triggering event.

Evaluate the condition after executing the triggering event.

Evaluate the condition instead of executing the triggering event.

2. Deferred consideration. The condition is evaluated at the end of the trans- action that included the triggering
event. In this case, there could be many

3. Detached consideration. The condition is evaluated as a separate transac- tion, spawned from the triggering
transaction.

The next set of options concerns the relationship between evaluating the rule condi- tion and executing the rule
action. Here, again, three options are possible: immediate , deferred, or detached execution. Most active systems
use the first

option. That is, as soon as the condition is evaluated, if it returns true, the action is immediately executed.
The Oracle system uses the immediate consideration model, but it allows the user to specify for each rule whether
the before or after option is to be used with immediate condition evaluation. It also uses the immediate execution
model. The STARBURST system uses the deferred consideration option, meaning that all rules triggered by a
transaction wait until the triggering transaction reaches its end and issues its COMMIT WORK command before
the rule conditions are evaluated.

Another issue concerning active database rules is the distinction between row-level rules and statement-level rules.
Because SQL update statements (which act as trig- gering events) can specify a set of tuples, one has to distinguish
between whether the rule should be considered once for the whole statement or whether it should be con-sidered
separately for each row (that is, tuple) affected by the statement. The SQL-99 standard and the Oracle system
allow the user to choose which of the options is to be used for each rule, whereas STAR- BURST uses statement-
level semantics only.

One of the difficulties that may have limited the widespread use of active rules, in spite of their potential to simplify
database and software development, is that there are no easy-to-use techniques for designing, writing, and verifying
rules. For exam- ple, it is quite difficult to verify that a set of rules is consistent, meaning that two or more rules in
the set do not contradict one another. It is also difficult to guarantee termination of a set of rules under all
circumstances.

An example to illustrate the termination problem for active rules.

R1: CREATE TRIGGER T1

AFTER INSERT ON TABLE1 FOR EACH ROW

UPDATE TABLE2

SET Attribute1 = ... ;

R2: CREATE TRIGGER T2

AFTER UPDATE OF Attribute1 ON TABLE2 FOR EACH ROW

INSERT INTO TABLE1 VALUES ( ... );

To illustrate the termination problem briefly, consider the rules in Figure 26.4. Here, rule R1 is triggered by an
INSERT event on TABLE1 and its action includes an update event on Attribute1 of TABLE2 . However, rule R2
’s triggering event is an UPDATE event on Attribute1 of TABLE2 , and its action includes an INSERT event on
TABLE1 . In this example, it is easy to see that these two rules can trigger one another indefinitely, leading to
non- termination. However, if dozens of rules are written, it is very difficult to determine whether termination is
guaranteed or not.

If active rules are to reach their potential, it is necessary to develop tools for the design, debugging, and monitoring
of active rules that can help users design and debug their rules.

Microsoft Open Database Connectivity (ODBC)

What is ODBC?
Open Database Connectivity (ODBC) is an open standard Application Programming Interface (API) for accessing
a database. In 1992, Microsoft partners with Simba to build the world’s first ODBC driver; SIMBA.DLL, and
standards-based data access was born. By using ODBC statements in a program, you can access files in a number
of different common databases. In addition to the ODBC software, a separate module or driver is needed for each
database to be accessed.

ODBC History
Microsoft introduced the ODBC standard in 1992. ODBC was a standard designed to unify access to SQL
databases. Following the success of ODBC, Microsoft introduced OLE (object linking and embedding)DB which
was to be a broader data access standard. OLE DB was a data access standard that went beyond just SQL databases
and extended to any data source that could deliver data in tabular format. Microsoft’s plan was that OLE DB would
supplant ODBC as the most common data access standard. More recently, Microsoft introduced the ADO data
access standard. ADO was supposed to go further than OLE DB, in that ADO was more object oriented. However,
even with Microsoft’s very significant attempts to replace the ODBC standard with what were felt to be “better”
alternatives, ODBC has continued to be the de facto data access standard for SQL data sources. In fact, today the
ODBC standard is more common than OLE DB and ADO(Active x Data Object) because ODBC is widely
supported (including support from Oracle and IBM) and is a cross platform data access standard. Today, the most
common data access standards for SQL data sources continue to be ODBC and JDBC, and it is very likely that
standards like OLE DB and ADO will fade away over time.

ODBC Overview

ODBC has become the de-facto standard for standards-based data access in both relational and non-relational
database management systems (DBMS). Simba worked closely with Microsoft to co-develop the ODBC standard
back in the early 90’s. The ODBC standard enables maximum interoperability thereby enabling application
developers to write a single application to access data sources from different vendors. ODBC is based on the Call-
Level Interface (CLI) specifications from Open Group and ISO/IEC (International Organization for
Standardization/International Electrotechnical Commission)for database APIs and uses Structured Query
Language (SQL) as its database access language.

ODBC Architecture

The architecture of ODBC-based data connectivity is as follows:


ODBC Enabled Application

This is any ODBC compliant application, such as Microsoft Excel, Tableau, Crystal Reports, Microsoft Power BI,
or similar application (Spreadsheet, Word processor, Data Access & Retrievable Tool, etc.). The ODBC enabled
application performs processing by passing SQL Statements to and receiving results from the ODBC Driver
Manager.

ODBC Driver Manager


The ODBC Driver Manager loads and unloads ODBC drivers on behalf of an application. The Windows platform
comes with a default Driver Manager, while non-windows platforms have the choice to use an open source ODBC
Driver Manager like unixODBC and iODBC(independent ODBC). The ODBC Driver Manager processes ODBC
function calls, or passes them to an ODBC driver and resolves ODBC version conflicts.

ODBC Driver

The ODBC driver processes ODBC function calls, submits SQL requests to a specific data source and returns
results to the application. The ODBC driver may also modify an application’s request so that the request conforms
to syntax supported by the associated database. A framework to easily build an ODBC drivers is available from
Simba Technologies, as are ODBC drivers for many data sources, such as Salesforce, MongoDB, Spark and more.
The Simba SDK is available in C++, Java and C# and supports building drivers for Windows, OSX and many *Nix
distributions.

Data Source

A data source is simply the source of the data. It can be a file, a particular database on a DBMS, or even a live data
feed. The data might be located on the same computer as the program, or on another computer somewhere on a
network.
Unit – 3 XML Databases
Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model – XML
Documents – Document Type Definition – XML Schema – XML Documents and Databases – XML
Querying – XPath – XQuery

Structured Data Vs Unstructured Data Vs Semi-


Structured Data
We can classify data as structured data, semi-structured data, or unstructured data. Structured
data resides in predefined formats and models, Unstructured data is stored in its natural format until
it’s extracted for analysis, and Semi-structured data basically is a mix of both structured and
unstructured data.

In this blog, we are going to cover Data, types of Data, and Structured Vs Unstructured Data, and
suitable Datastores.

What Is Data?

 Data is a set of facts such as descriptions, observations, and numbers used in decision making.
 We can classify data as structured, unstructured, or semi-structured data.
1) Structured Data

 Structured data is generally tabular data that is represented by columns and rows in a database.
 Databases that hold tables in this form are called relational databases.
 The mathematical term “relation” specify to a formed set of data held as a table.
 In structured data, all row in a table has the same set of columns.
 SQL (Structured Query Language) programming language used for structured data.

2) Semi-structured Data

 Semi-structured data is information that doesn’t consist of Structured data (relational database)
but still has some structure to it.
 Semi-structured data consist of documents held in JavaScript Object Notation (JSON) format. It
also includes key-value stores and graph databases.
3) Unstructured Data

 Unstructured data is information that either does not organize in a pre-defined manner or not
have a pre-defined data model.
 Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and
facts as well.
 Videos, audio, and binary data files might not have a specific structure. They’re assigned to
as unstructured data.

Characteristics Of Structured (Relational) and Unstructured (Non-Relational) Data


Relational Data

 Relational databases provide undoubtedly the most well-understood model for holding data.
 The simplest structure of columns and tables makes them very easy to use initially, but the
inflexible structure can cause some problems.
 We can communicate with relational databases using Structured Query Language (SQL).
 SQL allows the joining of tables using a few lines of code, with a structure most beginner
employees can learn very fast.
 Examples of relational databases:
o MySQL
o PostgreSQL
o Db2

Non-Relational Data

 Non-relational databases permit us to store data in a format that more closely meets the original
structure.
 A non-relational database is a database that does not use the tabular schema of columns and
rows found in most traditional database systems.
 It uses a storage model that is enhanced for the specific requirements of the type of data being
stored.
 In a non-relational database the data may be stored as JSON documents, as simple key/value
pairs, or as a graph consisting of edges and vertices.
 Examples of non-relational databases:
o Redis
o JanusGraph
o MongoDB
o RabbitMQ

Document Data Stores


 A document data store handles a set of objects data values and named string fields in an entity
referred to as a document.
 These data stores generally store data in the form of JSON documents.

Columnar Data Stores

 A columnar or column-family data store construct data into rows and columns. The columns are
divided into groups known as column families.
 Each column family consists of a set of columns that are logically related and are generally
retrieved or manipulated as a unit.
 Within a column family, rows can be sparse and new columns can be added dynamically.

Key/Value Data Stores

 A key/value store is actually a large hash table.


 We associate each data value with a unique key, and the key/value store uses this key to store the
data by using a correct hashing function.
 The hashing function is preferred to provide an even distribution of hashed keys across the data
storage.
 Key/value stores are highly suitable for applications operating simple lookups using the value of
the by a range of keys.

Graph Data Stores

 A graph data store handles two types of information, edges, and nodes.
 Edges point out the relationships between these entities and Nodes represent entities.
 The aim of a graph datastore is to grant an application to efficiently perform queries that traverse
the network of edges and nodes and to inspect the relationships between entities.

Time series data stores

 Time series data is a set of values formed by time, and a time-series data store is making the best
for this type of data.
 Time series data stores must support a very large number of writes, as they generally collect large
amounts of data in real-time from a huge number of sources.

Object data stores

 Object data stores are correct for retrieving and storing large binary objects or blobs such as audio
and video streams, images, text files, large application documents and data objects, and virtual
machine disk images.
 An object consists of some metadata, stored data, and a unique ID for access to the object.
 Flights.csv-comma separated values

External index data stores

 External index data stores give the ability to search for information held in other data services and
stores.
 An external index acts as a secondary index for any data store. It can provide real-time access to
indexes and can be used to index massive volumes of data.
Structured Vs Unstructured Data
1) Defined Vs Undefined Data

 Structured data is undoubtedly a defined type of data in a structure.


 Structured data lives in columns and rows and it can be mapped into pre-defined fields.
 Unstructured data does not have a predefined data format.

2)Quantitative Vs Qualitative Data

 Structured data is generally quantitative data, it usually consists of hard numbers or things that
can be counted.
 Methods for analysis include classification, regression, and clustering of data.
 Unstructured data is generally categorized as qualitative data, and cannot be analyzed and
processed using conventional tools and methods.
 Understanding qualitative data requires advanced analytics techniques like data
stacking and data mining.

3) Storage In Data Houses Vs Data Lakes

 Structured data is generally stored in data warehouses.


 Unstructured data is stored in data lakes.
 Unstructured data requires more storage space, while structured data requires less
storage space.

4) Ease Of Analysis

 Structured data is easy to search, both for algorithms and for humans.
 Unstructured data is more difficult to search and requires processing to become understandable.
In context of Big Data we know that it deals with large amount of data and its execution. So in nutshell
we can say that Big data is something which deals with the large amount of data and as amount of data
is so large then broadly there are three categories which are defined on the basis of how data is
organized which are namely as Structured, Semi Structured and Unstructured Data.
Now the basis of level of organizing the data we can find out some more differences between all these
three types of data which are as follow.
Following are the important differences between Structure and Union.

Sr. Key Structured Data Semi Structured Data Unstructured Data


No.

Level of Structured Data as On other hand in case of In last the data is


1 organizing name suggest this Semi Structured Data the fully non organized
type of data is well data is organized up to some in case of
organized and extent only and rest is non Unstructured Data
Sr. Key Structured Data Semi Structured Data Unstructured Data
No.

hence level of organized hence the level of and hence level of


organizing is highest organizing is less than that of organizing is lowest
in this type of data. Structured Data and higher in case of
than that of Unstructured Unstructured Data.
Data.

Means of Structured Data is While in case of Semi On other hand in


Data get organized by the Structured Data is partially case of
Organization means of Relational organized by the means of Unstructured Data
Database. XML/RDF. data is based on
2
simple character
Extensible markup and binary data.
language/Resource
Development Framework

Transaction In Structured Data In Semi Structured Data While in


Management management and transaction is not by default Unstructured Data
concurrency of data but is get adapted from no transaction
is present and DBMS but data concurrency management and
3
hence mostly is not present. no concurrency are
preferred in present.
multitasking
process.

Versioning As mentioned in On other hand in case of Versioning in case


definition Structured Semi Structured Data of Unstructured
Data supports in versioning is done only Data is possible
Relational Database where tuples or graph is only as on whole
4
so versioning is possible as partial database data as no support
done over tuples, is supported in case of Semi of database at all.
rows and table as Structured Data.
well.

Flexible and As Structured Data While in case Semi As there is no


Scalable is based on Structured Data data is more dependency on any
relational database flexible than Structured Data database so
so it becomes but less flexible and scalable Unstructured Data is
5 schema dependent as compare to Unstructured more flexible and
and less flexible as Data. scalable as
well as less compare to
scalable. Structured and Semi
Structured Data.
Sr. Key Structured Data Semi Structured Data Unstructured Data
No.

Performance In Structure Data we On other hand in case of While in case of


can perform Semi Structured Data only Unstructured Data
structured query queries over anonymous only textual query
which allow complex nodes are possible so its are possible so
joining and thus performance is lower than performance is
6
performance is Structured Data but more lower than both
highest as compare than that of Unstructured Structured and Semi
to that of Semi Data Structured Data.
Structured and
Unstructured Data.

XML Hierarchical (Tree) Data Model

We now introduce the data model used in XML. The basic object in XML is the XML document.
Two main structuring concepts are used to construct an XML document: elements and attributes.
It is important to note that the term attribute in XML is not used in the same manner as is
customary in database terminology, but rather as it is used in document description languages such
as HTML and SGML. Attributes in XML provide additional information that describes elements,
as we will see. There are additional concepts in XML, such as entities, identifiers, and references,
but first we concentrate on describing elements and attributes to show the essence of the XML
model.

Figure 12.3 shows an example of an XML element called < Projects>. As in HTML, elements are
identified in a document by their start tag and end tag. The tag names are enclosed between angled
brackets < ... >, and end tags are further identified by a slash, </ ... >.
<?xml version= “1.0” standalone=“yes”?> <Projects>
<Project>

<Name>ProductX</Name>

<Number>1</Number>

<Location>Bellaire</Location> <Dept_no>5</Dept_no> <Worker>


<Ssn>123456789</Ssn> <Last_name>Smith</Last_name> <Hours>32.5</Hours>

</Worker>

<Worker>

<Ssn>453453453</Ssn> <First_name>Joyce</First_name> <Hours>20.0</Hours>

</Worker>

</Project>

<Project>

<Name>ProductY</Name>

<Number>2</Number>

<Location>Sugarland</Location> <Dept_no>5</Dept_no> <Worker>

<Ssn>123456789</Ssn>

<Hours>7.5</Hours>

</Worker>

<Worker>

<Ssn>453453453</Ssn>

<Hours>20.0</Hours>
</Worker>

<Worker>

<Ssn>333445555</Ssn>

<Hours>10.0</Hours>

</Worker>

</Project>

...

</Projects>
Figure 12.3 A complex XML element called <Projects>

Complex elements are constructed from other elements hierarchically, whereas simple
elements contain data values. A major difference between XML and HTML is that XML tag
names are defined to describe the meaning of the data elements in the document, rather than to
describe how the text is to be displayed. This makes it possible to process the data elements in the
XML document automatically by computer programs. Also, the XML tag (element) names can be
defined in another document, known as the schema document, to give a semantic meaning to the
tag names that can be exchanged among multiple users. In HTML, all tag names are predefined
and fixed; that is why they are not extendible.

It is straightforward to see the correspondence between the XML textual representation shown in
Figure 12.3 and the tree structure shown in Figure 12.1. In the tree representation, internal nodes
represent complex elements, whereas leaf nodes rep-resent simple elements. That is why the XML
model is called a tree model or a hierarchical model. In Figure 12.3, the simple elements are the
ones with the tag names <Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>,
<First_name>, and <Hours>. The complex elements are the ones with the tag
names <Projects>, <Project>, and <Worker>. In general, there is no limit on the levels of nesting of
elements.

It is possible to characterize three main types of XML documents:

Data-centric XML documents. These documents have many small data items that follow a
specific structure and hence may be extracted from a structured database. They are formatted as
XML documents in order to exchange them over or display them on the Web. These usually follow
a predefined schema that defines the tag names.

Document-centric XML documents. These are documents with large amounts of text, such as
news articles or books. There are few or no struc-tured data elements in these documents.

Hybrid XML documents. These documents may have parts that contain structured data and
other parts that are predominantly textual or unstruc-tured. They may or may not have a predefined
schema.

XML documents that do not follow a predefined schema of element names and cor-responding
tree structure are known as schemaless XML documents. It is important to note that datacentric
XML documents can be considered either as semistructured data or as structured data as defined
in Section 12.1. If an XML document conforms to a predefined XML schema or DTD (see Section
12.3), then the document can be considered as structured data. On the other hand, XML allows
documents that do not conform to any schema; these would be considered as semistructured
data and are schemaless XML documents. When the value of the standalone attribute in an XML
document is yes, as in the first line in Figure 12.3, the document is standalone and schemaless.

XML attributes are generally used in a manner similar to how they are used in HTML (see Figure
12.2), namely, to describe properties and characteristics of the elements (tags) within which they
appear. It is also possible to use XML attributes to hold the values of simple data elements;
however, this is generally not recommended. An exception to this rule is in cases that need
to reference another element in another part of the XML document. To do this, it is common to
use attribute values in one element as the references. This resembles the concept of foreign keys
in relational databases, and is a way to get around the strict hierarchical model that the XML tree
model implies. We discuss XML attributes further in Section 12.3 when we discuss XML schema
and DTD.

XML Documents, DTD, and XML Schema


1. Well-Formed and Valid XML Documents and XML DTD

In Figure 12.3, we saw what a simple XML document may look like. An XML document is well
formed if it follows a few conditions. In particular, it must start with an XML declaration to
indicate the version of XML being used as well as any other relevant attributes, as shown in the
first line in Figure 12.3. It must also follow the syn-tactic guidelines of the tree data model. This
means that there should be a single root element, and every element must include a matching pair
of start and end tags within the start and end tags of the parent element. This ensures that the nested
elements specify a well-formed tree structure.

A well-formed XML document is syntactically correct. This allows it to be processed by generic


processors that traverse the document and create an internal tree representation. A standard model
with an associated set of API (application programming interface) functions
called DOM (Document Object Model) allows programs to manipulate the resulting tree
representation corresponding to a well-formed XML document. However, the whole document
must be parsed beforehand when using DOM in order to convert the document to that standard
DOM internal data structure representation. Another API called SAX (Simple API for XML)
allows processing of XML documents on the fly by notifying the processing program through
callbacks whenever a start or end tag is encountered. This makes it easier to process large
documents and allows for processing of so-called streaming XML documents, where the
processing program can process the tags as they are encountered. This is also known as event-
based processing.

A well-formed XML document can be schemaless; that is, it can have any tag names for the
elements within the document. In this case, there is no predefined set of elements (tag names) that
a program processing the document knows to expect. This gives the document creator the freedom
to specify new elements, but limits the possibilities for automatically interpreting the meaning or
semantics of the elements within the document.

A stronger criterion is for an XML document to be valid. In this case, the document must be well
formed, and it must follow a particular schema. That is, the element names used in the start and
end tag pairs must follow the structure specified in a separate XML DTD (Document Type
Definition) file or XML schema file. We first discuss XML DTD here, and then we give an
overview of XML schema in Section 12.3.2. Figure 12.4 shows a simple XML DTD file, which
specifies the elements (tag names) and their nested structures. Any valid documents conforming
to this DTD should follow the specified structure. A special syntax exists for specifying DTD files,
as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is
called Projects in the first line in Figure 12.4. Then the elements and their nested structure are
specified.
<!DOCTYPE Projects [

<!ELEMENT Projects (Project+)>

<!ELEMENT Project (Name, Number, Location, Dept_no?, Workers) <!ATTLIST Project


ProjId ID #REQUIRED>

>

<!ELEMENT Name (#PCDATA)> <!ELEMENT Number (#PCDATA) <!ELEMENT Location (#PCDATA)>


<!ELEMENT Dept_no (#PCDATA)> <!ELEMENT Workers (Worker*)>

<!ELEMENT Worker (Ssn, Last_name?, First_name?, Hours)> <!ELEMENT Ssn (#PCDATA)>


<!ELEMENT Last_name (#PCDATA)> <!ELEMENT First_name (#PCDATA)> <!ELEMENT Hours (#PCDATA)>

]>
Figure 12.4 An XML DTD file called Projects
When specifying elements, the following notation is used:

A * following the element name means that the element can be repeated zero or more times in
the document. This kind of element is known as an optional multivalued (repeating) element.

A + following the element name means that the element can be repeated one or more times in
the document. This kind of element is a required multival-ued (repeating) element.

A ? following the element name means that the element can be repeated zero or one times. This
kind is an optional single-valued (nonrepeating) element.

An element appearing without any of the preceding three symbols must appear exactly once in
the document. This kind is a required single-valued (nonrepeating) element.

The type of the element is specified via parentheses following the element. If the parentheses
include names of other elements, these latter elements are the children of the element in the tree
structure. If the parentheses include the keyword #PCDATA or one of the other data types available
in XML DTD, the element is a leaf node. PCDATA stands for parsed character data, which is
roughly similar to a string data type.

The list of attributes that can appear within an element can also be specified

via the keyword !ATTLIST. In Figure 12.3, the Project element has an attribute ProjId. If the type of
an attribute is ID, then it can be referenced from another attribute whose type is IDREF within
another element. Notice that attributes can also be used to hold the values of simple data elements
of type #PCDATA.

Parentheses can be nested when specifying elements.


A bar symbol ( e1 | e2 ) specifies that either e1 or e2 can appear in the document.

We can see that the tree structure in Figure 12.1 and the XML document in Figure 12.3 conform
to the XML DTD in Figure 12.4. To require that an XML document be checked for conformance
to a DTD, we must specify this in the declaration of the document. For example, we could change
the first line in Figure 12.3 to the following:

<?xml version=“1.0” standalone=“no”?> <!DOCTYPE Projects SYSTEM “proj.dtd”>

When the value of the standalone attribute in an XML document is “no”, the document needs to be
checked against a separate DTD document or XML schema document (see below). The DTD file
shown in Figure 12.4 should be stored in the same file system as the XML document, and should
be given the file name proj.dtd. Alternatively, we could include the DTD document text at the
beginning of the XML document itself to allow the checking.

Although XML DTD is quite adequate for specifying tree structures with required, optional, and
repeating elements, and with various types of attributes, it has several limitations. First, the data
types in DTD are not very general. Second, DTD has its own special syntax and thus requires
specialized processors. It would be advantageous to specify XML schema documents using the
syntax rules of XML itself so that the same processors used for XML documents could process
XML schema descriptions. Third, all DTD elements are always forced to follow the specified
ordering of the document, so unordered elements are not permitted. These draw-backs led to the
development of XML schema, a more general but also more complex language for specifying the
structure and elements of XML documents.
2. XML Schema

The XML schema language is a standard for specifying the structure of XML documents. It uses
the same syntax rules as regular XML documents, so that the same processors can be used on both.
To distinguish the two types of documents, we will use the term XML instance document or XML
document for a regular XML document, and XML schema document for a document that specifies
an XML schema. Figure 12.5 shows an XML schema document corresponding to
the COMPANY database shown in Figures 3.5 and 7.2. Although it is unlikely that we would want
to display the whole database as a single document, there have been proposals to store data
in native XML format as an alternative to storing the data in relational data-bases. The schema in
Figure 12.5 would serve the purpose of specifying the struc-ture of the COMPANY database if it
were stored in a native XML system. We discuss this topic further in Section 12.4.

As with XML DTD, XML schema is based on the tree data model, with elements and attributes as
the main structuring concepts. However, it borrows additional concepts from database and object
models, such as keys, references, and identifiers. Here we describe the features of XML schema
in a step-by-step manner, referring to the sample XML schema document in Figure 12.5 for
illustration. We introduce and describe some of the schema concepts in the order in which they are
used in Figure 12.5.

Figure 12.5

An XML schema file called company.

<?xml version=“1.0” encoding=“UTF-8” ?>

<xsd:schema xmlns:xsd=“http://www.w3.org/2001/XMLSchema”> <xsd:annotation>

<xsd:documentation xml:lang=“en”>Company Schema (Element Approach) - Prepared by Babak


Hojabri</xsd:documentation>

</xsd:annotation> <xsd:element name=“company”>

<xsd:complexType>

<xsd:sequence>
<xsd:element name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” /> <xsd:element
name=“employee” type=“Employee” minOccurs=“0” maxOccurs= “unbounded”>

<xsd:unique name=“dependentNameUnique”> <xsd:selector xpath=“employeeDependent” /> <xsd:field


xpath=“dependentName” />

</xsd:unique>

</xsd:element>

<xsd:element name=“project” type=“Project” minOccurs=“0” maxOccurs=“unbounded” /> </xsd:sequence>

</xsd:complexType>

<xsd:unique name=“departmentNameUnique”> <xsd:selector xpath=“department” /> <xsd:field


xpath=“departmentName” />

</xsd:unique>

<xsd:unique name=“projectNameUnique”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectName” />

</xsd:unique>

<xsd:key name=“projectNumberKey”> <xsd:selector xpath=“project” /> <xsd:field xpath=“projectNumber” />

</xsd:key>

<xsd:key name=“departmentNumberKey”> <xsd:selector xpath=“department” /> <xsd:field


xpath=“departmentNumber” />

</xsd:key>

<xsd:key name=“employeeSSNKey”> <xsd:selector xpath=“employee” /> <xsd:field xpath=“employeeSSN” />


</xsd:key>

<xsd:keyref name=“departmentManagerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector


xpath=“department” />

<xsd:field xpath=“departmentManagerSSN” /> </xsd:keyref>

<xsd:keyref name=“employeeDepartmentNumberKeyRef” refer=“departmentNumberKey”>

<xsd:selector xpath=“employee” />

<xsd:field xpath=“employeeDepartmentNumber” /> </xsd:keyref>

<xsd:keyref name=“employeeSupervisorSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector


xpath=“employee” />

<xsd:field xpath=“employeeSupervisorSSN” /> </xsd:keyref>

<xsd:keyref name=“projectDepartmentNumberKeyRef” refer=“departmentNumberKey”> <xsd:selector


xpath=“project” />

<xsd:field xpath=“projectDepartmentNumber” /> </xsd:keyref>

<xsd:keyref name=“projectWorkerSSNKeyRef” refer=“employeeSSNKey”> <xsd:selector


xpath=“project/projectWorker” />

<xsd:field xpath=“SSN” /> </xsd:keyref>

<xsd:keyref name=“employeeWorksOnProjectNumberKeyRef” refer=“projectNumberKey”>

<xsd:selector xpath=“employee/employeeWorksOn” /> <xsd:field xpath=“projectNumber” />

</xsd:keyref>

</xsd:element>

<xsd:complexType name=“Department”> <xsd:sequence>

<xsd:element name=“departmentName” type=“xsd:string” /> <xsd:element name=“departmentNumber”


type=“xsd:string” /> <xsd:element name=“departmentManagerSSN” type=“xsd:string” /> <xsd:element
name=“departmentManagerStartDate” type=“xsd:date” />

<xsd:element name=“departmentLocation” type=“xsd:string” minOccurs=“0” maxOccurs=“unbounded” />


</xsd:sequence>
</xsd:complexType> <xsd:complexType name=“Employee”>

<xsd:sequence>

<xsd:element name=“employeeName” type=“Name” /> <xsd:element name=“employeeSSN” type=“xsd:string” />


<xsd:element name=“employeeSex” type=“xsd:string” /> <xsd:element name=“employeeSalary”
type=“xsd:unsignedInt” /> <xsd:element name=“employeeBirthDate” type=“xsd:date” />

<xsd:element name=“employeeDepartmentNumber” type=“xsd:string” /> <xsd:element


name=“employeeSupervisorSSN” type=“xsd:string” /> <xsd:element name=“employeeAddress” type=“Address”
/>

<xsd:element name=“employeeWorksOn” type=“WorksOn” minOccurs=“1” maxOccurs=“unbounded” />


<xsd:element name=“employeeDependent” type=“Dependent” minOccurs=“0” maxOccurs=“unbounded” />

</xsd:sequence>

</xsd:complexType>

<xsd:complexType name=“Project”>

<xsd:sequence>

<xsd:element name=“projectName” type=“xsd:string” />

<xsd:element name=“projectNumber” type=“xsd:string” />

<xsd:element name=“projectLocation” type=“xsd:string” />

<xsd:element name=“projectDepartmentNumber” type=“xsd:string” />

<xsd:element name=“projectWorker” type=“Worker” minOccurs=“1” maxOccurs=“unbounded” />


</xsd:sequence>

</xsd:complexType>
<xsd:complexType name=“Dependent”> <xsd:sequence>

<xsd:element name=“dependentName” type=“xsd:string” /> <xsd:element name=“dependentSex”


type=“xsd:string” /> <xsd:element name=“dependentBirthDate” type=“xsd:date” /> <xsd:element
name=“dependentRelationship” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Address”>

<xsd:sequence>

<xsd:element name=“number” type=“xsd:string” /> <xsd:element name=“street” type=“xsd:string” /> <xsd:element


name=“city” type=“xsd:string” /> <xsd:element name=“state” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Name”>

<xsd:sequence>

<xsd:element name=“firstName” type=“xsd:string” /> <xsd:element name=“middleName” type=“xsd:string” />


<xsd:element name=“lastName” type=“xsd:string” />

</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“Worker”>

<xsd:sequence>

<xsd:element name=“SSN” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” />


</xsd:sequence>

</xsd:complexType> <xsd:complexType name=“WorksOn”>

<xsd:sequence>

<xsd:element name=“projectNumber” type=“xsd:string” /> <xsd:element name=“hours” type=“xsd:float” />

</xsd:sequence>

</xsd:complexType>

</xsd:schema>

Schema descriptions and XML namespaces. It is necessary to identify the specific set of XML
schema language elements (tags) being used by specify-ing a file stored at a Web site location.
The second line in Figure 12.5 specifies the file used in this example, which
is http://www.w3.org/2001/XMLSchema. This is a commonly used standard for XML schema
commands. Each such definition is called an XML namespace, because it defines the set of
commands (names) that can be used. The file name is assigned to the variable xsd (XML schema
description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all
XML schema commands (tag names). For example, in Figure 12.5, when we
write xsd:element or xsd:sequence, we are referring to the definitions of the element and sequence tags
as defined in the file http://www.w3.org/2001/XMLSchema.

2. Annotations, documentation, and language used. The next couple of lines in Figure 12.5
illustrate the XML schema elements (tags) xsd:annotation and xsd:documentation, which are used
for providing comments and other descriptions in the XML document. The attribute xml:lang of
the xsd:documentation element specifies the language being used, where en stands for the English
language.

Elements and types. Next, we specify the root element of our XML schema. In XML schema,
the name attribute of the xsd:element tag specifies the ele-ment name, which is called company for
the root element in our example (see Figure 12.5). The structure of the company root element can
then be speci-fied, which in our example is xsd:complexType. This is further specified to be a
sequence of departments, employees, and projects using the xsd:sequence structure of XML
schema. It is important to note here that this is not the only way to specify an XML schema for
the COMPANY database. We will dis-cuss other options in Section 12.6.

First-level elements in the COMPANY database. Next, we specify the three first-level elements
under the company root element in Figure 12.5. These elements are named employee, department,
and project, and each is specified in an xsd:element tag. Notice that if a tag has only attributes and
no further subelements or data within it, it can be ended with the backslash symbol (/>) directly
instead of having a separate matching end tag. These are called empty elements; examples are
the xsd:element elements named department and project in Figure 12.5.

Specifying element type and minimum and maximum occurrences. In

XML schema, the attributes type, minOccurs, and maxOccurs in the xsd:element tag specify the type
and multiplicity of each element in any doc-ument that conforms to the schema specifications. If
we specify a type attrib-ute in an xsd:element, the structure of the element must be described
separately, typically using the xsd:complexType element of XML schema. This is illustrated by
the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute
is specified, the element structure can be defined directly following the tag, as illustrated by
the company root ele-ment in Figure 12.5. The minOccurs and maxOccurs tags are used for specify-
ing lower and upper bounds on the number of occurrences of an element in any XML document
that conforms to the schema specifications. If they are not specified, the default is exactly one
occurrence. These serve a similar role to the *, +, and ? symbols of XML DTD.

Specifying keys. In XML schema, it is possible to specify constraints that correspond to unique
and primary key constraints in a relational database (see Section 3.2.2), as well as foreign keys (or
referential integrity) con-straints (see Section 3.2.4). The xsd:unique tag specifies elements that
correspond to unique attributes in a relational database. We can give each such uniqueness
constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element
type that contains the unique element and the element name within it that is unique via
the xpath attribute. This is illustrated by the departmentNameUnique and projectNameUnique elements
in

Figure 12.5. For specifying primary keys, the tag xsd:key is used instead of xsd:unique, as
illustrated by the projectNumberKey, departmentNumberKey, and employeeSSNKey elements in Figure
12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by the
six xsd:keyref elements in Figure 12.5. When specifying a foreign key, the attribute refer of
the xsd:keyref tag specifies the referenced primary key, whereas the
tags xsd:selector and xsd:field specify the referencing element type and foreign key (see Figure 12.5).
7. Specifying the structures of complex elements via complex types.
The next part of our example specifies the structures of the complex
elements Department, Employee, Project, and Dependent, using the tag xsd:complexType (see Figure
12.5 on page 428). We specify each of these as a sequence of subelements corresponding to the
database attributes of each entity type (see Figure 3.7) by using
the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type
via the attributes name and type of xsd:element. We can also
specify minOccurs and maxOccurs attributes if we need to change the default of exactly one
occurrence. For (optional) data-base attributes where null is allowed, we need to specify minOccurs
= 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded” on
the corresponding element. Notice that if we were not going to specify any key constraints, we
could have embedded the subelements within the parent element definitions directly without
having to specify complex types. However, when unique, primary key and foreign key constraints
need to be specified; we must define complex types to specify the ele-ment structures.

Composite (compound) attributes. Composite attributes from Figure 7.2 are also specified as
complex types in Figure 12.7, as illustrated by the Address, Name, Worker, and WorksOn complex
types. These could have been directly embedded within their parent elements.

This example illustrates some of the main features of XML schema. There are other features, but
they are beyond the scope of our presentation. In the next section, we discuss the different
approaches to creating XML documents from relational data-bases and storing XML documents.

Storing and Extracting XML Documents from Databases

Several approaches to organizing the contents of XML documents to facilitate their subsequent
querying and retrieval have been proposed. The following are the most common approaches:

Using a DBMS to store the documents as text. A relational or object DBMS can be used to
store whole XML documents as text fields within the DBMS records or objects. This approach can
be used if the DBMS has a special module for document processing, and would work for storing
schemaless and documentcentric XML documents.

Using a DBMS to store the document contents as data elements. This approach would work
for storing a collection of documents that follow a specific XML DTD or XML schema. Because
all the documents have the same structure, one can design a relational (or object) database to store
the leaf-level data elements within the XML documents. This approach would require mapping
algorithms to design a database schema that is compatible with the XML document structure as
specified in the XML schema or DTD and to recreate the XML documents from the stored data.
These algorithms can be implemented either as an internal DBMS module or as separate
middleware that is not part of the DBMS.

Designing a specialized system for storing native XML data. A new type of database system
based on the hierarchical (tree) model could be designed and implemented. Such systems are being
called Native XML DBMSs. The system would include specialized indexing and querying
techniques, and would work for all types of XML documents. It could also include data
compression techniques to reduce the size of the documents for storage. Tamino by Software AG
and the Dynamic Application Platform of eXcelon are two popular products that offer native XML
DBMS capability. Oracle also offers a native XML storage option.

Creating or publishing customized XML documents from preexisting relational


databases. Because there are enormous amounts of data already stored in relational databases,
parts of this data may need to be formatted as documents for exchanging or displaying over the
Web. This approach would use a separate middleware software layer to handle the conversions
needed between the XML documents and the relational database. Section 12.6 dis-cusses this
approach, in which datacentric XML documents are extracted from existing databases, in more
detail. In particular, we show how tree structured documents can be created from graph-structured
databases. Section 12.6.2 discusses the problem of cycles and how to deal with it.

All of these approaches have received considerable attention. We focus on the fourth approach in
Section 12.6, because it gives a good conceptual understanding of the differences between the
XML tree data model and the traditional database models based on flat files (relational model) and
graph representations (ER model). But first we give an overview of XML query languages in
Section 12.5.

Extracting XML Documents from Relational Databases


1. Creating Hierarchical XML Views over Flat or Graph-Based Data 2. Breaking Cycles to Convert Graphs into Trees
3. ther Steps for Extracting XML Documents from Databases

Extracting XML Documents from Relational Databases

1. Creating Hierarchical XML Views over Flat or Graph-Based


Data
This section discusses the representational issues that arise when converting data from a
database system into XML documents. As we have discussed, XML uses a hierarchical (tree)
model to represent documents. The database systems with the most widespread use follow the
flat relational data model. When we add referential integrity constraints, a relational schema can
be considered to be a graph structure (for example, see Figure 3.7). Similarly, the ER model
represents data using graph-like structures (for example, see Figure 7.2). We saw in Chapter 9
that there are straightforward mappings between the ER and relational models, so we can
conceptually represent a relational database schema using the corresponding ER schema.
Although we will use the ER model in our discussion and examples to clarify the conceptual
differences between tree and graph models, the same issues apply to converting relational data
to XML.

We will use the simplified UNIVERSITY ER schema shown in Figure 12.8 to illustrate our
discussion. Suppose that an application needs to extract XML documents for student, course, and
grade information from the UNIVERSITY database. The data needed for these documents is
contained in the database attributes of the entity
types COURSE, SECTION, and STUDENT from Figure 12.8, and the relationships S-S and C-
S between them. In general, most documents extracted from a database will only use a subset
of the attributes, entity types, and relationships in the database. In this example, the subset of
the database that is needed is shown in Figure 12.9.

At least three possible document hierarchies can be extracted from the database subset in Figure
12.9. First, we can choose COURSE as the root, as illustrated in Figure 12.10. Here, each course
entity has the set of its sections as subelements, and each section has its students as
subelements. We can see one consequence of modeling the information in a hierarchical tree
structure. If a student has taken multiple sections, that student’s information will appear multiple
times in the document— once under each section. A possible simplified XML schema for this view
is shown in Figure 12.11. The Grade database attribute in the S-S relationship is migrated to
the STUDENT element. This is because STUDENT becomes a child of SECTION in this hierarchy, so
each STUDENT element under a specific SECTION element can have a specific grade in that
section. In this document hierarchy, a student taking more than one section will have several
replicas, one under each section, and each replica will have the specific grade given in that
particular section.
Figure 12.11

XML schema document with course as the root.

<xsd:element name=“root”>

<xsd:sequence>

<xsd:element name=“course” minOccurs=“0” maxOccurs=“unbounded”>

<xsd:sequence>

<xsd:element name=“cname” type=“xsd:string” />

<xsd:element name=“cnumber” type=“xsd:unsignedInt” />

<xsd:element name=“section” minOccurs=“0” maxOccurs=“unbounded”>


<xsd:sequence>

<xsd:element name=“secnumber” type=“xsd:unsignedInt” />

<xsd:element name=“year” type=“xsd:string” />

<xsd:element name=“quarter” type=“xsd:string” />

<xsd:element name=“student” minOccurs=“0” maxOccurs=“unbounded”> <xsd:sequence>

<xsd:element name=“ssn” type=“xsd:string” /> <xsd:element name=“sname” type=“xsd:string” /> <xsd:element


name=“class” type=“xsd:string” /> <xsd:element name=“grade” type=“xsd:string” />

</xsd:sequence>

</xsd:element>

</xsd:sequence>

</xsd:element>

</xsd:sequence>

</xsd:element>

</xsd:sequence>

</xsd:element>

In the second hierarchical document view, we can choose STUDENT as root (Figure 12.12). In this
hierarchical view, each student has a set of sections as its child elements, and each section is
related to one course as its child, because the relationship between SECTION and COURSE is N:1.
Thus, we can merge the COURSE and SECTION elements in this view, as shown in Figure 12.12. In
addition, the GRADE data-base attribute can be migrated to the SECTION element. In this
hierarchy, the combined COURSE/SECTION information is replicated under each student who
completed the section. A possible simplified XML schema for this view is shown in Figure 12.13.

<xsd:element name=”root”> <xsd:sequence>

<xsd:element name=”student” minOccurs=”0” maxOccurs=”unbounded”> <xsd:sequence>

<xsd:element name=”ssn” type=”xsd:string” /> <xsd:element name=”sname” type=”xsd:string” /> <xsd:element


name=”class” type=”xsd:string” />

<xsd:element name=”section” minOccurs=”0” maxOccurs=”unbounded”> <xsd:sequence>

<xsd:element name=”secnumber” type=”xsd:unsignedInt” /> <xsd:element name=”year” type=”xsd:string” />


<xsd:element name=”quarter” type=”xsd:string” /> <xsd:element name=”cnumber” type=”xsd:unsignedInt” />
<xsd:element name=”cname” type=”xsd:string” /> <xsd:element name=”grade” type=”xsd:string” />

</xsd:sequence>

</xsd:element>

</xsd:sequence>
</xsd:element>

</xsd:sequence>

</xsd:element>

Figure 12.13 XML schema document with student as the root.

The third possible way is to choose SECTION as the root, as shown in Figure 12.14. Similar to the
second hierarchical view, the COURSE information can be merged into the SECTION element.
The GRADE database attribute can be migrated to the STUDENT element. As we can see, even in
this simple example, there can be numerous hierarchical document views, each corresponding
to a different root and a different XML document structure.

2. Breaking Cycles to Convert Graphs into Trees

In the previous examples, the subset of the database of interest had no cycles. It is possible to
have a more complex subset with one or more cycles, indicating multiple relationships among
the entities. In this case, it is more difficult to decide how to create the document hierarchies.
Additional duplication of entities may be needed to represent the multiple relationships. We will
illustrate this with an example using the ER schema in Figure 12.8.
Suppose that we need the information in all the entity types and relationships in Figure 12.8 for
a particular XML document, with STUDENT as the root element. Figure 12.15 illustrates how a
possible hierarchical tree structure can be created for this document. First, we get a lattice
with STUDENT as the root, as shown in Figure 12.15(a). This is not a tree structure because of the
cycles. One way to break the cycles is to replicate the entity types involved in the cycles. First, we
replicate INSTRUCTOR as shown in Figure 12.15(b), calling the replica to the right INSTRUCTOR1.
The INSTRUCTOR replica on the left represents the relationship between instructors and the
sections they teach, whereas the INSTRUCTOR1 replica on the right represents the relationship
between instructors and the department each works in. After this, we still have the cycle
involving COURSE, so we can replicate COURSE in a similar manner, leading to the hierarchy
shown in Figure 12.15(c). The COURSE1 replica to the left represents the relationship between
courses and their sections, whereas the COURSE replica to the right represents the relationship
between courses and the department that offers each course.

In Figure 12.15(c), we have converted the initial graph to a hierarchy. We can do further merging
if desired (as in our previous example) before creating the final hierarchy and the corresponding
XML schema structure.

3. threeSteps for Extracting XML Documents from Databases


In addition to creating the appropriate XML hierarchy and corresponding XML schema document,
several other steps are needed to extract a particular XML document from a database:

It is necessary to create the correct query in SQL to extract the desired information for the XML
document.

Once the query is executed, its result must be restructured from the flat relational form to the
XML tree structure.

The query can be customized to select either a single object or multiple objects into the
document. For example, in the view in Figure 12.13, the query can select a single student entity
and create a document corresponding to that single student, or it may select several—or even
all—of the students and create a document with multiple students.

XML Languages

There have been several proposals for XML query languages, and two query language standards
have emerged. The first is XPath, which provides language constructs for specifying path
expressions to identify certain nodes (elements) or attributes within an XML document that match
specific patterns. The second is XQuery, which is a more general query language. XQuery uses
XPath expressions but has additional constructs. We give an overview of each of these languages
in this section. Then we discuss some additional languages related to HTML in Section 12.5.3.

1. XPath: Specifying Path Expressions in XML

An XPath expression generally returns a sequence of items that satisfy a certain pat-tern as
specified by the expression. These items are either values (from leaf nodes) or elements or
attributes. The most common type of XPath expression returns a collection of element or attribute
nodes that satisfy certain patterns specified in the expression. The names in the XPath expression
are node names in the XML document tree that are either tag (element) names or attribute names,
possibly with additional qualifier conditions to further restrict the nodes that satisfy the pattern.
Two main separators are used when specifying a path: single slash (/) and double slash (//). A
single slash before a tag specifies that the tag must appear as a direct child of the previous (parent)
tag, whereas a double slash specifies that the tag can appear as a descendant of the previous tag at
any level. Let us look at some examples of XPath as shown in Figure 12.6.

The first XPath expression in Figure 12.6 returns the company root node and all its descendant
nodes, which means that it returns the whole XML document. We should note that it is customary
to include the file name in the XPath query. This allows us to specify any local file name or even
any path name that specifies a file on the Web. For example, if the COMPANY XML document is
stored at the location

then the first XPath expression in Figure 12.6 can be written as

This prefix would also be included in the other examples of XPath expressions.

The second example in Figure 12.6 returns all department nodes (elements) and their descendant
subtrees. Note that the nodes (elements) in an XML document are ordered, so the XPath result that
returns multiple nodes will do so in the same order in which the nodes are ordered in the document
tree.

The third XPath expression in Figure 12.6 illustrates the use of //, which is convenient to use if we
do not know the full path name we are searching for, but do know the name of some tags of interest
within the XML document. This is particularly useful for schemaless XML documents or for
documents with many nested levels of nodes.

The expression returns all employeeName nodes that are direct children of an employee node, such
that the employee node has another child element employeeSalary whose value is greater than 70000.
This illustrates the use of qualifier conditions, which restrict the nodes selected by
the XPath expression to those that satisfy the condition. XPath has a number of comparison
operations for use in qualifier conditions, including standard arithmetic, string, and set comparison
operations.

The fourth XPath expression in Figure 12.6 should return the same result as the pre-vious one,
except that we specified the full path name in this example. The fifth expression in Figure 12.6
returns all projectWorker nodes and their descendant nodes that are children under a
path /company/project and have a child node hours with a value greater than 20.0 hours.
When we need to include attributes in an XPath expression, the attribute name is prefixed by the
@ symbol to distinguish it from element (tag) names. It is also possible to use the wildcard symbol
*, which stands for any element, as in the following example, which retrieves all elements that are
child elements of the root, regardless of their element type. When wildcards are used, the result
can be a sequence of different types of items.

/company/*

The examples above illustrate simple XPath expressions, where we can only move down in the
tree structure from a given node. A more general model for path expressions has been proposed.
In this model, it is possible to move in multiple directions from the current node in the path
expression. These are known as the axes of an XPath expression. Our examples above used
only three of these axes: child of the current node (/), descendent or self at any level of the current
node (//), and attribute of the current node (@). Other axes include parent, ancestor (at any level),
previous sibling (any node at same level to the left in the tree), and next sibling (any node at the
same level to the right in the tree). These axes allow for more complex path expressions.

The main restriction of XPath path expressions is that the path that specifies the pat-tern also
specifies the items to be retrieved. Hence, it is difficult to specify certain conditions on the pattern
while separately specifying which result items should be retrieved. The XQuery language separates
these two concerns, and provides more powerful constructs for specifying queries.

2. XQuery: Specifying Queries in XML

XPath allows us to write expressions that select items from a tree-structured


XML document. XQuery permits the specification of more general queries on one or more XML
documents. The typical form of a query in XQuery is known as a FLWR expression, which stands
for the four main clauses of XQuery and has the following form:

FOR <variable bindings to individual nodes (elements)>

LET <variable bindings to collections of nodes (elements)>

WHERE <qualifier conditions>

RETURN <query result specification>


There can be zero or more instances of the FOR clause, as well as of the LET clause in a
single XQuery. The WHERE clause is optional, but can appear at most once, and
the RETURN clause must appear exactly once. Let us illustrate these clauses with the fol-lowing
simple example of an XQuery.

LET $d := doc(www.company.com/info.xml)

FOR $x IN $d/company/project[projectNumber = 5]/projectWorker, $y IN $d/company/employee


WHERE $x/hours gt 20.0 AND $y.ssn = $x.ssn

RETURN <res> $y/employeeName/firstName, $y/employeeName/lastName, $x/hours </res>

Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are variables.

The LET clause assigns a variable to a particular expression for the rest of the query. In this
example, $d is assigned to the document file name. It is possi-ble to have a query that refers to
multiple documents by assigning multiple variables in this way.

The FOR clause assigns a variable to range over each of the individual items in a sequence. In
our example, the sequences are specified by path expressions. The $x variable ranges over
elements that satisfy the path expression $d/company/project[projectNumber =
5]/projectWorker. The $y variable ranges over elements that satisfy the path
expression $d/company/employee. Hence, $x ranges over projectWorker elements,
whereas $y ranges over employee elements.

The WHERE clause specifies additional conditions on the selection of items. In this example,
the first condition selects only those projectWorker elements that satisfy the condition (hours gt
20.0). The second condition specifies a join condition that combines an employee with
a projectWorker only if they have the same ssn value.

Finally, the RETURN clause specifies which elements or attributes should be retrieved from the
items that satisfy the query conditions. In this example, it will return a sequence of elements each
containing <firstName, lastName, hours> for employees who work more that 20 hours per week on
project number 5.

Figure 12.7 includes some additional examples of queries in XQuery that can be specified on an
XML instance documents that follow the XML schema document in Figure 12.5. The first query
retrieves the first and last names of employees who earn more than $70,000. The variable $x is
bound to each employeeName element that is a child of an employee element, but only
for employee elements that satisfy the quali-fier that their employeeSalary value is greater than
$70,000. The result retrieves the firstName and lastName child elements of the
selected employeeName elements. The second query is an alternative way of retrieving the same
elements retrieved by the first query.

The third query illustrates how a join operation can be performed by using more than one variable.
Here, the $x variable is bound to each projectWorker element that is a child of project number 5,
whereas the $y variable is bound to each employee ele-ment. The join condition
matches ssn values in order to retrieve the employee names. Notice that this is an alternative way
of specifying the same query in our ear-lier example, but without the LET clause.

XQuery has very powerful constructs to specify complex queries. In particular, it can specify
universal and existential quantifiers in the conditions of a query, aggregate functions, ordering of
query results, selection based on position in a sequence, and even conditional branching. Hence,
in some ways, it qualifies as a full-fledged programming language.

This concludes our brief introduction to XQuery. The interested reader is referred
to www.w3.org, which contains documents describing the latest standards related to XML
and XQuery. The next section briefly discusses some additional languages and protocols related to
XML.

3. Other Languages and Protocols Related to XML


There are several other languages and protocols related to XML technology. The long-term goal
of these and other languages and protocols is to provide the technology for realization of the
Semantic Web, where all information in the Web can be intelligently located and processed.

The Extensible Stylesheet Language (XSL) can be used to define how a document should be
rendered for display by a Web browser.

The Extensible Stylesheet Language for Transformations (XSLT) can be used to transform one
structure into a different structure. Hence, it can convert documents from one form to another.

The Web Services Description Language (WSDL) allows for the description of Web Services
in XML. This makes the Web Service available to users and programs over the Web.

The Simple Object Access Protocol (SOAP) is a platform-independent and programming


language-independent protocol for messaging and remote procedure calls.

The Resource Description Framework (RDF) provides languages and tools for exchanging and
processing of meta-data (schema) descriptions and specifications over the Web.

Extra Contents : Key Points on XML Documents,XML


Schema ,XML Query,XML Databse:
XML Documents :
An XML document is a basic unit of XML information composed of elements and other markup in an
orderly package. An XML document can contains wide variety of data. For example, database of
numbers, numbers representing molecular structure or a mathematical equation.

XML Document Example


A simple document is shown in the following example −
<?xml version = "1.0"?>
<contact-info>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact-info>
The following image depicts the parts of XML document.
Document Prolog Section
Document Prolog comes at the top of the document, before the root element. This section contains −

 XML declaration
 Document type declaration
You can learn more about XML declaration in this chapter − XML Declaration

Document Elements Section


Document Elements are the building blocks of XML. These divide the document into a hierarchy of
sections, each serving a specific purpose. You can separate a document into multiple sections so that
they can be rendered differently, or used by a search engine. The elements can be containers, with a
combination of text and other elements.

XML Document Type Definition :

The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language
precisely. DTDs check vocabulary and validity of the structure of XML documents against grammatical
rules of appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate document
and then liked separately.

Syntax
Basic syntax of a DTD is as follows −
<!DOCTYPE element DTD identifier
[
declaration1
declaration2
........
]>
In the above syntax,
 The DTD starts with <!DOCTYPE delimiter.
 An element tells the parser to parse the document from the specified root element.
 DTD identifier is an identifier for the document type definition, which may be the path to a file on
the system or URL to a file on the internet. If the DTD is pointing to external path, it is
called External Subset.
 The square brackets [ ] enclose an optional list of entity declarations called Internal Subset.
Internal DTD
A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as
internal DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration
works independent of an external source.
Syntax
Following is the syntax of internal DTD −
<!DOCTYPE root-element [element-declarations]>
where root-element is the name of root element and element-declarations is where you declare the
elements.
Example
Following is a simple example of internal DTD −
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
<!DOCTYPE address [
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>

<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
Let us go through the above code −
Start Declaration − Begin the XML declaration with the following statement.
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
DTD − Immediately after the XML header, the document type declaration follows, commonly referred
to as the DOCTYPE −
<!DOCTYPE address [
The DOCTYPE declaration has an exclamation mark (!) at the start of the element name. The
DOCTYPE informs the parser that a DTD is associated with this XML document.
DTD Body − The DOCTYPE declaration is followed by body of the DTD, where you declare elements,
attributes, entities, and notations.
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone_no (#PCDATA)>
Several elements are declared here that make up the vocabulary of the <name> document.
<!ELEMENT name (#PCDATA)> defines the element name to be of type "#PCDATA". Here #PCDATA
means parse-able text data.
End Declaration − Finally, the declaration section of the DTD is closed using a closing bracket and a
closing angle bracket (]>). This effectively ends the definition, and thereafter, the XML document follows
immediately.
Rules
 The document type declaration must appear at the start of the document (preceded only by the
XML header) − it is not permitted anywhere else within the document.
 Similar to the DOCTYPE declaration, the element declarations must start with an exclamation
mark.
 The Name in the document type declaration must match the element type of the root element.

External DTD
In external DTD elements are declared outside the XML file. They are accessed by specifying the
system attributes which may be either the legal .dtd file or a valid URL. To refer it as external
DTD, standalone attribute in the XML declaration must be set as no. This means, declaration includes
information from the external source.
Syntax
Following is the syntax for external DTD −
<!DOCTYPE root-element SYSTEM "file-name">
where file-name is the file with .dtd extension.
Example
The following example shows external DTD usage −
<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
<!DOCTYPE address SYSTEM "address.dtd">
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
The content of the DTD file address.dtd is as shown −
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>

Types
You can refer to an external DTD by using either system identifiers or public identifiers.

System Identifiers
A system identifier enables you to specify the location of an external file containing DTD declarations.
Syntax is as follows −
<!DOCTYPE name SYSTEM "address.dtd" [...]>
As you can see, it contains keyword SYSTEM and a URI reference pointing to the location of the
document.
Public Identifiers
Public identifiers provide a mechanism to locate DTD resources and is written as follows −
<!DOCTYPE name PUBLIC "-//Beginning XML//DTD Address Example//EN">
As you can see, it begins with keyword PUBLIC, followed by a specialized identifier. Public identifiers
are used to identify an entry in a catalog. Public identifiers can follow any format, however, a commonly
used format is called Formal Public Identifiers, or FPIs.

XML Schema :
XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe and
validate the structure and the content of XML data. XML schema defines the elements, attributes and
data types. Schema element supports Namespaces. It is similar to a database schema that describes
the data in a database.

Syntax
You need to declare a schema in your XML document as follows −
Example
The following example shows how to use schema −
<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">
<xs:element name = "contact">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
The basic idea behind XML Schemas is that they describe the legitimate format that an XML document
can take.

Elements
As we saw in the XML - Elements chapter, elements are the building blocks of XML document. An
element can be defined within an XSD as follows −
<xs:element name = "x" type = "y"/>

Definition Types
You can define XML schema elements in the following ways −
Simple Type
Simple type element is used only in the context of the text. Some of the predefined simple types are:
xs:integer, xs:boolean, xs:string, xs:date. For example −
<xs:element name = "phone_number" type = "xs:int" />
Complex Type
A complex type is a container for other element definitions. This allows you to specify which child
elements an element can contain and to provide some structure within your XML documents. For
example −
<xs:element name = "Address">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
In the above example, Address element consists of child elements. This is a container for
other <xs:element> definitions, that allows to build a simple hierarchy of elements in the XML
document.
Global Types
With the global type, you can define a single type in your document, which can be used by all other
references. For example, suppose you want to generalize the person and company for different
addresses of the company. In such case, you can define a general type as follows −
<xs:element name = "AddressType">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Now let us use this type in our example as follows −
<xs:element name = "Address1">
<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone1" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>

<xs:element name = "Address2">


<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone2" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
Instead of having to define the name and the company twice (once for Address1 and once
for Address2), we now have a single definition. This makes maintenance simpler, i.e., if you decide to
add "Postcode" elements to the address, you need to add them at just one place.

Attributes
Attributes in XSD provide extra information within an element. Attributes have name and type property
as shown below −
<xs:attribute name = "x" type = "y"/>

XML Documents and Databases:


Different approaches for storing XML documents are as given below:

 A RDBMS or object-oriented database management system is used to store XML


document in the form of text.
 A tree model is very useful to store data elements located at leaf level in tree
structure.
 The large amount of data is stored in the form of relational database or object-oriented
databases. A middleware software is used to manage communication between XML
document and relational database.

XML Query:
XML query is based on two methods.

1. Xpath:

 Xpath is a syntax for defining parts or elements of XML documents.


 Xpath is used to navigate between elements and attributes in the XML documents.
 Xpath uses path of expressions to select nodes in XML documents.
Example:
In this example, the tree structure of employee is represented in the form of XML document.

<Employee>
<Name>
<First name>Brian</First name>
<Last name>Lara<</Last name>
</Name>
<Contact>
<Mobile>9800000000</Mobile>
<Landline>020222222</Landline>
</Contact>
<Address>
<City>Pune</city>
<Street>Tilak road</Street>
<Zip code>4110</Zip code>
</Address>
</Employee>
In the above example, a (/) is used as </Employee> where Employee is a root node and
Name, Contact and Address are descendant nodes.

2. Xquery:

 Xquery is a query and functional programming language. Xquery provides a facility to


extract and manipulate data from XML documents or any data source, like relational
database.
 The Xquery defines FLWR expression which supports iteration and binding of variable
to intermediate results.
FLWR is an abbreviation of FOR, LET, WHERE, RETURN. Which are explained as
follows:
FOR: Creates a sequence of nodes.
LET: Binds a sequence to variable.
WHERE: Filters the nodes on condition.
RETURN: It is query result specification.

Xquery comparisons:
The two methods for Xquery comparisons are as follows:

1. General comparisons: =, !=, <=, >, >=


Example:
In this example, the expression (Query) can return true value if any attributes have a value
greater than or equal to 15000.
$ TVStore//TV/price > =15000

2. Value comparisons: eq, ne, lt, le, gt , ge


Example:
In this example, the expression (Query) can return true value, if there is only one attribute
returned by the expression, and its value is equal to 15000.
$ TVStore//TV/price eq 15000

Example:
Lets take an example to understand how to write a XMLquery.

FOR $x in doc (“student.xml”)/student information /marks


WHERE $x/marks >700
RETURN <res>
$x/Name
</res>
The queries in the above example can return true value ( Name of the students)
only if the value (marks) is greater than 700.

XML Databse :
XML Database is used to store huge amount of information in the XML format. As the use of XML is
increasing in every field, it is required to have a secured place to store the XML documents. The data
stored in the database can be queried using XQuery, serialized, and exported into a desired format.

XML Database Types


There are two major types of XML databases −

 XML- enabled
 Native XML (NXD)
XML - Enabled Database
XML enabled database is nothing but the extension provided for the conversion of XML document. This
is a relational database, where data is stored in tables consisting of rows and columns. The tables
contain set of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather than table format. It can store large amount of
XML document and data. Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database. It is highly capable to store,
query and maintain the XML document than XML-enabled database.
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
<contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact1>

<contact2>
<name>Manisha Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in
turn consists of three entities − name, company and phone.
UNIT – 4
NOSQL DATABASES AND BIG DATA STORAGE SYSTEMS

NoSQL – Categories of NoSQL Systems – CAP Theorem – Document-Based NoSQL Systems and MongoDB – MongoDB Data
Model – MongoDB Distributed Systems Characteristics – NoSQL Key-Value Stores – DynamoDB Overview – Voldemort Key-
Value Distributed Data Store – Wide Column NoSQL Systems – Hbase Data Model – Hbase Crud Operations – Hbase Storage and
Distributed System Concepts – NoSQL Graph Databases and Neo4j – Cypher Query Language of Neo4j – Big Data – MapReduce
– Hadoop – YARN.

What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data
stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For
example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”,
NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured and polymorphic data. Let’s understand about NoSQL with a diagram in this
NoSQL database tutorial:
OLAP:
Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high speeds on large volumes of data.
Typically, this data is from a data warehouse, data mart or some other centralized data store.

Categories of No Sql:
Here are the four main types of NoSQL databases:

 Document databases
 Key-value stores
 Column-oriented databases
 Graph databases

Document Databases
A document database stores data in JSON, BSON , or XML documents (not Word documents or Google docs, of course).
In a document database, documents can be nested. Particular elements can be indexed for faster querying.

Documents can be stored and retrieved in a form that is much closer to the data objects used in applications, which means
less translation is required to use the data in an application. SQL data must often be assembled and disassembled when
moving back and forth between applications and storage.

Document databases are popular with developers because they have the flexibility to rework their document structures
as needed to suit their application, shaping their data structures as their application requirements change over time. This
flexibility speeds development because in effect data becomes like code and is under the control of developers. In SQL
databases, intervention by database administrators may be required to change the structure of a database.

The most widely adopted document databases are usually implemented with a scale-out architecture, providing a clear
path to scalability of both data volumes and traffic.

Use cases include ecommerce platforms, trading platforms, and mobile app development across industries.

Comparing MongoDB vs PostgreSQL offers a detailed analysis of MongoDB, the leading NoSQL database, and
PostgreSQL, one of the most popular SQL databases.

Key-Value Stores
The simplest type of NoSQL database is a key-value store . Every data element in the database is stored as a key value
pair consisting of an attribute name (or "key") and a value. In a sense, a key-value store is like a relational database with
only two columns: the key or attribute name (such as state) and the value (such as Alaska).

Use cases include shopping carts, user preferences, and user profiles.

Column-Oriented Databases
While a relational database stores data in rows and reads data row by row, a column store is organized as a set of columns.
This means that when you want to run analytics on a small number of columns, you can read those columns directly
without consuming memory with the unwanted data. Columns are often of the same type and benefit from more efficient
compression, making reads even faster. Columnar databases can quickly aggregate the value of a given column (adding
up the total sales for the year, for example). Use cases include analytics.

Unfortunately there is no free lunch, which means that while columnar databases are great for analytics, the way in which
they write data makes it very difficult for them to be strongly consistent as writes of all the columns require multiple
write events on disk. Relational databases don't suffer from this problem as row data is written contiguously to disk.

Graph Databases
A graph database focuses on the relationship between data elements. Each element is stored as a node (such as a person
in a social media graph). The connections between elements are called links or relationships. In a graph database,
connections are first-class elements of the database, stored directly. In relational databases, links are implied, using data
to express the relationships.

A graph database is optimized to capture and search the connections between data elements, overcoming the overhead
associated with JOINing multiple tables in SQL.

Very few real-world business systems can survive solely on graph queries. As a result graph databases are usually run
alongside other more traditional databases.

Use cases include fraud detection, social networks, and knowledge graphs.

The CAP Theorem in DBMS


The CAP theorem, originally introduced as the CAP principle, can be used to explain some of the competing
requirements in a distributed system with replication. It is a tool used to make system designers aware of the
trade-offs while designing networked shared-data systems.
The three letters in CAP refer to three desirable properties of distributed systems with replicated
data: consistency (among replicated copies), availability of the system for read and write operations)
and partition tolerance in the face of the nodes in the system being partitioned by a network fault).
The CAP theorem states that it is not possible to guarantee all three of the desirable properties – consistency,
availability, and partition tolerance at the same time in a distributed system with data replication.
The theorem states that networked shared-data systems can only strongly support two of the following three
properties:

 Consistency–
Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions. A guarantee that every node in a distributed cluster returns the same, most
recent, successful write. Consistency refers to every client having the same view of the data. There
are various types of consistency models. Consistency in CAP refers to sequential consistency, a
very strong form of consistency.

 Availability–
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all read and write requests in a reasonable amount of time. The key
word here is every. To be available, every node on (either side of a network partition) must be able
to respond in a reasonable amount of time.

 Partition Tolerant–
Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other. That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once
the partition heals.

The use of the word consistency in CAP and its use in ACID do not refer to the same identical concept.

In CAP, the term consistency refers to the consistency of the values in different copies of the same data item
in a replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.

The CAP theorem categorizes systems into three categories:

CP (Consistent and Partition Tolerant) database: A CP database delivers consistency and


partition tolerance at the expense of availability. When a partition occurs between any two nodes,
the system has to shut down the non-consistent node (i.e., make it unavailable) until the partition
is resolved.

Partition refers to a communication break between nodes within a distributed system. Meaning,
if a node cannot receive any messages from another node in the system, there is a partition
between the two nodes. Partition could have been because of network failure, server crash, or
any other reason.

AP (Available and Partition Tolerant) database: An AP database delivers availability and


partition tolerance at the expense of consistency. When a partition occurs, all nodes remain
available but those at the wrong end of a partition might return an older version of data than
others. When the partition is resolved, the AP databases typically resync the nodes to repair all
inconsistencies in the system.

CA (Consistent and Available) database: A CA delivers consistency and availability in the


absence of any network partition. Often a single node’s DB servers are categorized as CA
systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.

In any networked shared-data systems or distributed systems partition tolerance is a must.


Network partitions and dropped messages are a fact of life and must be handled appropriately.
Consequently, system designers must choose between consistency and availability.

The following diagram shows the classification of different databases based on the CAP theorem.
System designers must take into consideration the CAP theorem while designing or choosing
distributed storages as one needs to be sacrificed from C and A for others.

Document based NoSQL

Document databases are considered to be non-relational (or NoSQL) databases. Instead of storing data
in fixed rows and columns, document databases use flexible documents. Document databases are the
most popular alternative to tabular, relational databases.

Understanding Document-Based MongoDB NoSQL

rdbms related to enterprise scalabilities, performance and


flexibilities
rdbms evolved out of strong roots in math like relational and set theories. some of the aspects
are schema validation, normalized data to avoid duplication, atomicity, locking, concurrency,
high availability and one version of the truth .
while these aspects have lot of benefits in terms of data storage and retrieval, they can impact
enterprise scalabilities, performance and flexibilities. let us consider a typical purchase order
example. in rdbms world we will have 2 tables with one-to-many relationship as below,

consider that we need to store huge amount of purchase orders and we started partitioning, one of
the ways to partition is to have orderheader table in one db instance and lineitem information in
another. and if you want to insert or update an order information, you need to update both the tables
atomically and you need to have a transaction manager to ensure atomicity. if you want to scale this
further in terms of processing and data storage, you can only increase hard disk space and ram.

the easy way to achieve scaling in rdbms is vertical scaling

While horizontal scaling refers to adding additional nodes, vertical scaling describes adding more
power to your current machines. For instance, if your server requires more processing power, vertical
scaling would mean upgrading the CPUs.

let us consider another situation, because of the change in our business we added a new column to
the lineitem table called linedesc. and imagine that this application was running in production. once
we deploy this change, we need to bring down the server and for some time to take effect this change.

achieving enterprise scalabilities, performance and


flexibility
fundamental requirements of modern enterprise systems are,

1. flexibilities in terms of scaling database so that multiple instance of the database can process
the information parallel
2. flexibilities in terms of changes to the database can be absorbed without long server
downtimes
3. application /middle tier does not handle object-relational impedance mismatch – can we get
away with it using techniques like json(javascript Object notation)

let us go back to our purchaseorder example and relax some of the aspects of rdbms like
normalization (avoid joins of lot of rows), atomicity and see if we can achieve some of the above
objectives.

below is an example of how we can store the purchaseorder (there are other better way of storing
the information).

orderheader:{

orderdescription: “krishna’s orders”

date:"sat jul 24 2010 19:47:11 gmt-0700(pdt)",

lineitems:[

{linename:"pendrive", quantity:"5"}, {linename:"harddisk", quantity:"10"}

if you notice carefully, the purchase order is stored in a json document like structure. you also
notice, we don’t need multiple tables, relationship and normalization and hence there is no need to
join. and since the schema qualifiers are within the document, there is no table definition.

you can store them as collection of objects/documents. hypothetically if we need to store several
millions of purchaseorders, we can chunk them in groups and store them in several instances.

if you want to retrieve purchaseorders based on specific criteria, for example all the purchase orders
in which one of the line item is a “pendrive”, we can ask all the individual instances to retrieve in
“parallel” based on the same criteria and one of them can consolidate the list and return the
information to the client. this is the concept of horizontal scaling
because the there is no separate table schema and and the schema definition is included in the json
object, we can change document structure and store and retrieve with just change in application
layer. this does not need database restart.

finally the object structure is json, we can directly present it to the web tier or mobile device and
they will render it.

nosql is a classification of database which is designed to keep the above aspects in mind.

mongodb: document based nosql


mongodb nosql database is document based which is some of the above techniques to store and
retrieve the data. there are few nosql databases that are ordered key value based
like redis , cassandra whichalso take these approaches but are much simpler.

if you have to give rdbms analogy, collection in mongodb are similar to tables, document are similar
to rows. internally mongodb stores the information as binary serializable json objects called bson .
mongodb support javascript style query syntax to retrieve bson (binary JSON)objects.
typical documents looks as below,

post={

author:“hergé”,

date:new date(),

text:“destination moon”,

tags:[“comic”,“adventure”] }

> db.post.save(post)

------------

>db.posts.find() {

_id:objectid(" 4c4ba5c0672c685e5e8aabf3"),

author:"hergé",

date:"sat jul 24 2010 19:47:11 gmt-0700(pdt)",

text:"destination moon",

tags:["comic","adventure"]

in mongodb, atomicity is guaranteed within a document. if you have to achieve atomicity outside of
the document, it has to be managed at the application level. below is an example,

many to many:

products:{

_id:objectid("10"),

name:"destinationmoon",

category_ids:[objectid("20"),objectid("30”)]}

categories:{

_id:objectid("20"),

name:"adventure"}

//all products for a given category


>db.products.find({category_ids:objectid("20")})

//all categories for a given product

product=db.products.find(_id:some_id)

>db.categories.find({

_id:{$in:product.category_ids}

})

[feedly mini]

in a typical stack that uses mongodb, it makes lot of sense to use a javascript based framework. a
good web framework, we use express/node.js/mongodb stack. a good example of how to use these
stack is here .

mongodb nosql also supports sharding which supports parallel processing/horizontal scaling. for
more details on how a typical bigdata handles parallel processing/horizontal scaling, refer rickly
ho’s link

a typical use cases for mongodb include, event logging, realtime analytics, content management,
ecommerce. use cases where it is not a good fit are transaction base banking system, non realtime
data warehousing

Data Model Design


Effective data models support your application needs. The key consideration for the structure of your documents is the
decision to embed or to use references.

Embedded Data Models


With MongoDB, you may embed related data in a single structure or document. These schema are generally known as
"denormalized" models, and take advantage of MongoDB's rich documents. Consider the following diagram:

Embedded data models allow applications to store related pieces of information in the same database record. As a result,
applications may need to issue fewer queries and updates to complete common operations.

In general, use embedded data models when:

 you have "contains" relationships between entities. See Model One-to-One Relationships with Embedded Documents.
 you have one-to-many relationships between entities. In these relationships the "many" or child documents always appear
with or are viewed in the context of the "one" or parent documents. See Model One-to-Many Relationships with Embedded
Documents.

In general, embedding provides better performance for read operations, as well as the ability to request and retrieve
related data in a single database operation. Embedded data models make it possible to update related data in a single
atomic write operation.

To access data within embedded documents, use dot notation to "reach into" the embedded documents. See query for
data in arrays and query data in embedded documents for more examples on accessing data in arrays and embedded
documents.

Embedded Data Model and Document Size Limit

Documents in MongoDB must be smaller than the maximum BSON document size.

For bulk binary data, consider GridFS.

Normalized Data Models

Normalized data models describe relationships using references between documents.

In general, use normalized data models:

 when embedding would result in duplication of data but would not provide sufficient read performance
advantages to outweigh the implications of the duplication.
 to represent more complex many-to-many relationships.
 to model large hierarchical data sets.

To join collections, MongoDB provides the aggregation stages:


 $lookup
 $graphLookup

MongoDB also provides referencing to join data across collections.

For an example of normalized data models,

For examples of various tree models,

MongoDB’s top five technical features:

1. Ad-hoc queries for optimized, real-time analytics


In SQL, an ad hoc query is a loosely typed command/query whose value depends upon some variable. Each time the
command is executed, the result is different, depending on the value of the variable. It cannot be predetermined
and usually comes under dynamic programming SQL query.

When designing the schema of a database, it is impossible to know in advance all the queries that will be performed by
end users. An ad hoc query is a short-lived command whose value depends on a variable. Each time an ad hoc query is
executed, the result may be different, depending on the variables in question.

Optimizing the way in which ad-hoc queries are handled can make a significant difference at scale, when thousands to
millions of variables may need to be considered. This is why MongoDB, a document-oriented, flexible schema database,
stands apart as the cloud database platform of choice for enterprise applications that require real-time analytics. With ad-
hoc query support that allows developers to update ad-hoc queries in real time, the improvement in performance can be
game-changing.

MongoDB supports field queries, range queries, and regular expression searches. Queries can return specific fields and
also account for user-defined functions. This is made possible because MongoDB indexes BSON documents and uses
the MongoDB Query Language (MQL).

2. Indexing appropriately for better query executions


In our experience, the number one issue that many technical support teams fail to address with their users is indexing.
Done right, indexes are intended to improve search speed and performance. A failure to properly define appropriate
indices can and usually will lead to a myriad of accessibility issues, such as problems with query execution and load
balancing.

Without the right indices, a database is forced to scan documents one by one to identify the ones that match the query
statement. But if an appropriate index exists for each query, user requests can be optimally executed by the server.
MongoDB offers a broad range of indices and features with language-specific sort orders that support complex access
patterns to datasets.
Notably, MongoDB indices can be created on demand to accommodate real-time, ever-changing query patterns and
application requirements. They can also be declared on any field within any of your documents, including those nested
within arrays.

3. Replication for better data availability and stability


When your data only resides in a single database, it is exposed to multiple potential points of failure, such as a server
crash, service interruptions, or even good old hardware failure. Any of these events would make accessing your data
nearly impossible.

Replication allows you to sidestep these vulnerabilities by deploying multiple servers for disaster recovery and backup.
Horizontal scaling across multiple servers that house the same data (or shards of that same data) means greatly increased
data availability and stability. Naturally, replication also helps with load balancing. When multiple users access the same
data, the load can be distributed evenly across servers.

In MongoDB, replica sets are employed for this purpose. A primary server or node accepts all write operations and
applies those same operations across secondary servers, replicating the data. If the primary server should ever experience
a critical failure, any one of the secondary servers can be elected to become the new primary node. And if the former
primary node comes back online, it does so as a secondary server for the new primary node.

4. Sharding
When dealing with particularly large datasets, sharding—the process of splitting larger datasets across multiple
distributed collections, or “shards”—helps the database distribute and better execute what might otherwise be
problematic and cumbersome queries. Without sharding, scaling a growing web application with millions of daily users
is nearly impossible.

Like replication via replication sets, sharding in MongoDB allows for much greater horizontal scalability. Horizontal
scaling means that each shard in every cluster houses a portion of the dataset in question, essentially functioning as a
separate database. The collection of distributed server shards forms a single, comprehensive database much better suited
to handling the needs of a popular, growing application with zero downtime.

Zero downtime deployment is a deployment method where your website or application is never down or in an
unstable state during the deployment process. To achieve this the web server doesn't start serving the changed code
until the entire deployment process is complete.

All operations in a sharding environment are handled through a lightweight process called mongos. Mongos can direct
queries to the correct shard based on the shard key. Naturally, proper sharding also contributes significantly to better
load balancing.

5. Load balancing
At the end of the day, optimal load balancing remains one of the holy grails of large-scale database management for
growing enterprise applications. Properly distributing millions of client requests to hundreds or thousands of servers can
lead to a noticeable (and much appreciated) difference in performance.

Fortunately, via horizontal scaling features like replication and sharding, MongoDB supports large-scale load balancing.
The platform can handle multiple concurrent read and write requests for the same data with best-in-class concurrency
control and locking protocols that ensure data consistency. There’s no need to add an external load balancer—MongoDB
ensures that each and every user has a consistent view and quality experience with the data they need to access

NoSQL database types explained: Key-value store

What is a key-value store?

This specific type of NoSQL database uses the key-value method and represents a collection of numerous key-value
pairs. The keys are unique identifiers for the values. The values can be any type of object -- a number or a string, or even
another key-value pair in which case the structure of the database grows more complex.

Unlike relational databases, key-value databases do not have a specified structure. Relational databases store data in
tables where each column has an assigned data type. Key-value databases are a collection of key-value pairs that are
stored as individual records and do not have a predefined data structure. The key can be anything, but seeing that it is
the only way of retrieving the value associated with it, naming the keys should be done strategically.

Key names can range from as simple as numbering to specific descriptions of the value that is about to follow. A key-
value database can be thought of as a dictionary or a directory. Dictionaries have words as keys and their meanings as
values.
Phonebooks have names of people as keys and their phone numbers as values. Just like key-value stores, unless you
know the name of the person whose number you need, you will not be able to find the right number.

The features of key-value database


The key-value store is one of the least complex types of NoSQL databases. This is precisely what makes this model so
attractive. It uses very simple functions to store, get and remove data.

Apart from those main functions, key-value store databases do not have querying language. The data is of no type and is
determined by the requirements of the application used to process the data.

A very useful feature is built-in redundancy improving the reliability of this database type.

Use cases of key-value databases


The choice of which database an organization should apply depends purely on its users and their needs. However, some
of the most common use cases of key-value databases are to record sessions in applications that require logins.
In this case, the data about each session -- period from login to logoff -- is recorded in a key-value store. Sessions are
marked with identifiers and all data recorded about each session -- themes, profiles, targeted offers, etc. -- is sorted under
the appropriate identifier.

With an increasing variety of data types and cheap storing options, we started stepping away from them and looking into
nonrelational (NoSQL) databases.

Another more specific use case yet similar to the previous one is a shopping cart where e-commerce websites can record
data pertaining to individual shopping sessions. Relational databases are better to use with payment transaction records;
however, session records prior to payment are probably better off in a key-value store. We know that more people fill
their shopping carts and subsequently change their mind about buying the selected items than those who proceed to
payment. Why fill a relational database with all this data when there is a more efficient and more reliable solution?

A key-value store will be quick to record and get data simultaneously. Also, with its built-in redundancy, it ensures that
no item from a cart gets lost. The scalability of key-value stores comes in handy in peak seasons around holidays or
during sales and special promotions because there is usually a sharp increase in sales and an even greater increase in
traffic on the website. The scalability of the key-value store will make sure that the increased load on the database does
not result in performance issues.

Advantages of key-value databases


It is worth pointing out that different database types exist to serve different purposes. This sometimes makes the choice
of the right type of database to use obvious. While key-value databases may be limited in what they can do, they are
often the right choice for the following reasons:

 Simplicity. As mentioned above, key value databases are quite simple to use. The straightforward commands
and the absence of data types make work easier for programmers. With this feature data can assume any type,
or even multiple types, when needed.

 Speed. This simplicity makes key value databases quick to respond, provided that the rest of the environment
around it is well-built and optimized.

 Scalability. This is a beloved advantage of NoSQL databases over relational databases in general, and key-
value stores in particular. Unlike relational databases, which are only scalable vertically, key-value stores are
also infinitely scalable horizontally.

 Easy to move. The absence of a query language means that the database can be easily moved between
different systems without having to change the architecture.

 Reliability. Built-in redundancy comes in handy to cover for a lost storage node where duplicated data comes
in place of what's been lost.
Disadvantages of key-value databases
Not all key-value databases are the same, but some of the general drawbacks include the following:

 Simplicity. The list of advantages and disadvantages demonstrates that everything is relative, and that what
generally comes as an advantage can also be a disadvantage. This further proves that you have to consider
your needs and options carefully before choosing a database to use. The fact that key-value stores are not
complex also means that they are not refined. There is no language nor straightforward means that would
allow you to query the database with anything else other than the key.

 No query language. Without a unified query language to use, queries from one database may not be
transportable into a different key-value database.

 Values can't be filtered. The database sees values as blobs so it cannot make much sense of what they
contain. When there is a request placed, whole values are returned -- rather than a specific piece of
information -- and when they get updated, the whole value needs to be updated.

Popular key-value databases


If you want to rely on recommendations and follow in the footsteps of your peers, you likely won't make a mistake by
choosing one of the following key-value databases:

 Amazon DynamoDB. DynamoDB is a database trusted by many large-scale users and users in general. It is
fully managed and reliable, with built-in backup and security options. It is able to endure high loads and
handle trillions of requests daily. These are just some of the many features supporting the reputation of
DynamoDB, apart from its famous name.

 Aerospike. This is a real-time platform facilitating billions of transactions. It reduces the server footprint by
80% and enables high performance of real-time applications.

 Redis. Redis is an open source key-value database. With keys containing lists, hashes, strings and sets, Redis
is known as a data structure server.

The list goes on, and includes many strong competitors Key-value databases serve a specific purpose, and they have
features that can add value to some but impose limitations on others. For this reason, you should always carefully assess
your requirements and the purpose of your data before you settle for a database. Once that is done, you can start looking
into your options and ensure that your database allows you to collect and make the most of your data without
compromising performance.
DynamoDB – Overview
DynamoDB allows users to create databases capable of storing and retrieving any amount of data, and serving any
amount of traffic. It automatically distributes data and traffic over servers to dynamically manage each customer's
requests, and also maintains fast performance.

What is DynamoDB?

DynamoDB is a hosted NoSQL database offered by Amazon Web Services (AWS). It offers:
 reliable performance even as it scales;
 a managed experience, so you won't be SSH-ing (SSH or Secure Shell is a network communication protocol that
enables two computers to communicate) into servers to upgrade the crypto libraries;
 a small, simple API allowing for simple key-value access as well as more advanced query patterns.

DynamoDB is a particularly good fit for the following use cases:

Applications with large amounts of data and strict latency requirements. As your amount of data scales, JOINs and
advanced SQL operations can slow down your queries. With DynamoDB, your queries have predictable latency up to
any size, including over 100 TBs!
Serverless applications using AWS Lambda. AWS Lambda provides auto-scaling, stateless, ephemeral compute in
response to event triggers. DynamoDB is accessible via an HTTP API and performs authentication & authorization via
IAM roles, making it a perfect fit for building Serverless applications.
Data sets with simple, known access patterns. If you're generating recommendations and serving them to users,
DynamoDB's simple key-value access patterns make it a fast, reliable choice.

DynamoDB vs. RDBMS

DynamoDB uses a NoSQL model, which means it uses a non-relational system. The following table highlights the
differences between DynamoDB and RDBMS −

Common Tasks RDBMS DynamoDB

Connect to the Source It uses a persistent connection and SQL It uses HTTP requests and API operations
commands.
Create a Table Its fundamental structures are tables, and must It only uses primary keys, and no schema on
be defined. creation. It uses various data sources.

Get Table Info All table info remains accessible Only primary keys are revealed.

Load Table Data It uses rows made of columns. In tables, it uses items made of attributes

Read Table Data It uses SELECT statements and filtering It uses GetItem, Query, and Scan.
statements.

Manage Indexes It uses standard indexes created through SQL It uses a secondary index to achieve the
statements. Modifications to it occur same function. It requires specifications
automatically on table changes. (partition key and sort key).

Modify Table Data It uses an UPDATE statement. It uses an UpdateItem operation.

Delete Table Data It uses a DELETE statement. It uses a DeleteItem operation.

Delete a Table It uses a DROP TABLE statement. It uses a DeleteTable operation.

Advantages
The two main advantages of DynamoDB are scalability and flexibility. It does not force the use of a particular data
source and structure, allowing users to work with virtually anything, but in a uniform way.

Its design also supports a wide range of use from lighter tasks and operations to demanding enterprise functionality. It
also allows simple use of multiple languages: Ruby, Java, Python, C#, Erlang, PHP, and Perl.

Limitations
DynamoDB does suffer from certain limitations, however, these limitations do not necessarily create huge problems or
hinder solid development.

You can review them from the following points −

 Capacity Unit Sizes − A read capacity unit is a single consistent read per second for items no larger than 4KB.
A write capacity unit is a single write per second for items no bigger than 1KB.

 Provisioned Throughput Min/Max − All tables and global secondary indices have a minimum of one read and
one write capacity unit. Maximums depend on region. In the US, 40K read and write remains the cap per table
(80K per account), and other regions have a cap of 10K per table with a 20K account cap.
A data cap (bandwidth cap) is a service provider-imposed limit on the amount of data transferred by
a user account at a specified level of throughput over a given time period, for a specified fee. The
term applies to both home Internet service and mobile data plans.

Data caps are usually imposed as a maximum allowed amount of data in a month for an agreed-upon
charge. As a rule, when the user exceeds that limit, they are charged at a higher rate for further data
use.

 Provisioned Throughput Increase and Decrease − You can increase this as often as needed, but decreases
remain limited to no more than four times daily per table.

 Table Size and Quantity Per Account − Table sizes have no limits, but accounts have a 256 table limit unless
you request a higher cap.

 Secondary Indexes Per Table − Five local and five global are permitted.

 Projected Secondary Index Attributes Per Table − DynamoDB allows 20 attributes.

 Partition Key Length and Values − Their minimum length sits at 1 byte, and maximum at 2048 bytes, however,
DynamoDB places no limit on values.

 Sort Key Length and Values − Its minimum length stands at 1 byte, and maximum at 1024 bytes, with no limit
for values unless its table uses a local secondary index.

 Table and Secondary Index Names − Names must conform to a minimum of 3 characters in length, and a
maximum of 255. They use the following characters: AZ, a-z, 0-9, “_”, “-”, and “.”.

 Attribute Names − One character remains the minimum, and 64KB the maximum, with exceptions for keys and
certain attributes.

 Reserved Words − DynamoDB does not prevent the use of reserved words as names.

 Expression Length − Expression strings have a 4KB limit. Attribute expressions have a 255-byte limit.
Substitution variables of an expression have a 2MB limit.

Voldemort Key-Value Distributed Data Store


Voldemort is a distributed data store that was designed as a key-value store used by LinkedIn for highly-scalable storage. It is
named after the fictional Harry Potter villain Lord Voldemort.

Voldemort is a distributed key-value storage system

 Data is automatically replicated over multiple servers.


 Data is automatically partitioned so each server contains only a subset of the total data
 Provides tunable consistency (strict quorum or eventual consistency)
 Server failure is handled transparently
 Pluggable Storage Engines -- BDB-JE, MySQL, Read-Only
 Pluggable serialization -- Protocol Buffers, Thrift, Avro and Java Serialization
 Data items are versioned to maximize data integrity in failure scenarios without compromising availability of the system
 Each node is independent of other nodes with no central point of failure or coordination
 Good single node performance: you can expect 10-20k operations per second depending on the machines, the network, the
disk system, and the data replication factor
 Support for pluggable data placement strategies to support things like distribution across data centers that are geographically
far apart.

It is used at LinkedIn by numerous critical services powering a large portion of the site.

Comparison to relational databases

Voldemort is not a relational database, it does not attempt to satisfy arbitrary relations while satisfying ACID properties. Nor is it
an object database that attempts to transparently map object reference graphs. Nor does it introduce a new abstraction such as
document-orientation. It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R
mapper like active-record or hibernate this will provide horizontal scalability and much higher availability but at great loss of
convenience. For large applications under internet-type scalability pressure, a system may likely consist of a number of functionally
partitioned services or APIs, which may manage storage resources across multiple data centers using storage systems which may
themselves be horizontally partitioned. For applications in this space, arbitrary in-database joins are already impossible since all the
data is not available in any single database. A typical pattern is to introduce a caching layer which will require hashtable semantics
anyway. For these applications Voldemort offers a number of advantages:

 Voldemort combines in memory caching with the storage system so that a separate caching tier is not required (instead the
storage system itself is just fast)
 Unlike MySQL replication, both reads and writes scale horizontally
 Data portioning is transparent, and allows for cluster expansion without rebalancing all data
 Data replication and placement is decided by a simple API to be able to accommodate a wide range of application specific
strategies
 The storage layer is completely mockable so development and unit testing can be done against a throw-away in-memory
storage system without needing a real cluster (or even a real storage system) for simple testing

Wide Column NoSQL Systems

A wide-column database is a NoSQL database that organizes data storage into flexible columns that can be spread
across multiple servers or database nodes, using multi-dimensional mapping to reference data by column, row, and
timestamp.

What are wide-column stores?

Wide-column stores use the typical tables, columns, and rows, but unlike relational databases (RDBs),
columnal formatting and names can vary from row to row inside the same table. And each column is stored
separately on disk.
Columnar databases store each column in a separate file. One file stores only the key column, the other
only the first name, the other the ZIP, and so on. Each column in a row is governed by auto-indexing —
each functions almost as an index — which means that a scanned/queried columns offset corresponds to
the other columnal offsets in that row in their respective files.

Traditional row-oriented storage gives you the best performance when querying multiple columns of a
single row. Of course, relational databases are structured around columns that hold very specific
information, upholding that specificity for each entry. For instance, let’s take a Customer table. Column
values contain Customer names, addresses, and contact info. Every Customer has the same format.

Columnar families are different. They give you automatic vertical partitioning; storage is both column-
based and organized by less restrictive attributes. RDB tables are also restricted to row-based storage and
deal with tuple storage in rows, accounting for all attributes before moving forward; e.g., tuple 1 attribute
1, tuple 1 attribute 2, and so on — then tuple 2 attribute 1, tuple 2 attribute 2, and so on — in that order.
The opposite is columnar storage, which is why we use the term column families.

Note: some columnar systems also have the option for horizontal partitions at default of, say, 6 million
rows. When it’s time to run a scan, this eliminates the need to partition during the actual query. Set up your
system to sort its horizontal partitions at default based on the most commonly used columns. This minimizes
the number of extents containing the values you are looking for.

One useful option if offered — InfiniDB is one example that does — is to automatically create horizontal
partitions based on the most recent queries. This eliminates the impact of much older queries that are no
longer crucial.

A wide-column store (or extensible record store) is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational
database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a
two-dimensional key–value store.

Wide-column stores versus columnar databases


Wide-column stores such as Bigtable and Apache Cassandra are not column stores in the original sense of the term, since
their two-level structures do not use a columnar data layout. In genuine column stores, a columnar data layout is adopted
such that each column is stored separately on disk. Wide-column stores do often support the notion of column
families that are stored separately. However, each such column family typically contains multiple columns that are used
together, similar to traditional relational database tables. Within a given column family, all data is stored in a row-by-
row fashion, such that the columns for a given row are stored together, rather than each column being stored separately.

Wide-column stores that support column families are also known as column family databases.

Notable wide-column stores


Notable wide-column stores include:

 Apache Accumulo
 Apache Cassandra
 Apache HBase
 Bigtable
 DataStax Enterprise
 DataStax Astra DB
 Hypertable
 Azure Tables
 Scylla (database)

What Is a Wide Column Database?

Wide column databases, or column family databases, refers to a category of NoSQL databases that works well for storing
enormous amounts of data that can be collected. Its architecture uses persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above
the petabyte scale). Column family stores do not follow the relational model, and they aren’t optimized for joins.

Good wide column database use cases include:

 Sensor Logs [Internet of Things (IoT)]


 User preferences
 Geographic information
 Reporting systems
 Time series data
 Logging and other write heavy applications

Wide column databases are not the preferred choice for applications with ad-hoc query patterns, high level aggregations
and changing database requirements. This type of data store does not keep good data lineage.

Other Definitions of Wide Column Databases Include:

 “A multidimensional nested sorted map of maps, where data is stored in cells of columns and grouped into column
families.” (Akshay Pore)
 “Scalability and high availability without compromising performance.” (Apache)
 Database management systems that organize related facts into columns. (Forbes)
 “Databases [that] are similar to key-value but allow a very large number of columns. They are well suited for
analyzing huge data sets, and Cassandra is the best known.” (IBM)
 A store that groups data into columns and allowing for an infinite number of them. (Temple University)
 A store with data as rows and columns, like a RDBMS, but able to handle more ambiguous and complex data
types, including unformatted text and imagery. (Michelle Knight)

Businesses Use Wide Column Databases to Handle:

 High volume of data


 Extreme write speeds with relatively less velocity reads
 Data extraction by columns using row keys

Hbase Data Model

What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and
is horizontally scalable.

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts
of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).

It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.

One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

HBase and HDFS

HDFS HBase

HDFS is a distributed file system suitable HBase is a database built on top of the HDFS.
for storing large files.

HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.

It provides high latency batch processing; It provides low latency access to single rows from billions of
no concept of batch processing. records (Random access).

It provides only sequential access of data. HBase internally uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.

Storage Mechanism in HBase


HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column
families, which are the key value pairs. A table have multiple column families and each column family can have any
number of columns. Subsequent column values are stored contiguously on the disk. Each cell value of the table has a
timestamp. In short, in an HBase:

 Table is a collection of rows.

 Row is a collection of column families.

 Column family is a collection of columns.

 Column is a collection of key value pairs.

Given below is an example schema of table in HBase.

Rowid Column Family Column Family Column Family Column Family

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data.
Shortly, they will have column families.

Row-Oriented Database Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing
(OLAP).

Such databases are designed for small number of rows and Column-oriented databases are designed for
columns. huge tables.

The following image shows column families in a column-oriented database:


HBase and RDBMS

HBase RDBMS

HBase is schema-less, it doesn't have the concept of fixed An RDBMS is governed by its schema, which
columns schema; defines only column families. describes the whole structure of tables.

It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.

No transactions are there in HBase. RDBMS is transactional.

It has de-normalized data. It will have normalized data.

It is good for semi-structured as well as structured data. It is good for structured data.

Features of HBase

 HBase is linearly scalable.

 It has automatic failure support.

 It provides consistent read and writes.

 It integrates with Hadoop, both as a source and a destination.

 It has easy java API for client.

 It provides data replication across clusters.


Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.

 It hosts very large tables on top of clusters of commodity hardware.

 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of Hadoop and HDFS.

Applications of HBase

 It is used whenever there is a need to write heavy applications.

 HBase is used whenever we need to provide fast random access to available data.

 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

HBase – Overview of Architecture and Data Model


In this article, we will briefly look at the capabilities of HBase, compare it against technologies that we are already
familiar with and look at the underlying architecture. In the upcoming parts, we will explore the core data model and
features that enable it to store and manage semi-structured data.

Introduction
HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture.
It can manage structured and semi-structured data and has some built-in features such as scalability, versioning,
compression and garbage collection.
Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from
individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using
Hadoop’s MapReduce capabilities.

Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and
concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-oriented
data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in a column is
stored together and hence quickly retrieved.
Row-oriented data stores –

 Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the
data in a row is required.
 Easy to read and write records
 Well suited for OLTP systems
 Not efficient in performing operations applicable to the entire dataset and hence aggregation is an
expensive operation
 Typical compression mechanisms provide less effective results than those on column-oriented data stores

Column-oriented data stores –

 Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
 Read and Write are typically slower operations
 Well suited for OLAP systems
 Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over
many rows and columns
 Permits high compression rates due to few distinct values in columns

Introduction Relational Databases vs. HBase


When talking of data stores, we first think of Relational Databases with structured data storage and a sophisticated query
engine. However, a Relational Database incurs a big penalty to improve performance as the data size increases. HBase,
on the other hand, is designed from the ground up to provide scalability and partitioning to enable efficient data structure
serialization, storage and retrieval. Broadly, the differences between a Relational Database and HBase are:

Relational Database –

 Is Based on a Fixed Schema


 Is a Row-oriented datastore
 Is designed to store Normalized Data
 Contains thin tables
 Has no built-in support for partitioning.

HBase –

 Is Schema-less
 Is a Column-oriented datastore
 Is designed to store Denormalized Data
 Contains wide and sparsely populated tables
 Supports Automatic Partitioning

HDFS vs. HBase


HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing of
data but doesn’t provide fast individual record lookups. HBase is built on top of HDFS and is designed to provide access
to single rows of data in large tables. Overall, the differences between HDFS and HBase are

HDFS –

 Is suited for High Latency operations batch processing


 Data is primarily accessed through MapReduce
 Is designed for batch processing and hence doesn’t have a concept of random reads/writes

HBase –

 Is built for Low Latency operations


 Provides access to single rows from billions of records
 Data is accessed through shell commands, Client APIs in Java, REST, Avro or Thrift

HBase Architecture
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase
cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server
contains multiple Regions – HRegions.

Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table
becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across
the cluster. Each Region Server hosts roughly the same number of Regions.
The HMaster in the HBase is responsible for

 Performing Administration
 Managing and Monitoring the Cluster
 Assigning Regions to the Region Servers
 Controlling the Load Balancing and Failover

On the other hand, the HRegionServer perform the following work

 Hosting and managing Regions


 Splitting the Regions automatically
 Handling the read/write requests
 Communicating with the Clients directly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made up
of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families
(explained below). The MemStore holds in-memory modifications to the Store (data).

The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data
from HBase, the clients read the required Region information from the .META table and directly communicate with the
appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)

HBase Data Model


The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and
columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster.
The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns,
Cells and Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As
shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a
Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are
always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more Columns
and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the
basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that
proper care be taken when designing Column Families in table.
The table above shows Customer and Sales Column Families. The Customer Column Family is made up 2 columns –
Name and City, whereas the Sales Column Families is made up to 2 columns – Product and Amount.

Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that
consists of the Column Family name concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can have
varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column
Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of
versions of data retained in a column family is configurable and this value by default is 3.

Hbase Crud Operations

HBase - Client API


This chapter describes the java client API for HBase that is used to
perform CRUD (Create,Retrive,Update,Delete)operations on HBase tables. HBase is written in Java and has a Java
Native API. Therefore it provides programmatic access to Data Manipulation Language (DML).

Class HBase Configuration

Adds HBase configuration files to a Configuration. This class belongs to the org.apache.hadoop.hbase package.

Methods and description


S.No. Methods and Description

1
static org.apache.hadoop.conf.Configuration create()

This method creates a Configuration with HBase resources.

Class HTable

HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is used to
communicate with a single HBase table. This class belongs to the org.apache.hadoop.hbase.client class.

Constructors
S.No. Constructors and Description
1
HTable()

2
HTable(TableName tableName, ClusterConnection connection, ExecutorService pool)

Using this constructor, you can create an object to access an HBase table.

Methods and description


S.No. Methods and Description

1
void close()

Releases all the resources of the HTable.

2
void delete(Delete delete)

Deletes the specified cells/row.

3
boolean exists(Get get)

Using this method, you can test the existence of columns in the table, as specified by Get.

4
Result get(Get get)

Retrieves certain cells from a given row.

5
org.apache.hadoop.conf.Configuration getConfiguration()

Returns the Configuration object used by this instance.

6
TableName getName()

Returns the table name instance of this table.

7
HTableDescriptor getTableDescriptor()

Returns the table descriptor for this table.

8
byte[] getTableName()

Returns the name of this table.


9
void put(Put put)

Using this method, you can insert data into the table.

Class Put

This class is used to perform Put operations for a single row. It belongs to
the org.apache.hadoop.hbase.client package.

Constructors
S.No. Constructors and Description

1
Put(byte[] row)

Using this constructor, you can create a Put operation for the specified row.

2
Put(byte[] rowArray, int rowOffset, int rowLength)

Using this constructor, you can make a copy of the passed-in row key to keep local.

3
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)

Using this constructor, you can make a copy of the passed-in row key to keep local.

4
Put(byte[] row, long ts)

Using this constructor, we can create a Put operation for the specified row, using a given timestamp.

Methods
S.No. Methods and Description

1
Put add(byte[] family, byte[] qualifier, byte[] value)

Adds the specified column and value to this Put operation.

2
Put add(byte[] family, byte[] qualifier, long ts, byte[] value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.
3
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.

4
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)

Adds the specified column and value, with the specified timestamp as its version to this Put operation.

Class Get

This class is used to perform Get operations on a single row. This class belongs to
the org.apache.hadoop.hbase.client package.

Constructor
S.No. Constructor and Description

1
Get(byte[] row)

Using this constructor, you can create a Get operation for the specified row.

2 Get(Get get)

Methods
S.No. Methods and Description

1
Get addColumn(byte[] family, byte[] qualifier)

Retrieves the column from the specific family with the specified qualifier.

2
Get addFamily(byte[] family)

Retrieves all columns from the specified family.

Class Delete

This class is used to perform Delete operations on a single row. To delete an entire row, instantiate a Delete object with
the row to delete. This class belongs to the org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description

1
Delete(byte[] row)

Creates a Delete operation for the specified row.

2
Delete(byte[] rowArray, int rowOffset, int rowLength)

Creates a Delete operation for the specified row and timestamp.

3
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)

Creates a Delete operation for the specified row and timestamp.

4
Delete(byte[] row, long timestamp)

Creates a Delete operation for the specified row and timestamp.

Methods
S.No. Methods and Description

1
Delete addColumn(byte[] family, byte[] qualifier)

Deletes the latest version of the specified column.

2
Delete addColumns(byte[] family, byte[] qualifier, long timestamp)

Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.

3
Delete addFamily(byte[] family)

Deletes all versions of all columns of the specified family.

4
Delete addFamily(byte[] family, long timestamp)

Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.
Class Result

This class is used to get a single row result of a Get or a Scan query.

Constructors
S.No. Constructors

1
Result()

Using this constructor, you can create an empty Result with no KeyValue payload; returns null if you
call raw Cells().

Methods
S.No. Methods and Description

1
byte[] getValue(byte[] family, byte[] qualifier)

This method is used to get the latest version of the specified column.

2
byte[] getRow()

This method is used to retrieve the row key that corresponds to the row from which this Result was
created.

Hbase Storage and Distributed System Concepts


Auto-Sharding

The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of
rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may also
be merged to reduce their number and required storage files. Each region is served by exactly one region server, and
each of these servers can serve many regions at any time.
Splitting and serving regions can be thought of as autosharding, as offered by other systems. The regions allow for fast
recovery when a server fails, and fine-grained load balancing since they can be moved between servers when the load of
the server currently serving the region is under pressure, or if that server becomes unavailable because of a failure or
because it is being decommissioned.

Splitting is also very fast—close to instantaneous—because the split regions simply read from the original storage files
until a compaction rewrites them into separate ones asynchronously.

1.3.3Storage API
The API offers operations to create and delete tables and column families. In addition, it has functions to change the
table and column family metadata, such as compression or block sizes. Furthermore, there are the usual operations for
clients to create or delete values as well as retrieving them with a given row key.
A scan API allows you to efficiently iterate over ranges of rows and be able to limit which columns are returned or the
number of versions of each cell. You can match columns using filters and select versions using time ranges, specifying
start and end times.

1.3.4Implementation(Architecture)
The data is stored in store files, called HFiles, which are persistent and ordered immutable maps from keys to values.
Internally, the files are sequences of blocks with a block index stored at the end. The index is loaded when the HFile is
opened and kept in memory. The default block size is 64 KB but can be configured differently if required.The store files
provide an API to access specific values as well as to scan ranges of values given a start and end key.
The store files are typically saved in the Hadoop Distributed File System (HDFS), which provides a scalable, persistent,
replicated storage layer for HBase. It guarantees that data is never lost by writing the changes across a configurable
number of physical servers. When data is updated it is first written to a commit log, called a write-ahead log (WAL) in
HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it
is flushed as anHFile to disk.
There are three major components to HBase: the client library, one master server, and many region servers. The region
servers can be added or removed while the system is up and running to accommodate changing workloads. The master
is responsible for assigning regions to region servers and uses Apache ZooKeeper, a reliable, highly available, persistent
and distributed coordination service,to facilitate that task.

The master server is also responsible for handling load balancing of regions across region servers, to unload busy
servers and move regions to less occupied ones. The master is not part of the actual data storage or retrieval path. It
negotiates load balancing and maintains the state of the cluster, but never provides any data services to either the region
servers or the clients, and is therefore lightly loaded in practice. In addition, it takes care of schema changes and other
metadata operations, such as creation of tables and column families.
Region servers are responsible for all read and write requests for all regions they serve, and also split regions that have
exceeded the configured region size thresholds. Clients communicate directly with them to handle all data-related
operations.

NoSQL Graph Databases and Neo4j

What is a NoSQL Graph Database?

The NoSQL graph database is a technology for data management designed to handle very large sets of structured,
semi-structured or unstructured data. The semantic graph database (also known as RDF triplestore) is a type of
NoSQL graph database that is capable of integrating heterogeneous data from many sources and making links
between datasets. It focuses on the relationships between entities and is able to infer new knowledge out of existing
information.

The NoSQL (‘not only SQL’) graph database is a technology for data management designed to handle very large sets of
structured, semi-structured or unstructured data. It helps organizations access, integrate and analyze data from various
sources, thus helping them with their big data and social media analytics.

NoSQL Graph Database Vs. Relational Database

The traditional approach to data management, the relational database, was developed in the 1970s to help enterprises
store structured information. The relational database needs its schema (the definition how data is organized and how the
relations are associated) to be defined before any new information is added.
Today, however, mobile, social and Internet of Things (IoT) data is everywhere, with unstructured real-time data piling
up by the minute. Apart from handling a massive amount of data of all kind, the NoSQL graph database does not need
its schema re-defined before adding new data.

This makes the graph database much more flexible, dynamic and lower-cost in integrating new data sources than
relational databases.

Compared to the moderate data velocity from one or few locations of the relational databases, NoSQL graph databases
are able to store, retrieve, integrate and analyze high-velocity data coming from many locations. Eg : Facebook Link

Semantically Rich NoSQL Graph Database

The semantic graph database is a type of NoSQL graph database that is capable of integrating heterogeneous data from
many sources and making links between datasets.

The semantic graph database, also referred to as an RDF triplestore, focuses on the relationships between entities and is
able to infer new knowledge out of existing information. It is a powerful tool to use in relationship-centered analytics
and knowledge discovery.

In addition, the capability to handle massive datasets and the schema-less approach support the NoSQL semantic graph
database usage in real-time big data analytics.

 In relational databases, the need to have the schemas defined before adding new information restricts data integration from
new sources because the whole schema needs to be changed anew.
 With the schema-less NoSQL semantic graph database with no need to change schemas every time a new data source is
about to be added, enterprises integrate data with less effort and cost.

The semantic graph database stands out from the other types of graph databases with its ability to additionally support
rich semantic data schema, the so-called ontologies.
The semantic NoSQL graph database gets the best of both worlds: on the one hand, data is flexible because it does not
depend on the schema. On the other hand, ontologies give the semantic graph database the freedom and ability to build
logical models any way organizations find it useful for their applications, without having to change the data.

The Benefits of the Semantic Graph Database

Apart from rich semantic models, semantic graph databases use the globally developed W3C standards for representing
data on the Web. The use of standard practices makes data integration, exchange and mapping to other datasets easier
and lowers the risk of vendor lock-in while working with a graph database.

One of those standards is the Uniform Resource Identifier (URI), a kind of unique ID for all things linked so that we can
distinguish between them or know that one thing from one dataset is the same as another in a different dataset. The use
of URIs not only reduces costs in integrating data from disparate sources, it also makes data publishing and sharing easier
with mapping to Linked (Open) Data.

Ontotext’s GraphDB is able to use inference, that is, to infer new links out of existing explicit statements in the RDF
triplestore. Inference enriches the graph database by creating new knowledge and gives organizations the ability to see
all their data highly interlinked. Thus, enterprises have more insights at hand to use in their decision-making processes.

NoSQL Graph Database Use Cases

Apart from representing proprietary enterprise data in a linked and meaningful way, the NoSQL graph database makes
content management and personalization easier, due to its cost-effective way of integrating and combining huge sets of
data.

As we all know the graph is a pictorial representation of data in the form of nodes and relationships which are
represented by edges. A graph database is a type of database used to represent the data in the form of a graph. It has
three components: nodes, relationships, and properties. These components are used to model the data. The concept of
a Graph Database is based on the theory of graphs. It was introduced in the year 2000. They are commonly referred
to NoSql databases as data is stored using nodes, relationships and properties instead of traditional databases. A graph
database is very useful for heavily interconnected data. Here relationships between data are given priority and therefore
the relationships can be easily visualized. They are flexible as new data can be added without hampering the old ones.
They are useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:

 Nodes: represent the objects or instances. They are equivalent to a row in database. The node basically acts
as a vertex in a graph. The nodes are grouped by applying a label to each member.
 Relationships: They are basically the edges in the graph. They have a specific direction, type and form
patterns of the data. They basically establish relationship between nodes.
 Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out of which Neo4j is
the most popular one.

In traditional databases, the relationships between data is not established. But in the case of Graph Database, the
relationships between data are prioritized. Nowadays mostly interconnected data is used where one data is connected
directly or indirectly. Since the concept of this database is based on graph theory, it is flexible and works very fast for
associative data. Often data are interconnected to one another which also helps to establish further relationships. It
works fast in the querying part as well because with the help of relationships we can quickly find the desired nodes.
join operations are not required in this database which reduces the cost. The relationships and properties are stored as
first-class entities in Graph Database.

Graph databases allow organizations to connect the data with external sources as well. Since organizations require a
huge amount of data, often it becomes cumbersome to store data in the form of tables. For instance, if the organization
wants to find a particular data that is connected with another data in another table, so first join operation is performed
between the tables, and then search for the data is done row by row. But Graph database solves this big problem. They
store the relationships and properties along with the data. So if the organization needs to search for a particular data,
then with the help of relationships and properties the nodes can be found without joining or without traversing row by
row. Thus the searching of nodes is not dependent on the amount of data.

Types of Graph Databases:


 Property Graphs: These graphs are used for querying and analyzing data by modelling the relationships
among the data. It comprises of vertices that has information about the particular subject and edges that
denote the relationship. The vertices and edges have additional attributes called properties.
 RDF Graphs: It stands for Resource Description Framework. It focuses more on data integration. They
are used to represent complex data with well defined semantics. It is represented by three elements: two
vertices, an edge that reflect the subject, predicate and object of a sentence. Every vertex and edge is
represented by URI(Uniform Resource Identifier).
When to Use Graph Database?
 Graph databases should be used for heavily interconnected data.
 It should be used when amount of data is larger and relationships are present.
 It can be used to represent the cohesive picture of the data.
How Graph and Graph Databases Work?
Graph databases provide graph models They allow users to perform traversal queries since data is connected. Graph
algorithms are also applied to find patterns, paths and other relationships this enabling more analysis of the data. The
algorithms help to explore the neighboring nodes, clustering of vertices analyze relationships and patterns. Countless
joins are not required in this kind of database.

Example of Graph Database:


 Recommendation engines in E commerce use graph databases to provide customers with accurate
recommendations, updates about new products thus increasing sales and satisfying the customer’s desires.
 Social media companies use graph databases to find the “friends of friends” or products that the user’s
friends like and send suggestions accordingly to user.
 To detect fraud Graph databases play a major role. Users can create graph from t he transactions between
entities and store other important information. Once created, running a simple query will help to identify
the fraud.
Advantages of Graph Database:
 Potential advantage of Graph Database is establishing the relationships with external sources as well
 No joins are required since relationships is already specified.
 Query is dependent on concrete relationships and not on the amount of data.
 It is flexible and agile.
 it is easy to manage the data in terms of graph.
Disadvantages of Graph Database:
 Often for complex relationships speed becomes slower in searching.
 The query language is platform dependent.
 They are inappropriate for transactional data
 It has smaller user base.
Future of Graph Database:
Graph Database is an excellent tool for storing data but it cannot be used to completely replace the traditional database.
This database deals with a typical set of interconnected data. Although Graph Database is in the developmental phase
it is becoming an important part as business and organizations are using big data and Graph databases help in complex
analysis. Thus these databases have become a must for today’s needs and tomorrow success.

Neo4j is the world's leading open source Graph Database which is developed using Java technology. It is highly scalable
and schema free (NoSQL).

Used by : Walmart,epay,NASA,Microsoft,IBM

What is a Graph Database?


A graph is a pictorial representation of a set of objects where some pairs of objects are connected by links. It is composed
of two elements - nodes (vertices) and relationships (edges).

Graph database is a database used to model the data in the form of graph. In here, the nodes of a graph depict the entities
while the relationships depict the association of these nodes.

Popular Graph Databases


Neo4j is a popular Graph Database. Other Graph Databases are Oracle NoSQL Database, OrientDB, HypherGraphDB,
GraphBase, InfiniteGraph, and AllegroGraph.

Why Graph Databases?


Nowadays, most of the data exists in the form of the relationship between different objects and more often, the
relationship between the data is more valuable than the data itself.

Relational databases store highly structured data which have several records storing the same type of data so they can
be used to store structured data and, they do not store the relationships between the data.

Unlike other databases, graph databases store relationships and connections as first-class entities.

The data model for graph databases is simpler compared to other databases and, they can be used with OLTP systems.
They provide features like transactional integrity and operational availability.

RDBMS Vs Graph Database


Following is the table which compares Relational databases and Graph databases.

Sr.No RDBMS Graph Database

1 Tables Graphs

2 Rows Nodes

3 Columns and Data Properties and its values

4 Constraints Relationships

5 Joins Traversal

Advantages of Neo4j

Following are the advantages of Neo4j.


 Flexible data model − Neo4j provides a flexible simple and yet powerful data model, which can be easily
changed according to the applications and industries.

 Real-time insights − Neo4j provides results based on real-time data.

 High availability − Neo4j is highly available for large enterprise real-time applications with transactional
guarantees.

 Connected and semi structures data − Using Neo4j, you can easily represent connected and semi-structured
data.

 Easy retrieval − Using Neo4j, you can not only represent but also easily retrieve (traverse/navigate) connected
data faster when compared to other databases.

 Cypher query language − Neo4j provides a declarative query language to represent the graph visually, using
an ascii-art syntax. The commands of this language are in human readable format and very easy to learn.

 No joins − Using Neo4j, it does NOT require complex joins to retrieve connected/related data as it is very easy
to retrieve its adjacent node or relationship details without joins or indexes.

Features of Neo4j

Following are the notable features of Neo4j −

 Data model (flexible schema) − Neo4j follows a data model named native property graph model. Here, the
graph contains nodes (entities) and these nodes are connected with each other (depicted by relationships). Nodes
and relationships store data in key-value pairs known as properties.

In Neo4j, there is no need to follow a fixed schema. You can add or remove properties as per requirement. It
also provides schema constraints.

 ACID properties − Neo4j supports full ACID (Atomicity, Consistency, Isolation, and Durability) rules.

 Scalability and reliability − You can scale the database by increasing the number of reads/writes, and the
volume without effecting the query processing speed and data integrity. Neo4j also provides support
for replication for data safety and reliability.

 Cypher Query Language − Neo4j provides a powerful declarative query language known as Cypher. It uses
ASCII-art for depicting graphs. Cypher is easy to learn and can be used to create and retrieve relations between
data without using the complex queries like Joins.

 Built-in web application − Neo4j provides a built-in Neo4j Browser web application. Using this, you can create
and query your graph data.

 Drivers − Neo4j can work with −

o REST (representational state transfer )API to work with programming languages such as Java,

Spring, Scala etc.


o Java Script to work with UI MVC(Model–view–controller) frameworks such as Node JS.

o It supports two kinds of Java API(application program interface): Cypher API and Native Java API to

develop Java applications. In addition to these, you can also work with other databases such as
MongoDB, Cassandra, etc.

 Indexing − Neo4j supports Indexes by using Apache Lucence.

Neo4j Property Graph Data Model

Neo4j Graph Database follows the Property Graph Model to store and manage its data.

Following are the key features of Property Graph Model −

 The model represents data in Nodes, Relationships and Properties

 Properties are key-value pairs

 Nodes are represented using circle and Relationships are represented using arrow keys

 Relationships have directions: Unidirectional and Bidirectional

 Each Relationship contains "Start Node" or "From Node" and "To Node" or "End Node"

 Both Nodes and Relationships contain properties

 Relationships connects nodes

In Property Graph Data Model, relationships should be directional. If we try to create relationships without direction,
then it will throw an error message.

In Neo4j too, relationships should be directional. If we try to create relationships without direction, then Neo4j will
throw an error message saying that "Relationships should be directional".

Neo4j Graph Database stores all of its data in Nodes and Relationships. We neither need any additional RRBMS
Database nor any SQL database to store Neo4j database data. It stores its data in terms of Graphs in its native format.

Neo4j uses Native GPE (Graph Processing Engine) to work with its Native graph storage format.

The main building blocks of Graph DB Data Model are −

 Nodes

 Relationships

 Properties

Following is a simple example of a Property Graph.


Here, we have represented Nodes using Circles. Relationships are represented using Arrows. Relationships are
directional. We can represent Node's data in terms of Properties (key-value pairs). In this example, we have represented
each Node's Id property within the Node's Circle.

Neo4j Query Cypher Language


The Neo4j has its own query language that called Cypher Language. It is similar to SQL, remember one one
thing Neo4j does not work with tables, row or columns it deals with nodes. It is more satisfied to see the data
in a graph format rather then in a table format.

Example: The Neo4j Cypher statement compare to SQL

MATCH (G:Company { name:"GeeksforGeeks" })

RETURN G

This Cypher statement will return the “Company” node where the “name” property is GeeksforGeeks. Here
the “G” works like a variable to holds the data that your Cypher query demands after that it will return. It will
will be more clear if you know the SQL. Below same query is written in SQL.

SELECT * FROM Company WHERE name = "GeeksforGeeks";

The neo4j is made for NoSQL database but it is very effective on relational database too, it does not use the
SQL language.

ASCII-Art Syntax: The Neo4j used ASCII-Art to create pattern.

(X)-[:GeeksfroGeeks]->(Y)

In the Neo4j the nodes are represented by “( )”.

 The relationship is represented by ” -> “.


 What kind of relationship is between the nodes are represented by ” [ ] ” like [:GeeksforGeeks]
So above description is helpful to decode the ASCII-Art Syntax given, (X)-[:GeeksforGeeks]->(Y). Here the X and
Y are the nodes X to Y relation kind is “GeekforGeeks”.
Defining Data: Below points will help you to grasp the concept of Cypher language.

 Neo4j deals with nodes and the nodes contains labels that could be “Person”, “Employee”, “Employer”
anything that can define the type of value field.
 Neo4j also have properties like “name”, “employee_id”, “phone_number”, basically that will gives us
information about the nodes.
 Neo4j’s relationship also can contain the properties but this is not mandatory.
 In Neo4j the relationship is like the kind of situation like X works for GeeksforGeeks here the X and the
“GeeksforGeeks” is the node and the relation is works for, in Cypher language it will be like.
(X)-[:WORK]->(GeekfoGeeks).

Note: Here the Company is the node’s label, name is the property of the node

MATCH (G:Company { name:"GeeksforGeeks" })

RETURN G

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data with so large
size and complexity that none of traditional data management tools can store it or process it efficiently. Big data is also
a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.

Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.

Data Growth over the years

Data Growth over the years,Yottabyte is 2 80

Characteristics Of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big
Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days,
spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the
form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed
to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs,
networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process
of being able to handle and manage the data effectively.
(v)Veracity - Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to
link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships,
hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.

Advantages Of Big Data Processing


Ability to process Big Data in DBMS brings in multiple benefits, such as-

 Businesses can utilize outside intelligence while taking decisions

Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their
business strategies.

 Improved customer service


Traditional customer feedback systems are getting replaced by new systems designed with Big Data technologies. In
these new systems, Big Data and natural language processing technologies are being used to read and evaluate consumer
responses.

 Early identification of risk to the product/services, if any


 Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data
should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps
an organization to offload infrequently accessed data.

Summary

 Big Data definition : Big Data meaning a data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
 Big Data analytics examples includes stock exchanges, social media sites, jet engines, etc.
 Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
 Volume, Variety, Velocity, and Variability are few Big Data characteristics
 Improved customer service, better operational efficiency, Better Decision Making are few advantages of Bigdata

MapReduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes.
MapReduce provides analytical capabilities for analyzing huge volumes of complex data.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration
depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts
and assigns them to many computers. Later, the results are collected at one place and integrated to form the result
dataset.
How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key-value pairs).

 The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into
a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.

 Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed
data to the mapper in the form of key-value pairs.

 Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.

 Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.

 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily
in the Reducer task.

 Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.

 Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.

Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −

MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets
per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with
the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

 Count − Generates a token counter per word.

 Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was
developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key
value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair.
The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File
System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop
HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.

NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.

Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time
out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools
to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes
of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is
down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data
are replicated thrice but the replication factor is configurable.

History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper,
published by Google.
Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web
crawler software project.

o While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of
costs which becomes the consequence of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file
system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large
clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop
first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Hadoop YARN Architecture


YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a “Redesigned Resource
Manager” at the time of its launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer. In Hadoop 1.0
version, the responsibility of Job tracker is split between the resource manager and application manager.
YARN also allows different data processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS (Hadoop Distributed File
System) thus making the system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large volume data processing, it is
quite necessary to manage the available resources properly so that every application can leverage them.

YARN Features: YARN gained popularity because of the following features-


 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend
and manage thousands of nodes and clusters.
 Compatability: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
Hadoop YARN Architecture

The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource assignment
and management among all the applications. Whenever it receives a processing request, it forwards
it to the corresponding node manager and allocates resources for the completion of the request
accordingly. It has two major components:
 Scheduler: It performs scheduling based on the allocated application and available
resources. It is a pure scheduler, means it does not perform other tasks such as monitoring
or tracking and does not guarantee a restart if a task fails. The YARN scheduler supports
plugins such as Capacity Scheduler and Fair Scheduler to partition the cluster resources.
 Application manager: It is responsible for accepting the application and negotiating the
first container from the resource manager. It also restarts the Application Master
container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application and
workflow and that particular node. Its primary job is to keep-up with the Resource Manager. It
registers with the Resource Manager and sends heartbeats with the health status of the node. It
monitors resource usage, performs log management and also kills a container based on directions
from the resource manager. It is also responsible for creating the container process and start it on
the request of Application master.
 Application Master: An application is a single job submitted to a framework. The application
master is responsible for negotiating resources with the resource manager, tracking the status and
monitoring progress of a single application. The application master requests the container from the
node manager by sending a Container Launch Context(CLC) which includes everything an
application needs to run. Once the application is started, it sends the health report to the resource
manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single
node. The containers are invoked by Container Launch Context(CLC) which is a record that
contains information such as environment variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource Manager

HADOOP Extra Content :

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across
clusters of computers using simple programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient
processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-
source framework.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to
be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications
having large datasets.

Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −

 Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.

 Hadoop YARN − This is a framework for job scheduling and cluster resource management.

How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system
and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover,
it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.

Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs

 Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M
(preferably 128M).

 These files are then distributed across various cluster nodes for further processing.

 HDFS, being on top of the local file system, supervises the processing.

 Blocks are replicated for handling hardware failure.

 Checking that the code was executed successfully.

 Performing the sort that takes place between the map and reduce stages.

 Sending the sorted data to a certain computer.

 Writing the debugging logs for each job.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU
cores.

 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library
itself has been designed to detect and handle failures at the application layer.

 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without
interruption.

 Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since
it is Java based.

Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution
for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is now Chief Architect of
Cloudera, named the project after his son's toy elephant.
UNIT 5
DATABASE SECURITY
Database Security Issues – Discretionary Access Control Based on Granting and Revoking Privileges – Mandatory Access Control and
Role-Based Access Control for Multilevel Security –SQL Injection – Statistical Database Security – Flow Control – Encryption and Public
Key Infrastructures – Preserving Data Privacy – Challenges to Maintaining Database Security – Database Survivability – Oracle Label-
Based Security.

10 Common Database Security Issues


Databases are very attractive targets for hackers because they contain valuable and sensitive information. This can range fro m
financial or intellectual property to corporate data and personal user data. Cybercriminals can profit by breaching the servers
of companies and damaging the databases in the process. Thus, database security testing is a must.

There are numerous incidents where hackers have targeted companies dealing with personal customer details. Equifax,
Facebook, Yahoo, Apple, Gmail, Slack, and eBay data breaches were in the news in the past few years, just to name a few.
Such rampant activities raised the need for cybersecurity software and web app testing which aims to protect the data that
people share with online businesses. If these measures are applied, the hackers will be denied all access to the records and
documents available on the online databases. Also, complying with GDPR will help a lot on the way to strengthening user data
protection.

Here’s a list of top 10 vulnerabilities that are commonly found in the database-driven systems and our tips for how to eliminate
them.

No Security Testing Before Deployment


One of the most common causes of database weaknesses is negligence on the deployment stage of the development process.
Although functional testing is conducted to ensure supreme performance, this type of test can’t show you if the database is
doing something that it is not supposed to. Thus, it is important that you test website security with different types of tests before
complete deployment.

Poor Encryption and Data Breaches Come Together


You might consider the database a backend part of your set-up and focus more on the elimination of Internet-borne threats. It
does not really work that way. There are network interfaces within the databases which can be easily tracked by hackers if
your software security is poor. In order to avoid such situations, it is important to use TLS or SSL encrypted communication
platforms.

Feeble Cybersecurity Software = Broken Database


Case in point, the Equifax data breach. Company representatives admitted that 147 million consumers’ data was compromised,
so the consequences are huge. This case has proven how important cybersecurity software is to defend one’s database.
Unfortunately, either due to a lack of resources or time, most businesses don’t bother to conduct user data security testing and
do not provide regular patches for their systems, thus, leaving them susceptible to data leaks.

Stolen Database Backups


There are two kinds of threats to your databases: external and internal. There are cases when companies struggle with interna l
threats even more than with external. Business owners can never be 100% sure of their employees’ loyalty, no matter what
computer security software they use, and how responsible they seem to be. Anybody who has access to sensitive data can steal
it and sell it to the third-party organizations for profit. However, there is a way to eliminate the risk: encrypt database archives,
implement strict security standards, apply fines in case of violations, use cybersecurity software, and
continuously increase your teams’ awareness via corporate meetings and personal consulting.
Flaws in Features as a Database Security Issue
Databases can be hacked through the flaws of their features. Hackers can break into legitimate credentials and compel the
system to run any arbitrary code. Although it sounds complex, the access is actually gained through the basic flaws inherent
to the features. The database can be protected from third-party access by security testing. Also, the simpler its functional
structure — the more chances to ensure good protection of each database feature.

Weak and Complex DB Infrastructure


Hackers do not generally take control over the entire database in one go. They opt for playing a Hopscotch game where they
find a particular weakness within the infrastructure and use it to their advantage. They launch a string of attacks until they
finally reach the backend. Security software is not capable of fully protecting your system from such manipulations. Even if
you pay attention to the specific feature flaws, it’s important not to leave the overall database infrastructure too complex. When
it’s complex, there are chances you will forget or neglect to check and fix its weaknesses. Thus, it is important that every
department maintains the same amount of control and segregates systems to decentralize focus and reduce possible risks.

Limitless Administration Access = Poor Data Protection


Smart division of duties between the administrator and the user ensures limited access only to experienced teams. This way
users that are not involved into the database administration process will experience more difficulties if they try to steal any
data. If you can limit the number of user accounts, it’s even better because hackers will face more problems in gaining control
over the database as well. This case can be applied to any type of business but usually it happens in financial industry. Thus,
it’s good not only to care about who has the access to the sensitive data but also to perform banking software testing before
releasing it.

Test Website Security to Avoid SQL Injections


This is a major roadblock on the way to the database protection. Injections attack the applications and database administrators
are forced to clean up the mess of malicious codes and variables that are inserted into the strings. Web application security
testing and firewall implementation are the best options to protect the web-facing databases. However this is a big problem for
online business, it’s not one of the major mobile security challenges, which is a great advantage for the owners who only have
a mobile version of their application.

Inadequate Key Management


It’s good if you encrypt sensitive data but it’s also important that you pay attention to who exactly has access to the keys. Since
the keys are often stored on somebody’s hard drive, it is obviously an easy target for whoever wants to steal them. If you leave
such important software security tools unguarded, be aware that this makes your system vulnerable to attack.

Irregularities in Databases
It is inconsistencies that lead to vulnerabilities. Test website security and assure data protection on the regular basis. In case
any discrepancies are found, they have to be fixed ASAP. Your developers should be aware of any threat that might affect the
database. Though this is not an easy work but through proper tracking, the information can be kept secret.

In spite of being aware of the need for security testing, numerous businesses still fail to implement it. Fatal mistakes usually
appear during the development stages but also during the app integration or while patching and updating the database.
Cybercriminals take advantage of these failures to make a profit and, as a result, your business is under risk of being busted.

Discretionary Access Control Based on Granting and Revoking Privileges


The typical method of enforcing discretionary access control in a database system is based on the granting and revoking
of privileges. Let us consider privileges in the context of a relational DBMS. In particular, we will discuss a system of
privileges somewhat similar to the one originally developed for the SQL language (see Chapters 4 and 5). Many current
relational DBMSs use some variation of this tech-nique. The main idea is to include statements in the query language that
allow the DBA and selected users to grant and revoke privileges.

1. Types of Discretionary Privileges


In SQL2 and later versions, the concept of an authorization identifier is used to refer, roughly speaking, to a user account (or
group of user accounts). For simplicity, we will use the words user or account interchangeably in place
of authorization identifier. The DBMS must provide selective access to each relation in the database based on specific
accounts. Operations may also be controlled; thus, having an account does not necessarily entitle the account holder to all the
functionality provided by the DBMS. Informally, there are two levels for assigning privileges to use the database system:
The account level. At this level, the DBA specifies the particular privileges that each account holds independently of
the relations in the database.
The relation (or table) level. At this level, the DBA can control the privilege to access each individual relation or view in the
database.
References privilege on R. This gives the account the capability to reference (or refer to) a relation R when specifying
integrity constraints. This privilege can also be restricted to specific attributes of R.
Notice that to create a view, the account must have the SELECT privilege on all relations involved in the view definition in
order to specify the query that corresponds to the view.

2. Specifying Privileges through the Use of Views


The mechanism of views is an important discretionary authorization mechanism in its own right. For example, if the
owner A of a relation R wants another account B to be able to retrieve only some fields of R, then A can create a
view V of R that includes only those attributes and then grant SELECT on V to B. The same applies to limiting B to retrieving
only certain tuples of R; a view V can be created by defining the view by means of a query that selects only those tuples
from R that A wants to allow B to access.

3. Revoking of Privileges
In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a relation may want to grant
the SELECT privilege to a user for a specific task and then revoke that privilege once the task is completed. Hence, a
mechanism for revoking privileges is needed. In SQL a REVOKE command is included for the purpose of canceling
privileges.

4. Propagation of Privileges Using the GRANT OPTION


Whenever the owner A of a relation R grants a privilege on R to another account B, the privilege can be given to B
with or without the GRANT OPTION. If the GRANT OPTION is given, this means that B can also grant that privilege on R to
other accounts. Suppose that B is given the GRANT OPTION by A and that B then grants the privilege on R to a third
account C, also with the GRANT OPTION. In this way, privileges on R can propagate to other accounts without the
knowledge of the owner of R. If the owner account A now revokes the privilege granted to B, all the privileges
that B propagated based on that privilege should automatically be revoked by the system.
It is possible for a user to receive a certain privilege from two or more sources. For example, A4 may receive a
certain UPDATE R privilege from both A2 and A3. In such a case, if A2 revokes this privilege from A4, A4 will still continue
to have the privilege by virtue of having been granted it from A3. If A3 later revokes the privilege from A4, A4 totally loses
the privilege. Hence, a DBMS that allows propagation of privi-leges must keep track of how all the privileges were granted so
that revoking of priv-ileges can be done correctly and completely.

5. An Example to Illustrate Granting and Revoking of Privileges


Suppose that the DBA creates four accounts—A1, A2, A3, and A4—and wants only A1 to be able to create base relations. To
do this, the DBA must issue the following GRANT command in SQL:

GRANT CREATETAB TO A1;

The CREATETAB (create table) privilege gives account A1 the capability to create new database tables (base relations) and
is hence an account privilege. This privilege was part of earlier versions of SQL but is now left to each individual system
imple-mentation to define.

In SQL2 the same effect can be accomplished by having the DBA issue a CREATE SCHEMA command, as follows:

CREATE SCHEMA EXAMPLE AUTHORIZATION A1;

User account A1 can now create tables under the schema called EXAMPLE. To con-tinue our example, suppose
that A1 creates the two base relations EMPLOYEE and DEPARTMENT shown in Figure 24.1; A1 is then the owner of these
two relations and hence has all the relation privileges on each of them.

Next, suppose that account A1 wants to grant to account A2 the privilege to insert and delete tuples in both of these relations.
However, A1 does not want A2 to be able to propagate these privileges to additional accounts. A1 can issue the following
com-mand:

GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;

Notice that the owner account A1 of a relation automatically has the GRANT OPTION, allowing it to grant privileges on the
relation to other accounts. However, account A2 cannot grant INSERT and DELETE privileges on
the EMPLOYEE and DEPARTMENT tables because A2 was not given the GRANT OPTION in the preceding command.

Next, suppose that A1 wants to allow account A3 to retrieve information from either of the two tables and also to be able to
propagate the SELECT privilege to other accounts. A1 can issue the following command:

GRANT SELECT ON EMPLOYEE, DEPARTMENT TO A3 WITH GRANT OPTION;

The clause WITH GRANT OPTION means that A3 can now propagate the privilege to other accounts by using GRANT. For
example, A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by issuing the following command:
GRANT SELECT ON EMPLOYEE TO A4;
Notice that A4 cannot propagate the SELECT privilege to other accounts because the GRANT OPTION was not given to A4.
Now suppose that A1 decides to revoke the SELECT privilege on the EMPLOYEE relation from A3; A1 then can issue this
command:
REVOKE SELECT ON EMPLOYEE FROM A3;
The DBMS must now revoke the SELECT privilege on EMPLOYEE from A3, and it must also automatically
revoke the SELECT privilege on EMPLOYEE from A4. This is because A3 granted that privilege to A4, but A3 does not have
the privilege any more.
Next, suppose that A1 wants to give back to A3 a limited capability to SELECT from the EMPLOYEE relation and wants to
allow A3 to be able to propagate the privilege. The limitation is to retrieve only the Name, Bdate, and Address attributes and
only for the tuples with Dno = 5. A1 then can create the following view:
CREATE VIEW A3EMPLOYEE AS
SELECT Name, Bdate, Address
FROM EMPLOYEE
WHERE Dno = 5;
After the view is created, A1 can grant SELECT on the view A3EMPLOYEE to A3 as follows:
GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;
Finally, suppose that A1 wants to allow A4 to update only the Salary attribute of EMPLOYEE; A1 can then issue the
following command:
GRANT UPDATE ON EMPLOYEE (Salary) TO A4;
The UPDATE and INSERT privileges can specify particular attributes that may be updated or inserted in a relation. Other
privileges (SELECT, DELETE) are not attrib-ute specific, because this specificity can easily be controlled by creating the
appro-priate views that include only the desired attributes and granting the corresponding privileges on the views. However,
because updating views is not always possible (see Chapter 5), the UPDATE and INSERT privileges are given the option to
specify the particular attributes of a base relation that may be updated.

6. Specifying Limits on Propagation of Privileges


Techniques to limit the propagation of privileges have been developed, although they have not yet been implemented in most
DBMSs and are not a part of SQL. Limiting horizontal propagation to an integer number i means that an account B given
the GRANT OPTION can grant the privilege to at most i other accounts.
Vertical propagation is more complicated; it limits the depth of the granting of privileges. Granting a privilege with a vertical
propagation of zero is equivalent to granting the privilege with no GRANT OPTION. If account A grants a privilege to
account B with the vertical propagation set to an integer number j > 0, this means that the account B has the GRANT
OPTION on that privilege, but B can grant the privilege to other accounts only with a vertical propagation less than j. In effect,
vertical propagation limits the sequence of GRANT OPTIONS that can be given from one account to the next based on a single
original grant of the privilege.
We briefly illustrate horizontal and vertical propagation limits—which are not available currently in SQL or other relational
systems—with an example. Suppose that A1 grants SELECT to A2 on the EMPLOYEE relation with horizontal propagation
equal to 1 and vertical propagation equal to 2. A2 can then grant SELECT to at most one account because the horizontal
propagation limitation is set to 1. Additionally, A2 cannot grant the privilege to another account except with vertical
propagation set to 0 (no GRANT OPTION) or 1; this is because A2 must reduce the vertical propagation by at least 1 when
passing the privilege to others. In addition, the horizontal propagation must be less than or equal to the originally granted hor-
izontal propagation. For example, if account A grants a privilege to account B with the horizontal propagation set to an integer
number j > 0, this means that B can grant the privilege to other accounts only with a horizontal propagation less than or equal
to j. As this example shows, horizontal and vertical propagation techniques are designed to limit the depth and breadth of
propagation of privileges.

Mandatory Access Control and Role-Based Access Control for Multilevel


Security

The discretionary access control technique of granting and revoking privileges on relations has traditionally been the main
security mechanism for relational database systems. This is an all-or-nothing method: A user either has or does not have a
certain privilege. In many applications, an additional security policy is needed that classifies data and users based on security
classes. This approach, known as mandatory access control (MAC), would typically be combined with the discretionary
access control mechanisms . It is important to note that most commercial DBMSs currently provide mechanisms only for
discretionary access control. However, the need for multilevel security exists in government, military, and intelligence
applications, as well as in many industrial and corporate applications. Some DBMS vendors—for example, Oracle—have
released special versions of their RDBMSs that incorporate mandatory access control for government use.
Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level
and U the lowest. Other more complex security classification schemes exist, in which the security classes are organized in a
lattice. For simplicity, we will use the system with four security classification levels, where TS ≥ S ≥ C ≥ U, to illustrate our
discussion. The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies
each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security
classifications TS, S, C, or U. We will refer to the clearance (classification) of a subject S as class(S) and to
the classification of an object O as class(O). Two restrictions are enforced on data access based on the subject/object
classifications:
1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is known as the simple security
property.
2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This is known as the star property (or *-
property).
The first restriction is intuitive and enforces the obvious rule that no subject can read an object whose security classification is
higher than the subject’s security clearance. The second restriction is less intuitive. It prohibits a subject from writing an object
at a lower security classification than the subject’s security clearance. Violation of this rule would allow information to flow
from higher to lower classifications, which violates a basic tenet of multilevel security. For example, a user (subject) with TS
clearance may make a copy of an object with classification TS and then write it back as a new object with classification U,
thus making it visible throughout the system.
To incorporate multilevel security notions into the relational database model, it is common to consider attribute values and
tuples as data objects. Hence, each attribute A is associated with a classification attribute C in the schema, and each attribute
value in a tuple is associated with a corresponding security classification. In addition, in some models, a tuple
classification attribute TC is added to the relation attributes to provide a classification for each tuple as a whole. The model
we describe here is known as the multilevel model, because it allows classifications at multiple security levels. A multilevel
relation schema R with n attributes would be represented as:
R(A1, C1, A2, C2, ..., An, Cn, TC)
where each Ci represents the classification attribute associated with attribute Ai.
The value of the tuple classification attribute TC in each tuple t—which is the highest of all attribute classification values
within t—provides a general classification for the tuple itself. Each attribute classification Ci provides a finer security
classification for each attribute value within the tuple. The value of TC in each tuple t is the highest of all attribute classification
values Ci within t.
The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular (single-
level) relation. A multilevel relation will appear to contain different data to subjects (users) with different clearance levels. In
some cases, it is possible to store a single tuple in the relation at a higher classification level and produce the corresponding
tuples at a lower-level classification through a process known as filtering. In other cases, it is necessary to store two or more
tuples at different classification levels with the same value for the apparent key.
This leads to the concept of polyinstantiation, where several tuples can have the same apparent key value but have different
attribute values for users at different clearance levels.
We illustrate these concepts with the simple example of a multilevel relation shown in Figure 24.2(a), where we display the
classification attribute values next to each attribute’s value. Assume that the Name attribute is the apparent key, and consider
the query SELECT * FROM EMPLOYEE. A user with security clearance S would see the same relation shown in Figure
24.2(a), since all tuple classifications are less than or equal to S. However, a user with security clearance C would not be
allowed to see the values for Salary of ‘Brown’ and Job_performance of ‘Smith’, since they have higher classification. The
tuples would be filtered to appear as shown in Figure 24.2(b), with Salary and Job_performance appearing as null. For a user
with security clearance U, the filtering allows only the Name attribute of ‘Smith’ to appear, with all the other

attributes appearing as null (Figure 24.2(c)). Thus, filtering introduces null values for attribute values whose security
classification is higher than the user’s security clearance.
In general, the entity integrity rule for multilevel relations states that all attributes that are members of the apparent key must
not be null and must have the same security classification within each individual tuple. Additionally, all other attribute values
in the tuple must have a security classification greater than or equal to that of the apparent key. This constraint ensures that a
user can see the key if the user is permitted to see any part of the tuple. Other integrity rules, called null
integrity and interinstance integrity, informally ensure that if a tuple value at some security level can be filtered (derived)
from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.
To illustrate polyinstantiation further, suppose that a user with security clearance C tries to update the value
of Job_performance of ‘Smith’ in Figure 24.2 to ‘Excellent’; this corresponds to the following SQL update being submitted
by that user:
UPDATE EMPLOYEE

SET Job_performance = ‘Excellent’

WHERE S Name = ‘Smith’;


Since the view provided to users with security clearance C (see Figure 24.2(b)) per-mits such an update, the system should not
reject it; otherwise, the user could infer that some nonnull value exists for the Job_performance attribute of ‘Smith’ rather than
the null value that appears. This is an example of inferring information through what is known as a covert channel, which
should not be permitted in highly secure system. However, the user should not be allowed to overwrite the existing value
of Job_performance at the higher classification level. The solution is to create a polyinstantiation for the ‘Smith’ tuple at the
lower classification level C, as shown in Figure 24.2(d). This is necessary since the new tuple cannot be filtered from the
existing tuple at classification S.
The basic update operations of the relational model (INSERT, DELETE, UPDATE) must be modified to handle this and
similar situations, but this aspect of the prob-lem is outside the scope of our presentation. We refer the interested reader to the
Selected Bibliography at the end of this chapter for further details.
1. Comparing Discretionary Access Control and Mandatory Access Control
Discretionary access control (DAC) policies are characterized by a high degree of flexibility, which makes them suitable for a
large variety of application domains. The main drawback of DAC models is their vulnerability to malicious attacks, such as
Trojan horses embedded in application programs. The reason is that discretionary authorization models do not impose any
control on how information is propagated and used once it has been accessed by users authorized to do so. By contrast,
mandatory policies ensure a high degree of protection—in a way, they prevent any illegal flow of information. Therefore, they
are suitable for military and high security types of applications, which require a higher degree of protection. However,
mandatory policies have the drawback of being too rigid in that they require a strict classification of subjects and objects into
security levels, and there-fore they are applicable to few environments. In many practical situations, discretionary policies are
preferred because they offer a better tradeoff between security and applicability.

2. Role-Based Access Control


Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for managing and enforcing security
in large-scale enterprise-wide systems. Its basic notion is that privileges and other permissions are associated with
organizational roles, rather than individual users. Individual users are then assigned to appropriate roles. Roles can be created
using the CREATE ROLE and DESTROY ROLE commands. The GRANT and REVOKE commands discussed in Section
24.2 can then be used to assign and revoke privileges from roles, as well as for individual users when needed. For example, a
company may have roles such as sales account manager, purchasing agent, mailroom clerk, department manager, and so on.
Multiple individuals can be assigned to each role. Security privileges that are common to a role are granted to the role name,
and any individual assigned to this role would automatically have those privileges granted.
RBAC can be used with traditional discretionary and mandatory access controls; it ensures that only authorized users in their
specified roles are given access to certain data or resources. Users create sessions during which they may activate a subset of
roles to which they belong. Each session can be assigned to several roles, but it maps to one user or a single subject only. Many
DBMSs have allowed the concept of roles, where privileges can be assigned to roles.
Separation of duties is another important requirement in various commercial DBMSs. It is needed to prevent one user from
doing work that requires the involvement of two or more people, thus preventing collusion. One method in which sepa-ration
of duties can be successfully implemented is with mutual exclusion of roles. Two roles are said to be mutually exclusive if
both the roles cannot be used simultaneously by the user. Mutual exclusion of roles can be categorized into two types,
namely authorization time exclusion (static) and runtime exclusion (dynamic). In authorization time exclusion, two roles that
have been specified as mutually exclusive cannot be part of a user’s authorization at the same time. In runtime exclusion, both
these roles can be authorized to one user but cannot be activated by the user at the same time. Another variation in mutual
exclusion of roles is that of complete and partial exclusion.
The role hierarchy in RBAC is a natural way to organize roles to reflect the organization’s lines of authority and
responsibility. By convention, junior roles at the bottom are connected to progressively senior roles as one moves up the
hierarchy. The hierarchic diagrams are partial orders, so they are reflexive, transitive, and antisymmetric. In other words, if a
user has one role, the user automatically has roles lower in the hierarchy. Defining a role hierarchy involves choosing the t ype
of hierarchy and the roles, and then implementing the hierarchy by granting roles to other roles. Role hierarchy can be
implemented in the following manner:
GRANT ROLE full_time TO employee_type1
GRANT ROLE intern TO employee_type2
The above are examples of granting the roles full_time and intern to two types of employees.
Another issue related to security is identity management. Identity refers to a unique name of an individual person. Since the
legal names of persons are not necessarily unique, the identity of a person must include sufficient additional information to
make the complete name unique. Authorizing this identity and managing the schema of these identities is called Identity
Management. Identity Management addresses how organizations can effectively authenticate people and manage their access
to confidential information. It has become more visible as a business requirement across all industries affecting organizations
of all sizes. Identity Management administrators constantly need to satisfy application owners while keeping expenditures
under control and increasing IT efficiency.
Another important consideration in RBAC systems is the possible temporal constraints that may exist on roles, such as the
time and duration of role activations, and timed triggering of a role by an activation of another role. Using an RBAC model is
a highly desirable goal for addressing the key security requirements of Web-based applications. Roles can be assigned to
workflow tasks so that a user with any of the roles related to a task may be authorized to execute it and may play a certain role
only for a certain duration.
RBAC models have several desirable features, such as flexibility, policy neutrality, better support for security management
and administration, and other aspects that make them attractive candidates for developing secure Web-based applications.
These features are lacking in DAC and MAC models. In addition, RBAC models include the capabilities available in traditional
DAC and MAC policies. Furthermore, an RBAC model provides mechanisms for addressing the security issues related to the
execution of tasks and workflows, and for specifying user-defined and organization-specific policies. Easier deployment over
the Internet has been another reason for the success of RBAC models.

3. Label-Based Security and Row-Level Access Control


Many commercial DBMSs currently use the concept of row-level access control, where sophisticated access control rules can
be implemented by considering the data row by row. In row-level access control, each data row is given a label, which is used
to store information about data sensitivity. Row-level access control provides finer granularity of data security by allowing the
permissions to be set for each row and not just for the table or column. Initially the user is given a default session label by the
database administrator. Levels correspond to a hierarchy of data-sensitivity levels to exposure or corruption, with the goal of
maintaining privacy or security. Labels are used to prevent unauthorized users from viewing or altering certain data. A user
having a low authorization level, usually represented by a low number, is denied access to data having a higher-level number.
If no such label is given to a row, a row label is automatically assigned to it depending upon the user’s session label.
A policy defined by an administrator is called a Label Security policy. Whenever data affected by the policy is accessed or
queried through an application, the policy is automatically invoked. When a policy is implemented, a new column is added to
each row in the schema. The added column contains the label for each row that reflects the sensitivity of the row as per the
policy. Similar to MAC, where each user has a security clearance, each user has an identity in label-based security. This user’s
identity is compared to the label assigned to each row to determine whether the user has access to view the contents of that
row. However, the user can write the label value himself, within certain restrictions and guidelines for that specific row. This
label can be set to a value that is between the user’s current session label and the user’s minimum level. The DBA has the
privilege to set an initial default row label.
The Label Security requirements are applied on top of the DAC requirements for each user. Hence, the user must satisfy the
DAC requirements and then the label security requirements to access a row. The DAC requirements make sure that the user is
legally authorized to carry on that operation on the schema. In most applica-tions, only some of the tables need label-based
security. For the majority of the application tables, the protection provided by DAC is sufficient.
Security policies are generally created by managers and human resources personnel. The policies are high-level, technology
neutral, and relate to risks. Policies are a result of management instructions to specify organizational procedures, guiding
principles, and courses of action that are considered to be expedient, prudent, or advantageous. Policies are typically
accompanied by a definition of penalties and countermeasures if the policy is transgressed. These policies are then interpreted
and converted to a set of label-oriented policies by the Label Security administra-tor, who defines the security labels for
data and authorizations for users; these labels and authorizations govern access to specified protected objects.
Suppose a user has SELECT privileges on a table. When the user executes a SELECT statement on that table, Label Security
will automatically evaluate each row returned by the query to determine whether the user has rights to view the data. For
example, if the user has a sensitivity of 20, then the user can view all rows having a security level of 20 or lower. The level
determines the sensitivity of the information contained in a row; the more sensitive the row, the higher its security label value.
Such Label Security can be configured to perform security checks on UPDATE, DELETE, and INSERT statements as well.

4. XML Access Control


With the worldwide use of XML in commercial and scientific applications, efforts are under way to develop security standards.
Among these efforts are digital signatures and encryption standards for XML. The XML Signature Syntax and Processing
specification describes an XML syntax for representing the associations between cryptographic signatures and XML
documents or other electronic resources. The specificaton also includes procedures for computing and verifying XML
signatures. An XML digital signature differs from other protocols for message signing, such as PGP (Pretty Good Privacy—
a confidentiality and authentication service that can be used for electronic mail and file storage application), in its sup-port for
signing only specific portions of the XML tree rather than the complete document. Additionally, the XML signature
specification defines mechanisms for countersigning and transformations—so-called canonicalization to ensure that two
instances of the same text produce the same digest for signing even if their representations differ slightly, for example, in
typographic white space.
The XML Encryption Syntax and Processing specification defines XML vocabulary and processing rules for protecting
confidentiality of XML documents in whole or in part and of non-XML data as well. The encrypted content and additional
pro-cessing information for the recipient are represented in well-formed XML so that the result can be further processed using
XML tools. In contrast to other commonly used technologies for confidentiality such as SSL (Secure Sockets Layer—a leading
Internet security protocol), and virtual private networks, XML encryption also applies to parts of documents and to documents
in persistent storage.

5. Access Control Policies for E-Commerce and the Web


Electronic commerce (e-commerce) environments are characterized by any trans-actions that are done electronically. They
require elaborate access control policies that go beyond traditional DBMSs. In conventional database environments, access
control is usually performed using a set of authorizations stated by security officers or users according to some security policies.
Such a simple paradigm is not well suited for a dynamic environment like e-commerce. Furthermore, in an e-commerce
environment the resources to be protected are not only traditional data but also knowledge and experience. Such peculiarities
call for more flexibility in specifying access control policies. The access control mechanism must be flexible enough to support
a wide spectrum of heterogeneous protection objects.
A second related requirement is the support for content-based access control. Content-based access control allows one to
express access control policies that take the protection object content into account. In order to support content-based access
control, access control policies must allow inclusion of conditions based on the object content.
A third requirement is related to the heterogeneity of subjects, which requires access control policies based on user
characteristics and qualifications rather than on specific and individual characteristics (for example, user IDs). A possible
solution, to better take into account user profiles in the formulation of access control policies, is to support the notion of
credentials. A credential is a set of properties concerning a user that are relevant for security purposes (for example, age or
position or role within an organization). For instance, by using credentials, one can simply formulate policies such as Only
permanent staff with five or more years of service can access documents related to the internals of the system.
It is believed that the XML is expected to play a key role in access control for e-commerce applications5 because XML is
becoming the common representation language for document interchange over the Web, and is also becoming the language
for e-commerce. Thus, on the one hand there is the need to make XML representations secure, by providing access control
mechanisms specifically tailored to the protection of XML documents. On the other hand, access control information (that is,
access control policies and user credentials) can be expressed using XML itself. The Directory Services Markup
Language (DSML) is a representation of directory service information in XML syntax. It provides a foundation for a standard
for communicating with the directory services that will be responsible for providing and authenticating user credentials. The
uniform presentation of both protection objects and access control policies can be applied to policies and credentials
themselves. For instance, some credential properties (such as the user name) may be accessible to everyone, whereas other
properties may be visible only to a restricted class of users. Additionally, the use of an XML-based language for specify-ing
credentials and access control policies facilitates secure credential submission and export of access control policies.

SQL Injection
SQL Injection
SQL injection is a code injection technique that might destroy your database.

SQL injection is one of the most common web hacking techniques.

SQL injection is the placement of malicious code in SQL statements, via web page input.

SQL in Web Pages


SQL injection usually occurs when you ask a user for input, like their username/userid, and instead of a name/id, the user gives you
an SQL statement that you will unknowingly run on your database.

Look at the following example which creates a SELECT statement by adding a variable (txtUserId) to a select string. The variable is
fetched from user input (getRequestString):

Example
txtUserId=getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;

The rest of this chapter describes the potential dangers of using user input in SQL statements.
SQL Injection Based on 1=1 is Always True
Look at the example above again. The original purpose of the code was to create an SQL statement to select a user, with a given
user id.

If there is nothing to prevent a user from entering "wrong" input, the user can enter some "smart" input like this:

105 OR 1=1
UserId:

Then, the SQL statement will look like this:

SELECT * FROM Users WHERE UserId = 105 OR 1=1;

The SQL above is valid and will return ALL rows from the "Users" table, since OR 1=1 is always TRUE.

Does the example above look dangerous? What if the "Users" table contains names and passwords?

The SQL statement above is much the same as this:

SELECT UserId, Name, Password FROM Users WHERE UserId = 105 or 1=1;

A hacker might get access to all the user names and passwords in a database, by simply inserting 105 OR 1=1 into the input field.

SQL Injection Based on ""="" is Always True


Here is an example of a user login on a web site:

Username:
John Doe

Password:
myPass

Example
uName=getRequestString("username");
uPass=getRequestString("userpassword");
sql = 'SELECT * FROM Users WHERE Name ="' + uName + '" AND Pass ="' + uPass + '"'

Result
SELECT * FROM Users WHERE Name ="John Doe" AND Pass ="myPass"

A hacker might get access to user names and passwords in a database by simply inserting " OR ""=" into the user name or password
text box:

UserName:

Password:
The code at the server will create a valid SQL statement like this:

Result
SELECT * FROM Users WHERE Name ="" or ""="" AND Pass ="" or ""=""

The SQL above is valid and will return all rows from the "Users" table, since OR ""="" is always TRUE.

SQL Injection Based on Batched SQL Statements


Most databases support batched SQL statement.

A batch of SQL statements is a group of two or more SQL statements, separated by semicolons.

The SQL statement below will return all rows from the "Users" table, then delete the "Suppliers" table.

Example
SELECT * FROM Users; DROP TABLE Suppliers

Look at the following example:

Example
txtUserId = getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;

And the following input:

105; DROP TA
User id:

The valid SQL statement would look like this:

Result
SELECT * FROM Users WHERE UserId = 105; DROP TABLE Suppliers;

Use SQL Parameters for Protection


To protect a web site from SQL injection, you can use SQL parameters.

SQL parameters are values that are added to an SQL query at execution time, in a controlled manner.

ASP.NET Razor Example


txtUserId=getRequestString("UserId");
txtSQL="SELECT*FROM Users WHERE UserId=@0";
db.Execute(txtSQL,txtUserId);

Note that parameters are represented in the SQL statement by a @ marker.


The SQL engine checks each parameter to ensure that it is correct for its column and are treated literally, and not as part of the SQL
to be executed.

Another Example
txtNam=getRequestString("CustomerName");
txtAdd=getRequestString("Address");
txtCit=getRequestString("City");
txtSQL="INSERT INTO Customers(CustomerName,Address,City)Values(@0,@1,@2)";
db.Execute(txtSQL,txtNam,txtAdd,txtCit

Examples
The following examples shows how to build parameterized queries in some common web languages.

SELECT STATEMENT IN ASP.NET:

txtUserId=getRequestString("UserId");
sql="SELECT*FROM Customers WHERE CustomerId=@0";
command=new SqlCommand(sql);
command.Parameters.AddWithValue("@0",txtUserId);
command.ExecuteReader();

Statistical Database Security


Certain databases may contain confidential or secret data of individuals of country like (Aadhar numbers, PAN card
numbers) and this database should not be accessed by attackers. So, therefore it should be protected from user access.
The database which contains details of huge population is called Statistical databases and it is used mainly to produce
statistics on various populations. But Users are allowed to retrieve certain statistical information of population like averages
of population of particular state/district etc and their sum, count, maximum, minimum, and standard deviations, etc.
It is the responsibility of ethical hackers to monitor Statistical Database security statistical users are not permitted to access
individual data, such as income of specific person, phone number, Debit card numbers of specified person in database
because Statistical database security techniques prohibit retrieval of individual data. It is also responsibility of DBMS to
provide confidentiality of data about individuals.
Statistical Queries:
The queries which allow only aggregate functions such as COUNT, SUM, MIN, MAX, AVERAGE, and STANDARD
DEVIATION are called statistical queries. Statistical queries are mainly used for knowing population statistics and in
companies/industries to maintain their employees’ database etc.
Example –
Consider the following examples of statistical queries where EMP_SALARY is confidential database that contains the
income of each employee of company.
Query-1:
SELECT COUNT(*)
FROM EMP_SALARY
WHERE Emp-department = '3';
Query-2:
SELECT AVG(income)
FROM EMP_SALARY
WHERE Emp-id = '2';
Here, the “Where” condition can be manipulated by attacker and there is chance to access income of individual
employees or confidential data of employee if he knows id/name of particular employee.
The possibility of accessing individual information from statistical queries is reduced by using the following
measures –
1. Partitioning of Database – This means the records of database must be not be stored as bulk in single
record. It must be divided into groups of some minimum size according to confidentiality of records.
The advantage of Partitioning of database is queries can refer to any complete group or set of groups,
but queries cannot access the subsets of records within a group. So, attacker can access at most one or
two groups which are less private.
2. If no statistical queries are permitted whenever number of tuples in population specified by selection
condition falls below some threshold.
3. Prohibit sequences of queries that refer repeatedly to same population of tuples.

Flow Control:
Measures of Control
The measures of control can be broadly divided into the following categories −
 Access Control − Access control includes security mechanisms in a database management system to protect against
unauthorized access. A user can gain access to the database after clearing the login process through only valid user
accounts. Each user account is password protected.
 Flow Control − Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow
control prevents data from being transferred in such a way that it can be accessed by unauthorized agents. A flow
policy lists out the channels through which information can flow. It also defines security classes for data as well as
transactions.
 Data Encryption − Data encryption refers to coding data when sensitive data is to be communicated over public
channels. Even if an unauthorized agent gains access of the data, he cannot understand it since it is in an
incomprehensible format.

Encryption and Public Key Infrastructures


The most distinct feature of Public Key Infrastructure (PKI) is that it uses a pair of keys to achieve the underlying securit y
service. The key pair comprises of private key and public key.
Since the public keys are in open domain, they are likely to be abused. It is, thus, necessary to establish and maintain some
kind of trusted infrastructure to manage these keys.

Key Management
It goes without saying that the security of any cryptosystem depends upon how securely its keys are managed. Without secure
procedures for the handling of cryptographic keys, the benefits of the use of strong cryptographic schemes are potentially
lost.
It is observed that cryptographic schemes are rarely compromised through weaknesses in their design. However, they are
often compromised through poor key management.
There are some important aspects of key management which are as follows −
 Cryptographic keys are nothing but special pieces of data. Key management refers to the secure administration of
cryptographic keys.
 Key management deals with entire key lifecycle as depicted in the following illustration −
 There are two specific requirements of key management for public key cryptography.
o Secrecy of private keys. Throughout the key lifecycle, secret keys must remain secret from all parties except
those who are owner and are authorized to use them.
o Assurance of public keys. In public key cryptography, the public keys are in open domain and seen as public
pieces of data. By default there are no assurances of whether a public key is correct, with whom it can be
associated, or what it can be used for. Thus key management of public keys needs to focus much more
explicitly on assurance of purpose of public keys.
The most crucial requirement of ‘assurance of public key’ can be achieved through the public-key infrastructure (PKI), a key
management systems for supporting public-key cryptography.

Public Key Infrastructure (PKI)


PKI provides assurance of public key. It provides the identification of public keys and their distribut ion. An anatomy of PKI
comprises of the following components.

 Public Key Certificate, commonly referred to as ‘digital certificate’.


 Private Key tokens.
 Certification Authority.
 Registration Authority.
 Certificate Management System.

Digital Certificate
For analogy, a certificate can be considered as the ID card issued to the person. People use ID cards such as a driver's license,
passport to prove their identity. A digital certificate does the same basic thing in the electronic world, but with one difference.
Digital Certificates are not only issued to people but they can be issued to computers, software packages or anything else that
need to prove the identity in the electronic world.
 Digital certificates are based on the ITU(International Telecommunication Union) standard X.509 which defines a
standard certificate format for public key certificates and certification validation. Hence digital certificates are
sometimes also referred to as X.509 certificates.
Public key pertaining to the user client is stored in digital certificates by The Certification Authority (CA) along with
other relevant information such as client information, expiration date, usage, issuer etc.
 CA digitally signs this entire information and includes digital signature in the certificate.
 Anyone who needs the assurance about the public key and associated information of client, he carries out the signature
validation process using CA’s public key. Successful validation assures that the public key given in the certificate
belongs to the person whose details are given in the certificate.
The process of obtaining Digital Certificate by a person/entity is depicted in the following illustration.

As shown in the illustration, the CA accepts the application from a client to certify his public key. The CA, after duly verifying
identity of client, issues a digital certificate to that client.

Certifying Authority (CA)


As discussed above, the CA issues certificate to a client and assist other users to verify the certificate. The CA takes
responsibility for identifying correctly the identity of the client asking for a certificate to be issued, and ensures that the
information contained within the certificate is correct and digitally signs it.

Key Functions of CA
The key functions of a CA are as follows −
 Generating key pairs − The CA may generate a key pair independently or jointly with the client.
 Issuing digital certificates − The CA could be thought of as the PKI equivalent of a passport agency − the CA issues
a certificate after client provides the credentials to confirm his identity. The CA then signs the certificate to prevent
modification of the details contained in the certificate.
 Publishing Certificates − The CA need to publish certificates so that users can find them. There are two ways of
achieving this. One is to publish certificates in the equivalent of an electronic telephone directory. The other is to send
your certificate out to those people you think might need it by one means or another.
 Verifying Certificates − The CA makes its public key available in environment to assist verification of his signature
on clients’ digital certificate.
 Revocation of Certificates − At times, CA revokes the certificate issued due to some reason such as compromise of
private key by user or loss of trust in the client. After revocation, CA maintains the list of all revoked certificate that
is available to the environment.

Classes of Certificates
There are four typical classes of certificate −
 Class 1 − These certificates can be easily acquired by supplying an email address.
 Class 2 − These certificates require additional personal information to be supplied.
 Class 3 − These certificates can only be purchased after checks have been made about the requestor’s identity.
 Class 4 − They may be used by governments and financial organizations needing very high levels of trust.

Registration Authority (RA)


CA may use a third-party Registration Authority (RA) to perform the necessary checks on the person or company requesting
the certificate to confirm their identity. The RA may appear to the client as a CA, but they do not actually sign the certificate
that is issued.

Certificate Management System (CMS)


It is the management system through which certificates are published, temporarily or permanently suspended, renewed, or
revoked. Certificate management systems do not normally delete certificates because it may be necessary to prove their status
at a point in time, perhaps for legal reasons. A CA along with associated RA runs certificate management systems to be able
to track their responsibilities and liabilities.

Private Key Tokens


While the public key of a client is stored on the certificate, the associated secret private key can be stored on the key owner’s
computer. This method is generally not adopted. If an attacker gains access to the computer, he can easily gain access to
private key. For this reason, a private key is stored on secure removable storage token access to which is protected through a
password.
Different vendors often use different and sometimes proprietary storage formats for storing keys. For example, Entrust uses
the proprietary .epf format, while Verisign, GlobalSign, and Baltimore use the standard .p12 format.

What is an EPF file?


Part file created by Edgecam Student Edition, an educational version of Edgecam used to train students for part
design and manufacturing; contains a part design, including geometry, toolpaths, and other part properties.
A p12 file contains a digital certificate that uses PKCS#12 (Public Key Cryptography Standard #12) encryption. It is
used as a portable format for transferring personal private keys and other sensitive information. P12 files are used by various
security and encryption programs.

Hierarchy of CA
With vast networks and requirements of global communications, it is practically not feasible to have only one trusted CA
from whom all users obtain their certificates. Secondly, availability of only one CA may lead to difficulties if CA is
compromised.
In such case, the hierarchical certification model is of interest since it allows public key certificates to be used in environments
where two communicating parties do not have trust relationships with the same CA.
 The root CA is at the top of the CA hierarchy and the root CA's certificate is a self-signed certificate.
 The CAs, which are directly subordinate to the root CA (For example, CA1 and CA2) have CA certificates that are
signed by the root CA.
 The CAs under the subordinate CAs in the hierarchy (For example, CA5 and CA6) have their CA certificates signed
by the higher-level subordinate CAs.
Certificate authority (CA) hierarchies are reflected in certificate chains. A certificate chain traces a path of certificates from
a branch in the hierarchy to the root of the hierarchy.
The following illustration shows a CA hierarchy with a certificate chain leading from an entity certificate through two
subordinate CA certificates (CA6 and CA3) to the CA certificate for the root CA.

Verifying a certificate chain is the process of ensuring that a specific certificate chain is valid, correctly signed, and
trustworthy. The following procedure verifies a certificate chain, beginning with the certificate that is presented for
authentication −
 A client whose authenticity is being verified supplies his certificate, generally along with the chain of certificates up
to Root CA.
 Verifier takes the certificate and validates by using public key of issuer. The issuer’s public key is found in the issuer’s
certificate which is in the chain next to client’s certificate.
 Now if the higher CA who has signed the issuer’s certificate, is trusted by the verifier, verification is successful and
stops here.
 Else, the issuer's certificate is verified in a similar manner as done for client in above steps. This process continues till
either trusted CA is found in between or else it continues till Root CA.
Preserving Data Privacy
Abstract

Incredible amounts of data is being generated by various organizations like hospitals, banks, e-commerce, retail and supply
chain, etc. by virtue of digital technology. Not only humans but machines also contribute to data in the form of closed circuit
television streaming, web site logs, etc. Tons of data is generated every minute by social media and smart phones. The
voluminous data generated from the various sources can be processed and analyzed to support decision making. However data
analytics is prone to privacy violations. One of the applications of data analytics is recommendation systems which is widely
used by ecommerce sites like Amazon, Flip kart for suggesting products to customers based on their buying habits leading to
inference attacks. Although data analytics is useful in decision making, it will lead to serious privacy concerns. Hence privacy
preserving data analytics became very important. This paper examines various privacy threats, privacy preservation techniques
and models with their limitations, also proposes a data lake based modernistic privacy preservation technique to handle privacy
preservation in unstructured data.
Introduction

There is an exponential growth in volume and variety of data as due to diverse applications of computers in all domain areas.
The growth has been achieved due to affordable availability of computer technology, storage, and network connectivity. The
large scale data, which also include person specific private and sensitive data like gender, zip code, disease, caste, shopping
cart, religion etc. is being stored in public domain. The data holder can release this data to a third party data analyst to gain
deeper insights and identify hidden patterns which are useful in making important decisions that may help in improving
businesses, provide value added services to customers , prediction, forecasting and recommendation . One of the prominent
applications of data analytics is recommendation systems which is widely used by ecommerce sites like Amazon, Flip kart for
suggesting products to customers based on their buying habits. Face book does suggest friends, places to visit and even movie
recommendation based on our interest. However releasing user activity data may lead inference attacks like identifying gender
based on user activity . We have studied a number of privacy preserving techniques which are being employed to protect
against privacy threats. Each of these techniques has their own merits and demerits. This paper explores the merits and demerits
of each of these techniques and also describes the research challenges in the area of privacy preservation. Always there exists
a trade off between data utility and privacy. This paper also proposes a data lake based modernistic privacy preservation
technique to handle privacy preservation in unstructured data with maximum data utility.
Privacy threats in data analytics

Privacy is the ability of an individual to determine what data can be shared, and employ access control. If the data is in public
domain then it is a threat to individual privacy as the data is held by data holder. Data holder can be social networking
application, websites, mobile apps, ecommerce site, banks, hospitals etc. It is the responsibility of the data holder to ensure
privacy of the users data. Apart from the data held in public domain, knowing or unknowingly users themself contribute to
data leakage. For example most of the mobile apps, seek access to our contacts, files, camera etc. and without reading the
privacy statement we agree for all terms and conditions, there by contributing to data leakage.

Hence there is a need to educate the smart phone users regarding privacy and privacy threats. Some of the key privacy threats
include (1) Surveillance; (2) Disclosure; (3) Discrimination; (4) Personal embracement and abuse.

Surveillance
Many organizations including retail, e-commerce, etc. study their customers buying habits and try to come up with various
offers and value added services . Based on the opinion data and sentiment analysis, social media sites does provide
recommendations of the new friends, places to visit, people to follow etc. This is possible only when they continuously monitor
their customer’s transactions. This is a serious privacy threat as no individual accepts surveillance.

Disclosure
Consider a hospital holding patient’s data which include (Zip, gender, age, disease) . The data holder has released data to a
third party for analysis by anonymizing sensitive person specific data so that the person cannot be identified. The third party
data analyst can map this information with the freely available external data sources like census data and can identify person
suffering with some disorder. This is how private information of a person can be disclosed which is considered to be a serious
privacy breach.

Discrimination
Discrimination is the bias or inequality which can happen when some private information of a person is disclosed. For instance,
statistical analysis of electoral results proved that people of one community were completely against the party, which formed
the government. Now the government can neglect that community or can have bias over them.

Personal embracement and abuse


Whenever some private information of a person is disclosed, it can even lead to personal embracement or abuse. For example,
a person was privately undergoing medication for some specific problem and was buying some medicines on a regular basis
from a medical shop. As part of their regular business model, the medical shop may send some reminder and offers related to
these medicines over phone. If any family member has noticed this, it will lead to personal embracement and even abuse .

Data analytics activity will affect data Privacy. Many countries are enforcing Privacy preservation laws. Lack of awareness is
also a major reason for privacy attacks. For example many smart phones users are not aware of the information that is stolen
from their phones by many apps. Previous research shows only 17% of smart phone users are aware of privacy threats .
Privacy preservation methods

Many Privacy preserving techniques were developed, but most of them are based on anonymization of data. The list of privacy
preservation techniques is given below.

 K anonymity
 L diversity
 T closeness
 Randomization
 Data distribution
 Cryptographic techniques
 Multidimensional Sensitivity Based Anonymization (MDSBA).
K anonymity
Anonymization is the process of modifying data before it is given for data analytics , so that de identification is not possible
and will lead to K indistinguishable records if an attempt is made to de identify by mapping the anonymized data with external
data sources. K anonymity is prone to two attacks namely homogeneity attack and back ground knowledge attack. Some of
the algorithms applied include, Incognito , Mondrian to ensure Anonymization. K anonymity is applied on the patient data
shown in Table 1. The table shows data before anonymization.

Table 1 Patient data, before anonymization


From: Privacy preservation techniques in big data analytics: a survey
Sno Zip Age Disease

1 57677 29 Cardiac problem

2 57602 22 Cardiac problem

3 57678 27 Cardiac problem

4 57905 43 Skin allergy


Sno Zip Age Disease

5 57909 52 Cardiac problem

6 57906 47 Cancer

7 57605 30 Cardiac problem

8 57673 36 Cancer

9 57607 32 Cancer

K anonymity algorithm is applied with k value as 3 to ensure 3 indistinguishable records when an attempt is made to identify
a particular person’s data. K anonymity is applied on the two attributes viz. Zip and age shown in Table 1. The result of
applying anonymization on Zip and age attributes is shown in Table 2.

Table 2 After applying anonymization on Zip and age


From: Privacy preservation techniques in big data analytics: a survey
Sno Zip Age Disease

1 576** 2* Cardiac problem

2 576** 2* Cardiac problem

3 576** 2* Cardiac problem

4 5790* > 40 Skin allergy

5 5790* > 40 Cardiac problem

6 5790* > 40 Cancer

7 576** 3* Cardiac problem

8 576** 3* Cancer

9 576** 3* Cancer

The above technique has used generalization to achieve Anonymization. Suppose if we know that John is 27 year old and lives
in 57677 zip codes then we can conclude John to have Cardiac problem even after anonymization as shown in Table 2. This is
called Homogeneity attack. For example if John is 36 year old and it is known that John does not have cancer, then definitely
John must have Cardiac problem. This is called as background knowledge attack. Achieving K anonymity can be done either
by using generalization or suppression. K anonymity can optimized if the minimal generalization can be done without huge
data loss . Identity disclosure is the major privacy threat which cannot be guaranteed by K anonymity . Personalized privacy
is the most important aspect of individual privacy .
L diversity
To address homogeneity attack, another technique called L diversity has been proposed. As per L diversity there must be L
well represented values for the sensitive attribute (disease) in each equivalence class.

Implementing L diversity is not possible every time because of the variety of data. L diversity is also prone to skewness attack.
When overall distribution of data is skewed into few equivalence classes attribute disclosure cannot be ensured. For example
if the entire records are distributed into only three equivalence classes then semantic closeness of these values may lead to
attribute disclosure. Also L diversity may lead to similarity attack. From Table 3 it can be noticed that if we know that John is
27 year old and lives in 57677 zip, then definitely John is under low income group because salaries of all three persons in
576** zip is low compare to others in the table. This is called as similarity attack.

Table 3 L diversity privacy preservation technique


From: Privacy preservation techniques in big data analytics: a survey
Sno Zip Age Salary Disease

1 576** 2* 5k Cardiac problem

2 576** 2* 6k Cardiac problem

3 576** 2* 7k Cardiac problem

4 5790* > 40 20k Skin allergy

5 5790* > 40 22k Cardiac problem

6 5790* > 40 24k Cancer

T closeness
Another improvement to L diversity is T closeness measure where an equivalence class is considered to have ‘T
closeness’ if the distance between the distributions of sensitive attribute in the class is no more than a threshold
and all equivalence classes have T closeness . T closeness can be calculated on every attribute with respect to
sensitive attribute.

From Table 4 it can be observed that if we know John is 27 year old, still it will be difficult to estimate whether
John has Cardiac problem or not and he is under low income group or not. T closeness may ensure attribute
disclosure but implementing T closeness may not give proper distribution of data every time.

Table 4 T closeness privacy preservation technique


From: Privacy preservation techniques in big data analytics: a survey
Sno Zip Age Salary Disease

1 576** 2* 5k Cardiac problem


Sno Zip Age Salary Disease

2 576** 2* 16k Cancer

3 576** 2* 9k Skin allergy

4 5790* > 40 20k Skin allergy

5 5790* > 40 42k Cardiac problem

6 5790* > 40 8k Flu

Randomization technique
Randomization is the process of adding noise to the data which is generally done by probability distribution .
Randomization is applied in surveys, sentiment analysis etc. Randomization does not need knowledge of other
records in the data. It can be applied during data collection and pre processing time. There is no anonymization
overhead in randomization. However, applying randomization on large datasets is not possible because of time
complexity and data utility which has been proved in our experiment described below.

We have loaded 10k records from an employee database into Hadoop Distributed File System and processed
them by executing a Map Reduce Job. We have experimented to classify the employees based on their salary
and age groups. In order apply randomization we added noise in the form of 5k records which are randomly
added to make a database of 15k records and following observations were made after running Map Reduce job.

 More number of Mappers and Reducers were used as data volume increased.
 Results before and after randomization were significantly different.
 Some of the records which are outliers remain unaffected with randomization and are vulnerable to
adversary attack.
 Privacy preservation at the cost of data utility is not appreciated and hence randomization may not be
suitable for privacy preservation especially attribute disclosure.
Data distribution technique
In this technique, the data is distributed across many sites. Distribution of the data can be done in two ways:

1. i.Horizontal distribution of data


2. ii.Vertical distribution of data

Horizontal distribution When data is distributed across many sites with same attributes then the distribution is
said to be horizontal distribution which is described in Fig. 1.
Distribution of sales data across different sites

Horizontal distribution of data can be applied only when some aggregate functions or operations are to be applied
on the data without actually sharing the data. For example, if a retail store wants to analyse their sales across
various branches, they may employ some analytics which does computations on aggregate data. However, as
part of data analysis the data holder may need to share the data with third party analyst which may lead to privacy
breach. Classification and Clustering algorithms can be applied on distributed data but it does not ensure privacy.
If the data is distributed across different sites which belong to different organizations, then results of aggregate
functions may help one party in detecting the data held with other parties. In such situations we expect all
participating sites to be honest with each other .

Vertical distribution of data When Person specific information is distributed across different sites under
custodian of different organizations, then the distribution is called vertical distribution as shown in Fig. 2. For
example, in crime investigations, the police officials would like to know details of a particular criminal which
include health, profession, financial, personal etc. All this information may not be available at one site. Such a
distribution is called vertical distribution where each site holds few set of attributes of a person. When some
analytics has to be done data has to be pooled in from all these sites and there is a vulnerability of privacy breach.

Vertical distribution of person specific data


In order to perform data analytics on vertically distributed data, where the attributes are distributed across different sites under
custodian of different parties, it is highly difficult to ensure privacy if the datasets are shared. For example, as part of a police
investigation, the investigating officer wants to access some information about the accused from his employer, health
department, bank to gain more insights about the character of the person. In this process some of the personal and sensitive
information of the accused may be disclosed to investigating officer leading to personal embarrassment or abuse.
Anonymization cannot be applied when entire records are not needed for analytics. Distribution of data will not ensure privacy
preservation but it closely overlaps with cryptographic techniques.

Cryptographic techniques
The data holder may encrypt the data before releasing the same for analytics. But encrypting large scale data using conventional
encryption techniques is highly difficult and must be applied only during data collection time. Differential privacy techniques
have already been applied where some aggregate computations on the data are done without actually sharing the inputs. For
example, if x and y are two data items then a function F(x, y) will be computed to gain some aggregate information from both
x and y without actually sharing x and y. This can be applied on when x and y are held with different parties as in the case of
vertical distribution. However, if the data is at single location under the custodian of a single organization, then differential
privacy cannot be employed. Another similar technique called secure multiparty computation has been used but proved to be
inadequate in privacy preservation. Data utility will be less if encryption is applied during data analytics. Thus encryption is
not only difficult to implement but it reduces the data utility .

Multidimensional Sensitivity Based Anonymization (MDSBA)


Bottom up Generalization and Top down Generalization are the conventional methods of Anonymization which were applied
on well represented structured data records. However, applying the same on large scale data sets is very difficult leading to
issues of scalability and information loss. Multidimensional Sensitivity Based Anonymization is a improved version of
Anonymization proved to be more effective than conventional Anonymization techniques.

Multidimensional Sensitivity Based Anonymization is an improved Anonymization technique such that it can be applied on
large data sets with reduced loss of information and predefined quasi identifiers. As part of this technique Apache MAP
REDUCE framework has been used to handle large data sets. In conventional Hadoop Distributed Files System, the data will
be divided into blocks of either 64 MB or 128 MB each and distributed across different nodes without considering the data
inside the blocks. As part of Multidimensional Sensitivity Based Anonymization technique the data is split into different bags
based on the probability distribution of the quasi identifiers by making use of filters in Apache Pig scripting language.

Multidimensional Sensitivity Based Anonymization makes use of bottom up generalization but on a set of attributes with
certain class values where class represents a sensitive attributes. Data distribution was made effectively when compared to
conventional method of blocks. Data Anonymization was done using four quasi identifiers using Apache Pig.

Since the data is vertically partitioned into different groups, it can protect from background knowledge attack if the bag contains
only few attributes. This method also makes it difficult to map the data with external sources to disclose any person specific
information.

In this method, the implementation was done using Apache Pig. Apache Pig is a scripting language, hence development effort
is less. However, code efficiency of Apache Pig is relatively less when compared to Map Reduce job because ultimately every
Apache Pig script has to be converted into a Map Reduce job. Multidimensional Sensitivity Based Anonymization is more
appropriate for large scale data but only when the data is at rest. Multidimensional Sensitivity Based Anonymization cannot
be applied for streaming data.
Analysis

Various privacy preservation techniques have been studied with respect to features including, type of data, data utility, attribute
preservation and complexity. The comparison of various privacy preservation techniques is shown in Table 5.

Table 5 Comparison of privacy preservation techniques


From: Privacy preservation techniques in big data analytics: a survey
Features Privacy preservation techniques

Anonymization Cryptographic Data distribution Randomization MDSBA


techniques techniques

Suitability for unstructured No No No No Yes


data

Attribute preservation No No No Yes Yes

Damage to data utility No No Yes No Yes

Very complex to apply No Yes Yes Yes Yes

Accuracy of results of data No Yes No No No


analytics

Results and discussions

As part of systematic literature review, it has been observed that all existing mechanisms of privacy preservation are with
respect to structured data. More than 80% of data being generated today is unstructured . As such, there is a need to address
following challenges.

1. i.Develop concrete solution to protect privacy in both structured and unstructured data.
2. ii.Scalable and robust techniques to be developed to handle large scale heterogeneous data sets.
3. iii.Data should be allowed to stay in its native form without need for transformation and data analytics can be carried
out while ensuring privacy preservation.
4. iv.New techniques apart from Anonymization must be developed to ensure protection against key privacy threats
which include identity disclosure, discrimination, surveillance etc.
5. v.Maximizing data utility while ensuring data privacy.

Conclusion

No concrete solution for unstructured data has been developed yet. Conventional data mining algorithms can be applied for
classification and clustering problems but cannot be used in privacy preservation especially when dealing with person specific
information. Machine learning and soft computing techniques can be used to develop new and more appropriate solution to
privacy problems which include identity disclosure that can lead to personal embarrassment and abuse.

There is a strong need for law enforcement by governments of all countries to ensure individual privacy. European Union is
making an attempt to enforce privacy preservation law. Apart from technological solutions, there is a strong need to create
awareness among the people regarding privacy hazards to safeguard themselves form privacy breaches. One of the serious
privacy threats is smart phone. Lot of personal information in the form of contacts, messages, chats and files are being accessed
by many apps running in our smart phone without our knowledge. Most of the time people do not even read the privacy
statement before installing any app. Hence there is a strong need to educate people on the various vulnerabilities which can
contribute to leakage of private information.

We propose a novel privacy preservation model based on Data Lake concept to hold variety of data from diverse sources. Data
lake is a repository to hold data from diverse sources in their raw format . Data ingestion from variety of sources can be done
using Apache Flume and an intelligent algorithm based on machine learning can be applied to identify sensitive attributes
dynamically . The algorithm will be trained with existing data sets with known sensitive attributes and rigorous training of the
model will help in predicting the sensitive attributes in a given data set . Accuracy of the model can be improved by adding
more layers of training leading to deep learning techniques . Advanced computing techniques like Apache Spark can be used
in implementing privacy preserving algorithms which is a distributed massive parallel computing with in memory processing
to ensure very fast processing. The proposed model is shown in Fig. 3.

Fig. 3

A Novel privacy preservation model based on vertical distribution and tokenization


Data analytics is done on the data collected from various sources. If an ecommerce site would like to perform data analytics,
they need transactional data, website logs and customers opinion through social media pages. A Data lake is used to collect
data from different sources. Apache Flume is used to ingest data from social media sites, website logs into Hadoop Distributed
File System(HDFS). Using SQOOP relational data can be loaded into HDFS.

In Data lake the data can remain in its native form which is either structured or unstructured. When data has to be processed,
it can be transformed into HIVE tables. A Hadoop map reduce job using machine learning can be executed on the data to
classify the sensitive attributes . The data can be vertically distributed to separate the sensitive attributes from rest of the data
and apply tokenization to map the vertically distributed data. The data without any sensitive attributes can be published for
data analytics.
Abbreviations

CCTV:
closed circuit television
MDSBA:

Multidimensional Sensitivity Based Anonymization

Challenges to Maintaining Database Security


Seeing the vast increase in volume and speed of threats to databases and many information assets, research efforts need to
be consider to the following issues
such as data quality, intellectual property rights, and database survivability.
Let’s discuss them one by one.
1. Data quality –
 The database community basically needs techniques and some organizational solutions to assess and attest the
quality of data. These techniques may include the simple mechanism such as quality stamps that are posted on
different websites. We also need techniques that will provide us more effective integrity semantics verification
tools for assessment of data quality, based on many techniques such as record linkage.
 We also need application-level recovery techniques to automatically repair the incorrect data.
 The ETL that is extracted transform and load tools widely used for loading the data in the data warehouse are
presently grappling with these issues.
2. Intellectual property rights–
As the use of Internet and intranet is increasing day by day, legal and informational aspects of data are becoming major
concerns for many organizations. To address this concerns watermark technique are used which will help to protect content
from unauthorized duplication and distribution by giving the provable power to the ownership of the content.
Traditionally they are dependent upon the availability of a large domain within which the objects can be altered while
retaining its essential or important properties.
However, research is needed to access the robustness of many such techniques and the study and investigate many different
approaches or methods that aimed to prevent intellectual property rights violation.
3.Database survivability–
Database systems need to operate and continued their functions even with the reduced capabilities, despite disruptive events
such as information warfare attacks
A DBMS in addition to making every effort to prevent an attack and detecting one in the event of the occurrence should be
able to do the following:
 Confident:
We should take immediate action to eliminate the attacker’s access to the system and to isolate or contain the
problem to prevent further spread.
 Damage assessment:
Determine the extent of the problem, including failed function and corrupted data.
 Recover:
Recover corrupted or lost data and repair or reinstall failed function to reestablish a normal level of operation.
 Reconfiguration:
Reconfigure to allow the operation to continue in a degraded mode while recovery proceeds.
 Fault treatment:
To the extent possible, identify the weakness exploited in the attack and takes steps to prevent a recurrence.

Database Survivability
Survivability of a system is the capability of a system to fulfill its mission in a timely manner in the
presence of attacks, failures, or accidents.
Survivability of Systems in the Cloud
Survivability is the ability of a system to adapt and recover from a serious failure, or more generally the ability to retain service
functionality in the face of threats. This could be related to small local events—such as equipment failures, and reconfigure
itself essentially automatically and over a time scale of seconds to minutes. Survivability could relate to major events, such as
a large natural disaster or a capable attack, on a time scale of hours to days or even longer. Another important part of
survivability is robustness. While survivability is to do with coping with the impact of events, robustness is to do with reducing
the impact in the first place. Assigning probabilities to potential dangers is difficult because of uncertainty. In addition, there
are no effective measures to actually assess the performance of the Internet . Because of these and other issues, dependability
is based on statistical measures of historical outages, faults, and failures.
At every level of the interconnection system in the Internet, there is little global information available, and what is available is
incomplete and of unknown accuracy. Specifically, there are no maps of physical connections, traffic, and interconnections
between ASes. There are a number of reasons for this lack of information. One is the physical complexity of the network fibers
around the world, which change from time to time as well. Another reason is that probes have limited paths and will only
reveal something about the path between two points in the Internet at the time of the probe. A security threat also exists,
because if the physical aspect of the Internet is mapped, it could be potentially dangerous material in the hands of certain
individuals and groups. Some groups have a commercial incentive for encouraging Internet anonymity and not having the
networks mapped out. Another reason is that networks lack motivation to gather such information because it does not seem to
serve them directly while being costly. Finally, there are no metrics for a network as a whole, and stakeholders must look
closely at the idiosyncrasies of the specific subsystems in use .
Oracle Label-Based Security
Oracle Label Security (OLS) is a security option for the Oracle Enterprise Edition database and mediates access to data rows by
comparing labels attached to data rows in application tables (sensitivity labels) and a set of user labels (clearance labels).

The need for more sophisticated controls on access to sensitive data is becoming increasingly important as
organizations address emerging security requirements around data consolidation, privacy and compliance.
Maintaining separate databases for highly sensitive customer data is costly and creates unnecessary
administrative overhead. However, consolidating databases sometimes means combining sensitive financial,
HR, medical or project data from multiple locations into a single database for reduced costs, easier
management and better scalability. Oracle Label Security provides the ability to tag data with a data label
or a data classification to allow the database to inherently know what data a user or role is authorizedfor
and allows data from different sources to be combined in the same table as a larger data set without
compromising security.

Access to sensitive data is controlled by comparing the data label with therequesting user’s label or access
clearance. A user label or access clearance can be thought of as an extension to standard database privileges
and roles. Oracle Label Security is centrally enforced within thedatabase, below the application layer,
providing strong security and eliminating the need for complicated application views.

What is Oracle Label Security?


Oracle Label Security (OLS) is a security option for the Oracle Enterprise Edition database and mediates
access to data rows by comparing labels attached to data rows in application tables (sensitivity labels) and a
set of user labels (clearance labels).

1 FAQ / Oracle Label Security


Who should consider Oracle Label Security?
Sensitivity labels are used in some form in virtually every industry. These industries include health care, law
enforcement, energy, retail, national security and defense industries. Examples of
label use include:

 Separating individual branch store, franchisee, or region data

 Financial companies with customers that span multiple countries with strong
government privacycontrols

 Consolidating and securing sensitive R&D projects

 Minimizing access to individual health care records

 Protecting HR data from different divisions

 Securing classified data for Government and Defense use

 Complying with U.S. State Department’s International Traffic in Arms (ITAR) regulations

 Supporting multiple customers in a multi-tenant SaaS application

 Restrict data processing, tracking consent and handling right to erasure requests under EU GDPR

What can Oracle Label Security do for my


security needs?
Oracle Label Security can be used to label data and restrict access with a high degree
of granularity. This is particularly useful when multiple organizations, companies or
users share a single application. Sensitivity labels can be used to restrict application
users to a subset of data within an organization, without having to change the
application. Data privacy is important to consumers and stringent regulatory measures
continue to be enacted. Oracle Label Security can be used to implement privacy
policies on data, restricting access to only those who have a need-to-know.

COMPONENTS AND FEATURES

2 FAQ / Oracle Label Security


What are the main components of Oracle Label
Security?
Label Security provides row level data access controls for application users. This is
called Label Security because each user and each data record have an associated
security label.

The User label consist of three components – a level, zero or more compartments
and zero or more groups. This label is assigned as part of the user authorization and
is not modifiable by the user.

Session labels also consists of the same three components and are different from the
user label based on the session that was established by the user. For example if the user
has a Top Secret level component, but the user logged in from a Secret workstation,
the session label level would be Secret.

Data security labels have the same three components as the User and Session

labels. Label components – the three label components are level,

compartment and group.

 Levels indicate the sensitivity level of the data and the authorization for a user to access sensitive
data. The user (and session) level must be equal or greater than the data level to
access that record.

3 FAQ / Oracle Label Security


 Data can be part of zero or more compartments. The user/session label must have
every compartment that the record data has for the user to successfully retrieve the
record. For example, if the data label compartments are A, B and C – the session
label must at least contain A, B and C to access that data record.

 Data can have zero or more groups in the group component. The user/session label
needs to have at least one group that matches a data record’s group(s) to access the
data record. For example, if the data record had Boston, Chicago and New York for
groups, then the session label needs only Boston (or one of the other 2 groups) to
access the data.

 Protected objects are tables with labeled records.

 Label Security policies are a combination of User labels, Data labels and protected objects.

Does Oracle Label Security provide column-


level access control?
No, Oracle Label Security is not column-aware.

A column-sensitive Virtual Private Database (VPD) policy can determine access to


a specific column by evaluating OLS user labels. An example of this type of OLS
and VPD integration is available as a white paper on the OLS OTN webpage.

A VPD policy can be written so that it only becomes active when a certain column
(the 'sensitive' column) is part of a SQL statement against a protected table. With
'column sensitivity' switch on, VPD either returns only those rows that include
information in the sensitive column the user is allowed to see, or it returns all rows,
with all cells in the sensitive column being empty, except those values the user is
allowed to see.

Can I base Secure Application Roles on Oracle


Label Security?
Yes, the procedure, which determines if the 'set role' command is executed, can
evaluate OLS user labels. In this case, the OLS policy does not need to be applied to
a table, since row labels are not part of this solution. An example of this can be found
as a white paper on the OLS OTN webpage.

4 FAQ / Oracle Label Security


What are Trusted Stored Program Units?
Stored procedures, functions and packages execute with the system and object
privileges (DAC) of the definer. If the invoker is a user with OLS user clearances
(labels), the procedure executes with a combination of the definer's DAC privileges
and the invoker's security clearances.

Trusted stored procedures are procedures that are either granted the OLS privilege
'FULL' or 'READ'. When a trusted stored program unit is carried out, the policy
privileges in force are a union of the invoking user's privileges and the program unit's
privileges.

Are there any administrative tools available for


Oracle Label Security?

Beginning with Oracle Database 11gR1, the functionality of Oracle Policy Manager
(and most other security related administration tools) has been made available in
Oracle Enterprise Manager Cloud Control, enabling administrators to create and
manage OLS policies in a modern, convenient and integrated environment.

5 FAQ / Oracle Label Security


DEPLOYMENT AND ADMINISTRATION

Where can I find Oracle Label Security?


Oracle Label Security is an option that is part of Oracle Database Enterprise Edition.
Oracle LabelSecurity is installed as part of the database and just needs to be enabled.

Should I use Oracle Label Security to protect all


my tables?
The traditional Oracle discretionary access control (DAC) objects privileges
SELECT, INSERT, UPDATE, and DELETE combined with database roles and
stored procedures are sufficient for most tables. Furthermore, before applying OLS to
your sensitive tables, some considerations need to be taken into account; they are
described in a white paper titled Oracle Label Security – Multi-Level Security
Implementation found on the OLS OTN webpage.

Are there any guidelines for using Oracle Label


Security and defining sensitivity labels?
Yes, a comprehensive Label Security Administrator's Guide is available online. In
addition, examples are available in a white paper and under technical resources on the
Oracle Technology Network, which walk you through a list of recommended
implementation guidelines. In most cases, the security mechanisms provided at no-
cost with the Oracle Enterprise Edition (system and object privileges, Database roles,
Secure Application Roles) will be sufficient to address security requirements. Oracle
Label Security should be considered when security is required at the individual row
level.

Can Oracle Label Security policies and


user labels (clearances) be stored

6 FAQ / Oracle Label Security


centrally in Oracle Identity
Management?
Not only can your database users be stored and managed centrally in Oracle Identity
Management using Enterprise User Security, but Oracle Label Security policies and
user clearances can be stored and managed in Oracle Identity Management as well.
This greatly simplifies policy management in distributed environments and enables
security administrators to automatically attach user clearances to all centrally
managed users.

How can I maintain the performance of


my applications after applying Label
Security access control policies?
As a best practice:

 Only apply sensitivity labels to those tables that really need protection. When
multiple tables are joined to retrieve sensitive, look for the driving table

 Do not apply OLS policies to schemas.

 Usually, there is only a small set of different data classification labels; if the table
is mostly used for READ operations, try building an Bitmap Index over the (hidden)
OLS column, and add this index to existing indexes in that table.

 Review the Oracle Label Security whitepaper available in the product OTN
webpage as it contains a thorough discussion of performance considerations with
Oracle Label Security.

Can I use Oracle Label Security with


Oracle Database Vault, Real Application
Security and Data Redaction?

Yes. Oracle Label Security can provide user labels to be used as factors within
Oracle Database Vault and security labels can be assigned to Real Application
Security users. It also integrates with oracle advanced security data
redaction,enabling security clearances to be used in data redactionpolici

7 FAQ / Oracle Label Security


8 FAQ / Oracle Label Security

You might also like