DataBase Practices
DataBase Practices
What is Data?
In simple words, data can be facts related to any object in consideration. For example, your name,
age, height, weight, etc. are some data related to you. A picture, image, file, pdf, etc. can also be
considered data.
Define Database :
A database is an organized collection of structured information, or data, typically stored
electronically in a computer system. A database is usually controlled by a database management
system (DBMS). ... The data can then be easily accessed, managed, modified, updated, controlled,
and organized.
ER model
o ER model stands for an Entity-Relationship model. It is a high-level data model. This
model is used to define the data elements and relationship for a specified system.
o It develops a conceptual design for the database. It also develops a very simple and easy to
design view of data.
o In ER modeling, the database structure is portrayed as a diagram called an entity-
relationship diagram.
For example, Suppose we design a school database. In this database, the student will be an entity
with attributes like address, name, id, age, etc. The address can be another entity with attributes
like city, street name, pin code, etc and there will be a relationship between them.
Component of ER Diagram
1. Entity:
An entity may be any object, class, person or place. In the ER diagram, an entity can be
represented as rectangles.
a. Weak Entity
10 SecQL CREATE TABLE
An entity that depends on another entity called a weak entity. The weak entity doesn't contain any
key attribute of its own. The weak entity is represented by a double rectangle.
2. Attribute
The attribute is used to describe the property of an entity. Eclipse is used to represent an attribute.
For example, id, age, contact number, name, etc. can be attributes of a student.
a. Key Attribute
The key attribute is used to represent the main characteristics of an entity. It represents a primary
key. The key attribute is represented by an ellipse with the text underlined.
b. Composite Attribute
An attribute that composed of many other attributes is known as a composite attribute. The
composite attribute is represented by an ellipse, and those ellipses are connected with an ellipse.
c. Multivalued Attribute
An attribute can have more than one value. These attributes are known as a multivalued attribute.
The double oval is used to represent multivalued attribute.
For example, a student can have more than one phone number.
d. Derived Attribute
An attribute that can be derived from other attribute is known as a derived attribute. It can be
represented by a dashed ellipse.
For example, A person's age changes over time and can be derived from another attribute like
Date of birth.
3. Relationship
A relationship is used to describe the relation between entities. Diamond or rhombus is used to represent the
relationship.
a. One-to-One Relationship
When only one instance of an entity is associated with the relationship, then it is known as one to one relationship.
For example, A female can marry to one male, and a male can marry to one female.
b. One-to-many relationship
When only one instance of the entity on the left, and more than one instance of an entity on the right associates
with the relationship then this is known as a one-to-many relationship.
For example, Scientist can invent many inventions, but the invention is done by the only specific scientist.
c. Many-to-one relationship
When more than one instance of the entity on the left, and only one instance of an entity on the right associates
with the relationship then it is known as a many-to-one relationship.
For example, Student enrolls for only one course, but a course can have many students.
d. Many-to-many relationship
When more than one instance of the entity on the left, and more than one instance of an entity on the right
associates with the relationship then it is known as a many-to-many relationship.
For example, Employee can assign by many projects and project can have many employees.
Participation Constraints
Total Participation − Each entity is involved in the relationship. Total participation is
represented by double lines.
Partial participation − Not all entities are involved in the relationship. Partial participation is
represented by single lines.
Example-Total
Here,
Double line between the entity set “Student” and relationship set “Enrolled in” signifies total participation.
It specifies that each student must be enrolled in at least one course.
Example-
Here,
Single line between the entity set “Course” and relationship set “Enrolled in” signifies partial participation.
It specifies that there might exist some courses for which no enrollments are made.
The domain of Marital Status has a set of possibilities: Married, Single, Divorced.
The domain of Shift has the set of all possible days: {Mon, Tue, Wed…}.
The domain of Salary is the set of all floating-point numbers greater than 0 and less than
200,000.
The domain of First Name is the set of character strings that represents names of people.
Attribute: It contains the name of a column in a particular table. Each attribute Ai must have a
domain, dom(Ai)
Relational instance: In the relational database system, the relational instance is represented by a
finite set of tuples. Relation instances do not have duplicate tuples.
Relational schema: A relational schema contains the name of the relation and name of all
columns or attributes.
Relational key: In the relational key, each row has one or more attributes. It can identify the row
in the relation uniquely.
o In the given table, NAME, ROLL_NO, PHONE_NO, ADDRESS, and AGE are the
attributes.
o The instance of schema STUDENT has 5 tuples.
o t3 = <Laxman, 33289, 8583287182, Gurugram, 20>
Properties of Relations
Keys :
atomic value: each value in the domain is indivisible as far as the relational model is
concernedattribute: principle storage unit in a database
domain: the original sets of atomic values used to model data; a set of acceptable values that a
column is allowed to contain
file:see relation
relation: a subset of the Cartesian product of a list of domains characterized by a name; the technical
term for table or file
table:see relation
Types:
1)Unary Relationship:
Eg : PERSON is Married to
PERSON is entity
Married to is relationship
2)Binary Relationship :
3)Ternary Relationship :
First Convert each entity and relationship to tables. Person table corresponds to Person Entity with
key as Per-Id. Similarly Passport table corresponds to Passport Entity with key as Pass-No. HashTable
represents relationship between Person and Passport (Which person has which passport). So it will take
attribute Per-Id from Person and Pass-No from Passport.
PR3 –
Table 1
As we can see from Table 1, each Per-Id and Pass-No has only one entry in Hashtable. So we can
merge all three tables into 1 with attributes shown in Table 2. Each Per-Id will be unique and not null.
So it will be the key. Pass-No can’t be key because for some person, it can be NULL.
Table 2
Case 2: Binary Relationship with 1:1 cardinality and partial participation of both entities
A male marries 0 or 1 female and vice versa as well. So it is 1:1 cardinality with partial participation
constraint from both. First Convert each entity and relationship to tables. Male table corresponds to
Male Entity with key as M-Id. Similarly Female table corresponds to Female Entity with key as F-Id.
Marry Table represents relationship between Male and Female (Which Male marries which female). So
it will take attribute M-Id from Male and F-Id from Female.
M1 – M1 F2 F1 –
M2 – M2 F1 F2 –
M3 – F3 –
Table 3
As we can see from Table 3, some males and some females do not marry. If we merge 3 tables into 1,
for some M-Id, F-Id will be NULL. So there is no attribute which is always not NULL. So we can’t
merge all three tables into 1. We can convert into 2 tables. In table 4, M-Id who are married will have
F-Id associated. For others, it will be NULL. Table 5 will have information of all females. Primary
Keys have been underlined.
Table 4
In this scenario, every student can enroll only in one elective course but for an elective course there can
be more than one student. First Convert each entity and relationship to tables. Student table
corresponds to Student Entity with key as S-Id. Similarly Elective_Course table corresponds to
Elective_Course Entity with key as E-Id. Enrolls Table represents relationship between Student and
Elective_Course (Which student enrolls in which course). So it will take attribute S-Id from and Student
E-Id from Elective_Course.
S-Id Other Student Attribute S-Id E-Id E-Id Other Elective CourseAttribute
S1 – S1 E1 E1 –
S2 – S2 E2 E2 –
S3 – S3 E1 E3 –
S4 – S4 E1
Table 6
As we can see from Table 6, S-Id is not repeating in Enrolls Table. So it can be considered as a key of
Enrolls table. Both Student and Enrolls Table’s key is same; we can merge it as a single table. The
resultant tables are shown in Table 7 and Table 8. Primary Keys have been underlined.
Table 8
Case 4: Binary Relationship with m: n cardinality
In this scenario, every student can enroll in more than 1 compulsory course and for a compulsory course
there can be more than 1 student. First Convert each entity and relationship to tables. Student table
corresponds to Student Entity with key as S-Id. Similarly Compulsory_Courses table corresponds to
Compulsory Courses Entity with key as C-Id. Enrolls Table represents relationship between Student
and Compulsory_Courses (Which student enrolls in which course). So it will take attribute S -Id from
Person and C-Id from Compulsory_Courses.
S-Id Other Student Attribute S-Id C-Id C-Id Other Compulsory CourseAttribute
S1 – S1 C1 C1 –
S2 – S1 C2 C2 –
S3 – S3 C1 C3 –
S4 – S4 C3 C4 –
S4 C2
S3 C3
Table 9
As we can see from Table 9, S-Id and C-Id both are repeating in Enrolls Table. But its combination is
unique; so it can be considered as a key of Enrolls table. All tables’ keys are different, these can’t be
merged. Primary Keys of all tables have been underlined.
Case 5: Binary Relationship with weak entity
In this scenario, an employee can have many dependents and one dependent can depend on one
employee. A dependent does not have any existence without an employee (e.g; you as a child can be
dependent of your father in his company). So it will be a weak entity and its participation will always
be total. Weak Entity does not have key of its own. So its key will be combination of key of its
identifying entity (E-Id of Employee in this case) and its partial key (D-Name).
First Convert each entity and relationship to tables. Employee table corresponds to Employee Entity
with key as E-Id. Similarly Dependents table corresponds to Dependent Entity with key as D-Name
and E-Id. HashTable represents relationship between Employee and Dependents (Which employee has
which dependents). So it will take attribute E-Id from Employee and D-Name from Dependents.
E-Id Other Employee Attribute E-Id D-Name D-Name E-Id Other DependentsAttribute
E1 – E1 RAM RAM E1 –
E2 – E1 SRINI SRINI E1 –
E3 – E2 RAM RAM E2 –
E3 ASHISH ASHISH E3 –
Table 10
As we can see from Table 10, E-Id, D-Name is key for Has as well as Dependents Table. So we can
merge these two into 1. So the resultant tables are shown in Tables 11 and 12. Primary Keys of all
tables have been underlined.
E-Id Other Employee Attribute
Table 11
Table 12
ER Model, when conceptualized into diagrams, gives a good overview of entity-relationship, which is easier to
understand. ER diagrams can be mapped to relational schema, that is, it is possible to create relational schema
using ER diagram. We cannot import all the ER constraints into relational model, but an approximate schema can
be generated.
There are several processes and algorithms available to convert ER Diagrams into Relational Schema. Some of
them are automated and some of them are manual. We may focus here on the mapping diagram contents to
relational basics.
ER diagrams mainly comprise of −
Mapping Entity
An entity is a real-world object with some attributes.
Mapping Relationship
A relationship is an association among entities.
Mapping Process
Create table for a relationship.
Add the primary keys of all participating Entities as fields of table with their respective data types.
If relationship has any attribute, add each attribute as field of table.
Declare a primary key composing all the primary keys of participating entities.
Declare all foreign key constraints.
Mapping Process
Create table for weak entity set.
Add all its attributes to table as field.
Add the primary key of identifying entity set.
Declare all foreign key constraints.
Relational Algebra
Relational Algebra is procedural query language, which takes Relation as input and generate relation as
output. Relational algebra mainly provides theoretical foundation for relational databases and SQL.
It uses operators to perform queries. An operator can be either unary or binary. They accept relations as their
input and yield relations as their output. Relational algebra is performed recursively on a relation and
intermediate results are also considered relations.
Relational algebra is a procedural query language that works on relational model. The purpose of a query
language is to retrieve data from database or perform various operations such as insert, update, delete on
the data. When I say that relational algebra is a procedural query language, it means that it tells what data
to be retrieved and how to be retrieved.
On the other hand relational calculus is a non-procedural query language, which means it tells what data
to be retrieved but doesn’t tell how to retrieve it.
Select
Project
Union
Set different
Cartesian product
Rename
We will discuss all these operations in the following sections.
A B C
-------
1 2 4
4 3 4
Project Operation (∏)
It projects column(s) that satisfy a given predicate.
Notation − ∏A , A , A (r)
1 2 n
1 CSE A
2 ECE B
3 MECH B
4 CIVIL A
5 CSE B
Table R2 is as follows −
1 CIVIL A
2 CSE A
3 ECE B
To display all the regno of R1 and R2, use the following command −
∏regno(R1) ∪ ∏regno(R2)
Output
Regno
Union
Union combines two different results obtained by a query into a single result in the form of a table.
However, the results should be similar if union is to be applied on them. Union removes all duplicates, if
any from the data and only displays distinct values. If duplicate values are required in the resultant data,
then UNION ALL is used.
An example of union is −
Select Student_Name from Art_Students
UNION
Select Student_Name from Dance_Students
This will display the names of all the students in the table Art_Students and Dance_Students i.e John,
Mary, Damon and Matt.
Intersection
The intersection operator gives the common data values between the two data sets that are intersected.
The two data sets that are intersected should be similar for the intersection operator to work.
Intersection also removes all duplicates before displaying the result.
An example of intersection is −
Select Student_Name from Art_Students
INTERSECT
Select Student_Name from Dance_Students
This will display the names of the students in the table Art_Students and in the table Dance_Students
i.e all the students that have taken both art and dance classes .Those are Mary and Damon in this
example
Note: Only those rows that are present in both the tables will appear in the result set.
table_name1 ∩ table_name2
Intersection Operator (∩) Example
Lets take the same example that we have taken above.
Table 1: COURSE
Student_Name
------------
Aditya
Steve
Paul
Lucy
The result of set difference operation is tuples, which are present in one relation but are not in the second relation.
Notation − r − s
Finds all the tuples that are present in r but not in s.
∏ author (Books) − ∏ author (Articles)
Output − Provides the name of authors who have written books but not articles.
Minus (-): Minus on two relations R1 and R2 can only be computed if R1 and R2 are union
compatible. Minus operator when applied on two relations as R1-R2 will give a relation with
tuples which are in R1 but not in R2. Syntax:
Relation1 - Relation2
Find person who are student but not employee, we can use minus operator like:
STUDENT – EMPLOYEE
Table 1: EMPLOYEE
EMP_NO NAME ADDRESS PHONE AGE
Table 2: STUDENT
ROLL_NO NAME ADDRESS PHONE AGE
RESULT:
ROLL_NO NAME ADDRESS PHONE AGE
Set difference
The set difference operators takes the two sets and returns the values that are in the first set but not the
second set.
An example of set difference is −
Select Student_Name from Art_Students
MINUS
Select Student_Name from Dance_Students
This will display the names of all the students in table Art_Students but not in table Dance_Students i.e
the students who are taking art classes but not dance classes.
AXB
Name Age Sex Id Course
---------------------------------
Ram 14 M 1 DS
Ram 14 M 2 DBMS
Sona 15 F 1 DS
Sona 15 F 2 DBMS
Kim 20 M 1 DS
Kim 20 M 2 DBMS
Note: if A has ‘n’ tuples and B has ‘m’ tuples then A X B will have ‘n*m’ tuples.
Emp Dep
(Name Id Dept_name ) (Dept_name Manager)
------------------------ ---------------------
A 120 IT Sale Y
B 125 HR Prod Z
C 110 Sale IT A
D 111 IT
Emp ⋈ Dep
Conditional Join
Conditional join works similar to natural join. In natural join, by default condition is equal between
common attribute while in conditional join we can specify the any condition such as greater than, less
than, not equal
Let us see below example
R S
(ID Sex Marks) (ID Sex Marks)
------------------ --------------------
1 F 45 10 M 20
2 F 55 11 M 22
3 F 60 12 M 59
Join between R And S with condition R.marks >= S.marks
Table 1
STUDENT_SPORTS
ROLL_NO SPORTS
1 Badminton
2 Cricket
2 Badminton
4 Badminton
Table 2
ALL_SPORTS
SPORTS
Badminton
Cricket
Table 3
EMPLOYEE
EMP_NO NAME ADDRESS PHONE AGE
Table 4
Intersection (∩): Intersection on two relations R1 and R2 can only be computed if R1 and R2
are union compatible (These two relation should have same number of attributes and
corresponding attributes in two relations have same domain). Intersection operator when
applied on two relations as R1∩R2 will give a relation with tuples which are in R1 as well as
R2. Syntax:
Relation1 ∩ Relation2
Example: Find a person who is student as well as employee- STUDENT ∩ EMPLOYEE
In terms of basic operators (union and minus) :
STUDENT ∩ EMPLOYEE = STUDENT + EMPLOYEE - (STUDENT U EMPLOYEE)
RESULT:
ROLL_NO NAME ADDRESS PHONE AGE
Conditional Join(⋈c): Conditional Join is used when you want to join two or more relation
based on some conditions. Example: Select students whose ROLL_NO is greater than
EMP_NO of employees
STUDENT⋈c STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
In terms of basic operators (cross product and selection) :
σ (STUDENT.ROLL_NO>EMPLOYEE.EMP_NO) (STUDENT×EMPLOYEE)
RESULT:
ROLL_ ADDRE AG EMP_ NAM ADDRE AG
NO NAME SS PHONE E NO E SS PHONE E
94551234 94551234
1 RAM DELHI 51 18 RAM DELHI 51 18
Natural Join(⋈): It is a special case of equijoin in which equality condition hold on all
attributes which have same name in relations R and S (relations on which join operation is
applied). While applying natural join on two relations, there is no need to write equality
condition explicitly. Natural Join will also return the similar attributes only once as their value
will be same in resulting relation.
Example: Select students whose ROLL_NO is equal to ROLL_NO of STUDENT_SPORTS as:
STUDENT⋈STUDENT_SPORTS
In terms of basic operators (cross product, selection and projection) :
∏(STUDENT.ROLL_NO, STUDENT.NAME, STUDENT.ADDRESS, STUDENT.PHONE, STUDENT.AGE STUDENT_SPORTS.SPORTS) (σ (STUDENT.ROLL_NO=STUDENT_SPORTS.ROLL_NO)
(STUDENT×STUDENT_SPORTS))
RESULT:
ROLL_NO NAME ADDRESS PHONE AGE SPORTS
Natural Join is by default inner join because the tuples which does not satisfy the conditions of
join does not appear in result set. e.g.; The tuple having ROLL_NO 3 in STUDENT does not
match with any tuple in STUDENT_SPORTS, so it has not been a part of result set.
Left Outer Join(⟕): When applying join on two relations R and S, some tuples of R or S does
not appear in result set which does not satisfy the join conditions. But Left Outer Joins gives
all tuples of R in the result set. The tuples of R which do not satisfy join condition will have
values as NULL for attributes of S.
Example:Select students whose ROLL_NO is greater than EMP_NO of employees and details
of other students as well
STUDENT⟕STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT
ROLL_ ADDRE AG EMP_ NAM ADDRE AG
NO NAME SS PHONE E NO E SS PHONE E
Right Outer Join(⟖): When applying join on two relations R and S, some tuples of R or S
does not appear in result set which does not satisfy the join conditions. But Right Outer Joins
gives all tuples of S in the result set. The tuples of S which do not satisfy join condition will
have values as NULL for attributes of R.
Example: Select students whose ROLL_NO is greater than EMP_NO of employees and
details of other Employees as well
STUDENT⟖STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT:
ROLL_ ADDRE AG EMP_ ADDRE AG
NO NAME SS PHONE E NO NAME SS PHONE E
NU NARE 9782918
NULL NULL NULL NULL LL 5 SH HISAR 192 22
NU SURE 9156768
NULL NULL NULL NULL LL 4 SH DELHI 971 18
Full Outer Join(⟗): When applying join on two relations R and S, some tuples of R or S does
not appear in result set which does not satisfy the join conditions. But Full Outer Joins gives all
tuples of S and all tuples of R in the result set. The tuples of S which do not satisfy join
condition will have values as NULL for attributes of R and vice versa.
Example:Select students whose ROLL_NO is greater than EMP_NO of employees and details
of other Employees as well and other Students as well
STUDENT⟗STUDENT.ROLL_NO>EMPLOYEE.EMP_NOEMPLOYEE
RESULT:
ROLL_ ADDRE AG EMP_ ADDRE AG
NO NAME SS PHONE E NO NAME SS PHONE E
NU NARE 9782918
NULL NULL NULL NULL LL 5 SH HISAR 192 22
NU SURE 9156768
NULL NULL NULL NULL LL 4 SH DELHI 971 18
9455123 NU
1 RAM DELHI 451 18 NULL NULL NULL NULL LL
Division Operator (÷): Division operator A÷B can be applied if and only if:
Attributes of B is proper subset of Attributes of A.
The relation returned by division operator will have attributes = (All attributes of A –
All Attributes of B)
The relation returned by division operator will return those tuples from relation A
which are associated to every B’s tuple.
Consider the relation STUDENT_SPORTS and ALL_SPORTS given in Table 2 and Table 3
above.
To apply division operator as
STUDENT_SPORTS÷ ALL_SPORTS
The operation is valid as attributes in ALL_SPORTS is a proper subset of attributes
in STUDENT_SPORTS.
The attributes in resulting relation will have attributes {ROLL_NO,SPORTS}-
{SPORTS}=ROLL_NO
The tuples in resulting relation will have those ROLL_NO which are associated with
all B’s tuple {Badminton, Cricket}. ROLL_NO 1 and 4 are associated to Badminton
only. ROLL_NO 2 is associated to all tuples of B. So the resulting relation will be:
ROLL_NO
(INNER) JOIN: Returns records that have matching values in both tables
LEFT (OUTER) JOIN: Returns all records from the left table, and the matched records from
the right table
RIGHT (OUTER) JOIN: Returns all records from the right table, and the matched records
from the left table
FULL (OUTER) JOIN: Returns all records when there is a match in either left or right table
Relational Calculus
In contrast to Relational Algebra, Relational Calculus is a non-procedural query language, that is, it tells what to
do but never explains how to do it.
Relational calculus exists in two forms −
Where a1, a2 are attributes and P stands for formulae built by inner attributes.
For example −
{< article, page, subject > | ∈ TutorialsPoint ∧ subject = 'database'}
Output − Yields Article, Page, and Subject from the relation TutorialsPoint, where subject is database.
Just like TRC, DRC can also be written using existential and universal quantifiers. DRC also involves relational
operators.
The expression power of Tuple Relation Calculus and Domain Relation Calculus is equivalent to Relational
Algebra.
SQL
o SQL stands for Structured Query Language. It is used for storing and managing data in relational database
management system (RDMS).
o It is a standard language for Relational Database System. It enables a user to create, read, update and delete relational
databases and tables.
o All the RDBMS like MySQL, Informix, Oracle, MS Access and SQL Server use SQL as their standard database
language.
o SQL allows users to query the database in a number of ways, using English-like statements.
Rules:
SQL follows the following rules:
o Structure query language is not case sensitive. Generally, keywords of SQL are written in uppercase.
o Statements of SQL are dependent on text lines. We can use a single SQL statement on one or multiple text line.
o Using the SQL statements, you can perform most of the actions in a database.
o SQL depends on tuple relational calculus and relational algebra.
SQL process:
o When an SQL command is executing for any RDBMS, then the system figure out the best way to carry out the
request and the SQL engine determines that how to interpret the task.
o In the process, various components are included. These components can be optimization Engine, Query engine,
Query dispatcher, classic, etc.
o All the non-SQL queries are handled by the classic query engine, but SQL query engine won't handle logical files.
SQL is a language to operate databases; it includes database creation, deletion, fetching rows, modifying rows,
etc. SQL is an ANSI (American National Standards Institute) standard language, but there are many different
versions of the SQL language.
The optimizer choose the plan with the lowest cost among all considered candidate plans. The optimizer uses available
statistics to calculate cost. For a specific query in a given environment, the cost computation accounts for factors of query
execution such as I/O, CPU, and communication.
For example, a query might request information about employees who are managers. If the optimizer statistics indicate
that 80% of employees are managers, then the optimizer may decide that a full table scan is most efficient. However, if
statistics indicate that very few employees are managers, then reading an index followed by a table access by rowid may be
more efficient than a full table scan.
Because the database has many internal statistics and tools at its disposal, the optimizer is usually in a better position than
the user to determine the optimal method of statement execution. For this reason, all SQL statements use the optimizer.
What is SQL?
SQL is Structured Query Language, which is a computer language for storing, manipulating and retrieving data
stored in a relational database.
SQL is the standard language for Relational Database System. All the Relational Database Management Systems
(RDMS) like MySQL, MS Access, Oracle, Sybase, Informix, Postgres and SQL Server use SQL as their standard
database language.
Also, they are using different dialects, such as −
Why SQL?
SQL is widely popular because it offers the following advantages −
Allows users to access data in the relational database management systems.
Allows users to describe the data.
Allows users to define the data in a database and manipulate that data.
Allows to embed within other languages using SQL modules, libraries & pre-compilers.
Allows users to create and drop databases and tables.
Allows users to create view, stored procedure, functions in a database.
Allows users to set permissions on tables, procedures and views.
SQL Process
When you are executing an SQL command for any RDBMS, the system determines the best way to carry out
your request and SQL engine figures out how to interpret the task.
There are various components included in this process.
These components are −
Query Dispatcher
Optimization Engines
Classic Query Engine
SQL Query Engine, etc.
A classic query engine handles all the non-SQL queries, but a SQL query engine won't handle logical files.
Following is a simple diagram showing the SQL Architecture −
SQL Commands
The standard SQL commands to interact with relational databases are CREATE, SELECT, INSERT, UPDATE,
DELETE and DROP. These commands can be classified into the following groups based on their nature −
o CREATE
o ALTER
o DROP
o TRUNCATE
Syntax:
Example:
b. DROP: It is used to delete both the structure and record stored in the table.
Syntax
1. DROP TABLE table_name;
Example
c. ALTER: It is used to alter the structure of the database. This change could be either to modify the
characteristics of an existing attribute or probably to add a new attribute.
Syntax:
EXAMPLE
d. TRUNCATE: It is used to delete all the rows from the table and free the space containing the table.
Syntax:
Example:
o INSERT
o UPDATE
o DELETE
a. INSERT: The INSERT statement is a SQL query. It is used to insert data into the row of a table.
Syntax:
1. INSERT INTO TABLE_NAME
2. (col1, col2, col3,.... col N)
3. VALUES (value1, value2, value3, .... valueN);
Or
For example:
b. UPDATE: This command is used to update or modify the value of a column in the table.
Syntax:
For example:
1. UPDATE students
2. SET User_Name = 'Sonoo'
3. WHERE Student_Id = '3'
Syntax:
For example:
o Grant
o Revoke
Example
1. GRANT SELECT, UPDATE ON MY_TABLE TO SOME_USER, ANOTHER_USER;
Example
These operations are automatically committed in the database that's why they cannot be used while
creating tables or dropping them.
o COMMIT
o ROLLBACK
o SAVEPOINT
a. Commit: Commit command is used to save all the transactions to the database.
Syntax:
1. COMMIT;
Example:
b. Rollback: Rollback command is used to undo transactions that have not already been saved to the
database.
Syntax:
1. ROLLBACK;
Example:
c. SAVEPOINT: It is used to roll the transaction back to a certain point without rolling back the entire
transaction.
Syntax:
1. SAVEPOINT SAVEPOINT_NAME;
5. Data Query Language
DQL is used to fetch the data from the database.
o SELECT
a. SELECT: This is the same as the projection operation of relational algebra. It is used to select the
attribute based on the condition described by WHERE clause.
Syntax:
1. SELECT expressions
2. FROM TABLES
3. WHERE conditions;
For example:
1. SELECT emp_name
2. FROM employee
3. WHERE age > 20;
1. COMMIT-
COMMIT in SQL is a transaction control language that is used to permanently save the
changes done in the transaction in tables/databases. The database cannot regain its previous
state after its execution of commit.
Example: Consider the following STAFF table with records:
STAFF
sql>
SELECT *
FROM Staff
WHERE Allowance = 400;
sql> COMMIT;
Output:
So, the SELECT statement produced the output consisting of three rows.
2. ROLLBACK
ROLLBACK in SQL is a transactional control language that is used to undo the transactions
that have not been saved in the database. The command is only been used to undo changes
since the last COMMIT.
Example: Consider the following STAFF table with records:
STAFF
sql>
SELECT *
FROM EMPLOYEES
WHERE ALLOWANCE = 400;
sql> ROLLBACK;
Output:
So, the SELECT statement produced the same output with the ROLLBACK command.
COMMIT permanently saves the changes made by the ROLLBACK undo the changes made by the
1. current transaction. current transaction.
2.
The transaction can not undo changes after COMMIT Transaction reaches its previous state after
COMMIT ROLLBACK
execution. ROLLBACK.
Char vs Varchar
The basic difference between Char and Varchar is that: char stores only fixed-length character
string data types whereas varchar stores variable-length string where an upper limit of length is
specified.
DBMS Keys
o Keys play an important role in the relational database.
o It is used to uniquely identify any record or row of data from the table. It is also used to establish and
identify relationships between tables.
For example: In Student table, ID is used as a key because it is unique for each student. In PERSON
table, passport_number, license_number, SSN are keys since they are unique for each person.
Types of key:
1. Primary key
o It is the first key used to identify one and only one instance of an entity uniquely. An entity can contain
multiple keys, as we saw in the PERSON table. The key which is most suitable from those lists becomes a
primary key.
o In the EMPLOYEE table, ID can be the primary key since it is unique for each employee. In the EMPLOYEE
table, we can even select License_Number and Passport_Number as primary keys since they are also
unique.
o For each entity, the primary key selection is based on requirements and developers.
2. Candidate key
o A candidate key is an attribute or set of attributes that can uniquely identify a tuple.
o Except for the primary key, the remaining attributes are considered a candidate key. The candidate keys
are as strong as the primary key.
For example: In the EMPLOYEE table, id is best suited for the primary key. The rest of the attributes, like
SSN, Passport_Number, License_Number, etc., are considered a candidate key.
3. Super Key
Super key is an attribute set that can uniquely identify a tuple. A super key is a superset of a candidate
key.
For example: In the above EMPLOYEE table, for(EMPLOEE_ID, EMPLOYEE_NAME), the name of two
employees can be the same, but their EMPLYEE_ID can't be the same. Hence, this combination can also
be a key.
4. Foreign key
o Foreign keys are the column of the table used to point to the primary key of another table.
o Every employee works in a specific department in a company, and employee and department are two
different entities. So we can't store the department's information in the employee table. That's why we link
these two tables through the primary key of one table.
o We add the primary key of the DEPARTMENT table, Department_Id, as a new attribute in the EMPLOYEE
table.
o In the EMPLOYEE table, Department_Id is the foreign key, and both the tables are related.
5. Alternate key
There may be one or more attributes or a combination of attributes that uniquely identify each tuple in
a relation. These attributes or combinations of the attributes are called the candidate keys. One key is
chosen as the primary key from these candidate keys, and the remaining candidate key, if it exists, is
termed the alternate key. In other words, the total number of the alternate keys is the total number of
candidate keys minus the primary key. The alternate key may or may not exist. If there is only one
candidate key in a relation, it does not have an alternate key.
For example, employee relation has two attributes, Employee_Id and PAN_No, that act as candidate
keys. In this relation, Employee_Id is chosen as the primary key, so the other candidate key, PAN_No,
acts as the Alternate key.
6. Composite key
Whenever a primary key consists of more than one attribute, it is known as a composite key. This key is
also known as Concatenated Key.
For example, in employee relations, we assume that an employee may be assigned multiple roles, and
an employee may work on multiple projects simultaneously. So the primary key will be composed of all
three attributes, namely Emp_ID, Emp_role, and Proj_ID in combination. So these attributes act as a
composite key since the primary key comprises more than one attribute.
7. Artificial key
The key created using arbitrarily assigned data are known as artificial keys. These keys are created when
a primary key is large and complex and has no relationship with many other relations. The data values of
the artificial keys are usually numbered in a serial order.
For example, the primary key, which is composed of Emp_ID, Emp_role, and Proj_ID, is large in
employee relations. So it would be better to add a new virtual attribute to identify each tuple in the
relation uniquely.
Normalization is a process of organizing the data in database to avoid data redundancy, insertion
anomaly, update anomaly & deletion anomaly. Let’s discuss about anomalies first then we will discuss
normal forms with examples.
Anomalies in DBMS
There are three types of anomalies that occur when the database is not normalized. These are – Insertion,
update and deletion anomaly. Let’s take an example to understand this.
Example: Suppose a manufacturing company stores the employee details in a table named employee that
has four attributes: emp_id for storing employee’s id, emp_name for storing employee’s name,
emp_address for storing employee’s address and emp_dept for storing the department details in which the
employee works. At some point of time the table looks like this:
The above table is not normalized. We will see the problems that we face when a table is not normalized.
Update anomaly: In the above table we have two rows for employee Rick as he belongs to two
departments of the company. If we want to update the address of Rick then we have to update the same in
two rows or the data will become inconsistent. If somehow, the correct address gets updated in one
department but not in other then as per the database, Rick would be having two different addresses, which
is not correct and would lead to inconsistent data.
Insert anomaly: Suppose a new employee joins the company, who is under training and currently not
assigned to any department then we would not be able to insert the data into the table if emp_dept field
doesn’t allow nulls.
Delete anomaly: Suppose, if at a point of time the company closes the department D890 then deleting the
rows that are having emp_dept as D890 would also delete the information of employee Maggie since she
is assigned only to this department.
To overcome these anomalies we need to normalize the data. In the next section we will discuss about
normalization.
Normalization
Here are the most commonly used normal forms:
Example: Suppose a company wants to store the names and contact details of its employees. It creates a
table that looks like this:
8812121212
9900012222
9990000123
Two employees (Jon & Lester) are having two mobile numbers so the company stored them in the same
field as you can see in the table above.
This table is not in 1NF as the rule says “each attribute of a table must have atomic (single) values”, the
emp_mobile values for employees Jon & Lester violates that rule.
To make the table complies with 1NF we should have the data like this:
emp_id emp_name emp_address emp_mobile
An attribute that is not part of any candidate key is known as non-prime attribute.
Example: Suppose a school wants to store the data of teachers and the subjects they teach. They create a
table that looks like this: Since a teacher can teach more than one subjects, the table can have multiple
rows for a same teacher.
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
The table is in 1 NF because each attribute has atomic values. However, it is not in 2NF because non
prime attribute teacher_age is dependent on teacher_id alone which is a proper subset of candidate key.
This violates the rule for 2NF as the rule says “no non-prime attribute is dependent on the proper subset
of any candidate key of the table”.
To make the table complies with 2NF we can break it in two tables like this:
teacher_details table:
teacher_id teacher_age
111 38
222 38
333 40
teacher_subject table:
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
An attribute that is not part of any candidate key is known as non-prime attribute.
In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each functional
dependency X-> Y at least one of the following conditions hold:
An attribute that is a part of one of the candidate keys is known as prime attribute.
Example: Suppose a company wants to store the complete address of each employee, they create a table
named employee_details that looks like this:
emp_id emp_name emp_zip emp_state emp_city emp_district
Here, emp_state, emp_city & emp_district dependent on emp_zip. And, emp_zip is dependent on emp_id
that makes non-prime attributes (emp_state, emp_city & emp_district) transitively dependent on super
key (emp_id). This violates the rule of 3NF.
To make this table complies with 3NF we have to break the table into two tables to remove the transitive
dependency:
employee table:
employee_zip table:
The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
To make the table comply with BCNF we can break the table in three tables like this:
emp_nationality table:
emp_id emp_nationality
1001 Austrian
1002 American
emp_dept table:
emp_dept dept_type dept_no_of_emp
emp_dept_mapping table:
emp_id emp_dept
1001 Stores
Functional dependencies:
emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}
Candidate keys:
For first table: emp_id
For second table: emp_dept
For third table: {emp_id, emp_dept}
This is now in BCNF as in both the functional dependencies left side part is a key.
For example: Suppose we have a student table with attributes: Stu_Id, Stu_Name, Stu_Age. Here Stu_Id
attribute uniquely identifies the Stu_Name attribute of student table because if we know the student id we
can tell the student name associated with it. This is known as functional dependency and can be written as
Stu_Id->Stu_Name or in words we can say Stu_Name is functionally dependent on Stu_Id.
Formally:
If column A of a table uniquely identifies the column B of same table then it can represented as A->B
(Attribute B is functionally dependent on attribute A)
DBMS Schema
Definition of schema: Design of a database is called the schema. Schema is of three types: Physical
schema, logical schema and view schema.
For example: In the following diagram, we have a schema that shows the relationship between three
tables: Course, Student and Section. The diagram only shows the design of the database, it doesn’t show
the data present in those tables. Schema is only a structural view(design) of a database as shown in the
diagram below.
The design of a database at physical level is called physical schema, how the data stored in blocks of
storage is described at this level.
Design of database at logical level is called logical schema, programmers and database administrators
work at this level, at this level data can be described as certain types of data records gets stored in data
structures, however the internal details such as implementation of data structure is hidden at this level
(available at physical level).
Design of database at view level is called view schema. This generally describes end user interaction with
database systems.
DBMS Instance
Definition of instance: The data stored in database at a particular moment of time is called instance of
database. Database schema defines the variable declarations in tables that belong to a particular database;
the value of these variables at a moment of time is called the instance of that database.
Unit 2
Distributed Databases,Active Database and Open Database
Connectivity
Distributed databases:
A distributed database is a type of database that has contributions from the common database and
information captured by local computers. In this type of database system, the data is not in one place
and is distributed at various organizations.
Architectural Models
Some of the common architectural models are −
Fully Replicated
In this design alternative, at each site, one copy of all the database tables is stored. Since, each site has its own
copy of the entire database, queries are very fast requiring negligible communication cost. On the contrary, the
massive redundancy in data requires huge cost during update operations. Hence, this is suitable for systems where
a large number of queries is required to be handled whereas the number of database updates is low.
Partially Replicated
Copies of tables or portions of tables are stored at different sites. The distribution of the tables is done in
accordance to the frequency of access. This takes into consideration the fact that the frequency of accessing the
tables vary considerably from site to site. The number of copies of the tables (or portions) depends on how
frequently the access queries execute and the site which generate the access queries.
Fragmented
In this design, a table is divided into two or more pieces referred to as fragments or partitions, and each fragment
can be stored at different sites. This considers the fact that it seldom happens that all data stored in a table is
required at a given site. Moreover, fragmentation increases parallelism and provides better disaster recovery. Here,
there is only one copy of each fragment in the system, i.e. no redundant data.
The three fragmentation techniques are −
Vertical fragmentation
Horizontal fragmentation
Hybrid fragmentation
Mixed Distribution
This is a combination of fragmentation and partial replications. Here, the tables are initially fragmented in any
form (horizontal or vertical), and then these fragments are partially replicated across the different sites according
to the frequency of accessing the fragments.
A distributed database is basically a database that is not limited to one system, it is spread over different
sites, i.e, on multiple computers or over a network of computers. A distributed database system is located
on various sites that don’t share physical components. This may be required when a particular database
needs to be accessed by various users globally. It needs to be managed such that for the users it looks
like one single database.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These are:
1.Replication –
In this approach, the entire relation is stored redundantly at 2 or more sites. If the entire database is
available at all sites, it is a fully redundant database. Hence, in replication, systems maintain copies of
data.
This is advantageous as it increases the availability of data at different sites. Also, now query requests
can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be constantly updated. Any change made
at one site needs to be recorded at every site that relation is stored or else it may lead to inconsistency.
This is a lot of overhead. Also, concurrency control becomes way more complex as concurrent access
now needs to be checked over a number of sites.
2.Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into smaller parts) and each of the
fragments is stored in different sites where they’re required. It must be made sure that the fragments are
such that they can be used to reconstruct the original relation (i.e, there isn’t any loss of data).
Fragmentation is advantageous as it doesn’t create copies of data, consistency is not a problem.
Flat Transaction
Nested Transaction
Concurrent Execution of the Sub-transactions is done which are at the same level – in the nested
transaction strategy.Here, in the above diagram, T1 and T2 invoke objects on different servers and
hence they can run in parallel and are therefore concurrent.
T1.1, T1.2, T2.1, and T2.2 are four sub-transactions. These sub-transactions can also run in parallel.
Consider a distributed transaction (T) in which a customer transfers :
Rs. 105 from account A to account C and
Subsequently, Rs. 205 from account B to account D.
It can be viewed/ thought of as :
Transaction T :
Start
Transfer Rs 105 from A to C :
Deduct Rs 105 from A(withdraw from A) & Add Rs 105 to C(depopsit to C)
Transfer Rs 205 from B to D :
Deduct Rs 205 from B (withdraw from B)& Add Rs 205 to D(depopsit to D)
End
Assuming :
1. Account A is on server X
2. Account B is on server Y,and
3. Accounts C and D are on server Z.
The transaction T involves four requests – 2 for deposits and 2 for withdrawals. Now they can be treated
as sub transactions (T1, T2, T3, T4) of the transaction T.
As shown in the figure below, transaction T is designed as a set of four nested transactions : T1, T2, T3
and T4.
Advantage:
The performance is higher than a single transaction in which four operations are invoked one after the
other in sequence.
Nested Transaction
The following distributed transaction executed by SCOTT updates the local SALES database, the
remote HQ database, and the remote MAINT database:
UPDATE [email protected]
SET loc = 'REDWOOD SHORES'
WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11
WHERE deptno = 10;
UPDATE [email protected]
SET room = 1225
WHERE room = 1163;
COMMIT;
Note:
If all statements of a transaction reference only a single remote node, then the transaction is
remote, not distributed.
The following list describes DML and DDL operations supported in a distributed transaction:
CREATE TABLE AS SELECT
DELETE
INSERT (default and direct load)
LOCK TABLE
SELECT
SELECT FOR UPDATE
You can execute DML and DDL statements in parallel, and INSERT direct load statements serially,
but note the following restrictions:
COMMIT
ROLLBACK
SAVEPOINT
Savepoint in SQL
o Savepoint is a command in SQL that is used with the rollback command.
o It is a command in Transaction Control Language that is used to mark the transaction in a table.
o Consider you are making a very long table, and you want to roll back only to a certain position in a table
then; this can be achieved using the savepoint.
o If you made a transaction in a table, you could mark the transaction as a certain name, and later on, if you
want to roll back to that point, you can do it easily by using the transaction's name.
o Savepoint is helpful when we want to roll back only a small part of a table and not the whole table. In
simple words, we can say savepoint is a bookmark in SQL.
Oracle8i defines a session tree of all nodes participating in a distributed transaction. A session tree is
a hierarchical model of the transaction that describes the relationships among the nodes that are
involved. Each node plays a role in the transaction. For example, the node that originates the
transaction is the global coordinator, and the node in charge of initiating a commit or rollback is called
the commit point site.
Two-Phase Commit Mechanism
Unlike a transaction on a local database, a distributed transaction involves altering data on multiple
databases. Consequently, distributed transaction processing is more complicated, because Oracle must
coordinate the committing or rolling back of the changes in a transaction as a self-contained unit. In
other words, the entire transaction commits, or the entire transactions rolls back.
Oracle ensures the integrity of data in a distributed transaction using the two-phase commit
mechanism. In the prepare phase, the initiating node in the transaction asks the other participating
nodes to promise to commit or roll back the transaction. During the commit phase, the initiating node
asks all participating nodes to commit the transaction; if this outcome is not possible, then all nodes
are asked to roll back.
All nodes participating in the session tree of a distributed transaction assume one or more of the
following roles:
Clients
A node acts as a client when it references information from another node's database. The referenced
node is a database server. In Figure 4-2, the node SALES is a client of the nodes that host the
WAREHOUSE and FINANCE databases.
Database Servers
A database server is a node that hosts a database from which a client requests data.
In Figure 4-2, an application at the SALES node initiates a distributed transaction that accesses data
from the WAREHOUSE and FINANCE nodes. Therefore, SALES.ACME.COM has the role of client
node, and WAREHOUSE and FINANCE are both database servers. In this example, SALES is a
database server and a client because the application also requests a change to the SALES database.
Local Coordinators
A node that must reference data on other nodes to complete its part in the distributed transaction is
called a local coordinator. In Figure 4-2, SALES is a local coordinator because it coordinates the
nodes it directly references: WAREHOUSE and FINANCE. SALES also happens to be the global
coordinator because it coordinates all the nodes involved in the transaction.
A local coordinator is responsible for coordinating the transaction among the nodes it communicates
directly with by:
Receiving and relaying transaction status information to and from those nodes.
Passing queries to those nodes.
Receiving queries from those nodes and passing them on to other nodes.
Returning the results of queries to the nodes that initiated them.
Global Coordinator
The node where the distributed transaction originates is called the global coordinator. The database
application issuing the distributed transaction is directly connected to the node acting as the global
coordinator. For example, in Figure 4-2, the transaction issued at the node SALES references
information from the database servers WAREHOUSE and FINANCE. Therefore,
SALES.ACME.COM is the global coordinator of this distributed transaction.
The global coordinator becomes the parent or root of the session tree. The global coordinator performs
the following operations during a distributed transaction:
Sends all of the distributed transaction's SQL statements, remote procedure calls, etc. to the
directly referenced nodes, thus forming the session tree.
Instructs all directly referenced nodes other than the commit point site to prepare the
transaction.
Instructs the commit point site to initiate the global commit of the transaction if all nodes
prepare successfully.
Instructs all nodes to initiate a global rollback of the transaction if there is an abort response.
Commit Point Site
The job of the commit point site is to initiate a commit or roll back operation as instructed by the
global coordinator. The system administrator always designates one node to be the commit point
site in the session tree by assigning all nodes a commit point strength. The node selected as commit
point site should be the node that stores the most critical data.
Figure 4-3 illustrates an example of distributed system, with SALES serving as the commit point site:
The commit point site is distinct from all other nodes involved in a distributed transaction in these
ways:
The commit point site never enters the prepared state. Consequently, if the commit point site
stores the most critical data, this data never remains in-doubt, even if a failure occurs. In failure
situations, failed nodes remain in a prepared state, holding necessary locks on data until in-
doubt transactions are resolved.
The commit point site commits before the other nodes involved in the transaction. In effect, the
outcome of a distributed transaction at the commit point site determines whether the transaction
at all nodes is committed or rolled back: the other nodes follow the lead of the commit point
site. The global coordinator ensures that all nodes complete the transaction in the same manner
as the commit point site.
A distributed transaction is considered committed after all non-commit point sites are prepared, and
the transaction has been actually committed at the commit point site. The online redo log at the commit
point site is updated as soon as the distributed transaction is committed at this node.
Because the commit point log contains a record of the commit, the transaction is considered committed
even though some participating nodes may still be only in the prepared state and the transaction not
yet actually committed at these nodes. In the same way, a distributed transaction is
considered not committed if the commit has not been logged at the commit point site.
Every database server must be assigned a commit point strength. If a database server is referenced in
a distributed transaction, the value of its commit point strength determines which role it plays in the
two-phase commit. Specifically, the commit point strength determines whether a given node is the
commit point site in the distributed transaction and thus commits before all of the other nodes. This
value is specified using the initialization parameter COMMIT_POINT_STRENGTH.
The commit point site, which is determined at the beginning of the prepare phase, is selected only
from the nodes participating in the transaction. The following sequence of events occurs:
1. Of the nodes directly referenced by the global coordinator, Oracle selects the node with the
highest commit point strength as the commit point site.
2. The initially-selected node determines if any of the nodes from which it has to obtain
information for this transaction has a higher commit point strength.
3. Either the node with the highest commit point strength directly referenced in the transaction or
one of its servers with a higher commit point strength becomes the commit point site.
4. After the final commit point site has been determined, the global coordinator sends prepare
responses to all nodes participating in the transaction.
Figure 4-4 shows in a sample session tree the commit point strengths of each node (in parentheses)
and shows the node chosen as the commit point site:
Figure 4-4 Commit Point Strengths and Determination of the Commit Point Site
The following conditions apply when determining the commit point site:
The commit mechanism has the following distinct phases, which Oracle performs automatically
whenever a user commits a distributed transaction:
prepare The initiating node, called the global coordinator, asks participating nodes other than the commit point
phase site to promise to commit or roll back the transaction, even if there is a failure. If any node cannot
prepare, the transaction is rolled back.
commit If all participants respond to the coordinator that they are prepared, then the coordinator asks the
phase commit point site to commit. After it commits, the coordinator asks all other nodes to commit the
transaction.
forget The global coordinator forgets about the transaction.
phase
Prepare Phase
Commit Phase
Forget Phase
Prepare Phase
The first phase in committing a distributed transaction is the prepare phase. In this phase, Oracle does
not actually commit or roll back the transaction. Instead, all nodes referenced in a distributed
transaction (except the commit point site, described in the "Commit Point Site") are told to prepare to
commit. By preparing, a node:
Records information in the online redo logs so that it can subsequently either commit or roll
back the transaction, regardless of intervening failures.
Places a distributed lock on modified tables, which prevents reads.
When a node responds to the global coordinator that it is prepared to commit, the prepared
node promises to either commit or roll back the transaction later--but does not make a unilateral
decision on whether to commit or roll back the transaction. The promise means that if an instance
failure occurs at this point, the node can use the redo records in the online log to recover the database
back to the prepare phase.
Note:
Queries that start after a node has prepared cannot access the associated locked data until
all phases complete.
prepared Data on the node has been modified by a statement in the distributed transaction, and the node has
successfully prepared.
read- No data on the node has been, or can be, modified (only queried), so no preparation is necessary.
only
abort The node cannot successfully prepare.
Prepared Response
When a node has successfully prepared, it issues a prepared message. The message indicates that the
node has records of the changes in the online log, so it is prepared either to commit or perform a
rollback. The message also guarantees that locks held for the transaction can survive a failure.
Read-Only Response
When a node is asked to prepare, and the SQL statements affecting the database do not change the
node's data, the node responds with a read-only message. The message indicates that the node will not
participate in the commit phase.
There are three cases in which all or part of a distributed transaction is read-only:
Partially read- Any of the following occurs: The read-only nodes recognize their status when asked
only to prepare. They give their local coordinators a read-
Only queries are issued at one only response. Thus, the commit phase completes faster
or more nodes. because Oracle eliminates read-only nodes from
No data is changed. subsequent processing.
Case Conditions Consequence
Completely read- All of following occur: All nodes recognize that they are read-only during
only with prepare phase, so no commit phase is required. The
prepare phase No data changes. global coordinator, not knowing whether all nodes are
Transaction is not started read-only, must still perform the prepare phase.
with SET TRANSACTION
READ ONLY statement.
Completely read- All of following occur: Only queries are allowed in the transaction, so global
only without coordinator does not have to perform two-phase
two-phase No data changes. commit. Changes by other transactions do not degrade
commit Transaction is started with global transaction-level read consistency because of
SET TRANSACTION READ global SCN coordination among nodes. The transaction
ONLY statement. does not use rollback segments.
Note that if a distributed transaction is set to read-only, then it does not use rollback segments. If many
users connect to the database and their transactions are not set to READ ONLY, then they allocate
rollback space even if they are only performing queries.
Abort Response
1. Releases resources currently held by the transaction and rolls back the local portion of the
transaction.
2. Responds to the node that referenced it in the distributed transaction with an abort message.
These actions then propagate to the other nodes involved in the distributed transaction so that they can
roll back the transaction and guarantee the integrity of the data in the global database. This response
enforces the primary rule of a distributed transaction: all nodes involved in the transaction either all
commit or all roll back the transaction at the same logical time.
To complete the prepare phase, each node excluding the commit point site performs the following
steps:
1. The node requests that its descendants, that is, the nodes subsequently referenced, prepare to
commit.
2. The node checks to see whether the transaction changes data on itself or its descendants. If there
is no change to the data, then the node skips the remaining steps and returns a read-only
response
3. The node allocates the resources it needs to commit the transaction if data is changed.
4. The node saves redo records corresponding to changes made by the transaction to its online
redo log.
5. The node guarantees that locks held for the transaction are able to survive a failure.
6. The node responds to the initiating node with a prepared response , if its attempt or the attempt
of one of its descendents to prepare was unsuccessful, with an abort response.
These actions guarantee that the node can subsequently commit or roll back the transaction on the
node. The prepared nodes then wait until a COMMIT or ROLLBACK request is received from the
global coordinator.
After the nodes are prepared, the distributed transaction is said to be in-doubt . It retains in-doubt
status until all changes are either committed or rolled back.
Commit Phase
The second phase in committing a distributed transaction is the commit phase. Before this phase
occurs, all nodes other than the commit point site referenced in the distributed transaction have
guaranteed that they are prepared, that is, they have the necessary resources to commit the transaction.
3. The commit point site informs the global coordinator that it has committed.
4. The global and local coordinators send a message to all nodes instructing them to commit the
transaction.
5. At each node, Oracle8i commits the local portion of the distributed transaction and releases
locks.
6. At each node, Oracle8i records an additional redo entry in the local redo log, indicating that the
transaction has committed.
7. The participating nodes notify the global coordinator that they have committed.
When the commit phase is complete, the data on all nodes of the distributed system is consistent with
one another.
Guaranteeing Global Database Consistency
Each committed transaction has an associated system change number (SCN) to uniquely identify the
changes made by the SQL statements within that transaction. The SCN functions as an internal Oracle
timestamp that uniquely identifies a committed version of the database.
In a distributed system, the SCNs of communicating nodes are coordinated when all of the following
actions occur:
A connection occurs using the path described by one or more database links.
A distributed SQL statement executes.
A distributed transaction commits.
Among other benefits, the coordination of SCNs among the nodes of a distributed system ensures
global read-consistency at both the statement and transaction level. If necessary, global time-based
recovery can also be completed.
During the prepare phase, Oracle8i determines the highest SCN at all nodes involved in the
transaction. The transaction then commits with the high SCN at the commit point site. The commit
SCN is then sent to all prepared nodes with the commit decision.
Forget Phase
After the participating nodes notify the commit point site that they have committed, the commit point
site can forget about the transaction. The following steps occur:
1. After receiving notice from the global coordinator that all nodes have committed, the commit
point site erases status information about this transaction.
2. The commit point site informs the global coordinator that it has erased the status information.
3. The global coordinator erases its own information about the transaction.
In-Doubt Transactions
The two-phase commit mechanism ensures that all nodes either commit or perform a rollback together.
What happens if any of the three phases fails because of a system or network error? The transaction
becomes in-doubt.
The RECO (recovery)process automatically resolves in-doubt transactions when the machine,
network, or software problem is resolved. Until RECO can resolve the transaction, the data is locked
for both reads and writes. Oracle blocks reads because it cannot determine which version of the data
to display for a query.
This section contains the following topics:
In the majority of cases, Oracle resolves the in-doubt transaction automatically. Assume that there are
two nodes, LOCAL and REMOTE, in the following scenarios. The local node is the commit point
site. User SCOTT connects to LOCAL and executes and commits a distributed transaction that updates
LOCAL and REMOTE.
Figure 4-5 illustrates the sequence of events when there is a failure during the prepare phase of a
distributed transaction:
2. The global coordinator, which in this example is also the commit point site, requests all
databases other than the commit point site to promise to commit or roll back when told to do
so.
3. The REMOTE database crashes before issuing the prepare response back to LOCAL.
4. The transaction is ultimately rolled back on each database by the RECO process when the
remote site is restored.
Figure 4-5 illustrates the sequence of events when there is a failure during the commit phase of a
distributed transaction:
Figure 4-6 Failure During Prepare Phase
2. The global coordinator, which in this case is also the commit point site, requests all databases
other than the commit point site to promise to commit or roll back when told to do so.
3. The commit point site receives a prepare message from REMOTE saying that it will commit.
4. The commit point site commits the transaction locally, then sends a commit message to
REMOTE asking it to commit.
5. The REMOTE database receives the commit message, but cannot respond because of a network
failure.
6. The transaction is ultimately committed on the remote database by the RECO process after the
network is restored.
You should only need to resolve an in-doubt transaction in the following cases:
Resolution of in-doubt transactions can be complicated. The procedure requires that you do the
following:
A system change number (SCN) is an internal timestamp for a committed version of the database. The
Oracle database server uses the SCN clock value to guarantee transaction consistency. For example,
when a user commits a transaction, Oracle records an SCN for this commit in the online redo log.
Oracle uses SCNs to coordinate distributed transactions among different databases. For example,
Oracle uses SCNs in the following way:
2. The distributed transaction commits with the highest global SCN among all the databases
involved.
3. The commit global SCN is sent to all databases involved in the transaction.
SCNs are important for distributed transactions because they function as a synchronized commit
timestamp of a transaction--even if the transaction fails. If a transaction becomes in-doubt, an
administrator can use this SCN to coordinate changes made to the global database. The global SCN
for the transaction commit can also be used to identify the transaction later, for example, in distributed
recovery.
At the Sales department, a salesperson uses SQL*Plus to enter a sales order and then commit it. The
application issues a number of SQL statements to enter the order into the SALES database and update
the inventory in the WAREHOUSE database:
CONNECT scott/[email protected] ...;
INSERT INTO orders ...;
UPDATE [email protected] ...;
INSERT INTO orders ...;
UPDATE [email protected] ...;
COMMIT;
These SQL statements are part of a single distributed transaction, guaranteeing that all issued SQL
statements succeed or fail as a unit. Treating the statements as a unit prevents the possibility of an
order being placed and then inventory not being updated to reflect the order. In effect, the transaction
guarantees the consistency of data in the global database.
As each of the SQL statements in the transaction executes, the session tree is defined, as shown
in Figure 4-7.
An order entry application running with the SALES database initiates the transaction.
Therefore, SALES.ACME.COM is the global coordinator for the distributed transaction.
The order entry application inserts a new sales record into the SALES database and updates the
inventory at the warehouse. Therefore, the nodes SALES.ACME.COM and
WAREHOUSE.ACME.COM are both database servers.
Because SALES.ACME.COM updates the inventory, it is a client of
WAREHOUSE.ACME.COM.
This stage completes the definition of the session tree for this distributed transaction. Each node in the
tree has acquired the necessary data locks to execute the SQL statements that reference local data.
These locks remain even after the SQL statements have been executed until the two-phase commit is
completed.
Oracle determines the commit point site immediately following the COMMIT statement.
SALES.ACME.COM, the global coordinator, is determined to be the commit point site, as shown
in Figure 4-8.
Figure 4-8 Determining the Commit Point Site
1. After Oracle determines the commit point site, the global coordinator sends the prepare message
to all directly referenced nodes of the session tree, excluding the commit point site. In this
example, WAREHOUSE.ACME.COM is the only node asked to prepare.
2. WAREHOUSE.ACME.COM tries to prepare. If a node can guarantee that it can commit the
locally dependent part of the transaction and can record the commit information in its local redo
log, then the node can successfully prepare. In this example, only WAREHOUSE.ACME.COM
receives a prepare message because SALES.ACME.COM is the commit point site.
As each node prepares, it sends a message back to the node that asked it to prepare. Depending on the
responses, one of the following can happen:
If any of the nodes asked to prepare respond with an abort message to the global coordinator,
then the global coordinator tells all nodes to roll back the transaction, and the operation is
completed.
If all nodes asked to prepare respond with a prepared or a read-only message to the global
coordinator, that is, they have successfully prepared, then the global coordinator asks the
commit point site to commit the transaction.
The committing of the transaction by the commit point site involves the following steps:
2. The commit point site now commits the transaction locally and records this fact in its local redo
log.
Even if WAREHOUSE.ACME.COM has not yet committed, the outcome of this transaction is pre-
determined. In other words, the transaction will be committed at all nodes even if a given node's ability
to commit is delayed.
1. The commit point site tells the global coordinator that the transaction has committed. Because
the commit point site and global coordinator are the same node in this example, no operation is
required. The commit point site knows that the transaction is committed because it recorded
this fact in its online log.
2. The global coordinator confirms that the transaction has been committed on all other nodes
involved in the distributed transaction.
The committing of the transaction by all the nodes in the transaction involves the following steps:
1. After the global coordinator has been informed of the commit at the commit point site, it tells
all other directly referenced nodes to commit.
2. In turn, any local coordinators instruct their servers to commit, and so on.
3. Each node, including the global coordinator, commits the transaction and records appropriate
redo log entries locally. As each node commits, the resource locks that were being held locally
for that transaction are released.
In Figure 4-10, SALES.ACME.COM, which is both the commit point site and the global coordinator,
has already committed the transaction locally. SALES now instructs WAREHOUSE.ACME.COM to
commit the transaction.
Figure 4-10 Instructing Nodes to Commit
Stage 7: Global Coordinator and Commit Point Site Complete the Commit
The completion of the commit of the transaction occurs in the following steps:
1. After all referenced nodes and the global coordinator have committed the transaction, the global
coordinator informs the commit point site of this fact.
2. The commit point site, which has been waiting for this message, erases the status information
about this distributed transaction.
3. The commit point site informs the global coordinator that it is finished. In other words, the
commit point site forgets about committing the distributed transaction. This action is
permissible because all nodes involved in the two-phase commit have committed the
transaction successfully, so they will never have to determine its status in the future.
4. The global coordinator finalizes the transaction by forgetting about the transaction itself.
After the completion of the COMMIT phase, the distributed transaction is itself complete. The
steps described above are accomplished automatically and in a fraction of a second.
EID- 10 bytes
SALARY- 20 bytes
DID- 10 bytes
Name- 20 bytes
Total records- 1000
Record Size- 60 bytes
Site2: DEPARTMENT
DID DNAME
DID- 10 bytes
DName- 20 bytes
Total records- 50
Record Size- 30 bytes
Example : Find the name of employees and their department names. Also, find the amount of data
transfer to execute this query when the query is submitted to Site 3.
Answer : Considering the query is submitted at site 3 and neither of the two relations that is an
EMPLOYEE and the DEPARTMENT not available at site 3. So, to execute this query, we have three
strategies:
1. Transfer both the tables that is EMPLOYEE and DEPARTMENT at SITE 3 then join the
tables there. The total cost in this is 1000 * 60 + 50 * 30 = 60,000 + 1500 = 61500 bytes.
2. Transfer the table EMPLOYEE to SITE 2, join the table at SITE 2 and then transfer the result
at SITE 3. The total cost in this is 60 * 1000 + 60 * 1000 = 120000 bytes since we have to
transfer 1000 tuples having NAME and DNAME from site 1,
3. Transfer the table DEPARTMENT to SITE 1, join the table at SITE 2 join the table at site1
and then transfer the result at site3. The total cost is 30 * 50 + 60 * 1000 = 61500 bytes since
we have to transfer 1000 tuples having NAME and DNAME from site 1 to site 3 that is 60
bytes each.
Now, If the Optimisation criteria are to reduce the amount of data transfer, we can choose either 1 or 3
strategies from the above.
2. Using Semi join in Distributed Query processing :
The semi-join operation is used in distributed query processing to reduce the number of tuples in a table
before transmitting it to another site. This reduction in the number of tuples reduces the number and the
total size of the transmission that ultimately reducing the total cost of data transfer. Let’s say that we
have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of one table say
R1 to the site where the other table say R2 is located. This column is joined with R2 at that site. The
decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing R1
with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in
distributed query processing.
Example : Find the amount of data transferred to execute the same query given in the above
example using semi-join operation.
Answer : The following strategy can be used to execute the query.
1. Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer them
to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 25 * 1000 =
25000 bytes.
2. Transfer the table DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE
with this table. The size of the DEPARTMENT table is 25 * 50 = 1250
Applying the above scheme, the amount of data transferred to execute the query will be 25000 + 1250
= 26250 bytes.
One-phase commit
Two-phase commit
Three-phase commit
o The slaves apply the transaction and send a “Commit ACK” message to the controlling
site.
o When the controlling site receives “Commit ACK” message from all the slaves, it considers
the transaction as committed.
After the controlling site has received the first “Not Ready” message from any slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK” message to the controlling site.
o When the controlling site receives “Abort ACK” message from all the slaves, it considers
the transaction as aborted.
Concurrency controlling techniques ensure that multiple transactions are executed simultaneously
while maintaining the ACID properties of the transactions and serializability in the schedules.
In this chapter, we will study the various approaches for concurrency control.
Locking Based Concurrency Control Protocols
Locking-based concurrency control protocols use the concept of locking data items. A lock is a variable
associated with a data item that determines whether read/write operations can be performed on that
data item. Generally, a lock compatibility matrix is used which states whether a data item can be locked
by two transactions at the same time.
Locking-based concurrency control systems can use either one-phase or two-phase locking protocols.
Conflict Graphs
Another method is to create conflict graphs. For this transaction classes are defined. A transaction class
contains two set of data items called read set and write set. A transaction belongs to a particular class
if the transaction’s read set is a subset of the class’ read set and the transaction’s write set is a subset
of the class’ write set. In the read phase, each transaction issues its read requests for the data items in
its read set. In the write phase, each transaction issues its write requests.
A conflict graph is created for the classes to which active transactions belong. This contains a set of
vertical, horizontal, and diagonal edges. A vertical edge connects two nodes within a class and denotes
conflicts within the class. A horizontal edge connects two nodes across two classes and denotes a
write-write conflict among different classes. A diagonal edge connects two nodes across two classes
and denotes a write-read or a read-write conflict among two classes.
The conflict graphs are analyzed to ascertain whether two transactions within the same class or across
two different classes can be run in parallel.
Checkpointing
Checkpoint is a point of time at which a record is written onto the database from the buffers. As a
consequence, in case of a system crash, the recovery manager does not have to redo the transactions
that have been committed before checkpoint. Periodical checkpointing shortens the recovery process.
The two types of checkpointing techniques are −
Consistent checkpointing
Fuzzy checkpointing
Consistent Checkpointing
Consistent checkpointing creates a consistent image of the database at checkpoint. During recovery,
only those transactions which are on the right side of the last checkpoint are undone or redone. The
transactions to the left side of the last consistent checkpoint are already committed and needn’t be
processed again. The actions taken for checkpointing are −
Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the time of system crash is tfail.
Let there be four transactions T a, Tb, Tc and Td such that −
Ta commits before checkpoint.
Tb starts before checkpoint and commits before system crash.
Tc starts after checkpoint and commits before system crash.
Td starts after checkpoint and was active at the time of system crash.
The situation is depicted in the following diagram −
Events with significance to the system are identified within an event-driven program. An event could
be some user action, a transmission of sensor data or a message from some other program or system,
among an almost infinite number of other possibilities. The ECA rule specifies how events drive the
desired program responses. When an event with significance for the system occurs, the conditions are
checked for or evaluated; if the conditions exist or meet pre-established criteria, the appropriate action
is executed.
ECA rules originated in active databases and have since been used in areas
including personalization, big data management and business process automation. The model is being
explored for M2M (machine-to-machine) networking, Internet of Things (IoT), cognitive
computing and the Semantic Web.
Eg : ATM process
Active Databases
Active Database is a database consisting of set of triggers. These databases are very difficult to be
maintained because of the complexity that arises in understanding the effect of these triggers. In such
database, DBMS initially verifies whether the particular trigger specified in the statement that modifies
the database) is activated or not, prior to executing the statement.
If the trigger is active then DBMS executes the condition part and then executes the action part only if
the specified condition is evaluated to true. It is possible to activate more than one trigger within a single
statement.
In such situation, DBMS processes each of the trigger randomly. The execution of an action part of a
trigger may either activate other triggers or the same trigger that Initialized this action. Such types of
trigger that activates itself is called as ‘recursive trigger’. The DBMS executes such chains of trigger in
some pre-defined manner but it effects the concept of understanding.
A trigger is a procedure which is automatically invoked by the DBMS in response to changes to the
database, and is specified by the database administrator (DBA). A database with a set of associated
triggers is generally called an active database.
Parts of trigger
A triggers description contains three parts, which are as follows −
Condition − A query that is run when the trigger is activated is called as a condition.
Action −A procedure which is executed when the trigger is activated and its condition is true.
Use of trigger
Triggers may be used for any of the following reasons −
To implement any complex business rule, that cannot be implemented using integrity constraints.
Triggers will be used to audit the process. For example, to keep track of changes made to a table.
Trigger is used to perform automatic action when another concerned action takes place.
Types of triggers
The different types of triggers are explained below −
Statement level trigger − It is fired only once for DML statement irrespective of number of rows
affected by statement. Statement-level triggers are the default type of trigger.
Before-triggers − At the time of defining a trigger we can specify whether the trigger is to be fired
before a command like INSERT, DELETE, or UPDATE is executed or after the command is
executed. Before triggers are automatically used to check the validity of data before the action is
performed. For instance, we can use before trigger to prevent deletion of rows if deletion should
not be allowed in a given case.
After-triggers − It is used after the triggering action is completed. For example, if the trigger is
associated with the INSERT command then it is fired after the row is inserted into the table.
Row-level triggers − It is fired for each row that is affected by DML command. For example, if
an UPDATE command updates 150 rows then a row-level trigger is fired 150 times whereas a
statement-level trigger is fired only for once.
The second issue concerns whether the triggered action should be executed before, after, instead of, or concurrently
with the triggering event. A before trigger executes the trigger before executing the event that caused the trigger.
It can be used in applications such as checking for constraint violations. An after trigger executes the trig- ger after
executing the event, and it can be used in applications such as maintaining derived data and monitoring for specific
events and conditions. An instead of trig- ger executes the trigger instead of executing the event, and it can be used
in applica- tions such as executing corresponding updates on base relations in response to an event that is an update
of a view.
A related issue is whether the action being executed should be considered as a separate transaction or whether it
should be part of the same transaction that triggered the rule. We will try to categorize the various options. It is
important to note that not all options may be available for a particular active database system. In fact, most com-
mercial systems are limited to one or two of the options that we will now discuss.
Let us assume that the triggering event occurs as part of a transaction execution. We should first consider the
various options for how the triggering event is related to the evaluation of the rule’s condition. The rule condition
evaluation is also known as rule consideration , since the action is to be executed only after considering whether
the condition evaluates to true or false. There are three main possibilities for rule consideration:
1. Immediate consideration. The condition is evaluated as part of the same transaction as the triggering event, and
is evaluated immediately. This case
2. Deferred consideration. The condition is evaluated at the end of the trans- action that included the triggering
event. In this case, there could be many
3. Detached consideration. The condition is evaluated as a separate transac- tion, spawned from the triggering
transaction.
The next set of options concerns the relationship between evaluating the rule condi- tion and executing the rule
action. Here, again, three options are possible: immediate , deferred, or detached execution. Most active systems
use the first
option. That is, as soon as the condition is evaluated, if it returns true, the action is immediately executed.
The Oracle system uses the immediate consideration model, but it allows the user to specify for each rule whether
the before or after option is to be used with immediate condition evaluation. It also uses the immediate execution
model. The STARBURST system uses the deferred consideration option, meaning that all rules triggered by a
transaction wait until the triggering transaction reaches its end and issues its COMMIT WORK command before
the rule conditions are evaluated.
Another issue concerning active database rules is the distinction between row-level rules and statement-level rules.
Because SQL update statements (which act as trig- gering events) can specify a set of tuples, one has to distinguish
between whether the rule should be considered once for the whole statement or whether it should be con-sidered
separately for each row (that is, tuple) affected by the statement. The SQL-99 standard and the Oracle system
allow the user to choose which of the options is to be used for each rule, whereas STAR- BURST uses statement-
level semantics only.
One of the difficulties that may have limited the widespread use of active rules, in spite of their potential to simplify
database and software development, is that there are no easy-to-use techniques for designing, writing, and verifying
rules. For exam- ple, it is quite difficult to verify that a set of rules is consistent, meaning that two or more rules in
the set do not contradict one another. It is also difficult to guarantee termination of a set of rules under all
circumstances.
UPDATE TABLE2
To illustrate the termination problem briefly, consider the rules in Figure 26.4. Here, rule R1 is triggered by an
INSERT event on TABLE1 and its action includes an update event on Attribute1 of TABLE2 . However, rule R2
’s triggering event is an UPDATE event on Attribute1 of TABLE2 , and its action includes an INSERT event on
TABLE1 . In this example, it is easy to see that these two rules can trigger one another indefinitely, leading to
non- termination. However, if dozens of rules are written, it is very difficult to determine whether termination is
guaranteed or not.
If active rules are to reach their potential, it is necessary to develop tools for the design, debugging, and monitoring
of active rules that can help users design and debug their rules.
What is ODBC?
Open Database Connectivity (ODBC) is an open standard Application Programming Interface (API) for accessing
a database. In 1992, Microsoft partners with Simba to build the world’s first ODBC driver; SIMBA.DLL, and
standards-based data access was born. By using ODBC statements in a program, you can access files in a number
of different common databases. In addition to the ODBC software, a separate module or driver is needed for each
database to be accessed.
ODBC History
Microsoft introduced the ODBC standard in 1992. ODBC was a standard designed to unify access to SQL
databases. Following the success of ODBC, Microsoft introduced OLE (object linking and embedding)DB which
was to be a broader data access standard. OLE DB was a data access standard that went beyond just SQL databases
and extended to any data source that could deliver data in tabular format. Microsoft’s plan was that OLE DB would
supplant ODBC as the most common data access standard. More recently, Microsoft introduced the ADO data
access standard. ADO was supposed to go further than OLE DB, in that ADO was more object oriented. However,
even with Microsoft’s very significant attempts to replace the ODBC standard with what were felt to be “better”
alternatives, ODBC has continued to be the de facto data access standard for SQL data sources. In fact, today the
ODBC standard is more common than OLE DB and ADO(Active x Data Object) because ODBC is widely
supported (including support from Oracle and IBM) and is a cross platform data access standard. Today, the most
common data access standards for SQL data sources continue to be ODBC and JDBC, and it is very likely that
standards like OLE DB and ADO will fade away over time.
ODBC Overview
ODBC has become the de-facto standard for standards-based data access in both relational and non-relational
database management systems (DBMS). Simba worked closely with Microsoft to co-develop the ODBC standard
back in the early 90’s. The ODBC standard enables maximum interoperability thereby enabling application
developers to write a single application to access data sources from different vendors. ODBC is based on the Call-
Level Interface (CLI) specifications from Open Group and ISO/IEC (International Organization for
Standardization/International Electrotechnical Commission)for database APIs and uses Structured Query
Language (SQL) as its database access language.
ODBC Architecture
This is any ODBC compliant application, such as Microsoft Excel, Tableau, Crystal Reports, Microsoft Power BI,
or similar application (Spreadsheet, Word processor, Data Access & Retrievable Tool, etc.). The ODBC enabled
application performs processing by passing SQL Statements to and receiving results from the ODBC Driver
Manager.
ODBC Driver
The ODBC driver processes ODBC function calls, submits SQL requests to a specific data source and returns
results to the application. The ODBC driver may also modify an application’s request so that the request conforms
to syntax supported by the associated database. A framework to easily build an ODBC drivers is available from
Simba Technologies, as are ODBC drivers for many data sources, such as Salesforce, MongoDB, Spark and more.
The Simba SDK is available in C++, Java and C# and supports building drivers for Windows, OSX and many *Nix
distributions.
Data Source
A data source is simply the source of the data. It can be a file, a particular database on a DBMS, or even a live data
feed. The data might be located on the same computer as the program, or on another computer somewhere on a
network.
Unit – 3 XML Databases
Structured, Semi structured, and Unstructured Data – XML Hierarchical Data Model – XML
Documents – Document Type Definition – XML Schema – XML Documents and Databases – XML
Querying – XPath – XQuery
In this blog, we are going to cover Data, types of Data, and Structured Vs Unstructured Data, and
suitable Datastores.
What Is Data?
Data is a set of facts such as descriptions, observations, and numbers used in decision making.
We can classify data as structured, unstructured, or semi-structured data.
1) Structured Data
Structured data is generally tabular data that is represented by columns and rows in a database.
Databases that hold tables in this form are called relational databases.
The mathematical term “relation” specify to a formed set of data held as a table.
In structured data, all row in a table has the same set of columns.
SQL (Structured Query Language) programming language used for structured data.
2) Semi-structured Data
Semi-structured data is information that doesn’t consist of Structured data (relational database)
but still has some structure to it.
Semi-structured data consist of documents held in JavaScript Object Notation (JSON) format. It
also includes key-value stores and graph databases.
3) Unstructured Data
Unstructured data is information that either does not organize in a pre-defined manner or not
have a pre-defined data model.
Unstructured information is a set of text-heavy but may contain data such as numbers, dates, and
facts as well.
Videos, audio, and binary data files might not have a specific structure. They’re assigned to
as unstructured data.
Relational databases provide undoubtedly the most well-understood model for holding data.
The simplest structure of columns and tables makes them very easy to use initially, but the
inflexible structure can cause some problems.
We can communicate with relational databases using Structured Query Language (SQL).
SQL allows the joining of tables using a few lines of code, with a structure most beginner
employees can learn very fast.
Examples of relational databases:
o MySQL
o PostgreSQL
o Db2
Non-Relational Data
Non-relational databases permit us to store data in a format that more closely meets the original
structure.
A non-relational database is a database that does not use the tabular schema of columns and
rows found in most traditional database systems.
It uses a storage model that is enhanced for the specific requirements of the type of data being
stored.
In a non-relational database the data may be stored as JSON documents, as simple key/value
pairs, or as a graph consisting of edges and vertices.
Examples of non-relational databases:
o Redis
o JanusGraph
o MongoDB
o RabbitMQ
A columnar or column-family data store construct data into rows and columns. The columns are
divided into groups known as column families.
Each column family consists of a set of columns that are logically related and are generally
retrieved or manipulated as a unit.
Within a column family, rows can be sparse and new columns can be added dynamically.
A graph data store handles two types of information, edges, and nodes.
Edges point out the relationships between these entities and Nodes represent entities.
The aim of a graph datastore is to grant an application to efficiently perform queries that traverse
the network of edges and nodes and to inspect the relationships between entities.
Time series data is a set of values formed by time, and a time-series data store is making the best
for this type of data.
Time series data stores must support a very large number of writes, as they generally collect large
amounts of data in real-time from a huge number of sources.
Object data stores are correct for retrieving and storing large binary objects or blobs such as audio
and video streams, images, text files, large application documents and data objects, and virtual
machine disk images.
An object consists of some metadata, stored data, and a unique ID for access to the object.
Flights.csv-comma separated values
External index data stores give the ability to search for information held in other data services and
stores.
An external index acts as a secondary index for any data store. It can provide real-time access to
indexes and can be used to index massive volumes of data.
Structured Vs Unstructured Data
1) Defined Vs Undefined Data
Structured data is generally quantitative data, it usually consists of hard numbers or things that
can be counted.
Methods for analysis include classification, regression, and clustering of data.
Unstructured data is generally categorized as qualitative data, and cannot be analyzed and
processed using conventional tools and methods.
Understanding qualitative data requires advanced analytics techniques like data
stacking and data mining.
4) Ease Of Analysis
Structured data is easy to search, both for algorithms and for humans.
Unstructured data is more difficult to search and requires processing to become understandable.
In context of Big Data we know that it deals with large amount of data and its execution. So in nutshell
we can say that Big data is something which deals with the large amount of data and as amount of data
is so large then broadly there are three categories which are defined on the basis of how data is
organized which are namely as Structured, Semi Structured and Unstructured Data.
Now the basis of level of organizing the data we can find out some more differences between all these
three types of data which are as follow.
Following are the important differences between Structure and Union.
We now introduce the data model used in XML. The basic object in XML is the XML document.
Two main structuring concepts are used to construct an XML document: elements and attributes.
It is important to note that the term attribute in XML is not used in the same manner as is
customary in database terminology, but rather as it is used in document description languages such
as HTML and SGML. Attributes in XML provide additional information that describes elements,
as we will see. There are additional concepts in XML, such as entities, identifiers, and references,
but first we concentrate on describing elements and attributes to show the essence of the XML
model.
Figure 12.3 shows an example of an XML element called < Projects>. As in HTML, elements are
identified in a document by their start tag and end tag. The tag names are enclosed between angled
brackets < ... >, and end tags are further identified by a slash, </ ... >.
<?xml version= “1.0” standalone=“yes”?> <Projects>
<Project>
<Name>ProductX</Name>
<Number>1</Number>
</Worker>
<Worker>
</Worker>
</Project>
<Project>
<Name>ProductY</Name>
<Number>2</Number>
<Ssn>123456789</Ssn>
<Hours>7.5</Hours>
</Worker>
<Worker>
<Ssn>453453453</Ssn>
<Hours>20.0</Hours>
</Worker>
<Worker>
<Ssn>333445555</Ssn>
<Hours>10.0</Hours>
</Worker>
</Project>
...
</Projects>
Figure 12.3 A complex XML element called <Projects>
Complex elements are constructed from other elements hierarchically, whereas simple
elements contain data values. A major difference between XML and HTML is that XML tag
names are defined to describe the meaning of the data elements in the document, rather than to
describe how the text is to be displayed. This makes it possible to process the data elements in the
XML document automatically by computer programs. Also, the XML tag (element) names can be
defined in another document, known as the schema document, to give a semantic meaning to the
tag names that can be exchanged among multiple users. In HTML, all tag names are predefined
and fixed; that is why they are not extendible.
It is straightforward to see the correspondence between the XML textual representation shown in
Figure 12.3 and the tree structure shown in Figure 12.1. In the tree representation, internal nodes
represent complex elements, whereas leaf nodes rep-resent simple elements. That is why the XML
model is called a tree model or a hierarchical model. In Figure 12.3, the simple elements are the
ones with the tag names <Name>, <Number>, <Location>, <Dept_no>, <Ssn>, <Last_name>,
<First_name>, and <Hours>. The complex elements are the ones with the tag
names <Projects>, <Project>, and <Worker>. In general, there is no limit on the levels of nesting of
elements.
Data-centric XML documents. These documents have many small data items that follow a
specific structure and hence may be extracted from a structured database. They are formatted as
XML documents in order to exchange them over or display them on the Web. These usually follow
a predefined schema that defines the tag names.
Document-centric XML documents. These are documents with large amounts of text, such as
news articles or books. There are few or no struc-tured data elements in these documents.
Hybrid XML documents. These documents may have parts that contain structured data and
other parts that are predominantly textual or unstruc-tured. They may or may not have a predefined
schema.
XML documents that do not follow a predefined schema of element names and cor-responding
tree structure are known as schemaless XML documents. It is important to note that datacentric
XML documents can be considered either as semistructured data or as structured data as defined
in Section 12.1. If an XML document conforms to a predefined XML schema or DTD (see Section
12.3), then the document can be considered as structured data. On the other hand, XML allows
documents that do not conform to any schema; these would be considered as semistructured
data and are schemaless XML documents. When the value of the standalone attribute in an XML
document is yes, as in the first line in Figure 12.3, the document is standalone and schemaless.
XML attributes are generally used in a manner similar to how they are used in HTML (see Figure
12.2), namely, to describe properties and characteristics of the elements (tags) within which they
appear. It is also possible to use XML attributes to hold the values of simple data elements;
however, this is generally not recommended. An exception to this rule is in cases that need
to reference another element in another part of the XML document. To do this, it is common to
use attribute values in one element as the references. This resembles the concept of foreign keys
in relational databases, and is a way to get around the strict hierarchical model that the XML tree
model implies. We discuss XML attributes further in Section 12.3 when we discuss XML schema
and DTD.
In Figure 12.3, we saw what a simple XML document may look like. An XML document is well
formed if it follows a few conditions. In particular, it must start with an XML declaration to
indicate the version of XML being used as well as any other relevant attributes, as shown in the
first line in Figure 12.3. It must also follow the syn-tactic guidelines of the tree data model. This
means that there should be a single root element, and every element must include a matching pair
of start and end tags within the start and end tags of the parent element. This ensures that the nested
elements specify a well-formed tree structure.
A well-formed XML document can be schemaless; that is, it can have any tag names for the
elements within the document. In this case, there is no predefined set of elements (tag names) that
a program processing the document knows to expect. This gives the document creator the freedom
to specify new elements, but limits the possibilities for automatically interpreting the meaning or
semantics of the elements within the document.
A stronger criterion is for an XML document to be valid. In this case, the document must be well
formed, and it must follow a particular schema. That is, the element names used in the start and
end tag pairs must follow the structure specified in a separate XML DTD (Document Type
Definition) file or XML schema file. We first discuss XML DTD here, and then we give an
overview of XML schema in Section 12.3.2. Figure 12.4 shows a simple XML DTD file, which
specifies the elements (tag names) and their nested structures. Any valid documents conforming
to this DTD should follow the specified structure. A special syntax exists for specifying DTD files,
as illustrated in Figure 12.4. First, a name is given to the root tag of the document, which is
called Projects in the first line in Figure 12.4. Then the elements and their nested structure are
specified.
<!DOCTYPE Projects [
>
]>
Figure 12.4 An XML DTD file called Projects
When specifying elements, the following notation is used:
A * following the element name means that the element can be repeated zero or more times in
the document. This kind of element is known as an optional multivalued (repeating) element.
A + following the element name means that the element can be repeated one or more times in
the document. This kind of element is a required multival-ued (repeating) element.
A ? following the element name means that the element can be repeated zero or one times. This
kind is an optional single-valued (nonrepeating) element.
An element appearing without any of the preceding three symbols must appear exactly once in
the document. This kind is a required single-valued (nonrepeating) element.
The type of the element is specified via parentheses following the element. If the parentheses
include names of other elements, these latter elements are the children of the element in the tree
structure. If the parentheses include the keyword #PCDATA or one of the other data types available
in XML DTD, the element is a leaf node. PCDATA stands for parsed character data, which is
roughly similar to a string data type.
The list of attributes that can appear within an element can also be specified
via the keyword !ATTLIST. In Figure 12.3, the Project element has an attribute ProjId. If the type of
an attribute is ID, then it can be referenced from another attribute whose type is IDREF within
another element. Notice that attributes can also be used to hold the values of simple data elements
of type #PCDATA.
We can see that the tree structure in Figure 12.1 and the XML document in Figure 12.3 conform
to the XML DTD in Figure 12.4. To require that an XML document be checked for conformance
to a DTD, we must specify this in the declaration of the document. For example, we could change
the first line in Figure 12.3 to the following:
When the value of the standalone attribute in an XML document is “no”, the document needs to be
checked against a separate DTD document or XML schema document (see below). The DTD file
shown in Figure 12.4 should be stored in the same file system as the XML document, and should
be given the file name proj.dtd. Alternatively, we could include the DTD document text at the
beginning of the XML document itself to allow the checking.
Although XML DTD is quite adequate for specifying tree structures with required, optional, and
repeating elements, and with various types of attributes, it has several limitations. First, the data
types in DTD are not very general. Second, DTD has its own special syntax and thus requires
specialized processors. It would be advantageous to specify XML schema documents using the
syntax rules of XML itself so that the same processors used for XML documents could process
XML schema descriptions. Third, all DTD elements are always forced to follow the specified
ordering of the document, so unordered elements are not permitted. These draw-backs led to the
development of XML schema, a more general but also more complex language for specifying the
structure and elements of XML documents.
2. XML Schema
The XML schema language is a standard for specifying the structure of XML documents. It uses
the same syntax rules as regular XML documents, so that the same processors can be used on both.
To distinguish the two types of documents, we will use the term XML instance document or XML
document for a regular XML document, and XML schema document for a document that specifies
an XML schema. Figure 12.5 shows an XML schema document corresponding to
the COMPANY database shown in Figures 3.5 and 7.2. Although it is unlikely that we would want
to display the whole database as a single document, there have been proposals to store data
in native XML format as an alternative to storing the data in relational data-bases. The schema in
Figure 12.5 would serve the purpose of specifying the struc-ture of the COMPANY database if it
were stored in a native XML system. We discuss this topic further in Section 12.4.
As with XML DTD, XML schema is based on the tree data model, with elements and attributes as
the main structuring concepts. However, it borrows additional concepts from database and object
models, such as keys, references, and identifiers. Here we describe the features of XML schema
in a step-by-step manner, referring to the sample XML schema document in Figure 12.5 for
illustration. We introduce and describe some of the schema concepts in the order in which they are
used in Figure 12.5.
Figure 12.5
<xsd:complexType>
<xsd:sequence>
<xsd:element name=“department” type=“Department” minOccurs=“0” maxOccurs= “unbounded” /> <xsd:element
name=“employee” type=“Employee” minOccurs=“0” maxOccurs= “unbounded”>
</xsd:unique>
</xsd:element>
</xsd:complexType>
</xsd:unique>
</xsd:unique>
</xsd:key>
</xsd:key>
</xsd:keyref>
</xsd:element>
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Project”>
<xsd:sequence>
</xsd:complexType>
<xsd:complexType name=“Dependent”> <xsd:sequence>
</xsd:sequence>
<xsd:sequence>
</xsd:sequence>
<xsd:sequence>
</xsd:sequence>
<xsd:sequence>
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
Schema descriptions and XML namespaces. It is necessary to identify the specific set of XML
schema language elements (tags) being used by specify-ing a file stored at a Web site location.
The second line in Figure 12.5 specifies the file used in this example, which
is http://www.w3.org/2001/XMLSchema. This is a commonly used standard for XML schema
commands. Each such definition is called an XML namespace, because it defines the set of
commands (names) that can be used. The file name is assigned to the variable xsd (XML schema
description) using the attribute xmlns (XML namespace), and this variable is used as a prefix to all
XML schema commands (tag names). For example, in Figure 12.5, when we
write xsd:element or xsd:sequence, we are referring to the definitions of the element and sequence tags
as defined in the file http://www.w3.org/2001/XMLSchema.
2. Annotations, documentation, and language used. The next couple of lines in Figure 12.5
illustrate the XML schema elements (tags) xsd:annotation and xsd:documentation, which are used
for providing comments and other descriptions in the XML document. The attribute xml:lang of
the xsd:documentation element specifies the language being used, where en stands for the English
language.
Elements and types. Next, we specify the root element of our XML schema. In XML schema,
the name attribute of the xsd:element tag specifies the ele-ment name, which is called company for
the root element in our example (see Figure 12.5). The structure of the company root element can
then be speci-fied, which in our example is xsd:complexType. This is further specified to be a
sequence of departments, employees, and projects using the xsd:sequence structure of XML
schema. It is important to note here that this is not the only way to specify an XML schema for
the COMPANY database. We will dis-cuss other options in Section 12.6.
First-level elements in the COMPANY database. Next, we specify the three first-level elements
under the company root element in Figure 12.5. These elements are named employee, department,
and project, and each is specified in an xsd:element tag. Notice that if a tag has only attributes and
no further subelements or data within it, it can be ended with the backslash symbol (/>) directly
instead of having a separate matching end tag. These are called empty elements; examples are
the xsd:element elements named department and project in Figure 12.5.
XML schema, the attributes type, minOccurs, and maxOccurs in the xsd:element tag specify the type
and multiplicity of each element in any doc-ument that conforms to the schema specifications. If
we specify a type attrib-ute in an xsd:element, the structure of the element must be described
separately, typically using the xsd:complexType element of XML schema. This is illustrated by
the employee, department, and project elements in Figure 12.5. On the other hand, if no type attribute
is specified, the element structure can be defined directly following the tag, as illustrated by
the company root ele-ment in Figure 12.5. The minOccurs and maxOccurs tags are used for specify-
ing lower and upper bounds on the number of occurrences of an element in any XML document
that conforms to the schema specifications. If they are not specified, the default is exactly one
occurrence. These serve a similar role to the *, +, and ? symbols of XML DTD.
Specifying keys. In XML schema, it is possible to specify constraints that correspond to unique
and primary key constraints in a relational database (see Section 3.2.2), as well as foreign keys (or
referential integrity) con-straints (see Section 3.2.4). The xsd:unique tag specifies elements that
correspond to unique attributes in a relational database. We can give each such uniqueness
constraint a name, and we must specify xsd:selector and xsd:field tags for it to identify the element
type that contains the unique element and the element name within it that is unique via
the xpath attribute. This is illustrated by the departmentNameUnique and projectNameUnique elements
in
Figure 12.5. For specifying primary keys, the tag xsd:key is used instead of xsd:unique, as
illustrated by the projectNumberKey, departmentNumberKey, and employeeSSNKey elements in Figure
12.5. For specifying foreign keys, the tag xsd:keyref is used, as illustrated by the
six xsd:keyref elements in Figure 12.5. When specifying a foreign key, the attribute refer of
the xsd:keyref tag specifies the referenced primary key, whereas the
tags xsd:selector and xsd:field specify the referencing element type and foreign key (see Figure 12.5).
7. Specifying the structures of complex elements via complex types.
The next part of our example specifies the structures of the complex
elements Department, Employee, Project, and Dependent, using the tag xsd:complexType (see Figure
12.5 on page 428). We specify each of these as a sequence of subelements corresponding to the
database attributes of each entity type (see Figure 3.7) by using
the xsd:sequence and xsd:element tags of XML schema. Each element is given a name and type
via the attributes name and type of xsd:element. We can also
specify minOccurs and maxOccurs attributes if we need to change the default of exactly one
occurrence. For (optional) data-base attributes where null is allowed, we need to specify minOccurs
= 0, whereas for multivalued database attributes we need to specify maxOccurs = “unbounded” on
the corresponding element. Notice that if we were not going to specify any key constraints, we
could have embedded the subelements within the parent element definitions directly without
having to specify complex types. However, when unique, primary key and foreign key constraints
need to be specified; we must define complex types to specify the ele-ment structures.
Composite (compound) attributes. Composite attributes from Figure 7.2 are also specified as
complex types in Figure 12.7, as illustrated by the Address, Name, Worker, and WorksOn complex
types. These could have been directly embedded within their parent elements.
This example illustrates some of the main features of XML schema. There are other features, but
they are beyond the scope of our presentation. In the next section, we discuss the different
approaches to creating XML documents from relational data-bases and storing XML documents.
Several approaches to organizing the contents of XML documents to facilitate their subsequent
querying and retrieval have been proposed. The following are the most common approaches:
Using a DBMS to store the documents as text. A relational or object DBMS can be used to
store whole XML documents as text fields within the DBMS records or objects. This approach can
be used if the DBMS has a special module for document processing, and would work for storing
schemaless and documentcentric XML documents.
Using a DBMS to store the document contents as data elements. This approach would work
for storing a collection of documents that follow a specific XML DTD or XML schema. Because
all the documents have the same structure, one can design a relational (or object) database to store
the leaf-level data elements within the XML documents. This approach would require mapping
algorithms to design a database schema that is compatible with the XML document structure as
specified in the XML schema or DTD and to recreate the XML documents from the stored data.
These algorithms can be implemented either as an internal DBMS module or as separate
middleware that is not part of the DBMS.
Designing a specialized system for storing native XML data. A new type of database system
based on the hierarchical (tree) model could be designed and implemented. Such systems are being
called Native XML DBMSs. The system would include specialized indexing and querying
techniques, and would work for all types of XML documents. It could also include data
compression techniques to reduce the size of the documents for storage. Tamino by Software AG
and the Dynamic Application Platform of eXcelon are two popular products that offer native XML
DBMS capability. Oracle also offers a native XML storage option.
All of these approaches have received considerable attention. We focus on the fourth approach in
Section 12.6, because it gives a good conceptual understanding of the differences between the
XML tree data model and the traditional database models based on flat files (relational model) and
graph representations (ER model). But first we give an overview of XML query languages in
Section 12.5.
We will use the simplified UNIVERSITY ER schema shown in Figure 12.8 to illustrate our
discussion. Suppose that an application needs to extract XML documents for student, course, and
grade information from the UNIVERSITY database. The data needed for these documents is
contained in the database attributes of the entity
types COURSE, SECTION, and STUDENT from Figure 12.8, and the relationships S-S and C-
S between them. In general, most documents extracted from a database will only use a subset
of the attributes, entity types, and relationships in the database. In this example, the subset of
the database that is needed is shown in Figure 12.9.
At least three possible document hierarchies can be extracted from the database subset in Figure
12.9. First, we can choose COURSE as the root, as illustrated in Figure 12.10. Here, each course
entity has the set of its sections as subelements, and each section has its students as
subelements. We can see one consequence of modeling the information in a hierarchical tree
structure. If a student has taken multiple sections, that student’s information will appear multiple
times in the document— once under each section. A possible simplified XML schema for this view
is shown in Figure 12.11. The Grade database attribute in the S-S relationship is migrated to
the STUDENT element. This is because STUDENT becomes a child of SECTION in this hierarchy, so
each STUDENT element under a specific SECTION element can have a specific grade in that
section. In this document hierarchy, a student taking more than one section will have several
replicas, one under each section, and each replica will have the specific grade given in that
particular section.
Figure 12.11
<xsd:element name=“root”>
<xsd:sequence>
<xsd:sequence>
</xsd:sequence>
</xsd:element>
</xsd:sequence>
</xsd:element>
</xsd:sequence>
</xsd:element>
</xsd:sequence>
</xsd:element>
In the second hierarchical document view, we can choose STUDENT as root (Figure 12.12). In this
hierarchical view, each student has a set of sections as its child elements, and each section is
related to one course as its child, because the relationship between SECTION and COURSE is N:1.
Thus, we can merge the COURSE and SECTION elements in this view, as shown in Figure 12.12. In
addition, the GRADE data-base attribute can be migrated to the SECTION element. In this
hierarchy, the combined COURSE/SECTION information is replicated under each student who
completed the section. A possible simplified XML schema for this view is shown in Figure 12.13.
</xsd:sequence>
</xsd:element>
</xsd:sequence>
</xsd:element>
</xsd:sequence>
</xsd:element>
The third possible way is to choose SECTION as the root, as shown in Figure 12.14. Similar to the
second hierarchical view, the COURSE information can be merged into the SECTION element.
The GRADE database attribute can be migrated to the STUDENT element. As we can see, even in
this simple example, there can be numerous hierarchical document views, each corresponding
to a different root and a different XML document structure.
In the previous examples, the subset of the database of interest had no cycles. It is possible to
have a more complex subset with one or more cycles, indicating multiple relationships among
the entities. In this case, it is more difficult to decide how to create the document hierarchies.
Additional duplication of entities may be needed to represent the multiple relationships. We will
illustrate this with an example using the ER schema in Figure 12.8.
Suppose that we need the information in all the entity types and relationships in Figure 12.8 for
a particular XML document, with STUDENT as the root element. Figure 12.15 illustrates how a
possible hierarchical tree structure can be created for this document. First, we get a lattice
with STUDENT as the root, as shown in Figure 12.15(a). This is not a tree structure because of the
cycles. One way to break the cycles is to replicate the entity types involved in the cycles. First, we
replicate INSTRUCTOR as shown in Figure 12.15(b), calling the replica to the right INSTRUCTOR1.
The INSTRUCTOR replica on the left represents the relationship between instructors and the
sections they teach, whereas the INSTRUCTOR1 replica on the right represents the relationship
between instructors and the department each works in. After this, we still have the cycle
involving COURSE, so we can replicate COURSE in a similar manner, leading to the hierarchy
shown in Figure 12.15(c). The COURSE1 replica to the left represents the relationship between
courses and their sections, whereas the COURSE replica to the right represents the relationship
between courses and the department that offers each course.
In Figure 12.15(c), we have converted the initial graph to a hierarchy. We can do further merging
if desired (as in our previous example) before creating the final hierarchy and the corresponding
XML schema structure.
It is necessary to create the correct query in SQL to extract the desired information for the XML
document.
Once the query is executed, its result must be restructured from the flat relational form to the
XML tree structure.
The query can be customized to select either a single object or multiple objects into the
document. For example, in the view in Figure 12.13, the query can select a single student entity
and create a document corresponding to that single student, or it may select several—or even
all—of the students and create a document with multiple students.
XML Languages
There have been several proposals for XML query languages, and two query language standards
have emerged. The first is XPath, which provides language constructs for specifying path
expressions to identify certain nodes (elements) or attributes within an XML document that match
specific patterns. The second is XQuery, which is a more general query language. XQuery uses
XPath expressions but has additional constructs. We give an overview of each of these languages
in this section. Then we discuss some additional languages related to HTML in Section 12.5.3.
An XPath expression generally returns a sequence of items that satisfy a certain pat-tern as
specified by the expression. These items are either values (from leaf nodes) or elements or
attributes. The most common type of XPath expression returns a collection of element or attribute
nodes that satisfy certain patterns specified in the expression. The names in the XPath expression
are node names in the XML document tree that are either tag (element) names or attribute names,
possibly with additional qualifier conditions to further restrict the nodes that satisfy the pattern.
Two main separators are used when specifying a path: single slash (/) and double slash (//). A
single slash before a tag specifies that the tag must appear as a direct child of the previous (parent)
tag, whereas a double slash specifies that the tag can appear as a descendant of the previous tag at
any level. Let us look at some examples of XPath as shown in Figure 12.6.
The first XPath expression in Figure 12.6 returns the company root node and all its descendant
nodes, which means that it returns the whole XML document. We should note that it is customary
to include the file name in the XPath query. This allows us to specify any local file name or even
any path name that specifies a file on the Web. For example, if the COMPANY XML document is
stored at the location
This prefix would also be included in the other examples of XPath expressions.
The second example in Figure 12.6 returns all department nodes (elements) and their descendant
subtrees. Note that the nodes (elements) in an XML document are ordered, so the XPath result that
returns multiple nodes will do so in the same order in which the nodes are ordered in the document
tree.
The third XPath expression in Figure 12.6 illustrates the use of //, which is convenient to use if we
do not know the full path name we are searching for, but do know the name of some tags of interest
within the XML document. This is particularly useful for schemaless XML documents or for
documents with many nested levels of nodes.
The expression returns all employeeName nodes that are direct children of an employee node, such
that the employee node has another child element employeeSalary whose value is greater than 70000.
This illustrates the use of qualifier conditions, which restrict the nodes selected by
the XPath expression to those that satisfy the condition. XPath has a number of comparison
operations for use in qualifier conditions, including standard arithmetic, string, and set comparison
operations.
The fourth XPath expression in Figure 12.6 should return the same result as the pre-vious one,
except that we specified the full path name in this example. The fifth expression in Figure 12.6
returns all projectWorker nodes and their descendant nodes that are children under a
path /company/project and have a child node hours with a value greater than 20.0 hours.
When we need to include attributes in an XPath expression, the attribute name is prefixed by the
@ symbol to distinguish it from element (tag) names. It is also possible to use the wildcard symbol
*, which stands for any element, as in the following example, which retrieves all elements that are
child elements of the root, regardless of their element type. When wildcards are used, the result
can be a sequence of different types of items.
/company/*
The examples above illustrate simple XPath expressions, where we can only move down in the
tree structure from a given node. A more general model for path expressions has been proposed.
In this model, it is possible to move in multiple directions from the current node in the path
expression. These are known as the axes of an XPath expression. Our examples above used
only three of these axes: child of the current node (/), descendent or self at any level of the current
node (//), and attribute of the current node (@). Other axes include parent, ancestor (at any level),
previous sibling (any node at same level to the left in the tree), and next sibling (any node at the
same level to the right in the tree). These axes allow for more complex path expressions.
The main restriction of XPath path expressions is that the path that specifies the pat-tern also
specifies the items to be retrieved. Hence, it is difficult to specify certain conditions on the pattern
while separately specifying which result items should be retrieved. The XQuery language separates
these two concerns, and provides more powerful constructs for specifying queries.
LET $d := doc(www.company.com/info.xml)
Variables are prefixed with the $ sign. In the above example, $d, $x, and $y are variables.
The LET clause assigns a variable to a particular expression for the rest of the query. In this
example, $d is assigned to the document file name. It is possi-ble to have a query that refers to
multiple documents by assigning multiple variables in this way.
The FOR clause assigns a variable to range over each of the individual items in a sequence. In
our example, the sequences are specified by path expressions. The $x variable ranges over
elements that satisfy the path expression $d/company/project[projectNumber =
5]/projectWorker. The $y variable ranges over elements that satisfy the path
expression $d/company/employee. Hence, $x ranges over projectWorker elements,
whereas $y ranges over employee elements.
The WHERE clause specifies additional conditions on the selection of items. In this example,
the first condition selects only those projectWorker elements that satisfy the condition (hours gt
20.0). The second condition specifies a join condition that combines an employee with
a projectWorker only if they have the same ssn value.
Finally, the RETURN clause specifies which elements or attributes should be retrieved from the
items that satisfy the query conditions. In this example, it will return a sequence of elements each
containing <firstName, lastName, hours> for employees who work more that 20 hours per week on
project number 5.
Figure 12.7 includes some additional examples of queries in XQuery that can be specified on an
XML instance documents that follow the XML schema document in Figure 12.5. The first query
retrieves the first and last names of employees who earn more than $70,000. The variable $x is
bound to each employeeName element that is a child of an employee element, but only
for employee elements that satisfy the quali-fier that their employeeSalary value is greater than
$70,000. The result retrieves the firstName and lastName child elements of the
selected employeeName elements. The second query is an alternative way of retrieving the same
elements retrieved by the first query.
The third query illustrates how a join operation can be performed by using more than one variable.
Here, the $x variable is bound to each projectWorker element that is a child of project number 5,
whereas the $y variable is bound to each employee ele-ment. The join condition
matches ssn values in order to retrieve the employee names. Notice that this is an alternative way
of specifying the same query in our ear-lier example, but without the LET clause.
XQuery has very powerful constructs to specify complex queries. In particular, it can specify
universal and existential quantifiers in the conditions of a query, aggregate functions, ordering of
query results, selection based on position in a sequence, and even conditional branching. Hence,
in some ways, it qualifies as a full-fledged programming language.
This concludes our brief introduction to XQuery. The interested reader is referred
to www.w3.org, which contains documents describing the latest standards related to XML
and XQuery. The next section briefly discusses some additional languages and protocols related to
XML.
The Extensible Stylesheet Language (XSL) can be used to define how a document should be
rendered for display by a Web browser.
The Extensible Stylesheet Language for Transformations (XSLT) can be used to transform one
structure into a different structure. Hence, it can convert documents from one form to another.
The Web Services Description Language (WSDL) allows for the description of Web Services
in XML. This makes the Web Service available to users and programs over the Web.
The Resource Description Framework (RDF) provides languages and tools for exchanging and
processing of meta-data (schema) descriptions and specifications over the Web.
XML declaration
Document type declaration
You can learn more about XML declaration in this chapter − XML Declaration
The XML Document Type Declaration, commonly known as DTD, is a way to describe XML language
precisely. DTDs check vocabulary and validity of the structure of XML documents against grammatical
rules of appropriate XML language.
An XML DTD can be either specified inside the document, or it can be kept in a separate document
and then liked separately.
Syntax
Basic syntax of a DTD is as follows −
<!DOCTYPE element DTD identifier
[
declaration1
declaration2
........
]>
In the above syntax,
The DTD starts with <!DOCTYPE delimiter.
An element tells the parser to parse the document from the specified root element.
DTD identifier is an identifier for the document type definition, which may be the path to a file on
the system or URL to a file on the internet. If the DTD is pointing to external path, it is
called External Subset.
The square brackets [ ] enclose an optional list of entity declarations called Internal Subset.
Internal DTD
A DTD is referred to as an internal DTD if elements are declared within the XML files. To refer it as
internal DTD, standalone attribute in XML declaration must be set to yes. This means, the declaration
works independent of an external source.
Syntax
Following is the syntax of internal DTD −
<!DOCTYPE root-element [element-declarations]>
where root-element is the name of root element and element-declarations is where you declare the
elements.
Example
Following is a simple example of internal DTD −
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
<!DOCTYPE address [
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
]>
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
Let us go through the above code −
Start Declaration − Begin the XML declaration with the following statement.
<?xml version = "1.0" encoding = "UTF-8" standalone = "yes" ?>
DTD − Immediately after the XML header, the document type declaration follows, commonly referred
to as the DOCTYPE −
<!DOCTYPE address [
The DOCTYPE declaration has an exclamation mark (!) at the start of the element name. The
DOCTYPE informs the parser that a DTD is associated with this XML document.
DTD Body − The DOCTYPE declaration is followed by body of the DTD, where you declare elements,
attributes, entities, and notations.
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone_no (#PCDATA)>
Several elements are declared here that make up the vocabulary of the <name> document.
<!ELEMENT name (#PCDATA)> defines the element name to be of type "#PCDATA". Here #PCDATA
means parse-able text data.
End Declaration − Finally, the declaration section of the DTD is closed using a closing bracket and a
closing angle bracket (]>). This effectively ends the definition, and thereafter, the XML document follows
immediately.
Rules
The document type declaration must appear at the start of the document (preceded only by the
XML header) − it is not permitted anywhere else within the document.
Similar to the DOCTYPE declaration, the element declarations must start with an exclamation
mark.
The Name in the document type declaration must match the element type of the root element.
External DTD
In external DTD elements are declared outside the XML file. They are accessed by specifying the
system attributes which may be either the legal .dtd file or a valid URL. To refer it as external
DTD, standalone attribute in the XML declaration must be set as no. This means, declaration includes
information from the external source.
Syntax
Following is the syntax for external DTD −
<!DOCTYPE root-element SYSTEM "file-name">
where file-name is the file with .dtd extension.
Example
The following example shows external DTD usage −
<?xml version = "1.0" encoding = "UTF-8" standalone = "no" ?>
<!DOCTYPE address SYSTEM "address.dtd">
<address>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</address>
The content of the DTD file address.dtd is as shown −
<!ELEMENT address (name,company,phone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT company (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
Types
You can refer to an external DTD by using either system identifiers or public identifiers.
System Identifiers
A system identifier enables you to specify the location of an external file containing DTD declarations.
Syntax is as follows −
<!DOCTYPE name SYSTEM "address.dtd" [...]>
As you can see, it contains keyword SYSTEM and a URI reference pointing to the location of the
document.
Public Identifiers
Public identifiers provide a mechanism to locate DTD resources and is written as follows −
<!DOCTYPE name PUBLIC "-//Beginning XML//DTD Address Example//EN">
As you can see, it begins with keyword PUBLIC, followed by a specialized identifier. Public identifiers
are used to identify an entry in a catalog. Public identifiers can follow any format, however, a commonly
used format is called Formal Public Identifiers, or FPIs.
XML Schema :
XML Schema is commonly known as XML Schema Definition (XSD). It is used to describe and
validate the structure and the content of XML data. XML schema defines the elements, attributes and
data types. Schema element supports Namespaces. It is similar to a database schema that describes
the data in a database.
Syntax
You need to declare a schema in your XML document as follows −
Example
The following example shows how to use schema −
<?xml version = "1.0" encoding = "UTF-8"?>
<xs:schema xmlns:xs = "http://www.w3.org/2001/XMLSchema">
<xs:element name = "contact">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
The basic idea behind XML Schemas is that they describe the legitimate format that an XML document
can take.
Elements
As we saw in the XML - Elements chapter, elements are the building blocks of XML document. An
element can be defined within an XSD as follows −
<xs:element name = "x" type = "y"/>
Definition Types
You can define XML schema elements in the following ways −
Simple Type
Simple type element is used only in the context of the text. Some of the predefined simple types are:
xs:integer, xs:boolean, xs:string, xs:date. For example −
<xs:element name = "phone_number" type = "xs:int" />
Complex Type
A complex type is a container for other element definitions. This allows you to specify which child
elements an element can contain and to provide some structure within your XML documents. For
example −
<xs:element name = "Address">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
<xs:element name = "phone" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
In the above example, Address element consists of child elements. This is a container for
other <xs:element> definitions, that allows to build a simple hierarchy of elements in the XML
document.
Global Types
With the global type, you can define a single type in your document, which can be used by all other
references. For example, suppose you want to generalize the person and company for different
addresses of the company. In such case, you can define a general type as follows −
<xs:element name = "AddressType">
<xs:complexType>
<xs:sequence>
<xs:element name = "name" type = "xs:string" />
<xs:element name = "company" type = "xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Now let us use this type in our example as follows −
<xs:element name = "Address1">
<xs:complexType>
<xs:sequence>
<xs:element name = "address" type = "AddressType" />
<xs:element name = "phone1" type = "xs:int" />
</xs:sequence>
</xs:complexType>
</xs:element>
Attributes
Attributes in XSD provide extra information within an element. Attributes have name and type property
as shown below −
<xs:attribute name = "x" type = "y"/>
XML Query:
XML query is based on two methods.
1. Xpath:
<Employee>
<Name>
<First name>Brian</First name>
<Last name>Lara<</Last name>
</Name>
<Contact>
<Mobile>9800000000</Mobile>
<Landline>020222222</Landline>
</Contact>
<Address>
<City>Pune</city>
<Street>Tilak road</Street>
<Zip code>4110</Zip code>
</Address>
</Employee>
In the above example, a (/) is used as </Employee> where Employee is a root node and
Name, Contact and Address are descendant nodes.
2. Xquery:
Xquery comparisons:
The two methods for Xquery comparisons are as follows:
Example:
Lets take an example to understand how to write a XMLquery.
XML Databse :
XML Database is used to store huge amount of information in the XML format. As the use of XML is
increasing in every field, it is required to have a secured place to store the XML documents. The data
stored in the database can be queried using XQuery, serialized, and exported into a desired format.
XML- enabled
Native XML (NXD)
XML - Enabled Database
XML enabled database is nothing but the extension provided for the conversion of XML document. This
is a relational database, where data is stored in tables consisting of rows and columns. The tables
contain set of records, which in turn consist of fields.
Native XML Database
Native XML database is based on the container rather than table format. It can store large amount of
XML document and data. Native XML database is queried by the XPath-expressions.
Native XML database has an advantage over the XML-enabled database. It is highly capable to store,
query and maintain the XML document than XML-enabled database.
Example
Following example demonstrates XML database −
<?xml version = "1.0"?>
<contact-info>
<contact1>
<name>Tanmay Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 123-4567</phone>
</contact1>
<contact2>
<name>Manisha Patil</name>
<company>TutorialsPoint</company>
<phone>(011) 789-4567</phone>
</contact2>
</contact-info>
Here, a table of contacts is created that holds the records of contacts (contact1 and contact2), which in
turn consists of three entities − name, company and phone.
UNIT – 4
NOSQL DATABASES AND BIG DATA STORAGE SYSTEMS
NoSQL – Categories of NoSQL Systems – CAP Theorem – Document-Based NoSQL Systems and MongoDB – MongoDB Data
Model – MongoDB Distributed Systems Characteristics – NoSQL Key-Value Stores – DynamoDB Overview – Voldemort Key-
Value Distributed Data Store – Wide Column NoSQL Systems – Hbase Data Model – Hbase Crud Operations – Hbase Storage and
Distributed System Concepts – NoSQL Graph Databases and Neo4j – Cypher Query Language of Neo4j – Big Data – MapReduce
– Hadoop – YARN.
What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It
avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data
stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For
example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”,
NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL
database system encompasses a wide range of database technologies that can store structured, semi-
structured, unstructured and polymorphic data. Let’s understand about NoSQL with a diagram in this
NoSQL database tutorial:
OLAP:
Online analytical processing (OLAP) is a system for performing multi-dimensional analysis at high speeds on large volumes of data.
Typically, this data is from a data warehouse, data mart or some other centralized data store.
Categories of No Sql:
Here are the four main types of NoSQL databases:
Document databases
Key-value stores
Column-oriented databases
Graph databases
Document Databases
A document database stores data in JSON, BSON , or XML documents (not Word documents or Google docs, of course).
In a document database, documents can be nested. Particular elements can be indexed for faster querying.
Documents can be stored and retrieved in a form that is much closer to the data objects used in applications, which means
less translation is required to use the data in an application. SQL data must often be assembled and disassembled when
moving back and forth between applications and storage.
Document databases are popular with developers because they have the flexibility to rework their document structures
as needed to suit their application, shaping their data structures as their application requirements change over time. This
flexibility speeds development because in effect data becomes like code and is under the control of developers. In SQL
databases, intervention by database administrators may be required to change the structure of a database.
The most widely adopted document databases are usually implemented with a scale-out architecture, providing a clear
path to scalability of both data volumes and traffic.
Use cases include ecommerce platforms, trading platforms, and mobile app development across industries.
Comparing MongoDB vs PostgreSQL offers a detailed analysis of MongoDB, the leading NoSQL database, and
PostgreSQL, one of the most popular SQL databases.
Key-Value Stores
The simplest type of NoSQL database is a key-value store . Every data element in the database is stored as a key value
pair consisting of an attribute name (or "key") and a value. In a sense, a key-value store is like a relational database with
only two columns: the key or attribute name (such as state) and the value (such as Alaska).
Use cases include shopping carts, user preferences, and user profiles.
Column-Oriented Databases
While a relational database stores data in rows and reads data row by row, a column store is organized as a set of columns.
This means that when you want to run analytics on a small number of columns, you can read those columns directly
without consuming memory with the unwanted data. Columns are often of the same type and benefit from more efficient
compression, making reads even faster. Columnar databases can quickly aggregate the value of a given column (adding
up the total sales for the year, for example). Use cases include analytics.
Unfortunately there is no free lunch, which means that while columnar databases are great for analytics, the way in which
they write data makes it very difficult for them to be strongly consistent as writes of all the columns require multiple
write events on disk. Relational databases don't suffer from this problem as row data is written contiguously to disk.
Graph Databases
A graph database focuses on the relationship between data elements. Each element is stored as a node (such as a person
in a social media graph). The connections between elements are called links or relationships. In a graph database,
connections are first-class elements of the database, stored directly. In relational databases, links are implied, using data
to express the relationships.
A graph database is optimized to capture and search the connections between data elements, overcoming the overhead
associated with JOINing multiple tables in SQL.
Very few real-world business systems can survive solely on graph queries. As a result graph databases are usually run
alongside other more traditional databases.
Use cases include fraud detection, social networks, and knowledge graphs.
Consistency–
Consistency means that the nodes will have the same copies of a replicated data item visible for
various transactions. A guarantee that every node in a distributed cluster returns the same, most
recent, successful write. Consistency refers to every client having the same view of the data. There
are various types of consistency models. Consistency in CAP refers to sequential consistency, a
very strong form of consistency.
Availability–
Availability means that each read or write request for a data item will either be processed
successfully or will receive a message that the operation cannot be completed. Every non-failing
node returns a response for all read and write requests in a reasonable amount of time. The key
word here is every. To be available, every node on (either side of a network partition) must be able
to respond in a reasonable amount of time.
Partition Tolerant–
Partition tolerance means that the system can continue operating if the network connecting the
nodes has a fault that results in two or more partitions, where the nodes in each partition can only
communicate among each other. That means, the system continues to function and upholds its
consistency guarantees in spite of network partitions. Network partitions are a fact of life.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions once
the partition heals.
The use of the word consistency in CAP and its use in ACID do not refer to the same identical concept.
In CAP, the term consistency refers to the consistency of the values in different copies of the same data item
in a replicated distributed system. In ACID, it refers to the fact that a transaction will not violate the integrity
constraints specified on the database schema.
Partition refers to a communication break between nodes within a distributed system. Meaning,
if a node cannot receive any messages from another node in the system, there is a partition
between the two nodes. Partition could have been because of network failure, server crash, or
any other reason.
The following diagram shows the classification of different databases based on the CAP theorem.
System designers must take into consideration the CAP theorem while designing or choosing
distributed storages as one needs to be sacrificed from C and A for others.
Document databases are considered to be non-relational (or NoSQL) databases. Instead of storing data
in fixed rows and columns, document databases use flexible documents. Document databases are the
most popular alternative to tabular, relational databases.
consider that we need to store huge amount of purchase orders and we started partitioning, one of
the ways to partition is to have orderheader table in one db instance and lineitem information in
another. and if you want to insert or update an order information, you need to update both the tables
atomically and you need to have a transaction manager to ensure atomicity. if you want to scale this
further in terms of processing and data storage, you can only increase hard disk space and ram.
While horizontal scaling refers to adding additional nodes, vertical scaling describes adding more
power to your current machines. For instance, if your server requires more processing power, vertical
scaling would mean upgrading the CPUs.
let us consider another situation, because of the change in our business we added a new column to
the lineitem table called linedesc. and imagine that this application was running in production. once
we deploy this change, we need to bring down the server and for some time to take effect this change.
1. flexibilities in terms of scaling database so that multiple instance of the database can process
the information parallel
2. flexibilities in terms of changes to the database can be absorbed without long server
downtimes
3. application /middle tier does not handle object-relational impedance mismatch – can we get
away with it using techniques like json(javascript Object notation)
let us go back to our purchaseorder example and relax some of the aspects of rdbms like
normalization (avoid joins of lot of rows), atomicity and see if we can achieve some of the above
objectives.
below is an example of how we can store the purchaseorder (there are other better way of storing
the information).
orderheader:{
lineitems:[
if you notice carefully, the purchase order is stored in a json document like structure. you also
notice, we don’t need multiple tables, relationship and normalization and hence there is no need to
join. and since the schema qualifiers are within the document, there is no table definition.
you can store them as collection of objects/documents. hypothetically if we need to store several
millions of purchaseorders, we can chunk them in groups and store them in several instances.
if you want to retrieve purchaseorders based on specific criteria, for example all the purchase orders
in which one of the line item is a “pendrive”, we can ask all the individual instances to retrieve in
“parallel” based on the same criteria and one of them can consolidate the list and return the
information to the client. this is the concept of horizontal scaling
because the there is no separate table schema and and the schema definition is included in the json
object, we can change document structure and store and retrieve with just change in application
layer. this does not need database restart.
finally the object structure is json, we can directly present it to the web tier or mobile device and
they will render it.
nosql is a classification of database which is designed to keep the above aspects in mind.
if you have to give rdbms analogy, collection in mongodb are similar to tables, document are similar
to rows. internally mongodb stores the information as binary serializable json objects called bson .
mongodb support javascript style query syntax to retrieve bson (binary JSON)objects.
typical documents looks as below,
post={
author:“hergé”,
date:new date(),
text:“destination moon”,
tags:[“comic”,“adventure”] }
> db.post.save(post)
------------
>db.posts.find() {
_id:objectid(" 4c4ba5c0672c685e5e8aabf3"),
author:"hergé",
text:"destination moon",
tags:["comic","adventure"]
in mongodb, atomicity is guaranteed within a document. if you have to achieve atomicity outside of
the document, it has to be managed at the application level. below is an example,
many to many:
products:{
_id:objectid("10"),
name:"destinationmoon",
category_ids:[objectid("20"),objectid("30”)]}
categories:{
_id:objectid("20"),
name:"adventure"}
product=db.products.find(_id:some_id)
>db.categories.find({
_id:{$in:product.category_ids}
})
[feedly mini]
in a typical stack that uses mongodb, it makes lot of sense to use a javascript based framework. a
good web framework, we use express/node.js/mongodb stack. a good example of how to use these
stack is here .
mongodb nosql also supports sharding which supports parallel processing/horizontal scaling. for
more details on how a typical bigdata handles parallel processing/horizontal scaling, refer rickly
ho’s link
a typical use cases for mongodb include, event logging, realtime analytics, content management,
ecommerce. use cases where it is not a good fit are transaction base banking system, non realtime
data warehousing
Embedded data models allow applications to store related pieces of information in the same database record. As a result,
applications may need to issue fewer queries and updates to complete common operations.
you have "contains" relationships between entities. See Model One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the "many" or child documents always appear
with or are viewed in the context of the "one" or parent documents. See Model One-to-Many Relationships with Embedded
Documents.
In general, embedding provides better performance for read operations, as well as the ability to request and retrieve
related data in a single database operation. Embedded data models make it possible to update related data in a single
atomic write operation.
To access data within embedded documents, use dot notation to "reach into" the embedded documents. See query for
data in arrays and query data in embedded documents for more examples on accessing data in arrays and embedded
documents.
Documents in MongoDB must be smaller than the maximum BSON document size.
when embedding would result in duplication of data but would not provide sufficient read performance
advantages to outweigh the implications of the duplication.
to represent more complex many-to-many relationships.
to model large hierarchical data sets.
When designing the schema of a database, it is impossible to know in advance all the queries that will be performed by
end users. An ad hoc query is a short-lived command whose value depends on a variable. Each time an ad hoc query is
executed, the result may be different, depending on the variables in question.
Optimizing the way in which ad-hoc queries are handled can make a significant difference at scale, when thousands to
millions of variables may need to be considered. This is why MongoDB, a document-oriented, flexible schema database,
stands apart as the cloud database platform of choice for enterprise applications that require real-time analytics. With ad-
hoc query support that allows developers to update ad-hoc queries in real time, the improvement in performance can be
game-changing.
MongoDB supports field queries, range queries, and regular expression searches. Queries can return specific fields and
also account for user-defined functions. This is made possible because MongoDB indexes BSON documents and uses
the MongoDB Query Language (MQL).
Without the right indices, a database is forced to scan documents one by one to identify the ones that match the query
statement. But if an appropriate index exists for each query, user requests can be optimally executed by the server.
MongoDB offers a broad range of indices and features with language-specific sort orders that support complex access
patterns to datasets.
Notably, MongoDB indices can be created on demand to accommodate real-time, ever-changing query patterns and
application requirements. They can also be declared on any field within any of your documents, including those nested
within arrays.
Replication allows you to sidestep these vulnerabilities by deploying multiple servers for disaster recovery and backup.
Horizontal scaling across multiple servers that house the same data (or shards of that same data) means greatly increased
data availability and stability. Naturally, replication also helps with load balancing. When multiple users access the same
data, the load can be distributed evenly across servers.
In MongoDB, replica sets are employed for this purpose. A primary server or node accepts all write operations and
applies those same operations across secondary servers, replicating the data. If the primary server should ever experience
a critical failure, any one of the secondary servers can be elected to become the new primary node. And if the former
primary node comes back online, it does so as a secondary server for the new primary node.
4. Sharding
When dealing with particularly large datasets, sharding—the process of splitting larger datasets across multiple
distributed collections, or “shards”—helps the database distribute and better execute what might otherwise be
problematic and cumbersome queries. Without sharding, scaling a growing web application with millions of daily users
is nearly impossible.
Like replication via replication sets, sharding in MongoDB allows for much greater horizontal scalability. Horizontal
scaling means that each shard in every cluster houses a portion of the dataset in question, essentially functioning as a
separate database. The collection of distributed server shards forms a single, comprehensive database much better suited
to handling the needs of a popular, growing application with zero downtime.
Zero downtime deployment is a deployment method where your website or application is never down or in an
unstable state during the deployment process. To achieve this the web server doesn't start serving the changed code
until the entire deployment process is complete.
All operations in a sharding environment are handled through a lightweight process called mongos. Mongos can direct
queries to the correct shard based on the shard key. Naturally, proper sharding also contributes significantly to better
load balancing.
5. Load balancing
At the end of the day, optimal load balancing remains one of the holy grails of large-scale database management for
growing enterprise applications. Properly distributing millions of client requests to hundreds or thousands of servers can
lead to a noticeable (and much appreciated) difference in performance.
Fortunately, via horizontal scaling features like replication and sharding, MongoDB supports large-scale load balancing.
The platform can handle multiple concurrent read and write requests for the same data with best-in-class concurrency
control and locking protocols that ensure data consistency. There’s no need to add an external load balancer—MongoDB
ensures that each and every user has a consistent view and quality experience with the data they need to access
This specific type of NoSQL database uses the key-value method and represents a collection of numerous key-value
pairs. The keys are unique identifiers for the values. The values can be any type of object -- a number or a string, or even
another key-value pair in which case the structure of the database grows more complex.
Unlike relational databases, key-value databases do not have a specified structure. Relational databases store data in
tables where each column has an assigned data type. Key-value databases are a collection of key-value pairs that are
stored as individual records and do not have a predefined data structure. The key can be anything, but seeing that it is
the only way of retrieving the value associated with it, naming the keys should be done strategically.
Key names can range from as simple as numbering to specific descriptions of the value that is about to follow. A key-
value database can be thought of as a dictionary or a directory. Dictionaries have words as keys and their meanings as
values.
Phonebooks have names of people as keys and their phone numbers as values. Just like key-value stores, unless you
know the name of the person whose number you need, you will not be able to find the right number.
Apart from those main functions, key-value store databases do not have querying language. The data is of no type and is
determined by the requirements of the application used to process the data.
A very useful feature is built-in redundancy improving the reliability of this database type.
With an increasing variety of data types and cheap storing options, we started stepping away from them and looking into
nonrelational (NoSQL) databases.
Another more specific use case yet similar to the previous one is a shopping cart where e-commerce websites can record
data pertaining to individual shopping sessions. Relational databases are better to use with payment transaction records;
however, session records prior to payment are probably better off in a key-value store. We know that more people fill
their shopping carts and subsequently change their mind about buying the selected items than those who proceed to
payment. Why fill a relational database with all this data when there is a more efficient and more reliable solution?
A key-value store will be quick to record and get data simultaneously. Also, with its built-in redundancy, it ensures that
no item from a cart gets lost. The scalability of key-value stores comes in handy in peak seasons around holidays or
during sales and special promotions because there is usually a sharp increase in sales and an even greater increase in
traffic on the website. The scalability of the key-value store will make sure that the increased load on the database does
not result in performance issues.
Simplicity. As mentioned above, key value databases are quite simple to use. The straightforward commands
and the absence of data types make work easier for programmers. With this feature data can assume any type,
or even multiple types, when needed.
Speed. This simplicity makes key value databases quick to respond, provided that the rest of the environment
around it is well-built and optimized.
Scalability. This is a beloved advantage of NoSQL databases over relational databases in general, and key-
value stores in particular. Unlike relational databases, which are only scalable vertically, key-value stores are
also infinitely scalable horizontally.
Easy to move. The absence of a query language means that the database can be easily moved between
different systems without having to change the architecture.
Reliability. Built-in redundancy comes in handy to cover for a lost storage node where duplicated data comes
in place of what's been lost.
Disadvantages of key-value databases
Not all key-value databases are the same, but some of the general drawbacks include the following:
Simplicity. The list of advantages and disadvantages demonstrates that everything is relative, and that what
generally comes as an advantage can also be a disadvantage. This further proves that you have to consider
your needs and options carefully before choosing a database to use. The fact that key-value stores are not
complex also means that they are not refined. There is no language nor straightforward means that would
allow you to query the database with anything else other than the key.
No query language. Without a unified query language to use, queries from one database may not be
transportable into a different key-value database.
Values can't be filtered. The database sees values as blobs so it cannot make much sense of what they
contain. When there is a request placed, whole values are returned -- rather than a specific piece of
information -- and when they get updated, the whole value needs to be updated.
Amazon DynamoDB. DynamoDB is a database trusted by many large-scale users and users in general. It is
fully managed and reliable, with built-in backup and security options. It is able to endure high loads and
handle trillions of requests daily. These are just some of the many features supporting the reputation of
DynamoDB, apart from its famous name.
Aerospike. This is a real-time platform facilitating billions of transactions. It reduces the server footprint by
80% and enables high performance of real-time applications.
Redis. Redis is an open source key-value database. With keys containing lists, hashes, strings and sets, Redis
is known as a data structure server.
The list goes on, and includes many strong competitors Key-value databases serve a specific purpose, and they have
features that can add value to some but impose limitations on others. For this reason, you should always carefully assess
your requirements and the purpose of your data before you settle for a database. Once that is done, you can start looking
into your options and ensure that your database allows you to collect and make the most of your data without
compromising performance.
DynamoDB – Overview
DynamoDB allows users to create databases capable of storing and retrieving any amount of data, and serving any
amount of traffic. It automatically distributes data and traffic over servers to dynamically manage each customer's
requests, and also maintains fast performance.
What is DynamoDB?
DynamoDB is a hosted NoSQL database offered by Amazon Web Services (AWS). It offers:
reliable performance even as it scales;
a managed experience, so you won't be SSH-ing (SSH or Secure Shell is a network communication protocol that
enables two computers to communicate) into servers to upgrade the crypto libraries;
a small, simple API allowing for simple key-value access as well as more advanced query patterns.
Applications with large amounts of data and strict latency requirements. As your amount of data scales, JOINs and
advanced SQL operations can slow down your queries. With DynamoDB, your queries have predictable latency up to
any size, including over 100 TBs!
Serverless applications using AWS Lambda. AWS Lambda provides auto-scaling, stateless, ephemeral compute in
response to event triggers. DynamoDB is accessible via an HTTP API and performs authentication & authorization via
IAM roles, making it a perfect fit for building Serverless applications.
Data sets with simple, known access patterns. If you're generating recommendations and serving them to users,
DynamoDB's simple key-value access patterns make it a fast, reliable choice.
DynamoDB uses a NoSQL model, which means it uses a non-relational system. The following table highlights the
differences between DynamoDB and RDBMS −
Connect to the Source It uses a persistent connection and SQL It uses HTTP requests and API operations
commands.
Create a Table Its fundamental structures are tables, and must It only uses primary keys, and no schema on
be defined. creation. It uses various data sources.
Get Table Info All table info remains accessible Only primary keys are revealed.
Load Table Data It uses rows made of columns. In tables, it uses items made of attributes
Read Table Data It uses SELECT statements and filtering It uses GetItem, Query, and Scan.
statements.
Manage Indexes It uses standard indexes created through SQL It uses a secondary index to achieve the
statements. Modifications to it occur same function. It requires specifications
automatically on table changes. (partition key and sort key).
Advantages
The two main advantages of DynamoDB are scalability and flexibility. It does not force the use of a particular data
source and structure, allowing users to work with virtually anything, but in a uniform way.
Its design also supports a wide range of use from lighter tasks and operations to demanding enterprise functionality. It
also allows simple use of multiple languages: Ruby, Java, Python, C#, Erlang, PHP, and Perl.
Limitations
DynamoDB does suffer from certain limitations, however, these limitations do not necessarily create huge problems or
hinder solid development.
Capacity Unit Sizes − A read capacity unit is a single consistent read per second for items no larger than 4KB.
A write capacity unit is a single write per second for items no bigger than 1KB.
Provisioned Throughput Min/Max − All tables and global secondary indices have a minimum of one read and
one write capacity unit. Maximums depend on region. In the US, 40K read and write remains the cap per table
(80K per account), and other regions have a cap of 10K per table with a 20K account cap.
A data cap (bandwidth cap) is a service provider-imposed limit on the amount of data transferred by
a user account at a specified level of throughput over a given time period, for a specified fee. The
term applies to both home Internet service and mobile data plans.
Data caps are usually imposed as a maximum allowed amount of data in a month for an agreed-upon
charge. As a rule, when the user exceeds that limit, they are charged at a higher rate for further data
use.
Provisioned Throughput Increase and Decrease − You can increase this as often as needed, but decreases
remain limited to no more than four times daily per table.
Table Size and Quantity Per Account − Table sizes have no limits, but accounts have a 256 table limit unless
you request a higher cap.
Secondary Indexes Per Table − Five local and five global are permitted.
Partition Key Length and Values − Their minimum length sits at 1 byte, and maximum at 2048 bytes, however,
DynamoDB places no limit on values.
Sort Key Length and Values − Its minimum length stands at 1 byte, and maximum at 1024 bytes, with no limit
for values unless its table uses a local secondary index.
Table and Secondary Index Names − Names must conform to a minimum of 3 characters in length, and a
maximum of 255. They use the following characters: AZ, a-z, 0-9, “_”, “-”, and “.”.
Attribute Names − One character remains the minimum, and 64KB the maximum, with exceptions for keys and
certain attributes.
Reserved Words − DynamoDB does not prevent the use of reserved words as names.
Expression Length − Expression strings have a 4KB limit. Attribute expressions have a 255-byte limit.
Substitution variables of an expression have a 2MB limit.
It is used at LinkedIn by numerous critical services powering a large portion of the site.
Voldemort is not a relational database, it does not attempt to satisfy arbitrary relations while satisfying ACID properties. Nor is it
an object database that attempts to transparently map object reference graphs. Nor does it introduce a new abstraction such as
document-orientation. It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R
mapper like active-record or hibernate this will provide horizontal scalability and much higher availability but at great loss of
convenience. For large applications under internet-type scalability pressure, a system may likely consist of a number of functionally
partitioned services or APIs, which may manage storage resources across multiple data centers using storage systems which may
themselves be horizontally partitioned. For applications in this space, arbitrary in-database joins are already impossible since all the
data is not available in any single database. A typical pattern is to introduce a caching layer which will require hashtable semantics
anyway. For these applications Voldemort offers a number of advantages:
Voldemort combines in memory caching with the storage system so that a separate caching tier is not required (instead the
storage system itself is just fast)
Unlike MySQL replication, both reads and writes scale horizontally
Data portioning is transparent, and allows for cluster expansion without rebalancing all data
Data replication and placement is decided by a simple API to be able to accommodate a wide range of application specific
strategies
The storage layer is completely mockable so development and unit testing can be done against a throw-away in-memory
storage system without needing a real cluster (or even a real storage system) for simple testing
A wide-column database is a NoSQL database that organizes data storage into flexible columns that can be spread
across multiple servers or database nodes, using multi-dimensional mapping to reference data by column, row, and
timestamp.
Wide-column stores use the typical tables, columns, and rows, but unlike relational databases (RDBs),
columnal formatting and names can vary from row to row inside the same table. And each column is stored
separately on disk.
Columnar databases store each column in a separate file. One file stores only the key column, the other
only the first name, the other the ZIP, and so on. Each column in a row is governed by auto-indexing —
each functions almost as an index — which means that a scanned/queried columns offset corresponds to
the other columnal offsets in that row in their respective files.
Traditional row-oriented storage gives you the best performance when querying multiple columns of a
single row. Of course, relational databases are structured around columns that hold very specific
information, upholding that specificity for each entry. For instance, let’s take a Customer table. Column
values contain Customer names, addresses, and contact info. Every Customer has the same format.
Columnar families are different. They give you automatic vertical partitioning; storage is both column-
based and organized by less restrictive attributes. RDB tables are also restricted to row-based storage and
deal with tuple storage in rows, accounting for all attributes before moving forward; e.g., tuple 1 attribute
1, tuple 1 attribute 2, and so on — then tuple 2 attribute 1, tuple 2 attribute 2, and so on — in that order.
The opposite is columnar storage, which is why we use the term column families.
Note: some columnar systems also have the option for horizontal partitions at default of, say, 6 million
rows. When it’s time to run a scan, this eliminates the need to partition during the actual query. Set up your
system to sort its horizontal partitions at default based on the most commonly used columns. This minimizes
the number of extents containing the values you are looking for.
One useful option if offered — InfiniDB is one example that does — is to automatically create horizontal
partitions based on the most recent queries. This eliminates the impact of much older queries that are no
longer crucial.
A wide-column store (or extensible record store) is a type of NoSQL database. It uses tables, rows, and columns, but unlike a relational
database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a
two-dimensional key–value store.
Wide-column stores that support column families are also known as column family databases.
Apache Accumulo
Apache Cassandra
Apache HBase
Bigtable
DataStax Enterprise
DataStax Astra DB
Hypertable
Azure Tables
Scylla (database)
Wide column databases, or column family databases, refers to a category of NoSQL databases that works well for storing
enormous amounts of data that can be collected. Its architecture uses persistent, sparse matrix, multi-dimensional
mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above
the petabyte scale). Column family stores do not follow the relational model, and they aren’t optimized for joins.
Wide column databases are not the preferred choice for applications with ad-hoc query patterns, high level aggregations
and changing database requirements. This type of data store does not keep good data lineage.
“A multidimensional nested sorted map of maps, where data is stored in cells of columns and grouped into column
families.” (Akshay Pore)
“Scalability and high availability without compromising performance.” (Apache)
Database management systems that organize related facts into columns. (Forbes)
“Databases [that] are similar to key-value but allow a very large number of columns. They are well suited for
analyzing huge data sets, and Cassandra is the best known.” (IBM)
A store that groups data into columns and allowing for an infinite number of them. (Temple University)
A store with data as rows and columns, like a RDBMS, but able to handle more ambiguous and complex data
types, including unformatted text and imagery. (Michelle Knight)
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and
is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts
of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.
HDFS HBase
HDFS is a distributed file system suitable HBase is a database built on top of the HDFS.
for storing large files.
HDFS does not support fast individual HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch processing; It provides low latency access to single rows from billions of
no concept of batch processing. records (Random access).
It provides only sequential access of data. HBase internally uses Hash tables and provides random access,
and it stores the data in indexed HDFS files for faster lookups.
col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data.
Shortly, they will have column families.
It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing
(OLAP).
Such databases are designed for small number of rows and Column-oriented databases are designed for
columns. huge tables.
HBase RDBMS
HBase is schema-less, it doesn't have the concept of fixed An RDBMS is governed by its schema, which
columns schema; defines only column families. describes the whole structure of tables.
It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale.
It is good for semi-structured as well as structured data. It is good for structured data.
Features of HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on Google File
System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
HBase is used whenever we need to provide fast random access to available data.
Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.
Introduction
HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage architecture.
It can manage structured and semi-structured data and has some built-in features such as scalability, versioning,
compression and garbage collection.
Since its uses write-ahead logging and distributed configuration, it can provide fault-tolerance and quick recovery from
individual server failures. HBase built on top of Hadoop / HDFS and the data stored in HBase can be manipulated using
Hadoop’s MapReduce capabilities.
Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and
concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-oriented
data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in a column is
stored together and hence quickly retrieved.
Row-oriented data stores –
Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the
data in a row is required.
Easy to read and write records
Well suited for OLTP systems
Not efficient in performing operations applicable to the entire dataset and hence aggregation is an
expensive operation
Typical compression mechanisms provide less effective results than those on column-oriented data stores
Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
Read and Write are typically slower operations
Well suited for OLAP systems
Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over
many rows and columns
Permits high compression rates due to few distinct values in columns
Relational Database –
HBase –
Is Schema-less
Is a Column-oriented datastore
Is designed to store Denormalized Data
Contains wide and sparsely populated tables
Supports Automatic Partitioning
HDFS –
HBase –
HBase Architecture
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the HBase
cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each Region Server
contains multiple Regions – HRegions.
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a Table
becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers across
the cluster. Each Region Server hosts roughly the same number of Regions.
The HMaster in the HBase is responsible for
Performing Administration
Managing and Monitoring the Cluster
Assigning Regions to the Region Servers
Controlling the Load Balancing and Failover
The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data
from HBase, the clients read the required Region information from the .META table and directly communicate with the
appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)
Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As
shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a
Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are
always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more Columns
and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the
basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that
proper care be taken when designing Column Families in table.
The table above shows Customer and Sales Column Families. The Customer Column Family is made up 2 columns –
Name and City, whereas the Sales Column Families is made up to 2 columns – Product and Amount.
Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that
consists of the Column Family name concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can have
varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column
Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of
versions of data retained in a column family is configurable and this value by default is 3.
Adds HBase configuration files to a Configuration. This class belongs to the org.apache.hadoop.hbase package.
1
static org.apache.hadoop.conf.Configuration create()
Class HTable
HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is used to
communicate with a single HBase table. This class belongs to the org.apache.hadoop.hbase.client class.
Constructors
S.No. Constructors and Description
1
HTable()
2
HTable(TableName tableName, ClusterConnection connection, ExecutorService pool)
Using this constructor, you can create an object to access an HBase table.
1
void close()
2
void delete(Delete delete)
3
boolean exists(Get get)
Using this method, you can test the existence of columns in the table, as specified by Get.
4
Result get(Get get)
5
org.apache.hadoop.conf.Configuration getConfiguration()
6
TableName getName()
7
HTableDescriptor getTableDescriptor()
8
byte[] getTableName()
Using this method, you can insert data into the table.
Class Put
This class is used to perform Put operations for a single row. It belongs to
the org.apache.hadoop.hbase.client package.
Constructors
S.No. Constructors and Description
1
Put(byte[] row)
Using this constructor, you can create a Put operation for the specified row.
2
Put(byte[] rowArray, int rowOffset, int rowLength)
Using this constructor, you can make a copy of the passed-in row key to keep local.
3
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
Using this constructor, you can make a copy of the passed-in row key to keep local.
4
Put(byte[] row, long ts)
Using this constructor, we can create a Put operation for the specified row, using a given timestamp.
Methods
S.No. Methods and Description
1
Put add(byte[] family, byte[] qualifier, byte[] value)
2
Put add(byte[] family, byte[] qualifier, long ts, byte[] value)
Adds the specified column and value, with the specified timestamp as its version to this Put operation.
3
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
Adds the specified column and value, with the specified timestamp as its version to this Put operation.
4
Put add(byte[] family, ByteBuffer qualifier, long ts, ByteBuffer value)
Adds the specified column and value, with the specified timestamp as its version to this Put operation.
Class Get
This class is used to perform Get operations on a single row. This class belongs to
the org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description
1
Get(byte[] row)
Using this constructor, you can create a Get operation for the specified row.
2 Get(Get get)
Methods
S.No. Methods and Description
1
Get addColumn(byte[] family, byte[] qualifier)
Retrieves the column from the specific family with the specified qualifier.
2
Get addFamily(byte[] family)
Class Delete
This class is used to perform Delete operations on a single row. To delete an entire row, instantiate a Delete object with
the row to delete. This class belongs to the org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description
1
Delete(byte[] row)
2
Delete(byte[] rowArray, int rowOffset, int rowLength)
3
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
4
Delete(byte[] row, long timestamp)
Methods
S.No. Methods and Description
1
Delete addColumn(byte[] family, byte[] qualifier)
2
Delete addColumns(byte[] family, byte[] qualifier, long timestamp)
Deletes all versions of the specified column with a timestamp less than or equal to the specified
timestamp.
3
Delete addFamily(byte[] family)
4
Delete addFamily(byte[] family, long timestamp)
Deletes all columns of the specified family with a timestamp less than or equal to the specified
timestamp.
Class Result
This class is used to get a single row result of a Get or a Scan query.
Constructors
S.No. Constructors
1
Result()
Using this constructor, you can create an empty Result with no KeyValue payload; returns null if you
call raw Cells().
Methods
S.No. Methods and Description
1
byte[] getValue(byte[] family, byte[] qualifier)
This method is used to get the latest version of the specified column.
2
byte[] getRow()
This method is used to retrieve the row key that corresponds to the row from which this Result was
created.
The basic unit of scalability and load balancing in HBase is called a region. Regions are essentially contiguous ranges of
rows stored together. They are dynamically split by the system when they become too large. Alternatively, they may also
be merged to reduce their number and required storage files. Each region is served by exactly one region server, and
each of these servers can serve many regions at any time.
Splitting and serving regions can be thought of as autosharding, as offered by other systems. The regions allow for fast
recovery when a server fails, and fine-grained load balancing since they can be moved between servers when the load of
the server currently serving the region is under pressure, or if that server becomes unavailable because of a failure or
because it is being decommissioned.
Splitting is also very fast—close to instantaneous—because the split regions simply read from the original storage files
until a compaction rewrites them into separate ones asynchronously.
1.3.3Storage API
The API offers operations to create and delete tables and column families. In addition, it has functions to change the
table and column family metadata, such as compression or block sizes. Furthermore, there are the usual operations for
clients to create or delete values as well as retrieving them with a given row key.
A scan API allows you to efficiently iterate over ranges of rows and be able to limit which columns are returned or the
number of versions of each cell. You can match columns using filters and select versions using time ranges, specifying
start and end times.
1.3.4Implementation(Architecture)
The data is stored in store files, called HFiles, which are persistent and ordered immutable maps from keys to values.
Internally, the files are sequences of blocks with a block index stored at the end. The index is loaded when the HFile is
opened and kept in memory. The default block size is 64 KB but can be configured differently if required.The store files
provide an API to access specific values as well as to scan ranges of values given a start and end key.
The store files are typically saved in the Hadoop Distributed File System (HDFS), which provides a scalable, persistent,
replicated storage layer for HBase. It guarantees that data is never lost by writing the changes across a configurable
number of physical servers. When data is updated it is first written to a commit log, called a write-ahead log (WAL) in
HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it
is flushed as anHFile to disk.
There are three major components to HBase: the client library, one master server, and many region servers. The region
servers can be added or removed while the system is up and running to accommodate changing workloads. The master
is responsible for assigning regions to region servers and uses Apache ZooKeeper, a reliable, highly available, persistent
and distributed coordination service,to facilitate that task.
The master server is also responsible for handling load balancing of regions across region servers, to unload busy
servers and move regions to less occupied ones. The master is not part of the actual data storage or retrieval path. It
negotiates load balancing and maintains the state of the cluster, but never provides any data services to either the region
servers or the clients, and is therefore lightly loaded in practice. In addition, it takes care of schema changes and other
metadata operations, such as creation of tables and column families.
Region servers are responsible for all read and write requests for all regions they serve, and also split regions that have
exceeded the configured region size thresholds. Clients communicate directly with them to handle all data-related
operations.
The NoSQL graph database is a technology for data management designed to handle very large sets of structured,
semi-structured or unstructured data. The semantic graph database (also known as RDF triplestore) is a type of
NoSQL graph database that is capable of integrating heterogeneous data from many sources and making links
between datasets. It focuses on the relationships between entities and is able to infer new knowledge out of existing
information.
The NoSQL (‘not only SQL’) graph database is a technology for data management designed to handle very large sets of
structured, semi-structured or unstructured data. It helps organizations access, integrate and analyze data from various
sources, thus helping them with their big data and social media analytics.
The traditional approach to data management, the relational database, was developed in the 1970s to help enterprises
store structured information. The relational database needs its schema (the definition how data is organized and how the
relations are associated) to be defined before any new information is added.
Today, however, mobile, social and Internet of Things (IoT) data is everywhere, with unstructured real-time data piling
up by the minute. Apart from handling a massive amount of data of all kind, the NoSQL graph database does not need
its schema re-defined before adding new data.
This makes the graph database much more flexible, dynamic and lower-cost in integrating new data sources than
relational databases.
Compared to the moderate data velocity from one or few locations of the relational databases, NoSQL graph databases
are able to store, retrieve, integrate and analyze high-velocity data coming from many locations. Eg : Facebook Link
The semantic graph database is a type of NoSQL graph database that is capable of integrating heterogeneous data from
many sources and making links between datasets.
The semantic graph database, also referred to as an RDF triplestore, focuses on the relationships between entities and is
able to infer new knowledge out of existing information. It is a powerful tool to use in relationship-centered analytics
and knowledge discovery.
In addition, the capability to handle massive datasets and the schema-less approach support the NoSQL semantic graph
database usage in real-time big data analytics.
In relational databases, the need to have the schemas defined before adding new information restricts data integration from
new sources because the whole schema needs to be changed anew.
With the schema-less NoSQL semantic graph database with no need to change schemas every time a new data source is
about to be added, enterprises integrate data with less effort and cost.
The semantic graph database stands out from the other types of graph databases with its ability to additionally support
rich semantic data schema, the so-called ontologies.
The semantic NoSQL graph database gets the best of both worlds: on the one hand, data is flexible because it does not
depend on the schema. On the other hand, ontologies give the semantic graph database the freedom and ability to build
logical models any way organizations find it useful for their applications, without having to change the data.
Apart from rich semantic models, semantic graph databases use the globally developed W3C standards for representing
data on the Web. The use of standard practices makes data integration, exchange and mapping to other datasets easier
and lowers the risk of vendor lock-in while working with a graph database.
One of those standards is the Uniform Resource Identifier (URI), a kind of unique ID for all things linked so that we can
distinguish between them or know that one thing from one dataset is the same as another in a different dataset. The use
of URIs not only reduces costs in integrating data from disparate sources, it also makes data publishing and sharing easier
with mapping to Linked (Open) Data.
Ontotext’s GraphDB is able to use inference, that is, to infer new links out of existing explicit statements in the RDF
triplestore. Inference enriches the graph database by creating new knowledge and gives organizations the ability to see
all their data highly interlinked. Thus, enterprises have more insights at hand to use in their decision-making processes.
Apart from representing proprietary enterprise data in a linked and meaningful way, the NoSQL graph database makes
content management and personalization easier, due to its cost-effective way of integrating and combining huge sets of
data.
As we all know the graph is a pictorial representation of data in the form of nodes and relationships which are
represented by edges. A graph database is a type of database used to represent the data in the form of a graph. It has
three components: nodes, relationships, and properties. These components are used to model the data. The concept of
a Graph Database is based on the theory of graphs. It was introduced in the year 2000. They are commonly referred
to NoSql databases as data is stored using nodes, relationships and properties instead of traditional databases. A graph
database is very useful for heavily interconnected data. Here relationships between data are given priority and therefore
the relationships can be easily visualized. They are flexible as new data can be added without hampering the old ones.
They are useful in the fields of social networking, fraud detection, AI Knowledge graphs etc.
The description of components are as follows:
Nodes: represent the objects or instances. They are equivalent to a row in database. The node basically acts
as a vertex in a graph. The nodes are grouped by applying a label to each member.
Relationships: They are basically the edges in the graph. They have a specific direction, type and form
patterns of the data. They basically establish relationship between nodes.
Properties: They are the information associated with the nodes.
Some examples of Graph Databases software are Neo4j, Oracle NoSQL DB, Graph base etc. Out of which Neo4j is
the most popular one.
In traditional databases, the relationships between data is not established. But in the case of Graph Database, the
relationships between data are prioritized. Nowadays mostly interconnected data is used where one data is connected
directly or indirectly. Since the concept of this database is based on graph theory, it is flexible and works very fast for
associative data. Often data are interconnected to one another which also helps to establish further relationships. It
works fast in the querying part as well because with the help of relationships we can quickly find the desired nodes.
join operations are not required in this database which reduces the cost. The relationships and properties are stored as
first-class entities in Graph Database.
Graph databases allow organizations to connect the data with external sources as well. Since organizations require a
huge amount of data, often it becomes cumbersome to store data in the form of tables. For instance, if the organization
wants to find a particular data that is connected with another data in another table, so first join operation is performed
between the tables, and then search for the data is done row by row. But Graph database solves this big problem. They
store the relationships and properties along with the data. So if the organization needs to search for a particular data,
then with the help of relationships and properties the nodes can be found without joining or without traversing row by
row. Thus the searching of nodes is not dependent on the amount of data.
Neo4j is the world's leading open source Graph Database which is developed using Java technology. It is highly scalable
and schema free (NoSQL).
Used by : Walmart,epay,NASA,Microsoft,IBM
Graph database is a database used to model the data in the form of graph. In here, the nodes of a graph depict the entities
while the relationships depict the association of these nodes.
Relational databases store highly structured data which have several records storing the same type of data so they can
be used to store structured data and, they do not store the relationships between the data.
Unlike other databases, graph databases store relationships and connections as first-class entities.
The data model for graph databases is simpler compared to other databases and, they can be used with OLTP systems.
They provide features like transactional integrity and operational availability.
1 Tables Graphs
2 Rows Nodes
4 Constraints Relationships
5 Joins Traversal
Advantages of Neo4j
High availability − Neo4j is highly available for large enterprise real-time applications with transactional
guarantees.
Connected and semi structures data − Using Neo4j, you can easily represent connected and semi-structured
data.
Easy retrieval − Using Neo4j, you can not only represent but also easily retrieve (traverse/navigate) connected
data faster when compared to other databases.
Cypher query language − Neo4j provides a declarative query language to represent the graph visually, using
an ascii-art syntax. The commands of this language are in human readable format and very easy to learn.
No joins − Using Neo4j, it does NOT require complex joins to retrieve connected/related data as it is very easy
to retrieve its adjacent node or relationship details without joins or indexes.
Features of Neo4j
Data model (flexible schema) − Neo4j follows a data model named native property graph model. Here, the
graph contains nodes (entities) and these nodes are connected with each other (depicted by relationships). Nodes
and relationships store data in key-value pairs known as properties.
In Neo4j, there is no need to follow a fixed schema. You can add or remove properties as per requirement. It
also provides schema constraints.
ACID properties − Neo4j supports full ACID (Atomicity, Consistency, Isolation, and Durability) rules.
Scalability and reliability − You can scale the database by increasing the number of reads/writes, and the
volume without effecting the query processing speed and data integrity. Neo4j also provides support
for replication for data safety and reliability.
Cypher Query Language − Neo4j provides a powerful declarative query language known as Cypher. It uses
ASCII-art for depicting graphs. Cypher is easy to learn and can be used to create and retrieve relations between
data without using the complex queries like Joins.
Built-in web application − Neo4j provides a built-in Neo4j Browser web application. Using this, you can create
and query your graph data.
o REST (representational state transfer )API to work with programming languages such as Java,
o It supports two kinds of Java API(application program interface): Cypher API and Native Java API to
develop Java applications. In addition to these, you can also work with other databases such as
MongoDB, Cassandra, etc.
Neo4j Graph Database follows the Property Graph Model to store and manage its data.
Nodes are represented using circle and Relationships are represented using arrow keys
Each Relationship contains "Start Node" or "From Node" and "To Node" or "End Node"
In Property Graph Data Model, relationships should be directional. If we try to create relationships without direction,
then it will throw an error message.
In Neo4j too, relationships should be directional. If we try to create relationships without direction, then Neo4j will
throw an error message saying that "Relationships should be directional".
Neo4j Graph Database stores all of its data in Nodes and Relationships. We neither need any additional RRBMS
Database nor any SQL database to store Neo4j database data. It stores its data in terms of Graphs in its native format.
Neo4j uses Native GPE (Graph Processing Engine) to work with its Native graph storage format.
Nodes
Relationships
Properties
RETURN G
This Cypher statement will return the “Company” node where the “name” property is GeeksforGeeks. Here
the “G” works like a variable to holds the data that your Cypher query demands after that it will return. It will
will be more clear if you know the SQL. Below same query is written in SQL.
The neo4j is made for NoSQL database but it is very effective on relational database too, it does not use the
SQL language.
(X)-[:GeeksfroGeeks]->(Y)
Neo4j deals with nodes and the nodes contains labels that could be “Person”, “Employee”, “Employer”
anything that can define the type of value field.
Neo4j also have properties like “name”, “employee_id”, “phone_number”, basically that will gives us
information about the nodes.
Neo4j’s relationship also can contain the properties but this is not mandatory.
In Neo4j the relationship is like the kind of situation like X works for GeeksforGeeks here the X and the
“GeeksforGeeks” is the node and the relation is works for, in Cypher language it will be like.
(X)-[:WORK]->(GeekfoGeeks).
Note: Here the Company is the node’s label, name is the property of the node
RETURN G
The New York Stock Exchange is an example of Big Data that generates about one terabyte of new trade data per day.
Social Media
The statistic shows that 500+terabytes of new data get ingested into the databases of social media
site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand
flights per day, generation of data reaches up to many Petabytes.
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a particular data can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic which needs to be considered while dealing with Big
Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. During earlier days,
spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, data in the
form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed
to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes, application logs,
networks, and social media sites, sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus hampering the process
of being able to handle and manage the data effectively.
(v)Veracity - Veracity refers to the quality of data. Because data comes from so many different sources, it’s difficult to
link, match, cleanse and transform data across systems. Businesses need to connect and correlate relationships,
hierarchies and multiple data linkages. Otherwise, their data can quickly spiral out of control.
Access to social data from search engines and sites like facebook, twitter are enabling organizations to fine tune their
business strategies.
Big Data technologies can be used for creating a staging area or landing zone for new data before identifying what data
should be moved to the data warehouse. In addition, such integration of Big Data technologies and data warehouse helps
an organization to offload infrequently accessed data.
Summary
Big Data definition : Big Data meaning a data that is huge in size. Bigdata is a term used to describe a collection
of data that is huge in size and yet growing exponentially with time.
Big Data analytics examples includes stock exchanges, social media sites, jet engines, etc.
Big Data could be 1) Structured, 2) Unstructured, 3) Semi-structured
Volume, Variety, Velocity, and Variability are few Big Data characteristics
Improved customer service, better operational efficiency, Better Decision Making are few advantages of Bigdata
MapReduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on multiple nodes.
MapReduce provides analytical capabilities for analyzing huge volumes of complex data.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration
depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system
creates too much of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts
and assigns them to many computers. Later, the results are collected at one place and integrated to form the result
dataset.
How MapReduce Works?
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into
a smaller set of tuples.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed
data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them
to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable
sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the
values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value
pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily
in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each
one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from
the Reducer function and writes them onto a file using a record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets
per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with
the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in
volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline
processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled
up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was
developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key
value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair.
The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File
System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task
Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave
architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes
performs the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop
HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.
NameNode
o It is a single master server exist in the HDFS cluster.
o As it is a single node, it may become the reason of single point failure.
o It manages the file system namespace by executing an operation like the opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode
o The HDFS cluster contains multiple DataNodes.
o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from the file system's clients.
o It performs block creation, deletion, and replication upon instruction from the NameNode.
Job Tracker
o The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker
o It works as a slave node for Job Tracker.
o It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In
response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time
out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools
to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes
of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as
compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is
down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data
are replicated thrice but the replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper,
published by Google.
Let's focus on the history of Hadoop in the following steps: -
o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It is an open source web
crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data they have to spend a lot of
costs which becomes the consequence of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google file system). It is a proprietary distributed file
system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the data processing on large
clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS (Nutch Distributed File
System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file system known as HDFS (Hadoop Distributed File System). Hadoop
first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.
Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across
clusters of computers using simple programming models. The Hadoop framework application works in an environment
that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
MapReduce
MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient
processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of commodity
hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an Apache open-
source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed file
system that is designed to run on commodity hardware. It has many similarities with existing distributed file systems.
However, the differences from other distributed file systems are significant. It is highly fault-tolerant and is designed to
be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications
having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource management.
It is quite expensive to build bigger servers with heavy configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers with single-CPU, as a single functional distributed system
and practically, the clustered machines can read the dataset in parallel and provide a much higher throughput. Moreover,
it is cheaper than one high-end server. So this is the first motivational factor behind using Hadoop that it runs across
clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs
−
Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M
(preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Performing the sort that takes place between the map and reduce stages.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in turn, utilizes the underlying parallelism of the CPU
cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA), rather Hadoop library
itself has been designed to detect and handle failures at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate without
interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all the platforms since
it is Java based.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It was originally developed to support distribution
for the Nutch search engine project. Doug, who was working at Yahoo! at the time and is now Chief Architect of
Cloudera, named the project after his son's toy elephant.
UNIT 5
DATABASE SECURITY
Database Security Issues – Discretionary Access Control Based on Granting and Revoking Privileges – Mandatory Access Control and
Role-Based Access Control for Multilevel Security –SQL Injection – Statistical Database Security – Flow Control – Encryption and Public
Key Infrastructures – Preserving Data Privacy – Challenges to Maintaining Database Security – Database Survivability – Oracle Label-
Based Security.
There are numerous incidents where hackers have targeted companies dealing with personal customer details. Equifax,
Facebook, Yahoo, Apple, Gmail, Slack, and eBay data breaches were in the news in the past few years, just to name a few.
Such rampant activities raised the need for cybersecurity software and web app testing which aims to protect the data that
people share with online businesses. If these measures are applied, the hackers will be denied all access to the records and
documents available on the online databases. Also, complying with GDPR will help a lot on the way to strengthening user data
protection.
Here’s a list of top 10 vulnerabilities that are commonly found in the database-driven systems and our tips for how to eliminate
them.
Irregularities in Databases
It is inconsistencies that lead to vulnerabilities. Test website security and assure data protection on the regular basis. In case
any discrepancies are found, they have to be fixed ASAP. Your developers should be aware of any threat that might affect the
database. Though this is not an easy work but through proper tracking, the information can be kept secret.
In spite of being aware of the need for security testing, numerous businesses still fail to implement it. Fatal mistakes usually
appear during the development stages but also during the app integration or while patching and updating the database.
Cybercriminals take advantage of these failures to make a profit and, as a result, your business is under risk of being busted.
3. Revoking of Privileges
In some cases it is desirable to grant a privilege to a user temporarily. For example, the owner of a relation may want to grant
the SELECT privilege to a user for a specific task and then revoke that privilege once the task is completed. Hence, a
mechanism for revoking privileges is needed. In SQL a REVOKE command is included for the purpose of canceling
privileges.
The CREATETAB (create table) privilege gives account A1 the capability to create new database tables (base relations) and
is hence an account privilege. This privilege was part of earlier versions of SQL but is now left to each individual system
imple-mentation to define.
In SQL2 the same effect can be accomplished by having the DBA issue a CREATE SCHEMA command, as follows:
User account A1 can now create tables under the schema called EXAMPLE. To con-tinue our example, suppose
that A1 creates the two base relations EMPLOYEE and DEPARTMENT shown in Figure 24.1; A1 is then the owner of these
two relations and hence has all the relation privileges on each of them.
Next, suppose that account A1 wants to grant to account A2 the privilege to insert and delete tuples in both of these relations.
However, A1 does not want A2 to be able to propagate these privileges to additional accounts. A1 can issue the following
com-mand:
Notice that the owner account A1 of a relation automatically has the GRANT OPTION, allowing it to grant privileges on the
relation to other accounts. However, account A2 cannot grant INSERT and DELETE privileges on
the EMPLOYEE and DEPARTMENT tables because A2 was not given the GRANT OPTION in the preceding command.
Next, suppose that A1 wants to allow account A3 to retrieve information from either of the two tables and also to be able to
propagate the SELECT privilege to other accounts. A1 can issue the following command:
The clause WITH GRANT OPTION means that A3 can now propagate the privilege to other accounts by using GRANT. For
example, A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by issuing the following command:
GRANT SELECT ON EMPLOYEE TO A4;
Notice that A4 cannot propagate the SELECT privilege to other accounts because the GRANT OPTION was not given to A4.
Now suppose that A1 decides to revoke the SELECT privilege on the EMPLOYEE relation from A3; A1 then can issue this
command:
REVOKE SELECT ON EMPLOYEE FROM A3;
The DBMS must now revoke the SELECT privilege on EMPLOYEE from A3, and it must also automatically
revoke the SELECT privilege on EMPLOYEE from A4. This is because A3 granted that privilege to A4, but A3 does not have
the privilege any more.
Next, suppose that A1 wants to give back to A3 a limited capability to SELECT from the EMPLOYEE relation and wants to
allow A3 to be able to propagate the privilege. The limitation is to retrieve only the Name, Bdate, and Address attributes and
only for the tuples with Dno = 5. A1 then can create the following view:
CREATE VIEW A3EMPLOYEE AS
SELECT Name, Bdate, Address
FROM EMPLOYEE
WHERE Dno = 5;
After the view is created, A1 can grant SELECT on the view A3EMPLOYEE to A3 as follows:
GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;
Finally, suppose that A1 wants to allow A4 to update only the Salary attribute of EMPLOYEE; A1 can then issue the
following command:
GRANT UPDATE ON EMPLOYEE (Salary) TO A4;
The UPDATE and INSERT privileges can specify particular attributes that may be updated or inserted in a relation. Other
privileges (SELECT, DELETE) are not attrib-ute specific, because this specificity can easily be controlled by creating the
appro-priate views that include only the desired attributes and granting the corresponding privileges on the views. However,
because updating views is not always possible (see Chapter 5), the UPDATE and INSERT privileges are given the option to
specify the particular attributes of a base relation that may be updated.
The discretionary access control technique of granting and revoking privileges on relations has traditionally been the main
security mechanism for relational database systems. This is an all-or-nothing method: A user either has or does not have a
certain privilege. In many applications, an additional security policy is needed that classifies data and users based on security
classes. This approach, known as mandatory access control (MAC), would typically be combined with the discretionary
access control mechanisms . It is important to note that most commercial DBMSs currently provide mechanisms only for
discretionary access control. However, the need for multilevel security exists in government, military, and intelligence
applications, as well as in many industrial and corporate applications. Some DBMS vendors—for example, Oracle—have
released special versions of their RDBMSs that incorporate mandatory access control for government use.
Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level
and U the lowest. Other more complex security classification schemes exist, in which the security classes are organized in a
lattice. For simplicity, we will use the system with four security classification levels, where TS ≥ S ≥ C ≥ U, to illustrate our
discussion. The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies
each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security
classifications TS, S, C, or U. We will refer to the clearance (classification) of a subject S as class(S) and to
the classification of an object O as class(O). Two restrictions are enforced on data access based on the subject/object
classifications:
1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is known as the simple security
property.
2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This is known as the star property (or *-
property).
The first restriction is intuitive and enforces the obvious rule that no subject can read an object whose security classification is
higher than the subject’s security clearance. The second restriction is less intuitive. It prohibits a subject from writing an object
at a lower security classification than the subject’s security clearance. Violation of this rule would allow information to flow
from higher to lower classifications, which violates a basic tenet of multilevel security. For example, a user (subject) with TS
clearance may make a copy of an object with classification TS and then write it back as a new object with classification U,
thus making it visible throughout the system.
To incorporate multilevel security notions into the relational database model, it is common to consider attribute values and
tuples as data objects. Hence, each attribute A is associated with a classification attribute C in the schema, and each attribute
value in a tuple is associated with a corresponding security classification. In addition, in some models, a tuple
classification attribute TC is added to the relation attributes to provide a classification for each tuple as a whole. The model
we describe here is known as the multilevel model, because it allows classifications at multiple security levels. A multilevel
relation schema R with n attributes would be represented as:
R(A1, C1, A2, C2, ..., An, Cn, TC)
where each Ci represents the classification attribute associated with attribute Ai.
The value of the tuple classification attribute TC in each tuple t—which is the highest of all attribute classification values
within t—provides a general classification for the tuple itself. Each attribute classification Ci provides a finer security
classification for each attribute value within the tuple. The value of TC in each tuple t is the highest of all attribute classification
values Ci within t.
The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular (single-
level) relation. A multilevel relation will appear to contain different data to subjects (users) with different clearance levels. In
some cases, it is possible to store a single tuple in the relation at a higher classification level and produce the corresponding
tuples at a lower-level classification through a process known as filtering. In other cases, it is necessary to store two or more
tuples at different classification levels with the same value for the apparent key.
This leads to the concept of polyinstantiation, where several tuples can have the same apparent key value but have different
attribute values for users at different clearance levels.
We illustrate these concepts with the simple example of a multilevel relation shown in Figure 24.2(a), where we display the
classification attribute values next to each attribute’s value. Assume that the Name attribute is the apparent key, and consider
the query SELECT * FROM EMPLOYEE. A user with security clearance S would see the same relation shown in Figure
24.2(a), since all tuple classifications are less than or equal to S. However, a user with security clearance C would not be
allowed to see the values for Salary of ‘Brown’ and Job_performance of ‘Smith’, since they have higher classification. The
tuples would be filtered to appear as shown in Figure 24.2(b), with Salary and Job_performance appearing as null. For a user
with security clearance U, the filtering allows only the Name attribute of ‘Smith’ to appear, with all the other
attributes appearing as null (Figure 24.2(c)). Thus, filtering introduces null values for attribute values whose security
classification is higher than the user’s security clearance.
In general, the entity integrity rule for multilevel relations states that all attributes that are members of the apparent key must
not be null and must have the same security classification within each individual tuple. Additionally, all other attribute values
in the tuple must have a security classification greater than or equal to that of the apparent key. This constraint ensures that a
user can see the key if the user is permitted to see any part of the tuple. Other integrity rules, called null
integrity and interinstance integrity, informally ensure that if a tuple value at some security level can be filtered (derived)
from a higher-classified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation.
To illustrate polyinstantiation further, suppose that a user with security clearance C tries to update the value
of Job_performance of ‘Smith’ in Figure 24.2 to ‘Excellent’; this corresponds to the following SQL update being submitted
by that user:
UPDATE EMPLOYEE
SQL Injection
SQL Injection
SQL injection is a code injection technique that might destroy your database.
SQL injection is the placement of malicious code in SQL statements, via web page input.
Look at the following example which creates a SELECT statement by adding a variable (txtUserId) to a select string. The variable is
fetched from user input (getRequestString):
Example
txtUserId=getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;
The rest of this chapter describes the potential dangers of using user input in SQL statements.
SQL Injection Based on 1=1 is Always True
Look at the example above again. The original purpose of the code was to create an SQL statement to select a user, with a given
user id.
If there is nothing to prevent a user from entering "wrong" input, the user can enter some "smart" input like this:
105 OR 1=1
UserId:
The SQL above is valid and will return ALL rows from the "Users" table, since OR 1=1 is always TRUE.
Does the example above look dangerous? What if the "Users" table contains names and passwords?
SELECT UserId, Name, Password FROM Users WHERE UserId = 105 or 1=1;
A hacker might get access to all the user names and passwords in a database, by simply inserting 105 OR 1=1 into the input field.
Username:
John Doe
Password:
myPass
Example
uName=getRequestString("username");
uPass=getRequestString("userpassword");
sql = 'SELECT * FROM Users WHERE Name ="' + uName + '" AND Pass ="' + uPass + '"'
Result
SELECT * FROM Users WHERE Name ="John Doe" AND Pass ="myPass"
A hacker might get access to user names and passwords in a database by simply inserting " OR ""=" into the user name or password
text box:
UserName:
Password:
The code at the server will create a valid SQL statement like this:
Result
SELECT * FROM Users WHERE Name ="" or ""="" AND Pass ="" or ""=""
The SQL above is valid and will return all rows from the "Users" table, since OR ""="" is always TRUE.
A batch of SQL statements is a group of two or more SQL statements, separated by semicolons.
The SQL statement below will return all rows from the "Users" table, then delete the "Suppliers" table.
Example
SELECT * FROM Users; DROP TABLE Suppliers
Example
txtUserId = getRequestString("UserId");
txtSQL = "SELECT * FROM Users WHERE UserId = " + txtUserId;
105; DROP TA
User id:
Result
SELECT * FROM Users WHERE UserId = 105; DROP TABLE Suppliers;
SQL parameters are values that are added to an SQL query at execution time, in a controlled manner.
Another Example
txtNam=getRequestString("CustomerName");
txtAdd=getRequestString("Address");
txtCit=getRequestString("City");
txtSQL="INSERT INTO Customers(CustomerName,Address,City)Values(@0,@1,@2)";
db.Execute(txtSQL,txtNam,txtAdd,txtCit
Examples
The following examples shows how to build parameterized queries in some common web languages.
txtUserId=getRequestString("UserId");
sql="SELECT*FROM Customers WHERE CustomerId=@0";
command=new SqlCommand(sql);
command.Parameters.AddWithValue("@0",txtUserId);
command.ExecuteReader();
Flow Control:
Measures of Control
The measures of control can be broadly divided into the following categories −
Access Control − Access control includes security mechanisms in a database management system to protect against
unauthorized access. A user can gain access to the database after clearing the login process through only valid user
accounts. Each user account is password protected.
Flow Control − Distributed systems encompass a lot of data flow from one site to another and also within a site. Flow
control prevents data from being transferred in such a way that it can be accessed by unauthorized agents. A flow
policy lists out the channels through which information can flow. It also defines security classes for data as well as
transactions.
Data Encryption − Data encryption refers to coding data when sensitive data is to be communicated over public
channels. Even if an unauthorized agent gains access of the data, he cannot understand it since it is in an
incomprehensible format.
Key Management
It goes without saying that the security of any cryptosystem depends upon how securely its keys are managed. Without secure
procedures for the handling of cryptographic keys, the benefits of the use of strong cryptographic schemes are potentially
lost.
It is observed that cryptographic schemes are rarely compromised through weaknesses in their design. However, they are
often compromised through poor key management.
There are some important aspects of key management which are as follows −
Cryptographic keys are nothing but special pieces of data. Key management refers to the secure administration of
cryptographic keys.
Key management deals with entire key lifecycle as depicted in the following illustration −
There are two specific requirements of key management for public key cryptography.
o Secrecy of private keys. Throughout the key lifecycle, secret keys must remain secret from all parties except
those who are owner and are authorized to use them.
o Assurance of public keys. In public key cryptography, the public keys are in open domain and seen as public
pieces of data. By default there are no assurances of whether a public key is correct, with whom it can be
associated, or what it can be used for. Thus key management of public keys needs to focus much more
explicitly on assurance of purpose of public keys.
The most crucial requirement of ‘assurance of public key’ can be achieved through the public-key infrastructure (PKI), a key
management systems for supporting public-key cryptography.
Digital Certificate
For analogy, a certificate can be considered as the ID card issued to the person. People use ID cards such as a driver's license,
passport to prove their identity. A digital certificate does the same basic thing in the electronic world, but with one difference.
Digital Certificates are not only issued to people but they can be issued to computers, software packages or anything else that
need to prove the identity in the electronic world.
Digital certificates are based on the ITU(International Telecommunication Union) standard X.509 which defines a
standard certificate format for public key certificates and certification validation. Hence digital certificates are
sometimes also referred to as X.509 certificates.
Public key pertaining to the user client is stored in digital certificates by The Certification Authority (CA) along with
other relevant information such as client information, expiration date, usage, issuer etc.
CA digitally signs this entire information and includes digital signature in the certificate.
Anyone who needs the assurance about the public key and associated information of client, he carries out the signature
validation process using CA’s public key. Successful validation assures that the public key given in the certificate
belongs to the person whose details are given in the certificate.
The process of obtaining Digital Certificate by a person/entity is depicted in the following illustration.
As shown in the illustration, the CA accepts the application from a client to certify his public key. The CA, after duly verifying
identity of client, issues a digital certificate to that client.
Key Functions of CA
The key functions of a CA are as follows −
Generating key pairs − The CA may generate a key pair independently or jointly with the client.
Issuing digital certificates − The CA could be thought of as the PKI equivalent of a passport agency − the CA issues
a certificate after client provides the credentials to confirm his identity. The CA then signs the certificate to prevent
modification of the details contained in the certificate.
Publishing Certificates − The CA need to publish certificates so that users can find them. There are two ways of
achieving this. One is to publish certificates in the equivalent of an electronic telephone directory. The other is to send
your certificate out to those people you think might need it by one means or another.
Verifying Certificates − The CA makes its public key available in environment to assist verification of his signature
on clients’ digital certificate.
Revocation of Certificates − At times, CA revokes the certificate issued due to some reason such as compromise of
private key by user or loss of trust in the client. After revocation, CA maintains the list of all revoked certificate that
is available to the environment.
Classes of Certificates
There are four typical classes of certificate −
Class 1 − These certificates can be easily acquired by supplying an email address.
Class 2 − These certificates require additional personal information to be supplied.
Class 3 − These certificates can only be purchased after checks have been made about the requestor’s identity.
Class 4 − They may be used by governments and financial organizations needing very high levels of trust.
Hierarchy of CA
With vast networks and requirements of global communications, it is practically not feasible to have only one trusted CA
from whom all users obtain their certificates. Secondly, availability of only one CA may lead to difficulties if CA is
compromised.
In such case, the hierarchical certification model is of interest since it allows public key certificates to be used in environments
where two communicating parties do not have trust relationships with the same CA.
The root CA is at the top of the CA hierarchy and the root CA's certificate is a self-signed certificate.
The CAs, which are directly subordinate to the root CA (For example, CA1 and CA2) have CA certificates that are
signed by the root CA.
The CAs under the subordinate CAs in the hierarchy (For example, CA5 and CA6) have their CA certificates signed
by the higher-level subordinate CAs.
Certificate authority (CA) hierarchies are reflected in certificate chains. A certificate chain traces a path of certificates from
a branch in the hierarchy to the root of the hierarchy.
The following illustration shows a CA hierarchy with a certificate chain leading from an entity certificate through two
subordinate CA certificates (CA6 and CA3) to the CA certificate for the root CA.
Verifying a certificate chain is the process of ensuring that a specific certificate chain is valid, correctly signed, and
trustworthy. The following procedure verifies a certificate chain, beginning with the certificate that is presented for
authentication −
A client whose authenticity is being verified supplies his certificate, generally along with the chain of certificates up
to Root CA.
Verifier takes the certificate and validates by using public key of issuer. The issuer’s public key is found in the issuer’s
certificate which is in the chain next to client’s certificate.
Now if the higher CA who has signed the issuer’s certificate, is trusted by the verifier, verification is successful and
stops here.
Else, the issuer's certificate is verified in a similar manner as done for client in above steps. This process continues till
either trusted CA is found in between or else it continues till Root CA.
Preserving Data Privacy
Abstract
Incredible amounts of data is being generated by various organizations like hospitals, banks, e-commerce, retail and supply
chain, etc. by virtue of digital technology. Not only humans but machines also contribute to data in the form of closed circuit
television streaming, web site logs, etc. Tons of data is generated every minute by social media and smart phones. The
voluminous data generated from the various sources can be processed and analyzed to support decision making. However data
analytics is prone to privacy violations. One of the applications of data analytics is recommendation systems which is widely
used by ecommerce sites like Amazon, Flip kart for suggesting products to customers based on their buying habits leading to
inference attacks. Although data analytics is useful in decision making, it will lead to serious privacy concerns. Hence privacy
preserving data analytics became very important. This paper examines various privacy threats, privacy preservation techniques
and models with their limitations, also proposes a data lake based modernistic privacy preservation technique to handle privacy
preservation in unstructured data.
Introduction
There is an exponential growth in volume and variety of data as due to diverse applications of computers in all domain areas.
The growth has been achieved due to affordable availability of computer technology, storage, and network connectivity. The
large scale data, which also include person specific private and sensitive data like gender, zip code, disease, caste, shopping
cart, religion etc. is being stored in public domain. The data holder can release this data to a third party data analyst to gain
deeper insights and identify hidden patterns which are useful in making important decisions that may help in improving
businesses, provide value added services to customers , prediction, forecasting and recommendation . One of the prominent
applications of data analytics is recommendation systems which is widely used by ecommerce sites like Amazon, Flip kart for
suggesting products to customers based on their buying habits. Face book does suggest friends, places to visit and even movie
recommendation based on our interest. However releasing user activity data may lead inference attacks like identifying gender
based on user activity . We have studied a number of privacy preserving techniques which are being employed to protect
against privacy threats. Each of these techniques has their own merits and demerits. This paper explores the merits and demerits
of each of these techniques and also describes the research challenges in the area of privacy preservation. Always there exists
a trade off between data utility and privacy. This paper also proposes a data lake based modernistic privacy preservation
technique to handle privacy preservation in unstructured data with maximum data utility.
Privacy threats in data analytics
Privacy is the ability of an individual to determine what data can be shared, and employ access control. If the data is in public
domain then it is a threat to individual privacy as the data is held by data holder. Data holder can be social networking
application, websites, mobile apps, ecommerce site, banks, hospitals etc. It is the responsibility of the data holder to ensure
privacy of the users data. Apart from the data held in public domain, knowing or unknowingly users themself contribute to
data leakage. For example most of the mobile apps, seek access to our contacts, files, camera etc. and without reading the
privacy statement we agree for all terms and conditions, there by contributing to data leakage.
Hence there is a need to educate the smart phone users regarding privacy and privacy threats. Some of the key privacy threats
include (1) Surveillance; (2) Disclosure; (3) Discrimination; (4) Personal embracement and abuse.
Surveillance
Many organizations including retail, e-commerce, etc. study their customers buying habits and try to come up with various
offers and value added services . Based on the opinion data and sentiment analysis, social media sites does provide
recommendations of the new friends, places to visit, people to follow etc. This is possible only when they continuously monitor
their customer’s transactions. This is a serious privacy threat as no individual accepts surveillance.
Disclosure
Consider a hospital holding patient’s data which include (Zip, gender, age, disease) . The data holder has released data to a
third party for analysis by anonymizing sensitive person specific data so that the person cannot be identified. The third party
data analyst can map this information with the freely available external data sources like census data and can identify person
suffering with some disorder. This is how private information of a person can be disclosed which is considered to be a serious
privacy breach.
Discrimination
Discrimination is the bias or inequality which can happen when some private information of a person is disclosed. For instance,
statistical analysis of electoral results proved that people of one community were completely against the party, which formed
the government. Now the government can neglect that community or can have bias over them.
Data analytics activity will affect data Privacy. Many countries are enforcing Privacy preservation laws. Lack of awareness is
also a major reason for privacy attacks. For example many smart phones users are not aware of the information that is stolen
from their phones by many apps. Previous research shows only 17% of smart phone users are aware of privacy threats .
Privacy preservation methods
Many Privacy preserving techniques were developed, but most of them are based on anonymization of data. The list of privacy
preservation techniques is given below.
K anonymity
L diversity
T closeness
Randomization
Data distribution
Cryptographic techniques
Multidimensional Sensitivity Based Anonymization (MDSBA).
K anonymity
Anonymization is the process of modifying data before it is given for data analytics , so that de identification is not possible
and will lead to K indistinguishable records if an attempt is made to de identify by mapping the anonymized data with external
data sources. K anonymity is prone to two attacks namely homogeneity attack and back ground knowledge attack. Some of
the algorithms applied include, Incognito , Mondrian to ensure Anonymization. K anonymity is applied on the patient data
shown in Table 1. The table shows data before anonymization.
6 57906 47 Cancer
8 57673 36 Cancer
9 57607 32 Cancer
K anonymity algorithm is applied with k value as 3 to ensure 3 indistinguishable records when an attempt is made to identify
a particular person’s data. K anonymity is applied on the two attributes viz. Zip and age shown in Table 1. The result of
applying anonymization on Zip and age attributes is shown in Table 2.
8 576** 3* Cancer
9 576** 3* Cancer
The above technique has used generalization to achieve Anonymization. Suppose if we know that John is 27 year old and lives
in 57677 zip codes then we can conclude John to have Cardiac problem even after anonymization as shown in Table 2. This is
called Homogeneity attack. For example if John is 36 year old and it is known that John does not have cancer, then definitely
John must have Cardiac problem. This is called as background knowledge attack. Achieving K anonymity can be done either
by using generalization or suppression. K anonymity can optimized if the minimal generalization can be done without huge
data loss . Identity disclosure is the major privacy threat which cannot be guaranteed by K anonymity . Personalized privacy
is the most important aspect of individual privacy .
L diversity
To address homogeneity attack, another technique called L diversity has been proposed. As per L diversity there must be L
well represented values for the sensitive attribute (disease) in each equivalence class.
Implementing L diversity is not possible every time because of the variety of data. L diversity is also prone to skewness attack.
When overall distribution of data is skewed into few equivalence classes attribute disclosure cannot be ensured. For example
if the entire records are distributed into only three equivalence classes then semantic closeness of these values may lead to
attribute disclosure. Also L diversity may lead to similarity attack. From Table 3 it can be noticed that if we know that John is
27 year old and lives in 57677 zip, then definitely John is under low income group because salaries of all three persons in
576** zip is low compare to others in the table. This is called as similarity attack.
T closeness
Another improvement to L diversity is T closeness measure where an equivalence class is considered to have ‘T
closeness’ if the distance between the distributions of sensitive attribute in the class is no more than a threshold
and all equivalence classes have T closeness . T closeness can be calculated on every attribute with respect to
sensitive attribute.
From Table 4 it can be observed that if we know John is 27 year old, still it will be difficult to estimate whether
John has Cardiac problem or not and he is under low income group or not. T closeness may ensure attribute
disclosure but implementing T closeness may not give proper distribution of data every time.
Randomization technique
Randomization is the process of adding noise to the data which is generally done by probability distribution .
Randomization is applied in surveys, sentiment analysis etc. Randomization does not need knowledge of other
records in the data. It can be applied during data collection and pre processing time. There is no anonymization
overhead in randomization. However, applying randomization on large datasets is not possible because of time
complexity and data utility which has been proved in our experiment described below.
We have loaded 10k records from an employee database into Hadoop Distributed File System and processed
them by executing a Map Reduce Job. We have experimented to classify the employees based on their salary
and age groups. In order apply randomization we added noise in the form of 5k records which are randomly
added to make a database of 15k records and following observations were made after running Map Reduce job.
More number of Mappers and Reducers were used as data volume increased.
Results before and after randomization were significantly different.
Some of the records which are outliers remain unaffected with randomization and are vulnerable to
adversary attack.
Privacy preservation at the cost of data utility is not appreciated and hence randomization may not be
suitable for privacy preservation especially attribute disclosure.
Data distribution technique
In this technique, the data is distributed across many sites. Distribution of the data can be done in two ways:
Horizontal distribution When data is distributed across many sites with same attributes then the distribution is
said to be horizontal distribution which is described in Fig. 1.
Distribution of sales data across different sites
Horizontal distribution of data can be applied only when some aggregate functions or operations are to be applied
on the data without actually sharing the data. For example, if a retail store wants to analyse their sales across
various branches, they may employ some analytics which does computations on aggregate data. However, as
part of data analysis the data holder may need to share the data with third party analyst which may lead to privacy
breach. Classification and Clustering algorithms can be applied on distributed data but it does not ensure privacy.
If the data is distributed across different sites which belong to different organizations, then results of aggregate
functions may help one party in detecting the data held with other parties. In such situations we expect all
participating sites to be honest with each other .
Vertical distribution of data When Person specific information is distributed across different sites under
custodian of different organizations, then the distribution is called vertical distribution as shown in Fig. 2. For
example, in crime investigations, the police officials would like to know details of a particular criminal which
include health, profession, financial, personal etc. All this information may not be available at one site. Such a
distribution is called vertical distribution where each site holds few set of attributes of a person. When some
analytics has to be done data has to be pooled in from all these sites and there is a vulnerability of privacy breach.
Cryptographic techniques
The data holder may encrypt the data before releasing the same for analytics. But encrypting large scale data using conventional
encryption techniques is highly difficult and must be applied only during data collection time. Differential privacy techniques
have already been applied where some aggregate computations on the data are done without actually sharing the inputs. For
example, if x and y are two data items then a function F(x, y) will be computed to gain some aggregate information from both
x and y without actually sharing x and y. This can be applied on when x and y are held with different parties as in the case of
vertical distribution. However, if the data is at single location under the custodian of a single organization, then differential
privacy cannot be employed. Another similar technique called secure multiparty computation has been used but proved to be
inadequate in privacy preservation. Data utility will be less if encryption is applied during data analytics. Thus encryption is
not only difficult to implement but it reduces the data utility .
Multidimensional Sensitivity Based Anonymization is an improved Anonymization technique such that it can be applied on
large data sets with reduced loss of information and predefined quasi identifiers. As part of this technique Apache MAP
REDUCE framework has been used to handle large data sets. In conventional Hadoop Distributed Files System, the data will
be divided into blocks of either 64 MB or 128 MB each and distributed across different nodes without considering the data
inside the blocks. As part of Multidimensional Sensitivity Based Anonymization technique the data is split into different bags
based on the probability distribution of the quasi identifiers by making use of filters in Apache Pig scripting language.
Multidimensional Sensitivity Based Anonymization makes use of bottom up generalization but on a set of attributes with
certain class values where class represents a sensitive attributes. Data distribution was made effectively when compared to
conventional method of blocks. Data Anonymization was done using four quasi identifiers using Apache Pig.
Since the data is vertically partitioned into different groups, it can protect from background knowledge attack if the bag contains
only few attributes. This method also makes it difficult to map the data with external sources to disclose any person specific
information.
In this method, the implementation was done using Apache Pig. Apache Pig is a scripting language, hence development effort
is less. However, code efficiency of Apache Pig is relatively less when compared to Map Reduce job because ultimately every
Apache Pig script has to be converted into a Map Reduce job. Multidimensional Sensitivity Based Anonymization is more
appropriate for large scale data but only when the data is at rest. Multidimensional Sensitivity Based Anonymization cannot
be applied for streaming data.
Analysis
Various privacy preservation techniques have been studied with respect to features including, type of data, data utility, attribute
preservation and complexity. The comparison of various privacy preservation techniques is shown in Table 5.
As part of systematic literature review, it has been observed that all existing mechanisms of privacy preservation are with
respect to structured data. More than 80% of data being generated today is unstructured . As such, there is a need to address
following challenges.
1. i.Develop concrete solution to protect privacy in both structured and unstructured data.
2. ii.Scalable and robust techniques to be developed to handle large scale heterogeneous data sets.
3. iii.Data should be allowed to stay in its native form without need for transformation and data analytics can be carried
out while ensuring privacy preservation.
4. iv.New techniques apart from Anonymization must be developed to ensure protection against key privacy threats
which include identity disclosure, discrimination, surveillance etc.
5. v.Maximizing data utility while ensuring data privacy.
Conclusion
No concrete solution for unstructured data has been developed yet. Conventional data mining algorithms can be applied for
classification and clustering problems but cannot be used in privacy preservation especially when dealing with person specific
information. Machine learning and soft computing techniques can be used to develop new and more appropriate solution to
privacy problems which include identity disclosure that can lead to personal embarrassment and abuse.
There is a strong need for law enforcement by governments of all countries to ensure individual privacy. European Union is
making an attempt to enforce privacy preservation law. Apart from technological solutions, there is a strong need to create
awareness among the people regarding privacy hazards to safeguard themselves form privacy breaches. One of the serious
privacy threats is smart phone. Lot of personal information in the form of contacts, messages, chats and files are being accessed
by many apps running in our smart phone without our knowledge. Most of the time people do not even read the privacy
statement before installing any app. Hence there is a strong need to educate people on the various vulnerabilities which can
contribute to leakage of private information.
We propose a novel privacy preservation model based on Data Lake concept to hold variety of data from diverse sources. Data
lake is a repository to hold data from diverse sources in their raw format . Data ingestion from variety of sources can be done
using Apache Flume and an intelligent algorithm based on machine learning can be applied to identify sensitive attributes
dynamically . The algorithm will be trained with existing data sets with known sensitive attributes and rigorous training of the
model will help in predicting the sensitive attributes in a given data set . Accuracy of the model can be improved by adding
more layers of training leading to deep learning techniques . Advanced computing techniques like Apache Spark can be used
in implementing privacy preserving algorithms which is a distributed massive parallel computing with in memory processing
to ensure very fast processing. The proposed model is shown in Fig. 3.
Fig. 3
In Data lake the data can remain in its native form which is either structured or unstructured. When data has to be processed,
it can be transformed into HIVE tables. A Hadoop map reduce job using machine learning can be executed on the data to
classify the sensitive attributes . The data can be vertically distributed to separate the sensitive attributes from rest of the data
and apply tokenization to map the vertically distributed data. The data without any sensitive attributes can be published for
data analytics.
Abbreviations
CCTV:
closed circuit television
MDSBA:
The need for more sophisticated controls on access to sensitive data is becoming increasingly important as
organizations address emerging security requirements around data consolidation, privacy and compliance.
Maintaining separate databases for highly sensitive customer data is costly and creates unnecessary
administrative overhead. However, consolidating databases sometimes means combining sensitive financial,
HR, medical or project data from multiple locations into a single database for reduced costs, easier
management and better scalability. Oracle Label Security provides the ability to tag data with a data label
or a data classification to allow the database to inherently know what data a user or role is authorizedfor
and allows data from different sources to be combined in the same table as a larger data set without
compromising security.
Access to sensitive data is controlled by comparing the data label with therequesting user’s label or access
clearance. A user label or access clearance can be thought of as an extension to standard database privileges
and roles. Oracle Label Security is centrally enforced within thedatabase, below the application layer,
providing strong security and eliminating the need for complicated application views.
Financial companies with customers that span multiple countries with strong
government privacycontrols
Complying with U.S. State Department’s International Traffic in Arms (ITAR) regulations
Restrict data processing, tracking consent and handling right to erasure requests under EU GDPR
The User label consist of three components – a level, zero or more compartments
and zero or more groups. This label is assigned as part of the user authorization and
is not modifiable by the user.
Session labels also consists of the same three components and are different from the
user label based on the session that was established by the user. For example if the user
has a Top Secret level component, but the user logged in from a Secret workstation,
the session label level would be Secret.
Data security labels have the same three components as the User and Session
Levels indicate the sensitivity level of the data and the authorization for a user to access sensitive
data. The user (and session) level must be equal or greater than the data level to
access that record.
Data can have zero or more groups in the group component. The user/session label
needs to have at least one group that matches a data record’s group(s) to access the
data record. For example, if the data record had Boston, Chicago and New York for
groups, then the session label needs only Boston (or one of the other 2 groups) to
access the data.
Label Security policies are a combination of User labels, Data labels and protected objects.
A VPD policy can be written so that it only becomes active when a certain column
(the 'sensitive' column) is part of a SQL statement against a protected table. With
'column sensitivity' switch on, VPD either returns only those rows that include
information in the sensitive column the user is allowed to see, or it returns all rows,
with all cells in the sensitive column being empty, except those values the user is
allowed to see.
Trusted stored procedures are procedures that are either granted the OLS privilege
'FULL' or 'READ'. When a trusted stored program unit is carried out, the policy
privileges in force are a union of the invoking user's privileges and the program unit's
privileges.
Beginning with Oracle Database 11gR1, the functionality of Oracle Policy Manager
(and most other security related administration tools) has been made available in
Oracle Enterprise Manager Cloud Control, enabling administrators to create and
manage OLS policies in a modern, convenient and integrated environment.
Only apply sensitivity labels to those tables that really need protection. When
multiple tables are joined to retrieve sensitive, look for the driving table
Usually, there is only a small set of different data classification labels; if the table
is mostly used for READ operations, try building an Bitmap Index over the (hidden)
OLS column, and add this index to existing indexes in that table.
Review the Oracle Label Security whitepaper available in the product OTN
webpage as it contains a thorough discussion of performance considerations with
Oracle Label Security.
Yes. Oracle Label Security can provide user labels to be used as factors within
Oracle Database Vault and security labels can be assigned to Real Application
Security users. It also integrates with oracle advanced security data
redaction,enabling security clearances to be used in data redactionpolici