0% found this document useful (0 votes)
0 views

Data Structure and Algorithms (2 Files Merged)

The document provides an overview of data structures and algorithms, categorizing them into primitive and non-primitive types, with a focus on linear and non-linear structures. It discusses arrays, linked lists, stacks, queues, and their operations, including searching, sorting, insertion, updating, and deletion. Additionally, it explains memory allocation, indexing, and the implementation of 2D arrays, along with stack operations and their characteristics.

Uploaded by

Sumanjali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Data Structure and Algorithms (2 Files Merged)

The document provides an overview of data structures and algorithms, categorizing them into primitive and non-primitive types, with a focus on linear and non-linear structures. It discusses arrays, linked lists, stacks, queues, and their operations, including searching, sorting, insertion, updating, and deletion. Additionally, it explains memory allocation, indexing, and the implementation of 2D arrays, along with stack operations and their characteristics.

Uploaded by

Sumanjali
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

DATA STRUCTURE AND ALGORITHM The non-primitive data structure is divided into two types:

Unit I  Linear data structure


Arrays and sequential representations – ordered lists – Stacks and Queues – Evaluation  Non-linear data structure
ofExpressions – Multiple Stacks and Queues – Singly Linked List – Linked Stacks and
queues – Polynomial addition. Linear Data Structure

The arrangement of data in a sequential manner is known as a linear data structure.


Data Structure Introduction:
The data structures used for this purpose are Arrays, Linked list, Stacks, and Queues.
The data structure name indicates itself that organizing the data in memory. In these data structures, one element is connected to only one another element in a
There are many ways of organizing the data in the memory as we have already seen linear form.
one of the data structures, i.e., array in C language. Array is a collection of memory
Non- Linear Structure:
elements in which data is stored sequentially, i.e., one after another. In other words,
we can say that array stores the elements in a continuous manner. This organization
When one element is connected to the 'n' number of elements known as a non-
of data is done with the help of an array of data structures. There are also other ways
linear data structure. The best example is trees and graphs. In this case, the elements
to organize the data in memory. To structure the data in memory, 'n' number of
are arranged in a random manner.
algorithms were proposed, and all these algorithms are known as Abstract data
types. These abstract data types are the set of rules. Data structures can also be classified as:

 Static data structure: It is a type of data structure where the size is allocated
at the compile time. Therefore, the maximum size is fixed.
 Dynamic data structure: It is a type of data structure where the size is
allocated at the run time. Therefore, the maximum size is flexible.

Major Operations
The major or the common operations that can be performed on the data structures
are:

o Searching: We can search for any element in a data structure.


Types of Data Structures o Sorting: We can sort the elements of a data structure either in an ascending
or descending order.
There are two types of data structures:
o Insertion: We can also insert the new element in a data structure.
o Primitive data structure o Updation: We can also update the element, i.e., we can replace the element
o Non-primitive data structure with another element.
o Deletion: We can also perform the delete operation to remove the element
Primitive Data structure from the data structure.

The primitive data structures are primitive data types. The int, char, float, double, and Advantages of Data structures
pointer are the primitive data structures that can hold a single value.
The following are the advantages of a data structure:
Non-Primitive Data structure
o Efficiency: If the choice of a data structure for implementing a particular ADT Memory Allocation of the array
is proper, it makes the program very efficient in terms of time and space.
o Reusability: The data structure provides reusability means that multiple client As we have mentioned, all the data elements of an array are stored at
programs can use the data structure. contiguous locations in the main memory. The name of the array represents
o Abstraction: The data structure specified by an ADT also provides the level of the base address or the address of first element in the main memory. Each
abstraction. The client cannot see the internal working of the data structure, element of the array is represented by a proper indexing.
so it does not have to worry about the implementation part. The client can
The indexing of the array can be defined in three ways.
only see the interface.
1. 0 (zero - based indexing) : The first element of the array will be arr[0].
Arrays and Sequential Representation
2. 1 (one - based indexing) : The first element of the array will be arr[1].
Definition 3. n (n - based indexing) : The first element of the array can reside at any random index
o Arrays are defined as the collection of similar type of data items stored at contiguous number.
memory locations.
o Arrays are the derived data type in C programming language which can store the In the following image, we have shown the memory allocation of an array arr of size
primitive type of data such as int, char, double, float, etc. 5. The array follows 0-based indexing approach. The base address of the array is
100th byte. This will be the address of arr[0]. Here, the size of int is 4 bytes therefore
o Array is the simplest data structure where each data element can be randomly
each element will take 4 bytes in the memory.
accessed by using its index number.
o For example, if we want to store the marks of a student in 6 subjects, then we don't
need to define different variable for the marks in different subject. instead of that, we
can define an array which can store the marks in each subject at a the contiguous
memory locations.

The array marks[10] defines the marks of the student in 10 different subjects where
each subject marks are located at a particular subscript in the array
i.e. marks[0] denotes the marks in first subject, marks[1] denotes the marks in 2nd
subject and so on.

Properties of the Array


1. Each element is of same data type and carries a same size i.e. int = 4 bytes.
2. Elements of the array are stored at contiguous memory locations where the first
In 0 based indexing, If the size of an array is n then the maximum index number, an
element is stored at the smallest memory location. element can have is n-1. However, it will be n if we use 1 based indexing.
3. Elements of the array can be randomly accessed since we can calculate the address of
each element of the array with the given base address and the size of data element. Accessing Elements of an array
Advantages of Array To access any random element of an array we need the following information:
o Array provides the single name for the group of variables of the same type therefore,
1. Base Address of the array.
it is easy to remember the name of all the elements of an array.
2. Size of an element in bytes.
o Traversing an array is a very simple process, we just need to increment the base
address of the array in order to visit each element one by one. 3. Which type of indexing, array follows.
o Any element in the array can be directly accessed by using the index.
Address of any element of a 1D array can be calculated by using the following
formula:
Byte address of element A[i] = base address + size * ( i - first index ) How do we access data in a 2D array
2D Array
Similar to one dimensional arrays, we can access the individual cells in a
2D array can be defined as an array of arrays. The 2D array is organized as matrices 2D array by using the indices of the cells. There are two indices attached
which can be represented as the collection of rows and columns. to a particular cell, one is its row number while the other is its column
number.
However, 2D arrays are created to implement a relational database look alike data
structure. It provides ease of holding bulk of data at once which can be passed to However, we can store the value stored in any particular cell of a 2D array to some
any number of functions wherever required. variable x by using the following syntax.

1. int x = a[i][j];
How to declare 2D Array
The syntax of declaring two dimensional array is very much similar to where i and j is the row and column number of the cell respectively.
that of a one dimensional array, given as follows.
We can assign each cell of a 2D array to 0 by using the following code:
1. int arr[max_rows][max_columns];
1. for ( int i=0; i<n ;i++)
however, It produces the data structure which looks like following. 2. {
3. for (int j=0; j<n; j++)
25M
4. {
5. a[i][j] = 0;
6. }
7. }

Initializing 2D Arrays
We know that, when we declare and initialize one dimensional array in C
programming simultaneously, we don't need to specify the size of the
array. However this will not work with 2D arrays. We will have to define
at least the second dimension of the array.

The syntax to declare and initialize the 2D array is given as follows.

1. int arr[2][2] = {0,1,2,3};


Above image shows the two dimensional array, the elements are organized in the form of
rows and columns. First element of the first row is represented by a[0][0] where the number The number of elements that can be present in a 2D array will always be equal to
shown in the first index is the number of that row while the number shown in the second
(number of rows * number of columns).
index is the number of the column.

Mapping 2D array to 1D array


When it comes to map a 2 dimensional array, most of us might think that why this mapping is
required. However, 2 D arrays exists from the user point of view. 2D arrays are created to
implement a relational database table lookalike data structure, in computer memory, the
storage technique for 2D array is similar to that of an one dimensional array.
The size of a two dimensional array is equal to the multiplication of number of rows and the According to the column major ordering, all the columns of the
number of columns present in the array. We do need to map two dimensional array to the one
dimensional array in order to store them in the memory.
2D array are stored into the memory contiguously. The memory
allocation of the array which is shown in in the above image is
A 3 X 3 two dimensional array is shown in the following image. However, this array given as follows.
needs to be mapped to a one dimensional array in order to store it into the memory.

first, the 1st column of the array is stored into the memory completely, then the
2nd row of the array is stored into the memory completely and so on till the last
column of the array.

There are two main techniques of storing 2D array elements into memory

1. Row Major ordering


In row major ordering, all the rows of the 2D array are stored into the memory
contiguously. Considering the array shown in the above image, its memory allocation
according to row major order is shown as follows. Calculating the Address of the random element of a 2D array
Due to the fact that, there are two different techniques of storing the two
dimensional array into the memory, there are two different formulas to calculate the
address of a random element of the 2D array.

By Row Major Order


first, the 1st row of the array is stored into the memory completely, then the 2 nd row
of the array is stored into the memory completely and so on till the last row.
If array is declared by a[m][n] where m is the number of rows while n is the number
of columns, then address of an element a[i][j] of the array stored in row major order
is calculated as,

1. Address(a[i][j]) = B. A. + (i * n + j) * size
where, B. A. is the base address or the address of the first element of the array a[0][0] .

Example :

1. a[10...30, 55...75], base address of the array (BA) = 0, size of an element = 4 bytes .
2. Find the location of a[15][68].
3.
2. Column Major ordering 4. Address(a[15][68]) = 0 +
5. ((15 - 10) x (68 - 55 + 1) + (68 - 55)) x 4 isMember
6. -used to test whether a given object instance is in the container.
7. = (5 x 14 + 13) x 4 stack
8. = 83 x 4
9. = 332 answer A real-world stack allows operations at one end only. For example, we can place or
remove a card or plate from the top of the stack only. At any given time, we can only
By Column major order access the top element of a stack.
This feature makes it LIFO data structure. LIFO stands for Last-in-first-out. Here,
If array is declared by a[m][n] where m is the number of rows while n is the number
the element which is placed (inserted or added) last, is accessed first. In stack
of columns, then address of an element a[i][j] of the array stored in row major order terminology, insertion operation is called PUSH operation and removal operation is
is calculated as, called POP operation.

1. Address(a[i][j]) = ((j*m)+i)*Size + BA Stack Representation


The following diagram depicts a stack and its operations −
where BA is the base address of the array.

Example:

A [-
5 ... +20][20 ... 70], BA = 1020, Size of element = 8 bytes. Find the location of a[0][30].

Address [A[0][30]) = ((30-20) x 24 + 5) x 8 + 1020 = 245 x 8 + 1020 = 2980 bytes

Ordered list
An ordered list is a list in which the order of the items is significant.
However, the items in an ordered list are not necessarily sorted.
Consequently, it is possible to change the order o items and still have a
valid ordered list.
A stack can be implemented by means of Array, Structure, Pointer, and Linked List.
Consider a list of the titles of the chapters in this book. The order of the Stack can either be a fixed size one or it may have a sense of dynamic resizing.
items in the list corresponds to the order in which they appear in the book. Here, we are going to implement stack using arrays, which makes it a fixed size
However, since the chapter titles are not sorted alphabetically, we cannot stack implementation.
consider the list to be sorted. Since it is possible to change the order of the
chapters in book, we must be able to do the same with the items of the list. Basic Operations
As a result, we may insert an item into an ordered list at any position. Stack operations may involve initializing the stack, using it and then de-initializing it.
Apart from these basic stuffs, a stack is used for the following two primary
A searchable container is a container that supports the following additional operations −
operations:  push() − Pushing (storing) an element on the stack.

Insert  pop() − Removing (accessing) an element from the stack.


-used to put objects into a the container; When data is PUSHed onto stack.
withdraw
-used to remove objects from the container; To use a stack efficiently, we need to check the status of stack as well. For the
find same purpose, the following functionality is added to stacks −
-used to locate objects in the container;  peek() − get the top data element of the stack, without removing it.
 isFull() − check if stack is full.
 isEmpty() − check if stack is empty.  Step 3 − If the stack is not full, increments top to point next empty space.
At all times, we maintain a pointer to the last PUSHed data on the stack.  Step 4 − Adds data element to the stack location, where top is pointing.
The top pointer provides top value of the stack without actually removing it.
 Step 5 − Returns success.
First we should learn about procedures to support stack functions −
If the linked list is used to implement the stack, then in step 3, we need to allocate
peek() space dynamically.

Algorithm of peek() function − Algorithm for PUSH Operation


begin procedure peek
return stack[top] A simple algorithm for Push operation can be derived as follows −
end procedure
begin procedure push: stack, data

isfull() if stack is full


return null
Algorithm of isfull() function − endif
begin procedure isfull top ← top + 1
stack[top] ← data
if top equals to MAXSIZE
return true end procedure
else
return false
endif Pop Operation
Accessing the content while removing it from the stack, is known as a Pop
end procedure Operation. In an array implementation of pop() operation, the data element is not
actually removed, instead top is decremented to a lower position in the stack to
isempty() point to the next value. But in linked-list implementation, pop() actually removes
data element and deallocates memory space.
Algorithm of isempty() function −
A Pop operation may involve the following steps −
begin procedure isempty
 Step 1 − Checks if the stack is empty.
if top less than 1  Step 2 − If the stack is empty, produces an error and exit.
return true
else  Step 3 − If the stack is not empty, accesses the data element at which top is
return false pointing.
endif
 Step 4 − Decreases the value of top by 1.
end procedure  Step 5 − Returns success.
Implementation of isempty() function in C programming language is slightly
different. We initialize top at -1, as the index in array starts from 0. So we check if Algorithm for Pop Operation
the top is below zero or -1 to determine if the stack is empty. Here's the code −
A simple algorithm for Pop operation can be derived as follows −
Push Operation begin procedure pop: stack
The process of putting a new data element onto stack is known as a Push
Operation. Push operation involves a series of steps − if stack is empty
return null
 Step 1 − Checks if the stack is full. endif
 Step 2 − If the stack is full, produces an error and exit.
data ← stack[top] begin procedure peek
top ← top - 1 return queue[front]
return data end procedure

end procedure isfull()


As we are using single dimension array to implement queue, we just check for the
Queue rear pointer to reach at MAXSIZE to determine that the queue is full. In case we
maintain the queue in a circular linked-list, the algorithm will differ. Algorithm of
Queue is an abstract data structure, somewhat similar to Stacks. Unlike stacks, a queue is isfull() function –
open at both its ends. One end is always used to insert data (enqueue) and the other is used Algorithm
to remove data (dequeue). Queue follows First-In-First-Out methodology, i.e., the data item begin procedure isfull
stored first will be accessed first.
if rear equals to MAXSIZE
return true
else
Queue Representation return false
As we now understand that in queue, we access both ends for different reasons. endif
The following diagram given below tries to explain queue representation as data
structure − end procedure
isempty()
Algorithm of isempty() function −
Algorithm
begin procedure isempty

if front is less than MIN OR front is greater than rear


return true
else
return false
Basic Operations endif
Queue operations may involve initializing or defining the queue, utilizing it, and then
completely erasing it from the memory. Here we shall try to understand the basic end procedure
operations associated with queues − If the value of front is less than MIN or 0, it tells that the queue is not yet initialized,
hence empty.
 enqueue() − add (store) an item to the queue.
 dequeue() − remove (access) an item from the queue. Enqueue Operation
Few more functions are required to make the above-mentioned queue operation Queues maintain two data pointers, front and rear. Therefore, its operations are
efficient. These are − comparatively difficult to implement than that of stacks.
 peek() − Gets the element at the front of the queue without removing it. The following steps should be taken to enqueue (insert) data into a queue −
 isfull() − Checks if the queue is full.
 Step 1 − Check if the queue is full.
 isempty() − Checks if the queue is empty.
 Step 2 − If the queue is full, produce overflow error and exit.
In queue, we always dequeue (or access) data, pointed by front pointer and while
enqueing (or storing) data in the queue we take help of rear pointer.  Step 3 − If the queue is not full, increment rear pointer to point the next
empty space.
peek()  Step 4 − Add data element to the queue location, where the rear is pointing.
This function helps to see the data at the front of the queue. The algorithm of  Step 5 − return success.
peek() function is as follows −
Algorithm
Sometimes, we also check to see if a queue is initialized or not, to
handle any unforeseen situations.

Algorithm for dequeue operation


procedure dequeue

if queue is empty
return underflow
end if

data = queue[front]
front ← front + 1
return true

end procedure
Algorithm for enqueue operation
procedure enqueue(data)
if queue is full
return overflow
endif
rear ← rear + 1 Evaluation of Expression
queue[rear] ← data
return true The way to write arithmetic expression is known as a notation. An arithmetic
end procedure expression can be written in three different but equivalent notations, i.e., without
changing the essence or output of an expression. These notations are –
Dequeue Operation  Infix Notation
Accessing data from the queue is a process of two tasks − access the data  Prefix (Polish) Notation
where front is pointing and remove the data after access. The following steps are
 Postfix (Reverse-Polish) Notation
taken to perform dequeue operation −
These notations are named as how they use operator in expression. We shall learn
 Step 1 − Check if the queue is empty.
the same here in this chapter.
 Step 2 − If the queue is empty, produce underflow error and exit.
 Step 3 − If the queue is not empty, access the data where front is pointing.
Infix Notation
We write expression in infix notation, e.g. a - b + c, where operators are used in-
 Step 4 − Increment front pointer to point to the next available data element.
between operands. It is easy for us humans to read, write, and speak in infix
 Step 5 − Return success. notation but the same does not go well with computing devices. An algorithm to
process infix notation could be difficult and costly in terms of time and space
consumption. a+b*c - a+(b*c)

Prefix Notation As multiplication operation has precedence over addition, b * c will be evaluated
first. A table of operator precedence is provided later.
In this notation, operator is prefixed to operands, i.e. operator is written ahead of
operands. For example, +ab. This is equivalent to its infix notation a + b. Prefix Associativity
notation is also known as Polish Notation.
Associativity describes the rule where operators with the same precedence appear
Postfix Notation in an expression. For example, in expression a + b − c, both + and – have the same
This notation style is known as Reversed Polish Notation. In this notation style, precedence, then which part of the expression will be evaluated first, is determined
the operator is postfixed to the operands i.e., the operator is written after the by associativity of those operators. Here, both + and − are left associative, so the
operands. For example, ab+. This is equivalent to its infix notation a + b. expression will be evaluated as (a + b) − c.

The following table briefly tries to show the difference in all three notations − Precedence and associativity determines the order of evaluation of an expression.
Following is an operator precedence and associativity table (highest to lowest) −
Sr.No. Infix Notation Prefix Notation Postfix Notation
Sr.No. Operator Precedence Associativity

1 a+b +ab ab+


1 Exponentiation ^ Highest Right Associative

2 (a + b) ∗ c ∗+abc ab+c∗
2 Multiplication ( ∗ ) & Division ( / ) Second Highest Left Associative

3 a ∗ (b + c) ∗a+bc abc+∗
3 Addition ( + ) & Subtraction ( − ) Lowest Left Associative

4 a/b+c/d +/ab/cd ab/cd/+


The above table shows the default behavior of operators. At any point of time in
expression evaluation, the order can be altered by using parenthesis. For example
5 (a + b) ∗ (c + d) ∗+ab+cd ab+cd+∗ −
In a + b*c, the expression part b*c will be evaluated first, with multiplication as
precedence over addition. We here use parenthesis for a + b to be evaluated first,
6 ((a + b) ∗ c) - d -∗+abcd ab+c∗d-
like (a + b)*c.

Postfix Evaluation Algorithm


Parsing Expressions Step 1 − scan the expression from left to right
Step 2 − if it is an operand push it to stack
As we have discussed, it is not a very efficient way to design an algorithm or Step 3 − if it is an operator pull operand from stack and perform
program to parse infix notations. Instead, these infix notations are first converted operation
into either postfix or prefix notations and then computed. Step 4 − store the output of step 3, back to stack
Step 5 − scan the expression until all operands are consumed
To parse any arithmetic expression, we need to take care of operator precedence
Step 6 − pop the stack and perform operation
and associativity also. A postfix expression is a collection of operators and operands in which the operator is placed
after the operands. That means, in a postfix expression the operator follows the operands.
Precedence
When an operand is in between two different operators, which operator will take the
Postfix Expression has following general structure...
operand first, is decided by the precedence of an operator over others. For example

Operand1 Operand2 Operator

Example

Postfix Expression Evaluation using Stack Data


Structure
A postfix expression can be evaluated using the Stack data structure. To evaluate a postfix
expression using Stack data structure we can use the following steps...Read all the symbols
one by one from left to right in the given Postfix Expression

1. If the reading symbol is operand, then push it on to the Stack.


2. If the reading symbol is operator (+ , - , * , / etc.,), then perform TWO pop
operations and store the two popped oparands in two different variables (operand1
and operand2). Then perform reading symbol operation using operand1 and
operand2 and push result back on to the Stack.
3. Finally! perform a pop operation and display the popped value as final result.
Example Consider the following Expression..

Multiple Stacks and Queues:


Multiple Stacks:
Following pictures are two ways to do two stacks in array:

1. None fixed size of the stacks:


 Stack 1 expands from the 0th element to the right Multiple Queues:
 Stack 2 expands from the 12th element to the left
 As long as the value of Top1 and Top2 are not next to each other, it has free elements for Following pictures are two ways to do two queues in array:
input the data in the array
1. None fixed size of the queues:
 When both Stacks are full, Top1 and Top 2 will be next to each other
 There is no fixed boundary between Stack 1 and Stack 2
 Elements –1 and –2 are using to store the information needed to manipulate the stack
(subscript for Top 1 and Top 2)

2. Fixed size of the stacks:

 Queue 1 expands from the 0th element to the right and circular back to the 0th element
 Queue 2 expands from the 8th element to the left and circular back to the 8th element
 Temporary boundary between the Queue 1 and the Queue 2; as long as there has free
 Stack 1 expands from the 0th element to the right elements in the array and boundary would be shift
 Stack 2 expands from the 6th element to the left  Free elements could be anywhere in the Queue such as before the front, after the rear, and
 As long as the value of Top 1 is less than 6 and greater than 0, Stack 1 has free between front and rear in the Queue
elements to input the data in the array  Queue 1’s and Queue 2 ‘s size could be change if it is necessary. When the Queue 1 is full
 As long as the value of Top 2 is less than 11 and greater than 5, Stack 2 has free and the Queue 2 has free space; the Queue 1 can increase the size to use that free space
elements to input the data in the array from the Queue 2. Same way for the Queue 2
 When the value of Top 1 is 5, Stack 1 is full  Elements –1, –2, and –3 are using to store the size of the Queue 1, the front of the Queue 1,
 When the value of Top 2 is 10, stack 2 is full and the data count for the Queue 1 needed to manipulate the Queue 1
 Elements –1 and –2 are using to store the size of Stack 1 and the subscript of the array for  Elements –4, –5, and –6 are using to store the size of the Queue 2, the front of the Queue 2,
Top 1 needed to manipulate Stack 1 and the data count for the Queue 2 needed to manipulate the Queue 2
 Elements –3 and –4 are using to store the size of Stack 2 and the subscript of the array for  Inserts data to the Queue 1, Q1Rear = (Q1Front + Q1count) % Q1Size
Top 2 needed to manipulate Stack 2  Inserts data to the Queue 2, Q2Rear = (Q2Front + Q2count) % Q2Size + Q1Size
 Deletes data from the Queue 1, Q1Front = (Q1Front + 1) % Q1Size
procedure ADD (i,X ) //add element X to the i'th stack, 1 i n//
 Deletes data from the Queue 2, Q2Front = (Q2Front + 1) % Q2Size + Q1Size
if T(i) = B(i + 1) then call STACK-FULL (i)
T(i) ← T(i) + 1
V(T(i)) ← X //add X to the i'th stack// 2. Fixed size of the queue:
end ADD

procedure DELETE(i,X ) //delete topmost element of stack i//


if T(i) = B(i) then call STACK-EMPTY(i)
X ← V(T(i))
T(i) ← T(i) - 1
end DELETE
 Next − Each link of a linked list contains a link to the next link called Next.
 LinkedList − A Linked List contains the connection link to the first link called
First.

Linked List Representation


Linked list can be visualized as a chain of nodes, where every node points to the
next node.

As per the above illustration, following are the important points to be considered.
 Queue 1 expands from the 0th element to the 4th element and circular back to 0th element  Linked List contains a link element called first.
 Queue 2 expands from the 8th element to the 5th element and circular back to 8th element
 The boundary is fixed between the Queue 1 and the Queue 2  Each link carries a data field(s) and a link field called next.
 Free elements could be anywhere in the Queue such as before the front, after the rear, and
 Each link is linked with its next link using its next link.
between front and rear in the Queue
 Elements –1, –2, and –3 are using to store the size of the Queue 1, the front of the Queue 1,  Last link carries a link as null to mark the end of the list.
and the data count for the Queue 1 needed to manipulate the Queue 1
 Elements –4, –5, and –6 are using to store the size of the Queue 2, the front of the Queue 2, Types of Linked List
and the data count for the Queue 2 needed to manipulate the Queue 2
 Inserts data to the Queue 1, Q1Rear = (Q1Front + Q1count) % Q1Size Following are the various types of linked list.
 Inserts data to the Queue 2, Q2Rear = (Q2Front + Q2count) % Q2Size + Q1Size  Simple Linked List − Item navigation is forward only.
 Deletes data from the Queue 1, Q1Front = (Q1Front + 1) % Q1Size
 Deletes data from the Queue 2, Q2Front = (Q2Front + 1) % Q2Size + Q1Size  Doubly Linked List − Items can be navigated forward and backward.
 Circular Linked List − Last item contains link of the first element as next and
the first element has a link to the last element as previous.

Basic Operations
Following are the basic operations supported by a list.
 Insertion − Adds an element at the beginning of the list.
 Deletion − Deletes an element at the beginning of the list.
 Display − Displays the complete list.
 Search − Searches an element using the given key.
 Delete − Deletes an element using the given key.

Insertion Operation
A linked list is a sequence of data structures, which are connected together via links. Adding a new node in linked list is a more than one step activity. We shall learn this
Linked List is a sequence of links which contains items. Each link contains a with diagrams here. First, create a node using the same structure and find the
connection to another link. Linked list is the second most-used data structure after location where it has to be inserted.
array. Following are the important terms to understand the concept of Linked List.
 Link − Each link of a linked list can store a data called an element.
Imagine that we are inserting a node B (NewNode), between A (LeftNode)
and C (RightNode). Then point B.next to C − Example
NewNode.next −> RightNode;
It should look like this −

Operations on Single Linked List


The following operations are performed on a Single Linked List
Now, the next node at the left should point to the new node.
LeftNode.next −> NewNode;  Insertion
 Deletion
 Display

Before we implement actual operations, first we need to set up an empty list. First, perform the
following steps before implementing actual operations.

 Step 1 - Include all the header files which are used in the program.
 Step 2 - Declare all the user defined functions.
 Step 3 - Define a Node structure with two members data and next
 Step 4 - Define a Node pointer 'head' and set it to NULL.
 Step 5 - Implement the main method by displaying operations menu and make suitable
function calls in the main method to perform user selected operation.

Insertion
In a single linked list, the insertion operation can be performed in three ways. They are as
follows...
What is Single Linked List?
1. Inserting At Beginning of the list
Simply a list is a sequence of data, and the linked list is a sequence of data linked with each 2. Inserting At End of the list
other. 3. Inserting At Specific location in the list
The formal definition of a single linked list is as follows...

Single linked list is a sequence of elements in which every element has link to its next
element in the sequence.
Inserting At Beginning of the list
We can use the following steps to insert a new node at beginning of the single linked list...
In any single linked list, the individual element is called as "Node". Every "Node" contains two
fields, data field, and the next field. The data field is used to store actual value of the node and
 Step 1 - Create a newNode with given value.
next field is used to store the address of next node in the sequence.
 Step 2 - Check whether list is Empty (head == NULL)
The graphical representation of a node in a single linked list is as follows...
 Step 3 - If it is Empty then, set newNode→next = NULL and head = newNode.
 Step 4 - If it is Not Empty then, set newNode→next = head and head = newNode.
Inserting At End of the list Deleting from End of the list
We can use the following steps to insert a new node at end of the single linked list... We can use the following steps to delete a node from end of the single linked list...

 Step 1 - Create a newNode with given value and newNode → next as NULL.  Step 1 - Check whether list is Empty (head == NULL)
 Step 2 - Check whether list is Empty (head == NULL).  Step 2 - If it is Empty then, display 'List is Empty!!! Deletion is not possible' and
 Step 3 - If it is Empty then, set head = newNode. terminate the function.
 Step 4 - If it is Not Empty then, define a node pointer temp and initialize with head.  Step 3 - If it is Not Empty then, define two Node pointers 'temp1' and 'temp2' and
 Step 5 - Keep moving the temp to its next node until it reaches to the last node in the list initialize 'temp1' with head.
(until temp → next is equal to NULL).  Step 4 - Check whether list has only one Node (temp1 → next == NULL)
 Step 6 - Set temp → next = newNode.  Step 5 - If it is TRUE. Then, set head = NULL and delete temp1. And terminate the
function. (Setting Empty list condition)
 Step 6 - If it is FALSE. Then, set 'temp2 = temp1 ' and move temp1 to its next node.
Inserting At Specific location in the list (After a Repeat the same until it reaches to the last node in the list. (until temp1 →
next == NULL)
Node)  Step 7 - Finally, Set temp2 → next = NULL and delete temp1.
We can use the following steps to insert a new node after a node in the single linked list...

 Step 1 - Create a newNode with given value.


Deleting a Specific Node from the list
 Step 2 - Check whether list is Empty (head == NULL) We can use the following steps to delete a specific node from the single linked list...
 Step 3 - If it is Empty then, set newNode → next = NULL and head = newNode.
 Step 4 - If it is Not Empty then, define a node pointer temp and initialize with head.  Step 1 - Check whether list is Empty (head == NULL)
 Step 5 - Keep moving the temp to its next node until it reaches to the node after which  Step 2 - If it is Empty then, display 'List is Empty!!! Deletion is not possible' and
we want to insert the newNode (until temp1 → data is equal to location, here location is terminate the function.
the node value after which we want to insert the newNode).  Step 3 - If it is Not Empty then, define two Node pointers 'temp1' and 'temp2' and
 Step 6 - Every time check whether temp is reached to last node or not. If it is reached to initialize 'temp1' with head.
last node then display 'Given node is not found in the list!!! Insertion not  Step 4 - Keep moving the temp1 until it reaches to the exact node to be deleted or to the
possible!!!' and terminate the function. Otherwise move the temp to next node. last node. And every time set 'temp2 = temp1' before moving the 'temp1' to its next
 Step 7 - Finally, Set 'newNode → next = temp → next' and 'temp → next = newNode' node.
 Step 5 - If it is reached to the last node then display 'Given node not found in the list!
Deletion not possible!!!'. And terminate the function.
Deletion  Step 6 - If it is reached to the exact node which we want to delete, then check whether
In a single linked list, the deletion operation can be performed in three ways. They are as list is having only one node or not
follows...  Step 7 - If list has only one node and that is the node to be deleted, then
set head = NULL and delete temp1 (free(temp1)).
 Step 8 - If list contains multiple nodes, then check whether temp1 is the first node in the
1. Deleting from Beginning of the list
2. Deleting from End of the list list (temp1 == head).
3. Deleting a Specific Node  Step 9 - If temp1 is the first node then move the head to the next node (head = head →
next) and delete temp1.
 Step 10 - If temp1 is not first node then check whether it is last node in the list (temp1 →
Deleting from Beginning of the list next == NULL).
 Step 11 - If temp1 is last node then set temp2 → next = NULL and
We can use the following steps to delete a node from beginning of the single linked list... delete temp1 (free(temp1)).
 Step 12 - If temp1 is not first node and not last node then set temp2 → next = temp1 →
 Step 1 - Check whether list is Empty (head == NULL) next and delete temp1 (free(temp1)).
 Step 2 - If it is Empty then, display 'List is Empty!!! Deletion is not possible' and
terminate the function.
 Step 3 - If it is Not Empty then, define a Node pointer 'temp' and initialize with head. Displaying a Single Linked List
 Step 4 - Check whether list is having only one node (temp → next == NULL) We can use the following steps to display the elements of a single linked list...
 Step 5 - If it is TRUE then set head = NULL and delete temp (Setting Empty list
conditions)
 Step 1 - Check whether list is Empty (head == NULL)
 Step 6 - If it is FALSE then set head = temp → next, and delete temp.
 Step 2 - If it is Empty then, display 'List is Empty!!!' and terminate the function.
 Step 3 - If it is Not Empty then, define a Node pointer 'temp' and initialize with head.
 Step 4 - Keep displaying temp → data with an arrow (--->) until temp reaches to the last
node
 Step 5 - Finally display temp → data with arrow pointing to NULL (temp → data ---> 2. If the list is empty then the item is to be pushed as the start node of the list. This
NULL). includes assigning value to the data part of the node and assign null to the address
part of the node.
Linked stack and Queue:
3. If there are some nodes in the list already, then we have to add the new element in
the beginning of the list (to not violate the property of the stack). For this purpose,
Linked stack assign the address of the starting element to the address field of the new node and
make the new node, the starting node of the list.
Instead of using array, we can also use linked list to implement stack. Linked list
allocates the memory dynamically. However, time complexity in both the scenario is
same for all the operations i.e. push, pop and peek.

In linked list implementation of stack, the nodes are maintained non-contiguously in


the memory. Each node contains a pointer to its immediate successor node in the
stack. Stack is said to be overflow if the space left in the memory heap is not enough
to create a node.

Deleting a node from the stack (POP operation)


The top most node in the stack always contains null in its address field. Lets discuss Deleting a node from the top of stack is referred to as pop operation.
the way in which, each operation is performed in linked list implementation of stack. Deleting a node from the linked list implementation of stack is different from
that in the array implementation. In order to pop an element from the stack,
Adding a node to the stack (Push operation) we need to follow the following steps :

Adding a node to the stack is referred to as push operation. Pushing an element to a 1. Check for the underflow condition: The underflow condition occurs when
we try to pop from an already empty stack. The stack will be empty if the
stack in linked list implementation is different from that of an array implementation.
head pointer of the list points to null.
In order to push an element onto the stack, the following steps are involved. OOPs
Concepts in Java 2. Adjust the head pointer accordingly: In stack, the elements are popped
only from one end, therefore, the value stored in the head pointer must be
1. Create a node first and allocate memory to it. deleted and the node must be freed. The next node of the head node now
becomes the head node.
Display the nodes (Traversing) Deletion operation removes the element that is first inserted among all the queue
elements. Firstly, we need to check either the list is empty or not. The condition front
Displaying all the nodes of a stack needs traversing all the nodes of the linked == NULL becomes true if the list is empty, in this case , we simply write underflow on
list organized in the form of stack. For this purpose, we need to follow the the console and make exit. Otherwise, we will delete the element that is pointed by
following steps. the pointer front. For this purpose, copy the node pointed by the front pointer into
the pointer ptr. Now, shift the front pointer, point to its next node and free the node
1. Copy the head pointer into a temporary pointer. pointed by the node ptr.
2. Move the temporary pointer through all the nodes of the list and print the
value field attached to every node.

Unit II
Linked Queue Trees – Binary tree representations – Tree Traversal – Threaded Binary Trees – Binary Tree
In a linked queue, each node of the queue consists of two parts i.e. data part and the Representation of Trees – Graphs and Representations – Traversals, Connected
link part. Each element of the queue points to its immediate next element in the Components and Spanning Trees – Shortest Paths and Transitive closure – Activity
memory. Networks – Topological Sort and Critical Paths.
In the linked queue, there are two pointers maintained in the memory i.e. front
pointer and rear pointer. The front pointer contains the address of the starting
element of the queue while the rear pointer contains the address of the last element Trees: Non-Linear data structure
of the queue.46Diference, and JVM A data structure is said to be linear if its elements form a sequence or a linear list.
Previous
Insertion and deletions are performed at rear and front end respectively. If front and linear data structures that we have studied like an array, stacks, queues and linked lists
rear both are NULL, it indicates that the queue is empty. organize data in linear order. A data structure is said to be non linear if its elements form a
The linked representation of queue is shown in the following figure. hierarchical classification where, data items appear at various levels.
Trees and Graphs are widely used non-linear data structures. Tree and graph structures
represent hierarchical relationship between individual data elements. Graphs are nothing but
trees with certain restrictions removed.
Trees represent a special case of more general structures known as graphs. In a graph,
there is no restrictions on the number of links that can enter or leave a node, and cycles may be
present in the graph. The figure shows a tree and a non-tree.

Operation on Linked Queue


There are two basic operations which can be implemented on the linked
queues. The operations are Insertion and Deletion.
The insert operation append the queue by adding an element to the end of the
queue. The new element will be the last element of the queue. Tree is a popular data structure used in wide range of applications. A tree data structure
can be defined as follows...
Firstly, allocate the memory for the new node ptr by using the following statement. Tree is a non-linear data structure which organizes data in hierarchical structure and
this is a recursive definition.
A tree data structure can also be defined as follows...
There can be the two scenario of inserting this new node ptr into the linked queue.
A tree is a finite set of one or more nodes such that:
There is a specially designated node called the root. The remaining nodes are partitioned
we insert element into an empty queue. In this case, the condition front = into n>=0 disjoint sets T1, ..., Tn, where each of these sets is a tree. We call T1, ..., Tn are the
NULL becomes true. Now, the new element will be added as the only element of the subtrees of the root.
queue and the next pointer of front and rear pointer both, will point to NULL.
A tree is hierarchical collection of nodes. One of the nodes, known as the root, is at the top of the
hierarchy. Each node can have at most one link coming into it. The node where the link originates
is called the parent node. The root node has no parent. The links leaving a node (any number of
links are allowed) point to child nodes. Trees are recursive structures. Each child node is itself the
root of a subtree. At the bottom of the tree are leaf nodes, which have no children.

Advantages of trees
Trees are so useful and frequently used, because they have some very serious advantages:
Trees reflect structural relationships in the data
Trees are used to represent hierarchies
Trees provide an efficient insertion and searching
Trees are very flexible data, allowing to move sub trees around with minimum effort

Introduction Terminology
In a Tree, Every individual element is called as Node. Node in a tree data structure, stores
the actual data of that particular element and link to next element in hierarchical structure.
Example

1. Root
In a tree data structure, the first node is called as Root Node. Every tree must have root node. We
can say that root node is the origin of tree data structure. In any tree, there must be only one root
node. We never have multiple root nodes in a tree. In above tree, A is a Root node
2. Edge
In a tree data structure, the connecting link between any two nodes is called as EDGE. In a tree
with 'N' number of nodes there will be a maximum of 'N-1' number of edges.
3. Parent
In a tree data structure, the node which is predecessor of any node is called as PARENT NODE.
In simple words, the node which has branch from it to any other node is called as parent node.
Parent node can also be defined as "The node which has child / children". e.g., Parent (A,B,C,D).
4. Child
In a tree data structure, the node which is descendant of any node is called as CHILD Node. In
simple words, the node which has a link from its parent node is called as child node. In a tree, any
parent node can have any number of child nodes. In a tree, all the nodes except root are child
nodes. e.g., Children of D are (H, I,J).
5. Siblings
In a tree data structure, nodes which belong to same Parent are called as SIBLINGS. In simple
words, the nodes with same parent are called as Sibling nodes. Ex: Siblings (B,C, D)
6. Leaf
In a tree data structure, the node which does not have a child (or) node with degree zero is called
as LEAF Node. In simple words, a leaf is a node with no child. In a tree data structure, the leaf
nodes are also called as External Nodes. External node is also a node with no child. In a tree, leaf In a tree data structure, each child from a node forms a subtree recursively. Every child node will
node is also called as 'Terminal' node. Ex: (K,L,F,G,M,I,J) form a subtree on its parent node
7. Internal Nodes
In a tree data structure, the node which has atleast one child is called as INTERNAL Node. In
simple words, an internal node is a node with atleast one child. In a tree data structure, nodes
other than leaf nodes are called as Internal Nodes. The root node is also said to be Internal Node
if the tree has more than one node. Internal nodes are also called as 'Non-Terminal' nodes.
Ex:B,C,D,E,H
8. Degree
In a tree data structure, the total number of children of a node (or)number of subtrees of a node is
called as DEGREE of that Node. In simple words, the Degree of a node is total number of
children it has. The highest degree of a node among all the nodes in a tree is called as 'Degree of
Tree' Tree Representations
A tree data structure can be represented in two methods. Those methods are as follows...
1.List Representation
2. Left Child - Right Sibling Representation
Consider the following tree...

9. Level
In a tree data structure, the root node is said to be at Level 0 and the children of root node are at
Level 1 and the children of the nodes which are at Level 1 will be at Level 2 and so on... In
simple words, in a tree each step from top to bottom is called as a Level and the Level count starts
with '0' and incremented by one at each level (Step). Some authors start root level with 1. 1. List Representation
10. Height In this representation, we use two types of nodes one for representing the node with data and
In a tree data structure, the total number of edges from leaf node to a particular node in the another for representing only references. We start with a node with data from root node in the
longest path is called as HEIGHT of that Node. In a tree, height of the root node is said to be tree. Then it is linked to an internal node through a reference node and is linked to any other node
height of the tree. In a tree, height of all leaf nodes is '0'. directly. This process repeats for all the nodes in the tree.
11. Depth The above tree example can be represented using List representation as follows...
In a tree data structure, the total number of edges from root node to a particular node is called as
DEPTH of that Node. In a tree, the total number of edges from root node to a leaf node in the
longest path is said to be Depth of the tree. In simple words, the highest depth of any leaf node in
a tree is said to be depth of that tree. In a tree, depth of the root node is '0'.
12. Path
In a tree data structure, the sequence of Nodes and Edges from one node to another node is called
as PATH between that two Nodes. Length of a Path is total number of nodes in that path. In
below example the path A - B - E - J has length 4. Fig: List representation of above Tree

Fig: Possible node structure for a tree of degree k


13. Sub Tree
2. Left Child - Right Sibling Representation
In this representation, we use list with one type of node which consists of three fields namely
Data field, Left child reference field and Right sibling reference field. Data field stores the actual
value of a node, left reference field stores the address of the left child and right reference field There are different types of binary trees and they are...
stores the address of the right sibling node. Graphical representation of that node is as follows... 1. Strictly Binary Tree
In a binary tree, every node can have a maximum of two children. But in strictly binary tree,
every node should have exactly two children or none. That means every internal node must have
exactly two children. A strictly Binary Tree can be defined as follows...
A binary tree in which every node has either two or zero number of children is called Strictly
Binary Tree. Strictly binary tree is also called as Full Binary Tree or Proper Binary Tree or 2-
Tree
In this representation, every node's data field stores the actual value of that node. If that node has
2. Complete Binary Tree
left child, then left reference field stores the address of that left child node otherwise that field
In a binary tree, every node can have a maximum of two children. But in strictly binary tree,
stores NULL. If that node has right sibling then right reference field stores the address of right
every node should have exactly two children or none and in complete binary tree all the nodes
sibling node otherwise that field stores NULL. The above tree example can be represented using
must have exactly two children and at every level of complete binary tree there must be 2 level
Left Child - Right Sibling representation as follows...
number of nodes. For example at level 2 there must be 2^2 = 4 nodes and at level 3 there must be
2^3 = 8 nodes.
A binary tree in which every internal node has exactly two children and all leaf nodes are at same
level is called Complete Binary Tree.
Complete binary tree is also called as Perfect Binary Tree

Representation as a Degree –Two Tree


To obtain degree-two tree representation of a tree, rotate the right- sibling pointers in the left
child-right sibling tree clockwise by 45 degrees. In a degree-two representation, the two children
of anode are referred as left and right children.

3. Extended Binary Tree


A binary tree can be converted into Full Binary tree by adding dummy nodes to existing nodes
Binary Trees wherever required.
In a normal tree, every node can have any number of children. Binary tree is a special The full binary tree obtained by adding dummy nodes to a binary tree is called as Extended
type of tree data structure in which every node can have a maximum of 2 children. One is known Binary Tree.
as left child and the other is known as right child.
~A tree in which every node can have a maximum of two children is called as Binary Tree. Abstract Data Type
~In a binary tree, every node can have either 0 children or 1 child or 2 children but not more than Definition: A binary tree is a finite set of nodes that is either empty or consists of a root and two
2 children. Example disjoint binary trees called left subtree and right subtree.
ADT contains specification for the binary tree ADT.
Structure Binary_Tree(abbreviated BinTree) is
objects: a finite set of nodes either empty or consisting of a root node, left Binary_Tree, and right
Binary_Tree.
Functions:
for all bt, bt1, bt2 BinTree, item element
Bintree Create()::= creates an empty binary tree
Boolean IsEmpty(bt)::= if (bt==empty binary tree) return TRUE else return FALSE 1)Array Representation
BinTree MakeBT(bt1, item, bt2)::= return a binary tree whose left subtree is bt1, whose right 2)Linked List Representation
subtree is bt2, and whose root node contains the data item 1)Array Representation: In array representation of binary tree, we use a one dimensional array
Bintree Lchild(bt)::= if (IsEmpty(bt)) return error else return the left subtree of bt (1-D Array) to represent a binary tree. To represent a binary tree of depth 'n' using array
element Data(bt)::= if (IsEmpty(bt)) return error else return the data in the root node of bt representation, we need one dimensional array with a maximum size of
Bintree Rchild(bt)::= if (IsEmpty(bt)) return error else return the right subtree of bt A complete binary tree with n nodes (depth = log n + 1) is represented sequentially, then for any
node with index i, 1<=i<=n, we have: a) parent(i) is at i/2 if i!=1. If i=1, i is at the root and has no
parent. b)left_child(i) ia at 2i if 2i<=n. If 2i>n, then i has no left child. c) right_child(i) is at 2i+1
if 2i +1 <=n. If 2i +1 >n, then i has no right child

Differences between A Tree and A Binary Tree


• The subtrees of a binary tree are ordered; those of a tree are not ordered.

2. Linked Representation : We use linked list to represent a binary tree. In a linked list, every
node consists of three fields. First field, for storing left child address, second for storing actual
data and third for storing right child address. In this linked list representation, a node has the
following structure...
Above two trees are different when viewed as binary trees. But same when viewed as trees.
Properties of Binary Trees
1.Maximum Number of Nodes in BT
maximum number of nodes on level i of a binary tree is 2i-1, i>=1.
k-1, k>=1.

Proof By Induction: typedef struct node *tree_pointer;


Induction Base: The root is the only node on level i=1.Hence ,the maximum number of nodes on typedef struct node
level i=1 is 2i-1=20=1. {
Induction Hypothesis: Let I be an arbitrary positive integer greater than 1.Assume that maximum int data;
number of nodes on level i-1 is 2i-2. tree_pointer left_child, right_child;
Induction Step: The maximum number of nodes on level i-1 is 2i-2 by the induction hypothesis. };.
Since each node in a binary tree has a maximum degree of 2,the maximum number of nodes on
level i is two times the maximum number of nodes on level i-1,or 2i-1.
The maximum number of nodes in a binary tree of depth k is
2.Relation between number of leaf nodes and degree-2 nodes: For any nonempty binary tree,
T, if n0 is the number of leaf nodes and n2 the number of nodes of degree 2, then n0=n2+1.
PROOF: Let n and B denote the total number of nodes and branches in T. Let n0, n1, n2 represent
the nodes with zero children, single child, and two children respectively.
B+1=n B=n1+2n2 ==> n1+2n2+1= n,
n1+2n2+1= n0+n1+n2 ==> n0=n2+1
3. A full binary tree of depth k is a binary tree of depth k having 2 -1 nodes, k>=0.
A binary tree with n nodes and depth k is complete iff its nodes correspond to the nodes numbered
from 1 to n in the full binary tree of depth k. Binary Tree Traversals
When we wanted to display a binary tree, we need to follow some order in which all the nodes of
Binary Tree Representation that binary tree must be displayed. In any binary tree displaying order of nodes depends on the
A binary tree data structure is represented using two methods. Those methods are
traversal method. Displaying (or) visiting order of nodes in a binary tree is called as Binary Tree
Traversal. 2. Pre - Order Traversal ( root - leftChild - rightChild )
There are three types of binary tree traversals. In Pre-Order traversal, the root node is visited before left child and right child nodes. In this
1)In - Order Traversal 2)Pre - Order Traversal 3)Post - Order Traversal traversal, the root node is visited first, then its left child and later its right child. This pre-order
Binary Tree Traversals traversal is applicable for every root node of all subtrees in the tree. In the above example of
• 1. In - Order Traversal ( leftChild - root - rightChild ) binary tree, first we visit root node 'A' then visit its left child 'B' which is a root for D and F. So
I-D-J-B-F-A-G-K-C–H we visit B's left child 'D' and again D is a root for I and J. So we visit D's left child'I' which is the
• 2. Pre - Order Traversal ( root - leftChild - rightChild ) left most child. So next we go for visiting D's right child 'J'. With this we have completed root,
A-B-D-I-J-F-C-G-K–H left and right parts of node D and root, left parts of node B. Next visit B's right child'F'. With this
• 3. Post - Order Traversal ( leftChild - rightChild - root ) we have completed root and left parts of node A. So we go for A's right child 'C' which is a root
I-J-D-F-B-K-G-H-C–A node for G and H. After visiting C, we go for its left child 'G' which is a root for node K. So next
we visit left of G, but it does not have left child so we go for G's right child 'K'. With this we have
completed node C's root and left parts. Next visit C's right child 'H' which is the right most child
in the tree. So we stop the process. That means here we have visited in the order of A-B-D-I-J-F-
C-G-K-H using Pre-Order Traversal.

Algorithm
Until all nodes are traversed −
Step 1 − Visit root node.
Step 2 − Recursively traverse left subtree.
1. In - Order Traversal ( leftChild - root - rightChild ) Step 3 − Recursively traverse right subtree.
In In-Order traversal, the root node is visited between left child and right child. In this traversal, void preorder(tree_pointer ptr) /* preorder tree traversal */ Recursive
the left child node is visited first, then the root node is visited and later we go for visiting right {
child node. This in-order traversal is applicable for every root node of all subtrees in the tree. This if (ptr) {
is performed recursively for all nodes in the tree. In the above example of binary tree, first we try printf(―%d‖, ptr->data);
to visit left child of root node 'A', but A's left child is a root node for left subtree. so we try to visit preorder(ptr->left_child);
its (B's) left child 'D' and again D is a root for subtree with nodes D, I and J. So we try to visit its preorder(ptr->right_child);
left child 'I' and it is the left most child. So first we visit 'I'then go for its root node 'D' and later }
we visit D's right child 'J'. With this we have completed the left part of node B. Then visit 'B' and }
next B's right child 'F' is visited. With this we have completed left part of node A. Then visit root
node 'A'. With this we have completed left and root parts of node A. Then we go for right part of
the node A. In right of A again there is a subtree with root C. So go for left child of C and again it
is a subtree with root G. But G does not have left part so we visit 'G' and then visit G's right child 3. Post - Order Traversal ( leftChild - rightChild - root )
K. With this we have completed the left part of node C. Then visit root node'C' and next visit C's In Post-Order traversal, the root node is visited after left child and right child. In this traversal,
right child 'H' which is the right most child in the tree so we stop the process. That means here we left child node is visited first, then its right child and then its root node. This is recursively
have visited in the order of I - D - J - B - F - A - G - K - C - H using In-Order Traversal. performed until the right most node is visited. Here we have visited in the order of I - J - D - F - B
In-Order Traversal for above example of binary tree is - K - G - H - C - A using Post-Order Traversal.
I-D-J-B-F-A-G-K-C–H
Algorithm
Algorithm Until all nodes are traversed −
Until all nodes are traversed − Step 1 − Recursively traverse left subtree.
Step 1 − Recursively traverse left subtree. Step 2 − Recursively traverse right subtree.
Step 2 − Visit root node. Step 3 − Visit root node.
Step 3 − Recursively traverse right subtree. void postorder(tree_pointer ptr) /* postorder tree traversal */ Recursive
void inorder(tree_pointer ptr) /* inorder tree traversal */ Recursive {
{ if (ptr) {
if (ptr) { postorder(ptr->left_child);
inorder(ptr->left_child); postorder(ptr->right_child);
printf(―%d‖, ptr->data); printf(―%d‖, ptr->data);
indorder(ptr->right_child); }
} }
}
left_subtree (keys) ≤ node (key) ≤ right_subtree (keys)

Fig: Example Binary Search Trees

Graph
A graph G = (V,E) is composed of:
V: set of vertices
E: set of edges connecting the vertices in V
• An edge e = (u,v) is a pair of vertices
Example:

Binary Search Trees


V= {a,b,c,d,e}
Binary Search Tree Representation
E={(a,b),(a,c),(a,d),(b,e),(c,d),(c,e),(d,e)}
Binary Search tree exhibits a special behavior. A node's left child must have value less than its
parent's value and node's right child must have value greater than it's parent value.
Graph Terminology
Undirected Graph:
An undirected graph is one in which the pair of vertices in a edge is unordered,
(v0, v1) = (v1,v0)
Directed Graph:
A directed graph is one in which each edge is a directed pair of vertices, <v0,
v1> != <v1,v0>

We're going to implement tree using node object and connecting them through references.

Definition: A binary search tree (BST) is a binary tree. It may be empty. If it is not empty,then
all nodes follows the below mentioned properties −

keys in a nonempty left subtree (right subtree) are smaller (larger) than the key in the root
of subtree.
Complete Graph:
A complete graph is a graph that has the maximum number of edges for undirected graph with
n vertices, the maximum number of edges is n(n-1)/2 for directed graph with n vertices, the
maximum number of edges is n(n-1)
left sub-tree and right sub-tree and can be defined as −
A subgraph of G is a graph G‟ such that V(G‟) is a subset of V(G) and E(G‟) is a
subset of E(G)

Path:
A path from vertex vp to vertex vq in a graph G, is a sequence of vertices, vp,
Adjacent and Incident: vi1, vi2, ..., vin, vq, such that (vp, vi1), (vi1, vi2), ..., (vin, vq) are edges in an
If (v0, v1) is an edge in an undirected graph, undirected graph
– v0 and v1 are adjacent The length of a path is the number of edges on it.
– The edge (v0, v1) is incident on vertices v0 and v1 Simple Path and Style:
A simple path is a path in which all vertices, except possibly the first and the
If <v0, v1> is an edge in a directed graph last, are distinct.
– v0 is adjacent to v1, and v1 is adjacent from v0 A cycle is a simple path in which the first and the last vertices are the same
– The edge <v0, v1> is incident on v0 and v1 In an undirected graph G, two vertices, v0 and v1, are connected if there is a
path in G from v0 to v1.
Multigraph: An undirected graph is connected if, for every pair of distinct vertices vi, vj,
In a multigraph, there can be more than one edge from vertex P to there is a path from vi to vj
vertex Q. In a simple graph there is at most one.

Degree
The degree of a vertex is the number of edges incident to that vertex
For directed graph,
– the in-degree of a vertex v is the number of edges that have v as the head
Graph with self edge or graph with feedback loops: – the out-degree of a vertex v is the number of edges that have v as the tail
A self loop is an edge that connects a vertex to itself. In some graph it makes sense to allow self- – if di is the degree of a vertex i in a graph G with n vertices and e edges, the
loops; in some it doesn't. number of edges is

Example:

Subgraph:
ADT for Graph
Graph ADT is
Data structures: a nonempty set of vertices and a set of undirected
edges, where each edge is a pair of vertices
Functions: for all graph Graph, v, v1 and v2 Vertices
 Graph Create()::=return an empty graph
 Graph InsertVertex(graph, v)::= return a graph with v inserted. V
has no incident edge.
 Graph InsertEdge(graph, v1,v2)::= return a graph with new edge between
v1 and v2
 Graph DeleteVertex(graph, v)::= return a graph in which v and all edges Merits of Adjacency Matrix
incident to it are removed From the adjacency matrix, to determine the connection of vertices is easy
 Graph DeleteEdge(graph, v1, v2)::=return a graph in which the edge The degree of a vertex is
(v1, v2) is removed For a digraph, the row sum is the out_degree, while the column sum is the
 Boolean IsEmpty(graph)::= if (graph==empty graph) return TRUE in_degree
else return FALSE
 List Adjacent(graph,v)::= return a list of all vertices that are adjacent
to v
Graph Representations b) Adjacency Lists
Graph can be represented in the following ways: Each row in adjacency matrix is represented as an adjacency list.
a) Adjacency Matrix
b) Adjacency Lists
c) Adjacency Multilists
a) Adjacency Matrix
Let G=(V,E) be a graph with n vertices.
The adjacency matrix of G is a two-dimensional by array, say adj_mat.
If the edge (vi, vj) is in E(G), adj_mat[i][j]=1
If there is no such edge in E(G), adj_mat[i][j]=0
The adjacency matrix for an undirected graph is symmetric; the adjacency matrix for a digraph
need not be symmetric
Examples for Adjacency Matrix:
c) Adjacency Multilists
Interesting Operations An edge in an undirected graph is represented by two nodes in adjacency list
 degree of a vertex in an undirected graph representation.
# of nodes in adjacency list Adjacency Multilists
 # of edges in a graph – lists in which nodes may be shared among several lists. (an edge is shared by
determined in O(n+e) two different paths)
 out-degree of a vertex in a directed graph
# of nodes in its adjacency list
 in-degree of a vertex in a directed graph
traverse the whole data structure
Example for Adjacency Multlists
Orthogonal representation for graph G3 Lists: vertex 0: M1->M2->M3, vertex 1: M1->M4->M5
vertex 2: M2->M4->M6, vertex 3: M3->M5->M6
As in the example given above, DFS algorithm traverses from A to B to C to D first then to E, then
Some Graph Operations to F and lastly to G. It employs the following rules.
The following are some graph operations:  Rule 1 − Visit the adjacent unvisited vertex. Mark it as visited. Display it. Push
a) Traversal Given G=(V,E) and vertex v, find all w V, such that w connects v. it in a stack.
– Depth First Search (DFS) preorder tree traversal  Rule 2 − If no adjacent vertex is found, pop up a vertex from the stack. (It will
– Breadth First Search (BFS) level order tree traversal pop up all the vertices from the stack, which do not have adjacent vertices.)
b) Spanning Trees  Rule 3 − Repeat Rule 1 and Rule 2 until the stack is empty.
c) Connected Components
Graph G and its adjacency lists Step Traversal Description

1 Initialize the
stack.

2 Mark S as visited and put it onto


the stack. Explore any unvisited
adjacent node from S. We have
three nodes and we can pick any
of them. For this example, we
shall take the node in an
alphabetical order.
depth first search: v0, v1, v3, v7, v4, v5, v2, v6
breadth first search: v0, v1, v2, v3, v4, v5, v6, v7

Depth First Search


Depth First Search (DFS) algorithm traverses a graph in a depthward motion
and uses a stack to remember to get the next vertex to start a search, when a
dead end occurs in any iteration.
3 Mark A as visited and put it onto 7 Only unvisited adjacent node is
the stack. Explore any unvisited from D is C now. So we visit C,
adjacent node from A. Both S and mark it as visited and put it onto
D are adjacent to A but we are the stack.
concerned for unvisited nodes
only.

4 Visit D and mark it as visited


and put onto the stack. Here, As C does not have any unvisited adjacent node so we keep popping the stack
we have B and C nodes, which until we find a node that has an unvisited adjacent node. In this case, there's
are adjacent to D and both are none and we keep popping until the stack is empty.
unvisited. However, we shall
again choose in an alphabetical
order. Breadth First Search
Breadth First Search (BFS) algorithm traverses a graph in a breadthward motion and uses a
queue to remember to get the next vertex to start a search, when a dead end occurs in any
iteration.

5 We choose B, mark it as visited


and put onto the stack. Here B
does not have any unvisited
adjacent node. So, we pop B from
the stack.

6 We check the stack top for return


to the previous node and check if
it has any unvisited nodes. Here,
we find D to be on the top of the
stack. As in the example given above, BFS algorithm traverses from A to B to E to F
first then to C and G lastly to D. It employs the following rules.
Rule 1 − Visit the adjacent unvisited vertex. Mark it as visited. Display it. Insert it in
a queue.

Rule 2 − If no adjacent vertex is found, remove the first vertex from the queue.

Rule 3 − Repeat Rule 1 and Rule 2 until the queue is empty.


Step Traversal Description
1 Initialize the queue. 6 Now, S is left with no
unvisited adjacent nodes.
So, we dequeue and find A.

2 We start from visiting S 7 From A we have D as


(starting node), and mark it unvisited adjacent node. We
as visited. mark it as visited and
enqueue it.

3 We then see an unvisited


adjacent node from S. In
At this stage, we are left with no unmarked (unvisited) nodes. But as per the
this example, we have three
nodes but alphabetically we
algorithm we keep on dequeuing in order to get all unvisited nodes. When the
choose A, mark it as visited queue gets emptied, the program is over.
and enqueue it.
Spanning Trees
When graph G is connected, a depth first or breadth first search starting at any
vertex will visit all vertices in G
A spanning tree is any tree that consists solely of edges in G and that includes
4 Next, the unvisited adjacent all the vertices
node from S is B. We mark E(G): T (tree edges) + N (nontree edges) where T: set of edges used during search
it as visited and enqueue it. N: set of remaining edges
Examples of Spanning Tree

5 Next, the unvisited adjacent


node from S is C. We mark
it as visited and enqueue it.

Either dfs or bfs can be used to create a spanning tree


– When dfs is used, the resulting spanning tree is known as a depth first
spanning tree
– When bfs is used, the resulting spanning tree is known as a breadth first
spanning tree

While adding a nontree edge into any spanning tree, this will create a cycle
DFS VS BFS Spanning Tree
biconnected component: a maximal connected subgraph H (no subgraph that is both
biconnected and properly contains H).

A spanning tree is a minimal subgraph, G‟, of G such that V(G‟)=V(G) and G‟ is


connected.
Any connected graph with n vertices must have at least n-1 edges.
A biconnected graph is a connected graph that hasno articulation points.
Minimum Cost Spanning Tree
cost of a spanning tree of a weighted undirected graph is the sum of the
costs of the edges in the spanning tree

– Kruskal
– Prim
– Sollin

Kruskal’s Algorithm
Build a minimum cost spanning tree T by adding edges to T one at a time
Select the edges for inclusion in T in nondecreasing order of the cost
An edge is added to T if it does not form a cycle
Since G is connected and has n > 0 vertices, exactly n-1 edges will be selected
Kruskal’s algorithm
1. Sort all the edges in non-decreasing order of their weight.
2. Pick the smallest edge. Check if it forms a cycle with the spanning
tree formed so far. If cycle is not formed, include this edge. Else,
discard it.
3. Repeat step#2 until there are (V-1) edges in the spanning tree.

Psuedocode for Kruskal’s Algorithm


Kruskal(G, V, E)
{
T= {};
while(T contains less than n-1 edges && E is not empty)
{
choose a least cost edge (v,w) from E;
delete (v,w) from E;
if ((v,w) does not create a cycle in T)
add (v,w) to T
else
discard (v,w);
}
if (T contains fewer than n-1 edges)
printf(“No spanning tree\n”);
} 1. Begin with any vertex which you think would be suitable and add it to the
Examples for Kruskal’s Algorithm tree.
2. Find an edge that connects any vertex in the tree to any vertex that is not in
the tree. Note that, we don't have to form cycles.
3. Stop when n - 1 edges have been added to the tree.

Psuedocode of Prim’s algorithm


Prims(G,V,E)
{
T={};
TV={0};
while (T contains fewer than n-1 edges)
{
let (u,v) be a least cost edge such that and if (there is no such edge ) break;
add v to TV;
add (u,v) to T;
}
if (T contains fewer than n-1 edges)
printf(“No spanning tree\n”);
}

Prim’s Algorithm
Prim's algorithm to find minimum cost spanning tree (as Kruskal's algorithm)
uses the greedy approach. Prim's algorithm shares a similarity with the
shortest path first algorithms.
Prim's algorithm, in contrast with Kruskal's algorithm, treats the nodes as a
single tree and keeps on adding new nodes to the spanning tree from the given
graph.
To contrast with Kruskal's algorithm and to understand Prim's algorithm better,
we shall use the same example −
Steps of Prim's Algorithm: The following are the main 3 steps of the Prim's
Algorithm:
int i, u, w;
for (i=0; i<n; i++)
{
found[i] = FALSE;
distance[i] = cost[v][i];
}
found[v] = TRUE;
distance[v] = 0;
for (i=0; i<n-2; i++)
{
determine n-1 paths from v
u = choose(distance, n, found);
found[u] = TRUE;
for (w=0; w<n; w++)
if (!found[w])
if (distance[u]+cost[u][w]<distance[w])
distance[w] = distance[u]+cost[u][w];
}
}

All Pairs Shortest Paths


All pairs shortest path algorithm finds the shortest paths between all pairs of
vertices.

Solution 1
3)

Solution 2

iently large
number

vertices with an index <= k

index greater than n-1


#define MAX_VERTICES 6 t[i][j]
int cost[][MAX_VERTICES]=
{{ 0, 50, 10, 1000, 45, 1000},
{1000, 0, 15, 1000, 10, 1000},
{ 20, 1000, 0, 15, 1000, 1000},
{1000, 20, 1000, 0, 35, 1000},
{1000, 1000, 30, 1000, 0, 1000},
{1000, 1000, 1000, 3, 1000, 0}};
int distance[MAX_VERTICES];
short int found{MAX_VERTICES]; Graph with negative cycle
int n = MAX_VERTICES;
void shortestpath(int v, int cost[][MAX_ERXTICES], int distance[], int n,
short int found[])
{
Algorithm for All Pairs Shortest Paths
void allcosts(int cost[][MAX_VERTICES], int distance[][MAX_VERTICES], int n)
{
int i, j, k;
for (i=0; i<n; i++)
for (j=0; j<n; j++) distance[i][j] = cost[i][j];
for (k=0; k<n; k++)
for (i=0; i<n; i++)
for (j=0; j<n; j++)
if (distance[i][k]+distance[k][j] < distance[i][j])
distance[i][j]= distance[i][k]+distance[k][j];
} Transitive Closure
Example Goal: given a graph with unweighted edges, determine if there is a path
Directed graph and its cost matrix from i to j for all i and j.
(1) Require positive path (> 0) lengths. transitive closure matrix
reflexive transitive closure matrix
Unit III
Algorithms – Priority Queues - Heaps – Heap Sort – Merge Sort – Quick Sort –
Binary Search – Finding the Maximum and Minimum.

Algorithm
 An algorithm is a step-by-step procedure to solve a problem in a finite
number of steps.
 Branching and repetition are included in the steps of an algorithm.
 This branching and repetition depend on the problem for which Algorithm is
developed.
 All the steps of Algorithm during the definition should be written in a
human-understandable language which does not depend on any
programming language.
 we can choose any programming language to implement the Algorithm.
 Pseudocode and flow chart are popular ways to represent an algorithm.

An algorithm must satisfy the following criteria:


1. Input: An algorithm should have zero or more but should be a finite number of
inputs. We can also say that it is essential for any algorithm before starting. Input In the above specification,
should be given to it initially before the Algorithm begins.  Initially DAndC(P) is invoked, where „P‟ is the problem to be solved.
2. Output: An algorithm must give at least one required result from the given set
of input values. These output values are known as the solution to a problem.  Small (P) is a Boolean-valued function that determines whether the
3. Definiteness: Each step must be clear, unambiguous, and precisely defined. input size is small enough that the answer can be computed without
4. Finiteness: Finiteness means Algorithm should be terminated after a finite splitting. If this so, the function „S’ is invoked. Otherwise, the problem P
number of steps. Also, each step should be finished in a finite amount of time. is divided into smaller sub problems. These sub problems P1, P2 …Pk are
5. Effectiveness: Each step of the Algorithm must be feasible i.e., it should be solved by recursive application of DAndC.
practically possible to perform the action. Every Algorithm is generally expected to  Combine is a function that determines the solution to P using the
be effective. solutions to the „k‟ sub problems.

Divide and Conquer Binary Search


Divide and Conquer is one of the best-known general algorithm design
technique. It works according to the following general plan:  Problem definition: Let ai, 1 ≤ i ≤ n be a list of elements that are sorted in non-
decreasing order. The problem is to find whether a given element x is present in the list
 Given a function to compute on „n‟ inputs the divide-and-conquer
or not. If x is present we have to determine a value j (element‘s position) such that aj=x.
strategy suggests splitting the inputs into „k‟ distinct subsets, 1<k<=n,
If x is not in the list, then j is set to zero.
yielding „k‟ sub problems.
 Solution: Let P = (n, ai…al , x) denote an arbitrary instance of search problem where n
 These sub problems must be solved, and then a method must be found is the number of elements in the list, ai…al is the list of elements and x is the key element
to combine sub solutions into a solution of the whole. to be searched for in the given list. Binary search on the list is done as follows:
 If the sub problems are still relatively large, then the divide-and-conquer  Step1: Pick an index q in the middle range [i, l] i.e. q= [(n + 1)/2] and compare x with
strategy can possibly be reapplied. aq. Step 2: if x = aq i.e key element is equal to mid element, the problem is immediately
solved.
 Often the sub problems resulting from a divide-and-conquer design are  Step 3: if x <aqin this case x has to be searched for only in the sub-list ai, ai+1, ……, aq-
of the same type as the original problem. For those cases the Therefore, problem reduces to (q-i, ai…aq-1, x).
reapplication of the divide-and- conquer principle is naturally expressed  Step 4: if x >aq,x has to be searched for only in the sub-list aq+1, ...,., al . Therefore problem
by a recursive algorithm. reduces to (l-i, aq+1…al, x).
 For the above solution procedure, the Algorithm can be implemented as recursive or non-
A typical case with k=2is diagrammatically shown below. recursive algorithm.
Recursive binary search algorithm
Explanation:
StraightMaxMin requires 2(n-1) comparisons in the best, average & worst cases.
By realizing the comparison of a[i]>max is false, improvement in a algorithm can be done.
Hence we can replace the contents of the for loop by,
If(a[i]>Max) then Max = a[i]; Else if (a[i]<min) min=a[i]
On the average a[i] is > max half the time. So, the avg. no. of comparison is 3n/2-1.
Algorithm based on Divide and Conquer strategy
Let P = (n, a [i],……,a [j]) denote an arbitrary instance of the problem. Here ‗n‘ is the no. of
elements in the list (a[i],….,a[j]) and we are interested in finding the maximum and minimum of
the list. If the list has more than 2 elements, P has to be divided into smaller instances.
For example, we might divide ‗P‘ into the 2 instances, P1= ( [n/2],a[1], a[n/2])
P2= (n-[n/2], a[[n/2]+1],……., a[n])
After having divided ‗P‘ into 2 smaller sub problems, we can solve them by recursively invoking
the same divide-and-conquer algorithm.
Algorithm:

Iterative binary search:

Finding the maximum and minimum


Problem statement: Given a list of n elements, the problem is to find the maximum and
minimum items.
StraightMaxMin: A simple and straight forward algorithm to achieve this is given below.
Merge sort is a perfect example of a successful application of the divide-and
conquer technique. It sorts a given array A [O ... n - 1] by dividing it into two
halves A [0 .. \n/2]-1] and A [ ⎝n/2] .. n-1], sorting each of them recursively, and
then merging the two smaller sorted arrays into a single sorted one.

The merging of two sorted arrays can be done as follows.


 Two pointers (array indices) are initialized to point to the first elements of
the arrays being merged.
 The elements pointed to are compared, and the smaller of them is added
to a new array being constructed
 After that, the index of the smaller element is incremented to point to its
immediate successor in the array it was copied from. This operation is
repeated until one of the two given arrays is exhausted, and then the
remaining elements of the other array are copied to the end of the new
array.

Compared with the straight forward method (2n-2) this method saves 25% in
comparisons. Example:
The operation of the algorithm on the list 8, 3, 2, 9, 7, 1, 5, 4 is illustrated in
Space Complexity the figure.
Compared to the straight forward method, the MaxMin method requires extra
stack space for i, j, max, min, max1 and min1. Given n elements there will be
[log2n] + 1 levels of recursion and we need to save seven values for each
recursive call. (6 + 1 for return address).

Merge Sort
 Mergesort will never degrade to O (n2)
 Another advantage of mergesort over quicksort and heapsort is its
stability. (A sorting algorithm is said to be stable if two objects with
equal keys appear in the same order in sorted output as they appear in
the input array to be sorted.)
Limitations:
The principal shortcoming of mergesort is the linear amount [O(n) ] of extra
storage the algorithm requires. Though merging can be done in-place, the
resulting algorithm is quite complicated and of theoretical interest only.

Variations of merge sort


1. The algorithm can be implemented bottom up by merging pairs of the array‟s
elements, then merging the sorted pairs, and so on. (If n is not a power of 2,
only slight bookkeeping complications arise.) This avoids the time and space
Analysis overhead of using a stack to handle recursive calls.
Here the basic operation is key comparison. As merge sort execution does not 2. We can divide a list to be sorted in more than two parts, sort each
depend on the order of the data, best case and average case runtime are the recursively, and then merge them together. This scheme, which is particularly
same as worst case runtime. useful for sorting files residing on secondary memory devices, is called multiway
Worst case: During key comparison, neither of the two arrays becomes empty mergesort.
before the Quick sort
other one contains just one element leads to the worst case of merge sort. Quicksort is the other important sorting algorithm that is based on the divide-and-conquer
Assuming for approach. Unlike mergesort, which divides its input elements according to their position in the
simplicity that total number of elements n is a power of 2, the recurrence array, quicksort divides (or partitions) them according to their value.
relation for the A partition is an arrangement of the array‘s elements so that all the elements to the left of some
number of key comparisons C(n) is element A[s] are less than or equal to A[s], and all the elements to the right of A[s] are greater
than or equal to it:

where, Cmerge(n) is the number of key comparisons made during the merging
stage.
Obviously, after a partition is achieved, A[s] will be in its final position in the sorted array,
Let us analyze Cmerge(n), the number of key comparisons performed during the
and we can continue sorting the two subarrays to the left and the right of A[s] independently (e.g.,
merging stage. At each step, exactly one comparison is made, after which the by the same method).
total number of elements in the two arrays still needing to be processed is In quick sort, the entire work happens in the division stage, with no work required to combine the
reduced by 1. In the worst case, neither of the two arrays becomes empty before solutions to the sub problems.
the other one contains just one element (e.g., smaller elements may come from
the alternating arrays).Therefore, for the worst case, Cmerge(n) = n –1.
Now,

Solving the recurrence equation using master theorem:


Here a = 2, b = 2, f (n) = n, d = 1. Therefore 2 = 21, case 2 holds in the master
theorem
Cworst(n) = Θ (nd log n) = Θ (n1 log n) = Θ (n log n)Therefore Cworst(n) = Θ (n log
n)
Advantages:
 Number of comparisons performed is nearly optimal.
 For large n, the number of comparisons made by this algorithm in the
average case turns out to be about 0.25n less and hence is also in Θ(n log Partitioning
n). We start by selecting a pivot—an element with respect to whose value we are going to divide
the subarray. There are several different strategies for selecting a pivot. We use
the sophisticated method suggested by C.A.R. Hoare, the prominent British computer scientist
who invented quicksort.
Select the subarray‘s first element: p = A[l].Now scan the subarray from both ends, comparing
the subarray‘s elements to the pivot.
The left-to-right scan, denoted below by index pointer i, starts with the second element. Since
we want elements smaller than the pivot to be in the left part of the subarray, this scan skips over
elements that are smaller than the pivot and stops upon encountering the first element greater than
or equal to the pivot.
The right-to-left scan, denoted below by index pointer j, starts with the last element of the
subarray. Since we want elements larger than the pivot to be in the right part of the subarray, this
scan skips over elements that are larger than the pivot and stops on encountering the first
element smaller than or equal to the pivot.

After both scans stop, three situations may arise, depending on whether or not
the scanning indices have crossed.
 If scanning indices i and j have not crossed, i.e., i< j, we simply exchange
A[i] and A[j ] and resume the scans by incrementing I and decrementing j,
respectively:

If the scanning indices have crossed over, i.e., i>


j, we will have partitioned the subarray after exchanging the pivot with
A[j]:

If the scanning indices stop while pointing to the same element, i.e., i = j, the
value they are pointing to must be equal to p. Thus, we have the subarray
partitioned, with the split position s = i = j :
Analysis Switching to insertion sort on very small subarrays (between 5 and 15
Best Case -Here the basic operation is key comparison. Number of key elements for most computer systems) or not sorting small subarrays at all and
comparisons made before a partition is achieved is n + 1 if the scanning indices finishing the algorithm with insertion sort applied to the entire nearly sorted
cross over and n if they coincide. If all the splits happen in the middle of array
corresponding subarrays, we will have the best case. The number of key Modifications of the partitioning algorithm such as the three-way partition into
comparisons in the best case satisfies the recurrence, segments smaller than, equal to, and larger than the pivot

Limitations: 1. It is not stable. 2. It requires a stack to store parameters of


According to the Master Theorem, Cbest(n) ∈Θ(n log2 n); solving it exactly for n = subarrays that are yet to be sorted. 3. While Performance on randomly ordered
2k yields arrays is known to be sensitive not only to the implementation details of the
Cbest(n) = n log2 n. algorithm but also to both computer architecture and data type.
Worst Case – In the worst case, all the splits will be skewed to the extreme: one
of the two subarrays will be empty, and the size of the other will be just 1 less
than the size of the subarray being partitioned. This unfortunate situation will Unit - 4
happen, in particular, for increasing arrays. Indeed, if A[0..n − 1] is a strictly Greedy Method : The General Method – Optimal Storage on Tapes – Knapsack Problem – Job
increasing array and we use A[0] as the pivot, the left-to-right scan will stop on Sequencing with Deadlines – Optimal Merge Patterns.
A[1] while the right-to-left scan will go all the way to reach A[0], indicating the
split at position 0:So, after making n + 1 comparisons to get to this partition GENERAL METHOD
and exchanging the pivot A[0] with itself, the algorithm will be left with the
strictly increasing array A[1..n − 1] to sort. This sorting of strictly increasing Greedy Method
arrays of diminishing sizes will continue until the last one A[n−2.. n−1] has Greedy is the most straight forward design technique. Most of the problems have n
been processed. The total number of key comparisons made will be equal to inputs and require us to obtain a subset that satisfies some constraints. Any subset that
satisfies these constraints is called a feasible solution. We need to find a feasible solution
that either maximizes or minimizes the objective function. A feasible solution that does
this is called an optimal solution.
Average Case - Let Cavg(n) be the average number of key comparisons made by The greedy method is a simple strategy of progressively building up a solution, one
element at a time, by choosing the best possible element at each stage. At each stage, a
quicksort on a randomly ordered array of size n. A partition can happen in any
decision is made regarding whether or not a particular input is in an optimal solution.
position s (0 ≤ s ≤ n−1) after n+1comparisons are made to achieve the partition. This is done by considering the inputs in an order determined by some selection
After the partition, the left and right subarrays will have s and n − 1− s procedure. If the inclusion of the next input, into the partially constructed optimal
elements, respectively. Assuming that the partition split can happen in each solution will result in an infeasible solution then this input is not added to the partial
position s with the same probability 1/n, we get the following recurrence solution. The selection procedure itself is based on some optimization measure. Several
relation: optimization measures are plausible for a given problem. Most of them, however, will
result in algorithms that generate sub-optimal solutions. This version of greedy
technique is called subset paradigm. Some problems like Knapsack, Job sequencing with
deadlines and minimum cost spanning trees are based on subset paradigm.
For the problems that make decisions by considering the inputs in some order, each
decision is made using an optimization criterion that can be computed using decisions
Its solution, which is much trickier than the worst- and best-case analyses, already made. This version of greedy method is ordering paradigm. Some problems like
turns out to be optimal storage on tapes, optimal merge patterns and single source shortest path are
based on ordering paradigm.

CONTROL ABSTRACTION
Thus, on the average, quicksort makes only 39% more comparisons than in the
Algorithm Greedy (a, n)
best case. Moreover, its innermost loop is so efficient that it usually runs faster // a(1 : n) contains the „n‟ inputs
than mergesort on randomly ordered arrays of nontrivial sizes. This certainly {
justifies the name given to the algorithm by its inventor. solution := ; // initialize the solution to empty for i:=1 to n do
{
Variations: Because of quicksort‟s importance, there have been persistent x := select (a);
efforts over the years to refine the basic algorithm. Among several if feasible (solution, x) then
improvements discovered by researchers are: solution := Union (Solution, x);
Better pivot selection methods such as randomized quicksort that uses a }
return solution;
random element or the median-of-three method that uses the median of the
}
leftmost, rightmost, and the middle element of the array
Procedure Greedy describes the essential way that a greedy based algorithm will look,
once a particular problem is chosen and the functions select, feasible and union are
properly implemented.
The function select selects an input from „a‟, removes it and assigns its value to „x‟. Feasible is a
Boolean valued function, which determines if „x‟ can be included into the solution vector. The
function Union combines „x‟ with solution and updates the objective function.

KNAPSACK PROBLEM
Let us apply the greedy method to solve the knapsack problem. We are given „n‟ objects
and a knapsack. The object „i‟ has a weight wi and the knapsack has a capacity „m‟. If a
fraction xi, 0 < xi < 1 of object i is placed into the knapsack then a profit of pi xi is
earned. The objective is to fill the knapsack that maximizes the total profit earned.
Since the knapsack capacity is „m‟, we require the total weight of all chosen objects to be at
most „m‟. The problem is stated as:

Example
Let n = 3, (l1, l2, l3) = (5, 10, 3). Then find the optimal ordering?
The profits and weights are positive numbers. Solution:
Algorithm There are n! = 6 possible orderings. They are:
If the objects are already been sorted into non-increasing order of p[i] / w[i] then the
algorithm given below obtains solutions corresponding to this strategy.

Algorithm GreedyKnapsack (m, n)


// P[1 : n] and w[1 : n] contain the profits and weights respectively of
// Objects ordered so that p[i] / w[i] > p[i + 1] / w[i + 1].
// m is the knapsack size and x[1: n] is the solution vector.
{
for i := 1 to n do x[i] := 0.0 // initialize x U := m;
for i := 1 to n do
{
if (w(i) > U) then break;
x [i] := 1.0; U := U – w[i];
}
if (i < n) then x[i] := U / w[i];
}
Running time:
The objects are to be sorted into non-decreasing order of pi / wi ratio. But if we disregard
the time to initially sort the objects, the algorithm requires only O(n) time.
Example:
Consider the following instance of the knapsack problem: n = 3, m = 20, (p 1, p2, p3) = (25, 24,
15) and (w1, w2, w3) = (18, 15, 10).

OPTIMAL STORAGE ON TAPES


There are „n‟ programs that are to be stored on a computer tape of length „L‟. Each
program „i‟ is of length li, 1 ≤ i ≤ n. All the programs can be stored on the tape if and
only if the sum of the lengths of the programs is at most „L‟.
We shall assume that whenever a program is to be retrieved from this tape, the tape is
initially positioned at the front. If the programs are stored in the order i = i 1, i2, . . . . .
, in, the time tJ needed to retrieve program iJ is proportional to
Algorithm:
The algorithm for assigning programs to tapes is as follows:
Algorithm Store (n, m)
// n is the number of programs and m the number of tapes
{
j := 0; // next tape to store on for i :=1 to n do
{
Print („append program‟, i, „to permutation for tape‟, j); j := (j + 1) mod m;
}
}
On any given tape, the programs are stored in non-decreasing order of their lengths.

JOB SEQUENCING WITH DEADLINES


When we are given a set of „n‟ jobs. Associated with each Job i, deadline di > 0 and
profit Pi > 0. For any job „i‟ the profit pi is earned iff the job is completed by its deadline.
Only one machine is available for processing jobs. An optimal solution is the feasible
solution with maximum profit.
Sort the jobs in „j‟ ordered by their deadlines. The array d [1 : n] is used to store the
deadlines of the order of their p-values. The set of jobs j [1 : k] such that j [r], 1 ≤ r ≤ k
are the jobs in „j‟ and d (j [1]) ≤ d (j[2]) ≤ . . . ≤ d (j[k]). To test whether J U {i} is
feasible, we have just to insert i into J preserving the deadline ordering and then verify
that d [J[r]] ≤ r, 1 ≤ r ≤ k+1.
OPTIMAL MERGE PATERNS
Given „n‟ sorted files, there are many ways to pair wise merge them into a single sorted
file. As, different pairings require different amounts of computing time, we want to
Example:
determine an optimal (i.e., one requiring the fewest comparisons) way to pair wise
Let n = 4, (P1, P2, P3, P4,) = (100, 10, 15, 27) and (d1 d2 d3 d4) = (2, 1, 2, 1). The
feasible solutions and their values are: merge „n‟ sorted files together. This type of merging is called as 2-way merge patterns.
To merge an n-record file and an m-record file requires possibly n + m record moves,
the obvious choice choice is, at each step merge the two smallest files together. The
two-way merge patterns can be represented by binary merge trees.

Algorithm to Generate Two-way Merge Tree:

struct treenode
{
treenode * lchild;
treenode * rchild;
};
Example 1:
Suppose we are having three sorted files X1, X2 and X3 of length 30, 20, and 10 records each.
Merging of the files can be carried out as follows:

Example 2:
Given five files (X1, X2, X3, X4, X5) with sizes (20, 30, 10, 5, 30). Apply greedy rule to
find optimal way of pair wise merging to give an optimal solution using binary merge
tree representation.
Solution:

Unit V
Back tracking: The General Method – The 8-Queens Problem – Sum of Subsets – Graph Coloring.
Merge X4 and X3 to get 15 record moves. Call this Z1.

Backtracking
Some problems can be solved, by exhaustive search. The exhaustive-search technique
suggests generating all candidate solutions and then identifying the one (or the ones) with a
desired property.
Backtracking is a more intelligent variation of this approach. The principal idea is to
Merge Z1 and X1 to get 35 record moves. Call this Z2.
construct solutions one component at a time and evaluate such partially constructed
candidates as follows. If a partially constructed solution can be developed further without
violating the problem‘s constraints, it is done by taking the first remaining legitimate option
for the next component. If there is no legitimate option for the next component, no
alternatives for any remaining component need to be considered. In this case, the algorithm
backtracks to replace the last component of the partially constructed solution with its next
option.
It is convenient to implement this kind of processing by constructing a tree of choices being
made, called the state-space tree. Its root represents an initial state before the search for a
solution begins. The nodes of the first level in the tree represent the choices made for the first
component of a solution; the nodes of the second level represent the choices for the second
component, and soon. A node in a state-space tree is said to be promising if it corresponds to
a partially constructed solution that may still lead to a complete solution; otherwise, it is
called non-promising. Leaves represent either non-promising dead ends or complete
solutions found by the algorithm.
In the majority of cases, a state space tree for a backtracking algorithm is constructed in the
manner of depth-first search. If the current node is promising, its child is generated by adding
the first remaining legitimate option for the next component of a solution, and the processing
moves to this child. If the current node turns out to be non-promising, the algorithm
backtracks to the node‘s parent to consider the next possible option for its last component; if
there is no such option, it backtracks one more level up the tree, and so on. Finally, if the
algorithm reaches a complete solution to the problem, it either stops (if just one solution is
required) or continues searching for other possible solutions.

General method
General Algorithm (Recursive)

General Algorithm (Iterative)


General Algorithm for backtracking

Figure: State-space tree of solving the four-queens problem by backtracking. × denotes an


unsuccessful attempt to place a queen in the indicated column. The numbers above the nodes indicate
the order in which the nodes are generated.
If other solutions need to be found, the algorithm can simply resume its operations at the leaf at
which it stopped. Alternatively, we can use the board‘s symmetry for this purpose.
Finally, it should be pointed out that a single solution to the n-queens problem for any n ≥ 4 can
be found in linear time.
Note: The algorithm NQueens() is not in the syllabus. It is given here for interested learners. The
N-Queens problem algorithm is referred from textbook T2.
The problem is to place n queens on an n × n chessboard so that no two queens attack each other
by being in the same row or in the same column or on the same diagonal.
So let us consider the four-queens problem and solve it by the backtracking technique. Since each of the
four queens has to be placed in its own row, all we need to do is to assign a column for each queen on the
board presented in figure.

We start with the empty board and then place queen 1 in the first possible position of its row, which is
in column 1 of row 1. Then we place queen 2, after trying unsuccessfully columns 1 and 2, in the first
acceptable position for it, which is square (2, 3), the square in row 2 and column 3. This proves to be a
dead end because there is no acceptable position for queen 3. So, the algorithm backtracks and puts
queen 2 in the next possible position at (2, 4). Then queen 3 is placed at (3, 2), which proves to be
another dead end. The algorithm then backtracks all the way to queen 1 and moves it to (1, 2). Queen 2
then goes to (2, 4), queen 3 to(3, 1), and queen 4 to (4, 3), which is a solution to the problem. The state-
space tree of this search is shown in figure.
We record the value of s, the sum of these numbers, in the node. If s is equal to d, we have a solution to
the problem. We can either report this result and stop or, if all the solutions need to be found, continue by
backtracking to the node‘s parent. If s is not equal to d, we can terminate the node as non-promising if
either of the following two inequalities holds:

Example: Apply backtracking to solve the following instance of the subset sum problem: A
= {1, 3, 4, 5} and d = 11.

Graph coloring

Sum of subsets problem


Problem definition: Find a subset of a given set A = {a1, . . . , an} of n positive integers whose
sum is equal to a given positive integer d.
For example, for A = {1, 2, 5, 6, 8} and d = 9, there are two solutions: {1, 2, 6} and {1, 8}. Of
course, some instances of this problem may have no solutions.
It is convenient to sort the set‘s elements in increasing order. So, we will assume that
a1< a2< . . . < an.
The state-space tree can be constructed as a binary tree like that in Figure shown below for the
instance A = {3, 5, 6, 7} and d = 15.
The number inside a node is the sum of the elements already included in the subsets represented by the
node. The inequality below a leaf indicates the reason for its termination.

The root of the tree represents the starting point, with no decisions about the given elements made
as yet. Its left and right children represent, respectively, inclusion and exclusion of a 1 in a set
being sought.
Similarly, going to the left from a node of the first level corresponds to inclusion of a 2 while
going to the right corresponds to its exclusion, and so on. Thus, a path from the root to a node on
the ith level of the tree indicates which of the first in numbers have been included in the subsets
represented by that node.
UNIT -1

CLOUD COMPUTING
Introduction

Cloud computing is a type of computing that relies on shared computing resources rather than
having local servers or personal devices to handle applications.

Definition by NIST Cloud Computing

The National Institute of Stands and Technology(NIST) has a more comprehensive definition
of cloud computing. It describes cloud computing as "a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction."

• Ability / space where you store your data ,process it and can access anywhere from the
world
• As a Metaphor for the internet.

 Cloud computing is :
• Storing data /Applications on remote servers
• Processing Data / Applications from servers
Analysis
• Accessing Data / Applications via internet

What is a cloud service??

 Cloud computing is taking services and moving them outside an organization's firewall.
Applications, storage and other services are accessed via the Web. The services are
delivered and used over the Internet and are paid for by the cloud customer on an as-
needed or pay-per-use business model.

Service: This term in cloud computing is the concept of being able to use reusable, fine-grained
components across a vendor’s network.

 Iaas,Paas,Saas,Daas,Naas,Caas are some of the services Provided by different providers

1.1 Characteristics (OR) Features of Cloud Environments:

According to the NIST, all true cloud environments have five key characteristics:

K.NIKHILA Page 1
UNIT -1 UNIT -1

1. On-demand self-service: This means that cloud customers can sign up for, pay for and v. Standards There are currently no standards to convert a centralized database into
start using cloud resources very quickly on their own without help from a sales agent. a cloud solution.
2. Broad network access: Customers access cloud services via the Internet. iii. Synchronizationallows content to be refreshed across multiple devices.
3. Resource pooling: Many different customers (individuals, organizations or different Ex:
departments within an organization) all use the same servers, storage or other computing Google docs
resources. Data base services (DaaS): it avoids the complexity and cost of running your own database.
4. Rapid elasticity or expansion: Cloud customers can easily scale their use of resources up Benefits:
or down as their needs change. i.Ease of use :don’t have to worry about buying, installing, and maintaining hardware
5. Measured service: Customers pay for the amount of resources they use in a given period for the database as there is no servers to provision and no redundant systems to worry..
of time rather than paying for hardware or software upfront. (Note that in a private cloud, ii. Power The database isn’t housed locally, but that doesn’t mean that it is not
this measured service usually involves some form of charge backs where IT keeps track functional and effective. Depending on your vendor, you can get custom data
of how many resources different departments within an organization are using.) validation to ensure accurate information. You can create and manage the database
with ease.
1.2 Applications: iii. Integration The database can be integrated with your other services to provide
more value and power. For instance, you can tie it in with calendars, email, and
i) Storage:cloud keeps many copies of storage. Using these copies of resources, it extracts people to make your work more powerful.
another resource if anyone of the resources fails. iv. Management because large databases benefit from constant pruning and
optimization, typically there are expensive resources dedicated to this task. With
ii. Database: are repositories for information with links within the information that help making some DaaS offerings, this management can be provided as part of the service for
the data searchable. much less expense. The provider will often use offshore labor pools to take
Advantage of lower labor costs there. So it’s possible that you are using the service
Advantages:
in Chicago, the physical servers are in Washington state, and the database
administrator is in the Philippines.
i. Improved availability:If there is a fault in one database system, it will only affect
one fragment of the information, not the entire database.  MS SQL and Oracle are two biggest players of DaaS providers.
ii. Improved performance: Data is located near the site with the greatest demand and the MS SQL:
database systems are parallelized, which allows the load to be balanced among the
 Microsoft SQL server data services (SSDS),SSDS based on SQL server, announced
servers.
cloud extension of SQL server tool, in 2008 which is similar to Amazon’s simple
iii. Price It is less expensive to create a network of smaller computers with the power
database (schema –free data storage, SOAP or REST APIs and a pay-as-you-go payment
of one large one.
system.
iv. Flexibility : Systems can be changed and modified without harm to the entire
database.  Variation is first, one of the main selling points of SSDS is that it integrates with
Disadvantages: Microsoft’s sync Framework which is a .NET library for synchronizing dissimilar data
i. Complexity Database administrators have extra work to do to maintain the sources.
system.  Microsoft wants SSDS to work as a data hub, synchronizing data on multiple devices so
ii. Labor costs With that added complexity comes the need for more workers on the they can be accessed offline.
payroll.
Core concepts in SSDS:
iii. Security Database fragments must be secured and so must the sites housing the
fragments.
i. Authority  both a billing unit and a collection of containers
iv. Integrity It may be difficult to maintain the integrity of the database if it is too
ii. Container  collection of entities and is what you search within.
complex or changes too quickly.

K.NIKHILA Page 2 K.NIKHILA Page 3


UNIT -1 UNIT -1

iii. Entity  property bag of name and value pairs 1.3 Cloud Components:

Three components of a cloud computing are :

Oracle: • Clients
• Data center
It introduces three services to provide database services to cloud users. Customers can license • Distributed servers

a. Oracle Database 11g i. Clients:


b. Oracle fusion Middleware
c. Oracle enterprise Manager • Clients are the devices that the end users interact with to manage their information on the
cloud.
 AWS EC2-Amazon web services Elastic Compute cloud • Clients are of three categories :

Oracle delivered a set of free Amazon Machine Images (AMIs) to its customers so they could a. Mobile: mobile devices including PDAs/smart phones like a blackberry, windows, iphone.
quickly and efficiently deploy Oracle’s database solutions.
Developers can take advantage of the provisioning and automated software deployment b. Thin: are comps that don’t have internal hard drives then display the info but rather let server
to rapidly build applications using Oracle’s popular development tools such as Oracle do all the work.
Application Express, Oracle Developer, Oracle Enterprise Pack for Eclipse, and Oracle
Workshop for Web Logic. Additionally, Oracle Unbreakable Linux Support and AWS c. Thick: is a regular comp, using web browser like Firefox/Internet Explorer to connect to the
Premium Support is available for Oracle Enterprise Linux on EC2, providing seamless cloud.
customer support.
“Providing choice is the foundation of Oracle’s strategy to enable customers to become Thin Vs Thick
more productive and lower their IT costs—whether it’s choice of hardware, operating
system, or on demand computing—extending this to the Cloud environment is a natural i. Price and effect environment
evolution,” said Robert Shimp, vice president of Oracle Global Technology Business Unit. ii. Lower hardware costs
“We are pleased to partner with Amazon Web Services to provide our customers enterpriseclass. iii. Lower IT costs
Cloud solutions, using familiar Oracle software on which their businesses depend.” iv. Security
Additionally, Oracle also introduced a secure cloud-based backup solution. Oracle v. Data Security
Secure Backup Cloud Module, based on Oracle’s premier tape backup management vi. Less Power consumption
software, Oracle Secure Backup, enables customers to use the Amazon Simple Storage vii. Ease of repair or replacement
Service (Amazon S3) as their database backup destination. Cloud-based backups offer viii. Less noise
reliability and virtually unlimited capacity, available on demand and with no up-front
capital expenditure. ii. Data Center :
The Oracle Secure Backup Cloud Module also enables encrypted data backups to help
• It is a collection of servers where the application you subscribe and housed.
ensure complete privacy in the cloud environment. It’s fully integrated with Oracle
Recovery Manager and Oracle Enterprise Manager, providing users with familiar interfaces
iii. Distributed Servers:
for cloud-based backups.
For customers with an ongoing need to quickly move very large volumes of data into or • Servers are in geographically disparate locations but act as if they’re humming away right
out of the AWS cloud, Amazon allows the creation of network peering connections. next to each other.
• This gives the service provider more flexibility in options and security.

K.NIKHILA Page 4 K.NIKHILA Page 5


UNIT -1 UNIT -1

EX : a) Down time: Since cloud computing systems are internet-based, service outages are always an
Amazon has their cloud solution all over the world ,if one failed at one site the service unfortunate possibility and can occur for any reason.
would still be accessed through another site Best Practices for minimizing planned downtime in a cloud environment:
• If cloud needs more h/w they need not throw more servers in the safe room –they can add ii. Design services with high availability and disaster recovery in mind. Leverage the multi-
them at another site and make it part of the cloud. availability zones provided by cloud vendors in your infrastructure.
iii. If your services have a low tolerance for failure, consider multi-region deployments with
1.4 Benefits and Limitations of Cloud Computing automated failover to ensure the best business continuity possible.
iv. Define and implement a disaster recovery plan in line with your business objectives that
The advantage of cloud computing is twofold. It is a file backup shape. It also allows working provide the lowest possible recovery time (RTO) and recovery point objectives (RPO).
on the same document for several jobs (one person or a nomad traveling) of various types (or PC, v. Consider implementing dedicated connectivity such as AWS Direct Connect, Azure
tab or smart phone). Express Route, or Google Cloud’s Dedicated Interconnect or Partner Interconnect. These
services provide a dedicated network connection between you and the cloud service point
Cloud computing simplifies usage by allowing overcoming the constraints of traditional of presence. This can reduce exposure to the risk of business interruption from the public
computer tools (installation and updating of software, storage, data portability...). Cloud internet.
computing also provides more elasticity and agility because it allows faster access to IT
resources (server, storage or bandwidth) via a simple web portal and thus without investing in b) Security and Privacy: Code Space and the hacking of their AWS EC2 console, which led to
additional hardware. data deletion and the eventual shutdown of the company. Their dependence on remote cloud-
based infrastructure meant taking on the risks of outsourcing everything.
Best practices for minimizing security and privacy risks:
 Understand the shared responsibility model of your cloud provider.
 Implement security at every level of your deployment.
 Know who is supposed to have access to each resource and service and limit access to
least privilege.
 Make sure your team’s skills are up to the task: Solid security skills for your cloud teams
are one of the best ways to mitigate security and privacy concerns in the cloud.
 Take a risk-based approach to securing assets used in the cloud
Extend security to the device.
 Implement multi-factor authentication for all accounts accessing sensitive data or
systems.
Consumers and organizations have many different reasons for choosing to use cloud computing
services. They might include the following: c) Vulnerability to Attack: Even the best teams suffer severe attacks and security breaches from
time to time.
 Convenience Best practices to help you reduce cloud attacks:
 Scalability  Make security a core aspect of all IT operations.
 Low costs  Keep ALL your teams up to date with cloud security best practices.
 Security  Ensure security policies and procedures are regularly checked and reviewed.
 Anytime, anywhere access  Proactively classify information and apply access control.
 High availability  Use cloud services such as AWS Inspector, AWS CloudWatch, AWS CloudTrail, and
AWS Config to automate compliance controls.
Limitations /Disadvantages:  Prevent data ex-filtration.

K.NIKHILA Page 6 K.NIKHILA Page 7


UNIT -1 UNIT -1

 Integrate prevention and response strategies into security operations.  Build in flexibility as a matter of strategy when designing applications to ensure
 Discover rogue projects with audits. portability now and in the future.
 Remove password access from accounts that do not need to log in to services.
 Review and rotate access keys and access credentials. f) Costs Savings: Adopting cloud solutions on a small scale and for short-term projects can be
 Follow security blogs and announcements to be aware of known attacks. perceived as being expensive.
 Apply security best practices for any open source software that you are using. Best practices to reduce costs:
 Try not to over-provision, instead of looking into using auto-scaling services
d) Limited control and flexibility: Since the cloud infrastructure is entirely owned, managed  Scale DOWN as well as UP
and monitored by the service provider, it transfers minimal control over to the customer.  Pre-pay if you have a known minimum usage
To varying degrees (depending on the particular service), cloud users may find they have less  Stop your instances when they are not being used
control over the function and execution of services within a cloud-hosted infrastructure. A cloud  Create alerts to track cloud spending
provider’s end-user license agreement (EULA) and management policies might impose limits on
what customers can do with their deployments. Customers retain control of their applications,
data, and services, but may not have the same level of control over their backend infrastructure.
Best practices for maintaining control and flexibility: 1.5 Architecture
 Consider using a cloud provider partner to help with implementing, running, and
supporting cloud services. Let’s have a look into Cloud Computing and see what Cloud Computing is made of. Cloud
 Understanding your responsibilities and the responsibilities of the cloud vendor in the computing comprises of two components front end and back end. Front end consist client part
shared responsibility model will reduce the chance of omission or error. of cloud computing system. It comprise of interfaces and applications that are required to access
 Make time to understand your cloud service provider’s basic level of support. Will this the cloud computing platform.
service level meet your support requirements? Most cloud providers offer additional
support tiers over and above the basic support for an additional cost. A central server administers the system, monitoring traffic and client demands to ensure
 Make sure you understand the service level agreement (SLA) concerning the
everything runs smoothly. It follows a set of rules called protocols and uses a special kind of
infrastructure and services that you’re going to use and how that will impact your software called MIDDLEWARE. Middleware allows networked computers to communicate
agreements with your customers. with each other. Most of the time, servers don't run at full capacity. That means there's unused
processing power going to waste. It's possible to fool a physical server into thinking it's actually
e) Vendor Lock-In: organizations may find it difficult to migrate their services from one vendor multiple servers, each running with its own independent operating system. The technique is
to another. Differences between vendor platforms may create difficulties in migrating from one called server virtualization. By maximizing the output of individual servers, server
cloud platform to another, which could equate to additional costs and configuration complexities. virtualization reduces the need for more physical machines.
Best practices to decrease dependency:
 Design with cloud architecture best practices in mind. All cloud services provide the
opportunity to improve availability and performance, decouple layers, and reduce
performance bottlenecks. If you have built your services using cloud architecture best
practices, you are less likely to have issues porting from one cloud platform to another.
 Properly understanding what your vendors are selling can help avoid lock-in challenges.
 Employing a multi-cloud strategy is another way to avoid vendor lock-in. While this may
add both development and operational complexity to your deployments, it doesn’t have to
be a deal breaker. Training can help prepare teams to architect and select best-fit services
and technologies.

K.NIKHILA Page 8 K.NIKHILA Page 9


UNIT -1 UNIT -1

While back end refers to the cloud itself, it comprises of the resources that are required for cloud systems were required to be present at the same geographical location. Thus to solve
computing services. It consists of virtual machines, servers, data storage, security mechanism this problem, distributed computing led to three more types of computing and they
etc. It is under provider’s control. were-Mainframe computing, cluster computing, and grid computing.

Cloud computing distributes the file system that spreads over multiple hard disks and machines.
Data is never stored in one place only and in case one unit fails the other will take over  Mainframe computing:
automatically. The user disk space is allocated on the distributed file system, while another Mainframes which first came into existence in 1951 are highly powerful and reliable
important component is algorithm for resource allocation. Cloud computing is a strong computing machines. These are responsible for handling large data such as massive
distributed environment and it heavily depends upon strong algorithm. input-output operations. Even today these are used for bulk processing tasks such as
online transactions etc. These systems have almost no downtime with high fault
Evolution of Cloud Computing tolerance. After distributed computing, these increased the processing capabilities of
 Difficulty Level : Easy
the system. But these were very expensive. To reduce this cost, cluster computing
 Last Updated : 14 May, 2020
came as an alternative to mainframe technology.
Cloud computing is all about renting computing services. This idea first came in the
1950s. In making cloud computing what it is today, five technologies played a vital role.
Cluster computing:
These are distributed systems and its peripherals, virtualization, web 2.0, service
In 1980s, cluster computing came as an alternative to mainframe computing. Each
orientation, and utility computing.
machine in the cluster was connected to each other by a network with high bandwidth.
These were way cheaper than those mainframe systems. These were equally capable
of high computations. Also, new nodes could easily be added to the cluster if it was
required. Thus, the problem of the cost was solved to some extent but the problem
related to geographical restrictions still pertained. To solve this, the concept of grid
computing was introduced.

 Grid computing:
In 1990s, the concept of grid computing was introduced. It means that different
systems were placed at entirely different geographical locations and these all were
connected via the internet. These systems belonged to different organizations and thus
the grid consisted of heterogeneous nodes. Although it solved some problems but new
problems emerged as the distance between the nodes increased. The main problem
which was encountered was the low availability of high bandwidth connectivity and
with it other network associated issues. Thus. cloud computing is often referred to as
“Successor of grid computing”.
 Distributed Systems:
It is a composition of multiple independent systems but all of them are depicted as a  Virtualization:
single entity to the users. The purpose of distributed systems is to share resources and It was introduced nearly 40 years back. It refers to the process of creating a virtual
also use them effectively and efficiently. Distributed systems possess characteristics layer over the hardware which allows the user to run multiple instances
such as scalability, concurrency, continuous availability, heterogeneity, and simultaneously on the hardware. It is a key technology used in cloud computing. It is
independence in failures. But the main problem with this system was that all the the base on which major cloud computing services such as Amazon EC2, VMware

K.NIKHILA Page 10 K.NIKHILA Page 11


UNIT -1 UNIT -1

vCloud, etc work on. Hardware virtualization is still one of the most common types of 3) Storage Virtualization
virtualization.
Network Virtualization: It is a method of combining the available resources in a network by
splitting up the available bandwidth into channels, each of which is independent from the others
 Web 2.0: and each channel is independent of others and can be assigned to a specific server or device in
It is the interface through which the cloud computing services interact with the clients. real time.
It is because of Web 2.0 that we have interactive and dynamic web pages. It also
increases flexibility among web pages. Popular examples of web 2.0 include Google Storage Virtualization: It is the pooling of physical storage from multiple network storage
Maps, Facebook, Twitter, etc. Needless to say, social media is possible because of this devices into what appears to be a single storage device that is managed from a central console.
Storage virtualization is commonly used in storage area networks (SANs).
technology only. In gained major popularity in 2004.
Server Virtualization: Server virtualization is the masking of server resources like processors,
 Service orientation: RAM, operating system etc, from server users. The intention of server virtualization is to
It acts as a reference model for cloud computing. It supports low-cost, flexible, and increase the resource sharing and reduce the burden and complexity of computation from users.
evolvable applications. Two important concepts were introduced in this computing
Virtualization is the key to unlock the Cloud system, what makes virtualization so important for
model. These were Quality of Service (QoS) which also includes the SLA (Service
the cloud is that it decouples the software from the hardware. For example, PC’s can use virtual
Level Agreement) and Software as a Service (SaaS). memory to borrow extra memory from the hard disk. Usually hard disk has a lot more space than
memory. Although virtual disks are slower than real memory, if managed properly the
 Utility computing: substitution works perfectly. Likewise, there is software which can imitate an entire computer,
which means 1 computer can perform the functions equals to 20 computers.
It is a computing model that defines service provisioning techniques for services such
as compute services along with other major services such as storage, infrastructure,
1.6 Classification of Cloud Variants:
etc which are provisioned on a pay-per-use basis.
i. Service Model Based
Virtualization and Cloud Computing ii. Deployment Model Based

The main enabling technology for Cloud Computing is VIRTUALIZATION. Virtualization is a 1.6.1 Service Model Based /Models Service / Types of Models
partitioning of single physical server into multiple logical servers. Once the physical server is
divided, each logical server behaves like a physical server and can run an operating system and Cloud computing services are divided into three classes, according to the abstraction level of the
applications independently. Many popular companies like VMware and Microsoft provide capability provided and the service model of providers, namely:
virtualization services, where instead of using your personal PC for storage and computation,
you use their virtual server. They are fast, cost-effective and less time consuming. 1. Infrastructure as a Service (IaaS)
2. Platform as a Service (PaaS) and
For software developers and testers virtualization comes very handy, as it allows developer to 3. Software as a Service. (SaaS)
write code that runs in many different environments and more importantly to test that code.
These abstraction levels can also be viewed as a layered architecture where services of a higher
Virtualization is mainly used for three main purposes layer can be composed from services of the underlying layer. The reference model explains the
role of each layer in an integrated architecture. A core middleware manages physical resources
1) Network Virtualization and the VMs deployed on top of them; in addition, it provides the required features (e.g.,
accounting and billing) to offer multi-tenant pay-as-you-go services.
2) Server Virtualization

K.NIKHILA Page 12 K.NIKHILA Page 13


UNIT -1 UNIT -1

Cloud development environments are built on top of infrastructure services to offer application PLATFORM AS A SERVICE
development and deployment capabilities; in this level, various programming models, libraries,
APIs, and mashup editors enable the creation of a range of business, Web, and scientific In addition to infrastructure-oriented clouds that provide raw computing and storage services,
applications. Once deployed in the cloud, these applications can be consumed by end users. another approach is to offer a higher level of abstraction to make a cloud easily programmable,
known as Platform as a Service (PaaS).
INFRASTRUCTURE AS A SERVICE
A cloud platform offers an environment on which developers create and deploy
Offering virtualized resources (computation, storage, and communication) on demand is known applications and do not necessarily need to know how many processors or how much memory
as Infrastructure as a Service (IaaS). that applications will be using. In addition, multiple programming models and specialized
services (e.g., data access, authentication, and payments) are offered as building blocks to new
applications.
Google App Engine, an example of Platform as a Service, offers a scalable environment for
developing and hosting Web applications, which should be written in specific programming
languages such as Python or Java, and use the services‘ own proprietary structured object data
store. Building blocks include an in-memory object cache (mem cache), mail service, instant
messaging service (XMPP), an image manipulation service, and integration with Google
Accounts authentication service. Software as a Service Applications reside on the top of the
cloud stack. Services provided by this layer can be accessed by end users through Web portals.
Therefore, consumers are increasingly shifting from locally installed computer programs to on-
line software services that offer the same functionally. Traditional desktop applications such as
word processing and spreadsheet can now be accessed as a service in the Web. This model of
delivering applications, known as Software as a Service (F), alleviates the burden of software
maintenance for customers and simplifies development and testing for providers.
Salesforce.com, which relies on the SaaS model, offers business productivity applications
(CRM) that reside completely on their servers, allowing customers to customize and access
applications on demand.

FIGURE 1.3. The cloud computing stack.

A cloud infrastructure enables on-demand provisioning of servers running several choices of


operating systems and a customized software stack. Infrastructure services are considered to be
the bottom layer of cloud computing systems.
 Amazon Web Services mainly offers IaaS, which in the case of its EC2 service means
offering VMs with a software stack that can be customized similar to how an ordinary
physical server would be customized.

 Users are given privileges to perform numerous activities to the server, such as: starting
and stopping it, customizing it by installing software packages, attaching virtual disks to
it, and configuring access permissions and firewalls rules.

K.NIKHILA Page 14 K.NIKHILA Page 15


UNIT -1 UNIT -1

availability zones in the same region. Regions, in turn, ―are geographically dispersed and will
be in separate geographic areas or countries.

User Interfaces And Access To Servers:

A public IaaS provider must provide multiple access means to its cloud, thus catering for
various users and their preferences. Different types of user interfaces (UI) provide different
levels of abstraction, the most common being graphical user interfaces (GUI), command-line
tools (CLI), and Web service (WS) APIs. GUIs are preferred by end users who need to launch,
customize, and monitor a few virtual servers and do not necessary need to repeat the process
several times.

Advance Reservation Of Capacity:

Advance reservations allow users to request for an IaaS provider to reserve resources for a
specific time frame in the future, thus ensuring that cloud resources will be available at that
time. Amazon Reserved Instances is a form of advance reservation of capacity, allowing users
to pay a fixed amount of money in advance to guarantee resource availability at anytime during
c. Infrastructure as a Service (IaaS) or Hardware as a Service (HaaS)
an agreed period and then paying a discounted hourly rate when resources are in use.
Automatic Scaling And Load Balancing:
INFRASTRUCTURE AS A SERVICE PROVIDERS
It allow users to set conditions for when they want their applications to scale up and down,
Public Infrastructure as a Service providers commonly offer virtual servers containing
based on application specific metrics such as transactions per second, number of simultaneous
one or more CPUs, running several choices of operating systems and a customized software
stack. users, request latency, and so forth. When the number of virtual servers is increased by
automatic scaling, incoming traffic must be automatically distributed among the available
FEATURES servers. This activity enables applications to promptly respond to traffic increase while also
The most relevant features are: achieving greater fault tolerance.
i. Geographic distribution of data centers;
ii. Variety of user interfaces and APIs to access the system; Service-Level Agreement:

a. Specialized components and services that aid particular applications (e.g., Service-level agreements (SLAs) are offered by IaaS providers to express their commitment to
load- balancers, firewalls); delivery of a certain QoS. To customers it serves as a warranty. An SLA usually include
b. Choice of virtualization platform and operating systems; and availability and performance guarantees. HYPERVISOR AND OPERATING SYSTEM
c. Different billing methods and period (e.g., prepaid vs. postpaid, hourly vs. CHOICE: IaaS offerings have been based on heavily customized open-source Xen
monthly). deployments. IaaS providers needed expertise in Linux, networking, virtualization, metering,
resource management, and many other low-level aspects to successfully deploy and maintain
Geographic Presence: their cloud offerings.

Availability zones are ―distinct locations that are engineered to be insulated from failures in
other availability zones and provide inexpensive, low-latency network connectivity to other

K.NIKHILA Page 16 K.NIKHILA Page 17


UNIT -1 UNIT -1

PaaS Providers include only public clouds, only private clouds or a combination of both public and private
clouds.
Public Platform as a Service providers commonly offer a development and deployment
environment that allow users to create and run their applications with little or no concern to d. Community Cloud: Here, computing resources are provided for a community and
low-level details of the platform. organizations.

FEATURES 1.7 Infrastructure of Cloud Computing


 Cloud infrastructure means the hardware and software components.
Programming Models, Languages, and Frameworks. Programming models made available by
 These components are server, storage, and networking and virtualization software.
IaaS providers define how users can express their applications using higher levels of abstraction
 These components are required to support the computing requirements of a cloud computing
and efficiently run them on the cloud platform and recover it in case of crashes, as well as to
model.
store user data.

Persistence Options. A persistence layer is essential to allow applications to record their state Components of Cloud infrastructure
and recover it in case of crashes, as well as to store user data.

1.6.2 Deployment Model Based/Types of CC / Cloud Delivery Models:

Cloud computing can be divided into several sub-categories depending on the physical location
of the computing resources and who can access those resources.

a. Public cloud vendors offer their computing services to anyone in the general public. They
maintain large data centers full of computing hardware, and their customers share access to that
hardware.
a) Hypervisor
b. Private cloud is a cloud environment set aside for the exclusive use of one organization. Some  Hypervisor is a firmware or low-level program. It acts as a Virtual Machine Manager.
large enterprises choose to keep some data and applications in a private cloud for security  It enables to share a physical instance of cloud resources between several customers.
reasons, and some are required to use private clouds in order to comply with various regulations. b) Management Software
 Management software assists to maintain and configure the infrastructure.
Organizations have two different options for the location of a private cloud: they can set up a
c) Deployment Software
private cloud in their own data centers or they can use a hosted private cloud service. With a  Deployment software assists to deploy and integrate the application on the cloud.
hosted private cloud, a public cloud vendor agrees to set aside certain computing resources and
d) Network
allow only one customer to use those resources.  Network is the key component of the cloud infrastructure.
 It enables to connect cloud services over the Internet.
c. Hybrid cloud is a combination of both a public and private cloud with some level of
 The customer can customize the network route and protocol i.e possible to deliver network as a
integration between the two. For example, in a practice called "cloud bursting" a company may
utility over the Internet.
run Web servers in its own private cloud most of the time and use a public cloud service for
e) Server
additional capacity during times of peak use.
 The server assists to compute the resource sharing and offers other services like resource
allocation and de-allocation, monitoring the resources, provides the security etc.
A multi-cloud environment is similar to a hybrid cloud because the customer is using more than
one cloud service. However, a multi-cloud environment does not necessarily have integration 6) Storage
among the various cloud services, the way a hybrid cloud does. A multi-cloud environment can

K.NIKHILA Page 18 K.NIKHILA Page 19


UNIT -1 UNIT -1

 Cloud keeps many copies of storage. Using these copies of resources, it extracts another  Grid computing enables collaboration between enterprises to carry out distributed
resource if any one of the resources fails. computing jobs using interconnected computers spread across multiple locations running
Intranets and the Cloud: Intranets are customarily used within an organization and are not independently
accessible publicly. That is, a web server is maintained in-house and company information is  Utility computing provides web services such as computing, storage space, and
maintained on it that others within the organization can access. However, now intranets are being applications to users at a low cost through the virtualization of several backend servers.
maintained on the cloud. Utility computing has laid the foundation for today’s cloud computing
 To access the company’s private, in-house information, users have to log on to the  Distributed computing landscape connects ubiquitous networks and connected devices
intranet by going to a secure public web site. enabling peer-to-peer computing. Examples of such cloud infrastructure are ATMs, and
intranets/ workgroups
 There are two main components in client/server computing: servers and thin or light
clients. Grid Computing Vs Cloud Computing
 The servers house the applications your organization needs to run, and the thin
When we switch on the fan or any electric device, we are less concern about the power supply
Clients—who do not have hard drives—display the results.
from where it comes and how it is generated. The power supply or electricity that we receives at
Hypervisor Applications
our home travels through a chain of network, which includes power stations, transformers, power
 Applications like VMware or Microsoft’s Hyper-V allow you to virtualize your servers
lines and transmission stations. These components together make a ‘Power Grid’. Likewise,
so
‘Grid Computing’ is an infrastructure that links computing resources such as PCs, servers,
that multiple virtual servers can run on one physical server.
 These sorts of solutions provide the tools to supply a virtualized set of hardware to the workstations and storage elements and provides the mechanism required to access them.
guest operating system. They also make it possible to install different operating systems
on the same machine. For example, you may need Windows Vista to run one application, Grid Computing is a middle ware to co-ordinate disparate IT resources across a network,
while another application requires Linux. It’s easy to set up the server to run both allowing them to function as whole. It is more often used in scientific research and in universities
operating systems. for educational purpose. For example, a group of architect students working on a different
 Thin clients use an application program to communicate with an application server. project requires a specific designing tool and a software for designing purpose but only couple of
Most of the processing is done down on the server, and sent back to the client. them got access to this designing tool, the problem is how they can make this tool available to
There is some debate about where to draw the line when talking about thin clients. rest of the students. To make available for other students they will put this designing tool on
Some thin clients require an application program or a web browser to communicate with campus network, now the grid will connect all these computers in campus network and allow
the server. However, others require no add-on applications at all. This is sort of a discussion of student to use designing tool required for their project from anywhere. Cloud computing and
semantics, because the real issue is whether the work is being done on the server and transmitted Grid computing is often confused, though there functions are almost similar there approach for
back to the thin client. their functionality is different. Let see how they operate-
1.8. Cloud computing techniques
Cloud Computing Grid Computing
Some traditional computing techniques that have helped enterprises achieve additional
computing and storage capabilities, while meeting customer demands using shared  Cloud computing works more as a  Grid computing uses the available
physical resources, are: service provider for utilizing computer resource and interconnected computer
resource systems to accomplish a common goal
 Cluster computing connects different computers in a single location via LAN to work as
a single computer. Improves the combined performance of the organization which owns  Grid computing is a decentralized
it  Cloud computing is a centralized model model, where the computation could
occur over many administrative model

K.NIKHILA Page 20 K.NIKHILA Page 21


UNIT -1 UNIT -1

 A grid is a collection of computers be up and running in days or weeks.


 Cloud is a collection of computers which is owned by a multiple parties in
usually owned by a single party. multiple locations and connected  Utility computing users want to be in  In cloud computing, provider is in
 together so that users can share the control of the geographical location of complete control of cloud computing
combined power of resources the infrastructure services and infrastructure

 Cloud offers more services all most all  Utility computing is more favorable  Cloud computing is great and easy to
the services like web hosting, DB (Data  Grid provides limited services when performance and selection use when the selection infrastructure
Base) support and much more infrastructure is critical and performance is not critical

 Cloud computing is typically provided  Utility computing is a good choice for  Cloud computing is a good choice for
 Grid computing federates the resources
within a single organization (eg : less resource demanding high resource demanding
located within different organization.
Amazon)
 Utility computing refers to a business  Cloud computing refers to the
model underlying IT architecture
Utility Computing Vs Cloud Computing

In our previous conversation in “Grid Computing” we have seen how electricity is supplied to
1.9 Security concerns for Cloud Computing
our house, also we do know that to keep electricity supply we have to pay the bill. Utility
Computing is just like that, we use electricity at home as per our requirement and pay the bill While using cloud computing, the major issue that concerns the users is about its security.
accordingly likewise you will use the services for the computing and pay as per the use this is One concern is that cloud providers themselves may have access to customer’s. unencrypted
known as ‘Utility computing’. Utility computing is a good source for small scale usage, it can be
done in any server environment and requires Cloud Computing. data- whether it’s on disk, in memory or transmitted over the network. Some countries
government may decide to search through data without necessarily notifying the data owner,
Utility computing is the process of providing service through an on-demand, pay per use billing depending on where the data resides, which is not appreciated and is considered as a privacy
method. The customer or client has access to a virtually unlimited supply of computing solutions breach (Example Prism Program by USA).
over a virtual private network or over the internet, which can be sourced and used whenever it’s To provide security for systems, networks and data cloud computing service providers have
required. Based on the concept of utility computing , grid computing, cloud computing and joined hands with TCG (Trusted Computing Group) which is non-profit organization which
managed IT services are based. regularly releases a set of specifications to secure hardware, create self-encrypting drives and
improve network security. It protects the data from root kits and malware.
Through utility computing small businesses with limited budget can easily use software like As computing has expanded to different devices like hard disk drives and mobile phones, TCG
CRM (Customer Relationship Management) without investing heavily on infrastructure to has extended the security measures to include these devices. It provides ability to create a unified
maintain their clientele base. data protection policy across all clouds.

Utility Computing Cloud Computing Some of the trusted cloud services are Amazon, Box.net, Gmail and many others.
 Utility computing refers to the ability to  Cloud Computing also works like utility
charge the offered services, and charge computing, you pay only for what you 1.10 Privacy Concern & Cloud Computing
customers for exact usage use but Cloud Computing might be
Privacy presents a strong barrier for users to adapt into Cloud Computing systems
cheaper, as such, Cloud based app can

K.NIKHILA Page 22 K.NIKHILA Page 23


UNIT -1 UNIT -1

There are certain measures which can improve privacy in cloud computing.

1. The administrative staff of the cloud computing service could theoretically monitor the
data moving in memory before it is stored in disk. To keep the confidentiality of a data,
administrative and legal controls should prevent this from happening.
2. The other way for increasing the privacy is to keep the data encrypted at the cloud storage
site, preventing unauthorized access through the internet; even cloud vendor can’t access
the data either.
ii) Full Virtualization
 Full virtualization is a technique in which a complete installation of one machine is run
on another. The result is a system in which all software running on the server is within a
virtual machine.
 In a fully virtualized deployment, the software running on the server is displayed on the
clients.
 Virtualization is relevant to cloud computing because it is one of the ways in which you
will access services on the cloud. That is, the remote datacenter may be delivering your
services in a fully virtualized format.
 In order for full virtualization to be possible, it was necessary for specific hardware
combinations to be used. It wasn’t until 2005 that the introduction of the AMD- CHALLENGES AND RISKS
Virtualization(AMD-V) and Intel Virtualization Technology (IVT) extensions made it easier to
go fully virtualized. Despite the initial success and popularity of the cloud computing paradigm and the extensive
availability of providers and tools, a significant number of challenges and risks are inherent to
Full virtualization has been successful for several purposes: this new model of computing. Providers, developers, and end users must consider these
i) Sharing a computer system among multiple users challenges and risks to take good advantage of cloud computing. Issues to be faced include user
ii) Isolating users from each other and from the control program privacy, data security, data lock- in, availability of service, disaster recovery, performance,
iii) Emulating hardware on another machine scalability, energy- efficiency, and programmability. Security, Privacy, and Trust: Security and
iii) Para virtualization privacy affect the entire cloud computing stack, since there is a massive use of third- party
Para virtualization allows multiple operating systems to run on a single hardware device at services and infrastructures that are used to host important data or to perform critical operations.
the same time by more efficiently using system resources, like processors and memory. In this scenario, the trust toward providers is fundament al to ensure the desired level of privacy
In full virtualization, the entire system is emulated (BIOS, drive, and so on), but in for applications hosted in the cloud. | 62 Legal and regulatory issues also need attention. When
para virtualization, its management module operates with an operating system that has data are moved into the Cloud, providers may choose to locate them anywhere on the planet. The
been adjusted to work in a virtual machine. Para virtualization typically runs better than the full physical location of data centers determines the set of laws that can be applied to the
virtualization model, simply because in a fully virtualized deployment, all elements management of data. For example, specific cryptography techniques could not be used because
must be emulated. they are not allowed in some countries. Similarly, country laws can impose that sensitive data,
such as patient health records, are to be stored within national borders. Data Lock- In and
Standardization: A major concern of cloud computing users is about having their data locked- in
by a certain provider. Users may want to move data and applications out from a provider that
does not meet their requirements. However, in their current form, cloud computing
infrastructures and platforms do not employ standard methods of storing user data and
applications. Consequently, they do not interoperate and user data are not portable. The answer

K.NIKHILA Page 24 K.NIKHILA Page 25


UNIT -1 UNIT -1

to this concern is standardization. In this direction, there are efforts to create open standard s for cooling system, thus costing USD 2.6 million per year. Besides the monetary cost, data centers
cloud computing. The Cloud Computing Interopera bility Forum (CCIF) was formed by significantly impact the environment in terms of CO2 emissions from the cooling systems
organizations such as Intel, Sun, and Cisco in order to “enable a global cloud computing
ecosystem whereby organizations are able to seamlessly work together for the purposes for wider Issues in cloud:
industry adoption of cloud computing technology.” The development of the Unified Cloud
Interface (UCI) by CCIF aims at creating a standard programmatic point of access to an entire The Eucalyptus : framework was one of the first open- source projects to focus on building IaaS
cloud infrastructure. In the hardware virtualization sphere, the Open Virtual Format (OVF) aims clouds. It has been developed with the intent of providing an open- source implement a tion
at facilitating packing and distribution of software to be run on VMs so that virtual appliances nearly identical in functionality to Amazon Web Services APIs. Eucalyptus provides the
can be made portable—that is, seamlessly run on hypervisor of different vendors. Availability, following features: Linux- based controller with administration Web portal; EC2- compatible
Fault- Tolerance, and Disaster Recovery: It is expected that users will have certain expectations (SOAP, Query) and S3- compatible (SOAP, REST) CLI and Web portal interfaces; Xen, KVM,
about the service level to be provided once their applications are moved to the cloud. These and VMWare backends; Amazon EBS- compatible virtual storage devices; interface to the
expectations include availability of the service, its overall performance, and what measures are to Amazon EC2 public cloud; virtual networks.
be taken when something goes wrong in the system or its component s. In summary, users seek
for a warranty before they can comfortably move their business to the cloud. SLAs, which Nimbus3: The Nimbus toolkit is built on top of the Globus framework. Nimbus provides most
include QoS requirements, must be ideally set up between customers and cloud computing features in common with other open- source VI managers, such as an EC2- compatible front- end
providers to act as warranty. An SLA specifies the details of the service to be provided, including API, support to Xen, and a backend interface to Amazon EC2. However, it distinguishes from
availability and performance guarantees. Additionally, metrics must be agreed upon by all others by providing a Globus Web Services Resource Framework (WSRF) interface. It also
parties, and penalties for violating the expectations must also be approved. Resource provides a backend service, named Pilot, which spawns VMs on clusters manage d by a local
Management and Energy- Efficiency: One important challenge faced by providers of cloud resource manager (LRM) such as PBS and SGE.
computing services is the efficient manage m e n t of virtualized resource pools. Physical
resources such as CPU cores, disk space, and network bandwidth must be sliced and shared Open Nebula: Open Nebula is one of the most feature- rich open- source VI managers. It was
among virtual machines running potentially heterogeneous workloads. The multidimensional initially conceived to manage local virtual infrastructure, but has also included remote interfaces
nature of virtual machines complicates the activity of finding a good mapping of VMs onto that make it viable to build public clouds. Altogether r, four programming APIs are available:
available physical hosts while maximizing user | 63 utility. Dimensions to be considered include: XML-RPC and libvirt for local interaction; a subset of EC2 (Query) APIs and the Open Nebula
number of CPUs, amount of memory, size of virtual disks, and network bandwidth. Dynamic Cloud API (OCA) for public access. Open Nebula provides the following features: Linux- based
VM mapping policies may leverage the ability to suspend, migrate, and resume VMs as an easy controller; CLI, XML-RPC, EC2- compatible Query and OCA interfaces; Xen, KVM, and
way of preempting low- priority allocations in favor of higher- priority ones. Migration of VMs VMware backend; interface to public clouds (Amazon EC2, Elastic Hosts); virtual networks;
also brings additional challenges such as detecting when to initiate a migration, which VM to dynamic resource allocation; advance reservation of capacity.
migrate, and where to migrate. In addition, policies may take advantage of live migration of
CASE STUDY
virtual machines to relocate data center load without significantly disrupting running services. In
this case, an additional concern is the tradeoff between the negative impact of a live migration on
The Eucalyptus :
the performance and stability of a service and the benefits to be achieved with that migration.
Another challenge concerns the outstanding amount of data to be managed in various VM framework was one of the first open- source projects to focus on building IaaS clouds. It has
manage m e n t activities. Such data amount is a result of particular abilities of virtual machines, been developed with the intent of providing an open- source implement a tion nearly identical in
including the ability of traveling through space (i.e., migration) and time (i.e., check pointing and functionality to Amazon Web Services APIs. Eucalyptus provides the following features: Linux-
rewinding), operations that may be required in load balancing, backup, and recovery scenarios. based controller with administration Web portal; EC2- compatible (SOAP, Query) and S3-
In addition, dynamic provisioning of new VMs and replicating existing VMs require efficient compatible (SOAP, REST) CLI and Web portal interfaces; Xen, KVM, and VMWare backends;
mechanisms to make VM block storage devices (e.g., image files) quickly available at selected Amazon EBS- compatible virtual storage devices; interface to the Amazon EC2 public cloud;
hosts. Data centers consume r large amounts of electricity. According to a data published by virtual networks.
HP[4], 100 server racks can consume 1.3MWof power and another 1.3 MW are required by the

K.NIKHILA Page 26 K.NIKHILA Page 27


UNIT -1 UNIT -1

Nimbus3 : The Nimbus toolkit is built on top of the Globus framework. Nimbus provides most Looking to the success of Cloud Computing in e-mail services and communication .The second
features in common with other open- source VI manage rs, such as an EC2- compatible front- strategic move of Royal Mail Group, was to migrating from physical servers to virtual servers,
end API, support to Xen, and a backend interface to Amazon EC2. However, it distinguishes up to 400 servers to create a private cloud based on Microsoft hyper V. This would give a fresh
from others by providing a Globus Web Services Resource Framework (WSRF) interface. It also look and additional space to their employees desktop and also provides latest modern exchange
provides a backend service, named Pilot, which spawns VMs on clusters manage d by a local environment.
resource manager (LRM) such as PBS and SGE.
The hyper V project by RMG’s (Royal Mail Group) is estimated to save around 1.8 million
Open Nebula: pound for them in future and will increase the efficiency of the organization’s internal IT system.

Open Nebula is one of the most feature- rich open- source VI managers. It was initially Case study -2
conceived to manage local virtual infrastructure, but has also included remote interfaces that XYZ is a startup IT organization that develops and sells s/w the org gets a new website
make it viable to build public clouds. Altogether, four programming APIs are available: XML- development project that needs a web server, application server and a database server. The
RPC and libvirt for local interaction; a subset of EC2 (Query) APIs and the OpenNebula Cloud org has hired 30 employees for this web development project.
API (OCA) for public access. OpenNebula provides the following features: Linux- based Constraints :
controller; CLI, XML-RPC, EC2- compatible Query and OCA interfaces; Xen, KVM, and Acquiring renting space for new servers
VMware backend; interface to public clouds (Amazon EC2, ElasticHosts); virtual networks; Buying new high end servers
dynamic resource allocation; advance reservation of capacity. Hiring new IT staff for infrastructure management
Buying licensed OS and other s/w required for development
i) Case-Study of Cloud Computing- Royal Mail Solution :Public cloud IaaS
Team leader :
 Subject of Case-Study: Using Cloud Computing for effective communication among
staff. 1. Creates an ac
 Reason for using Cloud Computing: Reducing the cost made after communication for 2. Choose an VM image from image repository or create a new image
28,000 employees and to provide advance features and interface of e-mail services to 3. Specify no.of VM’s
their employees. 4. Choose VM type
5. Set necessary configurations for VM
Royal mail group, a postal service in U.K, is the only government organization in U.K that 6. After VM launched ,provide IP address of VM to prog team
serves over 24 million customers through its 12000 post offices and 3000 separate processing 7. Access VM and start development
sites. Its logistics systems and parcel-force worldwide handles around 404 million parcel a year.
Case study -2
And to do this they need an effective communicative medium. They have recognized the
advantage of Cloud Computing and implemented it to their system. It has shown an outstanding Case study -3
performance in inter-communication.
XYZ firm gets more revenue ,grows and hence buys some IT infrastructuire.However it
Before moving on to Cloud system, the organization was struggling with the out-of-date continues to use public IaaS cloud for its development work
software, and due to which the operational efficiency was getting compromised. As soon as the
organization switched on to Cloud System, 28000 employees were supplied with their new Now the firm gets a new project that involves sensitive data that restricts the firm to use a
collaboration suite, giving them access to tools such as instant messaging and presence public cloud .hence this org is in need of setting up the required infrastructure in its own
awareness. The employees got more storage place than on local server. The employees became premise.
much more productive. Constraints:

Infrastructure cost

K.NIKHILA Page 28 K.NIKHILA Page 29


UNIT -1 UNIT -1

Infrastructure optimization AWS is Amazon's cloud web hosting platform which offers fast, flexible, reliable and cost-
effective solutions. It offers a service in the form of building block which can be used to create
Power consumption and deploy any kind of application in the cloud. It is the most popular as it was the first to enter
Data center management the cloud computing space.
Features:
Additional expenditure on infrastructure operation with lesser productivity  Easy sign-up process
 Fast Deployments
Solution : Private IaaS cloud
 Allows easy management of add or remove capacity
Moving to private cloud is :  Access to effectively limitless capacity
 Centralized Billing and management
Moving to private cloud  Offers Hybrid Capabilities and per hour billing
Download link:https://aws.amazon.com/
IT managed  self-service

Physical  virtual 2) Microsoft Azure

Manual management  automated management

Dedicated  shared

Explanation:

1.Setup cloud infrastructure

2. Setup self-service portal or dashboard


Azure is a cloud computing platform which is launched by Microsoft in February 2010. This
3. Test the cloud environment through self-service open source and flexible cloud platform which helps in development, data storage, service
4. Get VM’s management & hosting solutions.
Features:
5. Use VM’s to develop and test applications  Windows Azure offers the most effective solution for your data needs
 Provides scalability, flexibility, and cost-effectiveness
6. Manage cloud environment  Offers consistency across clouds with familiar tools and resources
 Allow you to scale your IT resources up and down according to your business needs
Cloud Computing Service Provider Companies in 2019
Download link:https://azure.microsoft.com/en-in/
1) Amazon Web Services

K.NIKHILA Page 30 K.NIKHILA Page 31


UNIT -1 UNIT -1

3) Google Cloud Platform Oracle Cloud

Google Cloud is a set of solution and products which includes GCP & G suite. It helps you to Oracle Cloud offers innovative and integrated cloud services. It helps you to build, deploy, and
solve all kind of business challenges with ease. manage workloads in the cloud or on premises. Oracle Cloud also helps companies to transform
Features: their business and reduce complexity.
 Allows you to scale with open, flexible technology Features:
 Solve issues with accessible AI & data analytics  Oracle offers more options for where and how you make your journey to the cloud
 Eliminate the need for installing costly servers  Oracle helps you realize the importance of modern technologies including Artificial
 Allows you to transform your business with a full suite of cloud-based services intelligence, chatbots, machine learning, and more
Download link:https://cloud.google.com/  Offers Next-generation mission-critical data management in the cloud
 Oracle provides better visibility to unsanctioned apps and protects against sophisticated
4) VMware cyber attacks
Download link:https://www.oracle.com/cloud/
5) IBM Cloud

VMware is a comprehensive cloud management platform. It helps you to manage a hybrid IBM cloud is a full stack cloud platform which spans public, private and hybrid environments. It
environment running anything from traditional to container workloads. The tools also allow you is built with a robust suite of advanced and AI tools.
to maximize the profits of your organization. Features:
Features:  IBM cloud offers infrastructure as a service (IaaS), software as a service (SaaS) and
 Enterprise-ready Hybrid Cloud Management Platform platform as a service (PaaS)
 Offers Private & Public Clouds  IBM Cloud is used to build pioneering which helps you to gain value for your businesses
 Comprehensive reporting and analytics which improve the capacity of forecasting &  It offers high performing cloud communications and services into your IT environment
planning Download link:https://www.ibm.com/cloud/
 Offers additional integrations with 3rd parties and custom applications, and tools. Tips for selecting a Cloud Service Provider
 Provides flexible, Agile services There "best" Cloud Service cannot be defined. You need to a chose a cloud service "best" for
Download link:https://www.vmware.com/in/cloud-services/infrastructure.html your project. Following checklist will help:
 Is your desired region supported?
 Cost for the service and your budget

K.NIKHILA Page 32 K.NIKHILA Page 33


UNIT -1 UNIT -1

 For an outsourcing company, Customer/Client Preference of service provider needs to be


factored in
 Cost involved in training employees on the Cloud Service Platform
 Customer support
 The provider should have a successful track record of stability/uptime/reliability
 Reviews of the company
Here is a list of Top 21 Cloud Service Providers for Quick Reference
 Amazon Web Services Alibaba Cloud
 Microsoft Azure Google Cloud Platform
 VMware Rackspace
 Salesforce Oracle Cloud
 Verizon Cloud Navisite
 IBM Cloud OpenNebula
 Pivotal DigtialOceanCloudSigma Dell Cloud
 LiquidWeb LimeStone
MassiveGridQuadranet Kamatera

Eucalyptus

• Eucalyptus is an acronym for Elastic Utility Computing Architecture for Linking


Your Programs To Useful Systems.

• Eucalyptus is a paid and open-source computer software for building Amazon Web Eucalyptus has six components:
Services (AWS)-compatible private and hybrid cloud computing environments, originally
1.The Cloud Controller (CLC) is a Java program that offers EC2-compatible interfaces,
developed by the company Eucalyptus Systems.
as well as a web interface to the outside world.
• Eucalyptus enables pooling compute, storage, and network resources that can be
• In addition to handling incoming requests, the CLC acts as the administrative interface
dynamically scaled up or down as application workloads change
for cloud management and performs high-level resource scheduling and system
accounting.

• The CLC accepts user API requests from command-line interfaces like euca2ools or
GUI-based tools like the Eucalyptus User Console and manages the underlying compute,
storage, and network resources.

• Only one CLC can exist per cloud and it handles authentication, accounting, reporting,
and quota management.

2.Walrus, also written in Java, is the Eucalyptus equivalent to AWS Simple Storage
Service (S3).

K.NIKHILA Page 34 K.NIKHILA Page 35


UNIT -1 UNIT -1

• Walrus offers persistent storage to all of the virtual machines in the Eucalyptus cloud and • Nimbus allows a client to lease remote resources by deploying virtual machines (VMs)
can be used as a simple HTTP put/get storage as a service solution. on those resources and configuring them to represent an environment desired by the user.

• There are no data type restrictions for Walrus, and it can contain images (i.e., the building • It was formerly known as the "Virtual Workspace Service" (VWS) but the "workspace
blocks used to launch virtual machines), volume snapshots (i.e., point-in-time copies), service" is technically just one the components in the software .
and application data. Only one Walrus can exist per cloud.
• Nimbus is a toolkit that, once installed on a cluster, provides an infrastructure as a service
3.The Cluster Controller (CC) is written in C and acts as the front end for a cluster cloud to its client via WSRF-based or Amazon EC2 WSDL web service APIs.
within a Eucalyptus cloud and communicates with the Storage Controller and Node
Controller. • Nimbus is free and open-source software, subject to the requirements of the Apache
License, version 2.
• It manages instance (i.e., virtual machines) execution and Service Level Agreements
(SLAs) per cluster. • Nimbus supports both the hypervisors Xen and KVM and virtual machine schedulers
Portable Batch System and Oracle Grid Engine.
4.The Storage Controller (SC) is written in Java and is the Eucalyptus equivalent to
AWS EBS. It communicates with the Cluster Controller and Node Controller and • It allows deployment of self-configured virtual clusters via contextualization.
manages Eucalyptus block volumes and snapshots to the instances within its specific • It is configurable with respect to scheduling, networking leases, and usage accounting.
cluster.
• Nimbus is comprised of two products:
• If an instance requires writing persistent data to memory outside of the cluster, it would
need to write to Walrus, which is available to any instance in any cluster. Nimbus Infrastructure

5.The Node Controller (NC) is written in C and hosts the virtual machine instances and Nimbus Platform
manages the virtual network endpoints.
• Nimbus Infrastructure is an open source EC2/S3-compatible Infrastructure-as-a-
• It downloads and caches images from Walrus as well as creates and caches instances. Service implementation specifically targeting features of interest to the scientific
community such as support for proxy credentials, batch schedulers, best-effort allocations
• While there is no theoretical limit to the number of Node Controllers per cluster, and others.
performance limits do exist.
• Nimbus Platform is an integrated set of tools, operating in a multi-cloud environment,
6.The VMware Broker is an optional component that provides an AWS-compatible that deliver the power and versatility of infrastructure clouds to scientific users. Nimbus
interface for VMware environments and physically runs on the Cluster Controller. Platform allows you to reliably deploy, scale, and manage cloud resources.
• The VMware Broker overlays existing ESX/ESXi hosts and transforms Eucalyptus System Architecture & Design
Machine Images (EMIs) to VMware virtual disks.

• The VMware Broker mediates interactions between the Cluster Controller and VMware
and can connect directly to either ESX/ESXi hosts or to vCenter Server.

Nimbus

• Nimbus is a set of open source tools that together provide an "Infrastructure-as-a-


Service" (IaaS) cloud computing solution.

• Mission is to evolve the infrastructure with emphasis on the needs of science, but many
non-scientific use cases are supported as well.
K.NIKHILA Page 36 K.NIKHILA Page 37
UNIT -1 UNIT -1

6. Workspace client

• A complex client that provides full access to the workspace service functionality.

7. Cloud client

• A simpler client providing access to selected functionalities in the workspace service.

8. Storage service

• cumulus is a web service providing users with storage capabilities to store images and
works in conjunction with GridFTP.

Open Nebula

• Open Nebula- is an open source cloud computing platform for managing heterogeneous
distributed data centre infrastructures.

• Manages a data centre’s virtual infrastructure to build private,public and hybrid


implementations of Iaas.
• The design of nimbus which consists of a number of components based on the web
service technology. • Two primary uses of open nebula platform are:

1. Workspace service data center virtualization

Allows clients to manage and administer VMs by providing to two interfaces: • Many of our users use OpenNebula to manage data center virtualization, consolidate
servers, and integrate existing IT assets for computing, storage, and networking.
• A)One interface is based on the web service is resource framework (WSRF)
• In this deployment model, OpenNebula directly integrates with hypervisors (like KVM,
• B)The other is based on EC2 WSDL Xen or VMware ESX) and has complete control over virtual and physical resources,
providing advanced features for capacity management, resource optimization, high
2. Workspace resource manager
availability and business continuity.
implements VM instance creation on a site management.
• Some of these users also enjoy OpenNebula’s cloud management and provisioning
3. Workspace pilot features when they additional want to federate data centers, implement cloud bursting, or
offer self-service portals for users.
• Provides virtualization with significant changes to the site configurations.

4. workspace control
Cloud infrastructure solutions
• Implements VM instance management such as start, stop and pause VM. It also provides
image management and set up networks and provides IP assignment. • We also have users that use OpenNebula to provide a multitenant, cloud-like
provisioning layer on top of an existing infrastructure management solution (like
5.context Broker VMware vCenter).
• Allows clients coordinate large virtual cluster launches automatically and repeatedly. • These users are looking for provisioning, elasticity and multi-tenancy cloud features like
virtual data centers provisioning, datacenter federation or hybrid cloud computing to

K.NIKHILA Page 38 K.NIKHILA Page 39


UNIT -1 UNIT -1

connect in-house infrastructures with public clouds, while the infrastructure is managed
by already familiar tools for infrastructure management and operation

Image Repository: Any storage medium for the VM images (usually a high performing SAN).
Master node: A single gateway or front-end machine, sometimes also called the master node, is
Cluster Storage : OpenNebula supports multiple back-ends (e.g. LVM for fast cloning)
responsible for queuing, scheduling and submitting jobs to the machines in the cluster. It runs
VM Directory: The home of the VM in the cluster node several other OpenNebula services mentioned below:

 Stores checkpoints, description files and VM disks


 Provides an interface to the user to submit virtual machines and monitor their status.
 Actual operations over the VM directory depends on the storage medium
 Manages and monitors all virtual machines running on different nodes in the cluster.
 Should be shared for live-migrations
 It hosts the virtual machine repository and also runs a transfer service to manage the
 You can go on without a shared FS and use the SSH back-e
transfer of virtual machine images to the concerned worker nodes.
 Provides an easy-to-use mechanism to set up virtual networks in the cloud.
 Finally, the front-end allows you to add new machines to your cluster.

Worker node: The other machines in the cluster, known as ‘worker nodes’, provide raw
computing power for processing the jobs submitted to the cluster. The worker nodes in an

K.NIKHILA Page 40 K.NIKHILA Page 41


UNIT -1 UNIT -1

OpenNebula cluster are machines that deploy a virtualisation hypervisor, such as VMware, Xen
or KVM.

CloudSim

• CloudSim is a framework for modeling and simulation of cloud computing


infrastructures and services.

• Originally built primarily at the Cloud Computing and Distributed Systems (CLOUDS)
Laboratory, The University of Melbourne, Australia, CloudSim has become one of the
most popular open source cloud simulators in the research and academia.

• CloudSim is completely written in Java.

• By using CloudSim, developers can focus on specific systems design issues that they
want to investigate, without getting concerned about details related to cloud-based
infrastructures and services.

• CloudSim is a simulation tool that allows cloud developers to test the performance of
their provisioning policies in a repeatable and controllable environment, free of cost.

• It helps tune the bottlenecks before real-world deployment.

• It is a simulator; hence, it doesn’t run any actual software.

• It can be defined as ‘running a model of an environment in a model of hardware’, where


technology-specific details are abstracted.

• CloudSim is a library for the simulation of cloud scenarios.

• It provides essential classes for describing data centres, computational resources, virtual
machines, applications, users, and policies for the management of various parts of the
system such as scheduling and provisioning.

• It can be used as a building block for a simulated cloud environment and can add new
policies for scheduling, load balancing and new scenarios.

• It is flexible enough to be used as a library that allows you to add a desired scenario by
writing a Java program.

Features of Cloudsim
Architecture of CloudSim

K.NIKHILA Page 42 K.NIKHILA Page 43


UNIT -1 UNIT -1

• User Interface: This layer provides the interaction between user and the simulator.

• The CloudSim Core simulation engine provides support for modeling and simulation of
virtualized Cloud-based data center environments including queuing and processing of
events, creation of cloud system entities (like data center, host, virtual machines, brokers,
services, etc.) communication between components and management of the simulation
clock.

• The user code :layer exposes basic entities such as the number of machines, their
specifications, etc, as well as applications, VMs, number of users, application types and
scheduling policies.

• The User Code layer is a custom layer where the user writes their own code to redefine
the characteristics of the stimulating environment as per their new research findings.

• Network Layer: This layer of CloudSim has responsibility to make communication


possible between different layers. This layer also identifies how resources in cloud
environment are places and managed.

• Cloud Resources: This layer includes different main resources like datacenters, cloud
coordinator (ensures that different resources of the cloud can work in a collaborative
way) in the cloud environment

• Cloud Services: This layer includes different service provided to the user of cloud
services. The various services of clouds include Information as a Service (IaaS), Platform
as a Service (PaaS), and Software as a Service (SaaS)

K.NIKHILA Page 44 K.NIKHILA Page 45


1.Cloud computing services Service-level agreement (SLA)
1.1 Infrastructure as a service - IaaS
A service-level agreement (SLA) is a contract between a service provider and its internal or
AWS supports everything you need to build and run Windows applications including Active external customers that documents what services the provider will furnish and defines the service
Directory, .NET, System Center, Microsoft SQL Server, Visual Studio, and the first and only standards the provider is obligated to meet.

fully managed native-Windows file system available in the cloud with FSx for Windows File
Server.
The AWS advantage for Windows over the next largest cloud provider
2x More Windows Server instances
2x more regions with multiple availability zones
7x fewer downtime hours in 2018*
2x higher performance for SQL Server on Windows
5x more services offering encryption
AWS offers the best cloud for Windows, and it is the right cloud platform for running Windows-
based applications
Windows on Amazon EC2 enables you to increase or decrease capacity within minutes
i. Broader and Deeper Functionality
ii. Greater Reliability
iii. More Security Capabilities
iv. Faster Performance
v. Lower Costs
vi. More Migration Experience
Popular AWS services for Windows workloads
i. SQL Server on Amazon EC2
ii. Amazon Relational Database Service
iii. Amazon FSx for Window File Server
iv. AWS Directory Service
v. AWS License Manager
UNIT III UNIT III

CLOUD STORAGE supply fails.

3.1.1 Overview b.Storage as a Service

The Basics
The term Storage as a Service (another Software as a Service, or SaaS, acronym) means that a

Cloud storage is nothing but storing our data with a cloud service provider rather third-party provider rents space on their storage to end users who lack the budget or capital

than on a local system, as with other cloud services, we can access the data stored on the cloud budget to pay for it on their own. It is also ideal when technical personnel are not available or

via an Internet link. Cloud storage has a number of advantages over traditional data storage. If have inadequate knowledge to implement and maintain that storage infrastructure. Storage

we store our data on a cloud, we can get at it from any location that has Internet access. service providers are nothing new, but given the complexity of current backup,

At the most rudimentary level, a cloud storage system just needs one data server connected to replication, and disaster recovery needs, the service has become popular, especially among

the Internet. A subscriber copies files to the server over the Internet, which then records the small and medium-sized businesses. Storage is rented from the provider using a cost-per-

data. When a client wants to retrieve the data, he or she accesses the data server with a web- gigabyte-stored or cost-per-data-transferred model. The end user doesn’t have to pay for

based interface, and the server then either sends the files back to the client or allows the client infrastructure; they simply pay for how much they transfer and save on the provider’s

to access and manipulate the data itself. servers.

A customer uses client software to specify the backup set and then transfers data across a
WAN. When data loss occurs, the customer can retrieve the lost data from the service provider.
c.Providers
Cloud storage systems utilize dozens or hundreds of data servers. Because servers require
maintenance or repair, it is necessary to store the saved data on multiple machines, providing They are hundreds of cloud storage providers on the Web, and more seem to be added each
redundancy. Without that redundancy, cloud storage systems couldn’t assure clients that they day. Not only are there general-purpose storage providers, but there are some that are very
could access their information at any given time. Most systems store the same data on servers specialized in what they store.
using different power supplies. That way, clients can still access their data even if a power  Google Docs allows users to upload documents, spreadsheets, and presentations to

K.NIKHILA Page 1 K.NIKHILA Page 2


UNIT III UNIT III

Google’s data servers. Those files can then be edited using a Google application.
 Web email providers like Gmail, Hotmail, and Yahoo! Mail store email messages on their
own servers. Users can access their email from computers and other devices connected to the
Internet.
 Flickr and Picasa host millions of digital photographs. Users can create their own online
photo albums.
 YouTube hosts millions of user-uploaded video files.
 Hostmonster and GoDaddy store files and data for many client web sites.
 Facebook and MySpace are social networking sites and allow members to post
e. Reliability
pictures and other content. That content is stored on the company’s servers.
Most cloud storage providers try to address the reliability concern through redundancy, but the
 MediaMax and Strongspace offer storage space for any kind of digital data.
possibility still exists that the system could crash and leave clients with no way to access their
d. Security:
saved data.
To secure data, most systems use a combination of techniques:
Advantages
i. Encryption A complex algorithm is used to encode information. To decode the encrypted
files, a user needs the encryption key. While it’s possible to crack encrypted
 Cloud storage is becoming an increasingly attractive solution for organizations. That’s
information, it’s very difficult and most hackers don’t have access to the amount of
because with cloud storage, data resides on the Web, located across storage systems
computer processing power they would need to crack the code.
rather than at a designated corporate hosting site. Cloud storage providers balance
ii. Authentication processes this requires a user to create a name and password.
server loads and move data among various datacenters, ensuring that information is
iii. Authorization practices The client lists the people who are authorized to access
stored close to where it is used.
information stored on the cloud system. Many corporations have multiple levels of
 Storing data on the cloud is advantageous, because it allows us to protect our data
authorization. For example, a front-line employee might have limited access to data
incase there’s a disaster. we may have backup files of our critical information, but if
stored on the cloud and the head of the IT department might have complete and free
there is a fire or a hurricane wipes out our organization, having the backups stored
access to everything.
locally doesn’t help.
 Amazon S3 is the best-known storage solution, but other vendors might be better for
large enterprises. For instance, those who offer service level agreements and direct
access to customer support are critical for a business moving storage to a service
provider.

K.NIKHILA Page 3 K.NIKHILA Page 4


UNIT III UNIT III

global network of websites. The service aims to maximize benefits of scale and to pass
those benefits on to developers.
Amazon S3 is intentionally built with a minimal feature set that includes the following
functionality:
 Write, read, and delete objects containing from 1 byte to 5 gigabytes of data
each. The number of objects that can be stored is unlimited.
 Each object is stored and retrieved via a unique developer-assigned key.
 Objects can be made private or public, and rights can be assigned to specific users.
 Uses standards-based REST and SOAP interfaces designed to work with any
 A lot of companies take the “appetizer” approach, testing one or two services to see
Internet- development toolkit.
how well they mesh with their existing IT systems. It’s important to make sure the
Design Requirements
services will provide what we need before we commit too much to the cloud.
Amazon built S3 to fulfill the following design requirements:
 Scalable Amazon S3 can scale in terms of storage, request rate, and users to
support an unlimited number of web-scale applications.
 Reliable Store data durably, with 99.99 percent availability. Amazon says it does not
allow any downtime.
 Fast Amazon S3 was designed to be fast enough to support high-performance
applications. Server-side latency must be insignificant relative to Internet latency. Any
performance bottlenecks can be fixed by simply adding nodes to the system.
 Inexpensive Amazon S3 is built from inexpensive commodity hardware components.
As a result, frequent node failure is the norm and must not affect the overall system. It
3.1.2 Cloud Storage Providers must be hardware-agnostic, so that savings can be captured as Amazon continues to drive
Amazon and Nirvanix are the current industry top storage providers. down infrastructure costs.
a. Amazon Simple Storage Service (S3)  Simple Building highly scalable, reliable, fast, and inexpensive storage is difficult. Doing
 The best-known cloud storage service is Amazon’s Simple Storage Service (S3), which so in a way that makes it easy to use for any application anywhere is more difficult. Amazon S3
launched in 2006. must do both.
 Amazon S3 is designed to make web-scale computing easier for developers. Amazon S3 Design Principles
provides a simple web services interface that can be used to store and retrieve any Amazon used the following principles of distributed system design to meet Amazon S3
amount of data, at any time, from anywhere on the Web. It gives any developer access requirements:
to the same highly scalable data storage infrastructure that Amazon uses to run its own
K.NIKHILA Page 5 K.NIKHILA Page 6
UNIT III UNIT III

 Decentralization It uses fully decentralized techniques to remove scaling bottlenecks


and single points of failure.
 Autonomy The system is designed such that individual components can make decisions
based on local information.
 Local responsibility Each individual component is responsible for achieving its
consistency; this is never the burden of its peers.
 Controlled concurrency Operations are designed such that no or limited concurrency
control is required.
 Failure toleration The system considers the failure of components to be a normal
mode of operation and continues operation with no or minimal interruption.
 Controlled parallelism Abstractions used in the system are of such granularity that
parallelism can be used to improve performance and robustness of recovery or the introduction
of new nodes. Buckets and objects are created, listed, and retrieved using either a REST-style or SOAP
 Small, well-understood building blocks Do not try to provide a single service that interface. Objects can also be retrieved using the HTTP GET interface or via BitTorrent.
does everything for everyone, but instead build small components that can be used as An access control list restricts who can access the data in each bucket. Bucket names and keys
building blocks for other services. are formulated so that they can be accessed using HTTP. Requests are authorized using an
 Symmetry Nodes in the system are identical in terms of functionality, and require access control list associated with each bucket and object, for instance:
no or minimal node-specific configuration to function. b Nirvanix
 Simplicity The system should be made as simple as possible, but no simpler. Nirvanix uses custom-developed software and file system technologies running on Intel storage
servers at six locations on both coasts of the United States. They continue to grow, and expect
How S3 Works to add dozens more server locations. SDN Features Nirvanix Storage Delivery Network (SDN)
turns a standard 1U server into an infinite capacity network attached storage (NAS) file
S3 stores arbitrary objects at up to 5GB in size, and each is accompanied by up to 2KB of accessible by popular applications and immediately integrates into an organization’s existing
metadata. Objects are organized by buckets. Each bucket is owned by an AWS account and the archive and backup processes.
buckets are identified by a unique, user-assigned key. Nirvanix has built a global cluster of storage nodes collectively referred to as the Storage
Delivery Network (SDN), powered by the Nirvanix Internet Media File System (IMFS). The SDN
intelligently stores, delivers, and processes storage requests in the best network location,
providing the best user experience in the marketplace.
Benefits of CloudNAS: The benefits of cloud network attached storage (CloudNAS) include

K.NIKHILA Page 7 K.NIKHILA Page 8


UNIT III UNIT III

 Cost savings of 80–90 percent over managing traditional storage solutions


 Elimination of large capital expenditures while enabling 100 percent storage utilization
 Encrypted offsite storage that integrates into existing archive and backup processes
 Built-in data disaster recovery and automated data replication on up to three
geographically dispersed storage nodes for a 100% SLA
 Immediate availability to data in seconds, versus hours or days on offline tape.
c.Google Bigtable Datastore:
 Datastore In cloud computing, it’s important to have a database that is capable of
handling numerous users on an on-demand basis. To serve that market, Google
introduced its Bigtable. Google started working on it in 2004 and finally went public with
it in April 2008. Bigtable was developed with very high speed, flexibility,
and extremely high scalability in mind. A Bigtable database can be petabytes in
size and span thousands of distributed servers. Bigtable is available to developers as
part of the Google App Engine, their cloud computing platform.
 Google describes Bigtable as a fast and extremely scalable DBMS. This allows Bigtable to d. MobileMe:

scale across thousands of commodity servers that can collectively store petabytes of  it is Apple’s solution that delivers push email, push contacts, and push calendars from

data. Each table in Bigtable is a multidimensional sparse map. That is, the table is made the MobileMe service in the cloud to native applications on iPhone, iPod touch, Macs,

up of rows and columns, and each cell has a timestamp. Multiple versions of a cell can and PCs.

exist, each with a different timestamp. With this stamping, we can select certain  It provides a suite of ad-free web applications that deliver a desktop like experience

versions of a web page, or delete cells that are older than a given date and time. through any browser.
e. Live Mesh:
 It is Microsoft’s “software plus services” platform and experience that enables PCs and
other devices to be aware of each other through internet,enabling individuals and
organizations to manage ,access and share their files and applications on the web.
components:
 A platform that defines and models a user’s digital relationships among devices,
data, applications, and people—made available to developers through an open data
model and protocols.
 A cloud service providing an implementation of the platform hosted in Microsoft

K.NIKHILA Page 9 K.NIKHILA Page 10


UNIT III UNIT III

datacenters.
 Software, a client implementation of the platform that enables local applications to
run offline and interact seamlessly with the cloud.
 A platform experience that exposes the key benefits of the platform for bringing together
a user’s devices, files and applications, and social graph, with news feeds across all of
these.
Standards
 Standards make the World Wide Web go around, and by extension, they are important to
cloud computing. Standards are what make it possible to connect to the cloud and what
make it possible to develop and deliver content.
3.2.1 Applications
A cloud application is the software architecture that the cloud uses to eliminate the need to The Problem with Polling When we wanted to sync services between two servers, the most

install and run on the client computer. There are many applications that can run, but there common means was to have the client ping the host at regular intervals. This is known as

needs to be a standard way to connect between the client and the cloud. polling. This is generally how we check our email. Every so often, we ping our email server to

a. Communication: see if we got any new messages. It’s also how the APIs for most web services work.

HTTP b. security

To get a web page from our cloud provider, we will likely be using the Hypertext Transfer  SSL is the standard security technology for establishing an encrypted link between a web

Protocol (HTTP) as the computing mechanism to transfer data between the cloud and our server and browser. This ensures that data passed between the browser and the web

organization. HTTP is a stateless protocol. This is beneficial because hosts do not need to retain server stays private. To create an SSL connection on a web server requires an SSL

information about users between requests, but this forces web developers to use alternative certificate. When our cloud provider starts an SSL session, they are prompted to

methods for maintaining users’ states. HTTP is the language that the cloud and our computers complete a number of questions about the identity of their company and web site. The

use to communicate. cloud provider’s computers then generate two cryptographic keys—a public key and a

XMPP private key.

The Extensible Messaging and Presence Protocol (XMPP) is being talked about as the next
big thing for cloud computing.

K.NIKHILA Page 11 K.NIKHILA Page 12


UNIT III UNIT III

b.DHTML: There are four parts to DHTML:


DOM: (Document Object Model) allows you to access web page and makes changes with
DHTML.DOM specifies every part of a web page and provides consistent naming
conventions, allowing accessing your web pages and changing their properties.
Scripts: common scripting language in DHTML. Are java scripts and ActiveX. Scripts are
used to control the objects specified in the DOM.
CSS: (Cascading style sheets) are used to control the look and feel of web page, where
style sheets list the color and font s of text, the background colors and images, and the
placement of objects on the page. Using scripting and DOM you can change the style of
various elements.
3.2.2 Client XHTML: nothing unique about XHTML but it is important because there are more things
a.HTML working from it than just the browser.
 HTML is to improve its usability and functionality.W3C is the organization that is charged DHTML features:
with designing and maintaining the language. When you click on a link in a web page, Four main features are:
you are accessing HTML code in the form of a hyperlink, which then takes you to another i. Changing the tags and properties
page. ii. Real-time positioning
How HTML works?? iii. Dynamic fonts
i. HTML is a series of short codes typed into a text file called TAGS which is created by web iv. Data binding
page design software. 3.2.3 Infrastructure
ii. This text is saved as an HTML file and viewed through a browser. Infrastructure is a way to deliver virtualization to our cloud computing solution.
iii. The browser reads the file and translates the text into the form the author wanted you a.Virtualization: Whenever something new happens in the world of computing, competitors
to see. duke it out to have their implementation be the standard. Virtualization is somewhat different,
 Writing HTML can be done using a number of methods, with either a simple text editor and major players worked together to develop a standard.
or a powerful graphical editor.
 Tags are seen like normal text but in <angle brackets>.tags is what allows things like
tables and images to appear in a web page.
 Different tags perform different functions. Here tags cannot be seen through browser
but affects how the browser behaves.

K.NIKHILA Page 13 K.NIKHILA Page 14


UNIT III UNIT III

VMware contributed an existing framework of interfaces, called Virtual Machine Hypervisor


Interfaces (VMHI), based on its virtualization products to facilitate the development of these
standards in an industry-neutral manner.
Community Source
The Community Source program provides industry partners with an opportunity to access
VMware ESX Server source code under a royalty-free license. Partners can contribute shared
code or create binary modules to spur and extend interoperable and integrated virtualization
solutions. The idea is to combine the best of both the traditional commercial and open-source

VMware, AMD, BEA Systems, BMC Software, Broadcom, Cisco, Computer Associates development models. Community members can participate and influence the governance of

International, Dell, Emulex, HP, IBM, Intel, Mellanox, Novell, QLogic, and Red Hat all worked VMware ESX Server through an architecture board.

together to advance open virtualization standards. VMware says that it will provide its partners b.OVF

with access to VMware ESX Server source code and interfaces under a new program called As the result of VMware and its industry partners’ efforts, a standard has already been

VMware Community Source. This program is designed to help partners influence the direction developed called the Open Virtualization Format (OVF). OVF describes how virtual appliances

of VMware ESX Server through a collaborative development model and shared governance can be packaged in a vendor-neutral format to be run on any hypervisor. It is a platform-

process. independent, extensible, and open specification for the packaging and distribution of virtual

These initiatives are intended to benefit end users by : appliances composed of one or more virtual machines.

i. Expanding virtualization solutions the availability of open-standard virtualization interfaces VMware developed a standard with these features:

and the collaborative nature of VMware Community Source are intended to accelerate the  Optimized for distribution

availability of new virtualization solutions.  Enables the portability and distribution of virtual appliances

ii. Expanded interoperability and supportability Standard interfaces for hypervisors are expected  Supports industry-standard content verification and integrity checking

to enable interoperability for customers with heterogeneous virtualized environments.  Provides a basic scheme for the management of software licensing

iii. Accelerated availability of new virtualization-aware technologies Vendors across the  A simple, automated user experience
technology stack can optimize existing technologies and introduce new technologies for  Enables a robust and user-friendly approach to streamlining the installation process
running in virtual environments.  Validates the entire package and confidently determines whether each virtual machine should
Open Hypervisor Standards be installed
Hypervisors are the foundational component of virtual infrastructure and enable  Verifies compatibility with the local virtual hardware
computer system partitioning. An open-standard hypervisor framework can benefit customers  Portable virtual machine packaging
by enabling innovation across an ecosystem of interoperable virtualization vendors and  Enables platform-specific enhancements to be captured
solutions.

K.NIKHILA Page 15 K.NIKHILA Page 16


UNIT III UNIT III

 Supports the full range of virtual hard disk formats used for virtual machines today, and is XML vs. JSON:
extensible to deal with future formats that are developed JSON should be used instead of XML when JavaScript is sending or receiving data. The reason
 Captures virtual machine properties concisely and accurately for this is that when we use XML in JavaScript, we have to write scripts or use libraries to handle
 Vendor and platform independent the DOM objects to extract the data our need. However, in JSON, the object is already an
 Does not rely on the use of a specific host platform, virtualization platform, or guest object, so no extra work needs to be done.
operating system
 Extensible
 Designed to be extended as the industry moves forward with virtual appliance technology
 Localizable
 Supports user-visible descriptions in multiple locales
 Supports localization of the interactive processes during installation of an appliance
 Allows a single packaged appliance to serve multiple market opportunities
3.2.4 Service
 A web service, as defined by the World Wide Web Consortium (W3C), “is a software system
designed to support interoperable machine-to-machine interaction over a network” that may
be accessed by other cloud computing components. Web services are often web API’s that can XML
be accessed over a network, like the Internet, and executed on a remote system that hosts the Extensible Markup Language (XML) is a standard, self-describing way of encoding text and data
requested services. so that content can be accessed with very little human interaction and exchanged across a wide
a.Data variety of hardware, operating systems, and applications. XML provides a standardized way to
Data can be stirred and served up with a number of mechanisms; two of the most popular are represent text and data in a format that can be used across platforms. It can also be used with a
JSON and XML. wide range of development tools and utilities.
JSON HTML vs XML
JSON is short for JavaScript Object Notation and is a lightweight computer data interchange  Separation of form and content HTML uses tags to define the appearance of text, while
format. It is used for transmitting structured data over a network connection in a process called XML tags define the structure and the content of the data. Individual applications will be
serialization. It is often used as an alternative to XML. specified by the application or associated style sheet.
JSON Basics JSON is based on a subset of JavaScript and is normally used with that language.  XML is extensible Tags can be defined by the developer for specific application, while
However, JSON is considered to be a language-independent format, and code for parsing and HTML’s tags are defined by W3C.
generating JSON data is available for several programming languages. This makes it a good
replacement for XML when JavaScript is involved with the exchange of data, like AJAX.

K.NIKHILA Page 17 K.NIKHILA Page 18


UNIT III UNIT III

Benefits of XML include: resource locator (URL) for the page where the XML file is located, read it with a web browser,
i. Self-describing data XML does not require relational schemata, file description tables, external understand the content using XML information, and display it appropriately.
data type definitions, and so forth. Also, while HTML only ensures the correct presentation of
the data, XML also guarantees that the data is usable.
ii. Database integration XML documents can contain any type of data—from text and numbers to
multimedia objects to active formats like Java.
iii. No reprogramming if modifications are made Documents and web sites can be changed with
XSL Style Sheets, without having to reprogram the data.
iv. One-server view of data XML is exceptionally ideal for cloud computing, because data spread
across multiple servers looks as if it is stored on one server.
v. Open and extensible XML’s structure allows us to add other elements if we need them. We can
easily adapt our system as our business changes.
vi. Future-proof The W3C has endorsed XML as an industry standard, and it is supported by all
leading software providers. It’s already become industry standard in fields like healthcare.
vii. Contains machine-readable context information Tags, attributes, and element structure
provide the context for interpreting the meaning of content, which opens up possibilities for
development. REST is similar in function to the Simple Object Access Protocol (SOAP), but is easier to use.
Content vs. presentation XML tags describe the meaning of the object, not its presentation. SOAP requires writing or using a data server program and a client program (to request the
That is, XML describes the look and feel of a document, and the application presents it as data). However, SOAP offers more capability. For instance, if we were to provide syndicated
described. content from our cloud to subscribing web sites, those subscribers might need to use SOAP,
b.Web Services which allows greater program interaction between the client and the server.
Web services describe how data is transferred from the cloud to the client. Benefits REST offers the following benefits:
REST  It gives better response time and reduced server load due to its support for the caching of
Representational state transfer (REST) is a way of getting information content from a web site representations.
by reading a designated web page that contains an XML file that describes and includes the  Server scalability is improved by reducing the need to maintain session state.
desired content.  A single browser can access any application and any resource, so less client-side software
For instance, REST could be used by our cloud provider to provide updated subscription needs to be written.
information. Every so often, the provider could prepare a web page that includes content and  A separate resource discovery mechanism is not needed, due to the use of hyperlinks in
XML statements that are described in the code. Subscribers only need to know the uniform representations.

K.NIKHILA Page 19 K.NIKHILA Page 20


UNIT III UNIT III

 Better long-term compatibility and resolvability characteristics exist than


in RPC. This is due to:
• The ability of documents, like HTML, to evolve with both forward- and backward-
compatibility.
• Resources can add support for new content types as they are defined, without
eliminating support for older content types.
SOAP
 Simple Object Access Protocol (SOAP) is a way for a program running in one kind of operating
system (such as Windows Vista) to communicate with a program in the same or another kind of
an operating system (such as Linux) by using HTTP and XML as the tools to exchange
information.
 Procedure Calls Often, remote procedure calls (RPC) are used between objects like DCOM or
COBRA, but HTTP was not designed for this use. RPC is a compatibility problem, because
firewall and proxy servers will block this type of traffic. Because web protocols already are
installed and available for use by the major operating systems, HTTP and XML provide an easy
solution to the problem of how programs running under different operating systems in a
network can communicate with each other.
 SOAP describes exactly how to encode an HTTP header and an XML file so that a program on
one computer can call a program in another computer and pass it information. It also explains
how a called program can return a response.
SOAP was developed by Microsoft, DevelopMentor, and Userland Software.
 One of the advantages of SOAP is that program calls are more likely to get through
firewalls that normally screen out requests for those applications. Because HTTP
requests are normally allowed through firewalls, programs using SOAP can
communicate with programs anywhere.

Standards are extremely important, and something that we take for granted
these days. For instance, it’s nothing for us to email Microsoft Word documents
back and forth and expect them to work on our computers.

K.NIKHILA Page 21 K.NIKHILA Page 22


UNIT III UNIT III

Software as a Service Advantages


SaaS (Software as a Service) is an application hosted on a remote server and accessed through • There’s a faster time to value and improved productivity, when
the Internet. compared to the long implementation cycles and failure rate of
enterprise software.
• There are lower software licensing costs. • SaaS offerings feature the biggest
cost savings over installed software by eliminating
the need for enterprises to install and maintain hardware, pay labor costs,
and maintain the applications.
• SaaS can be used to avoid the custom development cycles to get applications to
An easy way to think of SaaS is the web-based email service offered by such companies as
the organization quickly.
Microsoft (Hotmail), Google (Gmail), and Yahoo! (Yahoo Mail). Each mail service meets the
• SaaS vendors typically have very meticulous security audits. SaaS vendors allow
companies to have the most current version of an application as possible. This allows
the organization to spend their development dollars on new innovation in their
industry, rather than supporting old versions of applications.
SaaS, on the other hand, has no licensing. Rather than buying the application, you pay for it
through the use of a subscription, and you only pay for what you use. If you stop using the
application, you stop paying.

basic criteria: the vendor (Microsoft, Yahoo, and so on) hosts all of the programs and data in a
central location, providing end users with access to the data and software, which is accessed
across the World Wide Web.
SaaS can be divided into two major categories:
• Line of business services these are business solutions offered to companies and enterprises.
They are sold via a subscription service. Applications covered under this category include 3.3.2.Vendor Advantages
business processes, like supply- chain management applications, customer relations  SaaS is advantage to Vendors also.And financial benefit is the top one—vendors get a
applications, and similar business-oriented tools. constant stream of income, often what is more than the traditional software
• Customer-oriented services These services are offered to the general public on a subscription licensing setup. Additionally, through SaaS, vendors can fend off piracy concerns and
basis. More often than not, however, they are offered for free and supported by advertising. unlicensed use of software.
Examples in this category include the aforementioned web mail services, online gaming, and  Vendors also benefit more as more subscribers come online. They have a huge
consumer banking, among others. investment in physical space, hardware, technology staff, and process development.
K.NIKHILA Page 23 K.NIKHILA Page 24
UNIT III UNIT III

The more these resources are used to capacity, the more the provider can clear as (SSL) encryption.
margin. • No software to buy, install, or maintain and no network required. The software is hosted
Virtualization Benefits online, so small business users never have to worry about installing new software or upgrades.
 Virtualization makes it easy to move to a SaaS system. One of the main reasons is that it QuickBooks Online remembers customer, product, and vendor information, so users don’t
is easier for independent software vendors (ISVs) to adopt SaaS is the growth of have to re-enter data.
virtualization. The growing popularity of some SaaS vendors using Amazon’s EC2 cloud • Easy accounts receivable and accounts payable. Invoice customers and track customer
platform and the overall popularity of virtualized platforms help with the development of payments. Create an invoice with the click of a button. Apply specific credits to invoices or
SaaS. apply a single-customer payment to multiple jobs or invoices. Receive bills and enter them
3.3.3.Companies Offering SaaS Intuit into QuickBooks Online with the expected due date.
 QuickBooks has been around for years as a conventional application for tracking • Write and print checks. Enter information in the onscreen check form and print checks.
business accounting. With the addition of QuickBooks online, accounting has moved to Google
the cloud. QuickBooks Overview QuickBooks Online (www.qboe.com) gives small  Google’s SaaS offerings include Google Apps and Google Apps Premier Edition.
business owners the ability to access their financial data whether they are at work, Google Apps, launched as a free service in August 2006, is a suite of applications that includes
home, or on the road. Intuit Inc. says the offering also gives users a high level of security Gmail webmail services, Google Calendar shared calendaring, Google Talk instant messaging
because data is stored on firewall-protected servers and protected via automatic data and Voice over IP, and the Start Page feature for creating a customizable home page on a
backups. specific domain.
 There is also no need to hassle with technology—software upgrades are included at no  Google also offers Google Docs and Spreadsheets for all levels of Google Apps.
extra charge. Additionally, Google Apps supports Gmail for mobile on BlackBerry handheld
 For companies that are growing, QuickBooks Online Plus offers advanced features devices.
such as automatic billing and time tracking, as well as the ability to share information Google Apps Premier Edition has the following unique features:
with employees in multiple locations. • Per-user storage of 10GBs Offers about 100 times the storage of the average corporate
 QuickBooks Online features include : mailbox.
• The ability to access financial data anytime and from anywhere. QuickBooks Online is • APIs for business integration APIs for data migration, user provisioning, single sign-on, and
accessible to users 24 hours a day, seven days a week. mail gateways enable businesses to further customize the service for unique environments.
• Automated online banking. Download bank and credit card transactions automatically • Uptime of 99.9 percent Service level agreements for high availability of Gmail, with Google
every night, so it’s easy to keep data up to date. monitoring and crediting customers if service levels are not met.
• Reliable automatic data backup. Financial data is automatically backed up every day and is • Advertising optional Advertising is turned off by default, but businesses can choose to
stored on Intuit’s firewall-protected servers, which are monitored to keep critical business include Google’s relevant target-based ads if desired.

information safe and secure. QuickBooks Online also supports 128-bit Secure Sockets Layer • Low fee Simple annual fee of $50 per user account per year makes it practical to offer these

K.NIKHILA Page 25 K.NIKHILA Page 26


UNIT III UNIT III

applications to select users in the organization. Big Blue—IBM offers its own SaaS solution under the name “Blue Cloud.”
Microsoft Blue Cloud is a series of cloud computing offerings that will allow corporate datacenters to
Microsoft Office Live Small Business offers features including Store Manager, an e-commerce operate more like the Internet by enabling computing across a distributed, globally
tool to help small businesses easily sell products on their own web site and on eBay; and E- accessible fabric of resources, rather than on local machines or remote server farms.
mail Marketing beta, to make sending email newsletters and promotions simple and Blue Cloud is based on open-standards and open-source software supported by IBM
affordable. software, systems technology, and services. IBM’s Blue Cloud development is supported by
more than 200 IBM Internet-scale researchers worldwide and targets clients who want to
The following features are available in Microsoft Office Live Small Business: explore the extreme scale of cloud computing infrastructures.
• Store Manager is a hosted e-commerce service that enables users to easily sell products on
their own web site and on eBay. Software plus Services
• Custom domain name and business email is available to all customers for free for one year. Software plus Services takes the notion of Software as a Service (SaaS) to complement
Private domain name registration is included to help customers protect their contact packaged software. Here are some of the ways in which it can help the client organization.
information from spammers. Business email now includes 100 company-branded accounts,
each with 5GB of storage.
• Web design capabilities, including the ability to customize the entire page, as well as the
header, footer, navigation, page layouts, and more.
• Support for Firefox 2.0 means Office Live Small Business tools and features are now compatible
with Macs.
• A simplified sign-up process allows small business owners to get started quickly. Users do not
have to choose a domain name at sign-up or enter their credit card information. 3.4.1 Overview
• Domain flexibility allows businesses to obtain their domain name through any provider  User experience: Browsers have limitations as to just how rich the user experience can
and redirect it to Office Live Small Business. In addition, customers may purchase be. Combining client software that provides the features we want with the ability of the
additional domain names. Internet to deliver those experiences gives us the best of both worlds.
• Synchronization with Microsoft Office Outlook provides customers with access to vital • Working offline Not having to always work online gives us the flexibility to do our work,
business information such as their Office Live Small Business email, contacts, and calendars, but without the limitations of the system being unusable. By connecting occasionally and
both online and offline. synching data, we get a good solution for road warriors and telecommuters who don’t have
• E-mail Marketing beta enables users to stay connected to current customers and introduce the same bandwidth or can’t always be connected.
themselves to new ones by sending regular email newsletters, promotions, and updates. • Privacy worries: No matter how we use the cloud, privacy is a major concern. With
IBM Software plus Services, we can keep the most sensitive data housed on-site, while less

K.NIKHILA Page 27 K.NIKHILA Page 28


UNIT III UNIT III

sensitive data can be moved to the cloud. device. Google Android:


• Marketing: Software plus Services gives vendors a chance to keep their names in front of A broad alliance of leading technology and wireless companies joined forces to develop Android,
clients. Since it’s so easy to move from vendor to vendor, providing a part software/part- an open and comprehensive platform for mobile devices. Google Inc., T-Mobile, HTC, Qualcomm,
Internet solution makes it easier to sell our product to a client. Motorola, and others collaborated on the development of Android through the Open Handset
• Power: More efficiency is realized by running software locally and synching to the cloud as Alliance, a multinational alliance of technology and mobile industry leaders.
needed. Open Handset Alliance:
• Flexibility: Vendors can offer software in different sizes and shapes—whether onsite or  Thirty-four companies have formed the Open Handset Alliance, which aims to develop
hosted. This gives customers an opportunity to have the right-sized solution. technologies that will significantly lower the cost of developing and distributing mobile
Software plus Services offerings that prevalent companies have. devices and services. The Android platform is the first step in this direction—a fully
Vendors integrated mobile “software stack” that consists of an operating system, middleware,
i. Microsoft : Microsoft offers Dynamics CRM, Microsoft Outlook, Windows Azure, and Azure and user-friendly interface and applications. This alliance include major companied like,
Services Platform. Windows Azure is a collection of cloud-based services, including Live • Google (www.google.com)
Framework, .NET Services, SQL Services, CRM Services, SharePoint Services, and Windows • HTC (www.htc.com)
Azure Foundation Services for compute, storage, and management. • Intel (www.intel.com)
ii. Adobe: Adobe Integrated Runtime (AIR) brings Flash, ActionScript, and MXML/Flex to the • LG (www.lge.com)
PC. Using AIR, vendors can build desktop applications that access the cloud. • Marvell (www.marvell.com)
iii. Salesforce.com : Salesforce.com’s AppExchange is a set of APIs that vendors can use to • Motorola (www.motorola.com)
create desktop applications to access salesforce data and run on the desktop of an end • NMS Communications (www.nmscommunications.com)
user. • NTT DoCoMo Inc. (www.nttdocomo.com)
iv. Apple: Apple offers a number of cloud-enabled features for its iPhone/iPod touch. Not only • Qualcomm (www.qualcomm.com)
does it come with an integrated Safari web browser, but they also offer a software developer’s • Samsung (www.samsung.com) Etc...
kit (SDK) that allows software to be created for the iPhone/ iPod touch. Vendors can build their 3.4.3. Providers:
own applications, and on-the-go users can access cloud offerings with those applications. The following development solutions we may consider for creating our own
v. Google: Google’s mobile platform is called “Android” and helps vendors build software for Software plus Services deployments.
mobile phones. Google also offers its Google Apps and the Google Chrome browser, which also a.Adobe AIR:
installs Google Gears software on the desktop. This allows offline and online solutions. Adobe Systems offers its Adobe Integrated Runtime (AIR), formerly code-named Apollo. Adobe
3.4.2 Mobile Device Integration: AIR is a cross- operating-system application runtime that allows developers to use HTML/CSS,
How Mobile Device Integration is done. How Microsoft Online provides this? AJAX, Adobe Flash, and Adobe Flex to extend rich Internet applications (RIAs) to the desktop.
A key component of Software plus Services is the ability to work in the cloud from a mobile  For its popular iPhone and iPod touch devices, Apple offers its iPhone Software

K.NIKHILA Page 29 K.NIKHILA Page 30


UNIT III UNIT III

Development Kit (SDK) as well as enterprise features such as support for Microsoft and Microsoft Dynamics CRM 4.0, organizations will have the flexibility required to address
Exchange ActiveSync to provide secure, over-the-air push email, contacts, and their business needs.
calendars as well as remote wipe, and the addition of Cisco IPsec VPN for encrypted Exchange Online and SharePoint Online
access to private corporate networks. Exchange Online and SharePoint Online are two examples of how partners can extend their
App Store: reach, grow their revenues, and increase the number to sales in a Microsoft-hosted scenario. In
The iPhone software contains the App Store, an application that lets users browse, search, September 2007, Microsoft initially announced the worldwide availability of Microsoft Online
purchase, and wirelessly download third-party applications directly onto their iPhone or iPod Services—which includes Exchange Online, SharePoint Online, Office Communications Online,
touch. The App Store enables developers to reach every iPhone and iPod touch user. and Office Live Meeting—to organizations with more than 5,000 users. The extension of these
Developers set the price for their applications (including free) and retain 70 percent of all sales services to small and mid-sized businesses is appealing to partners in the managed services
revenues. Users can download free applications at no charge to either the user or developer, or space because they see it as an opportunity to deliver additional services and customer value
purchase priced applications with just one click. Enterprise customers can create a secure, on top of Microsoft-hosted Exchange Online or SharePoint Online. Microsoft Online Services
private page on the App Store accessible only by their employees. opens the door for partners to deliver reliable business services such as desktop and mobile
email, calendaring and contacts, instant messaging, audio and video conferencing, and shared
3.4.4 Microsoft Online: workspaces—all of which will help increase their revenue stream and grow their businesses.
Microsoft provides Software plus Services offerings, integrating some of its most popular and
prevalent offerings, like Exchange. Not only does Microsoft’s Software plus Services offering Microsoft Dynamics CRM 4.0:
allow a functional way to serve our organization, but it also provides a means to function on the Microsoft Dynamics CRM 4.0, released in December of 2007 which provides a key aspect of
cloud in simple way. Microsoft’s Software plus Services strategy. The unique advantages of the new Microsoft
Hybrid Model Dynamics CRM 4.0, which can be delivered on-premise or on-demand as a hosted solution,
With Microsoft services like Exchange Online, SharePoint Online, and CRM 4.0, make Microsoft Dynamics CRM an option for solution providers who want to rapidly offer a
organizations big and small have more choices in how they access and manage enterprise from solution that meets customer needs and maximizes their potential to grow their own business
entirely web-based, to entirely on-premise solutions, and anywhere in between. Having a through additional services.
variety of solutions to choose from gives customers the mobility and flexibility they need to
meet constantly evolving business needs. To meet this demand, Microsoft is moving toward a
hybrid strategy of Software plus Services, the goal of which is to empower customers and
partners with richer applications, more choices, and greater opportunity through a combination
of on-premise software, partner-hosted software, and Microsoft-hosted software. As part of
this strategy, Microsoft expanded its Microsoft Online Services which includes Exchange Online
and SharePoint Online to organizations of all sizes. With services like Microsoft Online Services

K.NIKHILA Page 31 K.NIKHILA Page 32


UNIT IV Gears provide three key features:
4.1 Developing Applications  A local server, to cache and serve application resources
(HTML, JavaScript, images, etc.) without needing to contact a server
In cloud computing we can develop our own applications to
 A database, to store and access data from within the browser
cater the needs of our business. A simple example is
 A worker thread pool, to make web applications more
developing an app using Android to meet our business using
responsive by performing expensive operations in the
Google App Engine and deploy in App Store. Similarly we
background .
may use Intuit’s QuickBase which allows us to develop
financial-based cloud apps.
4.1.2 Microsoft
4.1.1 Google
Microsoft’s Azure Services Platform is a tool provided for
To develop an app on the cloud, the Google App Engine is the developers who want to write applications that are going to
perfect tool to use, to make this dream become reality. In run partially or entirely in a remote datacenter. The Azure
essence, we will write a bit of code in Python, tweak some Services Platform (Azure) is an Internet-scale cloud services
HTML code, and then we have our app built, and it only takes platform hosted in Microsoft datacenters, which provides an
a few minutes. Using Google App Engine we can develop our operating system and a set of developer services that can be
applications without worry about buying servers, load used individually or together. Azure can be used to build new
balancers, or DNS tables. Salesforce.com struck up a strategic applications to run from the cloud or to enhance existing
alliance with Google with the availability of Force.com for applications with cloud-based capabilities, and it forms the
Google App Engine. Force.com for Google App Engine is a set foundation of all Microsoft’s cloud offerings. Its open
of tools and services to enable developer success with architecture gives developers the choice to build web
application development in the cloud. The offering brings applications, applications running on connected devices, PCs,
together Force.com and Google App Engine, enabling the servers, or hybrid solutions offering the best of online and on
creation of entirely new web and business applications. premises.
Force.com for Google App Engine builds on the relationship
between Salesforce.com and Google, spanning philanthropy, Microsoft also offers cloud applications ready for consumption
business applications, social networks, and cloud computing. by customers such as Windows Live, Microsoft Dynamics,
and other Microsoft Online Services for business such as
a) Google Gears Microsoft Exchange Online and SharePoint Online. The
Azure Services Platform lets developers provide their own
Another development tool that Google offers is Google unique customer offerings by offering the foundational
Gears, an open-source technology for creating offline web components of compute, storage, and building block services to
applications. This browser extension was made available in its author and compose applications in the cloud. Azure utilizes
early stages so that the development community could test its several other Microsoft services as part of its platform, known
capabilities and limitations and help Google improve upon it. as the Live Mesh platform.
Google’s long-term hope is that Google Gears can help the
industry as a whole move toward a single standard for offline a) Live Services:
capabilities that all developers can use. Live Services is a set of building blocks within the Azure
Services Platform that is used to handle user data and
application resources. Live Services provides developers with a
way to build social applications and experiences across a range Microsoft hosted cloud.
of digital devices that can connect with one of the largest Red Dog is made up of four pillars:
audiences on the Web. • Storage (a file system)
b) Microsoft SQL Services: • The fabric controller, which is a management system for
Microsoft SQL Services enhances the capabilities of Microsoft deploying and provisioning
SQL Server into the cloud as a web-based, distributed • Virtualized computation/VM
relational database. It provides web services that enable • Development environment, which allows developers to
relational queries, search, and data synchronization with emulate Red dog on their desktops
mobile users, remote offices, and business partners.
Layer Two:
Layer Two provides the building blocks that run on Azure.
c) Microsoft .NET Services : These services are the aforementioned Live Mesh platform.
Developers build on top of these lower-level services when building
 Microsoft .NET Services is a tool for developing loosely cloud apps.
coupled cloud-based applications. .NET Services includes access SharePoint Services and CRM Services are not the same as
control to help secure applications, a service bus for SharePoint Online and CRM Online. They are just the platform
communicating across applications and services, and hosted basics that do not include user interface elements.
workflow execution. These hosted services allow the creation of
applications that span from on-premises environments to the Layer Three
cloud. At Layer Three exist the Azure-hosted applications. Some of the
 Microsoft SharePoint Services and Dynamics CRM applications developed by Microsoft include SharePoint Online,
Services are used to allow developers to collaborate and build Exchange Online, Dynamics CRM, and Online. Third parties will
strong customer relationships. Using tools like Visual Studio, create other applications.
developers can build applications that utilize SharePoint and
CRM capabilities. 4.1.3 Intuit QuickBase:

d) Microsoft Azure Design Intuit Inc.’s QuickBase launched its new QuickBase
Business Consultant Program. The program allows members to use
 Azure is designed in several layers, with different things going their expertise to create unique business applications tailored
on under the hood, Layer Zero specifically to the industries they serve—without technical expertise
 Layer Zero is Microsoft’s Global Foundational Service. or coding. This helps members expand their reach into industries
GFS is akin to the hardware abstraction layer (HAL) in formerly served only by IT experts. Using QuickBase, program
Windows. It is the most basic level of the software that members will be able to easily build new on- demand business
interfaces directly with the servers. applications from scratch or customize one of 200 available templates
and resell them to their clients.
Layer One
Layer One is the base Azure operating system. It used to be code-
named “Red Dog,” and was designed by a team of operating system
experts at Microsoft. Red Dog is the technology that networks and
manages the Windows Server 2008 machines that form the
4.2 HADOOP are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
What?? Hadoop Common. Most of the tools or solutions are used to
Using the solution provided by Google, Doug Cutting and his team supplement or support these major elements. All these tools work
developed an Open Source Project called HADOOP collectively to provide services such as absorption, analysis, storage
Why?? and maintenance of data etc.
Hadoop runs applications using the MapReduce algorithm, where the Following are the components that collectively form a Hadoop
data is processed in parallel with others. In short, Hadoop is used to ecosystem:
develop applications that could perform complete statistical analysis  HDFS: Hadoop Distributed File System
on huge amounts of data.  YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
Hadoop is an Apache open source framework written in java  Oozie: Job Scheduling
that allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which
provides various services to solve the big data problems. It includes
Apache projects and various commercial tools and solutions. There
All these toolkits or components revolve around one term i.e. Data. YARN:
That’s the beauty of Hadoop that it revolves around data and hence
making its synthesis easier.  Yet Another Resource Negotiator, as the name implies, YARN
is the one who helps to manage the resources across the clusters. In
HDFS: short, it performs scheduling and resource allocation for the Hadoop
System.
 HDFS is the primary or major component of Hadoop  Consists of three major components i.e.
ecosystem and is responsible for storing large data sets of structured 1. Resource Manager
or unstructured data across various nodes and thereby maintaining 2. Nodes Manager
the metadata in the form of log files. 3. Application Manager
 HDFS consists of two core components i.e.  Resource manager has the privilege of allocating resources for
1. Name node the applications in a system whereas Node managers work on the
2. Data Node allocation of resources such as CPU, memory, bandwidth per machine
 Name Node is the prime node which contains metadata (data and later on acknowledges the resource manager. Application
about data) requiring comparatively fewer resources than the data manager works as an interface between the resource manager and
nodes that stores the actual data. These data nodes are commodity node manager and performs negotiations as per the requirement of the
hardware in the distributed environment. Undoubtedly, making two.
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and MapReduce:
hardware, thus working at the heart of the system.
 By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s logic and
helps to write applications which transform big data sets into a
manageable one.
 MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
1. Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the Reduce()
method.
2. Reduce(), as the name suggests does the summarization
by aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.

Pig:

 Pig was basically developed by Yahoo which works on a pig


Latin language, which is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and  It consumes in memory resources hence, thus being faster than
analyzing huge data sets. the prior in terms of optimization.
 Pig does the work of executing commands and in the  Spark is best suited for real-time data whereas Hadoop is best
background, all the activities of MapReduce are taken care of. After suited for structured data or batch processing, hence both are used in
the processing, pig stores the result in HDFS. most of the companies interchangeably.
 Pig Latin language is specially designed for this framework
which runs on Pig Runtime. Just the way Java runs on the JVM. Apache HBase:
 Pig helps to achieve ease of programming and optimization and
hence is a major segment of the Hadoop Ecosystem.  It’s a NoSQL database which supports all kinds of data and
thus capable of handling anything of Hadoop Database. It provides
Hive: capabilities of Google’s BigTable, thus able to work on Big Data sets
effectively.
 With the help of SQL methodology and interface, HIVE  At times where we need to search or retrieve the occurrences of
performs reading and writing of large data sets. However, its query something small in a huge database, the request must be processed
language is called as HQL (Hive Query Language). within a short quick span of time. At such times, HBase comes handy
 It is highly scalable as it allows real-time processing and batch as it gives us a tolerant way of storing limited data.
processing both. Also, all the SQL datatypes are supported by Hive
thus, making the query processing easier. Other Components: Apart from all of these, there are some other
 Similar to the Query Processing frameworks, HIVE too comes components too that carry out a huge task in order to make Hadoop
with two components: JDBC Drivers and HIVE Command Line. capable of processing large datasets. They are as follows:
 JDBC, along with ODBC drivers work on establishing the data
storage permissions and connection whereas HIVE Command line  Solr, Lucene: These are the two services that perform the task
helps in the processing of queries. of searching and indexing with the help of some java libraries,
especially Lucene is based on Java which allows spell check
Mahout: mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of
 Mahout, allows Machine Learnability to a system or coordination and synchronization among the resources or the
application. Machine Learning, as the name suggests helps the system components of Hadoop which resulted in inconsistency, often.
to develop itself based on some patterns, user/environmental Zookeeper overcame all the problems by performing synchronization,
interaction or on the basis of algorithms. inter-component based communication, grouping, and maintenance.
 It provides various libraries or functionalities such as  Oozie: Oozie simply performs the task of a scheduler, thus
collaborative filtering, clustering, and classification which are nothing scheduling jobs and binding them together as a single unit. There is
but concepts of Machine learning. It allows invoking algorithms as two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. Oozie
per our need with the help of its own libraries. workflow is the jobs that need to be executed in a sequentially ordered
manner whereas Oozie Coordinator jobs are those that are triggered
Apache Spark: when some data or external stimulus is given to it.

 It’s a platform that handles all the process consumptive tasks Hadoop Architecture:
like batch processing, interactive or iterative real-time processing, At its core, Hadoop has two major layers namely −
graph conversions, and visualization, etc.
 Processing/Computation layer (MapReduce), and  HDFS, being on top of the local file system, supervises the
 Storage layer (Hadoop Distributed File System). processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and
reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the underlying
How Does Hadoop Work?
parallelism of the CPU cores.
It is quite expensive to build bigger servers with heavy
 Hadoop does not rely on hardware to provide fault-tolerance
configurations that handle large scale processing, but as an alternative,
and high availability (FTHA), rather Hadoop library itself has been
you can tie together many commodity computers with single-CPU, as
designed to detect and handle failures at the application layer.
a single functional distributed system and practically, the clustered
 Servers can be added or removed from the cluster dynamically
machines can read the dataset in parallel and provide a much higher
and Hadoop continues to operate without interruption.
throughput. Moreover, it is cheaper than one high-end server. So this
 Another big advantage of Hadoop is that apart from being open
is the first motivational factor behind using Hadoop that it runs
source, it is compatible on all the platforms since it is Java based.
across clustered and low-cost machines.
 Hadoop is supported by GNU/Linux platform and its flavors.
Hadoop runs code across a cluster of computers. This process includes
Therefore, we have to install a Linux operating system for setting up
the following core tasks that Hadoop performs −
Hadoop environment. In case you have an OS other than Linux, you
 Data is initially divided into directories and files. Files are
can install a Virtualbox software in it and have Linux inside the
divided into uniform sized blocks of 128M and 64M (preferably
Virtualbox.
128M).
MapReduce
 These files are then distributed across various cluster nodes for
further processing.
 MapReduce is a parallel programming model for writing cluster is merely a configuration change. This simple scalability is
distributed applications devised at Google for efficient processing of what has attracted many programmers to use the MapReduce model.
large amounts of data (multi-terabyte data-sets), on large clusters The Algorithm
(thousands of nodes) of commodity hardware in a reliable, fault-  Generally MapReduce paradigm is based on sending the
tolerant manner. The MapReduce program runs on Hadoop which is computer to where the data resides!
an Apache open-source framework.  MapReduce program executes in three stages, namely map
 MapReduce is a framework using which we can write stage, shuffle stage, and reduce stage.
applications to process huge amounts of data, in parallel, on large o Map stage − The map or mapper’s job is to process the
clusters of commodity hardware in a reliable manner. input data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
What is MapReduce? passed to the mapper function line by line. The mapper processes the

 MapReduce is a processing technique and a program model for data and creates several small chunks of data.

distributed computing based on java. The MapReduce algorithm o Reduce stage − This stage is the combination of

contains two important tasks, namely Map and Reduce. Map takes a the Shuffle stage and the Reduce stage. The Reducer’s job is to

set of data and converts it into another set of data, where individual process the data that comes from the mapper. After processing, it

elements are broken down into tuples (key/value pairs). Secondly, produces a new set of output, which will be stored in the HDFS.

reduce task, which takes the output from a map as an input and  During a MapReduce job, Hadoop sends the Map and Reduce

combines those data tuples into a smaller set of tuples. As the tasks to the appropriate servers in the cluster.

sequence of the name MapReduce implies, the reduce task is always  The framework manages all the details of data-passing such as

performed after the map job. issuing tasks, verifying task completion, and copying data around the

 The major advantage of MapReduce is that it is easy to scale cluster between the nodes.

data processing over multiple computing nodes. Under the  Most of the computing takes place on nodes with data on local

MapReduce model, the data processing primitives are called mappers disks that reduces the network traffic.

and reducers. Decomposing a data processing application  After completion of the given tasks, the cluster collects and

into mappers and reducers is sometimes nontrivial. But, once we write reduces the data to form an appropriate result, and sends it back to

an application in the MapReduce form, scaling the application to run the Hadoop server.

over hundreds, thousands, or even tens of thousands of machines in a


Terminology
 PayLoad − Applications implement the Map and the Reduce
functions, and form the core of the job.
 Mapper − Mapper maps the input key/value pairs to a set of
intermediate key/value pair.
 NameNode − Node that manages the Hadoop Distributed File
System (HDFS).
 DataNode − Node where data is presented in advance before
Inputs and Outputs (Java Perspective)
any processing takes place.
 The MapReduce framework operates on <key, value> pairs,
 Master Node − Node where JobTracker runs and which
that is, the framework views the input to the job as a set of <key, accepts job requests from clients.
value> pairs and produces a set of <key, value> pairs as the output of
 Slave Node − Node where Map and Reduce program runs.
the job, conceivably of different types.
 Job Tracker − Schedules jobs and tracks the assign jobs to Task
 The key and the value classes should be in serialized manner tracker.
by the framework and hence, need to implement the Writable  Task Tracker − Tracks the task and reports status to
interface. Additionally, the key classes have to implement the JobTracker.
Writable-Comparable interface to facilitate sorting by the  Job − A program of an execution of a Mapper and Reducer
framework. Input and Output types of a MapReduce job − (Input) across a dataset.
<k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).  Task − an execution of a Mapper or a Reducer on a slice of
Input Output data.
 Task Attempt − A particular instance of an attempt to execute
Map <k1, v1> list (<k2, v2>) a task on a Slave Node.

Reduce <k2, list(v2)> list (<k3, v3>) MapReduce Tutorial: A Word Count Example of MapReduce

Let us understand how a MapReduce works by taking an example


where I have a text file called example.txt whose contents are as
follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear which is [1,1] for the key Bear. Then, it counts the number of ones in
the very list and gives the final output as – Bear, 2.
Now, suppose, we have to perform a word counts on the sample.txt  Finally, all the output key/value pairs are then collected and
using MapReduce. So, we will be finding the unique words and the written in the output file.
number of occurrences of those unique words.
Hadoop Distributed File System
 The Hadoop Distributed File System (HDFS) is based on the
Google File System (GFS) and provides a distributed file system that
is designed to run on commodity hardware. It has many similarities
with existing distributed file systems. However, the differences from
other distributed file systems are significant. It is highly fault-tolerant
and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications
having large datasets.
Apart from the above-mentioned two core components, Hadoop
framework also includes the following two modules −
 Hadoop Common − these are Java libraries and utilities
 First, we divide the input in three splits as shown in the figure.
required by other Hadoop modules.
This will distribute the work among all the map nodes.
 Then, we tokenize the words in each of the mapper and give a  Hadoop YARN − this is a framework for job scheduling and
hardcoded value (1) to each of the tokens or words. The rationale
cluster resource management.
behind giving a hardcoded value equal to 1 is that every word, in itself,
will occur once. Hadoop File System was developed using distributed file system
 Now, a list of key-value pair will be created where the key is
design. It is run on commodity hardware. Unlike other distributed
nothing but the individual words and value is one. So, for the first
line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; systems, HDFS is highly fault tolerant and designed using low-cost
River, 1. The mapping process remains the same on all the nodes.
hardware.
 After mapper phase, a partition process takes place where
sorting and shuffling happens so that all the tuples with the same key  HDFS holds very large amount of data and provides easier
are sent to the corresponding reducer.
access. To store such huge data, the files are stored across multiple
 So, after the sorting and shuffling phase, each reducer will have
a unique key and a list of values corresponding to that very key. For machines. These files are stored in redundant fashion to rescue the
example, Bear, [1,1]; Car, [1,1,1].., etc.
system from possible data losses in case of failure. HDFS also makes
 Now, each Reducer counts the values which are present in that
list of values. As shown in the figure, reducer gets a list of values applications available to parallel processing.
iv) Replication
Features of HDFS
Data Replication is unique features of HDFS. Replication solves the
problem of data loss in an unfavorable condition like hardware failure,
i) Fault Tolerance crashing of nodes etc. HDFS maintain the process of replication at
regular interval of time. HDFS also keeps creating replicas of user
The fault tolerance in Hadoop HDFS is the working strength of a data on different machine present in the cluster. So, when any node
system in unfavorable conditions. It is highly fault-tolerant. Hadoop goes down, the user can access the data from other machines. Thus,
framework divides data into blocks. After that creates multiple copies there is no possibility of losing of user data.
of blocks on different machines in the cluster. So, when any machine
in the cluster goes down, then a client can easily access their data v) Scalability
from the other machine which contains the same copy of data blocks.
Hadoop HDFS stores data on multiple nodes in the cluster. So,
ii) High Availability whenever requirements increase you can scale the cluster. Two
scalability mechanisms are available in HDFS: Vertical and
Hadoop HDFS is a highly available file system. In HDFS, data gets Horizontal Scalability.
replicated among the nodes in the Hadoop cluster by creating a replica
of the blocks on the other slaves present in HDFS cluster. So, HDFS Architecture
whenever a user wants to access this data, they can access their data
from the slaves which contain its blocks. At the time of unfavorable Given below is the architecture of a Hadoop File System.
situations like a failure of a node, a user can easily access their data
from the other nodes. Because duplicate copies of blocks are present
on the other nodes in the HDFS cluster.

iii) High Reliability

HDFS provides reliable data storage. It can store data in the range of
100s of petabytes. HDFS stores data reliably on a cluster. It divides
the data into blocks. Hadoop framework stores these blocks on nodes
present in HDFS cluster. HDFS stores data reliably by creating a
replica of each and every block present in the cluster. Hence provides
fault tolerance facility. If the node in the cluster containing data goes
down, then a user can easily access that data from the other nodes.  HDFS follows the master-slave architecture and it has the
HDFS by default creates 3 replicas of each block containing data
present in the nodes. So, data is quickly available to the users. Hence following elements.
user does not face the problem of data loss. Thus, HDFS is highly a) Namenode
reliable.
The namenode is the commodity hardware that contains the
GNU/Linux operating system and the namenode software. It is
software that can be run on commodity hardware. The system having
the namenode acts as the master server and it does the following tasks ii) Huge datasets − HDFS should have hundreds of nodes per cluster
− to manage the applications having huge datasets.
 Manages the file system namespace. iii) Hardware at data − A requested task can be done efficiently, when
 Regulates client’s access to files. the computation takes place near the data. Especially where huge
 It also executes file system operations such as renaming, datasets are involved, it reduces the network traffic and increases the
closing, and opening files and directories. throughput.
b) Datanode
The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.
 Datanodes perform read-write operations on the file systems,
as per client request.
 They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.
c) Block
Generally the user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or stored
in individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
Goals of HDFS
i) Fault detection and recovery − since HDFS includes a large number
of commodity hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic fault
detection and recovery.
UNIT V  It is possible to disable verification of checksums by passing false to the setVerify Checksum() method on

HADOOP I/O FileSystem, before using the open() method to read a file. The same effect is possible from the shell by using

 Hadoop comes with a set of primitives for data I/O and the techniques that are more general than Hadoop, the -ignoreCrc option with the -get or the equivalent -copyToLocal command. This feature is useful if you

such as data integrity and compression, but deserve special consideration when dealing with multi-terabyte have a corrupt file that you want to inspect so you can decide what to do with it. For example, you might

datasets. want to see whether it can be salvaged before you delete it.

 Others are Hadoop tools or APIs that form the building blocks for developing distributed system, such as
5.1.2 LocalFileSystem
serialization frameworks and on-disk data structures
5.1. Data integrity
The Hadoop LocalFileSystem performs client-side checksumming. This means that when you write a file
The usual way of detecting corrupted data is by computing a checksum for the data when it first enters the system,
and again whenever it is transmitted across a channel that is unreliable and hence capable of corrupting the data. The called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the same directory
data is deemed to be corrupt if the newly generated checksum doesn’t exactly match the original. This technique
containing the checksums for each chunk of the file. Like HDFS, the chunk size is controlled by the
doesn’t offer any way to fix the data—merely error detection. A commonly used error-detecting code is CRC-32
(cyclic redundancy check), which computes a 32-bit integer checksum for input of any size. io.bytes.per.checksum property, which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so
the file can be read back correctly even if the setting for the chunk size has changed. Checksums are verified when
5.1.1 Data Integrity in HDFS
the file is read, and if an error is detected, LocalFileSystem throws a ChecksumException.
 HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A
Checksums are fairly cheap to compute (in Java, they are implemented in native code), typically adding a few
separate checksum is created for every io.bytes.per.checksum bytes of data. The default is 512 bytes, and since
percent overhead to the time to read or write a file. For most pay for data integrity. It is, however, possible to disable
a CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.
checksums: typically when the underlying filesystem supports checksums natively. This is accomplished by using
 Datanodes are responsible for verifying the data they receive before storing the data and its checksum. This
RawLocalFileSystem in place of Local FileSystem. To do this globally in an application, it suffices to remap the
applies to data that they receive from clients and from other datanodes during replication. A client writing
implementation for file URIs by setting the property fs.file.impl to the value
data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum. If it
org.apache.hadoop.fs.RawLocalFileSystem. Alternatively, you can directly create a Raw LocalFileSystem instance,
detects an error, the client receives a ChecksumException, a subclass of IOException, which it should handle
which may be useful if you want to disable checksum verification for only some reads;
in an application-specific manner, by retrying the operation, for example.
For example:
 When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored Configuration conf= ...
at the datanode. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each FileSystemfs = new RawLocalFileSystem();
of its blocks was verified. When a client successfully verifies a block, it tells the datanode, which updates its fs.initialize(null, conf);
log. Keeping statistics such as these is valuable in detecting bad disks.
 Aside from block verification on client reads, each datanode runs a DataBlockScanner in a background thread 5.1.3 ChecksumFileSystem
that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit LocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it easy to add checksumming
rot” in the physical storage media. See “Datanode block scanner” for details on how to access the scanner to other (nonchecksummed) filesystems, as Checksum FileSystem is just a wrapper around FileSystem. The general
reports. idiom is as follows:
 Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas to FileSystemrawFs= ...
produce a new, uncorrupt replica. The way this works is that if a client detects an error when reading a block, FileSystemchecksummedFs = new ChecksumFileSystem(rawFs);

it reports the bad block and the datanode it was trying to read from to the namenode before throwing a The underlying filesystem is called the raw filesystem, and may be retrieved using the getRawFileSystem()
ChecksumException. The namenode marks the block replica as corrupt, so it doesn’t direct clients to it, or try method on checksumFileSystem. ChecksumFileSystem has a few more useful methods for working with checksums,
such as getChecksumFile() for getting the path of a checksum file for any file. Check the documentation for the
to copy this replica to another datanode. It then schedules a copy of the block to be replicated on another others.
datanode, so its replication factor is back at the expected level. Once this has happened, the corrupt replica is If an error is detected by ChecksumFileSystem when reading a file, it will call its reportChecksumFailure ()
method. The default implementation does nothing, but LocalFileSystem moves the offending file and its checksum
deleted. to a side directory on the same device called bad_files. Administrators should periodically check for these bad files
and take action on them.
5.2 Compression CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
Inferring CompressionCodecsusing CompressionCodecFactory

 If you are reading a compressed file, you can normally infer the codec to use by looking at its filename extension. A
file ending in .gzcan be read with GzipCodec, and so on.
 CompressionCodecFactoryprovides a way of mapping a filename extension to a compressionCodecusing its
getCodec() method, which takes a Path object for the file in question.
 All of the tools listed in Table 4-1 give some control over this trade-off at compression time by offering nine different  Following example shows an application that uses this feature to decompress files.
options String uri = args[0];
-1 means optimize for speed and Configuration conf = new Configuration();
FileSystemfs = FileSystem.get(URI.create(uri), conf);
-9 means optimize for space Path inputPath = new Path(uri);
e.g :--- gzip-1 file CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
The different tools have very different compression characteristics.Both gzip and ZIPare general-purpose if (codec == null)
{
compressors, and sit in the middle of the space/time trade-off.
System.err.println("No codec found for " + uri);
 Bzip2compresses more effectively than gzipor ZIP, but is slower. System.exit(1);
}
 LZOoptimizes for speed. It is faster than gzipand ZIP, but compresses slightly less effectively
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
5.2.1 Codecs InputStream in = null;
 A codec is the implementation of a compression-decompression algorithm OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
}
Finally
{
 The LZO libraries are GPL-licensed and may not be included in Apache distributions, so for this reason the IOUtils.closeStream(in);
IOUtils.closeStream(out);
Hadoopcodecs must be downloaded separately from http://code.google.com/p/hadoop-gpl-compression/ }
Native libraries

Compressing and decompressing streams with CompressionCodec  For performance, it is preferable to use a native library for compression and decompression. For example, in
 CompressionCodechas two methods that allow you to easily compress or decompress data. one test, using the native gziplibraries reduced decompression times by up to 50%and compression times by
 To compress data being written to an output stream, use the createOutputStream(OutputStreamout) method to around 10%(compared to the built-in Java implementation).
create a CompressionOutputStream to which you write your uncompressed data to have it written in  Hadoopcomes with prebuilt native compression libraries for 32-and 64-bit Linux, which you can find in the
compressed form to the underlying stream. lib/native directory.
 To decompress data begin read from an input stream, call createIntputStream(InputStreamin) to obtain a  By default Hadooplooks for native libraries for the platform it is running on, and loads them automatically if
CompressionInputStream, which allows you to read uncompressed data from the underlying stream. they are found.
String codecClassname = args[0];
Class<?>codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
5.2.3 Using Compression in MapReduce

If your input files are compressed, they will be automatically decompressed as they are read by MapReduce,
using the filename extension to determine the codec to use.

For Example...

Native libraries –CodecPool

 If you are using a native library and you are doing a lot of compression or decompression in your application,
consider using CodecPool, which allows you to reuse compressors and decompressors, thereby amortizing the
cost of creating these objects.

String codecClassname = args[0];


Class<?>codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf); Compressing map output
Compressor compressor = null;
try  Even if your Map Reduce application reads and writes uncompressed data, it may
{ benefit from compressing the intermediate output of the map phase.
compressor = CodecPool.getCompressor(codec);  Since the map output is written to disk and transferred across the network to the reducer nodes,
CompressionOutputStream out = by using a fast compressor such as LZO, you can get performance gains simply because the
codec.createOutputStream(System.out, compressor); volume of data to transfer is reduced.
IOUtils.copyBytes(System.in, out, 4096, false);  Here are the lines to add to enable gzipmap output compression in your job:
out.finish();
}
finally
{
CodecPool.returnCompressor(compressor);
}

5.2.2 Compression and Input Splits

 When considering how to compress data that will be processed by MapReduce, it is important to understand 5.3 Serialization
whether the compression format supports splitting.
 Consider an uncompressed file stored in HDFS whose size is 1GB. With a HDFS block size of 64MB, the file  Serialization is the process of turning structured objects into a byte stream for transmission
will be stored as 16 blocks, and a Map Reduce job using this file as input will create 16 input splits, each over a network or for writing to persistent storage. Deserialization is the process of turning a byte stream back into a
processed independently as input to a separate map task. series of structured objects.
 Imagine now the file is a gzip-compressed file whose compressed size is 1GB. As before, HDFS will store the
file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at  In Hadoop, interprocess communication between nodes in the system is implemented using remote
an arbitrary point in the gzipstream, and therefore impossible for a map task to read its split independently of procedure calls(RPCs). The RPC protocol uses serialization to render the message into a binary stream to be
the others. sent to the remote node, which then deserializes the binary stream into the original message.
 In this case, Map Reduce will do the right thing, and not try to split the gzippedfile.This will work, but at the In general, it is desirable that an RPC serialization format is:
expense of locality. A single map will process the 16 HDFS blocks, most of which will not be local to the
map. Also, with fewer maps, the job is less granular, and so may take longer to run.  Compact: A compact format makes the best use of network bandwidth

 Fast: Interprocess communication forms the backbone for a distributed system, so it is essential
that there is as little performance overhead as possible for the serialization and deserialization process.

 Extensible: Protocols change over time to meet new requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and servers.

 Interoperable : For some systems, it is desirable to be able to support clients that are written in
different languages to the server. Text
 Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of java.lang.String.
5.3.1 Writable Interface  The Text class uses an int to store the number of bytes in the string encoding, so the maximum value is 2
GB. Furthermore, Text uses standard UTF -8, which makes it potentially easier to interpoperate with other
 The Writable interface defines two methods: one for writing its state to a DataOutput binary stream, and one tools that understand UTF-8.
for reading its state from a DataInput binary stream.  The Text class has several features.
 We will use IntWritable, a wrapper for a Java int. We can create one and set its value using the set() method:  Indexing
 Unicode
IntWritable writable = new IntWritable();  Iteration
writable.set(163);  Mutability
 To examine the serialized form of the IntWritable, we write a small helper method that wraps a  Resorting to String
java.io.ByteArrayOutputStream in a java.io.DataOutputStream to capture the bytes in the serialized stream
ByteArrayOutputStream out = new ByteArrayOutputStream(); Indexing
DataOutputStreamdataOut = new DataOutputStream(out);  Indexing for the Text class is in terms of position in the encoded byte sequence, not the Unicode character in the
writable.write(dataOut); string, or the Java char code unit. For ASCII String, these three concepts of index position coincide.
dataOut.close();
returnout.toByteArray();  Notice that charAt() returns an intrepresenting a Unicode code point, unlike the String variant that returns a char.
Text also has a find() method, which is analogous to String’s indexOf()
5.2.2 Writable Classes

 Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package. They form the
class hierarchy shown in Figure 4-1.
Unicode
 When we start using characters that are encoded with more than a single byte, the differences between Text
and String become clear. Consider the Unicode characters shown in Table 4-7 All but the last character in the
table, U+10400, can be expressed using a single Java
char.

Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since
Writable Class you can’t just increment the index.
 The idiom for iteration is a little obscure: turn the Text object into a java.nio.ByteBuffer. Then repeatedly
 Writable wrappers for Java primitives call the bytesToCodePoint() static method on Text with the buffer. This method extracts the next code point as an
 There are Writable wrappers for all the Java primitive types except short and char. intand updates the position in the buffer.
 All have a get() and a set() method for retrieving and storing the wrapped value. For Example...
public class TextIterator
{
public static void main(String[] args)
{
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
ByteBufferbuf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
Iteration. 102 | Chapter 4: Hadoop I/O intcp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1)
{
System.out.println(Integer.toHexString(cp));
}
}
}
Writing a SequenceFile
Mutability
 To create a SequenceFile, use one of its createWriter() static methods, which returns a
Another difference with String is that Text is mutable. You can reuse a Text instance by calling on of the set() SequenceFile.Writerinstance.
methods on it.
For Example...  The keys and values stored in a SequenceFiledo not necessarily need to be Writable. Any types that can be
serialized and deserializedby a Serialization may be used.
Text t = new Text("hadoop");  Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. Then
t.set("pig"); when you’ve finished you call the close() method (SequenceFile.Writerimplements java.io.Closeable)
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3)); For example...
IntWritable key = new IntWritable();
Restoring to String Text value = new Text();
SequenceFile.Writer writer = null;
 Text doesn’t have as rich an API for manipulating strings as java.lang.String , so in many cases you need to try { writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
convert the Text object to a String. for (int i = 0; i < 100; i++) { key.set(100 - i); value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
Null Writable writer.append(key, value); } } finally { IOUtils.closeStream(writer);

 NullWritable is a special type of Writable, as it has a zero -length serialization. No bytes are written to , or
read from , the stream. It is used as a placeholder. Reading a SequenceFile

 For example, in MapReduce , a key or a value can be declared as a NullWritable when you don’t need to use Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Reader, and
that position-it effectively stores a constant empty value. iterating over records by repeatedlyinvoking one of the next() methods.
If you are using Writable types, you can use the next() method that takes a key and a value argument, and
 NullWritable can also be useful as a key in SequenceFile when you want to store a list of values, as opposed to reads the next key and value in the stream into these variables:
key-value pairs. It is an immutable singleton: the instance can be retrieved by calling NullWritable.get().
For example... public static void main(String[] args) throws IOException
5.2.4 Serialization Frameworks {
o Although most Map Reduce programs use Writable key and value types, this isn’t mandated by the Map String uri = args[0];
Reduce API. In fact, any types can be used, the only requirement is that there be a mechanism that translates to Configuration conf = new Configuration();
and from a binary representation of each type. FileSystemfs = FileSystem.get(URI.create(uri), conf);
 To support this, Hadoophas an API for pluggable serialization frameworks. A serialization framework is Path path = new Path(uri);
represented by an implementation of Serialization. WritableSerialization,for example, is the implementation of
Serialization for Writable types. SequenceFile.Reader reader = null;
 Although making it convenient to be able to use standard Java types in Map Reduce programs, like Integer or try
String, Java Object Serialization is not as efficient as Writable, so it’s not worth making this trade-off. {
5.4 File-Based data structure reader = new SequenceFile.Reader(fs, path, conf);
 For some applications, you need a specialized data structure to hold your data. For MapReduce-based Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
processing, putting each blob of binary data into its own file doesn’t scale, so Hadoopdeveloped a number of Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
higher-level containers for these situations. long position = reader.getPosition();
while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : "";
 Higher-level containers
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
o SequenceFile
position = reader.getPosition(); // beginning of next record
o MapFile
}
o
}
5.4.1 SequenceFile
Finally
 Imagine a logfile, where each log record is a new line of text. If you want to logbinary types, plain text isn’t a
{
suitable format.
IOUtils.closeStream(reader);
 Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key- }
value pairs.To use it as a logfileformat, you would choose a key, such as timestamp represented by a
LongWritable, and the value is Writable that represents the quantity being logged.
 SequenceFilealso work well as containers for smaller files. HDFS and Map Reduce are optimized for large
files, so packing files into a SequenceFilemakes storing and processing the smaller files more efficient.
 The return value is used to determine if an entry was found in the MapFile. If it’s null, then no value exist for the
given key. If key was found, then the value for that key is read into val, as well as being returned from the
method call.
 For this operation, the MapFile. Readerreads the index file into memory. A very large MapFile’s index can
take up a lot of memory. Rather than reindex to change the index interval, it is possible to lad only a fraction of the
index keys into memory when reading the MapFile by setting the io.amp.index.ksipproperty.

Converting a SequenceFileto a MapFile

 One way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s quite natural to want to be
able to convert a SequenceFile into a MapFile.

For example.
SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);
Class keyClass = reader.getKeyClass();
5.4.2 MapFile Class valueClass = reader.getValueClass(); reader.close();

 A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a // Create the map file index file
persistent form of java.util.Map(although it doesn’t implement this interface), which is able to grow beyond
the size of a Map that is kept in memory long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entries\n", map, entries);
Writing a MapFile

 Writing a MapFile is similar to writing a Sequence File. You create an instance of MapFile. Writer,
then call the append () method to add entries in order. Keys must be instances of WritableComparable, and
values must be Writable

For example:

String uri = args[0];


Configuration conf = new Configuration();
FileSystemfs = FileSystem.get(URI.create(uri), conf);
IntWritable key = new IntWritable();
Text value = new Text();
MapFile.Writer writer = null;
try
{
writer = new MapFile.Writer(conf, fs, uri, key.getClass(), value.getClass());
Reading Data from a Hadoop URL
for (int i = 0; i < 1024; i++) { key.set(i + 1);
value.set(DATA[i % DATA.length]);
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a
writer.append(key, value);
stream to read the data from. The general idiom is:
}
Finally
{ InputStream in = null;
IOUtils.closeStream(writer); try {
} in = new URL("hdfs://host/path").openStream();
// process in
Reading a MapFile } finally {
IOUtils.closeStream(in);
}
 Iterating through the entries in order in a MapFileis similar to the procedure for a SequenceFile. You create a
MapFile. Reader, then call the next() method until it returns false, signifying that no entry was read because the end
of the file was reached. There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved
publicboolean next(WritableComparable key, Writable val) throws IOException by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory.
public Writable get(WritableComparable key, Writable val) throws IOException This method can be called only once per JVM, so it is typically executed in a static block. This limitation means that
if some other part of your program — perhaps a third-party component outside your control — sets a
URLStreamHandlerFactory, you won’t be able to use this approach for reading data from Hadoop. The next section
discusses an alternative.
Read Operation In HDFS
Example 3-1 shows a program for displaying files from Hadoop filesystems on standard output, like the Unix cat
command. Data read request is served by HDFS, NameNode, and DataNode. Let's call the reader as a 'client'. Below diagram
depicts file read operation in Hadoop.
Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler

public class URLCat {


static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
}
finally {
IOUtils.closeStream(in); } } }

We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause,
and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two
arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the
copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.

Here’s a sample run:[31]

% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
1. A client initiates read request by calling 'open()' method of FileSystem object; it is an object of
type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the locations of the
blocks of the file. Please note that these addresses are of first few blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and
NameNode. In step 4 shown in the above diagram, a client invokes 'read()' method which
causes DFSInputStream to establish a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process
of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the next
DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS Access HDFS using JAVA API

In this,we know how data is written into HDFS through files. In order to interact with Hadoop's filesystem programmatically, Hadoop provides multiple JAVA classes.
Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These
operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to
interact with other filesystems other than HDFS.

Reading a file from HDFS, programmatically

Object java.net.URL is used for reading contents of a file. To begin with, we need to make Java recognize Hadoop's
hdfs URL scheme. This is done by calling setURLStreamHandlerFactory method on URL object and an instance of
FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only once per JVM, hence it is
enclosed in a static block.

An example code is-

public class URLCat {


static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
1. A client initiates write operation by calling 'create()' method of DistributedFileSystem object which creates a } finally {
new file - Step no. 1 in the above diagram. IOUtils.closeStream(in);
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates new file creation. }
However, this file creates operation does not associate any blocks with the file. It is the responsibility of }
NameNode to verify that the file (which is being created) does not exist already and a client has correct }
permissions to create a new file. If a file already exists or client does not have sufficient permission to create a
new file, then IOException is thrown to the client. Otherwise, the operation succeeds and a new record for
This code opens and reads contents of a file. Path of this file on HDFS is passed to the program as a command line
the file is created by the NameNode.
argument.
3. Once a new record in NameNode is created, an object of type FSDataOutputStream is returned to the client.
A client uses it to write data into the HDFS. Data write method is invoked (step 3 in the diagram).
Access HDFS Using COMMAND-LINE INTERFACE
4. FSDataOutputStream contains DFSOutputStream object which looks after communication with DataNodes
and NameNode. While the client continues writing data, DFSOutputStream continues creating packets with
this data. These packets are enqueued into a queue which is called as DataQueue. This is one of the simplest ways to interact with HDFS. Command-line interface has support for filesystem
5. There is one more component called DataStreamer which consumes this DataQueue. DataStreamer also asks operations like read the file, create directories, moving files, deleting data, and listing directories.
NameNode for allocation of new blocks thereby picking desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our case, we have chosen a We can run '$HADOOP_HOME/bin/hdfs dfs -help' to get detailed help on every command. Here, 'dfs' is a shell
replication level of 3 and hence there are 3 DataNodes in the pipeline. command of HDFS which supports multiple subcommands.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the second DataNode in a Some of the widely used commands are listed below along with some details of each one.
pipeline.
9. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which are waiting for 1. Copy a file from the local filesystem to HDFS
acknowledgment from DataNodes.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the pipeline, it is removed $HADOOP_HOME/bin/hdfs dfs -copyFromLocal temp.txt /
from the 'Ack Queue'. In the event of any DataNode failure, packets from this queue are used to reinitiate the
operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the diagram) Call to close(),
results into flushing remaining data packets to the pipeline followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file write operation is
complete. This command copies file temp.txt from the local filesystem to HDFS.
2. We can list files present in a directory using -ls Inserting Data into HDFS

$HADOOP_HOME/bin/hdfs dfs -ls / Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system.
Follow the steps given below to insert the required file in the Hadoop file system.

Step 1

You have to create an input directory.

$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Step 2
We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.
Transfer and store a data file from local systems to the Hadoop file system using the put command.
3. Command to copy a file to the local filesystem from HDFS
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt
Step 3

You can verify the file using ls command.

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Retrieving Data from HDFS

Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file
from the Hadoop file system.
We can see temp.txt copied to a local filesystem.
Step 1
4. Command to create a new directory

$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory Initially, view the data from HDFS using cat command.

$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

Step 2

Starting HDFS Get the file from HDFS to the local file system using get command.

Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the $ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
following command.
Shutting Down the HDFS
$ hadoop namenode -format
You can shut down the HDFS by using the following command.
After formatting the HDFS, start the distributed file system. The following command will start the namenode as
well as the data nodes as cluster. $ stop-dfs.sh

$ start-dfs.sh 1. Create a directory in HDFS at given path(s).

Listing Files in HDFS Usage:


hadoop fs -mkdir <paths>
After loading the information in the server, we can find the list of files in a directory, status of a file, using ‘ls’. Given Example:
below is the syntax of ls that you can pass to a directory or a filename as an argument. hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

$ $HADOOP_HOME/bin/hadoop fs -ls <args>


2. List the contents of a directory. Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/saurzcode/abc.txt
Usage :
hadoop fs -ls <args> Similar to put command, except that the source is restricted to a local file reference.
Example:
hadoop fs -ls /user/saurzcode copyToLocal

3. Upload and download a file in HDFS. Usage:


hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Upload:
Similar to get command, except that the destination is restricted to a local file reference.
hadoop fs -put:
7. Move file from source to destination.
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Note:- Moving files across filesystem is not permitted.
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path> Usage :
Example: hadoop fs -mv <src> <dest>
hadoop fs -put /home/saurzcode/Samplefile.txt /user/saurzcode/dir3/ Example:
Download: hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
hadoop fs -get:
8. Remove a file or directory in HDFS.
Copies/Downloads files to the local file system
Remove files specified as argument. Deletes directory only when it is empty
Usage:
hadoop fs -get <hdfs_src> <localdst> Usage :
Example: hadoop fs -rm <arg>
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/ Example:
hadoop fs -rm /user/saurzcode/dir1/abc.txt
4. See contents of a file Recursive version of delete.
Usage :
Same as unix cat command: hadoop fs -rmr <arg>
Example:
Usage: hadoop fs -rmr /user/saurzcode/
hadoop fs -cat <path[filename]>
Example: 9. Display last few lines of a file.
hadoop fs -cat /user/saurzcode/dir1/abc.txt
Similar to tail command in Unix.
5. Copy a file from source to destination
Usage :
This command allows multiple sources as well in which case the destination must be a directory. hadoop fs -tail <path[filename]>
Example:
Usage: hadoop fs -tail /user/saurzcode/dir1/abc.txt
hadoop fs -cp <source> <dest>
Example: 10. Display the aggregate length of a file.
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/dir2
Usage :
6. Copy a file from/To Local file system to HDFS hadoop fs -du <path>
Example:
copyFromLocal hadoop fs -du /user/saurzcode/dir1/abc.txt

Usage:
hadoop fs -copyFromLocal <localsrc> URI

You might also like