Data Structure and Algorithms (2 Files Merged)
Data Structure and Algorithms (2 Files Merged)
Static data structure: It is a type of data structure where the size is allocated
at the compile time. Therefore, the maximum size is fixed.
Dynamic data structure: It is a type of data structure where the size is
allocated at the run time. Therefore, the maximum size is flexible.
Major Operations
The major or the common operations that can be performed on the data structures
are:
The primitive data structures are primitive data types. The int, char, float, double, and Advantages of Data structures
pointer are the primitive data structures that can hold a single value.
The following are the advantages of a data structure:
Non-Primitive Data structure
o Efficiency: If the choice of a data structure for implementing a particular ADT Memory Allocation of the array
is proper, it makes the program very efficient in terms of time and space.
o Reusability: The data structure provides reusability means that multiple client As we have mentioned, all the data elements of an array are stored at
programs can use the data structure. contiguous locations in the main memory. The name of the array represents
o Abstraction: The data structure specified by an ADT also provides the level of the base address or the address of first element in the main memory. Each
abstraction. The client cannot see the internal working of the data structure, element of the array is represented by a proper indexing.
so it does not have to worry about the implementation part. The client can
The indexing of the array can be defined in three ways.
only see the interface.
1. 0 (zero - based indexing) : The first element of the array will be arr[0].
Arrays and Sequential Representation
2. 1 (one - based indexing) : The first element of the array will be arr[1].
Definition 3. n (n - based indexing) : The first element of the array can reside at any random index
o Arrays are defined as the collection of similar type of data items stored at contiguous number.
memory locations.
o Arrays are the derived data type in C programming language which can store the In the following image, we have shown the memory allocation of an array arr of size
primitive type of data such as int, char, double, float, etc. 5. The array follows 0-based indexing approach. The base address of the array is
100th byte. This will be the address of arr[0]. Here, the size of int is 4 bytes therefore
o Array is the simplest data structure where each data element can be randomly
each element will take 4 bytes in the memory.
accessed by using its index number.
o For example, if we want to store the marks of a student in 6 subjects, then we don't
need to define different variable for the marks in different subject. instead of that, we
can define an array which can store the marks in each subject at a the contiguous
memory locations.
The array marks[10] defines the marks of the student in 10 different subjects where
each subject marks are located at a particular subscript in the array
i.e. marks[0] denotes the marks in first subject, marks[1] denotes the marks in 2nd
subject and so on.
1. int x = a[i][j];
How to declare 2D Array
The syntax of declaring two dimensional array is very much similar to where i and j is the row and column number of the cell respectively.
that of a one dimensional array, given as follows.
We can assign each cell of a 2D array to 0 by using the following code:
1. int arr[max_rows][max_columns];
1. for ( int i=0; i<n ;i++)
however, It produces the data structure which looks like following. 2. {
3. for (int j=0; j<n; j++)
25M
4. {
5. a[i][j] = 0;
6. }
7. }
Initializing 2D Arrays
We know that, when we declare and initialize one dimensional array in C
programming simultaneously, we don't need to specify the size of the
array. However this will not work with 2D arrays. We will have to define
at least the second dimension of the array.
first, the 1st column of the array is stored into the memory completely, then the
2nd row of the array is stored into the memory completely and so on till the last
column of the array.
There are two main techniques of storing 2D array elements into memory
1. Address(a[i][j]) = B. A. + (i * n + j) * size
where, B. A. is the base address or the address of the first element of the array a[0][0] .
Example :
1. a[10...30, 55...75], base address of the array (BA) = 0, size of an element = 4 bytes .
2. Find the location of a[15][68].
3.
2. Column Major ordering 4. Address(a[15][68]) = 0 +
5. ((15 - 10) x (68 - 55 + 1) + (68 - 55)) x 4 isMember
6. -used to test whether a given object instance is in the container.
7. = (5 x 14 + 13) x 4 stack
8. = 83 x 4
9. = 332 answer A real-world stack allows operations at one end only. For example, we can place or
remove a card or plate from the top of the stack only. At any given time, we can only
By Column major order access the top element of a stack.
This feature makes it LIFO data structure. LIFO stands for Last-in-first-out. Here,
If array is declared by a[m][n] where m is the number of rows while n is the number
the element which is placed (inserted or added) last, is accessed first. In stack
of columns, then address of an element a[i][j] of the array stored in row major order terminology, insertion operation is called PUSH operation and removal operation is
is calculated as, called POP operation.
Example:
A [-
5 ... +20][20 ... 70], BA = 1020, Size of element = 8 bytes. Find the location of a[0][30].
Ordered list
An ordered list is a list in which the order of the items is significant.
However, the items in an ordered list are not necessarily sorted.
Consequently, it is possible to change the order o items and still have a
valid ordered list.
A stack can be implemented by means of Array, Structure, Pointer, and Linked List.
Consider a list of the titles of the chapters in this book. The order of the Stack can either be a fixed size one or it may have a sense of dynamic resizing.
items in the list corresponds to the order in which they appear in the book. Here, we are going to implement stack using arrays, which makes it a fixed size
However, since the chapter titles are not sorted alphabetically, we cannot stack implementation.
consider the list to be sorted. Since it is possible to change the order of the
chapters in book, we must be able to do the same with the items of the list. Basic Operations
As a result, we may insert an item into an ordered list at any position. Stack operations may involve initializing the stack, using it and then de-initializing it.
Apart from these basic stuffs, a stack is used for the following two primary
A searchable container is a container that supports the following additional operations −
operations: push() − Pushing (storing) an element on the stack.
if queue is empty
return underflow
end if
data = queue[front]
front ← front + 1
return true
end procedure
Algorithm for enqueue operation
procedure enqueue(data)
if queue is full
return overflow
endif
rear ← rear + 1 Evaluation of Expression
queue[rear] ← data
return true The way to write arithmetic expression is known as a notation. An arithmetic
end procedure expression can be written in three different but equivalent notations, i.e., without
changing the essence or output of an expression. These notations are –
Dequeue Operation Infix Notation
Accessing data from the queue is a process of two tasks − access the data Prefix (Polish) Notation
where front is pointing and remove the data after access. The following steps are
Postfix (Reverse-Polish) Notation
taken to perform dequeue operation −
These notations are named as how they use operator in expression. We shall learn
Step 1 − Check if the queue is empty.
the same here in this chapter.
Step 2 − If the queue is empty, produce underflow error and exit.
Step 3 − If the queue is not empty, access the data where front is pointing.
Infix Notation
We write expression in infix notation, e.g. a - b + c, where operators are used in-
Step 4 − Increment front pointer to point to the next available data element.
between operands. It is easy for us humans to read, write, and speak in infix
Step 5 − Return success. notation but the same does not go well with computing devices. An algorithm to
process infix notation could be difficult and costly in terms of time and space
consumption. a+b*c - a+(b*c)
Prefix Notation As multiplication operation has precedence over addition, b * c will be evaluated
first. A table of operator precedence is provided later.
In this notation, operator is prefixed to operands, i.e. operator is written ahead of
operands. For example, +ab. This is equivalent to its infix notation a + b. Prefix Associativity
notation is also known as Polish Notation.
Associativity describes the rule where operators with the same precedence appear
Postfix Notation in an expression. For example, in expression a + b − c, both + and – have the same
This notation style is known as Reversed Polish Notation. In this notation style, precedence, then which part of the expression will be evaluated first, is determined
the operator is postfixed to the operands i.e., the operator is written after the by associativity of those operators. Here, both + and − are left associative, so the
operands. For example, ab+. This is equivalent to its infix notation a + b. expression will be evaluated as (a + b) − c.
The following table briefly tries to show the difference in all three notations − Precedence and associativity determines the order of evaluation of an expression.
Following is an operator precedence and associativity table (highest to lowest) −
Sr.No. Infix Notation Prefix Notation Postfix Notation
Sr.No. Operator Precedence Associativity
2 (a + b) ∗ c ∗+abc ab+c∗
2 Multiplication ( ∗ ) & Division ( / ) Second Highest Left Associative
3 a ∗ (b + c) ∗a+bc abc+∗
3 Addition ( + ) & Subtraction ( − ) Lowest Left Associative
Example
Queue 1 expands from the 0th element to the right and circular back to the 0th element
Queue 2 expands from the 8th element to the left and circular back to the 8th element
Temporary boundary between the Queue 1 and the Queue 2; as long as there has free
Stack 1 expands from the 0th element to the right elements in the array and boundary would be shift
Stack 2 expands from the 6th element to the left Free elements could be anywhere in the Queue such as before the front, after the rear, and
As long as the value of Top 1 is less than 6 and greater than 0, Stack 1 has free between front and rear in the Queue
elements to input the data in the array Queue 1’s and Queue 2 ‘s size could be change if it is necessary. When the Queue 1 is full
As long as the value of Top 2 is less than 11 and greater than 5, Stack 2 has free and the Queue 2 has free space; the Queue 1 can increase the size to use that free space
elements to input the data in the array from the Queue 2. Same way for the Queue 2
When the value of Top 1 is 5, Stack 1 is full Elements –1, –2, and –3 are using to store the size of the Queue 1, the front of the Queue 1,
When the value of Top 2 is 10, stack 2 is full and the data count for the Queue 1 needed to manipulate the Queue 1
Elements –1 and –2 are using to store the size of Stack 1 and the subscript of the array for Elements –4, –5, and –6 are using to store the size of the Queue 2, the front of the Queue 2,
Top 1 needed to manipulate Stack 1 and the data count for the Queue 2 needed to manipulate the Queue 2
Elements –3 and –4 are using to store the size of Stack 2 and the subscript of the array for Inserts data to the Queue 1, Q1Rear = (Q1Front + Q1count) % Q1Size
Top 2 needed to manipulate Stack 2 Inserts data to the Queue 2, Q2Rear = (Q2Front + Q2count) % Q2Size + Q1Size
Deletes data from the Queue 1, Q1Front = (Q1Front + 1) % Q1Size
procedure ADD (i,X ) //add element X to the i'th stack, 1 i n//
Deletes data from the Queue 2, Q2Front = (Q2Front + 1) % Q2Size + Q1Size
if T(i) = B(i + 1) then call STACK-FULL (i)
T(i) ← T(i) + 1
V(T(i)) ← X //add X to the i'th stack// 2. Fixed size of the queue:
end ADD
As per the above illustration, following are the important points to be considered.
Queue 1 expands from the 0th element to the 4th element and circular back to 0th element Linked List contains a link element called first.
Queue 2 expands from the 8th element to the 5th element and circular back to 8th element
The boundary is fixed between the Queue 1 and the Queue 2 Each link carries a data field(s) and a link field called next.
Free elements could be anywhere in the Queue such as before the front, after the rear, and
Each link is linked with its next link using its next link.
between front and rear in the Queue
Elements –1, –2, and –3 are using to store the size of the Queue 1, the front of the Queue 1, Last link carries a link as null to mark the end of the list.
and the data count for the Queue 1 needed to manipulate the Queue 1
Elements –4, –5, and –6 are using to store the size of the Queue 2, the front of the Queue 2, Types of Linked List
and the data count for the Queue 2 needed to manipulate the Queue 2
Inserts data to the Queue 1, Q1Rear = (Q1Front + Q1count) % Q1Size Following are the various types of linked list.
Inserts data to the Queue 2, Q2Rear = (Q2Front + Q2count) % Q2Size + Q1Size Simple Linked List − Item navigation is forward only.
Deletes data from the Queue 1, Q1Front = (Q1Front + 1) % Q1Size
Deletes data from the Queue 2, Q2Front = (Q2Front + 1) % Q2Size + Q1Size Doubly Linked List − Items can be navigated forward and backward.
Circular Linked List − Last item contains link of the first element as next and
the first element has a link to the last element as previous.
Basic Operations
Following are the basic operations supported by a list.
Insertion − Adds an element at the beginning of the list.
Deletion − Deletes an element at the beginning of the list.
Display − Displays the complete list.
Search − Searches an element using the given key.
Delete − Deletes an element using the given key.
Insertion Operation
A linked list is a sequence of data structures, which are connected together via links. Adding a new node in linked list is a more than one step activity. We shall learn this
Linked List is a sequence of links which contains items. Each link contains a with diagrams here. First, create a node using the same structure and find the
connection to another link. Linked list is the second most-used data structure after location where it has to be inserted.
array. Following are the important terms to understand the concept of Linked List.
Link − Each link of a linked list can store a data called an element.
Imagine that we are inserting a node B (NewNode), between A (LeftNode)
and C (RightNode). Then point B.next to C − Example
NewNode.next −> RightNode;
It should look like this −
Before we implement actual operations, first we need to set up an empty list. First, perform the
following steps before implementing actual operations.
Step 1 - Include all the header files which are used in the program.
Step 2 - Declare all the user defined functions.
Step 3 - Define a Node structure with two members data and next
Step 4 - Define a Node pointer 'head' and set it to NULL.
Step 5 - Implement the main method by displaying operations menu and make suitable
function calls in the main method to perform user selected operation.
Insertion
In a single linked list, the insertion operation can be performed in three ways. They are as
follows...
What is Single Linked List?
1. Inserting At Beginning of the list
Simply a list is a sequence of data, and the linked list is a sequence of data linked with each 2. Inserting At End of the list
other. 3. Inserting At Specific location in the list
The formal definition of a single linked list is as follows...
Single linked list is a sequence of elements in which every element has link to its next
element in the sequence.
Inserting At Beginning of the list
We can use the following steps to insert a new node at beginning of the single linked list...
In any single linked list, the individual element is called as "Node". Every "Node" contains two
fields, data field, and the next field. The data field is used to store actual value of the node and
Step 1 - Create a newNode with given value.
next field is used to store the address of next node in the sequence.
Step 2 - Check whether list is Empty (head == NULL)
The graphical representation of a node in a single linked list is as follows...
Step 3 - If it is Empty then, set newNode→next = NULL and head = newNode.
Step 4 - If it is Not Empty then, set newNode→next = head and head = newNode.
Inserting At End of the list Deleting from End of the list
We can use the following steps to insert a new node at end of the single linked list... We can use the following steps to delete a node from end of the single linked list...
Step 1 - Create a newNode with given value and newNode → next as NULL. Step 1 - Check whether list is Empty (head == NULL)
Step 2 - Check whether list is Empty (head == NULL). Step 2 - If it is Empty then, display 'List is Empty!!! Deletion is not possible' and
Step 3 - If it is Empty then, set head = newNode. terminate the function.
Step 4 - If it is Not Empty then, define a node pointer temp and initialize with head. Step 3 - If it is Not Empty then, define two Node pointers 'temp1' and 'temp2' and
Step 5 - Keep moving the temp to its next node until it reaches to the last node in the list initialize 'temp1' with head.
(until temp → next is equal to NULL). Step 4 - Check whether list has only one Node (temp1 → next == NULL)
Step 6 - Set temp → next = newNode. Step 5 - If it is TRUE. Then, set head = NULL and delete temp1. And terminate the
function. (Setting Empty list condition)
Step 6 - If it is FALSE. Then, set 'temp2 = temp1 ' and move temp1 to its next node.
Inserting At Specific location in the list (After a Repeat the same until it reaches to the last node in the list. (until temp1 →
next == NULL)
Node) Step 7 - Finally, Set temp2 → next = NULL and delete temp1.
We can use the following steps to insert a new node after a node in the single linked list...
Adding a node to the stack is referred to as push operation. Pushing an element to a 1. Check for the underflow condition: The underflow condition occurs when
we try to pop from an already empty stack. The stack will be empty if the
stack in linked list implementation is different from that of an array implementation.
head pointer of the list points to null.
In order to push an element onto the stack, the following steps are involved. OOPs
Concepts in Java 2. Adjust the head pointer accordingly: In stack, the elements are popped
only from one end, therefore, the value stored in the head pointer must be
1. Create a node first and allocate memory to it. deleted and the node must be freed. The next node of the head node now
becomes the head node.
Display the nodes (Traversing) Deletion operation removes the element that is first inserted among all the queue
elements. Firstly, we need to check either the list is empty or not. The condition front
Displaying all the nodes of a stack needs traversing all the nodes of the linked == NULL becomes true if the list is empty, in this case , we simply write underflow on
list organized in the form of stack. For this purpose, we need to follow the the console and make exit. Otherwise, we will delete the element that is pointed by
following steps. the pointer front. For this purpose, copy the node pointed by the front pointer into
the pointer ptr. Now, shift the front pointer, point to its next node and free the node
1. Copy the head pointer into a temporary pointer. pointed by the node ptr.
2. Move the temporary pointer through all the nodes of the list and print the
value field attached to every node.
Unit II
Linked Queue Trees – Binary tree representations – Tree Traversal – Threaded Binary Trees – Binary Tree
In a linked queue, each node of the queue consists of two parts i.e. data part and the Representation of Trees – Graphs and Representations – Traversals, Connected
link part. Each element of the queue points to its immediate next element in the Components and Spanning Trees – Shortest Paths and Transitive closure – Activity
memory. Networks – Topological Sort and Critical Paths.
In the linked queue, there are two pointers maintained in the memory i.e. front
pointer and rear pointer. The front pointer contains the address of the starting
element of the queue while the rear pointer contains the address of the last element Trees: Non-Linear data structure
of the queue.46Diference, and JVM A data structure is said to be linear if its elements form a sequence or a linear list.
Previous
Insertion and deletions are performed at rear and front end respectively. If front and linear data structures that we have studied like an array, stacks, queues and linked lists
rear both are NULL, it indicates that the queue is empty. organize data in linear order. A data structure is said to be non linear if its elements form a
The linked representation of queue is shown in the following figure. hierarchical classification where, data items appear at various levels.
Trees and Graphs are widely used non-linear data structures. Tree and graph structures
represent hierarchical relationship between individual data elements. Graphs are nothing but
trees with certain restrictions removed.
Trees represent a special case of more general structures known as graphs. In a graph,
there is no restrictions on the number of links that can enter or leave a node, and cycles may be
present in the graph. The figure shows a tree and a non-tree.
Advantages of trees
Trees are so useful and frequently used, because they have some very serious advantages:
Trees reflect structural relationships in the data
Trees are used to represent hierarchies
Trees provide an efficient insertion and searching
Trees are very flexible data, allowing to move sub trees around with minimum effort
Introduction Terminology
In a Tree, Every individual element is called as Node. Node in a tree data structure, stores
the actual data of that particular element and link to next element in hierarchical structure.
Example
1. Root
In a tree data structure, the first node is called as Root Node. Every tree must have root node. We
can say that root node is the origin of tree data structure. In any tree, there must be only one root
node. We never have multiple root nodes in a tree. In above tree, A is a Root node
2. Edge
In a tree data structure, the connecting link between any two nodes is called as EDGE. In a tree
with 'N' number of nodes there will be a maximum of 'N-1' number of edges.
3. Parent
In a tree data structure, the node which is predecessor of any node is called as PARENT NODE.
In simple words, the node which has branch from it to any other node is called as parent node.
Parent node can also be defined as "The node which has child / children". e.g., Parent (A,B,C,D).
4. Child
In a tree data structure, the node which is descendant of any node is called as CHILD Node. In
simple words, the node which has a link from its parent node is called as child node. In a tree, any
parent node can have any number of child nodes. In a tree, all the nodes except root are child
nodes. e.g., Children of D are (H, I,J).
5. Siblings
In a tree data structure, nodes which belong to same Parent are called as SIBLINGS. In simple
words, the nodes with same parent are called as Sibling nodes. Ex: Siblings (B,C, D)
6. Leaf
In a tree data structure, the node which does not have a child (or) node with degree zero is called
as LEAF Node. In simple words, a leaf is a node with no child. In a tree data structure, the leaf
nodes are also called as External Nodes. External node is also a node with no child. In a tree, leaf In a tree data structure, each child from a node forms a subtree recursively. Every child node will
node is also called as 'Terminal' node. Ex: (K,L,F,G,M,I,J) form a subtree on its parent node
7. Internal Nodes
In a tree data structure, the node which has atleast one child is called as INTERNAL Node. In
simple words, an internal node is a node with atleast one child. In a tree data structure, nodes
other than leaf nodes are called as Internal Nodes. The root node is also said to be Internal Node
if the tree has more than one node. Internal nodes are also called as 'Non-Terminal' nodes.
Ex:B,C,D,E,H
8. Degree
In a tree data structure, the total number of children of a node (or)number of subtrees of a node is
called as DEGREE of that Node. In simple words, the Degree of a node is total number of
children it has. The highest degree of a node among all the nodes in a tree is called as 'Degree of
Tree' Tree Representations
A tree data structure can be represented in two methods. Those methods are as follows...
1.List Representation
2. Left Child - Right Sibling Representation
Consider the following tree...
9. Level
In a tree data structure, the root node is said to be at Level 0 and the children of root node are at
Level 1 and the children of the nodes which are at Level 1 will be at Level 2 and so on... In
simple words, in a tree each step from top to bottom is called as a Level and the Level count starts
with '0' and incremented by one at each level (Step). Some authors start root level with 1. 1. List Representation
10. Height In this representation, we use two types of nodes one for representing the node with data and
In a tree data structure, the total number of edges from leaf node to a particular node in the another for representing only references. We start with a node with data from root node in the
longest path is called as HEIGHT of that Node. In a tree, height of the root node is said to be tree. Then it is linked to an internal node through a reference node and is linked to any other node
height of the tree. In a tree, height of all leaf nodes is '0'. directly. This process repeats for all the nodes in the tree.
11. Depth The above tree example can be represented using List representation as follows...
In a tree data structure, the total number of edges from root node to a particular node is called as
DEPTH of that Node. In a tree, the total number of edges from root node to a leaf node in the
longest path is said to be Depth of the tree. In simple words, the highest depth of any leaf node in
a tree is said to be depth of that tree. In a tree, depth of the root node is '0'.
12. Path
In a tree data structure, the sequence of Nodes and Edges from one node to another node is called
as PATH between that two Nodes. Length of a Path is total number of nodes in that path. In
below example the path A - B - E - J has length 4. Fig: List representation of above Tree
2. Linked Representation : We use linked list to represent a binary tree. In a linked list, every
node consists of three fields. First field, for storing left child address, second for storing actual
data and third for storing right child address. In this linked list representation, a node has the
following structure...
Above two trees are different when viewed as binary trees. But same when viewed as trees.
Properties of Binary Trees
1.Maximum Number of Nodes in BT
maximum number of nodes on level i of a binary tree is 2i-1, i>=1.
k-1, k>=1.
Algorithm
Until all nodes are traversed −
Step 1 − Visit root node.
Step 2 − Recursively traverse left subtree.
1. In - Order Traversal ( leftChild - root - rightChild ) Step 3 − Recursively traverse right subtree.
In In-Order traversal, the root node is visited between left child and right child. In this traversal, void preorder(tree_pointer ptr) /* preorder tree traversal */ Recursive
the left child node is visited first, then the root node is visited and later we go for visiting right {
child node. This in-order traversal is applicable for every root node of all subtrees in the tree. This if (ptr) {
is performed recursively for all nodes in the tree. In the above example of binary tree, first we try printf(―%d‖, ptr->data);
to visit left child of root node 'A', but A's left child is a root node for left subtree. so we try to visit preorder(ptr->left_child);
its (B's) left child 'D' and again D is a root for subtree with nodes D, I and J. So we try to visit its preorder(ptr->right_child);
left child 'I' and it is the left most child. So first we visit 'I'then go for its root node 'D' and later }
we visit D's right child 'J'. With this we have completed the left part of node B. Then visit 'B' and }
next B's right child 'F' is visited. With this we have completed left part of node A. Then visit root
node 'A'. With this we have completed left and root parts of node A. Then we go for right part of
the node A. In right of A again there is a subtree with root C. So go for left child of C and again it
is a subtree with root G. But G does not have left part so we visit 'G' and then visit G's right child 3. Post - Order Traversal ( leftChild - rightChild - root )
K. With this we have completed the left part of node C. Then visit root node'C' and next visit C's In Post-Order traversal, the root node is visited after left child and right child. In this traversal,
right child 'H' which is the right most child in the tree so we stop the process. That means here we left child node is visited first, then its right child and then its root node. This is recursively
have visited in the order of I - D - J - B - F - A - G - K - C - H using In-Order Traversal. performed until the right most node is visited. Here we have visited in the order of I - J - D - F - B
In-Order Traversal for above example of binary tree is - K - G - H - C - A using Post-Order Traversal.
I-D-J-B-F-A-G-K-C–H
Algorithm
Algorithm Until all nodes are traversed −
Until all nodes are traversed − Step 1 − Recursively traverse left subtree.
Step 1 − Recursively traverse left subtree. Step 2 − Recursively traverse right subtree.
Step 2 − Visit root node. Step 3 − Visit root node.
Step 3 − Recursively traverse right subtree. void postorder(tree_pointer ptr) /* postorder tree traversal */ Recursive
void inorder(tree_pointer ptr) /* inorder tree traversal */ Recursive {
{ if (ptr) {
if (ptr) { postorder(ptr->left_child);
inorder(ptr->left_child); postorder(ptr->right_child);
printf(―%d‖, ptr->data); printf(―%d‖, ptr->data);
indorder(ptr->right_child); }
} }
}
left_subtree (keys) ≤ node (key) ≤ right_subtree (keys)
Graph
A graph G = (V,E) is composed of:
V: set of vertices
E: set of edges connecting the vertices in V
• An edge e = (u,v) is a pair of vertices
Example:
We're going to implement tree using node object and connecting them through references.
Definition: A binary search tree (BST) is a binary tree. It may be empty. If it is not empty,then
all nodes follows the below mentioned properties −
keys in a nonempty left subtree (right subtree) are smaller (larger) than the key in the root
of subtree.
Complete Graph:
A complete graph is a graph that has the maximum number of edges for undirected graph with
n vertices, the maximum number of edges is n(n-1)/2 for directed graph with n vertices, the
maximum number of edges is n(n-1)
left sub-tree and right sub-tree and can be defined as −
A subgraph of G is a graph G‟ such that V(G‟) is a subset of V(G) and E(G‟) is a
subset of E(G)
Path:
A path from vertex vp to vertex vq in a graph G, is a sequence of vertices, vp,
Adjacent and Incident: vi1, vi2, ..., vin, vq, such that (vp, vi1), (vi1, vi2), ..., (vin, vq) are edges in an
If (v0, v1) is an edge in an undirected graph, undirected graph
– v0 and v1 are adjacent The length of a path is the number of edges on it.
– The edge (v0, v1) is incident on vertices v0 and v1 Simple Path and Style:
A simple path is a path in which all vertices, except possibly the first and the
If <v0, v1> is an edge in a directed graph last, are distinct.
– v0 is adjacent to v1, and v1 is adjacent from v0 A cycle is a simple path in which the first and the last vertices are the same
– The edge <v0, v1> is incident on v0 and v1 In an undirected graph G, two vertices, v0 and v1, are connected if there is a
path in G from v0 to v1.
Multigraph: An undirected graph is connected if, for every pair of distinct vertices vi, vj,
In a multigraph, there can be more than one edge from vertex P to there is a path from vi to vj
vertex Q. In a simple graph there is at most one.
Degree
The degree of a vertex is the number of edges incident to that vertex
For directed graph,
– the in-degree of a vertex v is the number of edges that have v as the head
Graph with self edge or graph with feedback loops: – the out-degree of a vertex v is the number of edges that have v as the tail
A self loop is an edge that connects a vertex to itself. In some graph it makes sense to allow self- – if di is the degree of a vertex i in a graph G with n vertices and e edges, the
loops; in some it doesn't. number of edges is
Example:
Subgraph:
ADT for Graph
Graph ADT is
Data structures: a nonempty set of vertices and a set of undirected
edges, where each edge is a pair of vertices
Functions: for all graph Graph, v, v1 and v2 Vertices
Graph Create()::=return an empty graph
Graph InsertVertex(graph, v)::= return a graph with v inserted. V
has no incident edge.
Graph InsertEdge(graph, v1,v2)::= return a graph with new edge between
v1 and v2
Graph DeleteVertex(graph, v)::= return a graph in which v and all edges Merits of Adjacency Matrix
incident to it are removed From the adjacency matrix, to determine the connection of vertices is easy
Graph DeleteEdge(graph, v1, v2)::=return a graph in which the edge The degree of a vertex is
(v1, v2) is removed For a digraph, the row sum is the out_degree, while the column sum is the
Boolean IsEmpty(graph)::= if (graph==empty graph) return TRUE in_degree
else return FALSE
List Adjacent(graph,v)::= return a list of all vertices that are adjacent
to v
Graph Representations b) Adjacency Lists
Graph can be represented in the following ways: Each row in adjacency matrix is represented as an adjacency list.
a) Adjacency Matrix
b) Adjacency Lists
c) Adjacency Multilists
a) Adjacency Matrix
Let G=(V,E) be a graph with n vertices.
The adjacency matrix of G is a two-dimensional by array, say adj_mat.
If the edge (vi, vj) is in E(G), adj_mat[i][j]=1
If there is no such edge in E(G), adj_mat[i][j]=0
The adjacency matrix for an undirected graph is symmetric; the adjacency matrix for a digraph
need not be symmetric
Examples for Adjacency Matrix:
c) Adjacency Multilists
Interesting Operations An edge in an undirected graph is represented by two nodes in adjacency list
degree of a vertex in an undirected graph representation.
# of nodes in adjacency list Adjacency Multilists
# of edges in a graph – lists in which nodes may be shared among several lists. (an edge is shared by
determined in O(n+e) two different paths)
out-degree of a vertex in a directed graph
# of nodes in its adjacency list
in-degree of a vertex in a directed graph
traverse the whole data structure
Example for Adjacency Multlists
Orthogonal representation for graph G3 Lists: vertex 0: M1->M2->M3, vertex 1: M1->M4->M5
vertex 2: M2->M4->M6, vertex 3: M3->M5->M6
As in the example given above, DFS algorithm traverses from A to B to C to D first then to E, then
Some Graph Operations to F and lastly to G. It employs the following rules.
The following are some graph operations: Rule 1 − Visit the adjacent unvisited vertex. Mark it as visited. Display it. Push
a) Traversal Given G=(V,E) and vertex v, find all w V, such that w connects v. it in a stack.
– Depth First Search (DFS) preorder tree traversal Rule 2 − If no adjacent vertex is found, pop up a vertex from the stack. (It will
– Breadth First Search (BFS) level order tree traversal pop up all the vertices from the stack, which do not have adjacent vertices.)
b) Spanning Trees Rule 3 − Repeat Rule 1 and Rule 2 until the stack is empty.
c) Connected Components
Graph G and its adjacency lists Step Traversal Description
1 Initialize the
stack.
Rule 2 − If no adjacent vertex is found, remove the first vertex from the queue.
While adding a nontree edge into any spanning tree, this will create a cycle
DFS VS BFS Spanning Tree
biconnected component: a maximal connected subgraph H (no subgraph that is both
biconnected and properly contains H).
– Kruskal
– Prim
– Sollin
Kruskal’s Algorithm
Build a minimum cost spanning tree T by adding edges to T one at a time
Select the edges for inclusion in T in nondecreasing order of the cost
An edge is added to T if it does not form a cycle
Since G is connected and has n > 0 vertices, exactly n-1 edges will be selected
Kruskal’s algorithm
1. Sort all the edges in non-decreasing order of their weight.
2. Pick the smallest edge. Check if it forms a cycle with the spanning
tree formed so far. If cycle is not formed, include this edge. Else,
discard it.
3. Repeat step#2 until there are (V-1) edges in the spanning tree.
Prim’s Algorithm
Prim's algorithm to find minimum cost spanning tree (as Kruskal's algorithm)
uses the greedy approach. Prim's algorithm shares a similarity with the
shortest path first algorithms.
Prim's algorithm, in contrast with Kruskal's algorithm, treats the nodes as a
single tree and keeps on adding new nodes to the spanning tree from the given
graph.
To contrast with Kruskal's algorithm and to understand Prim's algorithm better,
we shall use the same example −
Steps of Prim's Algorithm: The following are the main 3 steps of the Prim's
Algorithm:
int i, u, w;
for (i=0; i<n; i++)
{
found[i] = FALSE;
distance[i] = cost[v][i];
}
found[v] = TRUE;
distance[v] = 0;
for (i=0; i<n-2; i++)
{
determine n-1 paths from v
u = choose(distance, n, found);
found[u] = TRUE;
for (w=0; w<n; w++)
if (!found[w])
if (distance[u]+cost[u][w]<distance[w])
distance[w] = distance[u]+cost[u][w];
}
}
Solution 1
3)
Solution 2
iently large
number
Algorithm
An algorithm is a step-by-step procedure to solve a problem in a finite
number of steps.
Branching and repetition are included in the steps of an algorithm.
This branching and repetition depend on the problem for which Algorithm is
developed.
All the steps of Algorithm during the definition should be written in a
human-understandable language which does not depend on any
programming language.
we can choose any programming language to implement the Algorithm.
Pseudocode and flow chart are popular ways to represent an algorithm.
Compared with the straight forward method (2n-2) this method saves 25% in
comparisons. Example:
The operation of the algorithm on the list 8, 3, 2, 9, 7, 1, 5, 4 is illustrated in
Space Complexity the figure.
Compared to the straight forward method, the MaxMin method requires extra
stack space for i, j, max, min, max1 and min1. Given n elements there will be
[log2n] + 1 levels of recursion and we need to save seven values for each
recursive call. (6 + 1 for return address).
Merge Sort
Mergesort will never degrade to O (n2)
Another advantage of mergesort over quicksort and heapsort is its
stability. (A sorting algorithm is said to be stable if two objects with
equal keys appear in the same order in sorted output as they appear in
the input array to be sorted.)
Limitations:
The principal shortcoming of mergesort is the linear amount [O(n) ] of extra
storage the algorithm requires. Though merging can be done in-place, the
resulting algorithm is quite complicated and of theoretical interest only.
where, Cmerge(n) is the number of key comparisons made during the merging
stage.
Obviously, after a partition is achieved, A[s] will be in its final position in the sorted array,
Let us analyze Cmerge(n), the number of key comparisons performed during the
and we can continue sorting the two subarrays to the left and the right of A[s] independently (e.g.,
merging stage. At each step, exactly one comparison is made, after which the by the same method).
total number of elements in the two arrays still needing to be processed is In quick sort, the entire work happens in the division stage, with no work required to combine the
reduced by 1. In the worst case, neither of the two arrays becomes empty before solutions to the sub problems.
the other one contains just one element (e.g., smaller elements may come from
the alternating arrays).Therefore, for the worst case, Cmerge(n) = n –1.
Now,
After both scans stop, three situations may arise, depending on whether or not
the scanning indices have crossed.
If scanning indices i and j have not crossed, i.e., i< j, we simply exchange
A[i] and A[j ] and resume the scans by incrementing I and decrementing j,
respectively:
If the scanning indices stop while pointing to the same element, i.e., i = j, the
value they are pointing to must be equal to p. Thus, we have the subarray
partitioned, with the split position s = i = j :
Analysis Switching to insertion sort on very small subarrays (between 5 and 15
Best Case -Here the basic operation is key comparison. Number of key elements for most computer systems) or not sorting small subarrays at all and
comparisons made before a partition is achieved is n + 1 if the scanning indices finishing the algorithm with insertion sort applied to the entire nearly sorted
cross over and n if they coincide. If all the splits happen in the middle of array
corresponding subarrays, we will have the best case. The number of key Modifications of the partitioning algorithm such as the three-way partition into
comparisons in the best case satisfies the recurrence, segments smaller than, equal to, and larger than the pivot
CONTROL ABSTRACTION
Thus, on the average, quicksort makes only 39% more comparisons than in the
Algorithm Greedy (a, n)
best case. Moreover, its innermost loop is so efficient that it usually runs faster // a(1 : n) contains the „n‟ inputs
than mergesort on randomly ordered arrays of nontrivial sizes. This certainly {
justifies the name given to the algorithm by its inventor. solution := ; // initialize the solution to empty for i:=1 to n do
{
Variations: Because of quicksort‟s importance, there have been persistent x := select (a);
efforts over the years to refine the basic algorithm. Among several if feasible (solution, x) then
improvements discovered by researchers are: solution := Union (Solution, x);
Better pivot selection methods such as randomized quicksort that uses a }
return solution;
random element or the median-of-three method that uses the median of the
}
leftmost, rightmost, and the middle element of the array
Procedure Greedy describes the essential way that a greedy based algorithm will look,
once a particular problem is chosen and the functions select, feasible and union are
properly implemented.
The function select selects an input from „a‟, removes it and assigns its value to „x‟. Feasible is a
Boolean valued function, which determines if „x‟ can be included into the solution vector. The
function Union combines „x‟ with solution and updates the objective function.
KNAPSACK PROBLEM
Let us apply the greedy method to solve the knapsack problem. We are given „n‟ objects
and a knapsack. The object „i‟ has a weight wi and the knapsack has a capacity „m‟. If a
fraction xi, 0 < xi < 1 of object i is placed into the knapsack then a profit of pi xi is
earned. The objective is to fill the knapsack that maximizes the total profit earned.
Since the knapsack capacity is „m‟, we require the total weight of all chosen objects to be at
most „m‟. The problem is stated as:
Example
Let n = 3, (l1, l2, l3) = (5, 10, 3). Then find the optimal ordering?
The profits and weights are positive numbers. Solution:
Algorithm There are n! = 6 possible orderings. They are:
If the objects are already been sorted into non-increasing order of p[i] / w[i] then the
algorithm given below obtains solutions corresponding to this strategy.
struct treenode
{
treenode * lchild;
treenode * rchild;
};
Example 1:
Suppose we are having three sorted files X1, X2 and X3 of length 30, 20, and 10 records each.
Merging of the files can be carried out as follows:
Example 2:
Given five files (X1, X2, X3, X4, X5) with sizes (20, 30, 10, 5, 30). Apply greedy rule to
find optimal way of pair wise merging to give an optimal solution using binary merge
tree representation.
Solution:
Unit V
Back tracking: The General Method – The 8-Queens Problem – Sum of Subsets – Graph Coloring.
Merge X4 and X3 to get 15 record moves. Call this Z1.
Backtracking
Some problems can be solved, by exhaustive search. The exhaustive-search technique
suggests generating all candidate solutions and then identifying the one (or the ones) with a
desired property.
Backtracking is a more intelligent variation of this approach. The principal idea is to
Merge Z1 and X1 to get 35 record moves. Call this Z2.
construct solutions one component at a time and evaluate such partially constructed
candidates as follows. If a partially constructed solution can be developed further without
violating the problem‘s constraints, it is done by taking the first remaining legitimate option
for the next component. If there is no legitimate option for the next component, no
alternatives for any remaining component need to be considered. In this case, the algorithm
backtracks to replace the last component of the partially constructed solution with its next
option.
It is convenient to implement this kind of processing by constructing a tree of choices being
made, called the state-space tree. Its root represents an initial state before the search for a
solution begins. The nodes of the first level in the tree represent the choices made for the first
component of a solution; the nodes of the second level represent the choices for the second
component, and soon. A node in a state-space tree is said to be promising if it corresponds to
a partially constructed solution that may still lead to a complete solution; otherwise, it is
called non-promising. Leaves represent either non-promising dead ends or complete
solutions found by the algorithm.
In the majority of cases, a state space tree for a backtracking algorithm is constructed in the
manner of depth-first search. If the current node is promising, its child is generated by adding
the first remaining legitimate option for the next component of a solution, and the processing
moves to this child. If the current node turns out to be non-promising, the algorithm
backtracks to the node‘s parent to consider the next possible option for its last component; if
there is no such option, it backtracks one more level up the tree, and so on. Finally, if the
algorithm reaches a complete solution to the problem, it either stops (if just one solution is
required) or continues searching for other possible solutions.
General method
General Algorithm (Recursive)
We start with the empty board and then place queen 1 in the first possible position of its row, which is
in column 1 of row 1. Then we place queen 2, after trying unsuccessfully columns 1 and 2, in the first
acceptable position for it, which is square (2, 3), the square in row 2 and column 3. This proves to be a
dead end because there is no acceptable position for queen 3. So, the algorithm backtracks and puts
queen 2 in the next possible position at (2, 4). Then queen 3 is placed at (3, 2), which proves to be
another dead end. The algorithm then backtracks all the way to queen 1 and moves it to (1, 2). Queen 2
then goes to (2, 4), queen 3 to(3, 1), and queen 4 to (4, 3), which is a solution to the problem. The state-
space tree of this search is shown in figure.
We record the value of s, the sum of these numbers, in the node. If s is equal to d, we have a solution to
the problem. We can either report this result and stop or, if all the solutions need to be found, continue by
backtracking to the node‘s parent. If s is not equal to d, we can terminate the node as non-promising if
either of the following two inequalities holds:
Example: Apply backtracking to solve the following instance of the subset sum problem: A
= {1, 3, 4, 5} and d = 11.
Graph coloring
The root of the tree represents the starting point, with no decisions about the given elements made
as yet. Its left and right children represent, respectively, inclusion and exclusion of a 1 in a set
being sought.
Similarly, going to the left from a node of the first level corresponds to inclusion of a 2 while
going to the right corresponds to its exclusion, and so on. Thus, a path from the root to a node on
the ith level of the tree indicates which of the first in numbers have been included in the subsets
represented by that node.
UNIT -1
CLOUD COMPUTING
Introduction
Cloud computing is a type of computing that relies on shared computing resources rather than
having local servers or personal devices to handle applications.
The National Institute of Stands and Technology(NIST) has a more comprehensive definition
of cloud computing. It describes cloud computing as "a model for enabling ubiquitous,
convenient, on-demand network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction."
• Ability / space where you store your data ,process it and can access anywhere from the
world
• As a Metaphor for the internet.
Cloud computing is :
• Storing data /Applications on remote servers
• Processing Data / Applications from servers
Analysis
• Accessing Data / Applications via internet
Cloud computing is taking services and moving them outside an organization's firewall.
Applications, storage and other services are accessed via the Web. The services are
delivered and used over the Internet and are paid for by the cloud customer on an as-
needed or pay-per-use business model.
Service: This term in cloud computing is the concept of being able to use reusable, fine-grained
components across a vendor’s network.
According to the NIST, all true cloud environments have five key characteristics:
K.NIKHILA Page 1
UNIT -1 UNIT -1
1. On-demand self-service: This means that cloud customers can sign up for, pay for and v. Standards There are currently no standards to convert a centralized database into
start using cloud resources very quickly on their own without help from a sales agent. a cloud solution.
2. Broad network access: Customers access cloud services via the Internet. iii. Synchronizationallows content to be refreshed across multiple devices.
3. Resource pooling: Many different customers (individuals, organizations or different Ex:
departments within an organization) all use the same servers, storage or other computing Google docs
resources. Data base services (DaaS): it avoids the complexity and cost of running your own database.
4. Rapid elasticity or expansion: Cloud customers can easily scale their use of resources up Benefits:
or down as their needs change. i.Ease of use :don’t have to worry about buying, installing, and maintaining hardware
5. Measured service: Customers pay for the amount of resources they use in a given period for the database as there is no servers to provision and no redundant systems to worry..
of time rather than paying for hardware or software upfront. (Note that in a private cloud, ii. Power The database isn’t housed locally, but that doesn’t mean that it is not
this measured service usually involves some form of charge backs where IT keeps track functional and effective. Depending on your vendor, you can get custom data
of how many resources different departments within an organization are using.) validation to ensure accurate information. You can create and manage the database
with ease.
1.2 Applications: iii. Integration The database can be integrated with your other services to provide
more value and power. For instance, you can tie it in with calendars, email, and
i) Storage:cloud keeps many copies of storage. Using these copies of resources, it extracts people to make your work more powerful.
another resource if anyone of the resources fails. iv. Management because large databases benefit from constant pruning and
optimization, typically there are expensive resources dedicated to this task. With
ii. Database: are repositories for information with links within the information that help making some DaaS offerings, this management can be provided as part of the service for
the data searchable. much less expense. The provider will often use offshore labor pools to take
Advantage of lower labor costs there. So it’s possible that you are using the service
Advantages:
in Chicago, the physical servers are in Washington state, and the database
administrator is in the Philippines.
i. Improved availability:If there is a fault in one database system, it will only affect
one fragment of the information, not the entire database. MS SQL and Oracle are two biggest players of DaaS providers.
ii. Improved performance: Data is located near the site with the greatest demand and the MS SQL:
database systems are parallelized, which allows the load to be balanced among the
Microsoft SQL server data services (SSDS),SSDS based on SQL server, announced
servers.
cloud extension of SQL server tool, in 2008 which is similar to Amazon’s simple
iii. Price It is less expensive to create a network of smaller computers with the power
database (schema –free data storage, SOAP or REST APIs and a pay-as-you-go payment
of one large one.
system.
iv. Flexibility : Systems can be changed and modified without harm to the entire
database. Variation is first, one of the main selling points of SSDS is that it integrates with
Disadvantages: Microsoft’s sync Framework which is a .NET library for synchronizing dissimilar data
i. Complexity Database administrators have extra work to do to maintain the sources.
system. Microsoft wants SSDS to work as a data hub, synchronizing data on multiple devices so
ii. Labor costs With that added complexity comes the need for more workers on the they can be accessed offline.
payroll.
Core concepts in SSDS:
iii. Security Database fragments must be secured and so must the sites housing the
fragments.
i. Authority both a billing unit and a collection of containers
iv. Integrity It may be difficult to maintain the integrity of the database if it is too
ii. Container collection of entities and is what you search within.
complex or changes too quickly.
iii. Entity property bag of name and value pairs 1.3 Cloud Components:
Oracle: • Clients
• Data center
It introduces three services to provide database services to cloud users. Customers can license • Distributed servers
Oracle delivered a set of free Amazon Machine Images (AMIs) to its customers so they could a. Mobile: mobile devices including PDAs/smart phones like a blackberry, windows, iphone.
quickly and efficiently deploy Oracle’s database solutions.
Developers can take advantage of the provisioning and automated software deployment b. Thin: are comps that don’t have internal hard drives then display the info but rather let server
to rapidly build applications using Oracle’s popular development tools such as Oracle do all the work.
Application Express, Oracle Developer, Oracle Enterprise Pack for Eclipse, and Oracle
Workshop for Web Logic. Additionally, Oracle Unbreakable Linux Support and AWS c. Thick: is a regular comp, using web browser like Firefox/Internet Explorer to connect to the
Premium Support is available for Oracle Enterprise Linux on EC2, providing seamless cloud.
customer support.
“Providing choice is the foundation of Oracle’s strategy to enable customers to become Thin Vs Thick
more productive and lower their IT costs—whether it’s choice of hardware, operating
system, or on demand computing—extending this to the Cloud environment is a natural i. Price and effect environment
evolution,” said Robert Shimp, vice president of Oracle Global Technology Business Unit. ii. Lower hardware costs
“We are pleased to partner with Amazon Web Services to provide our customers enterpriseclass. iii. Lower IT costs
Cloud solutions, using familiar Oracle software on which their businesses depend.” iv. Security
Additionally, Oracle also introduced a secure cloud-based backup solution. Oracle v. Data Security
Secure Backup Cloud Module, based on Oracle’s premier tape backup management vi. Less Power consumption
software, Oracle Secure Backup, enables customers to use the Amazon Simple Storage vii. Ease of repair or replacement
Service (Amazon S3) as their database backup destination. Cloud-based backups offer viii. Less noise
reliability and virtually unlimited capacity, available on demand and with no up-front
capital expenditure. ii. Data Center :
The Oracle Secure Backup Cloud Module also enables encrypted data backups to help
• It is a collection of servers where the application you subscribe and housed.
ensure complete privacy in the cloud environment. It’s fully integrated with Oracle
Recovery Manager and Oracle Enterprise Manager, providing users with familiar interfaces
iii. Distributed Servers:
for cloud-based backups.
For customers with an ongoing need to quickly move very large volumes of data into or • Servers are in geographically disparate locations but act as if they’re humming away right
out of the AWS cloud, Amazon allows the creation of network peering connections. next to each other.
• This gives the service provider more flexibility in options and security.
EX : a) Down time: Since cloud computing systems are internet-based, service outages are always an
Amazon has their cloud solution all over the world ,if one failed at one site the service unfortunate possibility and can occur for any reason.
would still be accessed through another site Best Practices for minimizing planned downtime in a cloud environment:
• If cloud needs more h/w they need not throw more servers in the safe room –they can add ii. Design services with high availability and disaster recovery in mind. Leverage the multi-
them at another site and make it part of the cloud. availability zones provided by cloud vendors in your infrastructure.
iii. If your services have a low tolerance for failure, consider multi-region deployments with
1.4 Benefits and Limitations of Cloud Computing automated failover to ensure the best business continuity possible.
iv. Define and implement a disaster recovery plan in line with your business objectives that
The advantage of cloud computing is twofold. It is a file backup shape. It also allows working provide the lowest possible recovery time (RTO) and recovery point objectives (RPO).
on the same document for several jobs (one person or a nomad traveling) of various types (or PC, v. Consider implementing dedicated connectivity such as AWS Direct Connect, Azure
tab or smart phone). Express Route, or Google Cloud’s Dedicated Interconnect or Partner Interconnect. These
services provide a dedicated network connection between you and the cloud service point
Cloud computing simplifies usage by allowing overcoming the constraints of traditional of presence. This can reduce exposure to the risk of business interruption from the public
computer tools (installation and updating of software, storage, data portability...). Cloud internet.
computing also provides more elasticity and agility because it allows faster access to IT
resources (server, storage or bandwidth) via a simple web portal and thus without investing in b) Security and Privacy: Code Space and the hacking of their AWS EC2 console, which led to
additional hardware. data deletion and the eventual shutdown of the company. Their dependence on remote cloud-
based infrastructure meant taking on the risks of outsourcing everything.
Best practices for minimizing security and privacy risks:
Understand the shared responsibility model of your cloud provider.
Implement security at every level of your deployment.
Know who is supposed to have access to each resource and service and limit access to
least privilege.
Make sure your team’s skills are up to the task: Solid security skills for your cloud teams
are one of the best ways to mitigate security and privacy concerns in the cloud.
Take a risk-based approach to securing assets used in the cloud
Extend security to the device.
Implement multi-factor authentication for all accounts accessing sensitive data or
systems.
Consumers and organizations have many different reasons for choosing to use cloud computing
services. They might include the following: c) Vulnerability to Attack: Even the best teams suffer severe attacks and security breaches from
time to time.
Convenience Best practices to help you reduce cloud attacks:
Scalability Make security a core aspect of all IT operations.
Low costs Keep ALL your teams up to date with cloud security best practices.
Security Ensure security policies and procedures are regularly checked and reviewed.
Anytime, anywhere access Proactively classify information and apply access control.
High availability Use cloud services such as AWS Inspector, AWS CloudWatch, AWS CloudTrail, and
AWS Config to automate compliance controls.
Limitations /Disadvantages: Prevent data ex-filtration.
Integrate prevention and response strategies into security operations. Build in flexibility as a matter of strategy when designing applications to ensure
Discover rogue projects with audits. portability now and in the future.
Remove password access from accounts that do not need to log in to services.
Review and rotate access keys and access credentials. f) Costs Savings: Adopting cloud solutions on a small scale and for short-term projects can be
Follow security blogs and announcements to be aware of known attacks. perceived as being expensive.
Apply security best practices for any open source software that you are using. Best practices to reduce costs:
Try not to over-provision, instead of looking into using auto-scaling services
d) Limited control and flexibility: Since the cloud infrastructure is entirely owned, managed Scale DOWN as well as UP
and monitored by the service provider, it transfers minimal control over to the customer. Pre-pay if you have a known minimum usage
To varying degrees (depending on the particular service), cloud users may find they have less Stop your instances when they are not being used
control over the function and execution of services within a cloud-hosted infrastructure. A cloud Create alerts to track cloud spending
provider’s end-user license agreement (EULA) and management policies might impose limits on
what customers can do with their deployments. Customers retain control of their applications,
data, and services, but may not have the same level of control over their backend infrastructure.
Best practices for maintaining control and flexibility: 1.5 Architecture
Consider using a cloud provider partner to help with implementing, running, and
supporting cloud services. Let’s have a look into Cloud Computing and see what Cloud Computing is made of. Cloud
Understanding your responsibilities and the responsibilities of the cloud vendor in the computing comprises of two components front end and back end. Front end consist client part
shared responsibility model will reduce the chance of omission or error. of cloud computing system. It comprise of interfaces and applications that are required to access
Make time to understand your cloud service provider’s basic level of support. Will this the cloud computing platform.
service level meet your support requirements? Most cloud providers offer additional
support tiers over and above the basic support for an additional cost. A central server administers the system, monitoring traffic and client demands to ensure
Make sure you understand the service level agreement (SLA) concerning the
everything runs smoothly. It follows a set of rules called protocols and uses a special kind of
infrastructure and services that you’re going to use and how that will impact your software called MIDDLEWARE. Middleware allows networked computers to communicate
agreements with your customers. with each other. Most of the time, servers don't run at full capacity. That means there's unused
processing power going to waste. It's possible to fool a physical server into thinking it's actually
e) Vendor Lock-In: organizations may find it difficult to migrate their services from one vendor multiple servers, each running with its own independent operating system. The technique is
to another. Differences between vendor platforms may create difficulties in migrating from one called server virtualization. By maximizing the output of individual servers, server
cloud platform to another, which could equate to additional costs and configuration complexities. virtualization reduces the need for more physical machines.
Best practices to decrease dependency:
Design with cloud architecture best practices in mind. All cloud services provide the
opportunity to improve availability and performance, decouple layers, and reduce
performance bottlenecks. If you have built your services using cloud architecture best
practices, you are less likely to have issues porting from one cloud platform to another.
Properly understanding what your vendors are selling can help avoid lock-in challenges.
Employing a multi-cloud strategy is another way to avoid vendor lock-in. While this may
add both development and operational complexity to your deployments, it doesn’t have to
be a deal breaker. Training can help prepare teams to architect and select best-fit services
and technologies.
While back end refers to the cloud itself, it comprises of the resources that are required for cloud systems were required to be present at the same geographical location. Thus to solve
computing services. It consists of virtual machines, servers, data storage, security mechanism this problem, distributed computing led to three more types of computing and they
etc. It is under provider’s control. were-Mainframe computing, cluster computing, and grid computing.
Cloud computing distributes the file system that spreads over multiple hard disks and machines.
Data is never stored in one place only and in case one unit fails the other will take over Mainframe computing:
automatically. The user disk space is allocated on the distributed file system, while another Mainframes which first came into existence in 1951 are highly powerful and reliable
important component is algorithm for resource allocation. Cloud computing is a strong computing machines. These are responsible for handling large data such as massive
distributed environment and it heavily depends upon strong algorithm. input-output operations. Even today these are used for bulk processing tasks such as
online transactions etc. These systems have almost no downtime with high fault
Evolution of Cloud Computing tolerance. After distributed computing, these increased the processing capabilities of
Difficulty Level : Easy
the system. But these were very expensive. To reduce this cost, cluster computing
Last Updated : 14 May, 2020
came as an alternative to mainframe technology.
Cloud computing is all about renting computing services. This idea first came in the
1950s. In making cloud computing what it is today, five technologies played a vital role.
Cluster computing:
These are distributed systems and its peripherals, virtualization, web 2.0, service
In 1980s, cluster computing came as an alternative to mainframe computing. Each
orientation, and utility computing.
machine in the cluster was connected to each other by a network with high bandwidth.
These were way cheaper than those mainframe systems. These were equally capable
of high computations. Also, new nodes could easily be added to the cluster if it was
required. Thus, the problem of the cost was solved to some extent but the problem
related to geographical restrictions still pertained. To solve this, the concept of grid
computing was introduced.
Grid computing:
In 1990s, the concept of grid computing was introduced. It means that different
systems were placed at entirely different geographical locations and these all were
connected via the internet. These systems belonged to different organizations and thus
the grid consisted of heterogeneous nodes. Although it solved some problems but new
problems emerged as the distance between the nodes increased. The main problem
which was encountered was the low availability of high bandwidth connectivity and
with it other network associated issues. Thus. cloud computing is often referred to as
“Successor of grid computing”.
Distributed Systems:
It is a composition of multiple independent systems but all of them are depicted as a Virtualization:
single entity to the users. The purpose of distributed systems is to share resources and It was introduced nearly 40 years back. It refers to the process of creating a virtual
also use them effectively and efficiently. Distributed systems possess characteristics layer over the hardware which allows the user to run multiple instances
such as scalability, concurrency, continuous availability, heterogeneity, and simultaneously on the hardware. It is a key technology used in cloud computing. It is
independence in failures. But the main problem with this system was that all the the base on which major cloud computing services such as Amazon EC2, VMware
vCloud, etc work on. Hardware virtualization is still one of the most common types of 3) Storage Virtualization
virtualization.
Network Virtualization: It is a method of combining the available resources in a network by
splitting up the available bandwidth into channels, each of which is independent from the others
Web 2.0: and each channel is independent of others and can be assigned to a specific server or device in
It is the interface through which the cloud computing services interact with the clients. real time.
It is because of Web 2.0 that we have interactive and dynamic web pages. It also
increases flexibility among web pages. Popular examples of web 2.0 include Google Storage Virtualization: It is the pooling of physical storage from multiple network storage
Maps, Facebook, Twitter, etc. Needless to say, social media is possible because of this devices into what appears to be a single storage device that is managed from a central console.
Storage virtualization is commonly used in storage area networks (SANs).
technology only. In gained major popularity in 2004.
Server Virtualization: Server virtualization is the masking of server resources like processors,
Service orientation: RAM, operating system etc, from server users. The intention of server virtualization is to
It acts as a reference model for cloud computing. It supports low-cost, flexible, and increase the resource sharing and reduce the burden and complexity of computation from users.
evolvable applications. Two important concepts were introduced in this computing
Virtualization is the key to unlock the Cloud system, what makes virtualization so important for
model. These were Quality of Service (QoS) which also includes the SLA (Service
the cloud is that it decouples the software from the hardware. For example, PC’s can use virtual
Level Agreement) and Software as a Service (SaaS). memory to borrow extra memory from the hard disk. Usually hard disk has a lot more space than
memory. Although virtual disks are slower than real memory, if managed properly the
Utility computing: substitution works perfectly. Likewise, there is software which can imitate an entire computer,
which means 1 computer can perform the functions equals to 20 computers.
It is a computing model that defines service provisioning techniques for services such
as compute services along with other major services such as storage, infrastructure,
1.6 Classification of Cloud Variants:
etc which are provisioned on a pay-per-use basis.
i. Service Model Based
Virtualization and Cloud Computing ii. Deployment Model Based
The main enabling technology for Cloud Computing is VIRTUALIZATION. Virtualization is a 1.6.1 Service Model Based /Models Service / Types of Models
partitioning of single physical server into multiple logical servers. Once the physical server is
divided, each logical server behaves like a physical server and can run an operating system and Cloud computing services are divided into three classes, according to the abstraction level of the
applications independently. Many popular companies like VMware and Microsoft provide capability provided and the service model of providers, namely:
virtualization services, where instead of using your personal PC for storage and computation,
you use their virtual server. They are fast, cost-effective and less time consuming. 1. Infrastructure as a Service (IaaS)
2. Platform as a Service (PaaS) and
For software developers and testers virtualization comes very handy, as it allows developer to 3. Software as a Service. (SaaS)
write code that runs in many different environments and more importantly to test that code.
These abstraction levels can also be viewed as a layered architecture where services of a higher
Virtualization is mainly used for three main purposes layer can be composed from services of the underlying layer. The reference model explains the
role of each layer in an integrated architecture. A core middleware manages physical resources
1) Network Virtualization and the VMs deployed on top of them; in addition, it provides the required features (e.g.,
accounting and billing) to offer multi-tenant pay-as-you-go services.
2) Server Virtualization
Cloud development environments are built on top of infrastructure services to offer application PLATFORM AS A SERVICE
development and deployment capabilities; in this level, various programming models, libraries,
APIs, and mashup editors enable the creation of a range of business, Web, and scientific In addition to infrastructure-oriented clouds that provide raw computing and storage services,
applications. Once deployed in the cloud, these applications can be consumed by end users. another approach is to offer a higher level of abstraction to make a cloud easily programmable,
known as Platform as a Service (PaaS).
INFRASTRUCTURE AS A SERVICE
A cloud platform offers an environment on which developers create and deploy
Offering virtualized resources (computation, storage, and communication) on demand is known applications and do not necessarily need to know how many processors or how much memory
as Infrastructure as a Service (IaaS). that applications will be using. In addition, multiple programming models and specialized
services (e.g., data access, authentication, and payments) are offered as building blocks to new
applications.
Google App Engine, an example of Platform as a Service, offers a scalable environment for
developing and hosting Web applications, which should be written in specific programming
languages such as Python or Java, and use the services‘ own proprietary structured object data
store. Building blocks include an in-memory object cache (mem cache), mail service, instant
messaging service (XMPP), an image manipulation service, and integration with Google
Accounts authentication service. Software as a Service Applications reside on the top of the
cloud stack. Services provided by this layer can be accessed by end users through Web portals.
Therefore, consumers are increasingly shifting from locally installed computer programs to on-
line software services that offer the same functionally. Traditional desktop applications such as
word processing and spreadsheet can now be accessed as a service in the Web. This model of
delivering applications, known as Software as a Service (F), alleviates the burden of software
maintenance for customers and simplifies development and testing for providers.
Salesforce.com, which relies on the SaaS model, offers business productivity applications
(CRM) that reside completely on their servers, allowing customers to customize and access
applications on demand.
Users are given privileges to perform numerous activities to the server, such as: starting
and stopping it, customizing it by installing software packages, attaching virtual disks to
it, and configuring access permissions and firewalls rules.
availability zones in the same region. Regions, in turn, ―are geographically dispersed and will
be in separate geographic areas or countries.
A public IaaS provider must provide multiple access means to its cloud, thus catering for
various users and their preferences. Different types of user interfaces (UI) provide different
levels of abstraction, the most common being graphical user interfaces (GUI), command-line
tools (CLI), and Web service (WS) APIs. GUIs are preferred by end users who need to launch,
customize, and monitor a few virtual servers and do not necessary need to repeat the process
several times.
Advance reservations allow users to request for an IaaS provider to reserve resources for a
specific time frame in the future, thus ensuring that cloud resources will be available at that
time. Amazon Reserved Instances is a form of advance reservation of capacity, allowing users
to pay a fixed amount of money in advance to guarantee resource availability at anytime during
c. Infrastructure as a Service (IaaS) or Hardware as a Service (HaaS)
an agreed period and then paying a discounted hourly rate when resources are in use.
Automatic Scaling And Load Balancing:
INFRASTRUCTURE AS A SERVICE PROVIDERS
It allow users to set conditions for when they want their applications to scale up and down,
Public Infrastructure as a Service providers commonly offer virtual servers containing
based on application specific metrics such as transactions per second, number of simultaneous
one or more CPUs, running several choices of operating systems and a customized software
stack. users, request latency, and so forth. When the number of virtual servers is increased by
automatic scaling, incoming traffic must be automatically distributed among the available
FEATURES servers. This activity enables applications to promptly respond to traffic increase while also
The most relevant features are: achieving greater fault tolerance.
i. Geographic distribution of data centers;
ii. Variety of user interfaces and APIs to access the system; Service-Level Agreement:
a. Specialized components and services that aid particular applications (e.g., Service-level agreements (SLAs) are offered by IaaS providers to express their commitment to
load- balancers, firewalls); delivery of a certain QoS. To customers it serves as a warranty. An SLA usually include
b. Choice of virtualization platform and operating systems; and availability and performance guarantees. HYPERVISOR AND OPERATING SYSTEM
c. Different billing methods and period (e.g., prepaid vs. postpaid, hourly vs. CHOICE: IaaS offerings have been based on heavily customized open-source Xen
monthly). deployments. IaaS providers needed expertise in Linux, networking, virtualization, metering,
resource management, and many other low-level aspects to successfully deploy and maintain
Geographic Presence: their cloud offerings.
Availability zones are ―distinct locations that are engineered to be insulated from failures in
other availability zones and provide inexpensive, low-latency network connectivity to other
PaaS Providers include only public clouds, only private clouds or a combination of both public and private
clouds.
Public Platform as a Service providers commonly offer a development and deployment
environment that allow users to create and run their applications with little or no concern to d. Community Cloud: Here, computing resources are provided for a community and
low-level details of the platform. organizations.
Persistence Options. A persistence layer is essential to allow applications to record their state Components of Cloud infrastructure
and recover it in case of crashes, as well as to store user data.
Cloud computing can be divided into several sub-categories depending on the physical location
of the computing resources and who can access those resources.
a. Public cloud vendors offer their computing services to anyone in the general public. They
maintain large data centers full of computing hardware, and their customers share access to that
hardware.
a) Hypervisor
b. Private cloud is a cloud environment set aside for the exclusive use of one organization. Some Hypervisor is a firmware or low-level program. It acts as a Virtual Machine Manager.
large enterprises choose to keep some data and applications in a private cloud for security It enables to share a physical instance of cloud resources between several customers.
reasons, and some are required to use private clouds in order to comply with various regulations. b) Management Software
Management software assists to maintain and configure the infrastructure.
Organizations have two different options for the location of a private cloud: they can set up a
c) Deployment Software
private cloud in their own data centers or they can use a hosted private cloud service. With a Deployment software assists to deploy and integrate the application on the cloud.
hosted private cloud, a public cloud vendor agrees to set aside certain computing resources and
d) Network
allow only one customer to use those resources. Network is the key component of the cloud infrastructure.
It enables to connect cloud services over the Internet.
c. Hybrid cloud is a combination of both a public and private cloud with some level of
The customer can customize the network route and protocol i.e possible to deliver network as a
integration between the two. For example, in a practice called "cloud bursting" a company may
utility over the Internet.
run Web servers in its own private cloud most of the time and use a public cloud service for
e) Server
additional capacity during times of peak use.
The server assists to compute the resource sharing and offers other services like resource
allocation and de-allocation, monitoring the resources, provides the security etc.
A multi-cloud environment is similar to a hybrid cloud because the customer is using more than
one cloud service. However, a multi-cloud environment does not necessarily have integration 6) Storage
among the various cloud services, the way a hybrid cloud does. A multi-cloud environment can
Cloud keeps many copies of storage. Using these copies of resources, it extracts another Grid computing enables collaboration between enterprises to carry out distributed
resource if any one of the resources fails. computing jobs using interconnected computers spread across multiple locations running
Intranets and the Cloud: Intranets are customarily used within an organization and are not independently
accessible publicly. That is, a web server is maintained in-house and company information is Utility computing provides web services such as computing, storage space, and
maintained on it that others within the organization can access. However, now intranets are being applications to users at a low cost through the virtualization of several backend servers.
maintained on the cloud. Utility computing has laid the foundation for today’s cloud computing
To access the company’s private, in-house information, users have to log on to the Distributed computing landscape connects ubiquitous networks and connected devices
intranet by going to a secure public web site. enabling peer-to-peer computing. Examples of such cloud infrastructure are ATMs, and
intranets/ workgroups
There are two main components in client/server computing: servers and thin or light
clients. Grid Computing Vs Cloud Computing
The servers house the applications your organization needs to run, and the thin
When we switch on the fan or any electric device, we are less concern about the power supply
Clients—who do not have hard drives—display the results.
from where it comes and how it is generated. The power supply or electricity that we receives at
Hypervisor Applications
our home travels through a chain of network, which includes power stations, transformers, power
Applications like VMware or Microsoft’s Hyper-V allow you to virtualize your servers
lines and transmission stations. These components together make a ‘Power Grid’. Likewise,
so
‘Grid Computing’ is an infrastructure that links computing resources such as PCs, servers,
that multiple virtual servers can run on one physical server.
These sorts of solutions provide the tools to supply a virtualized set of hardware to the workstations and storage elements and provides the mechanism required to access them.
guest operating system. They also make it possible to install different operating systems
on the same machine. For example, you may need Windows Vista to run one application, Grid Computing is a middle ware to co-ordinate disparate IT resources across a network,
while another application requires Linux. It’s easy to set up the server to run both allowing them to function as whole. It is more often used in scientific research and in universities
operating systems. for educational purpose. For example, a group of architect students working on a different
Thin clients use an application program to communicate with an application server. project requires a specific designing tool and a software for designing purpose but only couple of
Most of the processing is done down on the server, and sent back to the client. them got access to this designing tool, the problem is how they can make this tool available to
There is some debate about where to draw the line when talking about thin clients. rest of the students. To make available for other students they will put this designing tool on
Some thin clients require an application program or a web browser to communicate with campus network, now the grid will connect all these computers in campus network and allow
the server. However, others require no add-on applications at all. This is sort of a discussion of student to use designing tool required for their project from anywhere. Cloud computing and
semantics, because the real issue is whether the work is being done on the server and transmitted Grid computing is often confused, though there functions are almost similar there approach for
back to the thin client. their functionality is different. Let see how they operate-
1.8. Cloud computing techniques
Cloud Computing Grid Computing
Some traditional computing techniques that have helped enterprises achieve additional
computing and storage capabilities, while meeting customer demands using shared Cloud computing works more as a Grid computing uses the available
physical resources, are: service provider for utilizing computer resource and interconnected computer
resource systems to accomplish a common goal
Cluster computing connects different computers in a single location via LAN to work as
a single computer. Improves the combined performance of the organization which owns Grid computing is a decentralized
it Cloud computing is a centralized model model, where the computation could
occur over many administrative model
Cloud offers more services all most all Utility computing is more favorable Cloud computing is great and easy to
the services like web hosting, DB (Data Grid provides limited services when performance and selection use when the selection infrastructure
Base) support and much more infrastructure is critical and performance is not critical
Cloud computing is typically provided Utility computing is a good choice for Cloud computing is a good choice for
Grid computing federates the resources
within a single organization (eg : less resource demanding high resource demanding
located within different organization.
Amazon)
Utility computing refers to a business Cloud computing refers to the
model underlying IT architecture
Utility Computing Vs Cloud Computing
In our previous conversation in “Grid Computing” we have seen how electricity is supplied to
1.9 Security concerns for Cloud Computing
our house, also we do know that to keep electricity supply we have to pay the bill. Utility
Computing is just like that, we use electricity at home as per our requirement and pay the bill While using cloud computing, the major issue that concerns the users is about its security.
accordingly likewise you will use the services for the computing and pay as per the use this is One concern is that cloud providers themselves may have access to customer’s. unencrypted
known as ‘Utility computing’. Utility computing is a good source for small scale usage, it can be
done in any server environment and requires Cloud Computing. data- whether it’s on disk, in memory or transmitted over the network. Some countries
government may decide to search through data without necessarily notifying the data owner,
Utility computing is the process of providing service through an on-demand, pay per use billing depending on where the data resides, which is not appreciated and is considered as a privacy
method. The customer or client has access to a virtually unlimited supply of computing solutions breach (Example Prism Program by USA).
over a virtual private network or over the internet, which can be sourced and used whenever it’s To provide security for systems, networks and data cloud computing service providers have
required. Based on the concept of utility computing , grid computing, cloud computing and joined hands with TCG (Trusted Computing Group) which is non-profit organization which
managed IT services are based. regularly releases a set of specifications to secure hardware, create self-encrypting drives and
improve network security. It protects the data from root kits and malware.
Through utility computing small businesses with limited budget can easily use software like As computing has expanded to different devices like hard disk drives and mobile phones, TCG
CRM (Customer Relationship Management) without investing heavily on infrastructure to has extended the security measures to include these devices. It provides ability to create a unified
maintain their clientele base. data protection policy across all clouds.
Utility Computing Cloud Computing Some of the trusted cloud services are Amazon, Box.net, Gmail and many others.
Utility computing refers to the ability to Cloud Computing also works like utility
charge the offered services, and charge computing, you pay only for what you 1.10 Privacy Concern & Cloud Computing
customers for exact usage use but Cloud Computing might be
Privacy presents a strong barrier for users to adapt into Cloud Computing systems
cheaper, as such, Cloud based app can
There are certain measures which can improve privacy in cloud computing.
1. The administrative staff of the cloud computing service could theoretically monitor the
data moving in memory before it is stored in disk. To keep the confidentiality of a data,
administrative and legal controls should prevent this from happening.
2. The other way for increasing the privacy is to keep the data encrypted at the cloud storage
site, preventing unauthorized access through the internet; even cloud vendor can’t access
the data either.
ii) Full Virtualization
Full virtualization is a technique in which a complete installation of one machine is run
on another. The result is a system in which all software running on the server is within a
virtual machine.
In a fully virtualized deployment, the software running on the server is displayed on the
clients.
Virtualization is relevant to cloud computing because it is one of the ways in which you
will access services on the cloud. That is, the remote datacenter may be delivering your
services in a fully virtualized format.
In order for full virtualization to be possible, it was necessary for specific hardware
combinations to be used. It wasn’t until 2005 that the introduction of the AMD- CHALLENGES AND RISKS
Virtualization(AMD-V) and Intel Virtualization Technology (IVT) extensions made it easier to
go fully virtualized. Despite the initial success and popularity of the cloud computing paradigm and the extensive
availability of providers and tools, a significant number of challenges and risks are inherent to
Full virtualization has been successful for several purposes: this new model of computing. Providers, developers, and end users must consider these
i) Sharing a computer system among multiple users challenges and risks to take good advantage of cloud computing. Issues to be faced include user
ii) Isolating users from each other and from the control program privacy, data security, data lock- in, availability of service, disaster recovery, performance,
iii) Emulating hardware on another machine scalability, energy- efficiency, and programmability. Security, Privacy, and Trust: Security and
iii) Para virtualization privacy affect the entire cloud computing stack, since there is a massive use of third- party
Para virtualization allows multiple operating systems to run on a single hardware device at services and infrastructures that are used to host important data or to perform critical operations.
the same time by more efficiently using system resources, like processors and memory. In this scenario, the trust toward providers is fundament al to ensure the desired level of privacy
In full virtualization, the entire system is emulated (BIOS, drive, and so on), but in for applications hosted in the cloud. | 62 Legal and regulatory issues also need attention. When
para virtualization, its management module operates with an operating system that has data are moved into the Cloud, providers may choose to locate them anywhere on the planet. The
been adjusted to work in a virtual machine. Para virtualization typically runs better than the full physical location of data centers determines the set of laws that can be applied to the
virtualization model, simply because in a fully virtualized deployment, all elements management of data. For example, specific cryptography techniques could not be used because
must be emulated. they are not allowed in some countries. Similarly, country laws can impose that sensitive data,
such as patient health records, are to be stored within national borders. Data Lock- In and
Standardization: A major concern of cloud computing users is about having their data locked- in
by a certain provider. Users may want to move data and applications out from a provider that
does not meet their requirements. However, in their current form, cloud computing
infrastructures and platforms do not employ standard methods of storing user data and
applications. Consequently, they do not interoperate and user data are not portable. The answer
to this concern is standardization. In this direction, there are efforts to create open standard s for cooling system, thus costing USD 2.6 million per year. Besides the monetary cost, data centers
cloud computing. The Cloud Computing Interopera bility Forum (CCIF) was formed by significantly impact the environment in terms of CO2 emissions from the cooling systems
organizations such as Intel, Sun, and Cisco in order to “enable a global cloud computing
ecosystem whereby organizations are able to seamlessly work together for the purposes for wider Issues in cloud:
industry adoption of cloud computing technology.” The development of the Unified Cloud
Interface (UCI) by CCIF aims at creating a standard programmatic point of access to an entire The Eucalyptus : framework was one of the first open- source projects to focus on building IaaS
cloud infrastructure. In the hardware virtualization sphere, the Open Virtual Format (OVF) aims clouds. It has been developed with the intent of providing an open- source implement a tion
at facilitating packing and distribution of software to be run on VMs so that virtual appliances nearly identical in functionality to Amazon Web Services APIs. Eucalyptus provides the
can be made portable—that is, seamlessly run on hypervisor of different vendors. Availability, following features: Linux- based controller with administration Web portal; EC2- compatible
Fault- Tolerance, and Disaster Recovery: It is expected that users will have certain expectations (SOAP, Query) and S3- compatible (SOAP, REST) CLI and Web portal interfaces; Xen, KVM,
about the service level to be provided once their applications are moved to the cloud. These and VMWare backends; Amazon EBS- compatible virtual storage devices; interface to the
expectations include availability of the service, its overall performance, and what measures are to Amazon EC2 public cloud; virtual networks.
be taken when something goes wrong in the system or its component s. In summary, users seek
for a warranty before they can comfortably move their business to the cloud. SLAs, which Nimbus3: The Nimbus toolkit is built on top of the Globus framework. Nimbus provides most
include QoS requirements, must be ideally set up between customers and cloud computing features in common with other open- source VI managers, such as an EC2- compatible front- end
providers to act as warranty. An SLA specifies the details of the service to be provided, including API, support to Xen, and a backend interface to Amazon EC2. However, it distinguishes from
availability and performance guarantees. Additionally, metrics must be agreed upon by all others by providing a Globus Web Services Resource Framework (WSRF) interface. It also
parties, and penalties for violating the expectations must also be approved. Resource provides a backend service, named Pilot, which spawns VMs on clusters manage d by a local
Management and Energy- Efficiency: One important challenge faced by providers of cloud resource manager (LRM) such as PBS and SGE.
computing services is the efficient manage m e n t of virtualized resource pools. Physical
resources such as CPU cores, disk space, and network bandwidth must be sliced and shared Open Nebula: Open Nebula is one of the most feature- rich open- source VI managers. It was
among virtual machines running potentially heterogeneous workloads. The multidimensional initially conceived to manage local virtual infrastructure, but has also included remote interfaces
nature of virtual machines complicates the activity of finding a good mapping of VMs onto that make it viable to build public clouds. Altogether r, four programming APIs are available:
available physical hosts while maximizing user | 63 utility. Dimensions to be considered include: XML-RPC and libvirt for local interaction; a subset of EC2 (Query) APIs and the Open Nebula
number of CPUs, amount of memory, size of virtual disks, and network bandwidth. Dynamic Cloud API (OCA) for public access. Open Nebula provides the following features: Linux- based
VM mapping policies may leverage the ability to suspend, migrate, and resume VMs as an easy controller; CLI, XML-RPC, EC2- compatible Query and OCA interfaces; Xen, KVM, and
way of preempting low- priority allocations in favor of higher- priority ones. Migration of VMs VMware backend; interface to public clouds (Amazon EC2, Elastic Hosts); virtual networks;
also brings additional challenges such as detecting when to initiate a migration, which VM to dynamic resource allocation; advance reservation of capacity.
migrate, and where to migrate. In addition, policies may take advantage of live migration of
CASE STUDY
virtual machines to relocate data center load without significantly disrupting running services. In
this case, an additional concern is the tradeoff between the negative impact of a live migration on
The Eucalyptus :
the performance and stability of a service and the benefits to be achieved with that migration.
Another challenge concerns the outstanding amount of data to be managed in various VM framework was one of the first open- source projects to focus on building IaaS clouds. It has
manage m e n t activities. Such data amount is a result of particular abilities of virtual machines, been developed with the intent of providing an open- source implement a tion nearly identical in
including the ability of traveling through space (i.e., migration) and time (i.e., check pointing and functionality to Amazon Web Services APIs. Eucalyptus provides the following features: Linux-
rewinding), operations that may be required in load balancing, backup, and recovery scenarios. based controller with administration Web portal; EC2- compatible (SOAP, Query) and S3-
In addition, dynamic provisioning of new VMs and replicating existing VMs require efficient compatible (SOAP, REST) CLI and Web portal interfaces; Xen, KVM, and VMWare backends;
mechanisms to make VM block storage devices (e.g., image files) quickly available at selected Amazon EBS- compatible virtual storage devices; interface to the Amazon EC2 public cloud;
hosts. Data centers consume r large amounts of electricity. According to a data published by virtual networks.
HP[4], 100 server racks can consume 1.3MWof power and another 1.3 MW are required by the
Nimbus3 : The Nimbus toolkit is built on top of the Globus framework. Nimbus provides most Looking to the success of Cloud Computing in e-mail services and communication .The second
features in common with other open- source VI manage rs, such as an EC2- compatible front- strategic move of Royal Mail Group, was to migrating from physical servers to virtual servers,
end API, support to Xen, and a backend interface to Amazon EC2. However, it distinguishes up to 400 servers to create a private cloud based on Microsoft hyper V. This would give a fresh
from others by providing a Globus Web Services Resource Framework (WSRF) interface. It also look and additional space to their employees desktop and also provides latest modern exchange
provides a backend service, named Pilot, which spawns VMs on clusters manage d by a local environment.
resource manager (LRM) such as PBS and SGE.
The hyper V project by RMG’s (Royal Mail Group) is estimated to save around 1.8 million
Open Nebula: pound for them in future and will increase the efficiency of the organization’s internal IT system.
Open Nebula is one of the most feature- rich open- source VI managers. It was initially Case study -2
conceived to manage local virtual infrastructure, but has also included remote interfaces that XYZ is a startup IT organization that develops and sells s/w the org gets a new website
make it viable to build public clouds. Altogether, four programming APIs are available: XML- development project that needs a web server, application server and a database server. The
RPC and libvirt for local interaction; a subset of EC2 (Query) APIs and the OpenNebula Cloud org has hired 30 employees for this web development project.
API (OCA) for public access. OpenNebula provides the following features: Linux- based Constraints :
controller; CLI, XML-RPC, EC2- compatible Query and OCA interfaces; Xen, KVM, and Acquiring renting space for new servers
VMware backend; interface to public clouds (Amazon EC2, ElasticHosts); virtual networks; Buying new high end servers
dynamic resource allocation; advance reservation of capacity. Hiring new IT staff for infrastructure management
Buying licensed OS and other s/w required for development
i) Case-Study of Cloud Computing- Royal Mail Solution :Public cloud IaaS
Team leader :
Subject of Case-Study: Using Cloud Computing for effective communication among
staff. 1. Creates an ac
Reason for using Cloud Computing: Reducing the cost made after communication for 2. Choose an VM image from image repository or create a new image
28,000 employees and to provide advance features and interface of e-mail services to 3. Specify no.of VM’s
their employees. 4. Choose VM type
5. Set necessary configurations for VM
Royal mail group, a postal service in U.K, is the only government organization in U.K that 6. After VM launched ,provide IP address of VM to prog team
serves over 24 million customers through its 12000 post offices and 3000 separate processing 7. Access VM and start development
sites. Its logistics systems and parcel-force worldwide handles around 404 million parcel a year.
Case study -2
And to do this they need an effective communicative medium. They have recognized the
advantage of Cloud Computing and implemented it to their system. It has shown an outstanding Case study -3
performance in inter-communication.
XYZ firm gets more revenue ,grows and hence buys some IT infrastructuire.However it
Before moving on to Cloud system, the organization was struggling with the out-of-date continues to use public IaaS cloud for its development work
software, and due to which the operational efficiency was getting compromised. As soon as the
organization switched on to Cloud System, 28000 employees were supplied with their new Now the firm gets a new project that involves sensitive data that restricts the firm to use a
collaboration suite, giving them access to tools such as instant messaging and presence public cloud .hence this org is in need of setting up the required infrastructure in its own
awareness. The employees got more storage place than on local server. The employees became premise.
much more productive. Constraints:
Infrastructure cost
Infrastructure optimization AWS is Amazon's cloud web hosting platform which offers fast, flexible, reliable and cost-
effective solutions. It offers a service in the form of building block which can be used to create
Power consumption and deploy any kind of application in the cloud. It is the most popular as it was the first to enter
Data center management the cloud computing space.
Features:
Additional expenditure on infrastructure operation with lesser productivity Easy sign-up process
Fast Deployments
Solution : Private IaaS cloud
Allows easy management of add or remove capacity
Moving to private cloud is : Access to effectively limitless capacity
Centralized Billing and management
Moving to private cloud Offers Hybrid Capabilities and per hour billing
Download link:https://aws.amazon.com/
IT managed self-service
Dedicated shared
Explanation:
Google Cloud is a set of solution and products which includes GCP & G suite. It helps you to Oracle Cloud offers innovative and integrated cloud services. It helps you to build, deploy, and
solve all kind of business challenges with ease. manage workloads in the cloud or on premises. Oracle Cloud also helps companies to transform
Features: their business and reduce complexity.
Allows you to scale with open, flexible technology Features:
Solve issues with accessible AI & data analytics Oracle offers more options for where and how you make your journey to the cloud
Eliminate the need for installing costly servers Oracle helps you realize the importance of modern technologies including Artificial
Allows you to transform your business with a full suite of cloud-based services intelligence, chatbots, machine learning, and more
Download link:https://cloud.google.com/ Offers Next-generation mission-critical data management in the cloud
Oracle provides better visibility to unsanctioned apps and protects against sophisticated
4) VMware cyber attacks
Download link:https://www.oracle.com/cloud/
5) IBM Cloud
VMware is a comprehensive cloud management platform. It helps you to manage a hybrid IBM cloud is a full stack cloud platform which spans public, private and hybrid environments. It
environment running anything from traditional to container workloads. The tools also allow you is built with a robust suite of advanced and AI tools.
to maximize the profits of your organization. Features:
Features: IBM cloud offers infrastructure as a service (IaaS), software as a service (SaaS) and
Enterprise-ready Hybrid Cloud Management Platform platform as a service (PaaS)
Offers Private & Public Clouds IBM Cloud is used to build pioneering which helps you to gain value for your businesses
Comprehensive reporting and analytics which improve the capacity of forecasting & It offers high performing cloud communications and services into your IT environment
planning Download link:https://www.ibm.com/cloud/
Offers additional integrations with 3rd parties and custom applications, and tools. Tips for selecting a Cloud Service Provider
Provides flexible, Agile services There "best" Cloud Service cannot be defined. You need to a chose a cloud service "best" for
Download link:https://www.vmware.com/in/cloud-services/infrastructure.html your project. Following checklist will help:
Is your desired region supported?
Cost for the service and your budget
Eucalyptus
• Eucalyptus is a paid and open-source computer software for building Amazon Web Eucalyptus has six components:
Services (AWS)-compatible private and hybrid cloud computing environments, originally
1.The Cloud Controller (CLC) is a Java program that offers EC2-compatible interfaces,
developed by the company Eucalyptus Systems.
as well as a web interface to the outside world.
• Eucalyptus enables pooling compute, storage, and network resources that can be
• In addition to handling incoming requests, the CLC acts as the administrative interface
dynamically scaled up or down as application workloads change
for cloud management and performs high-level resource scheduling and system
accounting.
• The CLC accepts user API requests from command-line interfaces like euca2ools or
GUI-based tools like the Eucalyptus User Console and manages the underlying compute,
storage, and network resources.
• Only one CLC can exist per cloud and it handles authentication, accounting, reporting,
and quota management.
2.Walrus, also written in Java, is the Eucalyptus equivalent to AWS Simple Storage
Service (S3).
• Walrus offers persistent storage to all of the virtual machines in the Eucalyptus cloud and • Nimbus allows a client to lease remote resources by deploying virtual machines (VMs)
can be used as a simple HTTP put/get storage as a service solution. on those resources and configuring them to represent an environment desired by the user.
• There are no data type restrictions for Walrus, and it can contain images (i.e., the building • It was formerly known as the "Virtual Workspace Service" (VWS) but the "workspace
blocks used to launch virtual machines), volume snapshots (i.e., point-in-time copies), service" is technically just one the components in the software .
and application data. Only one Walrus can exist per cloud.
• Nimbus is a toolkit that, once installed on a cluster, provides an infrastructure as a service
3.The Cluster Controller (CC) is written in C and acts as the front end for a cluster cloud to its client via WSRF-based or Amazon EC2 WSDL web service APIs.
within a Eucalyptus cloud and communicates with the Storage Controller and Node
Controller. • Nimbus is free and open-source software, subject to the requirements of the Apache
License, version 2.
• It manages instance (i.e., virtual machines) execution and Service Level Agreements
(SLAs) per cluster. • Nimbus supports both the hypervisors Xen and KVM and virtual machine schedulers
Portable Batch System and Oracle Grid Engine.
4.The Storage Controller (SC) is written in Java and is the Eucalyptus equivalent to
AWS EBS. It communicates with the Cluster Controller and Node Controller and • It allows deployment of self-configured virtual clusters via contextualization.
manages Eucalyptus block volumes and snapshots to the instances within its specific • It is configurable with respect to scheduling, networking leases, and usage accounting.
cluster.
• Nimbus is comprised of two products:
• If an instance requires writing persistent data to memory outside of the cluster, it would
need to write to Walrus, which is available to any instance in any cluster. Nimbus Infrastructure
5.The Node Controller (NC) is written in C and hosts the virtual machine instances and Nimbus Platform
manages the virtual network endpoints.
• Nimbus Infrastructure is an open source EC2/S3-compatible Infrastructure-as-a-
• It downloads and caches images from Walrus as well as creates and caches instances. Service implementation specifically targeting features of interest to the scientific
community such as support for proxy credentials, batch schedulers, best-effort allocations
• While there is no theoretical limit to the number of Node Controllers per cluster, and others.
performance limits do exist.
• Nimbus Platform is an integrated set of tools, operating in a multi-cloud environment,
6.The VMware Broker is an optional component that provides an AWS-compatible that deliver the power and versatility of infrastructure clouds to scientific users. Nimbus
interface for VMware environments and physically runs on the Cluster Controller. Platform allows you to reliably deploy, scale, and manage cloud resources.
• The VMware Broker overlays existing ESX/ESXi hosts and transforms Eucalyptus System Architecture & Design
Machine Images (EMIs) to VMware virtual disks.
• The VMware Broker mediates interactions between the Cluster Controller and VMware
and can connect directly to either ESX/ESXi hosts or to vCenter Server.
Nimbus
• Mission is to evolve the infrastructure with emphasis on the needs of science, but many
non-scientific use cases are supported as well.
K.NIKHILA Page 36 K.NIKHILA Page 37
UNIT -1 UNIT -1
6. Workspace client
• A complex client that provides full access to the workspace service functionality.
7. Cloud client
8. Storage service
• cumulus is a web service providing users with storage capabilities to store images and
works in conjunction with GridFTP.
Open Nebula
• Open Nebula- is an open source cloud computing platform for managing heterogeneous
distributed data centre infrastructures.
Allows clients to manage and administer VMs by providing to two interfaces: • Many of our users use OpenNebula to manage data center virtualization, consolidate
servers, and integrate existing IT assets for computing, storage, and networking.
• A)One interface is based on the web service is resource framework (WSRF)
• In this deployment model, OpenNebula directly integrates with hypervisors (like KVM,
• B)The other is based on EC2 WSDL Xen or VMware ESX) and has complete control over virtual and physical resources,
providing advanced features for capacity management, resource optimization, high
2. Workspace resource manager
availability and business continuity.
implements VM instance creation on a site management.
• Some of these users also enjoy OpenNebula’s cloud management and provisioning
3. Workspace pilot features when they additional want to federate data centers, implement cloud bursting, or
offer self-service portals for users.
• Provides virtualization with significant changes to the site configurations.
4. workspace control
Cloud infrastructure solutions
• Implements VM instance management such as start, stop and pause VM. It also provides
image management and set up networks and provides IP assignment. • We also have users that use OpenNebula to provide a multitenant, cloud-like
provisioning layer on top of an existing infrastructure management solution (like
5.context Broker VMware vCenter).
• Allows clients coordinate large virtual cluster launches automatically and repeatedly. • These users are looking for provisioning, elasticity and multi-tenancy cloud features like
virtual data centers provisioning, datacenter federation or hybrid cloud computing to
connect in-house infrastructures with public clouds, while the infrastructure is managed
by already familiar tools for infrastructure management and operation
Image Repository: Any storage medium for the VM images (usually a high performing SAN).
Master node: A single gateway or front-end machine, sometimes also called the master node, is
Cluster Storage : OpenNebula supports multiple back-ends (e.g. LVM for fast cloning)
responsible for queuing, scheduling and submitting jobs to the machines in the cluster. It runs
VM Directory: The home of the VM in the cluster node several other OpenNebula services mentioned below:
Worker node: The other machines in the cluster, known as ‘worker nodes’, provide raw
computing power for processing the jobs submitted to the cluster. The worker nodes in an
OpenNebula cluster are machines that deploy a virtualisation hypervisor, such as VMware, Xen
or KVM.
CloudSim
• Originally built primarily at the Cloud Computing and Distributed Systems (CLOUDS)
Laboratory, The University of Melbourne, Australia, CloudSim has become one of the
most popular open source cloud simulators in the research and academia.
• By using CloudSim, developers can focus on specific systems design issues that they
want to investigate, without getting concerned about details related to cloud-based
infrastructures and services.
• CloudSim is a simulation tool that allows cloud developers to test the performance of
their provisioning policies in a repeatable and controllable environment, free of cost.
• It provides essential classes for describing data centres, computational resources, virtual
machines, applications, users, and policies for the management of various parts of the
system such as scheduling and provisioning.
• It can be used as a building block for a simulated cloud environment and can add new
policies for scheduling, load balancing and new scenarios.
• It is flexible enough to be used as a library that allows you to add a desired scenario by
writing a Java program.
Features of Cloudsim
Architecture of CloudSim
• User Interface: This layer provides the interaction between user and the simulator.
• The CloudSim Core simulation engine provides support for modeling and simulation of
virtualized Cloud-based data center environments including queuing and processing of
events, creation of cloud system entities (like data center, host, virtual machines, brokers,
services, etc.) communication between components and management of the simulation
clock.
• The user code :layer exposes basic entities such as the number of machines, their
specifications, etc, as well as applications, VMs, number of users, application types and
scheduling policies.
• The User Code layer is a custom layer where the user writes their own code to redefine
the characteristics of the stimulating environment as per their new research findings.
• Cloud Resources: This layer includes different main resources like datacenters, cloud
coordinator (ensures that different resources of the cloud can work in a collaborative
way) in the cloud environment
• Cloud Services: This layer includes different service provided to the user of cloud
services. The various services of clouds include Information as a Service (IaaS), Platform
as a Service (PaaS), and Software as a Service (SaaS)
fully managed native-Windows file system available in the cloud with FSx for Windows File
Server.
The AWS advantage for Windows over the next largest cloud provider
2x More Windows Server instances
2x more regions with multiple availability zones
7x fewer downtime hours in 2018*
2x higher performance for SQL Server on Windows
5x more services offering encryption
AWS offers the best cloud for Windows, and it is the right cloud platform for running Windows-
based applications
Windows on Amazon EC2 enables you to increase or decrease capacity within minutes
i. Broader and Deeper Functionality
ii. Greater Reliability
iii. More Security Capabilities
iv. Faster Performance
v. Lower Costs
vi. More Migration Experience
Popular AWS services for Windows workloads
i. SQL Server on Amazon EC2
ii. Amazon Relational Database Service
iii. Amazon FSx for Window File Server
iv. AWS Directory Service
v. AWS License Manager
UNIT III UNIT III
The Basics
The term Storage as a Service (another Software as a Service, or SaaS, acronym) means that a
Cloud storage is nothing but storing our data with a cloud service provider rather third-party provider rents space on their storage to end users who lack the budget or capital
than on a local system, as with other cloud services, we can access the data stored on the cloud budget to pay for it on their own. It is also ideal when technical personnel are not available or
via an Internet link. Cloud storage has a number of advantages over traditional data storage. If have inadequate knowledge to implement and maintain that storage infrastructure. Storage
we store our data on a cloud, we can get at it from any location that has Internet access. service providers are nothing new, but given the complexity of current backup,
At the most rudimentary level, a cloud storage system just needs one data server connected to replication, and disaster recovery needs, the service has become popular, especially among
the Internet. A subscriber copies files to the server over the Internet, which then records the small and medium-sized businesses. Storage is rented from the provider using a cost-per-
data. When a client wants to retrieve the data, he or she accesses the data server with a web- gigabyte-stored or cost-per-data-transferred model. The end user doesn’t have to pay for
based interface, and the server then either sends the files back to the client or allows the client infrastructure; they simply pay for how much they transfer and save on the provider’s
A customer uses client software to specify the backup set and then transfers data across a
WAN. When data loss occurs, the customer can retrieve the lost data from the service provider.
c.Providers
Cloud storage systems utilize dozens or hundreds of data servers. Because servers require
maintenance or repair, it is necessary to store the saved data on multiple machines, providing They are hundreds of cloud storage providers on the Web, and more seem to be added each
redundancy. Without that redundancy, cloud storage systems couldn’t assure clients that they day. Not only are there general-purpose storage providers, but there are some that are very
could access their information at any given time. Most systems store the same data on servers specialized in what they store.
using different power supplies. That way, clients can still access their data even if a power Google Docs allows users to upload documents, spreadsheets, and presentations to
Google’s data servers. Those files can then be edited using a Google application.
Web email providers like Gmail, Hotmail, and Yahoo! Mail store email messages on their
own servers. Users can access their email from computers and other devices connected to the
Internet.
Flickr and Picasa host millions of digital photographs. Users can create their own online
photo albums.
YouTube hosts millions of user-uploaded video files.
Hostmonster and GoDaddy store files and data for many client web sites.
Facebook and MySpace are social networking sites and allow members to post
e. Reliability
pictures and other content. That content is stored on the company’s servers.
Most cloud storage providers try to address the reliability concern through redundancy, but the
MediaMax and Strongspace offer storage space for any kind of digital data.
possibility still exists that the system could crash and leave clients with no way to access their
d. Security:
saved data.
To secure data, most systems use a combination of techniques:
Advantages
i. Encryption A complex algorithm is used to encode information. To decode the encrypted
files, a user needs the encryption key. While it’s possible to crack encrypted
Cloud storage is becoming an increasingly attractive solution for organizations. That’s
information, it’s very difficult and most hackers don’t have access to the amount of
because with cloud storage, data resides on the Web, located across storage systems
computer processing power they would need to crack the code.
rather than at a designated corporate hosting site. Cloud storage providers balance
ii. Authentication processes this requires a user to create a name and password.
server loads and move data among various datacenters, ensuring that information is
iii. Authorization practices The client lists the people who are authorized to access
stored close to where it is used.
information stored on the cloud system. Many corporations have multiple levels of
Storing data on the cloud is advantageous, because it allows us to protect our data
authorization. For example, a front-line employee might have limited access to data
incase there’s a disaster. we may have backup files of our critical information, but if
stored on the cloud and the head of the IT department might have complete and free
there is a fire or a hurricane wipes out our organization, having the backups stored
access to everything.
locally doesn’t help.
Amazon S3 is the best-known storage solution, but other vendors might be better for
large enterprises. For instance, those who offer service level agreements and direct
access to customer support are critical for a business moving storage to a service
provider.
global network of websites. The service aims to maximize benefits of scale and to pass
those benefits on to developers.
Amazon S3 is intentionally built with a minimal feature set that includes the following
functionality:
Write, read, and delete objects containing from 1 byte to 5 gigabytes of data
each. The number of objects that can be stored is unlimited.
Each object is stored and retrieved via a unique developer-assigned key.
Objects can be made private or public, and rights can be assigned to specific users.
Uses standards-based REST and SOAP interfaces designed to work with any
A lot of companies take the “appetizer” approach, testing one or two services to see
Internet- development toolkit.
how well they mesh with their existing IT systems. It’s important to make sure the
Design Requirements
services will provide what we need before we commit too much to the cloud.
Amazon built S3 to fulfill the following design requirements:
Scalable Amazon S3 can scale in terms of storage, request rate, and users to
support an unlimited number of web-scale applications.
Reliable Store data durably, with 99.99 percent availability. Amazon says it does not
allow any downtime.
Fast Amazon S3 was designed to be fast enough to support high-performance
applications. Server-side latency must be insignificant relative to Internet latency. Any
performance bottlenecks can be fixed by simply adding nodes to the system.
Inexpensive Amazon S3 is built from inexpensive commodity hardware components.
As a result, frequent node failure is the norm and must not affect the overall system. It
3.1.2 Cloud Storage Providers must be hardware-agnostic, so that savings can be captured as Amazon continues to drive
Amazon and Nirvanix are the current industry top storage providers. down infrastructure costs.
a. Amazon Simple Storage Service (S3) Simple Building highly scalable, reliable, fast, and inexpensive storage is difficult. Doing
The best-known cloud storage service is Amazon’s Simple Storage Service (S3), which so in a way that makes it easy to use for any application anywhere is more difficult. Amazon S3
launched in 2006. must do both.
Amazon S3 is designed to make web-scale computing easier for developers. Amazon S3 Design Principles
provides a simple web services interface that can be used to store and retrieve any Amazon used the following principles of distributed system design to meet Amazon S3
amount of data, at any time, from anywhere on the Web. It gives any developer access requirements:
to the same highly scalable data storage infrastructure that Amazon uses to run its own
K.NIKHILA Page 5 K.NIKHILA Page 6
UNIT III UNIT III
scale across thousands of commodity servers that can collectively store petabytes of it is Apple’s solution that delivers push email, push contacts, and push calendars from
data. Each table in Bigtable is a multidimensional sparse map. That is, the table is made the MobileMe service in the cloud to native applications on iPhone, iPod touch, Macs,
up of rows and columns, and each cell has a timestamp. Multiple versions of a cell can and PCs.
exist, each with a different timestamp. With this stamping, we can select certain It provides a suite of ad-free web applications that deliver a desktop like experience
versions of a web page, or delete cells that are older than a given date and time. through any browser.
e. Live Mesh:
It is Microsoft’s “software plus services” platform and experience that enables PCs and
other devices to be aware of each other through internet,enabling individuals and
organizations to manage ,access and share their files and applications on the web.
components:
A platform that defines and models a user’s digital relationships among devices,
data, applications, and people—made available to developers through an open data
model and protocols.
A cloud service providing an implementation of the platform hosted in Microsoft
datacenters.
Software, a client implementation of the platform that enables local applications to
run offline and interact seamlessly with the cloud.
A platform experience that exposes the key benefits of the platform for bringing together
a user’s devices, files and applications, and social graph, with news feeds across all of
these.
Standards
Standards make the World Wide Web go around, and by extension, they are important to
cloud computing. Standards are what make it possible to connect to the cloud and what
make it possible to develop and deliver content.
3.2.1 Applications
A cloud application is the software architecture that the cloud uses to eliminate the need to The Problem with Polling When we wanted to sync services between two servers, the most
install and run on the client computer. There are many applications that can run, but there common means was to have the client ping the host at regular intervals. This is known as
needs to be a standard way to connect between the client and the cloud. polling. This is generally how we check our email. Every so often, we ping our email server to
a. Communication: see if we got any new messages. It’s also how the APIs for most web services work.
HTTP b. security
To get a web page from our cloud provider, we will likely be using the Hypertext Transfer SSL is the standard security technology for establishing an encrypted link between a web
Protocol (HTTP) as the computing mechanism to transfer data between the cloud and our server and browser. This ensures that data passed between the browser and the web
organization. HTTP is a stateless protocol. This is beneficial because hosts do not need to retain server stays private. To create an SSL connection on a web server requires an SSL
information about users between requests, but this forces web developers to use alternative certificate. When our cloud provider starts an SSL session, they are prompted to
methods for maintaining users’ states. HTTP is the language that the cloud and our computers complete a number of questions about the identity of their company and web site. The
use to communicate. cloud provider’s computers then generate two cryptographic keys—a public key and a
The Extensible Messaging and Presence Protocol (XMPP) is being talked about as the next
big thing for cloud computing.
VMware, AMD, BEA Systems, BMC Software, Broadcom, Cisco, Computer Associates development models. Community members can participate and influence the governance of
International, Dell, Emulex, HP, IBM, Intel, Mellanox, Novell, QLogic, and Red Hat all worked VMware ESX Server through an architecture board.
together to advance open virtualization standards. VMware says that it will provide its partners b.OVF
with access to VMware ESX Server source code and interfaces under a new program called As the result of VMware and its industry partners’ efforts, a standard has already been
VMware Community Source. This program is designed to help partners influence the direction developed called the Open Virtualization Format (OVF). OVF describes how virtual appliances
of VMware ESX Server through a collaborative development model and shared governance can be packaged in a vendor-neutral format to be run on any hypervisor. It is a platform-
process. independent, extensible, and open specification for the packaging and distribution of virtual
These initiatives are intended to benefit end users by : appliances composed of one or more virtual machines.
i. Expanding virtualization solutions the availability of open-standard virtualization interfaces VMware developed a standard with these features:
and the collaborative nature of VMware Community Source are intended to accelerate the Optimized for distribution
availability of new virtualization solutions. Enables the portability and distribution of virtual appliances
ii. Expanded interoperability and supportability Standard interfaces for hypervisors are expected Supports industry-standard content verification and integrity checking
to enable interoperability for customers with heterogeneous virtualized environments. Provides a basic scheme for the management of software licensing
iii. Accelerated availability of new virtualization-aware technologies Vendors across the A simple, automated user experience
technology stack can optimize existing technologies and introduce new technologies for Enables a robust and user-friendly approach to streamlining the installation process
running in virtual environments. Validates the entire package and confidently determines whether each virtual machine should
Open Hypervisor Standards be installed
Hypervisors are the foundational component of virtual infrastructure and enable Verifies compatibility with the local virtual hardware
computer system partitioning. An open-standard hypervisor framework can benefit customers Portable virtual machine packaging
by enabling innovation across an ecosystem of interoperable virtualization vendors and Enables platform-specific enhancements to be captured
solutions.
Supports the full range of virtual hard disk formats used for virtual machines today, and is XML vs. JSON:
extensible to deal with future formats that are developed JSON should be used instead of XML when JavaScript is sending or receiving data. The reason
Captures virtual machine properties concisely and accurately for this is that when we use XML in JavaScript, we have to write scripts or use libraries to handle
Vendor and platform independent the DOM objects to extract the data our need. However, in JSON, the object is already an
Does not rely on the use of a specific host platform, virtualization platform, or guest object, so no extra work needs to be done.
operating system
Extensible
Designed to be extended as the industry moves forward with virtual appliance technology
Localizable
Supports user-visible descriptions in multiple locales
Supports localization of the interactive processes during installation of an appliance
Allows a single packaged appliance to serve multiple market opportunities
3.2.4 Service
A web service, as defined by the World Wide Web Consortium (W3C), “is a software system
designed to support interoperable machine-to-machine interaction over a network” that may
be accessed by other cloud computing components. Web services are often web API’s that can XML
be accessed over a network, like the Internet, and executed on a remote system that hosts the Extensible Markup Language (XML) is a standard, self-describing way of encoding text and data
requested services. so that content can be accessed with very little human interaction and exchanged across a wide
a.Data variety of hardware, operating systems, and applications. XML provides a standardized way to
Data can be stirred and served up with a number of mechanisms; two of the most popular are represent text and data in a format that can be used across platforms. It can also be used with a
JSON and XML. wide range of development tools and utilities.
JSON HTML vs XML
JSON is short for JavaScript Object Notation and is a lightweight computer data interchange Separation of form and content HTML uses tags to define the appearance of text, while
format. It is used for transmitting structured data over a network connection in a process called XML tags define the structure and the content of the data. Individual applications will be
serialization. It is often used as an alternative to XML. specified by the application or associated style sheet.
JSON Basics JSON is based on a subset of JavaScript and is normally used with that language. XML is extensible Tags can be defined by the developer for specific application, while
However, JSON is considered to be a language-independent format, and code for parsing and HTML’s tags are defined by W3C.
generating JSON data is available for several programming languages. This makes it a good
replacement for XML when JavaScript is involved with the exchange of data, like AJAX.
Benefits of XML include: resource locator (URL) for the page where the XML file is located, read it with a web browser,
i. Self-describing data XML does not require relational schemata, file description tables, external understand the content using XML information, and display it appropriately.
data type definitions, and so forth. Also, while HTML only ensures the correct presentation of
the data, XML also guarantees that the data is usable.
ii. Database integration XML documents can contain any type of data—from text and numbers to
multimedia objects to active formats like Java.
iii. No reprogramming if modifications are made Documents and web sites can be changed with
XSL Style Sheets, without having to reprogram the data.
iv. One-server view of data XML is exceptionally ideal for cloud computing, because data spread
across multiple servers looks as if it is stored on one server.
v. Open and extensible XML’s structure allows us to add other elements if we need them. We can
easily adapt our system as our business changes.
vi. Future-proof The W3C has endorsed XML as an industry standard, and it is supported by all
leading software providers. It’s already become industry standard in fields like healthcare.
vii. Contains machine-readable context information Tags, attributes, and element structure
provide the context for interpreting the meaning of content, which opens up possibilities for
development. REST is similar in function to the Simple Object Access Protocol (SOAP), but is easier to use.
Content vs. presentation XML tags describe the meaning of the object, not its presentation. SOAP requires writing or using a data server program and a client program (to request the
That is, XML describes the look and feel of a document, and the application presents it as data). However, SOAP offers more capability. For instance, if we were to provide syndicated
described. content from our cloud to subscribing web sites, those subscribers might need to use SOAP,
b.Web Services which allows greater program interaction between the client and the server.
Web services describe how data is transferred from the cloud to the client. Benefits REST offers the following benefits:
REST It gives better response time and reduced server load due to its support for the caching of
Representational state transfer (REST) is a way of getting information content from a web site representations.
by reading a designated web page that contains an XML file that describes and includes the Server scalability is improved by reducing the need to maintain session state.
desired content. A single browser can access any application and any resource, so less client-side software
For instance, REST could be used by our cloud provider to provide updated subscription needs to be written.
information. Every so often, the provider could prepare a web page that includes content and A separate resource discovery mechanism is not needed, due to the use of hyperlinks in
XML statements that are described in the code. Subscribers only need to know the uniform representations.
Standards are extremely important, and something that we take for granted
these days. For instance, it’s nothing for us to email Microsoft Word documents
back and forth and expect them to work on our computers.
basic criteria: the vendor (Microsoft, Yahoo, and so on) hosts all of the programs and data in a
central location, providing end users with access to the data and software, which is accessed
across the World Wide Web.
SaaS can be divided into two major categories:
• Line of business services these are business solutions offered to companies and enterprises.
They are sold via a subscription service. Applications covered under this category include 3.3.2.Vendor Advantages
business processes, like supply- chain management applications, customer relations SaaS is advantage to Vendors also.And financial benefit is the top one—vendors get a
applications, and similar business-oriented tools. constant stream of income, often what is more than the traditional software
• Customer-oriented services These services are offered to the general public on a subscription licensing setup. Additionally, through SaaS, vendors can fend off piracy concerns and
basis. More often than not, however, they are offered for free and supported by advertising. unlicensed use of software.
Examples in this category include the aforementioned web mail services, online gaming, and Vendors also benefit more as more subscribers come online. They have a huge
consumer banking, among others. investment in physical space, hardware, technology staff, and process development.
K.NIKHILA Page 23 K.NIKHILA Page 24
UNIT III UNIT III
The more these resources are used to capacity, the more the provider can clear as (SSL) encryption.
margin. • No software to buy, install, or maintain and no network required. The software is hosted
Virtualization Benefits online, so small business users never have to worry about installing new software or upgrades.
Virtualization makes it easy to move to a SaaS system. One of the main reasons is that it QuickBooks Online remembers customer, product, and vendor information, so users don’t
is easier for independent software vendors (ISVs) to adopt SaaS is the growth of have to re-enter data.
virtualization. The growing popularity of some SaaS vendors using Amazon’s EC2 cloud • Easy accounts receivable and accounts payable. Invoice customers and track customer
platform and the overall popularity of virtualized platforms help with the development of payments. Create an invoice with the click of a button. Apply specific credits to invoices or
SaaS. apply a single-customer payment to multiple jobs or invoices. Receive bills and enter them
3.3.3.Companies Offering SaaS Intuit into QuickBooks Online with the expected due date.
QuickBooks has been around for years as a conventional application for tracking • Write and print checks. Enter information in the onscreen check form and print checks.
business accounting. With the addition of QuickBooks online, accounting has moved to Google
the cloud. QuickBooks Overview QuickBooks Online (www.qboe.com) gives small Google’s SaaS offerings include Google Apps and Google Apps Premier Edition.
business owners the ability to access their financial data whether they are at work, Google Apps, launched as a free service in August 2006, is a suite of applications that includes
home, or on the road. Intuit Inc. says the offering also gives users a high level of security Gmail webmail services, Google Calendar shared calendaring, Google Talk instant messaging
because data is stored on firewall-protected servers and protected via automatic data and Voice over IP, and the Start Page feature for creating a customizable home page on a
backups. specific domain.
There is also no need to hassle with technology—software upgrades are included at no Google also offers Google Docs and Spreadsheets for all levels of Google Apps.
extra charge. Additionally, Google Apps supports Gmail for mobile on BlackBerry handheld
For companies that are growing, QuickBooks Online Plus offers advanced features devices.
such as automatic billing and time tracking, as well as the ability to share information Google Apps Premier Edition has the following unique features:
with employees in multiple locations. • Per-user storage of 10GBs Offers about 100 times the storage of the average corporate
QuickBooks Online features include : mailbox.
• The ability to access financial data anytime and from anywhere. QuickBooks Online is • APIs for business integration APIs for data migration, user provisioning, single sign-on, and
accessible to users 24 hours a day, seven days a week. mail gateways enable businesses to further customize the service for unique environments.
• Automated online banking. Download bank and credit card transactions automatically • Uptime of 99.9 percent Service level agreements for high availability of Gmail, with Google
every night, so it’s easy to keep data up to date. monitoring and crediting customers if service levels are not met.
• Reliable automatic data backup. Financial data is automatically backed up every day and is • Advertising optional Advertising is turned off by default, but businesses can choose to
stored on Intuit’s firewall-protected servers, which are monitored to keep critical business include Google’s relevant target-based ads if desired.
information safe and secure. QuickBooks Online also supports 128-bit Secure Sockets Layer • Low fee Simple annual fee of $50 per user account per year makes it practical to offer these
applications to select users in the organization. Big Blue—IBM offers its own SaaS solution under the name “Blue Cloud.”
Microsoft Blue Cloud is a series of cloud computing offerings that will allow corporate datacenters to
Microsoft Office Live Small Business offers features including Store Manager, an e-commerce operate more like the Internet by enabling computing across a distributed, globally
tool to help small businesses easily sell products on their own web site and on eBay; and E- accessible fabric of resources, rather than on local machines or remote server farms.
mail Marketing beta, to make sending email newsletters and promotions simple and Blue Cloud is based on open-standards and open-source software supported by IBM
affordable. software, systems technology, and services. IBM’s Blue Cloud development is supported by
more than 200 IBM Internet-scale researchers worldwide and targets clients who want to
The following features are available in Microsoft Office Live Small Business: explore the extreme scale of cloud computing infrastructures.
• Store Manager is a hosted e-commerce service that enables users to easily sell products on
their own web site and on eBay. Software plus Services
• Custom domain name and business email is available to all customers for free for one year. Software plus Services takes the notion of Software as a Service (SaaS) to complement
Private domain name registration is included to help customers protect their contact packaged software. Here are some of the ways in which it can help the client organization.
information from spammers. Business email now includes 100 company-branded accounts,
each with 5GB of storage.
• Web design capabilities, including the ability to customize the entire page, as well as the
header, footer, navigation, page layouts, and more.
• Support for Firefox 2.0 means Office Live Small Business tools and features are now compatible
with Macs.
• A simplified sign-up process allows small business owners to get started quickly. Users do not
have to choose a domain name at sign-up or enter their credit card information. 3.4.1 Overview
• Domain flexibility allows businesses to obtain their domain name through any provider User experience: Browsers have limitations as to just how rich the user experience can
and redirect it to Office Live Small Business. In addition, customers may purchase be. Combining client software that provides the features we want with the ability of the
additional domain names. Internet to deliver those experiences gives us the best of both worlds.
• Synchronization with Microsoft Office Outlook provides customers with access to vital • Working offline Not having to always work online gives us the flexibility to do our work,
business information such as their Office Live Small Business email, contacts, and calendars, but without the limitations of the system being unusable. By connecting occasionally and
both online and offline. synching data, we get a good solution for road warriors and telecommuters who don’t have
• E-mail Marketing beta enables users to stay connected to current customers and introduce the same bandwidth or can’t always be connected.
themselves to new ones by sending regular email newsletters, promotions, and updates. • Privacy worries: No matter how we use the cloud, privacy is a major concern. With
IBM Software plus Services, we can keep the most sensitive data housed on-site, while less
Development Kit (SDK) as well as enterprise features such as support for Microsoft and Microsoft Dynamics CRM 4.0, organizations will have the flexibility required to address
Exchange ActiveSync to provide secure, over-the-air push email, contacts, and their business needs.
calendars as well as remote wipe, and the addition of Cisco IPsec VPN for encrypted Exchange Online and SharePoint Online
access to private corporate networks. Exchange Online and SharePoint Online are two examples of how partners can extend their
App Store: reach, grow their revenues, and increase the number to sales in a Microsoft-hosted scenario. In
The iPhone software contains the App Store, an application that lets users browse, search, September 2007, Microsoft initially announced the worldwide availability of Microsoft Online
purchase, and wirelessly download third-party applications directly onto their iPhone or iPod Services—which includes Exchange Online, SharePoint Online, Office Communications Online,
touch. The App Store enables developers to reach every iPhone and iPod touch user. and Office Live Meeting—to organizations with more than 5,000 users. The extension of these
Developers set the price for their applications (including free) and retain 70 percent of all sales services to small and mid-sized businesses is appealing to partners in the managed services
revenues. Users can download free applications at no charge to either the user or developer, or space because they see it as an opportunity to deliver additional services and customer value
purchase priced applications with just one click. Enterprise customers can create a secure, on top of Microsoft-hosted Exchange Online or SharePoint Online. Microsoft Online Services
private page on the App Store accessible only by their employees. opens the door for partners to deliver reliable business services such as desktop and mobile
email, calendaring and contacts, instant messaging, audio and video conferencing, and shared
3.4.4 Microsoft Online: workspaces—all of which will help increase their revenue stream and grow their businesses.
Microsoft provides Software plus Services offerings, integrating some of its most popular and
prevalent offerings, like Exchange. Not only does Microsoft’s Software plus Services offering Microsoft Dynamics CRM 4.0:
allow a functional way to serve our organization, but it also provides a means to function on the Microsoft Dynamics CRM 4.0, released in December of 2007 which provides a key aspect of
cloud in simple way. Microsoft’s Software plus Services strategy. The unique advantages of the new Microsoft
Hybrid Model Dynamics CRM 4.0, which can be delivered on-premise or on-demand as a hosted solution,
With Microsoft services like Exchange Online, SharePoint Online, and CRM 4.0, make Microsoft Dynamics CRM an option for solution providers who want to rapidly offer a
organizations big and small have more choices in how they access and manage enterprise from solution that meets customer needs and maximizes their potential to grow their own business
entirely web-based, to entirely on-premise solutions, and anywhere in between. Having a through additional services.
variety of solutions to choose from gives customers the mobility and flexibility they need to
meet constantly evolving business needs. To meet this demand, Microsoft is moving toward a
hybrid strategy of Software plus Services, the goal of which is to empower customers and
partners with richer applications, more choices, and greater opportunity through a combination
of on-premise software, partner-hosted software, and Microsoft-hosted software. As part of
this strategy, Microsoft expanded its Microsoft Online Services which includes Exchange Online
and SharePoint Online to organizations of all sizes. With services like Microsoft Online Services
d) Microsoft Azure Design Intuit Inc.’s QuickBase launched its new QuickBase
Business Consultant Program. The program allows members to use
Azure is designed in several layers, with different things going their expertise to create unique business applications tailored
on under the hood, Layer Zero specifically to the industries they serve—without technical expertise
Layer Zero is Microsoft’s Global Foundational Service. or coding. This helps members expand their reach into industries
GFS is akin to the hardware abstraction layer (HAL) in formerly served only by IT experts. Using QuickBase, program
Windows. It is the most basic level of the software that members will be able to easily build new on- demand business
interfaces directly with the servers. applications from scratch or customize one of 200 available templates
and resell them to their clients.
Layer One
Layer One is the base Azure operating system. It used to be code-
named “Red Dog,” and was designed by a team of operating system
experts at Microsoft. Red Dog is the technology that networks and
manages the Windows Server 2008 machines that form the
4.2 HADOOP are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
What?? Hadoop Common. Most of the tools or solutions are used to
Using the solution provided by Google, Doug Cutting and his team supplement or support these major elements. All these tools work
developed an Open Source Project called HADOOP collectively to provide services such as absorption, analysis, storage
Why?? and maintenance of data etc.
Hadoop runs applications using the MapReduce algorithm, where the Following are the components that collectively form a Hadoop
data is processed in parallel with others. In short, Hadoop is used to ecosystem:
develop applications that could perform complete statistical analysis HDFS: Hadoop Distributed File System
on huge amounts of data. YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Hadoop is an Apache open source framework written in java Oozie: Job Scheduling
that allows distributed processing of large datasets across clusters of
computers using simple programming models. The Hadoop
framework application works in an environment that provides
distributed storage and computation across clusters of computers.
Hadoop is designed to scale up from single server to thousands of
machines, each offering local computation and storage.
Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite which
provides various services to solve the big data problems. It includes
Apache projects and various commercial tools and solutions. There
All these toolkits or components revolve around one term i.e. Data. YARN:
That’s the beauty of Hadoop that it revolves around data and hence
making its synthesis easier. Yet Another Resource Negotiator, as the name implies, YARN
is the one who helps to manage the resources across the clusters. In
HDFS: short, it performs scheduling and resource allocation for the Hadoop
System.
HDFS is the primary or major component of Hadoop Consists of three major components i.e.
ecosystem and is responsible for storing large data sets of structured 1. Resource Manager
or unstructured data across various nodes and thereby maintaining 2. Nodes Manager
the metadata in the form of log files. 3. Application Manager
HDFS consists of two core components i.e. Resource manager has the privilege of allocating resources for
1. Name node the applications in a system whereas Node managers work on the
2. Data Node allocation of resources such as CPU, memory, bandwidth per machine
Name Node is the prime node which contains metadata (data and later on acknowledges the resource manager. Application
about data) requiring comparatively fewer resources than the data manager works as an interface between the resource manager and
nodes that stores the actual data. These data nodes are commodity node manager and performs negotiations as per the requirement of the
hardware in the distributed environment. Undoubtedly, making two.
Hadoop cost effective.
HDFS maintains all the coordination between the clusters and MapReduce:
hardware, thus working at the heart of the system.
By making the use of distributed and parallel algorithms,
MapReduce makes it possible to carry over the processing’s logic and
helps to write applications which transform big data sets into a
manageable one.
MapReduce makes the use of two functions i.e. Map() and
Reduce() whose task is:
1. Map() performs sorting and filtering of data and
thereby organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the Reduce()
method.
2. Reduce(), as the name suggests does the summarization
by aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.
Pig:
It’s a platform that handles all the process consumptive tasks Hadoop Architecture:
like batch processing, interactive or iterative real-time processing, At its core, Hadoop has two major layers namely −
graph conversions, and visualization, etc.
Processing/Computation layer (MapReduce), and HDFS, being on top of the local file system, supervises the
Storage layer (Hadoop Distributed File System). processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and
reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic distributes the
data and work across the machines and in turn, utilizes the underlying
How Does Hadoop Work?
parallelism of the CPU cores.
It is quite expensive to build bigger servers with heavy
Hadoop does not rely on hardware to provide fault-tolerance
configurations that handle large scale processing, but as an alternative,
and high availability (FTHA), rather Hadoop library itself has been
you can tie together many commodity computers with single-CPU, as
designed to detect and handle failures at the application layer.
a single functional distributed system and practically, the clustered
Servers can be added or removed from the cluster dynamically
machines can read the dataset in parallel and provide a much higher
and Hadoop continues to operate without interruption.
throughput. Moreover, it is cheaper than one high-end server. So this
Another big advantage of Hadoop is that apart from being open
is the first motivational factor behind using Hadoop that it runs
source, it is compatible on all the platforms since it is Java based.
across clustered and low-cost machines.
Hadoop is supported by GNU/Linux platform and its flavors.
Hadoop runs code across a cluster of computers. This process includes
Therefore, we have to install a Linux operating system for setting up
the following core tasks that Hadoop performs −
Hadoop environment. In case you have an OS other than Linux, you
Data is initially divided into directories and files. Files are
can install a Virtualbox software in it and have Linux inside the
divided into uniform sized blocks of 128M and 64M (preferably
Virtualbox.
128M).
MapReduce
These files are then distributed across various cluster nodes for
further processing.
MapReduce is a parallel programming model for writing cluster is merely a configuration change. This simple scalability is
distributed applications devised at Google for efficient processing of what has attracted many programmers to use the MapReduce model.
large amounts of data (multi-terabyte data-sets), on large clusters The Algorithm
(thousands of nodes) of commodity hardware in a reliable, fault- Generally MapReduce paradigm is based on sending the
tolerant manner. The MapReduce program runs on Hadoop which is computer to where the data resides!
an Apache open-source framework. MapReduce program executes in three stages, namely map
MapReduce is a framework using which we can write stage, shuffle stage, and reduce stage.
applications to process huge amounts of data, in parallel, on large o Map stage − The map or mapper’s job is to process the
clusters of commodity hardware in a reliable manner. input data. Generally the input data is in the form of file or directory
and is stored in the Hadoop file system (HDFS). The input file is
What is MapReduce? passed to the mapper function line by line. The mapper processes the
MapReduce is a processing technique and a program model for data and creates several small chunks of data.
distributed computing based on java. The MapReduce algorithm o Reduce stage − This stage is the combination of
contains two important tasks, namely Map and Reduce. Map takes a the Shuffle stage and the Reduce stage. The Reducer’s job is to
set of data and converts it into another set of data, where individual process the data that comes from the mapper. After processing, it
elements are broken down into tuples (key/value pairs). Secondly, produces a new set of output, which will be stored in the HDFS.
reduce task, which takes the output from a map as an input and During a MapReduce job, Hadoop sends the Map and Reduce
combines those data tuples into a smaller set of tuples. As the tasks to the appropriate servers in the cluster.
sequence of the name MapReduce implies, the reduce task is always The framework manages all the details of data-passing such as
performed after the map job. issuing tasks, verifying task completion, and copying data around the
The major advantage of MapReduce is that it is easy to scale cluster between the nodes.
data processing over multiple computing nodes. Under the Most of the computing takes place on nodes with data on local
MapReduce model, the data processing primitives are called mappers disks that reduces the network traffic.
and reducers. Decomposing a data processing application After completion of the given tasks, the cluster collects and
into mappers and reducers is sometimes nontrivial. But, once we write reduces the data to form an appropriate result, and sends it back to
an application in the MapReduce form, scaling the application to run the Hadoop server.
Reduce <k2, list(v2)> list (<k3, v3>) MapReduce Tutorial: A Word Count Example of MapReduce
HDFS provides reliable data storage. It can store data in the range of
100s of petabytes. HDFS stores data reliably on a cluster. It divides
the data into blocks. Hadoop framework stores these blocks on nodes
present in HDFS cluster. HDFS stores data reliably by creating a
replica of each and every block present in the cluster. Hence provides
fault tolerance facility. If the node in the cluster containing data goes
down, then a user can easily access that data from the other nodes. HDFS follows the master-slave architecture and it has the
HDFS by default creates 3 replicas of each block containing data
present in the nodes. So, data is quickly available to the users. Hence following elements.
user does not face the problem of data loss. Thus, HDFS is highly a) Namenode
reliable.
The namenode is the commodity hardware that contains the
GNU/Linux operating system and the namenode software. It is
software that can be run on commodity hardware. The system having
the namenode acts as the master server and it does the following tasks ii) Huge datasets − HDFS should have hundreds of nodes per cluster
− to manage the applications having huge datasets.
Manages the file system namespace. iii) Hardware at data − A requested task can be done efficiently, when
Regulates client’s access to files. the computation takes place near the data. Especially where huge
It also executes file system operations such as renaming, datasets are involved, it reduces the network traffic and increases the
closing, and opening files and directories. throughput.
b) Datanode
The datanode is a commodity hardware having the GNU/Linux
operating system and datanode software. For every node (Commodity
hardware/System) in a cluster, there will be a datanode. These nodes
manage the data storage of their system.
Datanodes perform read-write operations on the file systems,
as per client request.
They also perform operations such as block creation, deletion,
and replication according to the instructions of the namenode.
c) Block
Generally the user data is stored in the files of HDFS. The file in a
file system will be divided into one or more segments and/or stored
in individual data nodes. These file segments are called as blocks. In
other words, the minimum amount of data that HDFS can read or
write is called a Block. The default block size is 64MB, but it can be
increased as per the need to change in HDFS configuration.
Goals of HDFS
i) Fault detection and recovery − since HDFS includes a large number
of commodity hardware, failure of components is frequent. Therefore
HDFS should have mechanisms for quick and automatic fault
detection and recovery.
UNIT V It is possible to disable verification of checksums by passing false to the setVerify Checksum() method on
HADOOP I/O FileSystem, before using the open() method to read a file. The same effect is possible from the shell by using
Hadoop comes with a set of primitives for data I/O and the techniques that are more general than Hadoop, the -ignoreCrc option with the -get or the equivalent -copyToLocal command. This feature is useful if you
such as data integrity and compression, but deserve special consideration when dealing with multi-terabyte have a corrupt file that you want to inspect so you can decide what to do with it. For example, you might
datasets. want to see whether it can be salvaged before you delete it.
Others are Hadoop tools or APIs that form the building blocks for developing distributed system, such as
5.1.2 LocalFileSystem
serialization frameworks and on-disk data structures
5.1. Data integrity
The Hadoop LocalFileSystem performs client-side checksumming. This means that when you write a file
The usual way of detecting corrupted data is by computing a checksum for the data when it first enters the system,
and again whenever it is transmitted across a channel that is unreliable and hence capable of corrupting the data. The called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the same directory
data is deemed to be corrupt if the newly generated checksum doesn’t exactly match the original. This technique
containing the checksums for each chunk of the file. Like HDFS, the chunk size is controlled by the
doesn’t offer any way to fix the data—merely error detection. A commonly used error-detecting code is CRC-32
(cyclic redundancy check), which computes a 32-bit integer checksum for input of any size. io.bytes.per.checksum property, which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so
the file can be read back correctly even if the setting for the chunk size has changed. Checksums are verified when
5.1.1 Data Integrity in HDFS
the file is read, and if an error is detected, LocalFileSystem throws a ChecksumException.
HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A
Checksums are fairly cheap to compute (in Java, they are implemented in native code), typically adding a few
separate checksum is created for every io.bytes.per.checksum bytes of data. The default is 512 bytes, and since
percent overhead to the time to read or write a file. For most pay for data integrity. It is, however, possible to disable
a CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.
checksums: typically when the underlying filesystem supports checksums natively. This is accomplished by using
Datanodes are responsible for verifying the data they receive before storing the data and its checksum. This
RawLocalFileSystem in place of Local FileSystem. To do this globally in an application, it suffices to remap the
applies to data that they receive from clients and from other datanodes during replication. A client writing
implementation for file URIs by setting the property fs.file.impl to the value
data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum. If it
org.apache.hadoop.fs.RawLocalFileSystem. Alternatively, you can directly create a Raw LocalFileSystem instance,
detects an error, the client receives a ChecksumException, a subclass of IOException, which it should handle
which may be useful if you want to disable checksum verification for only some reads;
in an application-specific manner, by retrying the operation, for example.
For example:
When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored Configuration conf= ...
at the datanode. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each FileSystemfs = new RawLocalFileSystem();
of its blocks was verified. When a client successfully verifies a block, it tells the datanode, which updates its fs.initialize(null, conf);
log. Keeping statistics such as these is valuable in detecting bad disks.
Aside from block verification on client reads, each datanode runs a DataBlockScanner in a background thread 5.1.3 ChecksumFileSystem
that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit LocalFileSystem uses ChecksumFileSystem to do its work, and this class makes it easy to add checksumming
rot” in the physical storage media. See “Datanode block scanner” for details on how to access the scanner to other (nonchecksummed) filesystems, as Checksum FileSystem is just a wrapper around FileSystem. The general
reports. idiom is as follows:
Since HDFS stores replicas of blocks, it can “heal” corrupted blocks by copying one of the good replicas to FileSystemrawFs= ...
produce a new, uncorrupt replica. The way this works is that if a client detects an error when reading a block, FileSystemchecksummedFs = new ChecksumFileSystem(rawFs);
it reports the bad block and the datanode it was trying to read from to the namenode before throwing a The underlying filesystem is called the raw filesystem, and may be retrieved using the getRawFileSystem()
ChecksumException. The namenode marks the block replica as corrupt, so it doesn’t direct clients to it, or try method on checksumFileSystem. ChecksumFileSystem has a few more useful methods for working with checksums,
such as getChecksumFile() for getting the path of a checksum file for any file. Check the documentation for the
to copy this replica to another datanode. It then schedules a copy of the block to be replicated on another others.
datanode, so its replication factor is back at the expected level. Once this has happened, the corrupt replica is If an error is detected by ChecksumFileSystem when reading a file, it will call its reportChecksumFailure ()
method. The default implementation does nothing, but LocalFileSystem moves the offending file and its checksum
deleted. to a side directory on the same device called bad_files. Administrators should periodically check for these bad files
and take action on them.
5.2 Compression CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
out.finish();
Inferring CompressionCodecsusing CompressionCodecFactory
If you are reading a compressed file, you can normally infer the codec to use by looking at its filename extension. A
file ending in .gzcan be read with GzipCodec, and so on.
CompressionCodecFactoryprovides a way of mapping a filename extension to a compressionCodecusing its
getCodec() method, which takes a Path object for the file in question.
All of the tools listed in Table 4-1 give some control over this trade-off at compression time by offering nine different Following example shows an application that uses this feature to decompress files.
options String uri = args[0];
-1 means optimize for speed and Configuration conf = new Configuration();
FileSystemfs = FileSystem.get(URI.create(uri), conf);
-9 means optimize for space Path inputPath = new Path(uri);
e.g :--- gzip-1 file CompressionCodecFactory factory = new CompressionCodecFactory(conf);
CompressionCodec codec = factory.getCodec(inputPath);
The different tools have very different compression characteristics.Both gzip and ZIPare general-purpose if (codec == null)
{
compressors, and sit in the middle of the space/time trade-off.
System.err.println("No codec found for " + uri);
Bzip2compresses more effectively than gzipor ZIP, but is slower. System.exit(1);
}
LZOoptimizes for speed. It is faster than gzipand ZIP, but compresses slightly less effectively
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
5.2.1 Codecs InputStream in = null;
A codec is the implementation of a compression-decompression algorithm OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
}
Finally
{
The LZO libraries are GPL-licensed and may not be included in Apache distributions, so for this reason the IOUtils.closeStream(in);
IOUtils.closeStream(out);
Hadoopcodecs must be downloaded separately from http://code.google.com/p/hadoop-gpl-compression/ }
Native libraries
Compressing and decompressing streams with CompressionCodec For performance, it is preferable to use a native library for compression and decompression. For example, in
CompressionCodechas two methods that allow you to easily compress or decompress data. one test, using the native gziplibraries reduced decompression times by up to 50%and compression times by
To compress data being written to an output stream, use the createOutputStream(OutputStreamout) method to around 10%(compared to the built-in Java implementation).
create a CompressionOutputStream to which you write your uncompressed data to have it written in Hadoopcomes with prebuilt native compression libraries for 32-and 64-bit Linux, which you can find in the
compressed form to the underlying stream. lib/native directory.
To decompress data begin read from an input stream, call createIntputStream(InputStreamin) to obtain a By default Hadooplooks for native libraries for the platform it is running on, and loads them automatically if
CompressionInputStream, which allows you to read uncompressed data from the underlying stream. they are found.
String codecClassname = args[0];
Class<?>codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
5.2.3 Using Compression in MapReduce
If your input files are compressed, they will be automatically decompressed as they are read by MapReduce,
using the filename extension to determine the codec to use.
For Example...
If you are using a native library and you are doing a lot of compression or decompression in your application,
consider using CodecPool, which allows you to reuse compressors and decompressors, thereby amortizing the
cost of creating these objects.
When considering how to compress data that will be processed by MapReduce, it is important to understand 5.3 Serialization
whether the compression format supports splitting.
Consider an uncompressed file stored in HDFS whose size is 1GB. With a HDFS block size of 64MB, the file Serialization is the process of turning structured objects into a byte stream for transmission
will be stored as 16 blocks, and a Map Reduce job using this file as input will create 16 input splits, each over a network or for writing to persistent storage. Deserialization is the process of turning a byte stream back into a
processed independently as input to a separate map task. series of structured objects.
Imagine now the file is a gzip-compressed file whose compressed size is 1GB. As before, HDFS will store the
file as 16 blocks. However, creating a split for each block won’t work since it is impossible to start reading at In Hadoop, interprocess communication between nodes in the system is implemented using remote
an arbitrary point in the gzipstream, and therefore impossible for a map task to read its split independently of procedure calls(RPCs). The RPC protocol uses serialization to render the message into a binary stream to be
the others. sent to the remote node, which then deserializes the binary stream into the original message.
In this case, Map Reduce will do the right thing, and not try to split the gzippedfile.This will work, but at the In general, it is desirable that an RPC serialization format is:
expense of locality. A single map will process the 16 HDFS blocks, most of which will not be local to the
map. Also, with fewer maps, the job is less granular, and so may take longer to run. Compact: A compact format makes the best use of network bandwidth
Fast: Interprocess communication forms the backbone for a distributed system, so it is essential
that there is as little performance overhead as possible for the serialization and deserialization process.
Extensible: Protocols change over time to meet new requirements, so it should be straightforward to evolve
the protocol in a controlled manner for clients and servers.
Interoperable : For some systems, it is desirable to be able to support clients that are written in
different languages to the server. Text
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of java.lang.String.
5.3.1 Writable Interface The Text class uses an int to store the number of bytes in the string encoding, so the maximum value is 2
GB. Furthermore, Text uses standard UTF -8, which makes it potentially easier to interpoperate with other
The Writable interface defines two methods: one for writing its state to a DataOutput binary stream, and one tools that understand UTF-8.
for reading its state from a DataInput binary stream. The Text class has several features.
We will use IntWritable, a wrapper for a Java int. We can create one and set its value using the set() method: Indexing
Unicode
IntWritable writable = new IntWritable(); Iteration
writable.set(163); Mutability
To examine the serialized form of the IntWritable, we write a small helper method that wraps a Resorting to String
java.io.ByteArrayOutputStream in a java.io.DataOutputStream to capture the bytes in the serialized stream
ByteArrayOutputStream out = new ByteArrayOutputStream(); Indexing
DataOutputStreamdataOut = new DataOutputStream(out); Indexing for the Text class is in terms of position in the encoded byte sequence, not the Unicode character in the
writable.write(dataOut); string, or the Java char code unit. For ASCII String, these three concepts of index position coincide.
dataOut.close();
returnout.toByteArray(); Notice that charAt() returns an intrepresenting a Unicode code point, unlike the String variant that returns a char.
Text also has a find() method, which is analogous to String’s indexOf()
5.2.2 Writable Classes
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package. They form the
class hierarchy shown in Figure 4-1.
Unicode
When we start using characters that are encoded with more than a single byte, the differences between Text
and String become clear. Consider the Unicode characters shown in Table 4-7 All but the last character in the
table, U+10400, can be expressed using a single Java
char.
Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offsets for indexing, since
Writable Class you can’t just increment the index.
The idiom for iteration is a little obscure: turn the Text object into a java.nio.ByteBuffer. Then repeatedly
Writable wrappers for Java primitives call the bytesToCodePoint() static method on Text with the buffer. This method extracts the next code point as an
There are Writable wrappers for all the Java primitive types except short and char. intand updates the position in the buffer.
All have a get() and a set() method for retrieving and storing the wrapped value. For Example...
public class TextIterator
{
public static void main(String[] args)
{
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
ByteBufferbuf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
Iteration. 102 | Chapter 4: Hadoop I/O intcp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1)
{
System.out.println(Integer.toHexString(cp));
}
}
}
Writing a SequenceFile
Mutability
To create a SequenceFile, use one of its createWriter() static methods, which returns a
Another difference with String is that Text is mutable. You can reuse a Text instance by calling on of the set() SequenceFile.Writerinstance.
methods on it.
For Example... The keys and values stored in a SequenceFiledo not necessarily need to be Writable. Any types that can be
serialized and deserializedby a Serialization may be used.
Text t = new Text("hadoop"); Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. Then
t.set("pig"); when you’ve finished you call the close() method (SequenceFile.Writerimplements java.io.Closeable)
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3)); For example...
IntWritable key = new IntWritable();
Restoring to String Text value = new Text();
SequenceFile.Writer writer = null;
Text doesn’t have as rich an API for manipulating strings as java.lang.String , so in many cases you need to try { writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
convert the Text object to a String. for (int i = 0; i < 100; i++) { key.set(100 - i); value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
Null Writable writer.append(key, value); } } finally { IOUtils.closeStream(writer);
NullWritable is a special type of Writable, as it has a zero -length serialization. No bytes are written to , or
read from , the stream. It is used as a placeholder. Reading a SequenceFile
For example, in MapReduce , a key or a value can be declared as a NullWritable when you don’t need to use Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Reader, and
that position-it effectively stores a constant empty value. iterating over records by repeatedlyinvoking one of the next() methods.
If you are using Writable types, you can use the next() method that takes a key and a value argument, and
NullWritable can also be useful as a key in SequenceFile when you want to store a list of values, as opposed to reads the next key and value in the stream into these variables:
key-value pairs. It is an immutable singleton: the instance can be retrieved by calling NullWritable.get().
For example... public static void main(String[] args) throws IOException
5.2.4 Serialization Frameworks {
o Although most Map Reduce programs use Writable key and value types, this isn’t mandated by the Map String uri = args[0];
Reduce API. In fact, any types can be used, the only requirement is that there be a mechanism that translates to Configuration conf = new Configuration();
and from a binary representation of each type. FileSystemfs = FileSystem.get(URI.create(uri), conf);
To support this, Hadoophas an API for pluggable serialization frameworks. A serialization framework is Path path = new Path(uri);
represented by an implementation of Serialization. WritableSerialization,for example, is the implementation of
Serialization for Writable types. SequenceFile.Reader reader = null;
Although making it convenient to be able to use standard Java types in Map Reduce programs, like Integer or try
String, Java Object Serialization is not as efficient as Writable, so it’s not worth making this trade-off. {
5.4 File-Based data structure reader = new SequenceFile.Reader(fs, path, conf);
For some applications, you need a specialized data structure to hold your data. For MapReduce-based Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
processing, putting each blob of binary data into its own file doesn’t scale, so Hadoopdeveloped a number of Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
higher-level containers for these situations. long position = reader.getPosition();
while (reader.next(key, value)) { String syncSeen = reader.syncSeen() ? "*" : "";
Higher-level containers
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
o SequenceFile
position = reader.getPosition(); // beginning of next record
o MapFile
}
o
}
5.4.1 SequenceFile
Finally
Imagine a logfile, where each log record is a new line of text. If you want to logbinary types, plain text isn’t a
{
suitable format.
IOUtils.closeStream(reader);
Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key- }
value pairs.To use it as a logfileformat, you would choose a key, such as timestamp represented by a
LongWritable, and the value is Writable that represents the quantity being logged.
SequenceFilealso work well as containers for smaller files. HDFS and Map Reduce are optimized for large
files, so packing files into a SequenceFilemakes storing and processing the smaller files more efficient.
The return value is used to determine if an entry was found in the MapFile. If it’s null, then no value exist for the
given key. If key was found, then the value for that key is read into val, as well as being returned from the
method call.
For this operation, the MapFile. Readerreads the index file into memory. A very large MapFile’s index can
take up a lot of memory. Rather than reindex to change the index interval, it is possible to lad only a fraction of the
index keys into memory when reading the MapFile by setting the io.amp.index.ksipproperty.
One way of looking at a MapFile is as an indexed and sorted SequenceFile. So it’s quite natural to want to be
able to convert a SequenceFile into a MapFile.
For example.
SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);
Class keyClass = reader.getKeyClass();
5.4.2 MapFile Class valueClass = reader.getValueClass(); reader.close();
A MapFile is a sorted SequenceFile with an index to permit lookups by key. MapFile can be thought of as a // Create the map file index file
persistent form of java.util.Map(although it doesn’t implement this interface), which is able to grow beyond
the size of a Map that is kept in memory long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entries\n", map, entries);
Writing a MapFile
Writing a MapFile is similar to writing a Sequence File. You create an instance of MapFile. Writer,
then call the append () method to add entries in order. Keys must be instances of WritableComparable, and
values must be Writable
For example:
We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause,
and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two
arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the
copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
1. A client initiates read request by calling 'open()' method of FileSystem object; it is an object of
type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the locations of the
blocks of the file. Please note that these addresses are of first few blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that block is returned back.
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is returned to the
client. FSDataInputStream contains DFSInputStream which takes care of interactions with DataNode and
NameNode. In step 4 shown in the above diagram, a client invokes 'read()' method which
causes DFSInputStream to establish a connection with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This process
of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on to locate the next
DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.
Write Operation In HDFS Access HDFS using JAVA API
In this,we know how data is written into HDFS through files. In order to interact with Hadoop's filesystem programmatically, Hadoop provides multiple JAVA classes.
Package named org.apache.hadoop.fs contains classes useful in manipulation of a file in Hadoop's filesystem. These
operations include, open, read, write, and close. Actually, file API for Hadoop is generic and can be extended to
interact with other filesystems other than HDFS.
Object java.net.URL is used for reading contents of a file. To begin with, we need to make Java recognize Hadoop's
hdfs URL scheme. This is done by calling setURLStreamHandlerFactory method on URL object and an instance of
FsUrlStreamHandlerFactory is passed to it. This method needs to be executed only once per JVM, hence it is
enclosed in a static block.
$HADOOP_HOME/bin/hdfs dfs -ls / Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system.
Follow the steps given below to insert the required file in the Hadoop file system.
Step 1
Step 2
We can see a file 'temp.txt' (copied earlier) being listed under ' / ' directory.
Transfer and store a data file from local systems to the Hadoop file system using the put command.
3. Command to copy a file to the local filesystem from HDFS
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
$HADOOP_HOME/bin/hdfs dfs -copyToLocal /temp.txt
Step 3
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file
from the Hadoop file system.
We can see temp.txt copied to a local filesystem.
Step 1
4. Command to create a new directory
$HADOOP_HOME/bin/hdfs dfs -mkdir /mydirectory Initially, view the data from HDFS using cat command.
Step 2
Starting HDFS Get the file from HDFS to the local file system using get command.
Initially you have to format the configured HDFS file system, open namenode (HDFS server), and execute the $ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
following command.
Shutting Down the HDFS
$ hadoop namenode -format
You can shut down the HDFS by using the following command.
After formatting the HDFS, start the distributed file system. The following command will start the namenode as
well as the data nodes as cluster. $ stop-dfs.sh
Usage:
hadoop fs -copyFromLocal <localsrc> URI