0% found this document useful (0 votes)
48 views

DS Syllabus Introduction (Reference)

The document introduces Hadoop and MapReduce, describing Hadoop as a framework for storing and processing huge datasets, with MapReduce providing a programming model to process data stored in Hadoop in a parallel and distributed manner. It also discusses HDFS as the distributed file system used by Hadoop to partition and store data across multiple machines, with important components including blocks, the NameNode, and DataNodes. Finally, YARN is introduced as the resource management framework used in newer Hadoop versions to schedule applications running on Hadoop clusters.

Uploaded by

Joel Shibbi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

DS Syllabus Introduction (Reference)

The document introduces Hadoop and MapReduce, describing Hadoop as a framework for storing and processing huge datasets, with MapReduce providing a programming model to process data stored in Hadoop in a parallel and distributed manner. It also discusses HDFS as the distributed file system used by Hadoop to partition and store data across multiple machines, with important components including blocks, the NameNode, and DataNodes. Finally, YARN is introduced as the resource management framework used in newer Hadoop versions to schedule applications running on Hadoop clusters.

Uploaded by

Joel Shibbi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

DISTRIBUTED SYSTEMS

SYLLABUS
Subject Code: CSE- 3251
2022

Dr. P.B.Shanthi
Asst. Prof (Sel. Grade)
CSE Dept., MIT, Manipal
Cabin: No. 16 in Faculty Room 2
Mob: 9901659271
Text Books

(1) Distributed Systems – Maarten van Steen and Andrew S. Tanenbaum, 3rd edition, Version 3.01/3.02, 2017

(2) Hadoop The Definitive Guide, Tom White, 4th Edition, Oreilly Publication.

Reference Books:
(3) Distributed Systems - Coulouris G., Dollimore J., and Kindberg T., Pearson, 4th edition, 2009.

(4) Hadoop in Action - Chuck Lam, Manning Publications Co.

(5) Ajay D. Kshemkalyani, and Mukesk Singhal, Distributed Computing: Principles, Algorithms, and Systems,

Cambridge University Press; Reissue edition, March 2011.

(6) Mei- Ling Liu, Distributed Computing: Principles and Application, Pearson Education, Inc. New Delhi. 2004.
Syllabus
Syllabus Continued…..
Module – 1 Teaching Hours

Introduction: 6 Hours
What is a distributed system? Design Goals.
Architecture:
Architectural Styles, Middleware Organization, System Architecture,
Example Architectures
 
Text 1: Ch 1: 1.1, 1.2
Text 1: Ch 2
 
What is distributed system?

• DS is the one in which components located at networked computers communicate and


coordinate their actions only by passing messages.

In other words…..

• DS consists of a collection of autonomous computers, connected through a network


and distributed middleware, which enables computers to coordinate their activities and
to share the resources of the system, so that users perceive the system is a single
coherent system.
Characteristics of Distributed Systems
Concurrency
 Multiple activities executed at the same time.
E.g. a server may create two threads running concurrently to service two client requests.
No global clock
Programs cooperate not by any shared idea of time but by passing or exchanging messages
Thus there is no single global notion of time and only commmunication is by sending
messages thru network .
Should go for UTC (Coordinated Universal Time)
Independent failures
Each component of a distributed system can fail independently, leaving the others still
running .
EXAMPLES OF DISTRIBUTED SYSTEMS:

• Internet/World-Wide Web
• Local Area Network and Intranet
• Mobile Computing and Ubiquitous Computing
• Database Management System
• Automatic Teller Machine Network

 
CHALLENGES as Design Goals……

• Heterogeneity
• Openness
• Security
• Scalability
• Failure handling
• Concurrency
• Transparency
• Heterogeneity applies to
• Network - Internet consists of many different sorts of network, their differences are masked by the fact that all of the computers
attached to them use the Internet protocols to communicate with one another.

• Computer Hardware - Data types such as integers may be represented in different ways on different sorts of hardware

• Operating Systems – All OS do not necessarily provide the same application programming interface to the Internet protocols. For
example, the calls for exchanging messages in UNIX are different from the calls in Windows.

• Programing Language - Different programming languages use different representations for characters and data structures such as
arrays and records. These differences must be addressed if programs written in different languages are to be able to communicate
with one another.

• Implementations by different developer - Programs written by different developers cannot communicate with one another unless they
use common standards, for example, for network communication and the representation of primitive data items and data structures
in messages

 Middleware as software layers to provide a programming abstraction as well as masking the heterogeneity of the
underlying networks, hardware, OS, and programming languages.
 Provides a standard mode of representation.
• Openness is concerned with extensions and improvements of distributed systems. Be able to interact
with services from other open systems, irrespective of the underlying environment: well defined
interface, support portability of applications, easily interoperate etc

• H/w level : addition of computers to n/w.

• S/w level : Introducing new services , reimplementation of old ones, enabling application programs to
share resources

• Operating System:
Security – 3 Components

Security for information resources has 3 components


• Confidentiality(protection against unauthorized individuals)
In a distributed system, clients send requests to access data managed
by servers, resources in the networks:
• Doctors requesting records from hospitals
• Users purchase products through electronic commerce
• Integrity (protection against alteration or corruption)
Security is required for:
• Concealing the contents of messages: security and privacy
• Identifying a remote user or other agent correctly (authentication)
• Availability (protection against interference while accessing the
resources ie. Systems and data are usable when needed…)
New challenges:
• Denial of service attack
• Security of mobile code
Scalability

• A system is described as scalable if it remain effective , when there


is a significant increase in no. of resources and no. of users.

Design of scalable DS presents following challenges:


• Controlling the cost of physical resources
• Controlling Performance Loss
• Preventing s/w resources running out
• Avoiding performance bottlenecks.
Failure Handling
• In DS failure is partial
Techniques for dealing with failures:
• Detecting Failures – (Eg. Checksum)
• Masking Failures – (Eg. retransmitting msgs)
• Tolerating Failures – (Eg. Error Msg web browser cannot contact web
server)
• Recovery from failures
• Redundancy
Concurrency
• Resource sharing is the main motivation of DS.

• These resources are shared by clients

• Several clients will attempt to access a shared resources at the same


time

• If it takes one client request at a time it limits throughput.

• It allows multiple clients request to be processed concurrently.

• Integrity of the system may be violated if concurrent updates are not


coordinated.
• Lost updates
• Inconsistent analysis
Transparency
• Access Transparency
• Location transparency -
• Concurrency Transparency
• Replication Tranparency
• Failure Transparency
• Mobility Transparency
• Performance Transparency
• Scaling Transparecy
• Most important transparencies are Access and Location
transparencies(called as network transparency)
SYSTEM MODELS
• Architectural Model
-> It is concerned with the placement of its parts and relationship
between them.
-> How they interact with one another and how they mapped in the
network
Eg. Client – Server and Peer – Peer Model

• Fundamental Model
-> these are concerned with design issues, difficulties and threats and
how to resolve that to fulfil their tasks correctly, reliably and securely.
Fundamental model…
• Since there is no global time in DS , all communication b/w processes
is achieved by means of messages.
• Message communication over a computer network can be affected by
following issues.
They are
->Delays
->Variety of failures
->Vulnerable to security attacks
Module -2 Teaching Hours

6 Hours
Meet Hadoop:
Data!, Data Storage and Analysis, Querying All Your Data, Beyond Batch, Comparison with Other
Systems, A Brief History of Apache Hadoop
Map Reduce:
A Weather Dataset, Analyzing the Data with Hadoop, Scaling Out, Hadoop Streaming
The Hadoop Distributed Filesystem:
The Design of HDFS, HDFS Concepts, The Command-Line Interface, Hadoop Filesystems, Data
Flow
YARN:
Anatomy of a YARN Application Run, YARN Compared to MapReduce 1, Scheduling in YARN
How MapReduce Works:
Anatomy of a MapReduce Job Run, Failures, Shuffle and Sort, Task Execution
 
Text 2: Ch 1, Ch 2, Ch 3, Ch 4, Ch 7
 
Hadoop and Map Reduce

• Hadoop is a framework that allows to process and store huge data sets.

• Basically, Hadoop can be divided into two parts: processing and storage.

• So, MapReduce is a programming model which allows you to process huge data
stored in Hadoop.

• When you install Hadoop in a cluster, we get MapReduce as a service where you
can write programs to perform computations in data in parallel and distributed
fashion.
When data can potentially outgrow the storage capacity of a single machine, partitioning it across a number of
separate machines is necessary for storage or processing. This is achieved using a distributed File systems.

Important components in HDFS Architecture are:


• Blocks
• Name Node
• Data Nodes
HDFS Architecture:
HDFS Objectives

• Able to store vast amount of data probably in Tera bytes or Peta


bytes by spreading the data across a number of machines on cluster.

• Storing data reliably, and in fault-tolerant manner by maintaining data


replication to cope with loss of individual machines in the cluster.

• Able to process the data locally by moving the


computation/processing to data nodes instead of bringing data from
data nodes to computation server.
Mom told Sam
An apple a day keeps a doctor away!
One day
Sam thought of “drinking” the apple

So, he used a to cut

the and a to

make juice.
Next Day
Sam applied his invention to all the fruits he
could find in the fruit basket

Simple!!!
18 Years Later
Sam got his first job with juice making giants, for his
talent in making juice
But!
• Now, it’s not just one basket
but a whole container of fruits Fruit
s

• Also, he has to make juice of


different fruits separately

• And, Sam has just ONE and


ONE
Sam & MapReduce
Sam implemented a parallel version of his innovation
Fruit
s
Module -3
Coordination: 10 Hours

Clock synchronization, Logical clocks, Mutual exclusion, Election algorithms


 

Text 1: Ch 6: 6.1, 6.2, 6.3, 6.4

Module – 4
8 Hours
Communication:
Foundations, Remote procedure call, Message-oriented communication,
Multicast communication
 
Text 1: Ch 4
Why synchronization?- Example
• Airline reservation system

• Server A receives a client request to purchase last ticket on flight ABC


123.

• Server A timestamps purchase using local clock 9h:15m:32.45s, and logs


it. Replies ok to client.

• That was the last seat. Server A sends message to Server B saying “flight
full.”

• B enters “Flight ABC 123 full” + local clock value (which reads
9h:10m:10.11s) into its log.

• Server C queries A’s and B’s logs. Is confused that a client purchased a
ticket after the flight became full.
• May execute incorrect or unfair actions.
Mutual Exclusion

• How to maintain concurrency and collaboration among multiple processes?

• In many cases, the processes will need to simultaneously access the same
resources.

• To prevent the concurrent accesses that corrupt the resource, or make it


inconsistent, solutions are needed to grant mutual exclusive access by processes.

• Mutual exclusion refers to the requirement of ensuring that no two processes are in
their critical section at the same time.

• A critical section refers to a period of time when the process accesses a shared
resource.
Election Algorithms - Algorithms for electing a
coordinator

 Many distributed algorithms require one process to act as coordinator or an


initiator.

If all processes are exactly the same, with no distinguishing characteristics, there is
no way to select one of them as coordinator.

 So election algorithms is used to locate the process with the highest process
number and designate it as coordinator.

“The goal of an election algorithm is to ensure that when an election starts, it


concludes with all processes agreeing on who the new coordinator is to be.”
Characteristic of Interprocess communication:
• Msg passing between pair of processes – supported by 2 msg communication operations: Send and
Receive.

• For one process to communicate with another, one process sends a message (a sequence of bytes) to
a destination and another process at the destination receives the message.

• This activity involves the communication of data from the sending process to the receiving process
and may involve the synchronization of the two processes.

The characteristic of Interprocess communication includes…..


A. Synchronous and Asynchronous Communication
B. Message Destination
C. Reliability
D. Ordering
Group Communication:

• The pairwise exchange of messages is not the best model for


communication from one process to a group of other processes.

• A communication from one process to a group of other processes is


achieved by multicast operation.

• This is an operation that sends a single msg from one process to each of the
members of a group of processes in such a way that the membership or the
group is transparent to the sender.
extra reference….

Unicast
Server

Router

Multicast
Server

Router
extra reference….
Module-5

Naming: 8 Hours

Names, Identifiers and Addresses, Flat naming, Structured naming


 
Text 1: Ch 5: 5.1, 5.2, 5.3

Module-6
Consistency & Replication: 10 Hours

Introduction, Data-centric consistency models, Client-centric consistency


models, Replica management, Consistency protocols
 
Text 1: Ch 7: 7.1 – 7.5
The role of names and name services
• Resources are accessed using identifier or reference
• A name is human-readable value (usually a string) that can be resolved to an
identifier or address
• Internet domain name, file pathname, process number

• E.g ./etc/passwd, http://www.cdk3.net/

• Resource names are resolved by name services

39
Composed naming domains used to access a resource from a URL
Figure 9.1
URL
http://www.cdk3.net:8888/WebExamples/earth.html

DNS lookup
Resource ID (IP number, port number, pathname)

138.37.88.61 8888 WebExamples/earth.html

ARP lookup

(Ethernet) Network address

2:60:8c:2:b0:5a file

Socket
Web server
*
40
Consistency & Replication

• What kind of things do we replicate in a distributed system?


• Data
• Servers

• Why do we replicate things?


• To increase
1. Reliability
2. Performance

• What is the main problem in providing replication?


• Keeping replicas consistent!
Motivation
• Make copies of services on multiple sites, improve …
• Reliability(by redundancy)
• If primary crashes, standby still works
• Performance
• Increase processing power
• Reduce communication delays
• Performance is important when DS needs to scale in numbers and
geographical area.
• Scalability
• Prevent overloading a single server (size scalability)
• Avoid communication latencies (geographic scale)
• However, updates are more complex
- propagation to all replicas?
- keeping replicas consistent!
Why Replication?

• Two primary reasons for replicating data:


Reliability
 Performance

In terms of reliability….
-> Data are replicated – to increase the reliability of the system…
-> Helps when replica crashes – by switching to one of the other replicas….
-> Better protection against corrupted data – by maintaining multiple copies.
Performance

-> Single server – when an increasing no. of processes needs to access data that are managed by
a single server….will delay the process or task…..

-> by replicating the data , we can divide the work

-> placing a copy of data in proximity of the process using them …..so time to access that data
decreases.

Problem with Replication

-> Multiple copies lead to consistency problem

-> Modification has to be done at all the copies to maintain consistency

 Exactly when and how those modification need to be carried out determines the price of
replication…

You might also like