DS Syllabus Introduction (Reference)
DS Syllabus Introduction (Reference)
SYLLABUS
Subject Code: CSE- 3251
2022
Dr. P.B.Shanthi
Asst. Prof (Sel. Grade)
CSE Dept., MIT, Manipal
Cabin: No. 16 in Faculty Room 2
Mob: 9901659271
Text Books
(1) Distributed Systems – Maarten van Steen and Andrew S. Tanenbaum, 3rd edition, Version 3.01/3.02, 2017
(2) Hadoop The Definitive Guide, Tom White, 4th Edition, Oreilly Publication.
Reference Books:
(3) Distributed Systems - Coulouris G., Dollimore J., and Kindberg T., Pearson, 4th edition, 2009.
(5) Ajay D. Kshemkalyani, and Mukesk Singhal, Distributed Computing: Principles, Algorithms, and Systems,
(6) Mei- Ling Liu, Distributed Computing: Principles and Application, Pearson Education, Inc. New Delhi. 2004.
Syllabus
Syllabus Continued…..
Module – 1 Teaching Hours
Introduction: 6 Hours
What is a distributed system? Design Goals.
Architecture:
Architectural Styles, Middleware Organization, System Architecture,
Example Architectures
Text 1: Ch 1: 1.1, 1.2
Text 1: Ch 2
What is distributed system?
In other words…..
• Internet/World-Wide Web
• Local Area Network and Intranet
• Mobile Computing and Ubiquitous Computing
• Database Management System
• Automatic Teller Machine Network
CHALLENGES as Design Goals……
• Heterogeneity
• Openness
• Security
• Scalability
• Failure handling
• Concurrency
• Transparency
• Heterogeneity applies to
• Network - Internet consists of many different sorts of network, their differences are masked by the fact that all of the computers
attached to them use the Internet protocols to communicate with one another.
• Computer Hardware - Data types such as integers may be represented in different ways on different sorts of hardware
• Operating Systems – All OS do not necessarily provide the same application programming interface to the Internet protocols. For
example, the calls for exchanging messages in UNIX are different from the calls in Windows.
• Programing Language - Different programming languages use different representations for characters and data structures such as
arrays and records. These differences must be addressed if programs written in different languages are to be able to communicate
with one another.
• Implementations by different developer - Programs written by different developers cannot communicate with one another unless they
use common standards, for example, for network communication and the representation of primitive data items and data structures
in messages
Middleware as software layers to provide a programming abstraction as well as masking the heterogeneity of the
underlying networks, hardware, OS, and programming languages.
Provides a standard mode of representation.
• Openness is concerned with extensions and improvements of distributed systems. Be able to interact
with services from other open systems, irrespective of the underlying environment: well defined
interface, support portability of applications, easily interoperate etc
• S/w level : Introducing new services , reimplementation of old ones, enabling application programs to
share resources
• Operating System:
Security – 3 Components
• Fundamental Model
-> these are concerned with design issues, difficulties and threats and
how to resolve that to fulfil their tasks correctly, reliably and securely.
Fundamental model…
• Since there is no global time in DS , all communication b/w processes
is achieved by means of messages.
• Message communication over a computer network can be affected by
following issues.
They are
->Delays
->Variety of failures
->Vulnerable to security attacks
Module -2 Teaching Hours
6 Hours
Meet Hadoop:
Data!, Data Storage and Analysis, Querying All Your Data, Beyond Batch, Comparison with Other
Systems, A Brief History of Apache Hadoop
Map Reduce:
A Weather Dataset, Analyzing the Data with Hadoop, Scaling Out, Hadoop Streaming
The Hadoop Distributed Filesystem:
The Design of HDFS, HDFS Concepts, The Command-Line Interface, Hadoop Filesystems, Data
Flow
YARN:
Anatomy of a YARN Application Run, YARN Compared to MapReduce 1, Scheduling in YARN
How MapReduce Works:
Anatomy of a MapReduce Job Run, Failures, Shuffle and Sort, Task Execution
Text 2: Ch 1, Ch 2, Ch 3, Ch 4, Ch 7
Hadoop and Map Reduce
• Hadoop is a framework that allows to process and store huge data sets.
• Basically, Hadoop can be divided into two parts: processing and storage.
• So, MapReduce is a programming model which allows you to process huge data
stored in Hadoop.
• When you install Hadoop in a cluster, we get MapReduce as a service where you
can write programs to perform computations in data in parallel and distributed
fashion.
When data can potentially outgrow the storage capacity of a single machine, partitioning it across a number of
separate machines is necessary for storage or processing. This is achieved using a distributed File systems.
the and a to
make juice.
Next Day
Sam applied his invention to all the fruits he
could find in the fruit basket
Simple!!!
18 Years Later
Sam got his first job with juice making giants, for his
talent in making juice
But!
• Now, it’s not just one basket
but a whole container of fruits Fruit
s
Module – 4
8 Hours
Communication:
Foundations, Remote procedure call, Message-oriented communication,
Multicast communication
Text 1: Ch 4
Why synchronization?- Example
• Airline reservation system
• That was the last seat. Server A sends message to Server B saying “flight
full.”
• B enters “Flight ABC 123 full” + local clock value (which reads
9h:10m:10.11s) into its log.
• Server C queries A’s and B’s logs. Is confused that a client purchased a
ticket after the flight became full.
• May execute incorrect or unfair actions.
Mutual Exclusion
• In many cases, the processes will need to simultaneously access the same
resources.
• Mutual exclusion refers to the requirement of ensuring that no two processes are in
their critical section at the same time.
• A critical section refers to a period of time when the process accesses a shared
resource.
Election Algorithms - Algorithms for electing a
coordinator
If all processes are exactly the same, with no distinguishing characteristics, there is
no way to select one of them as coordinator.
So election algorithms is used to locate the process with the highest process
number and designate it as coordinator.
• For one process to communicate with another, one process sends a message (a sequence of bytes) to
a destination and another process at the destination receives the message.
• This activity involves the communication of data from the sending process to the receiving process
and may involve the synchronization of the two processes.
• This is an operation that sends a single msg from one process to each of the
members of a group of processes in such a way that the membership or the
group is transparent to the sender.
extra reference….
Unicast
Server
Router
Multicast
Server
Router
extra reference….
Module-5
Naming: 8 Hours
Module-6
Consistency & Replication: 10 Hours
39
Composed naming domains used to access a resource from a URL
Figure 9.1
URL
http://www.cdk3.net:8888/WebExamples/earth.html
DNS lookup
Resource ID (IP number, port number, pathname)
ARP lookup
2:60:8c:2:b0:5a file
Socket
Web server
*
40
Consistency & Replication
In terms of reliability….
-> Data are replicated – to increase the reliability of the system…
-> Helps when replica crashes – by switching to one of the other replicas….
-> Better protection against corrupted data – by maintaining multiple copies.
Performance
-> Single server – when an increasing no. of processes needs to access data that are managed by
a single server….will delay the process or task…..
-> placing a copy of data in proximity of the process using them …..so time to access that data
decreases.
Exactly when and how those modification need to be carried out determines the price of
replication…