PDC Notes Complete- Updated
PDC Notes Complete- Updated
Computing
Introduction to Early Computers
The first computers, such as ENIAC and UNIVAC, emerged in the 1940s and 1950s. These machines were
massive, room-sized structures that relied on vacuum tubes for processing. Compared to modern technology,
their processing capabilities were highly limited.
Evolution to Mainframes
By the mid-20th century, computers transitioned into mainframes. A significant milestone was IBM's
System/360, introduced in the 1960s. These mainframe computers, although still large, had significantly
improved processing power and capabilities.
Personal Computers Era
The late 20th century witnessed the rise of personal computers, making computing more accessible to
individuals. Companies like Apple and Microsoft played pivotal roles in popularizing personal computing.
The introduction of user-friendly interfaces and affordable hardware revolutionized the industry. Key
developments included the Apple II (1977) and IBM PC (1981), which marked the beginning of widespread
computer usage.
Moore's Law
Moore's Law refers to the observation made by Gordon Moore in 1965 that the number of transistors on a
microchip doubles approximately every two years, leading to an exponential increase in computing power
and a reduction in relative cost. This exponential growth in processing power has:
· Driven advancements in computer technology, making devices smaller, faster, and more efficient.
· Influenced technological innovation across various devices.
· Revolutionized modern life, from smartphones to supercomputers.
Serial Computation
Traditionally, software has been designed for serial computations, which run on a single computer with a
single Central Processing Unit (CPU). These computations involve breaking a problem into a set of discrete
instructions that are executed sequentially, meaning only one instruction is processed at any given time.
Introduction to Parallel Computing
Parallel computing is a type of computation where multiple processes are carried out simultaneously to solve
a problem faster. It works by breaking a problem into smaller, independent or interdependent parts that can
be processed concurrently. Each of these parts is further divided into a sequence of instructions, which are
then executed simultaneously across multiple CPUs or processors. This approach enhances performance,
reduces execution time, and optimizes resource utilization, making it essential for handling large-scale
computations in fields such as scientific research, data processing, and artificial intelligence.
Top of Form
Bottom of Form
SIMD (Single Instruction Multiple Data): A single control unit (CU) with multiple processing elements
(PEs). CU fetches and decodes an instruction, broadcasting control signals to all PEs. All PEs execute the
same instruction synchronously on different data sets.
MISD (Multiple Instructions Single Data): A rare type due to limited data throughput. multiple
instruction, single data (MISD) is a type of parallel computing architecture where many functional units
perform different operations on the same data.
MIMD (Multiple Instructions Multiple Data): Machines using MIMD have a number of processor
cores that function asynchronously and independently. At any time, different processors may be executing
different instructions on different pieces of data.
Communication in Parallel Programs
In parallel computing, communication between tasks is essential for data sharing and synchronization.
Efficient communication strategies impact performance by minimizing delays and optimizing resource
utilization.
1. Importance of Communication
4. Scope of Communication
· Point-to-Point Communication: Direct data transfer between two tasks.
· Collective Communication: Involves all tasks within a communication group, such as broadcasting
or gathering data.
2. Distributed Memory Systems: Distributed memory systems consist of multiple processors, each
with its own private memory. Processors communicate by passing messages over a network. This
design can scale more effectively than shared memory systems, as each processor operates
independently, and the network can handle communication between them.
Advantages:
· Scalable memory with increasing CPUs.
· Each CPU can access its local memory quickly.
Disadvantages:
· Programmers must manage data communication.
· Difficult to map global memory structures.
3. Hybrid Systems: Hybrid systems combine elements of shared and distributed memory architectures.
They typically feature nodes that use shared memory, interconnected by a distributed memory
network. Each node operates as a shared memory system, while communication between nodes
follows a distributed memory model. Within a node, tasks can communicate quickly using shared
memory, while inter-node communication uses message passing.
Advantages:
· Improved scalability.
Disadvantages:
· Increased programming complexity.
LECTURE -3
RISC vs. CISC Processors
Reduced Instruction Set Computer (RISC) is a type of computer architecture that utilizes a small, highly
optimized set of instructions. Unlike Complex Instruction Set Computer (CISC) architectures, which have a
large number of specialized instructions, RISC systems focus on executing a limited number of simple
instructions efficiently. This design philosophy leads to faster execution speeds, as RISC processors can
complete most instructions in a single clock cycle. Additionally, RISC architectures emphasize a uniform
instruction format, a large number of general-purpose registers, and a load/store approach, where memory
access is limited to specific instructions. These characteristics enhance performance, power efficiency, and
parallel execution, making RISC processors widely used in modern computing, especially in mobile and
embedded systems.
Top of Form
Bottom of Form
Advantages of RISC:
· Faster processing due to simple instruction decoding.
· Lower power consumption.
· High efficiency in portable devices.
Disadvantages of RISC:
· Requires more instructions for complex tasks.
· Higher memory usage.
Complex Instruction Set Computer (CISC) is a type of computer architecture that utilizes a large and
diverse set of instructions, allowing a single instruction to perform multiple low-level operations, such as
memory access, arithmetic computations, and complex addressing modes. Unlike Reduced Instruction Set
Computer (RISC) architectures, which focus on executing simple instructions quickly, CISC architectures
aim to reduce the number of instructions per program by using more complex and multi-step instructions.
This approach helps in minimizing the need for multiple instructions to perform a task, reducing the number
of memory accesses and instruction fetches. However, CISC processors often require multiple clock cycles
to execute a single instruction, making them potentially slower than RISC processors for certain operations.
Despite this, CISC remains widely used, especially in x86-based systems, where backward
.
Advantages of CISC:
· Reduced code size due to complex instructions.
· More memory-efficient.
· Long-established with broad software support.
Disadvantages of CISC:
· Slower execution due to complex instruction decoding.
· Higher power consumption.
Comparison of RISC vs. CISC
Feature RISC CISC
Focus Software Hardware
Control Unit Hardwired Hardwired & Microprogrammed
Transistors Usage More registers Storing complex instructions
Instruction Size Fixed Variable
Execution Type Register-to-register operations REG-to-REG, REG-to-MEM, MEM-to-MEM
Register Usage More registers required Fewer registers required
Code Size Large Small
Execution Speed Single cycle per instruction Multiple cycles per instruction
Addressing Modes Simple Complex
Pipelining Highly pipelined Less pipelined
Power Consumption Low High
RAM Requirement More Less
LECTURE -7
Amdahl’s Law is a principle in computer architecture that defines the potential speedup of a system
when improving a specific part of it, particularly through parallel computing.
S=T1/T2
represents speedup, where:
· T1 is the time taken by machine 1 (slower machine).
· T2 is the time taken by machine 2 (faster machine).
· Since T2 is smaller than T1, speedup S quantifies how much faster machine 2 is compared
to machine 1.
Amdahl’s Law and Parallel Processing
Gene Amdahl, in 1967, highlighted the limitations of improving performance by adding more
processors. He pointed out that:
· Speedup is constrained by the portion of the task that cannot be parallelized.
· If a fraction of the computation must still be executed sequentially, the overall performance
gain will be limited, even if other parts run in parallel
This result says that, no matter how much one type of operation in a system is improved,
the overall performance is inherently limited by the operations that are unaffected by the
improvement. For example, the best speedup that could be obtained in a parallel computing system
with p processors is p.
However, if 10% of a program cannot be executed in parallel, the overall speedup when using
the parallel machine is at most 1/α = 1/0.1=10, even if an infinite number of processors were
available.
o any system designer or programmer should concentrate on making the common case fast.
o That is, operations that occur most often will have the largest value of α.
o Thus, improving these operations will have the biggest impact on overall performance.
o One of the major criticisms concerning Amdahl’s law has been that it emphasizes the wrong
aspect of the performance potential of parallel-computing systems.
o The argument is that purchasers of parallel systems want to solve larger problems within the
available time.
o Following this line of argument leads to the following “scaled” or “fixed-time” version of
Amdahl's law.
o with the time required to execute the equivalent sequential version of the application
program, T1 using the speedup Sp = T1/Tp.
o With the fixed-time interpretation, however, the assumption is that
o The single-processor may not have a large enough memory, for example, or
o the time required to execute the sequential version would be unreasonably long.
LEC 8: Introduction to Pipelining
Pipelining is a technique in computer architecture that overlaps the execution of multiple instruction
stages to boost performance. It allows a processor to work on different instructions simultaneously,
improving efficiency. Pipelining divides the instruction execution process into smaller stages where
each stage processes a different instruction at the same time, enabling parallel execution. This
increases instruction throughput, meaning more instructions are completed in less time, and allows
the CPU to handle multiple tasks in a single clock cycle.
Types of Pipelining
1. Hardware Pipelining: Hardware pipelining is a method used in computers to speed up
processing by dividing tasks into smaller steps and working on different steps at the same time.
There are different types of pipelining
· Instruction pipelining breaks instructions into steps like fetch, decode, and execute, so the
processor can handle many instructions at once. For example, the 5-stage RISC pipeline.
· Arithmetic pipelining helps with solving complex math problems, like floating-point
operations, using special units like FPUs.
· Data pipelining moves data quickly between memory and input/output devices, often using
tools like DMA controllers.
2. Software Pipelining: Software pipelining involves compiler optimizations to rearrange
instructions for better performance without hardware changes.
· Loop Unrolling: Reduces loop overhead by combining iterations. Example: A 10-iteration
loop runs 2 iterations per cycle.
· VLIW Pipelining: Schedules multiple operations in one instruction. Example: DSP
processors.
· Speculative Execution: Executes instructions before conditions are confirmed to avoid
stalls. Example: Branch prediction.
· Software Parallelism: Breaks tasks into threads for multi-core processors. Example:
OpenMP, CUDA.
Tradeoff Between Cost and Performance
Pipelining boosts performance but increases costs due to hardware complexity, power usage, and
design challenges. Balancing cost and performance depends on the system’s needs.
Scenario High Performance Low Cost
Simple CPU (e.g., embedded Few pipeline stages, lower Reduced hardware cost, lower power
systems) complexity usage
High-Speed CPUs (e.g., gaming, Deep pipelining, out-of-order Expensive fabrication, higher power
AI) execution
Stages in Pipelining
1. Instruction Fetch (IF)
The Instruction Fetch stage is responsible for retrieving the instruction from memory, which may be
located in RAM or cache. During this stage, the Program Counter (PC) is updated to point to the
next instruction in sequence. However, this stage can face issues such as control hazards due to
branch instructions, which can disrupt the instruction flow. To address these challenges,
optimizations like instruction prefetching and branch prediction are commonly employed to
improve performance and maintain a steady pipeline flow.
In the Instruction Decode stage, the fetched instruction is decoded to determine the type of
operation and identify the required operands. This stage also generates control signals that direct
other parts of the CPU, particularly the execution unit. To enhance efficiency and reduce stalls
caused by data hazards, techniques such as register renaming and out-of-order execution are often
applied.
3. Execute (EX)
The Execute stage is where the actual operation takes place. This may involve arithmetic or logical
calculations using the Arithmetic Logic Unit (ALU), or it may involve evaluating branch
conditions. To ensure efficient processing, this stage can benefit from optimizations like operand
forwarding, which minimizes delays in data availability, and the use of multiple execution units to
allow parallel execution of instructions.
During the Memory Access stage, the processor handles load and store operations, which involve
reading from or writing to memory. This stage is crucial for data retrieval and storage. Performance
enhancements at this stage include the use of cache memory and multi-level caching systems, which
significantly reduce memory access time and improve overall throughput.
The final stage, Write Back, involves storing the results of computations into the processor’s
registers so that they can be used in subsequent instructions. To optimize this process, techniques
like register forwarding and write buffers are utilized, which help to reduce write delays and support
faster access to updated data by future instructions.
Characteristics of Pipelining
· Multiple Instructions: Different instructions are processed in parallel.
· Segmented Cycle: Instruction cycle is split into stages (e.g., IF, ID, EX).
· Operation Overlap: Stages work simultaneously on different instructions.
· Increased Throughput: More instructions completed per unit time.
· Reduced Execution Time: Instructions complete faster once the pipeline is full.
· Pipeline Depth: More stages improve throughput but increase complexity.
· Hazards: Structural, data, and control hazards can cause stalls.
· Speedup: Ideally equals the number of stages, but hazards reduce this.
· Dependency Management: Forwarding and branch prediction handle dependencies.
· Latency vs. Throughput: Throughput improves, but individual instruction latency remains.
Hazards in Pipelining
1. Data Hazards:
o Read After Write (RAW): Instruction needs a result not yet written.
o Write After Read (WAR): Later instruction writes before earlier reads.
o Write After Write (WAW): Two instructions write to the same register out of
order.
o Solution: Register renaming, operand forwarding.
2. Structural Hazards: Occur when hardware resources (e.g., memory) are insufficient
3. Control Hazards: Caused by branch instructions disrupting the instruction flow.
o Solutions: Branch prediction, delay slots, speculative execution.
Register Renaming
Register renaming eliminates data hazards by assigning different physical registers to the same
logical register. This prevents conflicts and is used in out-of-order and superscalar processors.
Hardware Mechanisms
· Memory Bus: A high-bandwidth bus connects processors to centralized DRAM, but contention
can slow performance.
· Cache Coherence: Protocols ensure all processors see consistent memory values.
Synchronization Challenges
· Race Conditions: Locks, semaphores, and monitors are needed to prevent conflicts when
multiple processors access the same data.
· False Sharing: When processors modify different variables in the same cache line, performance
degrades due to unnecessary invalidations.
· Memory Bandwidth: Limited bandwidth can bottleneck processor performance.
Advantages
· Simpler Programming: Developers work with a single memory space, making coding easier.
· Faster Communication: Data sharing within shared memory is quick.
· Hardware-Managed Coherence: Cache coherence is handled automatically.
Disadvantages
· Scalability Issues: Bus contention limits the number of processors.
· Limited Memory Capacity: Total memory is constrained by the system.
· Single Point of Failure: A memory failure can affect the entire system.
Applications
· Multiprocessor systems for scientific simulations.
· Operating systems for efficient task management.
Hardware Mechanisms
· Processors: Each node has its own CPU and local DRAM.
· Network Interface Controller (NIC): Facilitates message-based communication.
· Interconnection Network: Connects processors using topologies like mesh or torus.
Advantages
· High Scalability: Adding more processors is straightforward.
· Larger Memory Capacity: Aggregates memory across nodes.
· Fault Tolerance: A single processor failure doesn’t crash the system.
Disadvantages
· Complex Programming: Explicit message passing requires careful design.
· Slower Communication: Network latency increases data transfer time.
· Cache Management Complexity: No hardware coherence, increasing software complexity.
Applications
· Read Consistency: Processors may see different values for the same memory location.
· Write Propagation: Updates by one processor may not be visible to others.
· Serialization: Writes may occur in an undefined order, causing errors.
Cache Coherence Problems
· Inconsistent Data: One processor updates a variable in its cache, but others read an outdated
value from memory.
· False Sharing: Two processors modify different variables in the same cache line, causing
unnecessary invalidations and performance loss.
2. Snooping Protocols: Snooping protocols work by having each cache monitor, or "snoop on" a
shared communication bus to detect changes made by other caches. When a cache updates a
memory block, it broadcasts the change or an invalidation signal to all other caches. This ensures
that no stale data is used. Snooping protocols are typically employed in smaller systems due to their
simplicity and reliance on a single shared bus for communication.
Example Issues
· Inconsistency:
o CPU 1 and CPU 2 load X = 10 into their caches.
o CPU 1 updates X = 20, but CPU 2 still sees X = 10 without coherence.
· False Sharing:
o Two CPUs modify different variables in the same cache line, triggering unnecessary
invalidations.
MESI States
State Meaning
Modified (M) Cache block is modified, differs from main memory, and is unique to this
cache. Must be written back before replacement.
Exclusive (E) Cache block is clean, matches main memory, and exists only in this cache.
Shared (S) Cache block matches main memory and exists in multiple caches. Reads are
allowed, but writes require invalidation.
Invalid (I) Cache block is invalid and cannot be used.
MESI Operations
· Read Operation:
o M or E: Read from cache.
o S: Read from cache; other caches may have copies.
o I: Cache miss; fetch from memory or another cache.
· Write Operation:
o M: Write to cache; no other copies exist.
o E: Write to cache, transition to M.
o S: Invalidate other caches, write, transition to M.
o I: Fetch data, invalidate others, write, transition to M.
State Transitions
Example Scenario
Advantages of MESI
1. Each step of a process is completed within known lower and upper time bounds.
2. Messages transmitted over communication channels are guaranteed to be delivered
within a known maximum time.
3. Each process has a local clock whose drift from real time is bounded by a known limit.
Advantages:
One of the main benefits of synchronous systems is that timeouts can be used reliably to detect
process failures. Since the timing of processes and messages is predictable, the system can be
designed with precise failure detection and recovery mechanisms. Such systems are feasible when
processor cycles and network capacity are guaranteed and consistent.
Example:
An example of a synchronous system would be a tightly controlled environment where both
computing and networking resources are dedicated and reserved, such as real-time industrial control
systems or embedded systems in aerospace applications.
Examples:
The Internet is a classic example of an asynchronous system. Server loads, message transmission
times (such as those in FTP transfers or email delivery), and user interactions vary widely, making
timing unpredictable.
Challenges:
Asynchronous systems pose several design challenges. For instance, tasks like multimedia
streaming that rely on meeting deadlines become difficult to manage reliably. Additionally, because
of unpredictable delays, users often multitask—like browsing other tabs—while waiting for
responses.
In practice, most real-world distributed systems are asynchronous. This is due to the shared nature
of processors and networks, where resource contention and variable conditions prevent the system
from maintaining fixed timing guarantees.
1. Same Process: If a and b are in the same process, and a occurs before b, then a → b.
2. Message Send/Receive: If a is the sending of a message and b is the receipt of that message,
then a → b.
3. Transitivity: If a → b and b → c, then a → c.
Leslie Lamport (1978) introduced the concept of logical clocks to define a partial ordering of events
in a distributed system. Each process pi maintains a logical clock Li, which is a monotonically
increasing counter.
Helps determine happened-before relation: if e→e′e \rightarrow e'e→e′, then L(e)<L(e′)L(e) <
L(e')L(e)<L(e′).
LC1 (Local Event Rule): Before executing any event at process pi, increment its clock:
Li:=Li+1
Lj:=max(Lj,t)
VECTOR CLOCKS
Vector clocks were developed by Mattern (1989) and Fidge (1991). It Solves a problem with
Lamport clocks: L(e) < L(e′) does not imply causality.Vector clock has an
Each event in a distributed system can be given a vector timestamp, which is an array of integers
one element per process in the system.
Example:
· V = (3, 4, 5)
· V′ = (3, 6, 5)
→ V < V′ because all components of V are ≤ V′ and at least one is strictly less.
✅ Important conclusion:
If an event e happens before event e′, denoted e → e′, then the vector clock of e is less than that
of e′:
e → e′ ⇒ V(e) < V(e′)
Merge operation:
Use case: Useful when message order from a specific sender must be preserved, like updates or
logs.
2. Non-FIFO Ordering: There is no guarantee that messages sent by one process to another
will be received in the same order. Example: If Process A sends M1 then M2 to Process B, B might
receive M2 before M1..
Use case: Suitable when message order is not critical, or when application-level logic handles
reordering.
3. Causal Ordering: If one message causally affects another (e.g., a message is sent in response
to a previously received one), then all processes must see these messages in the same causal order
Example:
1. A sends M1 to B.
2. B receives M1 and sends M2 to C.
3. Then, C must receive M1 before M2 because M2 was caused by M1.
Use case: Essential in collaborative applications (like Google Docs) where operations depend on
prior ones.
Global State of A Distributed System: The global state of a distributed system refers to a
complete snapshot of the system at a particular point in time. It includes the state of each process
and the state of the communication channels between them. A distributed system is made up of:
1. Local state of each process:This might include variables, current tasks, program counters, etc.
2. State of each communication channel:This is the set of messages that have been sent but not
yet received—they're "in transit".
In centralized systems, you can look at memory or process tables to know the system state. But in
distributed systems, there’s no central control, no global clock, and no shared memory. Each
process has only partial knowledge of what’s happening. So, capturing the global state helps with:
The challenge: The main difficulty in defining or capturing a global state is the lack of a global
clock. If you try to record states at "the same time," what does "same time" even mean when each
process has its own local clock and no process knows exactly what the others are doing at that
moment?
The solution: Algorithms like the Chandy-Lamport Snapshot Algorithm solve this problem. They
allow the system to record a consistent global state even while processes continue to run and
exchange messages. This consistency means the snapshot could have occurred in a real execution of
the system, even if not all states were recorded at the same real-time instant.
Chandy-Lamport Snapshot
Basic Assumptions
· The system is made up of multiple processes that communicate through unidirectional FIFO
channels (i.e., messages arrive in the order they were sent).
· There is no shared memory, and processes do not have a global clock.
· Any process can initiate the snapshot at any time.
Suppose one process (say, P0) initiates the snapshot. The algorithm consists of these key steps:
This method guarantees that the snapshot reflects a consistent view of the system. It avoids
situations like recording a message received by one process but not sent by the other, because the
algorithm ensures that such messages are captured as "in-transit" in the channel state.
Time is crucial in distributed systems. It's used for practical purposes like timestamping transactions
and for understanding how events happen across different systems. However, physical clocks in
computers aren't perfectly accurate and can't always be synchronized exactly. This topic covers the
problems of clock synchronization, introduces algorithms to minimize these errors, and explains
logical clocks like vector clocks, which help in event ordering in distributed systems.
Physical clocks in computers use oscillating crystals to keep time. These oscillations are counted
and converted into a software clock using this formula:
Where:
·Hi(t)H_i(t)Hi(t): Hardware clock of node i
·α: Scaling factor
·β: Offset factor
However, these clocks are prone to:
· Clock skew: The difference between clocks at a single moment
· Clock drift: Gradual divergence due to hardware differences or environmental factors like
temperature
Typical quartz clocks drift about 10−6 seconds per second (1 second in ~11.6 days). High-precision
clocks are more stable, with drifts around 10−7 or 10−8.
To achieve better accuracy, computers can sync with external sources like International Atomic
Time (TAI). TAI is based on atomic oscillators and is extremely precise (drift of ~1 part in
10^{13})
Because the Earth's rotation (astronomical time) varies, Coordinated Universal Time (UTC)
combines TAI with leap seconds to stay in sync. UTC is distributed via:
· External synchronization: Sync with a trusted source like UTC within a bound DDD
· Internal synchronization: Ensure all system clocks are within DDD of each other
How it works:
1. The client sends a request to a time server asking for the current time.
2. The server replies with its current time t.
3. The client notes the round-trip time (RTT) = time it took to send the request and receive the
reply.
4. It estimates the current server time as: t+RTT/2
(Assumes that the delay is symmetric — same time to go and come back).
Example:
Pros:
Cons:
2. Berkeley Algorithm
How it works:
1. One node (called the coordinator) polls other nodes for their local time.
2. All nodes reply with their time.
3. Coordinator:
o Discards outliers (very different times).
o Averages the remaining times.
o Calculates how much each clock should adjust (either forward or backward).
4. Sends adjustment values to all nodes.
Important: Unlike Cristian’s method, there is no central time source. It’s a peer-based
approach.
Pros:
Cons:
· Needs a coordinator.
· Not suitable for environments with high delay variability.
How it works:
Formula (simplified):
Pros:
Cons:
· Complex algorithm.
· Slight delay depending on network conditions.
It acts like an operating system for distributed systems, providing a unified interface across multiple
machines.
Purpose of Middleware
It supports heterogeneous computers and networks while presenting a single-system view to users
and applications. It:
Middleware Architecture
Middleware provides a consistent programming model for servers and distributed applications.
Common models include:
How It Works
· Middleware hides network communication details (e.g., sending invocation requests and
replies).
· Programmers work with high-level abstractions rather than low-level network protocols.
3. Openness
Open distributed systems allow extensions through new services or reimplementations, enabling
resource sharing.
· Requirements:
o Publish key software interfaces using an Interface Definition Language (IDL) to specify function
names, parameters, and return types.
o Define service semantics (what services do) informally using natural language, as IDLs only
capture syntax.
· Benefits:
o Systems can integrate components from different vendors or OS.
o Example: Web caching where browsers allow customizable cache policies (size, document storage,
consistency checks).
· Challenge: Ensuring precise semantic specifications to avoid misinterpretations.
4. Security
Security in distributed systems involves three components:
· Confidentiality: Protecting data from unauthorized access.
· Integrity: Preventing unauthorized data alteration.
· Availability: Ensuring access to resources despite interference.
Challenges:
· Data Transmission: Sending sensitive data (e.g., patient records, credit card numbers) over
networks. Solution: Use encryption techniques.
· User Authentication: Verifying the identity of users sending messages. Solution: Biometric
techniques or verification codes (e.g., via cell phones).
· Internal Threats: Firewalls protect against external threats but not misuse within an
intranet.
5. Scalability
A scalable system remains effective as the number of resources and users grows. Distributed
systems must scale in three dimensions:
· Size: Support more users and resources.
· Geographical: Handle users and resources spread far apart.
· Administrative: Manage systems spanning multiple organizations.
Scalability Problems:
· Size: Centralized services (e.g., single servers for sensitive data like medical records)
become bottlenecks as users grow. Replicating servers may compromise security.
· Geographical: LAN-based systems use synchronous communication (microsecond delays),
but WANs face millisecond delays, complicating interactive applications. WANs rely on
unreliable, point-to-point communication, unlike LANs’ reliable broadcast-based systems.
Example: Locating services in WANs requires special location services, unlike simple LAN
broadcasting.
· Administrative: Conflicting policies on resource usage, management, and security across
organizations. Security measures needed to protect both the system and new domains from
malicious attacks (e.g., restricting access, securing downloaded code like applets). Non-
technical issues (e.g., organizational politics) make administrative scalability the hardest.
Scaling Techniques
To address scalability challenges, distributed systems use three main techniques:
1. Hiding Communication Latencies: Improve geographical scalability by reducing the impact of
network delays.
· Use asynchronous communication to allow the requesting application to perform other
tasks while waiting for a reply.
· Handle replies via interrupts or separate threads to complete requests.
· Example: A client application continues processing while awaiting a server response.
2. Distribution: Goal is to split components into smaller parts and spread them across the system.
· Example: Domain Name System (DNS):
o DNS maps domain names (e.g., www.amazon.com) to IP addresses.
o Organized hierarchically into zones, each managed by a single name server.
o Name resolution (e.g., nl.vu.cs.flits) involves querying servers for each zone, reducing
load on any single server.
· Benefit: Hierarchical structures scale better (O(log n) access time) than linear ones.
3. Replication
· Goal: Create multiple copies of data or services to distribute load and improve availability.
· Example: Caching web content closer to users to reduce server load and latency.
· Challenge: Ensuring consistency across replicas (e.g., using cache coherence protocols).
CORBA’s Common Data Representation defined in CORBA 2.0, is a standard format used to
support the data types involved in CORBA remote method invocations, including both arguments
and return values. It supports a wide range of types, such as primitive types like short (16-bit), long
(32-bit), unsigned short, unsigned long, float (32-bit), double (64-bit), char, boolean, octet (8-bit),
and a special type called "any" that can represent any data type. Additionally, it accommodates
composite types such as arrays, structs, unions, and sequences.
CDR comes with several notable features. It supports both big-endian and little-endian byte
orderings, with the sender specifying the ordering in the message. Primitive values are aligned
based on their size such as placing a 4-byte long at byte indices divisible by 4—for efficiency.
Floating-point values adhere to IEEE standards, ensuring consistency across platforms. Characters
are encoded using a mutually agreed code set, such as ASCII or Unicode. Furthermore, composite
types are serialized in a defined sequence; for instance, strings are encoded by first specifying their
length, followed by the character data.
Example: A Person struct {name: "Smith", place: "London", year: 1984} is marshalled as:
Index Value
0–3 5 (length of "Smith")
4–11 "Smith___" (padded)
12–15 6 (length of "London")
16–23 "London__" (padded)
24–27 1984 (unsigned long)
CDR does not include type information in messages; sender and receiver must know the data order
and types.
2. Java Object Serialization
Java serialization is a process that flattens and represents Java objects or entire object trees for the
purpose of transmission or storage. It is specifically limited to use within Java applications. One of
its key features is that it includes type information in the serialized form, which allows for accurate
reconstruction of the object during deserialization. Additionally, it supports complex object graphs,
including the handling of object references. Java serialization is commonly used in scenarios such
as sending Java objects over a network or saving them to disk for later retrieval.
3. XML (Extensible Markup Language)
A textual format for representing structured data, originally for web documents but now used in
web services. It Represents data textually, resulting in larger sizes compared to binary formats. It
Includes type information, often referencing external namespaces for type definitions. It is used for
exchanging data between clients and servers in web services.
Issues in Marshalling Design
1. Who Performs Marshalling/Unmarshalling?
In CORBA and Java, middleware automatically handles marshalling/unmarshalling,
programmers of the task. For XML, software libraries are available, though manual
encoding is possible but error-prone due to complex data representations.
2. Compactness:
Binary formats (CORBA, Java) are compact, marshalling data into efficient byte sequences.
Textual formats (XML) are less compact, as text representations (e.g., "4560" vs. binary
4560) require more bytes.
3. Type Information:
o CORBA CDR: Excludes type information, assuming sender and receiver share
knowledge of data structure.
o Java Serialization: Embeds all type information in the serialized form.
o XML: Includes type information, often via external namespace references.
4. General Use: Beyond RMI/RPC, external data representation is used for storing data in files
or transmitting structured documents.
CORBA’s Common Data Representation (CDR) in Detail
· Primitive Types:
o Values are aligned based on size (e.g., 4-byte values start at indices divisible by 4).
o Supports both endian orderings, with the sender’s ordering specified in the message.
· Composite Types:
o Serialized in a defined order (e.g., struct fields are marshalled sequentially).
o Strings include length (unsigned long) followed by characters, padded with zeros for
consistency.
· No Pointers: CDR represents data structures without pointers, ensuring portability.
· Comparison: Similar to Sun XDR (used in Sun NFS), which also omits type information in
messages.
Marshalling in CORBA
· Automation: Marshalling/unmarshalling operations are generated from CORBA Interface
Definition Language (IDL) specifications.
· Example IDL for the Person struct:
struct Person {
string name;
string place;
unsigned long year;
};
The CORBA interface compiler generates marshalling code for remote method arguments and
results.
The Serializable interface (from java.io package) enables serialization without requiring additional
methods.
Serialization Process
· Class Information: Includes the class name and a version number (set manually or computed as
a hash of class name, instance variables, methods, and interfaces).
· Object References: If an object references other objects, all referenced objects are serialized to
ensure references can be fulfilled upon deserialization.
· Handles: References are serialized as unique handles (sequential positive integers). Each object
is written once, and subsequent occurrences use the handle.
· Recursive Serialization: It writes class information, followed by types and names of instance
variables. If instance variables belong to new classes, their class information is included,
continuing recursively.
· Primitive Types: Written in a portable binary format using ObjectOutputStream.
· Strings and Characters: Written using writeUTF method in UTF-8 format:
o ASCII characters use 1 byte.
o Unicode characters use multiple bytes.
o Strings are preceded by their byte length.
Example: Serializing a Person Object
For Person p = new Person("Smith", "London", 1984):
· Simplified Serialized Form:
o Class information: Person class name, version number.
o Fields:
name: Length (5), "Smith" (padded).
place: Length (6), "London" (padded).
year: 1984 (int).
· Note: Actual form includes type markers and handles (e.g., h0, h1), omitted in simplified
examples.
Serialization and Deserialization Code
· Serialize:
ObjectOutputStream out = new ObjectOutputStream(...);
out.writeObject(p); // Serializes Person object
· Deserialize:
ObjectInputStream in = new ObjectInputStream(...);
Person p = (Person) in.readObject(); // Reconstructs Person object
· Customization: Programmers can define custom read and write methods. Variables marked
transient (e.g., file handles, sockets) are not serialized.
o Remote Method Invocation (RMI) evolved in the 1990s as an extension of the object-
oriented programming model. RMI enables an object in one Java Virtual Machine (JVM) to
invoke methods on an object located in another JVM. It builds upon the concept of local
method invocation (LMI) but allows method calls to operate across network boundaries,
thereby facilitating distributed object-oriented computing.
*Imp-Request-Reply Protocols
The request-reply protocol is designed to facilitate typical client-server interactions. In its most
common form, communication is synchronous, meaning the client process sends a request and then
blocks until a response is received from the server. This mechanism also provides reliability
because the reply from the server serves as an implicit acknowledgment of the client’s request.
However, there are cases where asynchronous communication is preferred. In asynchronous
request-reply communication, the client sends a request and continues processing without waiting
for an immediate response. This is useful when replies can be retrieved later, allowing for more
efficient resource usage in certain applications.
In terms of implementation, request-reply interactions can be constructed using Java’s API for UDP
datagrams. However, many modern systems prefer TCP streams for reliability and ease of use.
Protocols built over UDP datagrams help avoid the overhead associated with TCP, such as
connection establishment and flow control, which are unnecessary in many remote invocations
involving small data exchanges.
Core Operations in Request-Reply Protocol
The request-reply model is typically implemented using three communication primitives:
doOperation, getRequest, and sendReply.
o The doOperation function is used by the client to initiate a remote operation. It takes as input
the server reference, an identifier for the operation, and the necessary arguments, all of which
are marshalled into a byte array. After sending the request, doOperation waits for the server’s
response and then unmarshals the result from the reply byte array.
o On the server side, the getRequest function is responsible for receiving client requests.
o Once the server completes the requested operation, it uses the sendReply function to send the
reply back to the client. The client’s doOperation then resumes execution upon receiving the
reply.
o To ensure proper pairing of requests and replies, each message is assigned a unique identifier.
This identifier consists of a requestId and a sender identifier (such as an IP address and port
number). The requestId is a sequential integer generated by the client, while the sender
identifier ensures uniqueness across the distributed system. These identifiers help the client
verify that the received reply corresponds to the correct request and is not an outdated or
duplicate message. The server copies the requestId from the request message into the reply,
allowing the client to perform this verification.
o When using datagram-based protocols like UDP, the reply from the server acts as an implicit
acknowledgment of the client’s request. If delivery guarantees are required, the protocol must
ensure that the reply reliably reaches the client, potentially using retransmission strategies if
necessary.
To handle these limitations, especially in the doOperation primitive, timeouts are used while
waiting for a reply. If a timeout occurs, the action taken depends on the guarantees provided by the
protocol.
Timeout Handling
When a client sends a request but doesn’t get a reply within a certain time, it waits for a bit
(timeout). If that time passes without a reply, the system can do a few things:
· Immediate Failure: The client is told the operation failed. But this is risky because maybe
the server did the job, but the reply was lost.
· Resend the Request: The client sends the same request again. It keeps trying until it either
gets a reply or gives up and shows an error.
Duplicate Request Management
Because of resending, the server might get the same request more than once. This can happen if the
server is slow and the client thinks it didn't respond or if the client resends the request before the
server finishes processing.
To avoid doing the same task multiple times:
· Each request has a unique ID.
· If the server hasn’t replied yet, it just finishes the current work and replies once.
· If the server already replied, and gets the same request again:
o If the task is safe to repeat (idempotent), it can do it again.
o If not, the server sends the old reply from memory instead of redoing it.
Idempotent Operations
An idempotent operation produces the same result no matter how many times it is performed. These
operations simplify duplicate handling because re-execution has no adverse effect.
Examples:
· Adding an item to a set is idempotent.
· Appending to a list is not idempotent.
Servers with only idempotent operations don’t need elaborate mechanisms to avoid re-execution.
Using History to Handle Lost Replies
To avoid re-executing non-idempotent operations, servers may maintain a history of past requests.
Each entry in this history typically includes the request ID, the corresponding reply message, and
the client ID. When a duplicate request is received, the server can refer to this history and resend the
previously stored reply instead of processing the request again.
However, maintaining such a history comes with memory overhead. To manage this, servers often
treat a new request from a client as an implicit acknowledgment of the previous reply. This allows
the server to safely discard the last stored reply for that client. Despite this strategy, the history can
still grow, especially when many clients are connected or if a client terminates without sending
another request (thus never acknowledging the previous reply). As a result, servers commonly
discard old messages after a predetermined period to prevent unbounded memory usage.
Exchange Protocol Styles
Three exchange styles define how messages are passed between client and server in the presence of
failures:
· Request (R) Protocol:
In the Request (R) protocol, the client sends a single request without expecting any confirmation or
reply. This approach is suitable in scenarios where no result is needed and confirmation is not
critical. It is a very lightweight communication method but comes with the drawback of being
vulnerable to failures due to the lack of response or acknowledgment mechanisms.
· Request-Reply (RR) Protocol:
The Request-Reply (RR) protocol is commonly used in client-server interactions. In this model, the
server's reply serves as an implicit acknowledgment of the client’s request. Furthermore, subsequent
requests from the client can act as acknowledgments for previous replies. To ensure reliability, this
protocol may involve mechanisms such as retransmissions, duplicate detection, and maintaining a
history of interactions.
Call semantics in Remote Procedure Call (RPC) systems define how the system handles procedure
execution in the presence of communication failures. These semantics determine how reliably a
remote call is perceived by the client, especially under failure conditions such as lost messages or
crashes.
Maybe Semantics: Under maybe semantics, the remote procedure may execute once or not at all.
There is no fault tolerance mechanism, making this approach susceptible to omission or crash
failures. It is generally suitable for applications where occasional failed calls can be tolerated and do
not critically affect the system's correctness.
At-Most-Once Semantics: At-most-once semantics ensures that the remote procedure is executed
no more than once. This is achieved through comprehensive fault-tolerance techniques such as
retries combined with duplicate request filtering. It provides both reliability and protection against
duplicating side effects, making it a preferred approach in systems requiring consistency. An
example of a system implementing this semantic is Sun RPC.
Handling Failures
Remote procedure calls are more susceptible to failure than local calls. Therefore, client
applications must be capable of recovering from communication issues, crashes, or timeouts. This
necessitates exception handling and timeout mechanisms to maintain robustness.
Latency Considerations
RPC latency is significantly higher than that of local function calls. Because of this, programs
should minimize the number of remote interactions. The designers of Argus proposed that a remote
call should be abortable without affecting the server, implying that the server must be capable of
rolling back the effects of an incomplete call.