HPC Computer Engg Sem 8 Notes
HPC Computer Engg Sem 8 Notes
Unit 1:
Q.1) Describe the scope of parallel computing. What are applications of parallel
computing?
Parallel computing tackles problems that can be broken down into smaller, independent (or
loosely coupled) tasks that can be solved simultaneously. This approach leverages multiple
processing units (cores, processors, or even entire computers) working in concert to achieve
faster execution times compared to traditional serial computing where tasks are executed one
after another.
● Problem size and complexity: Parallel computing is particularly suited for large-scale,
computationally intensive problems that would take an unreasonable amount of time to
solve on a single processor. As the problem size increases, the potential speedup from
parallelization becomes more significant.
● Task decomposition: The problem needs to be divisible into subtasks that can be
executed concurrently with minimal overhead for communication and coordination
between processing units.
● Scalability: Ideally, parallel computing should exhibit good scalability, meaning the
performance gain should increase as you add more processing units. However,
achieving perfect scalability can be challenging due to factors like communication
overhead and synchronization requirements.
Parallel computing permeates various scientific and engineering domains due to its ability to
handle complex simulations and data analysis. Here are some prominent examples:
In the context of HPC, there are two main categories of dataflow execution models:
1. Batch Sequential:
The Problem:
● These protocols define how processors communicate and coordinate cache updates to
maintain consistency.
● They involve states (modified, shared, exclusive, etc.) for cache lines indicating the copy
status.
● Transitions between states occur based on read/write operations and communication
between caches or a directory.
Common Approaches:
Benefits:
Challenges:
Real-world Example:
Imagine two processors working on a shared document. Cache coherence ensures that both
processors always see the latest version of the document, regardless of which processor made
the last edit. This prevents inconsistencies, such as one processor seeing an outdated version
while the other has the latest changes.
Q.4) Explain Store-and-Forward & packet routing with its communication cost.
Store-and-forward is a fundamental technique used in packet routing within networks like the
internet. It ensures reliable data transmission by acting like a digital post office for data packets.
Here's how it works:
1. Receiving: When a packet arrives at a router (the network device responsible for
forwarding packets), it's entirely received and stored in a temporary buffer memory.
2. Error Checking: The router performs error checks on the packet, typically using
techniques like Cyclic Redundancy Check (CRC) to detect any data corruption during
transmission.
3. Routing Decision: Based on the destination address within the packet header, the
router consults its routing table to determine the next hop (the next router) on the path
towards the final destination.
4. Forwarding: If the error check passes and the next hop is determined, the router
forwards the entire packet out the appropriate outgoing link.
5. Buffer Management: If the buffer is full due to network congestion, the router might
employ strategies like queuing or packet dropping to manage the incoming data flow.
● Latency: There's a delay introduced at each router as the entire packet needs to be
received and processed before forwarding. This cumulative delay across multiple routers
can impact real-time applications.
● Buffer Management: Routers need additional memory to store packets temporarily.
Buffer overflow can lead to packet drops, reducing overall network efficiency.
● Processing Overhead: Error checking and routing table lookups add to the processing
workload at each router.
Advantages of Store-and-Forward:
● Reliable Delivery: It minimizes the risk of corrupted data reaching the destination by
discarding packets with errors.
● Congestion Control: Routers can implement buffer management techniques to prevent
network congestion.
● Flexibility: Store-and-forward works with various network protocols and data types.
● Complex simulations in fields like physics, chemistry, and engineering leverage multiple
cores to perform intensive calculations faster. This allows for more accurate and detailed
modeling of real-world phenomena.
● Examples: weather forecasting, climate modeling, molecular dynamics simulations for
drug discovery.
● Processing massive datasets often involves parallel tasks like data filtering, sorting, and
aggregations. Multi-core architectures accelerate these operations, enabling faster
analysis and insights extraction.
● Machine learning algorithms, especially deep learning models with many layers and
parameters, benefit from parallel processing on multiple cores for faster training and
inference.
● Examples: large-scale genomics analysis, financial data analysis, training complex
image recognition models.
Multimedia Applications:
● Video editing, encoding, and decoding are computationally demanding tasks that can be
significantly accelerated by multi-core processors. This allows for faster rendering,
real-time editing of high-resolution videos, and smoother playback experiences.
● 3D graphics rendering in games and animation software utilizes multiple cores to handle
complex lighting effects, object interactions, and high-resolution textures, leading to
more immersive visuals.
● Handling high volumes of user requests on web servers benefits from parallel processing
capabilities. Multiple cores can efficiently handle concurrent user connections and
database queries, improving responsiveness and scalability.
● Database management systems can leverage multi-core architecture for faster data
processing, indexing, and complex data manipulation tasks.
● N-Wide: This refers to the number of execution units available within the processor. A
wider design (higher N) allows for more parallel instruction execution. Common
examples include dual-core (N=2), quad-core (N=4), and even higher core count
processors.
● Superscalar Execution: The processor employs sophisticated hardware mechanisms
to achieve superscalar execution. It involves:
○ Instruction Fetch: The processor fetches multiple instructions from memory in a
single clock cycle, typically exceeding the number of execution units (N).
○ Instruction Decode and Dispatch: A decoder unit analyzes the fetched
instructions and identifies independent ones suitable for parallel execution. These
are then dispatched to available execution units.
○ Out-of-Order Execution: To maximize utilization of execution units, the
processor might execute instructions out of their program order if earlier
instructions have dependencies that haven't been resolved yet. This requires
careful instruction scheduling and data dependency checks to ensure correct
program execution.
○ Retirement: Once an instruction finishes execution, it's retired, and its results are
written back to the register file or memory.
● Complexity: Designing and managing the hardware for instruction fetching, decoding,
scheduling, and out-of-order execution adds complexity to the processor architecture.
● Limited Benefits for Serial Programs: Programs with inherent dependencies between
instructions might not see significant performance improvement with N-wide superscalar
architecture.
● Diminishing Returns: As the core count (N) increases, the benefits of additional
execution units can diminish due to factors like increased communication overhead and
challenges in managing complex instruction dependencies.
Multimedia Applications:
● Handling high volumes of user requests on web servers benefits from parallel processing
capabilities. Parallel programming allows for efficient handling of concurrent user
connections and database queries, improving responsiveness and scalability.
● Database management systems can leverage parallel programming for faster data
processing, indexing, and complex data manipulation tasks.
Other Applications:
● Signal processing: Parallel programming can accelerate tasks like image and audio
processing, filtering, and analysis.
● Cryptography: Breaking encryption codes or implementing complex cryptographic
algorithms can benefit from parallel processing techniques.
● Bioinformatics: Analyzing large genetic datasets for research purposes can be
significantly faster with parallel programming.
● Financial modeling: Complex financial simulations and risk assessments can be
performed much faster using parallel programming techniques.
Q.8) Explain (with suitable diagram): SIMD, MIMD & SIMT architecture.
Diagram:
+--------------------+
| Instruction Fetch | (Single Instruction)
+--------------------+
|
v
+---------+---------+---------+---------+
| PE 0 | PE 1 | PE 2 | PE 3 |
| Data 0 | Data 1 | Data 2 | Data 3 | (Multiple Data Elements)
+---------+---------+---------+---------+
Diagram:
● Concept: Similar to SIMD, executes a single instruction on multiple data streams, but
with more flexibility.
● Data Streams: Data elements can have varying structures and complex dependencies.
● Processing: Threads within a processing element can diverge from the main instruction
stream based on specific conditions within their data. This allows for some level of
conditional execution within the overall SIMT model.
● Applications: Often used in graphics processing units (GPUs) for tasks like image
processing, scientific simulations with some level of data branching, and real-time ray
tracing.
Diagram:
+--------------------+
| Instruction Fetch | (Single Instruction)
+--------------------+
|
v
+---------+---------+---------+---------+
| PE 0 | PE 1 | PE 2 | PE 3 |
| Thread | Thread | Thread | Thread | (Multiple Threads)
|0 |1 |2 |3 |
+---------+---------+---------+---------+
| | | |
v v v v
+-----+ +-----+ +-----+ +-----+ +-----+ (Data with Potential Variations)
| D00 | | D10 | | D20 | | D30 | | ... |
+-----+ +-----+ +-----+ +-----+ +-----+
| | | |
v v v v
+-----+ +-----+ +-----+ +-----+ +-----+
| D01 | | D11 | | D21 | | D31 | | ... |
+-----+ +-----+ +-----+ +-----+ +-----+
Key Differences:
Q.9) Explain the impact of Memory Latency & Memory Bandwidth on system
performance.
Memory Latency:
● Concept: Refers to the time it takes for the processor to access data from main memory
after it issues a request. It's essentially the waiting time for data retrieval.
● Impact:
○ Increased latency leads to performance degradation. The processor stalls while
waiting for data, hindering its ability to execute instructions efficiently. This is
particularly significant for tasks that require frequent memory access.
○ Cache plays a vital role: Modern processors employ caches (high-speed
memory closer to the processor) to mitigate high memory latency. Frequently
accessed data is stored in the cache, reducing reliance on main memory and
improving overall performance.
Memory Bandwidth:
● Concept: Refers to the rate at which data can be transferred between the processor and
main memory. It's analogous to the width of a data pipeline, determining how much data
can flow through in a given time unit.
● Impact:
○ Limited bandwidth can bottleneck performance, especially when dealing with
large datasets or applications that require significant data movement between
memory and the processor.
○ High bandwidth enables faster data transfer, improving performance for tasks
that involve heavy data processing or frequent memory access patterns.
● Startup Time: This includes the time spent on both the sending and receiving
processors to prepare the message for transmission and handle routing overhead.
● Data Transfer Time: The actual time it takes to transfer the data across the network
connection between processors. This depends on the data size and network bandwidth.
● Network Topology: The physical layout of the network (e.g., mesh, hypercube) can
influence communication costs due to varying path lengths between processors.
Minimizing message passing costs is crucial for optimal performance in parallel computing.
Strategies include:
● Reducing message frequency: Sending fewer, larger messages is more efficient than
sending numerous small ones.
● Data locality: Organizing data such that frequently accessed data resides on the same
processor or nearby processors minimizes communication needs.
● Overlapping communication and computation: When possible, processors can
perform computations while data is being transferred, improving overall efficiency.
● Concept: In UMA architecture, all processors share a single memory space and have
equal access time to any memory location regardless of the processor's physical
location.
● Diagram:
| | |
v v v
+--------------------+
| Shared Memory |
+--------------------+
● Benefits:
● Concept: In NUMA architecture, each processor has its own local memory that it can
access faster than non-local memory (memory associated with another processor).
Accessing non-local memory involves additional communication overhead.
● Diagram:
| | |
v v v
+---------+---------+---------+---------+---------+
+---------+---------+---------+---------+---------+
| (Local Memory) |
v
+--------------------+
+--------------------+
● Benefits:
Q.12) Write a short note on: (i) Dataflow Models, (ii) Demand Driven Computation, (iii)
Cache Memory
Dataflow models describe how data flows through a system and how computations are triggered
by the availability of data. They are particularly useful in parallel and distributed computing
environments. Common dataflow models include:
● Batch Sequential: Processes data in large batches, one after another. Simple to
implement but not suitable for problems with inherent dependencies between tasks.
● Streaming (or Pipelined): Processes data in a continuous flow, breaking it down into
smaller chunks and applying operations as they arrive. Enables high parallelism but can
be more complex to manage.
This approach focuses on calculating only the information that is actually needed. It's useful for
avoiding unnecessary computations and optimizing resource usage. In the context of dataflow
models, demand for data or computation is triggered by the arrival of specific data elements.
Cache memory is a small, high-speed memory located closer to the processor than main
memory. It stores frequently accessed data or instructions to reduce the average time it takes to
access data. This significantly improves performance because accessing cache memory is
much faster than accessing main memory.
● Cache Hierarchy: Modern systems often employ multiple levels of cache (L1, L2, L3)
with varying sizes and access times.
● Cache Coherence: In multiprocessor systems, cache coherence protocols ensure that
all processors have a consistent view of the data stored in main memory.
Multi-threading Advantages:
● Improved Performance for Parallel Tasks: When a program can be broken down into
independent or loosely coupled subtasks, multi-threading allows multiple threads to
execute these tasks concurrently. This can lead to significant performance gains
compared to pre-fetching, which is primarily focused on fetching data in anticipation of
future needs.
● Flexibility for Unpredictable Workloads: Multi-threading works well for workloads with
unpredictable data access patterns or dependencies that arise during runtime. Threads
can adapt to changing conditions and synchronize execution as needed. Pre-fetching is
less flexible as it relies on static predictions of future data requirements, which might not
always be accurate.
Q.14) Explain memory hierarchy and thread organization. Give a summarized response.
Memory Hierarchy:
Thread Organization:
● Effective use of memory hierarchy reduces main memory access, improving speed.
● Thread organization allows parallel processing of independent tasks within a program.
Key considerations:
● Data locality: Keeping frequently accessed data closer to the CPU (registers, cache)
improves performance.
● Thread workload: Choose the right number of threads to match available processing
cores and task complexity.
Q.15) Explain control structure of parallel platforms in detail.
The control structure of parallel platforms refers to how tasks are coordinated and executed on
multiple processors within a computer system. Here's a concise breakdown of the key aspects:
1. Levels of Parallelism:
2. Communication Models:
3. Thread Management:
● Scheduling: Determines how threads are allocated processing resources (CPU cores)
and the order in which they are executed. (e.g., round-robin scheduling)
● Synchronization: Ensures threads cooperate and access shared data safely to prevent
race conditions and data corruption. (e.g., mutexes, semaphores)
4. Programming Paradigms:
● Fork-join: A parent process creates child processes (or threads) that execute tasks in
parallel and then rejoin the parent process upon completion.
● Data parallelism: The same operation is applied to different data elements concurrently.
● Task parallelism: Different tasks within a program are executed concurrently.
● Memory Latency: Refers to the time it takes for the processor to access data from main
memory after it issues a request. It's essentially the waiting time for data retrieval.
● Memory Bandwidth: Represents the rate at which data can be transferred between the
processor and main memory. It's analogous to the width of a data pipeline, determining
how much data can flow through in a given time unit.
In simpler terms:
Q.17) Explain basic working principal of: (i) Superscalar processor, (ii) VLIW processor
Benefits:
Challenges:
● Complexity: Designing and managing the hardware for instruction fetching, decoding,
scheduling, and out-of-order execution adds complexity to the processor architecture.
● Limited Benefits for Serial Programs: Programs with inherent dependencies between
instructions might not see significant performance improvement with superscalar
architecture.
● Diminishing Returns: As the core count (number of execution units) increases, the
benefits of additional units can diminish due to factors like increased communication
overhead and challenges in managing complex instruction dependencies.
Benefits:
Challenges:
Q.18) Explain Superscalar execution in terms of horizontal waste and vertical waste with
example.
Horizontal Waste:
● Concept: Refers to a situation where not all execution units within the superscalar
processor are utilized in a particular clock cycle. This happens when there aren't enough
independent instructions available to fill all the execution units.
● Analogy: Imagine a restaurant kitchen with multiple chefs (execution units). Horizontal
waste occurs when there aren't enough orders (instructions) to keep all chefs busy in a
particular cycle. Some chefs stand idle, even though there's cooking capacity.
● Example: Consider a 2-way superscalar processor with four functional units (e.g., ALU,
multiplier, memory access unit). In a given cycle, the processor might fetch three
independent instructions. Two instructions can be dispatched to the available execution
units, but the third instruction has to wait until the next cycle due to limited resources.
This scenario represents horizontal waste, as one execution unit remains idle even
though there's an instruction ready to be processed.
Vertical Waste:
● Concept: Occurs when a clock cycle is wasted entirely because no instruction can be
completed. This can happen due to dependencies between instructions or limitations in
the processor's design.
● Analogy: Vertical waste is like a stall in the kitchen workflow. Perhaps a chef (execution
unit) is waiting for ingredients (data) from another chef who hasn't finished their task yet
(dependency). The entire kitchen (processor) stalls until the ingredient (data) becomes
available.
● Example: In our 2-way superscalar processor, imagine one instruction requires the
result of another instruction before it can proceed. If both instructions are fetched in the
same cycle but the first instruction takes longer to execute, the entire cycle might be
wasted as the second instruction cannot proceed without the first one's result. This is
vertical waste because no instruction completes execution in that cycle.
Minimizing Waste:
● Instruction Fetch: Superscalar processors often fetch more instructions than available
execution units to increase the chance of finding independent instructions.
● Out-of-Order Execution: Advanced techniques like out-of-order execution allow the
processor to reorder instructions and execute them even if earlier instructions haven't
finished, as long as there are no dependencies. This helps reduce vertical waste.
● Compiler Optimizations: Compilers can play a role in optimizing code to improve
instruction-level parallelism and reduce dependencies, leading to less horizontal and
vertical waste.
Unit 2:
Q.1) Explain: (i) granularity, (ii) concurrency and (iii) dependency graph.
(i) Granularity:
● Concept: Refers to the size and complexity of tasks that are broken down for parallel
execution.
● Impact:
○ Fine-grained: Smaller, simpler tasks offer more opportunities for parallelism but
can introduce more overhead in managing them.
○ Coarse-grained: Larger, more complex tasks reduce management overhead but
might limit parallelism if tasks lack sufficient internal independence.
(ii) Concurrency:
● Concept: The ability of a system to execute multiple tasks seemingly at the same
time. This doesn't necessarily mean true simultaneous execution; it can involve rapid
switching between tasks.
● Importance: Enables efficient utilization of multiple processing cores or resources to
improve overall performance for problems that can be broken down into independent or
loosely coupled subtasks.
● Impact:
○ A higher DoC indicates the program can potentially utilize more processing
resources simultaneously, leading to faster execution.
○ However, the DoC is not always equal to the total number of tasks in the program
due to dependencies between tasks.
● Factors Affecting DoC:
● Concept: The process of breaking down a large, complex problem into smaller, more
manageable sub-problems that can be executed concurrently. This is the foundation for
achieving parallelism in a program.
● Benefits:
○ Enables efficient utilization of multiple processors or resources.
○ Simplifies problem-solving by focusing on smaller, independent units.
○ Improves code readability and maintainability.
Tasks:
● Concept: The individual units of work created after decomposing a problem. Tasks
represent the smallest elements that can be executed independently (or with minimal
dependencies) in a parallel program.
● Characteristics:
○ Can be of varying sizes and complexity depending on the problem and
decomposition strategy.
○ May require communication and data exchange with other tasks for overall
program execution.
Dependency Graph:
1. Amdahl's Law: This law sets a theoretical limit on the speedup achievable through
parallelization. It states that the overall speedup is limited by the fraction of the program that is
inherently sequential. Even with infinite processing resources, the sequential portion of the
algorithm will limit the overall improvement.
3. Limited Scalability: Not all algorithms scale perfectly with increasing processing cores. As
the number of cores grows, the communication and synchronization overhead can become
significant, potentially outweighing the benefits of parallelism.
4. Algorithm Suitability: Not all algorithms can be effectively parallelized. Some algorithms
have inherent dependencies between steps that make it difficult to break them down into
independent tasks suitable for concurrent execution.
Q.5) Explain with suitable examples: (a) Recursive decomposition, (b) Exploratory
decomposition, (c) Data decomposition
● Concept: Breaks down a problem into smaller, self-similar subproblems of the same
type. The process is repeated recursively until the subproblems become simple enough
to be solved directly. This is a natural approach for problems that can be divided into
smaller versions of themselves with the same structure.
● Example:
Consider the problem of sorting a list of numbers using the Merge Sort algorithm.
1. Base Case: If the list has only one element, it's already sorted (nothing to do).
2. Recursive Step: Divide the list into roughly equal halves.
○ Recursively sort the first half.
○ Recursively sort the second half.
○ Merge the two sorted halves into a single sorted list.
Here, each subproblem (sorting half the list) is a smaller version of the original problem (sorting
the entire list). Recursive decomposition allows efficient use of multiple processors as each
subproblem can be sorted concurrently.
● Concept: Involves breaking down a problem into subproblems based on exploring the
search space of possible solutions. This technique is often used for problems where the
solution path is not entirely known in advance, and different subproblems might lead to
different solutions.
● Example:
This approach allows parallel exploration of multiple promising paths to find the best route
(solution) faster.
● Concept: Focuses on dividing the data associated with a problem into smaller chunks.
These chunks can then be processed independently or with minimal communication
across processors. This technique is effective for problems where the same operation
needs to be applied to different parts of the data.
● Example:
Consider processing a large image and applying a filter (e.g., blur) to each pixel.
1. Data Partitioning: Divide the image into smaller tiles (subproblems) containing a subset
of pixels.
2. Parallel Processing: Assign each tile to a different processor, which independently
applies the blur filter to its assigned pixels.
3. Result Aggregation: Once all tiles are processed, combine the filtered tiles back into
the final filtered image.
Data decomposition allows for efficient parallel processing of the image data, where each
processor can independently apply the filter to its assigned tile.
1. Block Distribution:
● Concept: Divides the data into equally sized contiguous blocks along a single
dimension. This approach is simple to implement and works well for problems where
operations are independent across different data blocks.
2. Cyclic Distribution:
● Concept: Divides the data into equally sized chunks and distributes them cyclically
across processors. This technique ensures a more balanced distribution of workload
compared to block distribution, especially when data items have varying processing
times.
● Example: Imagine processing a large log file where each line represents an event.
Cyclic distribution ensures each processor receives a mix of potentially short and long
log entries, leading to better load balancing compared to assigning entire blocks that
might have skewed processing times.
3. Scatter Decomposition:
● Concept: Distributes specific data elements based on a key associated with each
element. This technique is useful when operations depend on specific data values rather
than their position in the original data set.
● Example: Consider a database where customer information needs to be processed
based on location. Scatter decomposition can distribute customer records to processors
based on their city or region, allowing processors to efficiently handle queries or
operations specific to those locations.
4. Hashing:
● Concept: Uses a hash function to map each data element to a specific processor. This
technique is useful for situations where the workload associated with each data element
is unpredictable or the data needs to be grouped based on certain attributes.
● Example: Imagine processing a large collection of social media posts and analyzing
sentiment. Hashing can map each post to a processor based on the dominant sentiment
(positive, negative, neutral) expressed in the text. This allows processors to efficiently
analyze posts with similar sentiment.
Task Characteristics:
● Task Generation:
○ Static: Tasks are pre-defined and known in advance, allowing for efficient
allocation and scheduling. (e.g., matrix multiplication with fixed matrix sizes)
○ Dynamic: Tasks are generated during runtime based on the program's execution
or data encountered. (e.g., search algorithms exploring a dynamic search space)
● Task Granularity:
○ Fine-grained: Smaller, more focused tasks offer more parallelism but can
introduce overhead in managing them. (e.g., individual arithmetic operations)
○ Coarse-grained: Larger, more complex tasks reduce management overhead but
might limit parallelism if tasks lack internal independence. (e.g., processing an
entire image file)
● Data Association:
○ Independent: Tasks operate on independent data sets, allowing for true parallel
execution without communication.
○ Shared Data Access: Tasks might need to access or modify shared data,
requiring synchronization mechanisms to avoid conflicts.
Interaction Characteristics:
Q.8) Differentiate between static and dynamic mapping techniques for load balancing.
Drawbacks Can lead to load imbalance if May introduce additional complexity and
task characteristics or workload overhead in managing dynamic
changes assignments
Load balancing is crucial for ensuring efficient utilization of processing resources in parallel
computing. Mapping techniques define how tasks are assigned to processors to achieve this
balance. Here's a breakdown of some common mapping techniques:
Static Mapping:
Dynamic Mapping:
● Assigns tasks to processors during program execution. This approach is more flexible
and adapts to changing workloads or unforeseen variations in task characteristics.
However, it can introduce additional overhead for runtime task analysis and assignment.
● Work Stealing: Idle processors (or threads) can "steal" work from overloaded
processors, promoting better load balancing.
● Task Queues: Tasks are placed in a central queue, and any available processor picks
up the next task for execution. This approach simplifies load balancing but can introduce
overhead for managing the central queue.
● Adaptive Load Balancing: Monitors system performance and dynamically adjusts task
assignments based on processor load and communication patterns. Requires
sophisticated algorithms to analyze runtime behavior.
● Concept: This strategy aims to keep the data a task needs close to the processor that
will use it. This reduces the need for remote data access and communication across the
network.
● Techniques:
○ Data partitioning and distribution: Distribute data strategically across
processor memory based on access patterns.
○ Loop transformations: Reorganize loops to improve data reuse and minimize
redundant data access within a processor's cache.
● Concept: Focuses on reducing the amount of data that needs to be exchanged between
processors during communication.
● Techniques:
○ Sending only relevant data: Identify and transmit only the specific data
elements required by a task, avoiding unnecessary data transfer.
○ Data compression: Compress data before transmission to reduce the
communication overhead, especially for large datasets.
3. Minimizing Frequency of Interactions:
● Concept: Aims to reduce the number of times processors need to communicate with
each other.
● Techniques:
○ Bulk data transfers: Aggregate multiple data elements into a single message for
transmission, reducing the overhead of individual message exchanges.
○ Asynchronous communication: Allow processors to continue working while
communication happens in the background, improving overall utilization.
● Concept: Addresses situations where multiple processors compete for access to shared
resources (e.g., memory, communication channels). Contention creates bottlenecks and
slows down communication.
● Techniques:
○ Locking strategies: Use efficient locking mechanisms to prevent data
inconsistencies during concurrent access but minimize the time processors
spend waiting to acquire locks.
○ Load balancing: Ensure tasks are evenly distributed across processors to
prevent overloading specific resources and creating communication bottlenecks.
● Concept: While communication occurs, try to keep processors busy with other
independent computations that don't require data from other processors.
● Techniques:
○ Pipelining: Break down tasks into smaller stages and execute them concurrently
on different processors, overlapping communication with computation steps.
○ Multithreading: Utilize multiple threads within a processor. While one thread
communicates, another can perform independent computations.
3. Communication Patterns:
● Explicit Communication:
○ Processors explicitly exchange messages or signals to coordinate task
assignment.
○ Examples: Work stealing with message passing, centralized queue updates.
● Implicit Communication:
○ Processors infer task availability and workload based on actions or events
without explicit messages.
○ Can reduce communication overhead but might require more complex algorithms
for coordination.
○ Examples: Observing processor load through hardware counters or shared
memory access patterns.
4. Frequency of Re-mapping:
Here's a breakdown of some common parallel algorithm models with illustrative examples:
1. Data-Parallel Model:
● Concept: This model focuses on applying the same operation concurrently to different
data items. Tasks are typically independent, requiring minimal communication or
synchronization.
● Example: Consider performing a mathematical operation (e.g., addition, multiplication)
on all elements of a large array. Each processor can be assigned a portion of the array
and independently perform the operation on its assigned elements.
2. Task-Parallel Model:
● Concept: This model breaks down a problem into smaller, independent tasks that can
be executed concurrently. Tasks might have different functionalities but don't require
frequent communication or data sharing.
● Example: Imagine processing a large collection of images and applying filters (e.g.,
resize, grayscale conversion) to each image. Each processor can be assigned a
separate image and independently apply the desired filter.
● Concept: This model uses a central pool of tasks that workers (processors or threads)
can access and execute dynamically. Tasks are typically independent and don't require
specific ordering.
● Example: Consider processing a queue of customer orders in an e-commerce system.
Each order represents a task, and worker threads can pick up tasks from the central
queue, process them (e.g., validate payment, prepare shipment), and update the order
status.
4. Master-Slave Model:
● Concept: This model involves a master process that coordinates the execution of tasks
on slave processes (workers). The master distributes tasks, manages communication,
and collects results.
● Example: Imagine performing a scientific simulation that requires calculations across
different spatial or temporal segments. The master process can divide the simulation
domain into smaller sub-domains, distribute them as tasks to slave processes, and
collect the partial results to assemble the final simulation outcome.
● Concept: This model involves a producer that generates data, a consumer that
processes the data, and potentially intermediate stages (filters, transformers) that
perform specific operations on the data stream. Stages operate concurrently, with one
stage producing data for the next stage in the pipeline.
● Example: Consider processing a video stream. A producer can continuously read video
frames, a filter stage might convert them to a different format, and finally, a consumer
could display the processed frames on the screen. Each stage operates concurrently,
forming a processing pipeline.
6. Hybrid Models:
● Concept: Many real-world parallel algorithms combine aspects of different models. This
allows for efficient execution by leveraging the strengths of each model for specific
sub-problems within the larger application.
● Example: A scientific computing application might use a master-slave model to
distribute large calculations across processors, while each slave process utilizes a
data-parallel model to perform computations on smaller data chunks within its assigned
task.
Q.13) Draw the task-dependency graph for finding the minimum number in the sequence
{4, 9, 1, 7, 8, 11, 2, 12} where each node in the tree represents the task of finding the
minimum of a pair of numbers. Compare this with serial version of finding the minimum
number from an array.
Task-dependency graphs are a way to visualize the dependencies between tasks in a parallel
computation. In this case, we can represent finding the minimum number in a sequence as a
series of pairwise comparisons.
Here's the task-dependency graph for finding the minimum number in the sequence {4, 9, 1, 7,
8, 11, 2, 12}:
Now, let's compare this with the serial version of finding the minimum number from an array. In
the serial version, we iterate through the array once, keeping track of the minimum number
encountered so far. Here's how it would look:
Step 1: min = 4
Step 2: min = 4
Step 3: min = 1
Step 4: min = 1
Step 5: min = 1
Step 6: min = 1
Step 7: min = 1
Step 8: min = 1
In the serial version, we only perform a linear scan through the array once, comparing each
element with the current minimum found so far. There are no dependencies between tasks, as
each comparison is independent of the others.
Comparing both approaches, we see that the parallel version involves multiple parallel
comparisons, each dependent on the results of the previous comparisons, while the serial
version involves a single sequential scan through the array.
Q.14) Give the characteristics of GPUs and verious applications of GPU processing.
Characteristics of GPUs:
● Highly Parallel Architecture: GPUs are designed for massive parallelism, containing
thousands of cores compared to a CPU's limited number of cores. This allows them to
efficiently handle tasks involving a large number of independent calculations.
● Focus on Memory Bandwidth: GPUs prioritize high memory bandwidth to move data
quickly between cores and memory. This is crucial for processing large datasets that
don't fit entirely in the processor cache.
● Specialized Instruction Sets: GPUs have instruction sets optimized for specific tasks
like graphics processing and manipulating large data vectors. While less versatile than
CPUs, they excel at these specialized operations.
● Limited Control Flow: GPUs are less efficient at handling complex branching and
control flow logic compared to CPUs. They are better suited for problems with
predictable execution patterns.
● Lower Clock Speeds: Individual GPU cores typically have lower clock speeds than
CPU cores. However, the sheer number of cores often compensates for this in terms of
overall processing power for suitable workloads.