0% found this document useful (0 votes)
3 views36 pages

Week 1 Csc447

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

Week 1 Csc447

Uploaded by

omarobeidd03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Programming Massively Parallel Processors

Wen-mei Hwu, David Kirk, Izzat El Hajj

A Hands-on Approach

WEEK 1 Introduction z

Dr. Rachad Atat

Copyright © 2022 Elsevier


The Evolution of Computing Power

• Early Demands for Speed:


• Applications like weather forecasting, engineering simulations, and airline
reservations required more speed and memory for better performance.
• New technologies, such as deep learning, pushed computing limits further,
driving rapid advancements over the past 50 years.
• The Single CPU Era (1980s-1990s):
• Single CPU computers were standard, with increasing speed and power allowing
applications to improve and offer more features.
• By 2003, increasing CPU speed caused overheating and high energy use, leading
to a shift toward multi-core processors.

Copyright © 2022 Elsevier


Processor Trends

108
Transistors
10 7
(thousands)
106
105
104
103
102
101
100

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).

Moore’s “Law” predicted that the number of transistors


per unit area would double every 18-24 months
Copyright © 2022 Elsevier
Processor Trends

108
Transistors
10 7
(thousands)
106
105
104 Frequency
103 (MHz)

102
101
100

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).

Processor frequency (clock rate) followed the same trend because


smaller transistors can be switched faster… until around 2005.
Copyright © 2022 Elsevier
Processor Trends

108
Transistors
10 7
(thousands)
106
105
104 Frequency
103 (MHz)
Typical Power
102 (Watts)

101
100

1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).

Around 2005, frequency stopped increasing due to the power wall


(increasing frequency further would make the chip too hot to cool feasibly)
Copyright © 2022 Elsevier
Multiple Choice Question

https://strawpoll.live/
Pin Code: 693299

Which of the following statements best describes Moore's Law?

A) The number of transistors on a microchip doubles approximately


every two years, while the cost of computers remains the same.

B) The processing power of a microchip doubles every year, leading to a


proportional increase in energy consumption.

C) The number of microchips in a computer doubles every two years,


significantly increasing the system's overall speed.

D) The speed of a microprocessor doubles every two years, while the size
of the chip remains constant.

Copyright © 2022 Elsevier


The Shift to Multi-Core Processors and Parallel Programming

• Introduction of Multi-Core Processors:


• Multi-core processors became necessary to maintain performance improvements
without the drawbacks of single CPUs.
• Software needed to adapt by running different parts of a program on multiple
cores simultaneously.
• Parallel Programming:
• Traditionally, software ran sequentially, one step at a time, which worked well for
single CPUs.
• With modern multi-core processors, software must be written for parallel
execution to achieve expected speed improvements.
• Parallel programming is now essential, not just for high-performance computing
but for many modern applications.

Copyright © 2022 Elsevier


Dual Paths in Microprocessor Design Since 2003

• Path 1: Multicore Processors


• Designed to run traditional, sequential programs faster.
• Started with just 2 cores, now up to 288 cores in recent Intel server processors.
• Example: ARM Ampere processor with 128 cores.
• Each core supports multiple threads for enhanced performance.
• Path 2: Many-Thread Processors
• Focused on handling large numbers of tasks simultaneously.
• Example: NVIDIA Tesla A100 GPU, handling tens of thousands of threads.
• Leading in performance for high-calculation tasks, like floating-point operations.
• A100 GPU: Up to 312 trillion operations per second at 16-bit precision.

Copyright © 2022 Elsevier


The Growing Performance Gap

• Performance Difference:
• The gap between multicore CPUs and many-thread GPUs has significantly
widened.
• Developers increasingly shift heavy computational tasks to GPUs for better
performance.
• Impact on Applications:
• Power of parallel processing enables the creation of groundbreaking
applications, such as deep learning.
• Parallel programming is ideal for tasks that can be broken down and executed
across many threads efficiently.

Copyright © 2022 Elsevier


Multiple Choice Question

https://strawpoll.live/
PIN Code: 442259

Which of the following best describes the function of a multicore


processor?
• A) It executes a single instruction at a time.
• B) It allows a single processor to handle multiple tasks concurrently.
• C) It reduces power consumption by decreasing the clock speed.
• D) It increases memory capacity by integrating more RAM.

Copyright © 2022 Elsevier


Different Design Philosophies of CPUs and GPUs

• CPU Design: Latency-Oriented Approach


• Focus: Minimize the time to complete each task.
• Key Features:
• Low-Latency Arithmetic Units: Fast calculations.
• Sophisticated Operand Delivery Logic: Quick data delivery.
• Large On-Chip Caches: Store frequently used data for quick access.
• Branch Prediction Logic: Predicts next steps to reduce delays.
• Outcome: Fast execution of sequential tasks but uses more chip area and
power.

Copyright © 2022 Elsevier


GPU Design and Throughput-Oriented Approach

• GPU Design: Throughput-Oriented Approach


• Focus: Handle many tasks simultaneously.
• Key Features:
• Floating-Point Calculations: Essential for rendering graphics.
• High Memory Access Throughput: Operates at about 10x the memory
bandwidth of CPUs.
• Simpler Arithmetic Units & More Memory Channels: Allows more processing
power within the same chip area and power budget.
• Outcome: High performance in tasks that can be done in parallel.

• Summary:
• CPUs: Excel at making individual tasks fast, ideal for sequential tasks.
• GPUs: Excel at handling many tasks at once, ideal for parallel tasks like rendering
graphics.
• Conclusion: GPUs are much faster at tasks that benefit from parallel processing,
which explains their higher peak performance.

Copyright © 2022 Elsevier


Design Approaches

Latency-Oriented Design Throughput-Oriented Design

Minimize the time Maximize the number of


it takes to perform tasks that can be performed
a single task in a given time frame

Copyright © 2022 Elsevier


Understanding GPU and CPU Roles in Computing
•GPU Design: Throughput-Oriented
•Designed to run many tasks (threads) in parallel.
•Efficient task-switching when some tasks are waiting.
•Small caches reduce the need to access slower main memory frequently.
•Purpose: Ideal for applications with lots of parallel work.

•CPU Design: Latency-Oriented


•Handles tasks sequentially, focusing on completing each task quickly.
•Purpose: Best for tasks that require quick, one-after-the-other completion.

• Hybrid Computing:
• CPUs: Handle sequential parts of a program.
• GPUs: Manage heavy, parallel workloads.
• Example: CUDA by NVIDIA (introduced in 2007) allows CPUs and GPUs to work
together.
• Factors for Processor Choice:
• Installed Base: Over 1 billion CUDA-enabled GPUs in use, making GPUs an attractive
option for developers.
• Practicality: GPUs have enabled powerful computing in compact devices like MRIs,
making high-performance computing more accessible.

Copyright © 2022 Elsevier


Processor Design Approaches

CPU: Latency-Oriented Design GPU: Throughput-Oriented Design


ALU ALU
Control
ALU ALU

Cache

A few powerful ALUs Many small ALUs


Reduced operation latency Long latency, high throughput
Heavily pipelined for further throughput
Large caches Small caches
Convert long latency memory accesses to short More area dedicated to computation
latency cache accesses
Sophisticated control Simple control
Branch prediction to reduce control hazards More area dedicated to computation
Data forwarding to reduce data hazards
Modest multithreading to hide short latency Massive number of threads to hide the very
high latency!
High clock frequency
Moderate clock frequency

Copyright © 2022 Elsevier


Multiple Choice Question

https://strawpoll.live/
PIN Code: 871721

Which of the following is a major challenge when designing software


for multicore processors?

• A) Ensuring the software can run on older single-core processors.


• B) Dividing tasks into parallel threads without causing data dependency
issues.
• C) Increasing the clock speed of each core.
• D) Reducing the physical size of the processor.

Copyright © 2022 Elsevier


Evolution of GPU Computing

•Pre-2006 Challenges:
•GPGPU: General-purpose computing on GPUs required complex tricks
•using graphics functions like OpenGL or Direct3D.
•Limited and not widely adopted despite innovative research.

•Post-2007 Breakthrough:

•CUDA Introduction:
•Simplified GPU programming with new hardware and software features.
•Allowed developers to use familiar programming languages like C/C++.
•Opened up a wide range of applications for GPUs.

•Beyond GPUs:
•Other accelerators like FPGAs are also used for specific tasks.
•Techniques discussed for GPUs can apply to other accelerators.

Copyright © 2022 Elsevier


Growth of GPU Computing

Source: Jensen Huang, GTC’15 Keynote

Copyright © 2022 Elsevier Source: Jensen Huang, GTC’19 Keynote


GPU Market Sector Breakdown

Revenue from Different Market Sectors

Other Auto-
4% motive Pro-
fes- Breakdown of Datacenter Sector
5% sional
Visual-
ization
9%

Gaming Data- Source: NVIDIA, 2018


58% center
24%
Source: NVIDIA, Q2 FY19

Copyright © 2022 Elsevier


Importance of Massively Parallel Programming
•Purpose: Ensures applications can keep getting faster with new hardware.
•GPU Acceleration: Over 100x faster than a single CPU core.
•Data Parallelism: Possible 1,000x speedup with minimal effort.

•Why Faster Applications Matter:

•Future Applications: Tasks considered supercomputing today will be common tomorrow.


•Examples:
•Biology: Deeper molecular-level studies via computer simulations.
•Video/Audio Processing: Advancements like 3D imaging and video enhancement.

•User Interfaces:
•High-resolution touch screens evolving to 3D displays, VR, and advanced controls
•(e.g., voice, computer vision).

• Gaming and Beyond:


• Gaming: Realistic simulations (e.g., car damage) enhance immersion.
• Digital Twins: Digital counterparts of real-world objects
• requiring significant computing power.

Copyright © 2022 Elsevier


Revolutionizing Applications with Faster Computing

•Deep Learning Breakthrough:


•Past Challenges: Neural networks required too much labeled data and
computing power.
•Key Enablers:
•Internet: Provides vast amounts of labeled data.
•GPUs: Deliver a huge boost in computing speed.

•Impact on Technology:
•Since 2012: Rapid adoption in computer vision and natural language
processing.
•Enabled Technologies: Self-driving cars, home assistant devices.

•Handling Complex Data:


•Parallel Processing: Essential for managing and processing massive amounts of
data efficiently.
•Simplified Approach: Our goal is to make data management techniques accessible
with practical code examples and hands-on exercises using CUDA.

Copyright © 2022 Elsevier


Understanding Speedup in Parallel Computing

•Defining Speedup:
•Formula: Speedup = Time(System B) / Time(System A)
•Example: If System A = 10s, System B = 200s, then Speedup = 20x.

•Challenges in Achieving High Speedup:


•Memory Access: Memory speed often limits the speedup.
•Optimization: Requires fine-tuning to maximize parallelization and manage memory
efficiently.

•CPU vs. GPU:


•Program "Pit" (CPU): Handles parts of the program that aren't easily parallelizable.
•Program "Flesh" (GPU): Optimized for parallelizable tasks.
•CUDA: Expands the portion of a program that can be efficiently parallelized on GPUs.

Copyright © 2022 Elsevier


Amdahl’s Law
• Parallelization Limits: Speedup depends on the portion of the program that can
be parallelized.
• Example: if 30% parallelizable, what is Max Speedup?

Copyright © 2022 Elsevier


What is the reduction in total execution time due to parallelization?

• If the new execution time is 70% of the original execution time, you can
calculate the reduction in total execution time as follows:

Copyright © 2022 Elsevier


What is the reduction in total execution time due to parallelization?

In the previous example, the new execution time was 1/S % of the original execution
time.

This means that the new execution time is about 29.73% shorter than the original
execution time.
Copyright © 2022 Elsevier
Multiple Choice Question
•Given that 99% of the execution time is spent in the parallel portion, P=0.99, use
Amdahl's Law to find the overall speedup when the parallel portion is sped up by 100x.
https://strawpoll.live/ PIN Code: 216974

Copyright © 2022 Elsevier


Short Answer Question

•Calculate the new execution time if the original execution time is 100 s.

Copyright © 2022 Elsevier


Multiple Choice Question

https://strawpoll.live/
PIN Code: 331799

Which factor does NOT directly affect the speedup gained from parallel
execution on a multicore processor?
• A) The number of cores available.
• B) The parallelizability of the task.
• C) The cache size of each core.
• D) The complexity of the instruction set.

Copyright © 2022 Elsevier


Why is Parallel Programming Hard?

•1. Designing Efficient Parallel Algorithms:


•Splitting tasks ≠ Faster execution.
•Some parallel algorithms can add complexity and extra work.

•2. Memory Issues:


•Memory Latency: Slow memory access can bottleneck performance.
•Optimization is complex but crucial.

•3. Data Variability:


•Uneven data sizes cause unbalanced workloads.
•Effective data management is essential for steady performance.

•4. Synchronization Overhead:


•Communication Delays: Processors often need to wait and sync.
•Strategies needed to minimize waiting times and enhance efficiency.

Copyright © 2022 Elsevier


Parallel Programming Languages and Models

• 1. OpenMP
• What: For shared memory systems.
• How: Uses directives for parallelization; compiler handles execution.
• Advantages: Simplifies parallel coding; portable across systems.
• Limitations: Requires basic parallel programming knowledge; may need tools like
CUDA for more control.
• 2. MPI (Message Passing Interface)
• What: For systems with separate memory (clusters).
• How: Data is manually divided; uses message passing.
• Advantages: Ideal for large clusters; supports tens of thousands of nodes.
• Limitations: Complex to port; often combined with CUDA for GPU systems.

Copyright © 2022 Elsevier


Parallel Programming Languages and Models

• 3. CUDA
• What: For programming NVIDIA GPUs.
• How: Provides fine control over GPU resources.
• Advantages: Simplifies GPU programming; excellent for parallel processing.
• Limitations: For large computing clusters, MPI is still needed.
• 4. OpenCL
• What: A standardized model for various processors.
• How: Uses APIs and language extensions for parallelism.
• Advantages: Broad compatibility across processors.
• Limitations: Performance tuning may be needed for specific processors.
• Summary:
• OpenMP: Simplifies shared memory system coding.
• MPI: Best for large, separate memory clusters.
• CUDA: Detailed control for NVIDIA GPUs.
• OpenCL: Versatile, works across processors.

Copyright © 2022 Elsevier


Parallel Computing Pitfall

• Consider an application where:


• The sequential execution time is t=100s
• The fraction of execution that is parallelizable is p=90%
• The speedup achieved on the parallelizable part is s =1000×

• What is the overall speedup of the application?


• Calculate the Parallelizable and Sequential Portions:
• Parallelizable Portion: p × t = 0.9×100 s=90 s
• Non-Parallelizable Portion: (1-p) × t =(1−0.9)×100 s=10 s
• Calculate the Execution Time of the Parallelizable Portion After Speedup:
• The parallelizable portion is speeded up by a factor of 1000. Thus, the execution
time for this portion is:

Copyright © 2022 Elsevier


Parallel Computing Pitfall

• Calculate the Total Execution Time After Parallelization:


• Add the non-parallelizable portion (which remains unchanged) to the
parallelizable portion after applying speedup:

• Calculate the Overall Speedup:


• The overall speedup is the ratio of the original execution time to the new
execution time:

Copyright © 2022 Elsevier


Amdahl’s Law

• In general, if an application has:


• t sequential execution time
• p fraction of execution that is parallelizable
• s speedup achieved on the parallelizable part

• What is the overall speedup of the application?

Copyright © 2022 Elsevier


Amdahl’s Law Implications

• The maximum speedup of a parallel program is limited by the fraction of


execution that is parallelizable

• e.g., if p is 90%, speedup < 10×

• Fortunately, for many real applications, p > 99% especially for large datasets,
and speedups >100× are attainable

Copyright © 2022 Elsevier


Multiple Choice Question

https://strawpoll.live/
PIN Code: 548421

In the context of parallel computing, which of the following is true


about Amdahl's Law?

•A) It predicts the maximum speedup achievable with a given number


of processors.
•B) It measures the efficiency of cache usage in a multicore processor.
•C) It determines the heat dissipation requirements for multicore
processors.
•D) It is used to optimize sequential programs for single-core
processors.

Copyright © 2022 Elsevier

You might also like