Week 1 Csc447
Week 1 Csc447
A Hands-on Approach
WEEK 1 Introduction z
108
Transistors
10 7
(thousands)
106
105
104
103
102
101
100
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
108
Transistors
10 7
(thousands)
106
105
104 Frequency
103 (MHz)
102
101
100
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
108
Transistors
10 7
(thousands)
106
105
104 Frequency
103 (MHz)
Typical Power
102 (Watts)
101
100
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020
Source: M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, C. Batten (1970-2010 ). K. Rupp (2010-2017).
https://strawpoll.live/
Pin Code: 693299
D) The speed of a microprocessor doubles every two years, while the size
of the chip remains constant.
• Performance Difference:
• The gap between multicore CPUs and many-thread GPUs has significantly
widened.
• Developers increasingly shift heavy computational tasks to GPUs for better
performance.
• Impact on Applications:
• Power of parallel processing enables the creation of groundbreaking
applications, such as deep learning.
• Parallel programming is ideal for tasks that can be broken down and executed
across many threads efficiently.
https://strawpoll.live/
PIN Code: 442259
• Summary:
• CPUs: Excel at making individual tasks fast, ideal for sequential tasks.
• GPUs: Excel at handling many tasks at once, ideal for parallel tasks like rendering
graphics.
• Conclusion: GPUs are much faster at tasks that benefit from parallel processing,
which explains their higher peak performance.
• Hybrid Computing:
• CPUs: Handle sequential parts of a program.
• GPUs: Manage heavy, parallel workloads.
• Example: CUDA by NVIDIA (introduced in 2007) allows CPUs and GPUs to work
together.
• Factors for Processor Choice:
• Installed Base: Over 1 billion CUDA-enabled GPUs in use, making GPUs an attractive
option for developers.
• Practicality: GPUs have enabled powerful computing in compact devices like MRIs,
making high-performance computing more accessible.
Cache
https://strawpoll.live/
PIN Code: 871721
•Pre-2006 Challenges:
•GPGPU: General-purpose computing on GPUs required complex tricks
•using graphics functions like OpenGL or Direct3D.
•Limited and not widely adopted despite innovative research.
•Post-2007 Breakthrough:
•CUDA Introduction:
•Simplified GPU programming with new hardware and software features.
•Allowed developers to use familiar programming languages like C/C++.
•Opened up a wide range of applications for GPUs.
•Beyond GPUs:
•Other accelerators like FPGAs are also used for specific tasks.
•Techniques discussed for GPUs can apply to other accelerators.
Other Auto-
4% motive Pro-
fes- Breakdown of Datacenter Sector
5% sional
Visual-
ization
9%
•User Interfaces:
•High-resolution touch screens evolving to 3D displays, VR, and advanced controls
•(e.g., voice, computer vision).
•Impact on Technology:
•Since 2012: Rapid adoption in computer vision and natural language
processing.
•Enabled Technologies: Self-driving cars, home assistant devices.
•Defining Speedup:
•Formula: Speedup = Time(System B) / Time(System A)
•Example: If System A = 10s, System B = 200s, then Speedup = 20x.
• If the new execution time is 70% of the original execution time, you can
calculate the reduction in total execution time as follows:
In the previous example, the new execution time was 1/S % of the original execution
time.
This means that the new execution time is about 29.73% shorter than the original
execution time.
Copyright © 2022 Elsevier
Multiple Choice Question
•Given that 99% of the execution time is spent in the parallel portion, P=0.99, use
Amdahl's Law to find the overall speedup when the parallel portion is sped up by 100x.
https://strawpoll.live/ PIN Code: 216974
•Calculate the new execution time if the original execution time is 100 s.
https://strawpoll.live/
PIN Code: 331799
Which factor does NOT directly affect the speedup gained from parallel
execution on a multicore processor?
• A) The number of cores available.
• B) The parallelizability of the task.
• C) The cache size of each core.
• D) The complexity of the instruction set.
• 1. OpenMP
• What: For shared memory systems.
• How: Uses directives for parallelization; compiler handles execution.
• Advantages: Simplifies parallel coding; portable across systems.
• Limitations: Requires basic parallel programming knowledge; may need tools like
CUDA for more control.
• 2. MPI (Message Passing Interface)
• What: For systems with separate memory (clusters).
• How: Data is manually divided; uses message passing.
• Advantages: Ideal for large clusters; supports tens of thousands of nodes.
• Limitations: Complex to port; often combined with CUDA for GPU systems.
• 3. CUDA
• What: For programming NVIDIA GPUs.
• How: Provides fine control over GPU resources.
• Advantages: Simplifies GPU programming; excellent for parallel processing.
• Limitations: For large computing clusters, MPI is still needed.
• 4. OpenCL
• What: A standardized model for various processors.
• How: Uses APIs and language extensions for parallelism.
• Advantages: Broad compatibility across processors.
• Limitations: Performance tuning may be needed for specific processors.
• Summary:
• OpenMP: Simplifies shared memory system coding.
• MPI: Best for large, separate memory clusters.
• CUDA: Detailed control for NVIDIA GPUs.
• OpenCL: Versatile, works across processors.
• Fortunately, for many real applications, p > 99% especially for large datasets,
and speedups >100× are attainable
https://strawpoll.live/
PIN Code: 548421