CUDA - Introduction CUDA - Introduction

CUDA is an extension of C programming that allows programs to harness the power of GPUs for parallel computing, increasing performance. There are multiple ways CPUs achieve parallelism, including instruction level parallelism (ILP) techniques like pipelining and superscalar architectures. Pipelining allows different instructions to be handled in parallel each clock cycle by splitting instruction processing into stages. Superscalar CPUs can execute multiple instructions simultaneously using multiple arithmetic logic units. Hyper-threading and simultaneous multithreading (SMT) allow multiple threads to issue instructions concurrently on each core to improve hardware utilization.

Uploaded by

olia.92

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

CUDA - Introduction CUDA - Introduction

Uploaded by

olia.92

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

CUDA - Introduction

CUDA − Compute Uniﬁed Device Architecture. It is an extension of C programming, an API model for
parallel computing created by Nvidia. Programs written using CUDA harness the power of GPU. Thus,
increasing the computing performance.

Parallelism in the CPU

Gordon Moore of Intel once famously stated a rule, which said that every passing year, the clock
frequency of a semiconductor core doubles. This law held true until recent years. Now, as the clock
frequencies of a single core reach saturation points (you will not ﬁnd a single core CPU with a clock
frequency of say, 5GHz, even after 2 years from now), the paradigm has shifted to multi-core and many-
core processors.
In this chapter, we will study how parallelism is achieved in CPUs. This chapter is an essential foundation
to studying GPUs (it helps in understanding the key differences between GPUs and CPUs).

Following are the ﬁve essential steps required for an instruction to ﬁnish −
Instruction fetch (IF)
Instruction decode (ID)
Instruction execute (Ex)
Memory access (Mem)
Register write-back (WB)

This is a basic ﬁve-stage RISC architecture.

There are multiple ways to achieve parallelism in the CPU. First, we will discuss about ILP (Instruction
Level Parallelism), also known as pipelining.

Pipelining
A CPU pipeline is a series of instructions that a CPU can handle in parallel per clock. A basic CPU with no
ILP will execute the above five steps sequentially. In fact, any CPU will do that. It will first fetch the
instructions, decode them, execute them, then access the RAM, and then write-back to the registers. Thus,
it needs at least five CPU cycles to execute an instruction. During this process, there are parts of the chip
that are sitting idle, waiting for the current instruction to finish. This is highly inefficient and this is exactly
what instruction pipelining tries to address. Instead now, in one clock cycle, there are many steps of
different instructions that execute in parallel. Thus the name, Instruction Level Parallelism.

The following ﬁgure will help you understand how Instruction Level Parallelism works −
Using instruction pipelining, the instruction throughput has increased. Now, we can process many
instructions in one-clock cycle. But for ILP, the resources of a chip would have been sitting idle.
In a pipe-lined chip, the instruction throughput increased. Initially, one instruction completed after every 5
cycles. Now, at the end of each cycle (from the 5th cycle onwards, considering each step takes 1 cycle),
we get a completed instruction.

Note that in a non-pipelined chip, where it is assumed that the next instruction begins only when the
current has ﬁnished, there is no data hazard. But since such is not the case with a pipelined chip, hazards
may arise. Consider the situation below −
I1 − ADD 1 to R5
I2 − COPY R5 to R6

Now, in a pipeline processor, I1 starts at t1, and ﬁnishes at t5. I2 starts at t2 and ﬁnishes at t6. 1 is added
to R5 at t5 (at the WB stage). The second instruction reads the value of R5 at its second step (at time t3).
Thus, it will not fetch the update value, and this presents a hazard.
Modern compilers decode high-level code to low-level code, and take care of hazards.

Superscalar

ILP is also implemented by implementing a superscalar architecture. The primary difference between a
superscalar and a pipelined processor is (a superscalar processor is also pipeline) that the former uses
multiple execution units (on the same chip) to achieve ILP whereas the latter divides the EU in multiple
phases to do that. This means that in superscalar, several instructions can simultaneously be in the same
stage of the execution cycle. This is not possible in a simple pipelined chip. Superscalar microprocessors
can execute two or more instructions at the same time. They typically have at least 2 ALUs.
Superscalar processors can dispatch multiple instructions in the same clock cycle. This means that
multiple instructions can be started in the same clock cycle. If you look at the pipelines architecture
above, you can observe that at any clock cycle, only one instruction is dispatched. This is not the case
with superscalars. But we have only one instruction counter (in-ﬂight, multiple instructions are tracked).
This is still just one process.

Take the Intel i7 for instance. The processor boasts of 4 independent cores, each implementing the full
x86 ISA. Each core is hyper-threaded with two hardware cores.
Hyper-threading is a dope technology, proprietary to Intel, using which the operating system see a single
core as two virtual cores, for increasing the number of hardware instructions in the pipeline (note that not
all operating systems support HT, and Intel recommends that in such cases, HT be disabled). So, the Intel
i7 has a total of 8 hardware threads.

SMT
HT is just a technology to utilize a processor core better. Many a times, a processor core is utilizing only a
fraction of its resources to execute instructions. What HT does is that it takes a few more CPU registers,
and executes more instructions on the part of the core that is sitting idle. Thus, one core now appears as
two core. It is to be considered that they are not completely independent. If both the ‘cores’ need to
access the CPU resource, one of them ends up waiting. That is the reason why we cannot replace a dual-
core CPU with a hyper-threaded, single core CPU. A dual core CPU will have truly independent, out-of-order
cores, each with its own resources. Also note that HT is Intel’s implementation of SMT (Simultaneous
Multithreading). SPARC has a different implementation of SMT, with identical goals.

The pink box represents a single CPU core. The RAM contains instructions of 4 different programs,
indicated by different colors. The CPU implements the SMT, using a technology similar to hyper-threading.
Hence, it is able to run instructions of two different programs (red and yellow) simultaneously. White
boxes represent pipeline stalls.
So, there are multi-core CPUs. One thing to notice is that they are designed to fasten-up sequential
programs. A CPU is very good when it comes to executing a single instruction on a single datum, but not
so much when it comes to processing a large chunk of data. A CPU has a larger instruction set than a
GPU, a complex ALU, a better branch prediction logic, and a more sophisticated caching/pipeline
schemes. The instruction cycles are also a lot faster.

HW Assignment 5
No ratings yet
HW Assignment 5
3 pages
2023 SPSS AMOS en
No ratings yet
2023 SPSS AMOS en
7 pages
Computer Architecture - Notes
100% (1)
Computer Architecture - Notes
11 pages
INTEL CORE I7 PROCESSOR
100% (1)
INTEL CORE I7 PROCESSOR
22 pages
Implementation of A 16-Bit RISC Processor Using FPGA Programming
100% (4)
Implementation of A 16-Bit RISC Processor Using FPGA Programming
25 pages
Cat Vs Dog Classification Using Python
No ratings yet
Cat Vs Dog Classification Using Python
23 pages
Instruction Level Parallelism: Intel
No ratings yet
Instruction Level Parallelism: Intel
6 pages
Inside The Cpu
No ratings yet
Inside The Cpu
10 pages
MIPS Report File
No ratings yet
MIPS Report File
17 pages
Computer System Organizations: Ms - Chit Su Mon
No ratings yet
Computer System Organizations: Ms - Chit Su Mon
74 pages
20 Advanced Processor Designs
No ratings yet
20 Advanced Processor Designs
28 pages
Computer Architecture - Teachers Notes
No ratings yet
Computer Architecture - Teachers Notes
11 pages
Computer Hardware Lecturer - 4
No ratings yet
Computer Hardware Lecturer - 4
9 pages
Advanced Processor Superscalarclass
50% (2)
Advanced Processor Superscalarclass
73 pages
HP-UX 11i Knowledge-on-Demand: Performance Optimization Best-Practices From Our Labs To You
No ratings yet
HP-UX 11i Knowledge-on-Demand: Performance Optimization Best-Practices From Our Labs To You
12 pages
Instruction Set Architecture
No ratings yet
Instruction Set Architecture
7 pages
Magzhan Kairanbay
No ratings yet
Magzhan Kairanbay
27 pages
Begin Parallel Programming With OpenMP - CodeProject
No ratings yet
Begin Parallel Programming With OpenMP - CodeProject
8 pages
Processor Types: RISC Processor RISC Stands For Reduced Instruction Set
No ratings yet
Processor Types: RISC Processor RISC Stands For Reduced Instruction Set
9 pages
Milen_Dimitrov_HW2_Q3
No ratings yet
Milen_Dimitrov_HW2_Q3
7 pages
Intel Core I7 Processor
No ratings yet
Intel Core I7 Processor
22 pages
Lesson 6 - Central Processing Unit
No ratings yet
Lesson 6 - Central Processing Unit
5 pages
15CS72_ACA_Module2FinalCopy
No ratings yet
15CS72_ACA_Module2FinalCopy
29 pages
Pentium 4
100% (2)
Pentium 4
8 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Hyper - : Threading Technology
No ratings yet
Hyper - : Threading Technology
20 pages
Prepared by Dasun Nilanjana For
No ratings yet
Prepared by Dasun Nilanjana For
24 pages
Risc Vs Cisc
No ratings yet
Risc Vs Cisc
6 pages
Concept of Pipelining 3.1.3
No ratings yet
Concept of Pipelining 3.1.3
6 pages
Pankaj
No ratings yet
Pankaj
27 pages
Chapter 2 - Computer Organization
No ratings yet
Chapter 2 - Computer Organization
30 pages
COA Answers
No ratings yet
COA Answers
5 pages
single-cycle-vs-multi-cycle-cpu
No ratings yet
single-cycle-vs-multi-cycle-cpu
11 pages
Understandi NG Cpus
No ratings yet
Understandi NG Cpus
44 pages
Nehalem Architecture
No ratings yet
Nehalem Architecture
7 pages
Summary of Lec1
No ratings yet
Summary of Lec1
8 pages
1.Architecture of Computer
No ratings yet
1.Architecture of Computer
16 pages
Pipelining2019_(1)[1]
No ratings yet
Pipelining2019_(1)[1]
82 pages
19bce0531 VL2021220104072 Da 1 PDF
No ratings yet
19bce0531 VL2021220104072 Da 1 PDF
16 pages
Introduction To Parallel and Distributed Programming
No ratings yet
Introduction To Parallel and Distributed Programming
6 pages
MPD Project
No ratings yet
MPD Project
21 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
The Processor
No ratings yet
The Processor
6 pages
04-Instructions and Formats
No ratings yet
04-Instructions and Formats
7 pages
ISA
No ratings yet
ISA
4 pages
Computer System Organization: Processors
No ratings yet
Computer System Organization: Processors
21 pages
Sr. No. Key Computer Architecture Computer Organization
No ratings yet
Sr. No. Key Computer Architecture Computer Organization
11 pages
15.1-Hardware-Virtual-Machines-2024 (1)
No ratings yet
15.1-Hardware-Virtual-Machines-2024 (1)
10 pages
Ec 6009 - Advanced Computer Architecture 2 Marks
No ratings yet
Ec 6009 - Advanced Computer Architecture 2 Marks
8 pages
CPU Support Devices Programming Intel
No ratings yet
CPU Support Devices Programming Intel
15 pages
Micro Processor Design-Resumen
No ratings yet
Micro Processor Design-Resumen
7 pages
The Importance of A Computer CPU
No ratings yet
The Importance of A Computer CPU
4 pages
OLP Notes
No ratings yet
OLP Notes
11 pages
RISC Instruction Set:: I) Data Manipulation Instructions
No ratings yet
RISC Instruction Set:: I) Data Manipulation Instructions
8 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
Central Processing Unit
No ratings yet
Central Processing Unit
61 pages
Cyan 2800398239029h09fn0ivj0vcjb0
No ratings yet
Cyan 2800398239029h09fn0ivj0vcjb0
16 pages
Processor Performance Enhancement
100% (1)
Processor Performance Enhancement
3 pages
Lesson 7 The Central Processing Unit (CPU)
No ratings yet
Lesson 7 The Central Processing Unit (CPU)
32 pages
Cs501 Notes (1)
No ratings yet
Cs501 Notes (1)
33 pages
aes for ia (2)
No ratings yet
aes for ia (2)
26 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
(W F + 2P) /S + 1: Use of Zero-Padding
No ratings yet
(W F + 2P) /S + 1: Use of Zero-Padding
3 pages
W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
No ratings yet
W H D K F S P W H D W W H H D F F D D K: Summary. To Summarize, The Conv Layer
3 pages
Convolutional Layer: Web-Based Demo
No ratings yet
Convolutional Layer: Web-Based Demo
3 pages
2 - Introduction To The GPU
No ratings yet
2 - Introduction To The GPU
3 pages
CS231n - Convolutional-Networks 1
No ratings yet
CS231n - Convolutional-Networks 1
3 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
Fortimanager Informations
No ratings yet
Fortimanager Informations
6 pages
AWS - ITIS 252 - Module - (6) - Compute
No ratings yet
AWS - ITIS 252 - Module - (6) - Compute
89 pages
Tìm hiểu GAE 01 - Building.High-Perf
No ratings yet
Tìm hiểu GAE 01 - Building.High-Perf
41 pages
3PAR Thin Provisioning Best Practices
No ratings yet
3PAR Thin Provisioning Best Practices
21 pages
A350 EFB Operational Procedure
No ratings yet
A350 EFB Operational Procedure
7 pages
Jatin Nain Final Computer Project HM
No ratings yet
Jatin Nain Final Computer Project HM
29 pages
Submitted By: M Ikhlas. Submitted To: Maheen Zahid. Class: Ict. Department: Bscs 1
No ratings yet
Submitted By: M Ikhlas. Submitted To: Maheen Zahid. Class: Ict. Department: Bscs 1
14 pages
Active Data Object in Visual Basic
No ratings yet
Active Data Object in Visual Basic
23 pages
Windows Kernel Overview
No ratings yet
Windows Kernel Overview
14 pages
Quizez cs401
100% (4)
Quizez cs401
59 pages
Slots and Ports
No ratings yet
Slots and Ports
7 pages
F070128
No ratings yet
F070128
174 pages
Opensatkit: A Tool Suite For Working With Nasa'S Core Flight System
No ratings yet
Opensatkit: A Tool Suite For Working With Nasa'S Core Flight System
38 pages
Buy ebook Processing a Programming Handbook for Visual Designers and Artists Second Edition Casey Reas cheap price
No ratings yet
Buy ebook Processing a Programming Handbook for Visual Designers and Artists Second Edition Casey Reas cheap price
67 pages
Exercise 3 PDF
No ratings yet
Exercise 3 PDF
20 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
Class - BCA 2nd Semester Subject - Computer Organization Lesson Plan On Daily Basis Date Topic No Topic Planned
No ratings yet
Class - BCA 2nd Semester Subject - Computer Organization Lesson Plan On Daily Basis Date Topic No Topic Planned
2 pages
ppt for presentation
No ratings yet
ppt for presentation
27 pages
BUL-77 Videoregistratori DN - Manuale Installazione
No ratings yet
BUL-77 Videoregistratori DN - Manuale Installazione
53 pages
MCQ 2 Updated
No ratings yet
MCQ 2 Updated
3 pages
Install Linux On A Chromebook and Unlock Its Full Potential
No ratings yet
Install Linux On A Chromebook and Unlock Its Full Potential
9 pages
Python
No ratings yet
Python
4 pages
Canoe: User Manual
No ratings yet
Canoe: User Manual
188 pages
Essential MATLAB for Scientists and Engineers 2nd Edition Brian Hahnpdf download
100% (1)
Essential MATLAB for Scientists and Engineers 2nd Edition Brian Hahnpdf download
45 pages
OOSE Unit 3
No ratings yet
OOSE Unit 3
171 pages
Project File Of: IT: Writer Styles and Image Editing
No ratings yet
Project File Of: IT: Writer Styles and Image Editing
8 pages
Kernel Is Not Seandroid Enforcing
No ratings yet
Kernel Is Not Seandroid Enforcing
15 pages
21ST Report
No ratings yet
21ST Report
18 pages

CUDA - Introduction CUDA - Introduction

Uploaded by

CUDA - Introduction CUDA - Introduction

Uploaded by

CUDA - Introduction

Parallelism in the CPU

This is a basic ﬁve-stage RISC architecture.

You might also like