Unit 1
Unit 1
PARALLEL COMPUTING
Parallel computing is a computing
architecture that involves executing multiple
processors on an application or computation
simultaneously.
1. Motivation
Development of parallel software has traditionally been thought of as time and effort intensive. This
can be largely attributed to the inherent complexity of specifying and coordinating concurrent tasks, a
lack of portable algorithms, standardized environments, and software development toolkits. It takes
two years to develop a parallel application, during which time the underlying hardware and/or software
platform has become obsolete, the development effort is clearly wasted. However, there are some
unmistakable trends in hardware design, which indicate that uniprocessor (or implicitly parallel)
architectures may not be able to sustain the rate of realizable performance increments in the future.
This is a result of lack of implicit parallelism as well as other bottlenecks such as the datapath and the
memory. At the same time, standardized hardware interfaces have reduced the turnaround time from
the development of a microprocessor to a parallel machine based on the microprocessor.
Moore's Law states that circuit complexity doubles every eighteen months. This empirical relationship
has been amazingly resilient over the years both for microprocessors as well as for DRAMs. By relating
component density and increases in die-size to the computing power of a device, Moore's law has been
extrapolated to state that the amount of computing power available at a given cost doubles
approximately every 18 months.
➢ Memory/disk Speed Argument: clock rates of high-end processors have increased at roughly
40% per year over the past decade, DRAM access times have only improved at the rate of
roughly 10% per year over this interval. Coupled with increases in instructions executed per
clock cycle, this gap between processor speed and memory presents a tremendous performance
bottleneck. he overall performance of the memory system is determined by the fraction of the
total memory requests that can be satisfied from the cache. Parallel platforms typically yield
better memory system performance because they provide (i) larger aggregate caches, and (ii)
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 1
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
higher aggregate bandwidth to the memory system (both typically linear in the number of
processors).
➢ Data Communication Argument: As the networking infrastructure evolves, the vision of
using the Internet as one large heterogeneous parallel/distributed computing environment has
begun to take shape. Many applications lend themselves naturally to such computing paradigms.
Some of the most impressive applications of massively parallel computing have been in the
context of wide-area distributed platforms.
Parallel computing has made a tremendous impact on a variety of areas ranging from computational
simulations for scientific and engineering applications to commercial applications in data mining and
transaction processing. The cost benefits of parallelism coupled with the performance requirements of
applications present compelling arguments in favor of parallel computing. We present a small sample
of the diverse applications of parallel computing.
➢ Applications in Engineering and Design: Parallel computing has traditionally been employed
with great success in the design of airfoils (optimizing lift, drag, stability), internal combustion
engines (optimizing charge distribution, burn), high-speed circuits (layouts for delays and
capacitive and inductive effects), and structures (optimizing structural integrity, design
parameters, cost, etc.), among others. More recently, design of microelectromechanical and
nanoelectromechanical systems (MEMS and NEMS) has attracted significant attention. This
presents formidable challenges for geometric modeling, mathematical modeling, and algorithm
development, all in the context of parallel computers.
➢ Scientific Applications: The past few years have seen a revolution in high performance
scientific computing applications. The sequencing of the human genome by the International
Human Genome Sequencing Consortium and Celera, Inc. has opened exciting new frontiers in
bioinformatics. Functional and structural characterization of genes and proteins hold the
promise of understanding and fundamentally influencing biological processes. Analyzing
biological sequences with a view to developing new drugs and cures for diseases and medical
conditions requires innovative algorithms as well as large-scale computational power.
➢ Applications in Computer Systems: As computer systems become more pervasive and
computation spreads over the network, parallel processing issues become engrained into a
variety of applications. In computer security, intrusion detection is an outstanding challenge. In
the case of network intrusion detection, data is collected at distributed sites and must be
analyzed rapidly for signaling intrusion. The infeasibility of collecting this data at a central
location for analysis requires effective parallel and distributed algorithms. In the area of
cryptography, some of the most spectacular applications of Internet-based parallel computing
have focused on factoring extremely large integers.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 2
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Computers can be divided into the following major groups according to Flynn’s Classification:
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 3
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 4
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
5. Multi-Core Processors
A multicore processor is an integrated circuit that has two or more processor cores attached for
enhanced performance and reduced power consumption. These processors also enable more efficient
simultaneous processing of multiple tasks, such as with parallel processing and multithreading. A dual
core setup is similar to having multiple, separate processors installed on a computer. However, because
the two processors are plugged into the same socket, the connection between them is faster. The use of
multicore processors or microprocessors is one approach to boost processor performance without
exceeding the practical limitations of semiconductor design and fabrication. Using multicores also
ensure safe operation in areas such as heat generation.
➢ Multicore processors working concept: The heart of every processor is an execution engine,
also known as a core. The core is designed to process instructions and data according to the
direction of software programs in the computer's memory. Over the years, designers found that
every new processor design had limits. Numerous technologies were developed to accelerate
performance, including the following ones: (ref: https://www.techtarget.com/searchdatacenter/definition/multi-core-
processor)
o Clock speed. One approach was to make the processor's clock faster. The clock is the
"drumbeat" used to synchronize the processing of instructions and data through the
processing engine. Clock speeds have accelerated from several megahertz to several
gigahertz (GHz) today. However, transistors use up power with each clock tick. As a
result, clock speeds have nearly reached their limits given current semiconductor
fabrication and heat management techniques. Figure 1. Depicts the architecture of
multicore processor.
o Hyper-threading. Another approach involved the handling of multiple instruction
threads. Intel calls this hyper-threading. With hyper-threading, processor cores are
designed to handle two separate instruction threads at the same time. When properly
enabled and supported by both the computer's firmware and operating system (OS),
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 5
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
hyper-threading techniques enable one physical core to function as two logical cores.
Still, the processor only possesses a single physical core. The logical abstraction of the
physical processor added little real performance to the processor other than to help
streamline the behavior of multiple simultaneous applications running on the computer.
o More chips. The next step was to add processor chips -- or dies -- to the processor
package, which is the physical device that plugs into the motherboard. A dual-core
processor includes two separate processor cores. A quad-core processor includes four
separate cores. Today's multicore processors can easily include 12, 24 or even more
processor cores. The multicore approach is almost identical to the use of multiprocessor
motherboards, which have two or four separate processor sockets. The effect is the same.
Today's huge processor performance involves the use of processor products that
combine fast clock speeds and multiple hyper-threaded cores.
However, multicore chips have several issues to consider. First, the addition of more processor cores
doesn't automatically improve computer performance. The OS and applications must direct software
program instructions to recognize and use the multiple cores. This must be done in parallel, using
various threads to different cores within the processor package. Some software applications may need
to be refactored to support and use multicore processor platforms. Otherwise, only the default first
processor core is used, and any additional cores are unused or idle.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 6
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Second, the performance benefit of additional cores is not a direct multiple. That is, adding a second
core does not double the processor's performance, or a quad-core processor does not multiply the
processor's performance by a factor of four. This happens because of the shared elements of the
processor, such as access to internal memory or caches, external buses and computer system memory.
The benefit of multiple cores can be substantial, but there are practical limits. Still, the acceleration is
typically better than a traditional multiprocessor system because the coupling between cores in the same
package is tighter and there are shorter distances and fewer components between cores.
Consider the analogy of cars on a road. Each car might be a processor, but each car must share the
common roads and traffic limitations. More cars can transport more people and goods in a given time,
but more cars also cause congestion and other problems.
➢ Types of multicore processors: Different multicore processors often have different numbers
of cores. For example, a quad-core processor has four cores. The number of cores is usually a
power of two. (ref: https://insights.sei.cmu.edu/blog/multicore-processing/)
➢ Core types:
o Homogeneous (symmetric) cores. All of the cores in a homogeneous multicore
processor are of the same type; typically, the core processing units are general-purpose
central processing units that run a single multicore operating system.
o Heterogeneous (asymmetric) cores. Heterogeneous multicore processors have a mix
of core types that often-run different operating systems and include graphics processing
units.
➢ Number and level of caches. Multicore processors vary in terms of their instruction and data
caches, which are relatively small and fast pools of local memory.
➢ How cores are interconnected. Multicore processors also vary in terms of their bus
architectures.
➢ Isolation. The amount, typically minimal, of in-chip support for the spatial and temporal
isolation of cores:
o Physical isolation ensures that different cores cannot access the same physical hardware
(e.g., memory locations such as caches and RAM).
o Temporal isolation ensures that the execution of software on one core does not impact
the temporal behavior of software running on another core.
➢ Homogeneous Multicore Processor: The figure.2 notionally shows the architecture of a
system in which 14 software applications are allocated by a single host operating system to the
cores in a homogeneous quad-core processor. In this architecture, there are three levels of cache,
which are progressively larger but slower: L1 (consisting of an instruction cache and a data
cache), L2, and L3. Note that the L1 and L2 caches are local to a single core, whereas L3 is
shared among all four cores.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 7
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
➢ Pros of Multicore Processing: By using multicore processors, architects can decrease the
number of embedded computers. By allocating applications to different cores, multicore
processing increases the intrinsic support for actual (as opposed to virtual) parallel processing
within individual software applications across multiple applications. Multicore processing can
increase performance by running multiple applications concurrently. Allocating software to
multiple cores increases reliability and robustness (i.e., fault and failure tolerance) by limiting
fault and/or failure propagation from software on one core to software on another.
➢ Cons of Multicore Processing: Shared Resources. Cores on the same processor share both
processor-internal resources (L3 cache, system bus, memory controller, I/O controllers, and
interconnects) and processor-external resources (main memory, I/O devices, and networks).
These shared resources imply (1) the existence of single points of failure, (2) two applications
running on the same core can interfere with each other, and (3) software running on one core
can impact software running on another core (i.e., interference can violate spatial and temporal
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 8
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
isolation because multicore support for isolation is limited). The diagram below uses the color
red to illustrate six shared resources.
Concurrency Defects. Cores execute concurrently, creating the potential for concurrency
defects including deadlock, livelock, starvation, suspension, (data) race conditions, priority
inversion, order violations, and atomicity violations. Note that these are essentially the same
types of concurrency defects that can occur when software is allocated to multiple threads on a
single core.
Non-determinism. Multicore processing increases non-determinism. For example, I/O
Interrupts have top-level hardware priority (also a problem with single core processors).
Multicore processing is also subject to lock trashing, which stems from excessive lock conflicts
due to simultaneous access of kernel services by different cores (resulting in decreased
concurrency and performance). The resulting non-deterministic behavior can be unpredictable,
can cause related faults and failures, and can make testing more difficult (e.g., running the same
test multiple times may not yield the same test result).
Analysis Difficulty. The real concurrency due to multicore processing requires different
memory consistency models than virtual interleaved concurrency. It also breaks traditional
analysis approaches for work on single core processors. The analysis of maximum time limits
is harder and may be overly conservative. Although interference analysis becomes more
complex as the number of cores-per-processor increases, overly-restricting the core number
may not provide adequate performance.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 9
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 10
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 11
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
independently while they all access the same memory. Any change in the variables stored in the
memory is visible by all processors because at any given moment all they see is a copy or picture
of entire variables stored in the memory and they can directly address and access the same
logical memory locations regardless of where the physical memory actually exists. Figure.4
Shared memory example (ref: https://help.rc.ufl.edu/doc/Memory:_Shared_vs_Distributed)
Uniform Memory Access (UMA):
o Most commonly represented today by Symmetric Multiprocessor (SMP) machines
o Identical processors
o Equal access and access times to memory
o Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one
processor updates a location in shared memory, all the other processors know about the
update. Cache coherency is accomplished at the hardware level.
Figure.5 NUMA
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 12
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
➢ Distributed Memory: Distributed memory in hardware sense, refers to the case where the
processors can access another processor's memory only through network. In software sense, it
means each processor only can see local machine memory directly and should use
communications through network to access memory of the other processors. Figure 6. Illustrates
the distributed memory architecture.
OpenMP is a standard parallel programming API for shared memory environments, written in C, C++,
or FORTRAN. It consists of a set of compiler directives with a “lightweight” syntax, library routines,
and environment variables that influence run-time behavior. OpenMP is governed by OpenMP
Architecture Review Board (or OpenMP ARB), and is defined by several hardware and software
vendors.
➢ Use of OpenMP: OpenMP has received considerable attention in the past decade and is
considered by many to be an ideal solution for parallel programming because it has unique
advantages as a mainstream directive-based programming model.
First of all, OpenMP provides a cross-platform, cross-compiler solution. It supports lots of
platforms such as Linux, macOS, and Windows. Mainstream compilers including GCC,
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 13
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
LLVM/Clang, Intel Fortran, and C/C++ compilers provide OpenMP good support. Also, with
the rapid development of OpenMP, many researchers and computer vendors are constantly
exploring how to optimize the execution efficiency of OpenMP programs and continue to
propose improvements for existing compilers or develop new compilers. What’s more.
OpenMP is a standard specification, and all compilers that support it implement the same set of
standards, and there are no portability issues.
Secondly, using OpenMP can be very convenient and flexible to modify the number of threads.
To solve the scalability problem of the number of CPU cores. In the multi-core era, the number
of threads needs to change according to the number of CPU cores. OpenMP has irreplaceable
advantages in this regard.
Thirdly, using OpenMP to create threads is considered to be convenient and relatively easy
because it does not require an entry function, the code within the same function can be
decomposed into multiple threads for execution, and a for loop can be decomposed into multiple
threads for execution. If OpenMP is not used, when the operating system API creates a thread,
the code in a function needs to be manually disassembled into multiple thread entry functions.
To sum up, OpenMP has irreplaceable advantages in parallel programming. More and more
new directives are being added to achieve more functions, and they are playing an important
role on many different platforms. Figure.7 illustrates the OpnMP solution stack.
➢ Installation of OpenMP
Installation Steps on Linux Systems
Install gcc compiler: sudo apt-get install build-essentials
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 14
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Install OpenMP library: sudo apt-get install libomp-dev
Installation Steps on Windows Systems – Windows not recommended
For more detailed OpenMP Installation with sample code follow the below link
Option1: https://www.geeksforgeeks.org/openmp-introduction-with-installation-
guide/
Option2: https://www.youtube.com/watch?v=5cVU4MKsvqU
return 0;
}
Ref: https://passlab.github.io/OpenMPProgrammingBook/openmp_c/2_Syntax.html
Note: For more programming concepts the readers are instructed to go through the following
link. (https://www.openmp.org/wp-content/uploads/Intro_To_OpenMP_Mattson.pdf).
Assignments will be given during the lecture hours.
The following are to be self-studies to create the programming environment from the material
provided in the above links
➢ Getting Started with OpenMP:
o Introduction to parallel programming
o Hello world and how threads work
➢ The Core features of OpenMP
o Creating Threads (the Pi program)
o Parallel Loops (making the Pi program simple)
➢ Working with OpenMP
o Synchronize single masters and stuff
8. SIMD: Single instruction, multiple data (SIMD) is a form of parallel execution in which the
same operation is performed on multiple data elements independently in hardware vector
processing units (VPU), also called SIMD units. The addition of two vectors to form a third
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 15
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
vector is a SIMD operation. Many processors have SIMD (vector) units that can perform
simultaneously 2, 4, 8 or more executions of the same operation (by a single SIMD unit).
Loops without loop-carried backward dependency (or with dependency preserved using ordered
simd) are candidates for vectorization by the compiler for execution with SIMD units. In
addition, with state-of-the-art vectorization technology and declare simd construct extensions
for function vectorization in the OpenMP 4.5 specification, loops with function calls can be
vectorized as well. The basic idea is that a scalar function call in a loop can be replaced by a
vector version of the function, and the loop can be vectorized simultaneously by combining a
loop vectorization (simd directive on the loop) and a function vectorization (declare simd
directive on the function).
A simd construct states that SIMD operations be performed on the data within the loop. A
number of clauses are available to provide data-sharing attributes (private, linear, reduction and
lastprivate). Other clauses provide vector length preference/restrictions (simdlen / safelen), loop
fusion (collapse), and data alignment (aligned).
The declare simd directive designates that a vector version of the function should also be
constructed for execution within loops that contain the function and have a simd directive.
Clauses provide argument specifications (linear, uniform, and aligned), a requested vector
length (simdlen), and designate whether the function is always/never called conditionally in a
loop (branch/inbranch). The latter is for optimizing peformance.
Also, the simd construct has been combined with the worksharing loop constructs (for simd and
do simd) to enable simultaneous thread execution in different SIMD units.
➢ Simd and declare simd:
The following example illustrates the basic use of the simd construct to assure the compiler that
the loop can be vectorized
Example SIMD.1.c
void star( double *a, double *b, double *c, int n, int *ioff )
{
int i;
#pragma omp simd
for ( i = 0; i < n; i++ )
a[i] *= b[i] * c[i+ *ioff];
}
Ref: https://www.ibm.com/docs/en/xl-c-and-cpp-linux/16.1.0?topic=pdop-pragma-omp-simd
9. Vector Processing
Vector processor is basically a central processing unit that has the ability to execute the complete vector
input in a single instruction. More specifically we can say, it is a complete unit of hardware resources
that executes a sequential set of similar data items in the memory using a single instruction.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 16
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Unlike scalar processors that operate on only a single pair of data, a vector processor operates on
multiple pair of data. However, one can convert a scalar code into vector code. This conversion process
is known as vectorization. So, we can say vector processing allows operation on multiple data elements
by the help of single instruction. These instructions are said to be single instruction multiple data or
vector instructions. The CPU used in recent time makes use of vector processing as it is advantageous
than scalar processing. Let us now move further to understand how the vector processor functions.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 17
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Let us now understand the overall operation performed by the vector computer.
As it has several functional pipes thus it can execute the instructions over the operands. We know that
both data and instructions are present in the memory at the desired memory location. So, the instruction
processing unit i.e., IPU fetches the instruction from the memory.
Once the instruction is fetched then IPU determines either the fetched instruction is scalar or vector in
nature. If it is scalar in nature, then the instruction is transferred to the scalar register and then further
scalar processing is performed.
While, when the instruction is a vector in nature then it is fed to the vector instruction controller. This
vector instruction controller first decodes the vector instruction then accordingly determines the address
of the vector operand present in the memory.
Then it gives a signal to the vector access controller about the demand of the respective operand. This
vector access controller then fetches the desired operand from the memory. Once the operand is fetched
then it is provided to the instruction register so that it can be processed at the vector processor.
At times when multiple vector instructions are present, then the vector instruction controller provides
the multiple vector instructions to the task system. And in case the task system shows that the vector
task is very long then the processor divides the task into subvectors.
These subvectors are fed to the vector processor that makes use of several pipelines in order to execute
the instruction over the operand fetched from the memory at the same time.
The various vector instructions are scheduled by the vector instruction controller.
A vector is defined as an ordered set of a one-dimensional array of data items. A vector V of length n
can be represented as a row vector by V = [V1 V2 V3 · · · Vn]. If the data items are listed in a column,
it may be represented as a column vector. For a processor with multiple ALUs, it is possible to operate
on multiple data elements in parallel using a single instruction. Such instructions are called single-
instruction multiple-data (SIMD) instructions. They are also called vector instructions.
The above Vector instruction computes L sums using the elements in vector registers Vj and Vk, and
places the resulting sums in vector register Vi. Similar instructions are used to perform other arithmetic
operations.
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 18
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
The above Vector instruction is to transfer multiple data elements between a vector register and the
memory. A computer capable of vector processing eliminates the overhead associated with the time it
takes to fetch and execute the instructions in the program loop. It allows operations to be specified with
a single vector instruction of the form
The vector instruction includes the initial address of the operands, the length of the vectors, and the
operation to be performed, all in one composite instruction.
This is a three-address instruction with three fields specifying the length of the data items in the vectors
and the base address of the operands and an additional field. This assumes that the vector operands
reside in memory. It is also possible to design the processor with a large number of registers and store
all operands in registers prior to the addition operation. In that case, the base address and length in the
vector instruction specify a group of CPU registers. In a source program written in a high-level
language, if the operations performed in each pass are independent of the other passes, loops that
operate on arrays of integers or floating-point numbers are vectorizable.
A vectorizing compiler can recognize such loops and generate vector instruction if they are not too
complex. Using vector instructions reduces the number of instructions that need to be executed and
enables the operations to be performed in parallel on multiple ALUs. (ref:
https://www.codingninjas.com/studio/library/vector-processing)
The classification of vector processor relies on the ability of vector formation as well as the presence
of vector instruction for processing. So, depending on these criteria, vector processing is classified as
follows:
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 19
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 20
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
o So, from the above discussion, we can conclude that register to register architecture is
better than memory to memory architecture because it offers a reduction in vector
access time. (ref: https://electronicsdesk.com/vector-processor.html)
GPU stands for graphics processing unit. GPUs were originally designed specifically to accelerate
computer graphics workloads, particularly for 3D graphics. While they are still used for their original
purpose of accelerating graphics rendering, GPU parallel computing is now used in a wide range of
applications, including graphics and video rendering. GPU parallel computing is the ability to perform
several tasks at once. GPU Parallel computing enables GPUs to break complex problems into thousands
or millions of separate tasks and work them out all at once instead of one-by-one like a CPU needs to.
(ref: https://people.duke.edu/~ccc14/sta-663/CUDAPython.html)
The GPU parallel computing ability is what makes GPUs so valuable. It is also what makes them
flexible and allows them to be used in a wide range of applications, including graphics and video
rendering.
https://www.youtube.com/watch?v=-P28LKWTzrI&t=93s
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 21
U21CS603
[UNIT I – PARALLELISM FUNDAMENTALS AND
DS ARCHITECTURE ]
Prepared By: Dr. Vishnu Kumar K Professor/CSE Department, KPRIET, Coimbatore. Page 22