0-gpu-computing-i-give-it
0-gpu-computing-i-give-it
SUGGESTED ACTIVITIES:
1. Debugging Lab
2. Performance Lab
3. Launching Nsight
4. Running Performance Analysis
5. Understanding Metrics
6. NVIDIA Visual Profiler
7. Matrix Transpose Optimization
8. Reduction Optimization
A single GPU device consists of multiple Processor Clusters (PC) that contain
multiple Streaming Multiprocessors (SM). Each SM accommodates a layer-1
instruction cache layer with its associated cores. Typically, one SM uses a
dedicated layer-1 cache and a shared layer-2 cache before pulling data from
global GDDR-5 (or GDDR-6 in newer GPU models) memory. Its architecture is
tolerant of memory latency.
Compared to a CPU, a GPU works with fewer, and relatively small, memory
cache layers. Reason being is that a GPU has more transistors dedicated to
computation meaning it cares less how long it takes the retrieve data from
memory. The potential memory access ‘latency’ is masked as long as the GPU
has enough computations at hand, keeping it busy.
CUDA:
CUDA − Compute Unified Device Architecture. CUDA is a programming
language that uses the Graphical Processing Unit (GPU). It is a parallel
computing platform and an API (Application Programming Interface) model,
Compute Unified Device Architecture was developed by Nvidia. This allows
computations to be performed in parallel while providing well-formed speed.
COMPONENTS:
The CUDA Toolkit includes libraries, debugging and optimization tools, a
compiler, documentation, and a runtime library to deploy your applications. It
has components that support deep learning, linear algebra, signal processing,
and parallel algorithms.
CUDA Architecture:
Thread Organization
Memory Hierarchy
There are several levels of memory on the GPU device, each with distinct
read and write characteristics. Every primitive thread has access to private
‘local’ memory as well as registers. This ‘local’ memory is really a misnomer;
the memory is private to the thread, but is not stored local to the thread’s
registers but rather off-chip in the global GDDR memory available on the
graphics card. Every thread in a thread block also has access to a unified ‘shared
memory’, shared among all threads for the life of that thread block. Finally, all
threads have read/write access to ‘global memory’, which is located off-chip on
the main GDDR memory module which therefore has the largest capacity but is
the most costly to interact with. There also exists a read-only ‘constant’ and
‘texture’ memory, in the same location as the global memory.
The global, constant and texture memory are optimized for different memory
usage models. Global memory is not cached, though memory transactions may
be ‘coalesced’ to hide the high memory access latency. These coalescence rules
and behaviors are dependent on the particular device used. The read-only
constant memory resides in the same location as global memory, but this
memory may be cached. On a cache hit, regardless of the number of threads
reading, the access time is that of a register access for each address being read.
The read-only texture memory also resides in the same location as global
memory, and is also cached. Texture memory differs from constant memory in
that its caching policy specifically exploits 2D spatial locality. This is due to the
use of ‘textures’ in 3D graphics; the use of 2D images to ‘texture’ the surface of
3D polygons are frequently read and benefit from caching the texture
spatially.
The above diagram shows the scope of each of the memory segments in the
CUDA memory hierarchy. Registers and local memory are unique to a thread,
shared memory is unique to a block, and global, constant, and texture memories
exist across all blocks.
Multiprocessors
CUDA capable GPUs are constructed with the “Tesla” architecture. CUDA
applications may be run on any card which supports this architecture, but each
GPU device may have different specifications, and therefore a slightly different
set of supported features and a different number of available computational
resources. When a kernel is invoked, each thread block executes on a
‘multiprocessor’. This multiprocessor contains the resources to support a certain
number of threads. Specifically, each multiprocessor consists of:
UNIT II
Multi GPU:
Model parallelism
Model parallelism is a method you can use when your parameters are too large
for your memory constraints. Using this method, you split your model training
processes across multiple GPUs and perform each process in parallel (as
illustrated in the image below) or in series. Model parallelism uses the same
dataset for each portion of your model and requires synchronizing data between
the splits.
Data parallelism
Data parallelism is a method that uses duplicates of your model across GPUs.
This method is useful when the batch size used by your model is too large to fit
on a single machine, or when you want to speed up the training process. With
data parallelism, each copy of your model is trained on a subset of your dataset
simultaneously. Once done, the results of the models are combined and training
continues as normal.
10
Learn more in our guide to TensorFlow multiple GPU and Keras multiple GPU
PyTorch Multi GPU
PyTorch is an open source scientific computing framework based on Python.
You can use it to train machine learning models using tensor computations and
11
GPU Server
GPU servers are servers that incorporate GPUs in combination with one or more
CPUs. When workloads are assigned to these servers, the CPUs act as a central
management hub for the GPUs, distributing tasks and collecting outputs as
available.
GPU Cluster
GPU clusters are computing clusters with nodes that contain one or more GPUs.
These clusters can be formed from duplicates of the same GPU (homogeneous)
or from different GPUs (heterogeneous). Each node in a cluster is connected via
an interconnect to enable the transmission of data.
Kubernetes with GPUs
Kubernetes is an open source platform you can use to orchestrate and automate
container deployments. This platform offers support for the use of GPUs in
clusters to enable workload acceleration, including for deep learning.
12
When using GPUs with Kubernetes, you can deploy heterogeneous clusters and
specify your resources, such as memory requirements. You can also monitor
these clusters to ensure reliable performance and optimize GPU
utilization. Learn about Kubernetes architecture and how it can be used to
support Deep Learning.
13
14
UNIT III
Lack of Abstraction
CUDA generally exposes quite a bit about the way that the hardware works to
the programmer. In this regard, it is often compared to assembly language for
parallel programming. Explicit concepts such as thread arrays, map directly to
the way the hardware groups computational units and local memory.
The memory model requires explicit data movement between the host processor
and the device. CUDA makes explicit use of a hierarchy of memory address
spaces, each of which obeys different sharing rules. This is more low-level
detail than the typical application or scientific programmer has to deal with
when programming in C, C++, or Fortran. It also raises significant concerns
with regard to portability and maintainability of code.
The second challenge is that the programming model for CUDA is one of data
parallelism rather than task parallelism. When divvying up work across the
nodes of a cluster, HPC programmers are used to looking for and exploiting
parallelism at a certain level. The amount of data to assign to each node depends
on a number of factors including computational speed of the node, the available
memory, the bandwidth and latency of the interconnect, and the frequency with
which results need to be shared with other processes.
Since processors are fast and network bandwidth is relatively scarce, the
balance is typically to put quite a bit of data on each compute node and to move
data as infrequently as possible. CUDA invites the programmer to think about
parallelism of a completely different order, encouraging the developer to break
15
the problem apart into units that are much smaller. This is often many orders of
magnitude more parallelism than was previously expressed in the code and
requires reasoning somewhat differently about the computations themselves.
NVIDIA GPUs are external accelerators attached to the host system via the PCI
bus. Each GPU has both its own onboard memory and also features smaller bits
of memory attached to each one of the compute elements. While there are now
mechanisms to address regions of host memory from device kernels, such
access is slower compared to accessing device memory.
CUDA programs therefore use a model in which code running on the host
processor prepares and explicitly dispatches work to the GPU, pauses for the
GPU to complete that work, then reads the resulting data back from the device.
Both the units of code representing computational kernels and the associated
data on which these computational kernels will execute are dispatched to the
device. The data is moved, in whole or part, to the device over the PCI bus for
execution. As results are produced, they need to be moved, just as explicitly,
back from the device to the host computer’s main memory.
GPUs are complex in how they run code and manage data. Understanding how
your code runs across cores, streaming multiprocessors (SMs) and how the tens
to thousands of threads are organized into blocks, lanes, and warps can be
confusing.
16
for parallel computations and accelerate the computation of neural networks. The
most famous interface that allows developers to program using the GPU
is CUDA, created by NVIDIA.
Background
In the following figure, you see a classic set-up in which we have an input that is
processed by the CPU one instruction at a time to generate an output. But how
do we process multiple instructions at the same time? This is what we will try to
understand in this article.
Image by Author
Terminology
17
18
Parallelism
But remember that in almost every process there are some instructions which
should be performed sequentially and some others that can be computed
simultaneously in parallel.
When you talk about parallelism you should remember that there are 2 types of
parallelism:
19
Graphical processing units (GPU) can perform complex actions in a short period.
The complexity relies upon the quantity of operations executed
simultaneously, but only as long as they remain simple and basically similar.
The game industry has been the launching market for the GPU implementation,
later reached by Nvidia company through the platform CUDA. The notoriety of
GPU has increased even more, especially for developers which were now able to
run multiple computing actions using a few lines of code.
GPU:
• thousands of cores
20
CPU:
• few cores
In the next articles, we are going to write code to use parallel programming.
However, we must first know what the structure of a cuda-based code is, there
are a few simple steps to follow.
In such an environment we will call Host Code the code that is going to run on
CPU and Device Code the code that is going to run og GPU.
21
There could be two types of error for CUDA kernel launch, synchronous error
and asynchronous error.
Synchronous error happens when the host thread knows the kernel is illegal or
invalid. For example, when the thread block size or grid size is too large, a
synchronous error is resulted immediately after the kernel launch call, and this
error could be captured by CUDA runtime error capturing API calls, such
as cudaGetLastError, right after the kernel launch call.
22
A sticky error is not recoverable, meaning subsequent CUDA runtime API calls
will always return the same error. Therefore, the CUDA context is corrupted,
unless the application host process is terminated. For example, when the kernel
tries to access invalid memory address during kernel execution, it will result in a
sticky error which will be captured and returned by all the subsequent CUDA
runtime API calls.
#include <stdio.h>
// Example usage:
// handleCudaError(cudaMalloc((void**)&dev_a, size ));
23
return 103;
}
}
24
the fact that each thread now has its own Program Counter (PC) implies a future
possibility of removing this feature.
2) Block-level Synchronization (Synchronization Inside a Single GPU):
Block-level synchronization corresponds to the thread block in the
programming model. According to CUDA’s programming guide [1], its
function is the same as the classical synchronization primitive syncthreads().
3) Grid-level Synchronization (Single GPU Synchronization): Starting from
CUDA 9.0, Nvidia introduced grid group grid-level synchronization. Grid-level
synchronization is a method to do single GPU synchronization. In order to use a
grid group, cudaLaunchCooperativeKernel() API call is necessary, in
comparison to the traditional kernel launch (<<<>>>).
4) Multi-Grid Level Synchronization (Multi-GPU Synchronization): CUDA
9.0 also introduced the concept of multi-grid group. This group is initialized by
a kernel launch API: cuda Launch Cooperative Kernel Multi Device().
Synchronizing this group can act as a way to do multi-GPU synchronization in a
single node.
B. Non-primitive Synchronization
1) Software Barrier for Synchronization: Li etc. [16] researched fine-grained
synchronization. Beyond it, Xiao, etc. [5] introduced a software device-level
synchronization. The authors limit the number of blocks per SM to only one in
order to avoid deadlocks. Sorensen et al. extended this work by adding an
automatic occupancy discovery protocol to discover 1 activate warps [4].
2) Implicit Barrier for Synchronization: Before the introduction of grid-level
synchronization, the typical way to 5 introduce a device-wide barrier to a
program was to use several 6 kernels in a single CUDA stream. A stream is a
logical queue 7 that enforces an execution order on the CUDA kernels in the 9
stream, i.e. the kernels and data movement commands are 10 executed in the
order by which they appeared in the stream. 11 For example, many DL
frameworks, e.g., Chainer [3], use this 13 method to enforce execution order.
3) Multi-GPU Synchronization: The common way to do multi-GPU
synchronization is to synchronize CPU threads orchestrating the GPUs. The
basic idea is to use one CPU thread per device (or one MPI rank per device).
Additionally, with the help of the GPUDirect CUDA technology, it is also
possible to implement multi-GPU software barriers using GPUDirect APIs.
25
Since we are concerned in this paper with studying general and intrinsic barrier
methods, we would not discuss manu- ally implementation barriers, including
software barriers and GPUDirect based manually implementations.
26
UNIT IV
OpenCL:
Although OpenCL programs can be compiled and linked into binary objects
using conventional off-line compilation methodology, OpenCL also supports
run-time compilation enabling OpenCL programs to run natively on the target
hardware, even on platforms unavailable to the original software developer.
Run-time compilation eliminates dependencies on instruction sets, allowing
hardware vendors to make significant changes to instruction sets, drivers, and
supporting libraries, from one hardware generation to the next. Applications that
make use of the run-time compilation features of OpenCL will automatically
take advantage of the latest hardware and software features of the target device
without any need for recompilation of the main application itself.
27
The OpenCL programming model abstracts CPUs, GPUs, and other accelerators
as “devices” that contain one or more “compute units” (e.g., cores) composed of
one or more SIMD “processing elements” (PEs) that execute instructions in
lock-step. OpenCL defines four types of memory systems that devices may
incorporate, a large high-latency “global” memory, small low-latency read-only
“constant” memory, shared “local” memory accessible from multiple PEs
within the same compute unit, and “private” memory or device registers
accessible within each PE. Local memory may be implemented using either
high-latency global memory, or may be implemented with fast on-chip SRAM
or shared register file. Applications can query device attributes to determine the
properties of the available compute units and memory systems, using them
accordingly.
28
Multi-core CPUs
use of large caches to hide main memory latency. Many CPUs also incorporate
small scale use of single-instruction multiple-data (SIMD) arithmetic units to
boost the performance of dense arithmetic and multimedia workloads. These
SIMD units are not directly exposed by conventional programming languages
like C and Fortran, so their use requires calling vectorized subroutine libraries
or proprietary vector intrinsic functions, or trial-and-error source level
restructuring and autovector-izing compilers. AMD, Apple, and IBM provide
OpenCL implementations that target multi-core CPUs, and support the use of
SIMD instruction set extensions such as x86 SSE and Power/VMX. The current
CPU implementations for x86 processors often make best use of SSE when
OpenCL kernels are written with explicit use of float4 types. CPU
implementations often map all memory spaces onto the same hardware cache,
so a kernel that makes explicit use of constant and local memory spaces may
actually incur more overhead than a simple kernel that only uses global memory
references.
IBM has released an OpenCL toolkit supporting both the Cell and Power
processors on the Linux platform. The IBM OpenCL implementation supports
the embedded profile for the Cell SPUs, and uses software techniques to smooth
30
over some of the architectural differences between the Cell SPUs and
conventional CPUs. On the Cell processor, global memory accesses perform
best when operands are a multiple of 16 bytes, e.g. an OpenCL float4 type. The
use of larger vector types such as float16 enables the compiler to unroll loops,
further increasing performance. The 256 kB Cell SPU local store, is shared
among the program text, and OpenCL “local”, and “private” variables. This
places practical limits on the size of work-groups since private data storage is
required for each work-item. The Cell DMA engine performs most effectively
with the use of double buffering strategies combined with calls
to async_workgroup_copy() to load data from global memory into local store.
Although GPUs are powerful computing devices in their own right, they must
currently be managed by the host CPUs. GPUs are typically attached to the host
31
by a PCI-Express bus, and in most cases have their own independent on-board
memory system. In order to exchange input and output data with the GPU, the
host CPU schedules DMA transfers between the host and GPU memory
systems. OpenCL provides APIs for CPU-directed data transfers between
independent host and GPU memory systems. Recent GPUs are capable of direct
access to host memory over PCI-e, and in some cases may allow their on-board
to be mapped into the host address space, providing the necessary hardware
support for zero-copy access to data that are read or written only once during
kernel execution. At the present time, OpenCL does not include mechanisms for
zero-copy memory access, though it could be provided as an extension or as
part of a future version.
32
with the sum taken over all atoms, where α is a prefactor that accounts for the
system of units and solution dielectric values, atom j is located at rj and has
partial charge qj and size σj, and the pairwise distance is rij = |rj − ri|. The
potential at each grid point is effectively the sum of all atomic potential
contributions in the molecule. The MDH method is inherently data-parallel
when decomposed over grid points since they are computed independently and
there are no output conflicts.
Host Devices:
A host is any hardware device that has the capability of permitting access to a
network via a user interface, specialized software, network address, protocol
stack, or any other means. Some examples include, but are not limited
to, computers, personal electronic devices, thin clients, and multi-functional
devices.
33
Types:
Host devices are devices that are physically plugged into the host,
including SCSI (for example tapes, disks, changers), PCI (for example
NICs, GPUs, and HBAs), and USB (for example mice, cameras, and disks).
Hosts use various protocols to communicate, including TCP and User Datagram
Protocol (UDP). On a TCP/IP network, each host has a host number that,
together with a network identity, forms its unique IP address. In the Open
Systems Interconnection (OSI) model, protocols in the transport layer, also
known as Layer 4, are responsible for communication between hosts.
34
Types of IT hosts
The term host is used in several other areas within information technology (IT),
carrying a slightly different meaning depending on the context.
Web host
For companies or individuals with a website, a host is a web server that stores
and transmits the data for one or more websites. Host can also refer to the
service provider that leases this infrastructure, which is known as hosting.
Cloud host
A cloud host is based on cloud computing technologies that enable a number of
servers to act as one system in which website performance can be guaranteed by
multiple machines. It often includes a network of servers pulling from different
data centers in different locations.
Cloud hosts operate as a service that enables clients to buy as much of the
service as they need. Cloud hosting is an alternative to hosting a website on a
single server. Cloud hosting can be considered both infrastructure as a service
(IaaS) and platform as a service (PaaS). Using a public cloud model, a public
network transmits data that is physically stored on shared virtual servers that
make up the cloud resource.
Virtual host
The term virtual host has two uses. One refers to the technology used to run
multiple domains or applications on a single physical server. The second refers
to companies that sell virtual infrastructure services.
Remote host
In this context, users access a remote host in a different physical location using
a private network or the internet. This process provides users with remote
access. Examples include servers that users can log in to remotely or a host
computer for a remote desktop.
35
Hosts
connect to other hosts and servers in this local network.
Host virtual machine
This refers to the hardware -- or the physical server -- that provides the
computing resources to support virtual machines (VMs). This process is also
known as server virtualization.
Hostname
A hostname is a plaintext name identifying a host in a given domain. On a local
area network (LAN), a server's hostname might be a nickname like mailserver1.
On the internet, a hostname makes up part of a web address and has three parts:
1. subdomain
36
2. domain name
3. top-level domain
In other contexts, a host can also be a device or program that provides services
to some smaller or less-capable device or program.
37
UNIT IV
Algorithms of GPU
Parallel Patterns:
Convolution:
Convolution is a mathematical operation which describes a rule of how to
combine two functions or pieces of information to form a third function. The
feature map (or input data) and the kernel are combined to form a transformed
feature map. The convolution algorithm is often interpreted as a filter, where the
kernel filters the feature map for certain information. A kernel, for example,
might filter for edges and discard other information. The inverse of the
convolution operation is called deconvolution.
38
39
Prefix Sum:
40
41
42
43
44
Sparse matrix is the matrix in which most of the elements are zero. To store all
the data of only those elements which are non-zero in the existing matrix. For
example, let’s suppose there is a matrix A of which most the element is 0. So, to
store all the matrix, it is better to store only those elements which are non-zero.
Using this way to store the matrix reduces the space and takes lesser time to
Sparse matrix multiplication is used for better time complexity and space
store the elements that are non-zero, we use the below 2 representations.
elements more often. Since array store, the elements based on the indices thus is
45
representing.
insertion and deletion operation in the matrix is more since it is easier to delete
matrix.
46
• A sparse matrix is the type of 2 -D matrix where zero elements are large
lot of memory space and time, and also it is difficult to perform any
further operation. So to multiply these matrixes with less time and space
the multiplication of those matrix takes a lot of time means high time
complexity and also it will be large space as well for storing 3 matrixes,
matrix and multiply them and get the resultant matrix as a reduced the
Matrix Multiplication.
• We can also store the resultant multiplication matrix in the same as the
above representation.
47
are non-zero. So that depicts that we just need to store these 5 elements to store
in the memory. So we just store the location of non-zero elements and their
values.
Step 1: First, we have to take the transpose of the second matrix. The transpose
of a matrix is, converting all the rows into columns and columns to rows. So
here as we store only non-zero elements. So we have to replace the rows with
a row of the first matrix and y is the row of the transpose of the second matrix,
after the multiplication of those values in which the column in the first matrix
Step 3: Let x=0 and y=0; in the first matrix, two values are there in which row
is 0 and one in the second matrix. So, if their column is also the same, then we
multiply both and find another pair. So, in that example, in the first matrix, 0,2
value is 18, and in the second matrix, 0,2 is 5. So we multiply both and find
49
another pair, but as we can see, there is no other pair in which column is the
Step 4: Follow step 3 for every 0<=x<row and 0<=y<column where resultant
struct matrix {
};
cout << a[i].row <<" "<< a[i].col <<" "<< a[i].value << endl;
50
return true;
return false;
return true;
return false;
b[i].row = a[i].col;
b[i].col = a[i].row;
51
b[i].value = a[i].value;
sort(b,b+n,compareMatrix);
printMatrix(b, n);
int i = 0;
int j = 0;
int k = 0;
int temp = 0;
int a1 = a[i].row;
int b1 = transposeB[j].row;
int tempi = i;
int tempj = j;
52
while( a[tempi].row==a1 ) {
tempj=j;
if(a[tempi].col == transposeB[tempj].col ) {
tempj++;
tempi++;
if(temp != 0) {
resultant[k].row=a[i].row;
resultant[k].col=transposeB[j].row;
resultant[k].value=temp;
k++;
temp=0;
53
j++;
if(b1==transposeB[j].row) {
i++;
j=0;
if(a1==a[i].row){
break;
printMatrix(resultant,k);
int main() {
54
int n1=5,n2=5;
matrix a[n1]={
0,1,5,
0,2,18,
2,1,14,
3,1,15,
3,3,4
};
matrix b[n2] = {
0,1,5,
1,2,20,
2,0,8,
3,1,15,
3,3,24
};
matrix transposeb[n2];
matrix resultant[100];
55
Output:
Conclusion
Sparse matrix multiplication is the solution to perform the multiplication of two
matrixes in less time complexity. However, the resultant matrix is also having
the most elements value 0, so it is better to store the resultant matrix in the same
way. Thus, we can save a lot of space by storing just non-zero elements.
56