Chapter 4 Notes
Chapter 4 Notes
and
comparisons to more traditional vector processors and SIMD in multimedia processors. The book goes
heavy on the nVidia architecture and approach to programming, specifically Pascal and CUDA. There are
other programming approaches available for these operations, the most common of which is OpenCL.
OpenCL runs on many different GPU architectures from multiple vendors (even Intel integrated graphics
have basic support), but nVidia’s CUDA has a much higher rate of adoption. This is due to a variety of
reasons, including better optimization for nVidia hardware which is the fastest base platform for
developers, being first on the scene and picking up a significant lead over OpenCL in early adopters, and
a large and active program of developer support from nVidia. Therefore, my examples and additional
material will continue to focus on nVidia/CUDA.
Also, before you deep dive into modern architectures, I wanted to provide a few brief notes about
computing on the GPU. The book implies that this is a relatively new phenomenon (and it is, in the
history of computer science as a whole), but it is not nearly as new as the book presents it. Before GPUs
offered explicit support for general purpose graphics processing unit computing (GPGPU), programmers
could still take advantage to the primitive programming options available on early GPUs to do some
forms of acceleration. However, as there was no way of submitting anything resembling complex code,
the programmers had to determine how to represent the data that they wished to calculate in a way
that would correspond to the processing capabilities of these GPU’s. For data that could be represented
as graphical data, generally implying that you’d be using 8 or 16-bit colors as your data values,
programmers could interact with the data via lighting it in various fashions, producing altered graphical
data that could then be examined for the end results. This process was extremely awkward in its
limitations and implementation challenges, but it did work(ish). GPGPU designs in hardware and IDE
provided the means to submit functional code to the GPU with some support for branching/conditionals
and the ability to directly perform the math desired rather than having to translate data to a graphic and
back again.
Another general note is on the overall speed increases possible with a GPU versus a CPU in various
operations. It is generally perceived that a GPU is faster by entire orders of magnitude over a CPU in
operations that the GPU can perform due to the extremely parallel structure of the products. While a
top-end desktop CPU might have 24 cores, a GPU with similar positioning could have over 10,000
programmable shaders (10,240 in a 3080Ti, for example, plus the Tensor cores and raytracing cores that
can also do specialized math). However, other experts contend that if the same level of optimization
was applied to CPU programming, this difference would be largely offset by the additional overhead of
the GPU operations related to having to send/receive data to and from the CPU and main memory, as
well as restrictions with GPU programming such as having very slow branch processing. The obvious
brute-force power of a GPU in solving a problem may be equaled or surpassed by a more clever
implementation on a CPU.
A good example of this appeared in a research article last spring, and an overview can be seen here:
https://www.sciencedaily.com/releases/2020/03/200305135041.htm. An example of this is deep
learning, an AI technique in which a model can be trained on a large amount of representative data to
correctly perform complex tasks, such as image recognition. This may involve (hundreds of) millions of
parameters in the network system that somewhat mimic the operations of the human brain in their
connection and operation; these connected pieces of code are actually called neurons. In any case,
these models generally use GPUs to process the training data, as they are very parallelizable, and benefit
from a significant speed boost. However, some researchers realized that training every neuron for every
piece of data was not required, and that by using a tuned search algorithm, they could massively reduce
the amount of operations the training process requires. The end result is that in their testing system,
doing this problem on a (very high-end) CPU takes ~1/3 the time of a GPU version of the training. It is
possible that other similarly optimized approaches to traditional GPU processing tasks may cause
everyone to rethink, again, the balance between the CPU and GPU.
I’ve been looking at a variety of videos on YouTube to try to provide some supplementary information
and interaction to what is available in the book. These videos either look at the specific architecture for
nVidia graphics, specifically Ampere as the most recent incarnation, or a bit more about CUDA
programming to provide a more interactive example that helps to highlight how you can move from
traditional coding approaches to the CUDA version.
For the videos, you may either watch the brief intro to CUDA coding from one of the nVidia developers
here: https://www.youtube.com/watch?v=kIyCq6awClM and then move on into the architectural
videos, or start with the architecture and go back to the programming. For the architecture videos, I’d
suggest you start with the shorter one here: https://www.youtube.com/watch?v=BgzTmaDb8Pk and
then move on into the Ampere deep-dive here: https://www.youtube.com/watch?v=AmNL2Cg2OO8
For both the architectural videos, you will have an easier time understanding them if you refer to the
terminology chart in the book in the Chapter. I believe it was on page 314, but my online access to the
book expired and I don’t have the printed copy at home, so I’m going off memory. It should be close to
that page if not exactly; section 4.4 in the book. We’ll back into this on Wednesday by talking about the
basics of SIMD and the types of operations it can and can’t handle and how performance compares to
more traditional approaches.