Spring 2024 - Big Picture Summary 6

Multiprocessing

Maximizing productive (computational) traffic between a processor and memory is a central system performance concern. Multiprocessing poses an extra challenge to that concern. It must deliver a proportionate increase in overall performance as we add more processors to a computation. It seems reasonable to expect that two processors double the performance, three to triple it, and so on. Performance should ideally scale up linearly with the number of processors. This concern is called scalability. It is vital because it is the reason we pay more money for machines with more processors. Unfortunately, however, ideal scaling is not always easy to get. Amdahl's Law tells us that it only happens when we have done a perfect job in parallelizing work (processing in parallel 100% of the time).

Challenges to Scalable Multiprocessing

Multiprocessing raises issues that can have adverse effects on scaling the performance. It tends to cause extra traffic on all the interconnecting paths between memory and processors. The overhead traffic rises as the number of processors increases. It can ruin scalability at its worst. For example, designers resort to cache memory to boost performance. A multiprocessor generates extra traffic to keep the cache coherent. The overhead traffic is needed to maintain consistent values across private processor caches holding the same memory blocks. It gets worse as more processors connect, to the extent that it could reduce multiprocessing returns seriously. A clever cache protocol may help decrease overhead coherence traffic but can not eliminate it.

Even if designers restrain architectural overheads, such as those discussed above, there is still a legitimate increase in overhead traffic due to computation-related communication among processors, some of which may be due to poor parallelization, but most stem from workload characteristics. For example, some workloads may cause unusually high cache traffic, which, in turn, raises the parasitic part needed to maintain coherence. The workload communication overhead traffic also rises with the number of processors. It may reach a point where it could diminish multiprocessing returns or even reverse gains from parallelism in some cases. This overhead varies depending on workload type. Different architectures work better for some workloads than others. Optimizing a multiprocessor for some workload types may not work for all workloads of interest.

Communication overhead, therefore, always remains a concern. It is one of the main challenges for scalable multiprocessing. A good part of the concern is finding better tradeoffs between computation and communication. To close, note the following. We fully expect interprocessor communication to rise in a parallel workload with processor count. That would not be a problem as long as it scales up without diminishing the linear scaling of performance. It is the rate at which it increases that causes concern. Also, there are other challenges to scalability, such as balancing loads among processors. They pose comparable difficulties. We focused on cache coherence to illustrate the main points.

A Multiprocessor of a Different Kind

A modern GPU (graphics processing unit) is an example of a massively parallel multiprocessor. Its architectural roots date to machines such as the MasPar series of the 1990s and the MPP from 1980-82. This design connects an array of an enormous number of tiny processors, in the order of hundreds or thousands, to a high-bandwidth shared memory. Each array processing element (PE) may be very limited in power compared to a conventional CPU, but much more may be available. Interconnections emphasize fast, low-latency communications with minimal overheads. Parallelism occurs at the level of tiny tasks and is, thus, thought to be fine-grained. The multiprocessor may rely on a conventional (host) processor to provide a familiar front-end operating, programming, and I/O environment.

An early example was the MPP (Massively Parallel Processor) that NASA helped develop to process satellite images at high data rates, which involved small repetitive operations on massive pixel data. Workloads were heavy on processing data with high regularity and plenty of chance for parallelism.

The architecture is a good fit for workloads consisting of numerous relatively independent small calculations, like in the MPP. These repetitive computations lend themselves to some form of SIMD, which was the case for the MPP and later the MasPar and early CM series from Connection Machines. It was presumably the original reason for using this architecture class to accelerate graphics.

Graphics workloads process numerous operations on basic geometric shapes with plenty of room for parallelism. Operations are data-intensive and need much smaller but more responsive processing than possible on a conventional CPU. They require less control. As a result, a graphics processor may pack many more processing elements in a standard package as long as it can keep the instruction fetch-decode and control overheads low.

The modern GPU is a collection of SIMD-like arrays of PEs on a chip with shared memory. Each one of the arrays is a mini (tightly coupled in classic terms) multiprocessor that holds a block of connected PEs. This architecture results in a scalable array. It also allows for multilevel cache support, with L2 shared by multiprocessor units and L1 shared by a block of PEs. A modern GPU is characterized by how it handles data parallelism. It manages threads in a heavily threaded workload en masse (collectively in bulk), unlike the individually managed threads of a traditional OS. Specialized hardware in the GPU drives and synchronizes blocks of threads to flow massive amounts of data through the PEs. This processing style is called SIMT (single instruction multiple threads).

Hover 🔍︎

Processing elements (PE), organized in one vs. a 2-level array. In the scalable setup, the number of PEs may increase by adding multiprocessor blocks (for 40 PE), expanding the number of PEs in a block to 12 to get 48 PE, or both (for 60 PE). Interconnections scale up well with less effort.

It is, perhaps, worth noting that the data-intensive coordinated execution of graphics primitives may be reminiscent of classic vector processors. That is not the case with a modern GPU, however. Vector machines used one opcode to pipeline a sequence of data items through a single processing element, typically at the time, an ALU. A basic scenario would not have needed redundant functional units. The key idea is that the data items line up in special vector registers to pipeline quickly without data hazard stalls (except possibly for the first one). Therefore, the number of operands that fit in the vector register determines the number of times to perform the operation specified by an opcode. Compared to scalar, both vector and SIMD reduce the instruction counts. The main difference is that a vector processor can dramatically cut down the dynamic instruction bandwidth (i.e., the runtime number of fetch decodes, depending on the depth of the vector registers. In that respect, a GPU is similar to vectorization.

The GPU

The GPU originated in graphics accelerators with unimpressive static (fixed function) graphics pipelines to speed up some functions. It stepped up through a series of designs rather quickly, from having some programmable stages to fully programmable. The main driving force, oddly, was the demand for better quality gaming. Its major processing elements (PE) grew from a fixed pixel shading function to a programmable universal shader for all graphical elements. Programming support was still limited to graphics APIs. A look into the beginnings of the modern GPU can help reveal its distinctive architectural features.

The Modern GPU: Beginnings

In 2006, Nvidia was the first to ship a fully programmable unified pipeline in the GeForce 8800. The physical stages from the old static pipeline mapped logically to a new, dynamic one based on all-purpose PEs. It was also the first to have standard interfaces for general-purpose programming based on the C language and later C++, with support for parallel programming via a proprietary model (CUDA) or OpenCL. Previously, programmers had to go through the awkward step of making their tasks look like graphics operations. It did not include a cache (the microarchitectural detail, invisible to programmers, that buffers memory requests). The next big iteration in 2009 (Fermi) would get it. This milestone microarchitecture (code-named Tesla G80) became a foundation for what came after.

Later, PEs would get general integer and floating-point ALUs. They gradually turned to better general-purpose calculation engines called streaming or stream (i.e., optimized for fast vector-like data flow) processors. In contrast, a conventional CPU relies on complex instruction control flow and large caches to increase the bandwidth [see the linked Nvidia white paper].

Hover 🔍︎

Stream processing compared to other execution models (simplified to show the difference). It works by moving data through execution units via parallel threads composed of instruction blocks. Parallelism occurs at the thread level (TLP), not at the data level (DLP) like SIMD and vector (controlled directly by instructions in both cases). The data vector size, fixed in classic vector and SIMD, is programmable in GPU-based stream processing.

AMD, the other major GPU company, followed by a comparable architecture in 2007 (Radeon HD 2000 series), increasing the number of PEs from 160 in the G80 to 320 at the top of the line. (Note AMD counts its 5-instruction-wide superscalar ALU as 5 PE to get a higher PE count.) Support for general-purpose programming in the AMD devices was lagging at the time.

As a historical footnote, we must recognize the Xenos graphics processor from ATI, later a part of AMD, which was the first to have some unified PE in 2005 (limited to vertex and pixel shading). It was a joint design with Microsoft custom-made for their XBOX 360 gaming console and only supported MS graphics API programming. It was the precursor of the Radeon HD 2000.

Recent Trends

Two trends opened up GPUs for general scientific workloads. First, the GPU steadily changed to a flexible programmable processor from its graphics accelerator roots. Second, a steady supply of transistors (predicted or motivated by Moore's Law) helped processing elements grow in numbers and power. So did on-chip memory and logic supporting massive threading. Also, more transistors allowed engineers to add other types of PEs, transforming the GPU into a heterogeneous architecture. The new PEs initially focused on complex graphical functions, such as ray tracing, but soon started to target other specialized workloads, such as those from AI. The GPU grew substantially in computing power as a result.

Nowadays, supercomputers commonly utilize GPUs. Applications include weather forecasting, financial risk, animation/modeling, and machine learning, to name a few [see the 2017 Nvidia applications catalog]. In late 2021, the first exascale supercomputer started to ship at Oak Ridge National Laboratory in the USA, featuring advanced GPUs from AMD. It was the first to break the exascale performance barrier in 2022 to earn a top spot in the well-regarded Top500 list [see HPCwire news articles]. It is worth noting that the custom GPU package, of four GPU units per node, should have 16384 processing elements spread across two chips. The array unit of the multi-cabinet MPP had the same number of processing elements back in 1982.

These trends might make one wonder about the MIMD prospects of the GPU. As of this, GPUs are optimized highly for rigid, heavily multithreaded SIMD-like operation. Consequently, workloads with massive data-level parallelism that best fit the GPU model will benefit remarkably.