Spring 2024 - Big Picture Summary 7

High Performance

Studying typical instruction sequences is a big part of computer architecture. It has been so ever since Amdahl published his seminal work where he argued against an emerging attitude in building computers at the time. He based his argument on a study of workloads typical of that time. His perspective was rooted in delivering value to users in terms of reasonable returns on their investments in costly, supposedly high-performance hardware. He showed that there was more to high performance than throwing extra hardware at the problem. More importantly, his methods proved invaluable in the long run.

Setting Standards: The Benchmarks

Examining instruction sequences is essential for understanding performance issues and developing suitable solutions (i.e., building accordingly). Some of those sequences may be common, but most stem from the specifics of a workload. They depend on the types of programs expected to run frequently by the users. They may interact differently with different hardware. Identifying workload and code patterns relevant to performance expectations is crucial for building specialized machines such as supercomputers. The same holds for computers designed to run games, for example.

It is convenient, at times, to examine execution traces rather than the static code. A run trace may extract essential characteristics of an interesting aspect of an instruction stream. In some cases, that would be the dependencies in the stream. In others, it could be memory access patterns. The main advantage is that a trace can reveal the dynamic behavior of instructions. In particular, it can expose interactions with the hardware that may have profound effects on performance. It can also help identify the most frequently used parts in the code, which helps optimize the real experience if the traces come from user programs.

Benchmarking computers is about gauging their performance. Proper benchmarking should be concerned with generating those instruction streams that characterize performance realistically. Good benchmark programs can guide the design and help tune the performance for valuable workloads. Historically, the process was routinely subject to abuse to misrepresent performance for various reasons, unfortunately for users. However, it remains true that from a user's view, the best benchmark code for them should rely on real programs they are likely to run, using whole programs or carefully extracted representative parts. Maintaining such benchmark programs is not easy. They must change to keep up with technology and evolving needs while continuing to reflect expected user performance realistically.

It is worth noting that benchmark programs, even at their best, should not be regarded as conclusive indicators of the performance likely to be experienced by users. They can, however, give a much better idea of performance than the peak theoretical figures quoted by manufacturers.

Modern High-Performance Architecture

Although single-thread/core performance is still relevant, even critical for some applications (as of 2023), high performance generally implies some form of parallel or multiprocessing. Patterson and Hennessy's famous undergraduate textbook gives a solid grounding on the subject (see the Background Review sidebar on this page).

A look at the top500.org list of supercomputers reveals that the most dominant architecture seems to be the cluster [see Links]. It is a form of loosely coupled message-passing system. Essentially, clusters are independent computers connected by networking technology where node communication latencies are high relative to processing and memory access. Therefore, for high performance in parallel programs, it has to rely on interconnection networks that combine high speed and tight latencies. Some employ standard or commercially available networks to reduce costs. Those that use off-the-shelf parts, such as mainstream processors and networks, are sometimes called commodity clusters. The top high-performance ones are called superclusters and rely on specialized or custom networks. The three newcomers to the top ten spots in 2022, including the first exascale system at number one, used a clustered architecture (HPE Cray EX).

Interestingly, a modern cluster supercomputer typically combines major multiprocessing architectures in many ways. A so-called fat node in a cluster, termed originally for having lots of memory, is now increasingly a classic tightly-coupled shared memory multiprocessor. It could rely on one or more multicore packages, each having up to 128 (as of 2023 from AMD) conventional or mainstream processors. Custom nodes may use RISC-based or other simplified processors to pack even more cores in a perhaps green MIMD package (e.g., the PEZY-SC2 from 2017 had 2048 lightweight cores, while the Ampere CPU series, as of 2023, packed up to 192 ARM-based cores ). Alternatively, a fat node may instead rely mainly on a massively parallel GPU, a modern take on old SIMD array processors, or combine that with a high core count processor, effectively a supercomputer on its own, by past standards. For example, a node in the Frontier exascale supercomputer consists of a 64-core AMD multicore coupled to four GPUs and 512 GB of shared RAM [see node image in Links]. Custom high- speed interconnection networks connect the multicore, the GPUs, and the shared memory.

Fat nodes from recent systems are well suited for either memory or compute-intensive workloads or both, depending on the resources they pack. They are attractive since we can fit a substantial portion of parallel computations in a node before spreading it across fewer nodes. This configuration helps reduce typically high overheads of cluster-wide communication, which works well for the scalability of a range of valuable workloads. Modern fat nodes are vital in turning networking-based clusters into the supercomputers of recent times.

A clustered computer architecture depends heavily on the network that connects its nodes to reach high performance on large parallel loads. Commodity clusters may get away with standard networking. Superclusters, however, must rely on solutions with very low latency to improve the scalability of the parallel programs. Modern networking technologies address this sensitivity of cluster-wide performance to the network. Most of the advances were driven really by the needs of supercomputing. The top systems come with custom networks designed to maximize the scalability of the two main types of workloads, independent and parallel. Whether software can utilize all that power is another issue. Even if we address the challenges due to parallelization and balancing of loads, workloads don't always cooperate. Scalability will likely remain a challenge in some cases.

High-Performance Applications

The number one influence on every aspect of high-performance machines is, perhaps, the applications that they are supposed to run. These applications involve large, intensive computations and sometimes massive datasets or both. Applications in the field of astrophysics seem to get all the public and mainstream media attention (who is not fascinated by black holes?). However, other applications, perhaps less known, greatly influence the day-to-day lives of most people worldwide. Examples include local weather prediction and applications in material science and biomolecular modeling.

High-performance systems are built and configured to fit the needs of their users and the applications they intend to run. They typically undergo lengthy tests to tune the performance of significant user workloads as a part of the delivery process to customers in an installation. The largest systems must cater to an extensive range of applications with diverse, often competing, needs. They pose extra challenges to system builders. One of the most significant is keeping those wildly expensive systems busy, at both the workload and system levels, to justify their high running costs.

Increased demand for more complex high-performance applications was, and continues to be, the main driving force behind an explosive growth in performance and capability. In 2023, the top supercomputer listed by top500.org was rated 35 times faster than the number one only a decade ago (1194 petaflops vs. 33.86 in 2013). In 2023, the older system would occupy position sixteen. Understanding the needs and characteristics of high-performance applications is vital. It is a must for insights into the decisions and tradeoffs that go into the design and attainment of high performance.

Green High Performance

Although not covered by the reading set, this aspect is, perhaps, the most crucial for modern supercomputing. As these systems grow in computing power, so do their energy needs. Electric power consumption seems to explode with increasing computing power, if not checked, in the order of tens of Mega Watts (MW). As a reference point, the first exascale system to make it to the top500.org in 2022 delivered one exa-flops of performance at about 20 MW, which was quite an achievement. The energy requirements are not limited to those needed to drive the electronics and the other components of the massive systems. Considerable amounts, ironically, are required to control and dissipate the large amounts of thermal energy generated by those systems. The budgets associated with handling the excessive heat and its environmental implications are a significant part of the cost of those systems. Unfortunately, computing and electric power don't go hand in hand. An interesting exercise is to check where the top two green systems (Green500) fall in the top500.org list. The interview in the Links section is a light, informative read on this crucial aspect of modern high performance at the top tier.

Finally

The preceding discussion somewhat focused on processing, which may mislead some readers. One should note in conclusion that high performance is much more complex. It should involve all aspects of the computing system. For example, memory and I/O bandwidth are just as critical for high performance. Also, it requires a careful review of all constituents of computations in a target workload, most obviously the algorithms. For a less obvious example, some workloads may tolerate less precision than the 64-bit norm of high-performance applications, e.g., half-precision or 16-bit floating point. So, that may be a reasonable way to get higher performance in those cases.

Note also an architecture suited for workloads with specific features, such as mainstream cloud applications, may not deliver the high performance expected by other, more specialized applications. A system may seem to provide vast resources on paper. However, what makes the difference is how it organized those resources and which aspects of performance it prioritized. There is no one magic template for high performance.