Advanced Computer Architecture Big Picture Summary 3

Updated Monday October 9, 2023 2:41 PM GMT+3

Revised 9/22/2023 - This article is not standalone and is not a complete treatment of its topics. It's meant to cap a detailed reading of a similarly titled set of technical papers.

The Front End

By the late 1970s, there was a strong realization of a need to simplify the front end of a computer. The front end includes parts responsible for generating and handling program instructions before executing them. It is conventionally composed of software and hardware parts. A software compiler generates machine instructions from a program written in a high-level language. Hardware decodes the instructions to extract operations and operands. The most significant part of the front end is the instruction and operand access model, including program memory and machine operands (CPU registers). It is the programming face the machine presents to its users. Higher-level programming systems build on this interface.

Hover 🔍︎ + Click

Peeling the layers: a modern computer hides the machine that performs the calculations behind many layers, all of which we may throw away. We can feed inputs directly via switches and read results, one instruction at a time, via LEDs. The bare machine would be fully operable, but that would not be productive (to say the least).

Programming the Machine

A rich machine-level interface based on a wide selection of flexible operations and operands makes life easier for the assembly programmer. However, adding more capable instructions with complex behaviors and creative coding schemes results in costly decoding. Moore had predicted (perhaps demanded) a steady rise in transistors to be available to designers. It was an insight that lived long past its time thanks to advances in semiconductor technology, driven by believers who made Moore's original observations a law. Costly decoding, however, is not a wise place to spend a transistor budget better spent on execution logic and fast memories closer to functional units where they can help the execution logic run at full speed. Besides, compilers tended to avoid some fancy but problematic instructions because they ran inefficiently or gave them a hard time. It makes little sense to waste transistors on instructions with poor utility.

The RISC Approach

The RISC (reduced instruction set computer) approach to machine instructions enabled faster decoding and facilitated fast execution. Only the most basic and frequently used instructions, with simplified execution profiles, were selected to implement in hardware. The basic instructions could synthesize more complex operations that other designs perhaps support by complex machine instructions on less efficient hardware. Similarly, the addressing modes of operands are limited to bare essentials. Typically, an assembler provides pseudo instructions that simulate complex instructions to make a low-level programmer's life easier. The assembler changes the soft instructions to efficient patterns of machine instructions. Pseudoinstructions present a richer interface than the real machine. Meanwhile, the hardware works with an instruction set reduced in both size and behavior.

A reduced instruction set resulted in fewer and simpler bit encoding schemes, especially compared to designs that retained a large mix of simplified and complex instructions. Reduced instructions led to simpler decoding hardware, execution paths, and control, which, in turn, resulted in faster programs. Coherently designed instruction sets based only on elementary operations and addressing modes simplified compilers and helped produce more efficient machine code. Significantly, they made automating compiler production easier, lowering the cost of new architectures and leading to faster adoption by high-level programmers.

Hover 🔍︎

RISC-V core integer instruction (RV32I) encodings, a modern ISA. The fixed-size instructions come in only six formats with highly regularized field positions. The first two show examples of a register-register add and its short immediate-mode variant. Like MIPS I, loads share the I-type format with short-immediate operand instructions (12 bits in RISC-V); stores use S-type. The odd immediate fields in the B and J-type instruction formats illustrate how ISA can help simplify the hardware (look it up).

RISC quickly coalesced consistent rules, design choices, and guidelines for building processors. Previously, in the 1950s-70s, ambitious engineers/scientists working for competing businesses that had to market and sell products created processors in a young, largely unexplored field. As a result, the families of processors, later known as CISC (complex instruction set computer), were an odd yet impressive collection of designs, tricks, and compromises. Nevertheless, the pioneers pretty much figured out the basics. Later work, for the most part, pushed and refined their ideas. It is perhaps a mistake to view CISC and RISC as competing design methodologies. RISC, more appropriately, may be regarded as a natural progression in designing processors after a few decades of valuable experiences.

Hover 🔍︎

The 68000 (Motorola, 1979) in the original Apple Macintosh (1984) was a popular CISC. A carefully designed instruction format stated encoding patterns that supported extensive combinations of opcodes, function specifiers, operands, indexing, and versatile addressing options. Instructions were one to five 2-byte words long (2-10 bytes). The required first word specified an opcode and other info about the instruction, including its length. Up to four words, 2 bytes each, extended the instruction word. Later models stretched the format (up to 11 words) to fit advanced features while maintaining binary compatibility within the line.

RISC, unfortunately, tended to create longer sequences of instructions from its simpler building blocks, which placed higher burdens on memory and its interconnections to processors, driving up implementation costs at first. The size burdens of static code (i.e., as produced by a compiler) were static (occupying more memory) and dynamic (increasing traffic between memory and processor). A more subtle one was the poor locality in a memory hierarchy when there was very little or no fast caching. Fortunately, advances in integrated circuit technology at the time, known as VLSI (very large-scale integration), quickly addressed those concerns. The tedious instruction sequences needed to accomplish routine tasks were less of a problem in the long run as programming in assembly declined. Compilers that could generate better code than most programmers and the rising complexity of applications made high-level programming the norm. What mattered in the end was that code ran faster on the new hardware from a user's perspective. RISC ended up on servers and high-performance professional workstations in the late 1980s through the 1990s. Properly sized caches eventually alleviated code size concerns.

A perhaps less celebrated effect of RISC at the time became critical in the long run. Simpler RISC processors are naturally power-efficient.

MIPS R2000/3000: a Classic

The original MIPS processor (mid-1980s), while not alone, certainly not the first, was a very significant RISC design in terms of its long-term influence on modern processors [see the IBM RISC pioneer link]. It was the first RISC to be widely successful commercially. It remains a showcase for a more purist interpretation of RISC design principles. The design legacy of that processor lives on in the more recent RISC-V, a modern open ISA processor specification.

Beyond history, famously, the first MIPS designs (first generation ISA) provided a classic example of rethinking the front end to make the hardware faster.

The efficiency of the five-stage pipeline, a signature feature of the design, was a direct result of simplifying the instruction set. One might say that the design objective of the instruction set was to run efficiently on the pipeline. Following are some features that allowed all instructions to fit in five rigid stages that take exactly four clock cycles:

The severe reduction of instruction set behavior and variations resulted in a fast pipeline without interlocks (stages do not wait on each other, or broadly, in this context, hardware to delay an instruction pending a prior result). It is in the name, an acronym for Microcomputer without Interlocked Pipeline Stages. The downside was that some details/quirks of the pipeline were programmer-visible. Famously, the load word instruction got its data word from memory after the ALU stage of the following instruction, necessitating a delay slot (i.e., the next cycle must either fill with independent instruction or be left empty). The non-interlocked hardware, however, did not stop the following instruction from using the data word too early. The software had to intervene to prevent this pipeline hazard. Later designs would interlock the load instruction result so that it was not usable before it was ready. The hardware interlock reduced reliance on the compiler a bit.

As expected from RISC, the fixed 4-byte instruction size yielded a larger static code size relative to an average of 3 bytes for x86, according to Dominic Sweetman in the definitive MIPS reference See MIPS Run. Performance, however, was decisively better, which is where RISC made its case then. The R3000, at 25MHz, produced a SPECint92 rating of 16.1 compared to 8.35 for Intel 80386DX at 33 MHz from the same era (the late 1980s).

Simplifying ILP

Rethinking the front end can also highly reduce the complexity of the hardware that utilizes instruction-level parallelism (ILP). An instruction can be encoded to pack multiple RISC-like operations instead of just one. For this approach to work, a compiler must carefully select which ones to encode in an instruction. They must be independent and able to run in parallel on the hardware. In other words, this design couples the compiler to the machine in much the same way as a hardware decoder. Removing parts that deal with instruction dependencies from the hardware reduces its complexity. The software replaces functions corresponding to large amounts of transistors, which could lead to significant power savings. This approach to ISA results in longer sequences of bits to encode an instruction, hence the name VLIW (very long instruction word) computer. A compiler can use compaction (a technique to turn sequential code to parallel) to cram more information in an instruction, i.e., increase encoding efficiency. As a plus, VLIW should ideally result in a smaller static code size. TRACE, a late 1980 implementation from Multiflow, was an instructive example of VLIW's performance promise and the software vs. hardware complexity tradeoffs. Nowadays, the memory footprint of the code is no longer the concern it once was. Designers can count on the availability of a lot of cheap and fast memory.

Hover 🔍︎

A compiler can rearrange operations from the RISC instructions (left) in a compact sequence of VLIW. Long, constant-length words may have empty slots due to operation dependencies. Holes in the encoding reduce its efficiency.

Ironically, the very close tie between software and the processing engine of the machine, the main strength of VLIW, is often cited as its main drawback. VLIW, by design, breaks binary compatibility with any hardware refresh, i.e., a recompilation of software is required, which leads to poor user adoption. As a result, VLIW was not a commercial success. The second main drawback was technical. Encoding density and, consequently, the static code size depended on how much parallelism is in instructions. Independent operations pack densely in instruction words. Otherwise, they must spread across more words, leading to sections with poor utilization of parallel resources and less productive fetch-decode cycles. Researchers developed ways around these problems that included creative encoding. It further reduced VLIW's appeal for general purposes.

An intriguing solution was the Crusoe processor (Transmeta, 2000). Rather than general purpose, it implemented a VLIW execution engine wrapped in a software layer that emulated the x86 ISA (the designers called it Code Morphing). The software wrapper ran x86 binaries by converting them to internal VLIW instructions that run efficiently on highly simplified hardware. The design effectively decoupled the hardware from the ISA, which allowed it to evolve freely, thus addressing the main concern with VLIW. Crusoe and its successor were significant technically. While the parent company and its product disappeared, intellectual properties ended up with CPU giants Intel and Nvidia.

Hover 🔍︎

The long four-slot instruction format of the Crusoe processor reflects four parallel hardware units.

In Summary

bla bla

Hover 🔍︎ + CLICK

A map of the main abstraction layers in front of the parts that perform computations: the line between software and hardware is not, in reality, as rigidly drawn as may appear to some (by design). The figure shows where different front-end pieces may fit rather than one setup.

Breaking a system into pieces hidden by abstractions with well-defined interfaces allows machine builders to control/manage complexity and grow the system without hurting stability.

Recent Trends

It is perhaps a mistake to underestimate what microarchitectural innovations alone can achieve. While RISC enjoyed significant performance advantages for some time, mainstream x86-based designs caught on, eventually, and surpassed the fastest RISC chips. Design and production costs matter less if an economy of scale is on your side. They go down a lot anyway with huge volumes, which was the case with Intel. RISC eventually picked up complex superscalar out-of-order hardware to compete [see Links for Ditzel comments]. Abundant, cheap RAM and cache long addressed concerns about code size in memory and related locality issues. In 2022-23, some high-end consumer computers offered more on-chip memory in the form of cache than RAM in expensive workstations from the early 1990s. Power consumption, however, was where hardware designed to run simplified instruction sets paid off in the long run, and x86 could not compete (as of this).

Nowadays, RISC processors, mainly based on ARM designs, are a significant part of the power-sensitive everyday mobile experience for hundreds of millions globally. The mobile chips worked in hand-held devices at first. They were extremely power efficient but lacked the processing power beyond smartphones and tablets. In recent years, some brands of popular portable personal computers moved to enhanced versions of mobile RISC processors. The new chips were competitive in performance with mainstream x86 processors with longer battery life in the laptop form factor.

On the higher end of the computing spectrum, the ARM-based Fugaku was the top supercomputer for three years until 2022, according to the TOP500.org list. In the meantime, interest continues to grow in the RISC-V as a platform to develop next‑generation power‑optimized HPC processors. [See: European Processor Initiative, Barcelona Supercomputing Center, and CHIPS Alliance.] In 2017, the HPC-class PEZY-SC2 processor used MIPS-based parts (P6600). The moves to RISC in HPC are part of a push toward green supercomputing.