The Processor
Early processors had to deal with instruction sequences generated by compiler programs that convert a logical specification of operations and operands (in FORTRAN, for example) to binary machine code. A compiler constructs the code based on conventions specific to a machine. The machine code, in essence, was one plausible scheduling of machine instructions (hence perhaps the term program) that resulted in a correct execution. A compiler writer would apply some sensible optimizations as they saw fit. Thus, different compilers may produce different but logically equivalent sequences from the same source program.
Growing Processing Power
From the viewpoint of a processor, code from various sources made up a stream of instructions to decode and execute. The operands processed by some of those instructions formed a stream of data. Advanced processors worked on multiple data streams to speed up vector and matrix math. The streams of instructions and data came from machine programs written to specs and executed in ways that closely matched how the early processors organized their internal hardware resources.
As demand for faster processing rapidly grew, early designers started to tinker with the order of instructions under execution to improve performance. They increasingly diverged from compiler-generated schedules. It became clear the submitted machine code had to change to keep up with the hardware. For that to happen, users had to upgrade the compilers and regenerate their code. Programs written in assembly language had to change in ways that required programmers to understand the increasingly complex hardware. Assembly programs were a substantial part of the codebase then. Besides being a hassle, a combination of increased user costs, limited compiler technology at the time, commercial concerns, and intense competition made improving the hardware while keeping user machine code intact an appealing option, if not a necessity.
A related early challenge was to fill execution cycles to keep a processor busy (fully utilized). Conditionals often took a while to resolve before the final branch could execute. If a processor could guess the right direction to pursue, it could avoid a potentially costly stall. A wrong guess could involve an even higher cost. Conditional branches were frequent, so there could be a net improvement if the guesses were correct often enough. A similar situation arose when a processor had to wait for a data dependency to resolve or for memory to respond. The guesses could come from a compiler, the processor, or a combination. The clear advantages of processor-based guessing are code transparency and improved effectiveness. The software layer does not need to know, and the hardware can guess the direction of a branch based on previous times, which varies from run to run.
Directions: A Seminal Survey
Designers developed a family of techniques based on a wide range of tricks to speed up the execution of streams of instructions and data since the early 1960s. Rau and Fisher surveyed 30 years of work in 1992. Some created alternate internal instruction schedules that could execute more efficiently than those originating from compilers. Hardware changes were transparent, and old binaries could continue to run unmodified. Others did involve the compiler as a tradeoff to simplify the hardware and, therefore, make it faster. In this case, the compiler may have to do a complex analysis and modeling of the code and its running behavior. Unsurprisingly, designs that required programs to change with the hardware proved less attractive to users. They would have to pay for new code every time they updated the hardware.
Hiding the Hardware
While a compiler with intimate knowledge of the hardware could produce optimal instruction scheduling and register allocation for operands, some optimizations were more effective when performed at run time. Those techniques worked well with designs that favored hiding enhancements in the hardware. Hiding the changes also worked well for binary compatibility concerns.
Soon, some processors had to deal with another internal stream of instructions that could better utilize new enhancements while maintaining stable, relatively simple instruction specs for programmers to target. More importantly, programs already written to those specs continue to run unchanged with a boost in performance. There were, however, two main disadvantages. First, substantial enhancement efforts often resulted in modest returns for users in real-world programs. They had to rely on cumulative generational improvement. Second, hardware complexity occasionally got out of hand, forcing drastic redesigns. The approach, however, led to some of the most commercially successful and broadly used processors (the x86 line from Intel and AMD in microcomputers and the VAX line from Digital in minicomputers from the 1970s).
Thus, a modern processor may routinely process two instruction streams with distinct characteristics. The first one is compiler scheduled. The second is generated dynamically in the processor during execution.
Simplifying the Hardware
Meanwhile, designs that simplified the hardware to increase performance continued to develop and find significant applications. The most successful were RISC-based. A RISC processor had to deal with fewer highly regular essential instructions and addressing modes. As a result, the instructions could be executed quickly with low front-end overheads. Other designs explicitly advised the hardware on how to process a user program. In some of those designs, the machine instructions declared the operations that could run in parallel, which removed substantial complexity from the hardware. More bits were required to encode the extra information in unusually long instruction words, giving rise to the term very long instruction word computer (VLIW; more here in Simplifying ILP). Both approaches relied on innovations in instruction design that a compiler could use to get higher performance from simpler and faster hardware.
The VLIW designs proved less appealing and seemed to go nowhere. In the meantime, RISC processors were popular in scientific and engineering workstations since they were faster than mainstream ones initially and remained so for a while. They were most successful commercially in professional graphics workstations used for 3D rendering. Users of those applications tended to be less sensitive to legacy binaries. In the long run, the RISC approach proved valuable not due to its raw performance but as a naturally power-efficient design platform. Today, the adoption of RISC is rising due to the good performance-to-power ratios that RISC designs offer.
In Summary
Managing dependencies in operations, operands, and decisions is critical. They are the main obstacle to speeding up processing. They waste execution cycles and reduce the utilization of machine resources. They arise due to the restrictions imposed on the order of processing of data, competition on registers and functional units, or waiting for decisions to resolve. Some originate in the requests made by programs when they specify the logic required to solve a problem. The code produced by a compiler may help remove some. Others go back to the limitations imposed by the machine (its resources and how the designers chose to organize them). A hardware-aware compiler may assist in addressing some of those, but the hardware has to deal with the rest dynamically (from run to run). Or, it may remove all the dependencies to allow the hardware to focus on execution. In all cases, the machine must ensure a correct execution of the program's logic.
Effective management of dependence is the key to efficient processing. Most major architectures and execution tricks were motivated mainly by the need to deal with that concern.
Moreover, the machine has to maintain two logical binary states (on top of a raw physical one implemented perhaps primarily via transistors; more here in From Physical Device to Code). One of those states is public, presented to the machine programmers. The machine keeps the other hidden internally and managed in private. Programmers manipulate the external state (they write programs). The programmer state is composed of a memory model (address conventions and ranges), hardware operands (machine registers), and branch instructions that selectively overwrite the PC (alter execution flow). The hidden state is often more complex. The term architectural state refers to the programmer state. The hidden one is called the microarchitectural state.
A security concern occurs when information from the hidden state shows or leaks through the programmer state (i.e., via programs).
The Modern CPU
Modern computers come in many forms. They range from small hand-held devices to massive clusters of nodes, each a powerful computer in its own right. A typical computer will have many autonomous processors, called cores, inside a tiny chip package utilizing billions of transistors. In 2023, a personal computer may have up to 16 full or 24 mixed at the high end in that space.
Each core in a modern CPU is a powerful general-purpose processor with a separate control. Like its monolithic predecessors, a core processor routinely implements advanced processing techniques. It typically supports a form of SIMD instructions to speed up processing sets/vectors of data items in crucial applications. The cores share a fast cache memory in addition to their own. Cores, private caches, and the shared cache are coupled tightly by a high throughput interconnect on the same chip. The cache memories maximize the flow of bits between the cores and typically off-package main system memory. A core processor, in essence, is what used to be termed a CPU in early single-processor computers. Today, the industry refers to a multicore chip package as the CPU.
In addition to conventional cores, a typical CPU chip package contains a multilevel cache and perhaps a graphics processor. Some, as of 2022-23, may add a tensor math processor to speed up AI workloads. [See Links.] A multicore package may also include other components of a computer system, giving rise to the term system-on-a-chip (SoC). Mobile and embedded systems use power-optimized SoC designs. A multicore package is effectively a MIMD computer. Its core processors may work separately on independent workloads or process the same one in parallel. They typically run a multi-tasked workload of numerous threads from many programs under OS control, providing true concurrency.
Recent Trends
As of this writing, innovation in processors seems to be driven by three main trends: deep power efficiency, expanded specialized workload demands, and advances in shrinking integrated circuits. None of these trends is new. Recently, however, the trends have been driving a change in how processing is perceived. For example, increased demand at all levels for more power efficiency, once a primarily mobile concern, is changing how desktop performance is perceived. Established designs are changing as a result in significant ways. In the following, we look briefly at the effects of each trend as of this writing.
More Power Efficiency
The conventional multicores from Intel, so dominant in the business and consumer space, are shifting as of 2021 towards a heterogeneous core architecture, a design choice once mainly confined to power-sensitive devices such as smartphones and tablets. The cores in this design are not identical. Hence, a multicore can provide a mix of high-performance and power‑efficient cores. The powerful ones deliver performance when needed. Otherwise, power-efficient cores handle less demanding tasks. The new CPUs target higher performance levels than the smartphone pioneers. Users of these devices are increasingly expecting more performance at deeper power savings.
More Special Workloads
Support for specialized workloads has been growing in mainstream processors since the 1990s. In those days, it involved instruction set extensions focused on audio/video applications. The changes added SIMD instructions to speed up those workloads. Today, the growing demands of real-time security and artificial intelligence applications, especially various forms of machine learning, are again driving more instruction set extensions. The specialized performance demands of those workloads also motivate other architectural changes. These changes range from reworked functional units (matrix multiplication in 12th gen Intel core processors, for example) to specialized tensor math cores. This trend is also significant in the high-performance end. Supercomputers increasingly rely on mainstream parts and technologies rather than expensive custom ones to reduce costs.
Smaller Electronics
Shrinking integrated circuits (IC) trends continue with significant consequences on the previous two. Smaller circuits consume less power and can accommodate more logic economically. Historically, IC got smaller via a mixture of shrinking components (chiefly transistors), creative layering of circuits in the vertical plane, and improved miniaturization. Shrinking the transistors was the primary indicator of that progress. Traditionally, the industry reported transistor size in terms of a 2D measurement of its width (recently in nm). The metric suited the mainly 2D-shaped planar transistors at the time and correlated well with the resulting transistor counts or densities. Transistors gradually transformed vertically to where the 2D gate width related less and less to the actual increase in transistor count. The industry kept the old metric for marketing reasons, but numbers lost the original meaning and were no longer comparable. The nm figures quoted by manufacturers nowadays are mainly marketing terms.
In 2011, Intel hit a milestone with the first 3-dimensional transistor to go into production, with better switching, in a smaller planar (2D) footprint. It had been under development for a decade since TSMC demonstrated the technology in 2002. Experts had predicted in 2016 that shrinking transistors via physical gate length would stop by 2021 [see IEEE Spectrum 2016 article]. IBM and TSMC reported breakthroughs in IC miniaturization in 2021. Both cram more transistors even closer. Whichever way, transistor densities continue to rise, perhaps nowadays more due to innovations in shaping and stacking them vertically. [See the 3D CMOS article.] In the meantime, a new transistor, with the most substantial change to the original 3D structure from a decade ago, started to go into production in 2022, a milestone yielding yet another significant size reduction. Moore's Law seems very much in effect, with quite a ways to go.
Hover 🔍︎
Rethinking Speculative Execution
Speculative execution is a method to increase performance by executing code based on guessing the outcome of some instructions rather than waiting. A guess may turn out to be wrong, of course. Recovering from wrong speculations could be so costly that it may slow execution on balance. In the long run, however, carefully implemented speculation proved an effective way to address a main performance bottleneck that Flynn had identified long ago. Successful implementations appeared in almost all modern processors. They became an unquestioned constituent of performance until serious security issues came to light in 2018. Two attacks, colorfully named Spectre and Meltdown, were found in 2017 (went public in early 2018). Many more were, and continue to be, discovered since that time.
Historically, the main concern in designing processors was speed. Old-school speed came at a significant power cost, at times too steep to accept, as in mobile devices. Speculative execution was one of those execution tricks born out of a need for speed thinking. So, speed continued to be the priority for that part even as processors moved to be increasingly more power efficient. Furthermore, monitoring and controlling a processor was historically limited by physical access. It is no wonder that the growth of malicious software closely followed that of networking and remote code execution. The processor was increasingly exposed. It took a while before it was the target of attacks. It was, perhaps, a matter of time, with the spread of remote code execution over the Internet, that someone would expose such weaknesses.
Should the security issues be considered flaws in how designers implemented speculation for a long time? Or, were they forgotten consequences of tradeoffs that favored performance, made in an era when unauthorized access to running code on a processor was less likely? An intruder, then, needed some physical access to run code. Whatever the answer, processor designers need to address the security concerns now. They can either dispense with the speculation, patch up current implementations as we go, or rethink it for an era of relatively easy, covert remote access to the processor from virtually anywhere. In addition to speed and power, they perhaps should add security as a chief factor in the tradeoffs behind the design of this crucial feature.