Advanced Computer Architecture

Big Picture Summary 5

Updated Friday November 20, 2020 8:28 PM GMT+3

Revised 11/19/2020 - This article is not standalone. It's meant to cap a detailed reading of a similarly titled set of technical papers (see the course calendar).

High Performance

Studying typical instruction sequences is a big part of computer architecture, ever since Amdahl published his seminal work in which he argued for a particular way of building computers based on a study of likely workloads at the time. His perspective was that of an engineer who has to deliver value for users.

Examining instruction sequences is essential for understanding performance issues and for developing suitable solutions (i.e., build accordingly). Some of those sequences may be common, but most are workload-specific. They depend on the types of programs expected to run frequently. Identifying patterns relevant to performance is especially useful for building specialized machines such as supercomputers (the same applies to those designed to run games).

The use of instruction traces is the next logical step. Traces extract the essential characteristics of an interesting aspect of the instruction stream. In some cases, that would be the dependencies generated by the stream. In others, it could be the memory access patterns.

Proper benchmarking is about generating the right instruction streams to gauge performance realistically. Good benchmark programs guide the design and help optimize for valuable workloads. Historically, benchmarking has been abused to misrepresent performance for various reasons, unfortunately. For users, the best benchmarking relies on real programs that they are likely to run, using whole programs, or carefully extracted representative parts. Such benchmarks must change to keep up with evolving needs and reflect expected user performance realistically.

Interestingly, a modern cluster supercomputer typically combines major multiprocessing architectures in various ways. A so-called fat node in a cluster, named originally for having lots of memory, is now increasingly a classic tightly-coupled multicore with up to 100 conventional processors. Alternatively, a fat node may instead rely on a massively-parallel GPU or combine that with a high core count processor, effectively a supercomputer on its own. Such arrangements are attractive since we can fit a substantial computation in a node or spread across fewer nodes, reducing typically high overheads of cluster-wide communication.

Configuration flexibility helps scalability over a wide range of workloads. Whether software can utilize all that power is another issue. Even if we address parallelization and operating challenges, workloads don't always cooperate. Workload scalability remains a challenge.