Multiprocessors pose an extra challenge to the concern of maximizing productive traffic between processor and memory (goal of a successful cache). They must balance that against obtaining a proportionate increase in performance as more processors get involved in the same computation. This concern is called scalability. Moreover, a multiprocessor can increase overhead traffic on various paths between memory and processors to maintain coherence across private caches holding the same blocks, as we have seen. It further reduces multiprocessing returns.
Scalability is crucial for multiprocessors because it is the reason why we pay top money for machines with more processors!
Lamport clarified fundamental concerns that could only make the problem worse. Interprocessor traffic resulting from enforcing a necessary sequential consistency requirement can also reduce returns on adding more processors. Suggested classic operating system techniques, however, may not be feasible in an environment based on traditional processors.
Even if designers successfully address architectural overheads, there is still legitimate traffic resulting from computation-related communication among processors, some of which may be due to poor parallelization but most stems from workload characteristics. Interprocessor communication overhead, therefore, remains a concern for scalable multiprocessing.
The GPU is an example of a massively parallel (or fine-grained) multiprocessor. Its roots go back to machines such as the MasPar of the 1990s. This architecture connects a large number of tiny processors (called processing elements) to a high-bandwidth shared memory. The multiprocessor relies on a conventional (host) processor to provides a familiar operating, programming, and I/O front-end environment. It is most suited for workloads consisting of a large number of small computations, thereby supporting both a multiplicity of data and a multiplicity of instructions.
Repetitive small computations emphasize a multiplicity of data leading to an affinity with SIMD, which was probably the original reason for using this architecture class to accelerate graphics. Such applications encourage limiting the capabilities of processing elements. Specialized elements tend to be smaller, so a GPU may provide more. Coordinated execution of lots of repetitive graphics primitives involving lots of data and few low-overhead instructions invoke comparisons with classic vector processors.
A steady supply of transistors (predicted or motivated by Moore's law) supported a trend of processing elements growing both in numbers and in relative capability, which in turn drove general scientific workloads to grow in importance, making MIMD aspects of the architecture more interesting.