Advanced Computer Architecture, Comparative Architectures Past Paper

Textbook: Hennessy, J. and Patterson, D. (2012). Computer architecture: a quantitative approach. Elsevier (3rd/4th/5th ed.)

Design goals and constraints

Early computers exploit Bit-Level Parallelism.

superscalar compiler VLIW SIMD multi-core DSA
parallelism static or dynamic ILP, MLP static ILP, MLP DLP TLP custom
features instruction fetch,
dynamic prefetch,
memory addr. alias,
physical regs. rename
scheduling;
sw. speculate;
var. /fixed bundle
N ops., independent,
same FU,
disjoint regs.,
known mem access,
fine/coarse-grained vs.
SMT
specialized
instr. count (IF/DE) ↑ out-of-order one VLI var. custom
branch handling dynamic branch pred. limited poor (predicted) per-core custom
limitations fabrication, and below tailor to a pipeline data-level tasks Amdahl’s Law inflexible
hardware cost/area vector regs. / FUs pipeline regs. custom
interconnect ↑ (in single core) wide data bus, lane mesh/cache coherence scratchpad
energy cost var. ↓ ✰
binary compatibility □ (✓ VLA)
use cases CPU, general-purpose embedded, GPU ML, graphics CPU, server, SoC TPU, DSP

Limitations of Moore's law

Die stacking

Fundamentals of Computer Design

Common case first

Energy, power vs. parallelism, performance

Instruction Set Architecture (ISA)

Predicated operations

Scalar pipeline

Multiple / Diversified pipelines

Deep pipeline and optimal pipeline depth

Branch prediction

benefits

how to avoid pipeline bubble for complex predictors?

Static

Dynamic

One-level predictor: saturating counter

Two-level predictor: local/global history

Tournament predictor

Branch Target Buffer (BTB)

Precise exceptions

Improvements for scalar pipeline

Superscalar pipeline

Exploits Instruction- and Memory-Level Parallelism.

OOO processors benefit from,

In-order vs out-of-order (dynamic)

Instruction fetch

Register rename

Clustered data forwarding network

Hardware-based dynamic memory speculation

Load bypassing: Load/Store Queue (LQ/SQ).

Skip speculative loads, if predicts L1 D-cache miss.

ReOrder Buffer (ROB)

To search for operands between register file and the reorder buffer,

Unified Physical RF

ROB vs Unified Physical RF

Limitations

General

Software ILP (VLIW)

Exploits Instruction- and Memory-Level Parallelism via static scheduling.

vs. superscalar

Clustered architecture: see superscalar pipeline.

Local scheduling

Loop unrolling

Software pipelining with Rotating Register File (RRF)

Global scheduling

Conditional / Predicated operations

Memory reference speculation

Variable-length bundles of independent instructions

VLIW fixed-width bundle var-len bundle
code density,
i-cache size,
instr fetch

no-ops

only st., end
i-cache hit rate
fixed SLOT-to-FU
hardware simpler + decoding;
check before issue;
+ interconnect, 
muxes for op,
operands to FU.

Binary compatibility

Multi-threaded processors

Exploits Thread-Level Parallelism.

vs. multi-core

Coarse-grained MT

Fine-grained MT

SMT

partition/tag vs. share vs. duplicated resources

General

Memory hierarchy

Reference: Memory address calculation.

Direct-mapped vs. set/fully associative cache

Block replacement policy

Hit time calculation

Virtual addressing and caching

Cache: optimizing performance

Multi-level cache hierarchy

Inclusive vs. exclusive caches vs. NINE

Vector processors / SIMD

Exploits Data-Level Parallelism.

Potential advantages

Vector chaining (RaW) and tailgating (WaR)

Precise exceptions: please refer to the scalar pipeline section.

ISA: vector length

Multi-core processors

Exploits Chip Multiprocessing.

vs. superscalar

Multi-banked caches

Cache coherence

invalidate vs. update

snoopy protocol

directory protocol

general

Memory consistency

Sequential Consistency vs. Total Store Order

Store atomicity

False sharing: cache line/block granularity

On-chip interconnection network

General design

Specialised processors

Exploits Accelerator-Level Parallelism.

Heterogeneous/asymmetric vs homogeneous/symmetric

GPU

Domain-Specific Accelerators (DSA)