Trends

Advanced Computer Architecture, Comparative Architectures Past Paper

Textbook: Hennessy, J. and Patterson, D. (2012). Computer architecture: a quantitative approach. Elsevier (3rd/4th/5th ed.)

Trends

Design goals and constraints

y2016p8q3 (a) target markets
- cost, size, time-to-market goals, power consumption, performance and
- the types of programs (parallelism forms, dataset and program characteristics, e.g. cache).
- predictable execution times, fault tolerance, security, compatibility requirements.
- more: fabrication and packaging technology (mask costs), i.e. transistors size and speed, power consumption limits, interconnect speed , number of metal layers, I/Os speed and number (e.g. limits of off-chip/DRAM bandwidth); PCB design considerations, the size and cost of the overall product etc.

Early computers exploit Bit-Level Parallelism.

	superscalar	compiler VLIW	SIMD	multi-core	DSA
parallelism	static or dynamic ILP, MLP	static ILP, MLP	DLP	TLP	custom
features	instruction fetch, dynamic prefetch, memory addr. alias, physical regs. rename	scheduling; sw. speculate; var. /fixed bundle	N ops., independent, same FU, disjoint regs., known mem access,	fine/coarse-grained vs. SMT	specialized
instr. count (IF/DE)	↑ out-of-order	one VLI	↓	var.	custom
branch handling	dynamic branch pred.	limited	poor (predicted)	per-core	custom
limitations	fabrication, and below	tailor to a pipeline	data-level tasks	Amdahl’s Law	inflexible
hardware cost/area	↑	↓	vector regs. / FUs	pipeline regs.	custom
interconnect	↑ (in single core)	↓	wide data bus, lane	mesh/cache coherence	scratchpad
energy cost	↑	↓	↓	var.	↓ ✰
binary compatibility	✓	✗	□ (✓ VLA)	✓	✗
use cases	CPU, general-purpose	embedded, GPU	ML, graphics	CPU, server, SoC	TPU, DSP

y2020p9q4 (a)
- major historial turing points, see more stages at the following sections
y2007p8q3 (a-e)
- multi-processing important as memory falls behind
- OS and compiler support for parallelism, limitation

Limitations of Moore's law

y2011p7q4 (d)
- power wall, Amdahl's law (sequential dominates), parallel programming challenges (TLP, correctness), thread scheduling.
- Off-chip memory bandwidth, package pins, shared-memory communication models (bus vs. directory).
- On-chip interconnects (e.g. mesh, ring, crossbar) and their limitations.
- Process variations, temperature variations, aging and reliability.

Die stacking

y2015p7q5 (c)

Fundamentals of Computer Design

Common case first

y2008p8q5 (a)

Energy, power vs. parallelism, performance

y2015p7q5 (d)
- parallelism
y2012p7q5
- specialization, superscalar vs. multi-core, vector

Instruction Set Architecture (ISA)

y2017p7q6 (a)
- pipelined processor
y2011p8q3 (c)
- both 16-bit and 32-bit ISA
y2006p7q1 (a, c)
- RISC vs. CISC; number of registers.

Predicated operations

y2023p8q1 (d)
- relation with out-of-order execution
y2009p8q2 (a, d)
- complicates out-of-order execution
See more in VLIW section.

Scalar pipeline

Multiple / Diversified pipelines

allow different instructions without data dependencies to execute in parallel, to hide the latency of the long-cycle instructions (e.g. load/store).
exploits ILP and improves IPC.
y2010p8q5 (b)

Deep pipeline and optimal pipeline depth

pros: allow processors to be clocked at higher frequencies, increasing the number of instructions that could be executed in any period of time.
cons: extra logic (area) required to support deeper pipelines and extract ILP consume large amounts of power.
- minimise the critical path by balancing workload among the pipeline stages.
- CPI increases because stall penalties or pipeline interruptions, i.e. operations can't be done in a single cycle.
  - branch predictors to keep the pipeline full (control hazards).
  - memory access (e.g. L1 cache), which negates the benefits achieved.
- pipeline registers overhead, clock skew.
- limited ILP, atomic operations.
y2014p8q3 (a.i), y2017p7q6 (b), y2010p8q5 (a)
y2009p8q2 (b)
- branch predictor limitations for deep pipelines.

Branch prediction

benefits

y2007p7q1 (a)

how to avoid pipeline bubble for complex predictors?

y2014p8q3 (b.ii)
- have a valid fetch address generated every cycle.
  - complex + less accurate predictors together, refetch if differs.
  - add next line and way info to each four-instruction fetch block within the instruction cache.

Static

y2007p7q1 (g)
- compiler support for branch prediction, i.e. static heuristic and profile info.
- e.g. forward-not-taken and backwards-taken.

Dynamic

One-level predictor: saturating counter

y2007p7q1 (b)

Two-level predictor: local/global history

Tournament predictor

y2014p8q3 (b.i), y2007p7q1 (e)

Branch Target Buffer (BTB)

y2023p8q1 (a,b)
y2020p8q3 (b,c)
y2009p8q2 (c)
y2011p8q3 (a), y2007p7q1 (h)
- procedure return address stack (PRAS), indirect jump.

Precise exceptions

all instructions before E (the instruction that caused the exception) must be completed, and
all instructions after E must not have completed, including not modifying the architectural state.
whether E should complete or not depends on the exception type.
y2022p9q1 (a.iii)
- buffer pre-executed results allowing the effects of the second partially executed instruction to be undone,
  - or to buffer results until we know all earlier operations have completed;
- another option, record which parts of the second instruction have executed and selective replay,
  - now upon restarting execution after the exception handler runs,
  - the processor knows which operations (“beats” in Arm terminology) not to re-execute in the case of the second instruction.
  - supported by Arm’s MVE/Helium ISA extension.

Improvements for scalar pipeline

y2018p27q5 (a)
- micro-architectural techniques or elements:
  - I/D-caches,
  - branch prediction,
  - multiple / diversified pipelines,
  - skid buffer for stalling and replaying.

Superscalar pipeline

Exploits Instruction- and Memory-Level Parallelism.

Instruction Fetch, DEcode, [Rename], [Dispatch], [Issue: static/dynamic scheduling], EXecute, Memory, Write Back.
vs. In-order
- dispatch: stall for data hazards (false dependencies);
- issue: stall for structural hazards.

OOO processors benefit from,

after instructions fetched, decoded and renamed, dispatch into the issue queue, which can be scheduled out-of-order.
issue: create a window into the dynamic instruction stream, from different basic blocks of the original program.
pros: schedule instructions dynamically aided by speculation,
- constrained by little more than true data dependencies and functional unit availability.
- different instruction schedules depending on run-time info. and the actual state of the processor,
  - branch prediction, react to data cache misses,
  - exploit knowledge of load/store addresses, i.e. disambiguate memory addresses.
- avoids the need to stall when the result of a cache miss is needed.
  - improved memory-level parallelism, tolerate longer cache access latencies.
- simplify the compiler (register allocator), without increasing static code size.
- allows code compiled for another pipeline to run efficiency.
cons: more hardware cost,
- complexity of instruction fetch, memory speculation, and register renaming, cache access.
- large area, high power consumption, heat dissipation, and fabrication complexity.
limitation: task with low performance targets (ILP, Amdahl's law), competing for FUs.
- exploiting only ILP, without TLP or DLP.
- interconnect, limiting on state reachable per cycle and centralised memory-like structures scale.

In-order vs out-of-order (dynamic)

y2023p8q1 (c), y2010p8q5 (c)
- scheduling & execution
y2014p8q3 (a.ii)
- micro-architectural components: ReOrder Buffer, issue queue, load / store queue.
y2020p8q3 (a)
- pipeline diagrams
y2018p27q5 (b)
- power efficiency
y2016p8q3 (b.ii)
- cases where more superscalar techniques wouldn't help

Instruction fetch

y2011p8q3 (b)
- instruction cache and misalignment (spanning multiple cache lines), branch prediction.
- multiple basic blocks (path prediction and branch address cache, trace cache)

arch (or logical) register $\rightarrow$ $\to$ physical of the last destination targeted.
- Register Map Table (RMT), Free Physical Register List (FPRL);
to remove false (or name) dependencies (anti-:WaR, output:WaW).
- VLIW: Rotating Register File (RRF).
increase the maximum number of instructions that could be in-flight simultaneously.
vs. compile time,
- latter is difficult due to the presence of short loops, control dependencies, or
- when the ISA only defines a very limited number of architectural registers.
support speculative execution and precise exceptions.
- speculative execution benefits from the presence of a large number of physical registers to hold live variables from speculative execution paths.
- quickly return the processor to a particular state, either to implement precise interrupts or to recover from a mis-predicted branch.
y2024p8q1 (a), y2019p8q3 (b)
y2009p7q7 (a-d)

Clustered data forwarding network

y2017p7q6 (c)
- partitioned issue windows;
- issue: inter-cluster communication.
- for VLIW, the same applies.

Hardware-based dynamic memory speculation

memory-carried dependencies (aliases)
- not to issue a load instruction before a pending store that is writing to the same memory location has executed.
- or store-to-load forwarding; otherwise, the load can bypass the store, via load queue.
irreversible store
- store instructions are executed in program order, via store queue,
- never speculatively before earlier branches have been resolved.
  - ensures any exceptions caused by earlier instructions are handled.
y2009p7q7 (e), y2024p8q1 (b)

Load bypassing: Load/Store Queue (LQ/SQ).

y2018p27q5 (c)
y2021p9q4 (a)
- improve performance by avoiding repeated false speculated loads,
  - Load Wait Table: PC -> one bit.

Skip speculative loads, if predicts L1 D-cache miss.

y2021p9q4 (b)

ReOrder Buffer (ROB)

holds both the instructions and data for in-flight instructions,
for [precise exceptions], ensuring results are committed in order to prevent data hazards.

To search for operands between register file and the reorder buffer,

If the operand is located in the reorder buffer, it is represents the most recent value.
- The reorder buffer is searched and accessed in parallel with the register file.
A second approach is to maintain a register mapping table,
- this records whether each register should be read from the reorder buffer or the register file.
- It also records the entry in the reorder buffer where the register can be found.

Unified Physical RF

with two register mapping tables, and a simplified ROB (in-order instruction queue),
- the front-end future map represents the current potentially speculative state,
- the architectural register map maintains the user visible state (checkpoint).
- enhancement: one arch RM per branch prediction (MIPS 10K).
upon detecting a mis-predicted branch,
- the architectural register map is copied to the future register map and
- any instructions along the mis-predicted path are discarded.
- execution can then continue along the correct path with the correct register state.

ROB vs Unified Physical RF

y2024p8q1 (c), y2011p8q3 (d)

Limitations

y2013p8q3 (d), y2010p8q5 (d)
- bottleneck: fabrication of wires, interconnect, power consumption, and heat dissipation.
y2015p7q5 (a)
- why modern architecture prefers multi-core than only single thread with superscalar techniques.

General

y2016p8q3 (b)
- give an assembly language program benefiting from superscalar techniques.
y2013p7q5 (b)
- switching between two configurations.

Software ILP (VLIW)

Exploits Instruction- and Memory-Level Parallelism via static scheduling.

vs. superscalar

y2008p8q5 (b)
- implementations.

Clustered architecture: see superscalar pipeline.

Local scheduling

Loop unrolling

y2009p8q2 (d)

Software pipelining with Rotating Register File (RRF)

Global scheduling

y2013p8q3 (a)
- trace vs superblock scheduling

Conditional / Predicated operations

y2023p8q1 (d)
y2014p8q3 (b.iii)
- vs. branch prediction and limitations
y2009p8q2 (d)

Memory reference speculation

y2018p27q5 (d)
- compiler + advanced load address table (ALAT).
y2008p8q5 (d)

Variable-length bundles of independent instructions

VLIW	fixed-width bundle	var-len bundle
code density, i-cache size, instr fetch	↑ no-ops	↓ only st., end
i-cache hit rate	↓	↑
fixed SLOT-to-FU	✓	✗
hardware	simpler	+ decoding; check before issue; + interconnect, muxes for op, operands to FU.

Binary compatibility

y2006p7q1 (b)

Multi-threaded processors

Exploits Thread-Level Parallelism.

vs. multi-core

y2015p7q5 (a)
- modern architecture: multi-core > single multi-threaded

Coarse-grained MT

y2024p9q1 (a)

Fine-grained MT

y2024p9q1 (b)
y2021p8q3 (a)
- VLIW vs fine-grained multi-threaded
- round-robin thread schedule, functional units

SMT

y2024p9q1 (c.i)
y2020p9q4 (b,c)
- threads characteristics
- SMT vs Multi-core
y2018p8q2 (d)
- (store) multi-copy atomic

partition/tag vs. share vs. duplicated resources

y2024p9q1 (c.ii)
y2016p7q6 (d)
- cache partition

General

y2007p8q3 (c)

Memory hierarchy

Reference: Memory address calculation.

Direct-mapped vs. set/fully associative cache

y2014p7q5 (a)
- hit rate
y2011p7q4 (a)
- reduce conflict misses,
  - victim cache (exclusive cache),
  - spatial memory layout: array merging, array padding.

Block replacement policy

y2016p7q6 (b)
- LRU

Hit time calculation

y2008p7q5 (c)

Virtual addressing and caching

y2022p9q1 (b)
- VIPT

Cache: optimizing performance

y2013p8q3 (b)
- list all the techniques
trade-off between memory vs. computation
spatial access order: loop interchange, cache blocking
temporal locality: loop fusion and fission
y2010p7q7 (a)
- multi-banked cache
y2014p7q5 (c)
- load/store buffer: load-store forwarding, store merging or coalescing
y2018p8q2 (c)
- merging or coalescing write-buffer violates TSO?
y2011p7q4 (c)
- (stride) prefetching hardware
y2014p7q5 (b)
- non-blocking cache (out-of-order)
avoid false sharing in multiprocessor systems
- private variables in the same cache line accessed by different threads/cores

Multi-level cache hierarchy

y2008p7q5 (a)
- why multi-level ? hit time vs. number of hits
- L1, L2 examples

Inclusive vs. exclusive caches vs. NINE

y2008p7q5 (b)
y2017p8q3 (c.i), y2010p7q7 (b)
- private vs. shared L2 caches
y2017p8q3 (c.ii)
y2022p8q1 (b)
- non-inclusive

Vector processors / SIMD

Exploits Data-Level Parallelism.

y2021p8q3 (a.iv)
- VLIW vs SIMD

Potential advantages

y2022p9q1 (a.i-ii)
y2012p7q5 (c)
- energy efficiency

Vector chaining (RaW) and tailgating (WaR)

y2015p7q5 (b)

Precise exceptions: please refer to the scalar pipeline section.

ISA: vector length

y2022p9q1 (a.iv)

Multi-core processors

Exploits Chip Multiprocessing.

vs. superscalar

y2012p7q5 (b)
- less power but equal performance
y2015p7q5 (a)
- modern architecture: multi-core > single multi-threaded

Multi-banked caches

y2010p7q7 (a, b)

Cache coherence

invalidate vs. update

y2012p8q3 (b)

snoopy protocol

y2017p8q3 (a,b)
- bus's feature here, GPU
y2018p8q2 (e), y2014p7q5 (d), y2010p7q7 (c), y2011p7q4 (c)
- MESI, test-and-set without locking down the bus for multiple cycles
y2010p7q7 (d)
- inclusion policy benefit for snoopy protocol
y2022p8q1 (a)
- ring interconnect
y2010p7q7 (e)
- via multiple buses for a greater number of processors.

directory protocol

for each cache line, store a list of sharers and the exclusive individual processor in the directory.
inclusive directory for non-inclusive caches
y2012p8q3 (d), y2022p8q1 (c), y2022p8q1 (d)
- hierarchy of on-chip (clustered) caches, reducing storage overhead.

general

y2015p8q1

Memory consistency

Sequential Consistency vs. Total Store Order

y2012p8q3 (c)
- SC rarely used.
y2018p8q2 (a-c)
- SC vs. TSO
- coalescing write buffer violates TSO?

Store atomicity

y2018p8q2 (d)
- SMT, (store) multi-copy atomic

False sharing: cache line/block granularity

y2012p8q3 (a)

On-chip interconnection network

y2019p9q4 (a)
- virtual channels
y2016p7q6 (a, c)
- mesh network, H-tree
y2013p8q3 (c, d)
- large scale networks: challenges and constraints

General design

y2019p9q4 (b)
- draw a diagram with the above considerations

Specialised processors

Exploits Accelerator-Level Parallelism.

Heterogeneous/asymmetric vs homogeneous/symmetric

y2013p7q5 (a)

GPU

y2017p8q3 (b)
- GPU, cache coherence

Domain-Specific Accelerators (DSA)

y2023p9q1
- off-chip memory accesses; design hardware with assumptions
y2022p8q1 (e)
- own cache vs DMA into main memory
y2012p7q5 (a)
- power vs. performance
y2021p9q4 (c)
- multicore vs. a DSA / two DSAs.
y2006p8q1
- general design choices