Advanced Computer Architecture, Comparative Architectures Past Paper
Textbook: Hennessy, J. and Patterson, D. (2012). Computer architecture: a quantitative approach. Elsevier (3rd/4th/5th ed.)
Trends
Design goals and constraints
- y2016p8q3 (a) target markets
- cost, size, time-to-market goals, power consumption, performance and
- the types of programs (parallelism forms, dataset and program characteristics, e.g. cache).
- predictable execution times, fault tolerance, security, compatibility requirements.
- more: fabrication and packaging technology (mask costs), i.e. transistors size and speed, power consumption limits, interconnect speed , number of metal layers, I/Os speed and number (e.g. limits of off-chip/DRAM bandwidth); PCB design considerations, the size and cost of the overall product etc.
Early computers exploit Bit-Level Parallelism.
|
superscalar |
compiler VLIW |
SIMD |
multi-core |
DSA |
parallelism |
static or dynamic ILP, MLP |
static ILP, MLP |
DLP |
TLP |
custom |
features |
instruction fetch, dynamic prefetch, memory addr. alias, physical regs. rename |
scheduling; sw. speculate; var. /fixed bundle |
N ops., independent, same FU, disjoint regs., known mem access, |
fine/coarse-grained vs. SMT |
specialized |
instr. count (IF/DE) |
↑ out-of-order |
one VLI |
↓ |
var. |
custom |
branch handling |
dynamic branch pred. |
limited |
poor (predicted) |
per-core |
custom |
limitations |
fabrication, and below |
tailor to a pipeline |
data-level tasks |
Amdahl’s Law |
inflexible |
hardware cost/area |
↑ |
↓ |
vector regs. / FUs |
pipeline regs. |
custom |
interconnect |
↑ (in single core) |
↓ |
wide data bus, lane |
mesh/cache coherence |
scratchpad |
energy cost |
↑ |
↓ |
↓ |
var. |
↓ ✰ |
binary compatibility |
✓ |
✗ |
□ (✓ VLA) |
✓ |
✗ |
use cases |
CPU, general-purpose |
embedded, GPU |
ML, graphics |
CPU, server, SoC |
TPU, DSP |
- y2020p9q4 (a)
- major historial turing points, see more stages at the following sections
- y2007p8q3 (a-e)
- multi-processing important as memory falls behind
- OS and compiler support for parallelism, limitation
Limitations of Moore's law
- y2011p7q4 (d)
- power wall, Amdahl's law (sequential dominates), parallel programming challenges (TLP, correctness), thread scheduling.
- Off-chip memory bandwidth, package pins, shared-memory communication models (bus vs. directory).
- On-chip interconnects (e.g. mesh, ring, crossbar) and their limitations.
- Process variations, temperature variations, aging and reliability.
Die stacking
Fundamentals of Computer Design
Common case first
Energy, power vs. parallelism, performance
Instruction Set Architecture (ISA)
Predicated operations
Scalar pipeline
Multiple / Diversified pipelines
- allow different instructions without data dependencies to execute in parallel, to hide the latency of the long-cycle instructions (e.g. load/store).
- exploits ILP and improves IPC.
- y2010p8q5 (b)
Deep pipeline and optimal pipeline depth
- pros: allow processors to be clocked at higher frequencies, increasing the number of instructions that could be executed in any period of time.
- cons: extra logic (area) required to support deeper pipelines and extract ILP consume large amounts of power.
- minimise the critical path by balancing workload among the pipeline stages.
- CPI increases because stall penalties or pipeline interruptions, i.e. operations can't be done in a single cycle.
- branch predictors to keep the pipeline full (control hazards).
- memory access (e.g. L1 cache), which negates the benefits achieved.
- pipeline registers overhead, clock skew.
- limited ILP, atomic operations.
- y2014p8q3 (a.i), y2017p7q6 (b), y2010p8q5 (a)
- y2009p8q2 (b)
- branch predictor limitations for deep pipelines.
Branch prediction
benefits
how to avoid pipeline bubble for complex predictors?
- y2014p8q3 (b.ii)
- have a valid fetch address generated every cycle.
- complex + less accurate predictors together, refetch if differs.
- add next line and way info to each four-instruction fetch block within the instruction cache.
Static
- y2007p7q1 (g)
- compiler support for branch prediction, i.e. static heuristic and profile info.
- e.g. forward-not-taken and backwards-taken.
Dynamic
One-level predictor: saturating counter
Two-level predictor: local/global history
Tournament predictor
Branch Target Buffer (BTB)
Precise exceptions
- all instructions before E (the instruction that caused the exception) must be completed, and
- all instructions after E must not have completed, including not modifying the architectural state.
- whether E should complete or not depends on the exception type.
- y2022p9q1 (a.iii)
- buffer pre-executed results allowing the effects of the second partially executed instruction to be undone,
- or to buffer results until we know all earlier operations have completed;
- another option, record which parts of the second instruction have executed and selective replay,
- now upon restarting execution after the exception handler runs,
- the processor knows which operations (“beats” in Arm terminology) not to re-execute in the case of the second instruction.
- supported by Arm’s MVE/Helium ISA extension.
Improvements for scalar pipeline
- y2018p27q5 (a)
- micro-architectural techniques or elements:
- I/D-caches,
- branch prediction,
- multiple / diversified pipelines,
- skid buffer for stalling and replaying.
Superscalar pipeline
Exploits Instruction- and Memory-Level Parallelism.
- Instruction Fetch, DEcode, [Rename], [Dispatch], [Issue: static/dynamic scheduling], EXecute, Memory, Write Back.
- vs. In-order
- dispatch: stall for data hazards (false dependencies);
- issue: stall for structural hazards.
OOO processors benefit from,
- after instructions fetched, decoded and renamed, dispatch into the issue queue, which can be scheduled out-of-order.
- issue: create a window into the dynamic instruction stream, from different basic blocks of the original program.
- pros: schedule instructions dynamically aided by speculation,
- constrained by little more than true data dependencies and functional unit availability.
- different instruction schedules depending on run-time info. and the actual state of the processor,
- branch prediction, react to data cache misses,
- exploit knowledge of load/store addresses, i.e. disambiguate memory addresses.
- avoids the need to stall when the result of a cache miss is needed.
- improved memory-level parallelism, tolerate longer cache access latencies.
- simplify the compiler (register allocator), without increasing static code size.
- allows code compiled for another pipeline to run efficiency.
- cons: more hardware cost,
- complexity of instruction fetch, memory speculation, and register renaming, cache access.
- large area, high power consumption, heat dissipation, and fabrication complexity.
- limitation: task with low performance targets (ILP, Amdahl's law), competing for FUs.
- exploiting only ILP, without TLP or DLP.
- interconnect, limiting on state reachable per cycle and centralised memory-like structures scale.
In-order vs out-of-order (dynamic)
Instruction fetch
- y2011p8q3 (b)
- instruction cache and misalignment (spanning multiple cache lines), branch prediction.
- multiple basic blocks (path prediction and branch address cache, trace cache)
Register rename
- arch (or logical) register → physical of the last destination targeted.
- Register Map Table (RMT), Free Physical Register List (FPRL);
- to remove false (or name) dependencies (anti-:WaR, output:WaW).
- VLIW: Rotating Register File (RRF).
- increase the maximum number of instructions that could be in-flight simultaneously.
- vs. compile time,
- latter is difficult due to the presence of short loops, control dependencies, or
- when the ISA only defines a very limited number of architectural registers.
- support speculative execution and precise exceptions.
- speculative execution benefits from the presence of a large number of physical registers to hold live variables from speculative execution paths.
- quickly return the processor to a particular state, either to implement precise interrupts or to recover from a mis-predicted branch.
- y2024p8q1 (a), y2019p8q3 (b)
- y2009p7q7 (a-d)
Clustered data forwarding network
- y2017p7q6 (c)
- partitioned issue windows;
- issue: inter-cluster communication.
- for VLIW, the same applies.
Hardware-based dynamic memory speculation
- memory-carried dependencies (aliases)
- not to issue a load instruction before a pending store that is writing to the same memory location has executed.
- or store-to-load forwarding; otherwise, the load can bypass the store, via load queue.
- irreversible store
- store instructions are executed in program order, via store queue,
- never speculatively before earlier branches have been resolved.
- ensures any exceptions caused by earlier instructions are handled.
- y2009p7q7 (e), y2024p8q1 (b)
Load bypassing: Load/Store Queue (LQ/SQ).
Skip speculative loads, if predicts L1 D-cache miss.
ReOrder Buffer (ROB)
- holds both the instructions and data for in-flight instructions,
- for [precise exceptions], ensuring results are committed in order to prevent data hazards.
To search for operands between register file and the reorder buffer,
- If the operand is located in the reorder buffer, it is represents the most recent value.
- The reorder buffer is searched and accessed in parallel with the register file.
- A second approach is to maintain a register mapping table,
- this records whether each register should be read from the reorder buffer or the register file.
- It also records the entry in the reorder buffer where the register can be found.
Unified Physical RF
- with two register mapping tables, and a simplified ROB (in-order instruction queue),
- the front-end future map represents the current potentially speculative state,
- the architectural register map maintains the user visible state (checkpoint).
- enhancement: one arch RM per branch prediction (MIPS 10K).
- upon detecting a mis-predicted branch,
- the architectural register map is copied to the future register map and
- any instructions along the mis-predicted path are discarded.
- execution can then continue along the correct path with the correct register state.
ROB vs Unified Physical RF
Limitations
- y2013p8q3 (d), y2010p8q5 (d)
- bottleneck: fabrication of wires, interconnect, power consumption, and heat dissipation.
- y2015p7q5 (a)
- why modern architecture prefers multi-core than only single thread with superscalar techniques.
General
- y2016p8q3 (b)
- give an assembly language program benefiting from superscalar techniques.
- y2013p7q5 (b)
- switching between two configurations.
Software ILP (VLIW)
Exploits Instruction- and Memory-Level Parallelism via static scheduling.
vs. superscalar
Clustered architecture: see superscalar pipeline.
Local scheduling
Loop unrolling
Software pipelining with Rotating Register File (RRF)
Global scheduling
Conditional / Predicated operations
Memory reference speculation
Variable-length bundles of independent instructions
VLIW |
fixed-width bundle |
var-len bundle |
code density, i-cache size, instr fetch |
↑ no-ops |
↓ only st., end |
i-cache hit rate |
↓ |
↑ |
fixed SLOT-to-FU |
✓ |
✗ |
hardware |
simpler |
+ decoding; check before issue; + interconnect, muxes for op, operands to FU. |
Binary compatibility
Multi-threaded processors
Exploits Thread-Level Parallelism.
vs. multi-core
- y2015p7q5 (a)
- modern architecture: multi-core > single multi-threaded
Coarse-grained MT
Fine-grained MT
SMT
partition/tag vs. share vs. duplicated resources
General
Memory hierarchy
Reference: Memory address calculation.
Direct-mapped vs. set/fully associative cache
- y2014p7q5 (a)
- y2011p7q4 (a)
- reduce conflict misses,
- victim cache (exclusive cache),
- spatial memory layout: array merging, array padding.
Block replacement policy
Hit time calculation
Virtual addressing and caching
Cache: optimizing performance
- y2013p8q3 (b)
- trade-off between memory vs. computation
- spatial access order: loop interchange, cache blocking
- temporal locality: loop fusion and fission
- y2010p7q7 (a)
- y2014p7q5 (c)
- load/store buffer: load-store forwarding, store merging or coalescing
- y2018p8q2 (c)
- merging or coalescing write-buffer violates TSO?
- y2011p7q4 (c)
- (stride) prefetching hardware
- y2014p7q5 (b)
- non-blocking cache (out-of-order)
- avoid false sharing in multiprocessor systems
- private variables in the same cache line accessed by different threads/cores
Multi-level cache hierarchy
- y2008p7q5 (a)
- why multi-level ? hit time vs. number of hits
- L1, L2 examples
Inclusive vs. exclusive caches vs. NINE
Vector processors / SIMD
Exploits Data-Level Parallelism.
Potential advantages
Vector chaining (RaW) and tailgating (WaR)
Precise exceptions: please refer to the scalar pipeline section.
ISA: vector length
Multi-core processors
Exploits Chip Multiprocessing.
vs. superscalar
Multi-banked caches
Cache coherence
invalidate vs. update
snoopy protocol
directory protocol
- for each cache line, store a list of sharers and the exclusive individual processor in the directory.
- inclusive directory for non-inclusive caches
- y2012p8q3 (d), y2022p8q1 (c), y2022p8q1 (d)
- hierarchy of on-chip (clustered) caches, reducing storage overhead.
general
Memory consistency
Sequential Consistency vs. Total Store Order
Store atomicity
False sharing: cache line/block granularity
On-chip interconnection network
General design
Specialised processors
Exploits Accelerator-Level Parallelism.
Heterogeneous/asymmetric vs homogeneous/symmetric
GPU
Domain-Specific Accelerators (DSA)