The microarchitecture of Intel, AMD and VIA CPUs

Table of Contents:

Introduction to Intel Haswell and Broadwell Microarchitecture
Pipeline Design and Instruction Processing
Execution Units and Ports
Multithreading and Resource Sharing
Micro-op (μop) Cache and Instruction Fetch
Stack Engine and Register Renaming
Performance Bottlenecks and Optimization Strategies
Advanced Instruction Sets: AVX2 and FMA
Practical Use Cases and Software Implications
Conclusion and Further Reading

Overview — Practical x86 microarchitecture for measurable performance

This concise overview distills empirical guidance on Intel, AMD and VIA x86 cores into a hands-on reference for engineers, compiler writers, and advanced students. Grounded in reproducible measurement, the material explains how front-end delivery (fetch/decode and the μop cache), out-of-order mechanics, execution ports, and shared core resources produce real performance effects — and how to design microbenchmarks and code changes that confirm and remove those bottlenecks.

Who this is for and expected level

Intended for experienced systems and performance programmers, compiler developers, HPC engineers, and advanced computer-architecture students. Prior familiarity with assembly, CPU pipelines, and hardware performance counters is helpful; the coverage emphasizes practical models and repeatable experiments rather than introductory theory.

Key learning outcomes

Readers will gain actionable understanding and testable workflows to diagnose and improve code performance on modern x86 implementations. You will learn to:

Reason about instruction delivery, decode pressure, and retirement to predict where cycles are consumed and why throughput stalls occur.
Identify when the μop cache or decode bandwidth — not execution latency — limits tight loops, and how to restructure code accordingly.
Interpret execution-port utilization and pipeline balance to reduce single-port contention across scalar and vector (AVX/AVX2/FMA) workloads.
Evaluate SMT (simultaneous multithreading) trade-offs empirically, spotting when shared buffers or ROB entries reduce scaling.
Build focused microbenchmarks and use hardware counters to isolate front-end stalls, back-end pressure, and memory-induced delays.

Actionable measurement → optimization workflow

The material promotes an iterative cycle: form a compact model, measure with minimal microbenchmarks, apply surgical changes, and verify results. Recommended steps include creating tiny kernels that isolate one hypothesis, collecting retire/port/stall counters, and using port-mapping to rebalance instruction mixes for sustained throughput.

Practical techniques emphasized

Favor small, aligned loop bodies that exploit the μop cache; they often outperform larger unrolled loops that overload decode or retirement stages.
Profile and balance execution-port usage; redistribute arithmetic and memory ops to avoid saturating a single port.
Test thread placement and SMT empirically—some kernels benefit from extra hardware threads, others suffer due to shared resources or μop-cache contention.
Handle floating-point corner cases (denormals, rounding) deliberately; where exact subnormal semantics are unnecessary, flush-to-zero or MXCSR tweaks can reduce latency.

How complex mechanisms are taught simply

The guide separates conceptual models from chipset-specific data: clear diagrams and timing tables build intuition, then microbenchmarks validate assumptions. This pragmatic approach helps you reason about x86 behavior generally before consulting numeric tables for particular processors.

Project ideas to build intuition

Measure SMT scaling for a compute-bound kernel and identify which shared structures (ROB, load/store buffers, ports) limit efficiency.
Create loop variants sized to fit or overflow the μop cache and compare decode versus retirement-limited performance.
Construct instruction sequences that stress a single port, then refactor to balance usage and observe throughput changes with counters.

Quick glossary

μop (micro-operation): Decoded internal operations dispatched to the execution engine.
μop cache: Stores decoded μops to avoid repeated decode work in tight loops.
Execution ports: Logical paths to functional units; balanced port usage improves sustained throughput.
Reorder Buffer (ROB): Tracks in-flight instructions to ensure correct retirement after out-of-order execution.
AVX2 / FMA: Vector and fused multiply-add instruction sets that increase throughput if port contention is managed.

According to Agner Fog’s measured approach, the emphasis is always on reproducible evidence and minimal, targeted changes that produce predictable speedups. If you need practical, measurement-driven strategies to tune real workloads on contemporary x86 CPUs, this overview highlights the models, counters, and workflows to get you started.