
Performance engineering can be deeply mysterious. Sometimes adding a line of code can make your program execute 2× faster. These behaviors are impossible to explain unless you understand the processor microarchitecture and compiler optimization tricks.
In this video, I show how adding a single line of code to a slow-running program makes it run 2× faster. You’ll see how this one change helped the compiler arrange instructions in memory so the CPU could fetch them from its micro-op cache instead of decoding them every time, a huge win for hot loops.
On Intel processors, this micro-op cache is known as the Decoded Stream Buffer (DSB). It’s designed specifically to accelerate hot paths in your code by caching pre-decoded instructions, so the CPU can skip the expensive fetch/decode stages entirely. Understanding when and how the DSB kicks in is key to unlocking this kind of speedup.
If you’re curious about controlling the hardware and squeezing out every last ounce of performance, you should watch the video.
Along the way, we’ll cover:
-
Measuring performance with Linux perf
-
Using Top-Down Microarchitectural Analysis (TMA) to pinpoint hardware bottlenecks
-
Understanding what the DSB is and when it’s used
-
Forcing the compiler to take advantage of it with code alignment and profile-guided optimization
The result is 2x faster loop and a set of techniques that you can use for debugging and optimizing your own loops.
Confessions of a Code Addict
Bitcoin
Ethereum
Monero

Donate Bitcoin to The Bitstream
Scan the QR code or copy the address below into your wallet to send some Bitcoin to The Bitstream

Donate Ethereum to The Bitstream
Scan the QR code or copy the address below into your wallet to send some Ethereum to The Bitstream

Donate Monero to The Bitstream
Scan the QR code or copy the address below into your wallet to send some Monero to The Bitstream
Donate Via Wallets
Select a wallet to accept donation in ETH BNB BUSD etc..