An STM32 WFI bug

2023-07-17

I really like the STM32 series of microcontrollers in general. They’re generally quite reliable, the peripherals are well tested, and more often than not I can just grab one off the shelf and not think about it too much.

However, like every microcontroller, they do contain implementation bugs, so it’s always important to read the “Errata Sheet” (or in ST’s language, “Device Limitations”) when you’re using a part.

I appear to have hit an implementation bug in certain STM32 lines that is not listed in the errata sheet. I can’t find any specific description of this bug on the internet, so I’ve attempted to nail one down. Hopefully this will come up in the search results for someone who hits this in the future and save them some time.

The Symptoms

The CPU sometimes crashes or takes an unexpected code path. This will happen most often when the system is relatively lightly loaded.

If you power-cycle the device without a SWD debug probe attached, the firmware will work fine. But attaching a SWD debug probe (and running, say, openocd) causes things to misbehave.

The misbehavior for me has manifested in a variety of ways:

A good old-fashioned HardFault.
A branch to an unexpected place, such as activating a “panic if pointer is zero” path and printing a pointer value that is clearly non-zero.
Things that can look like stack corruption.

If you hit this bug, you are almost certainly running an RTOS, or have taken the time to add code to your firmware to enter the microcontroller’s “Sleep” mode on idle.

If the CPU crashes outright (with a HardFault or other fault) you will likely see the PRIMASK register set, disabling all interrupts and faults.

The cause that I’m aware of

This may not be the only thing that tickles this bug, but here’s a situation where I can reproduce it:

Your debug software sets the DBG_SLEEP bit in the DBGMCU_CR register to try and keep the debug connection alive even while the processor is sleeping.
Your firmware executes a WFI instruction from flash with interrupts disabled (typically just after a cpsid f instruction or a manual write to PRIMASK).
This WFI instruction’s address is greater than 8 mod 16. That is, the last hex-digit of its address is a, c, or e.
There is a branch within a few instructions of the WFI.
There is some important instruction, such as a stack pointer fixup, between the WFI and branch.

For instance, this code causes the bug to manifest:

 800120c        wfi
 800120e        nop
 8001210        add sp, #68
 8001212        b.w 8005f6c

But this does not:

 800120c        wfi
 800120e        add sp, #68
 8001210        b.w 8005f6c

In this particular case, it appears to corrupt the update to the stack pointer. It’s hard to say for sure, because single-stepping causes the bug to go away. But this has manifested as program corruption which sometimes causes faults, and sometimes just causes unexpected control flow (by popping the wrong return address, say).

The workaround

One possible workaround is to reconfigure your debugging software to not set the DBG_SLEEP bit in the DBGMCU_CR register. However, if you can change the firmware, there’s an alternative.

It appears to be sufficient to insert an Instruction Barrier (isb) immediately after the WFI.

For instance, keeping all instruction addresses the exact same, this sequence crashes:

 800120c       wfi
 800120e       nop.w
 8001212       add sp, #68
 8001214       b.w 8005f70

…while this sequence works:

 800120c       wfi
 800120e       isb sy
 8001212       add sp, #68
 8001214       b.w 8005f70

So, in your idle loop or equivalent, change whatever code you use to generate a WFI instruction to also produce an ISB. For instance, in Rust using the cortex_m crate,

cortex_m::asm::wfi();
cortex_m::asm::isb();

or using assembly directly:

core::arch::asm!("
    wfi
    isb
    ",
    options(nostack, nomem, preserves_flags),
);

Analysis

Without taking more time to reverse engineer the behavior, I can’t say for sure what’s causing this – and since I don’t work at ST and don’t have access to the RTL, reverse-engineering it would be a slow process.

What I suspect is happening is:

ST has added special logic around the ARM Cortex-M core to keep clocks used by the debug unit active during sleep, when the clocks would otherwise halt.
The Cortex-M core is doing some kind of prefetching of instructions into the pipeline.
One of the clocks ST is leaving on during sleep is causing that prefetch/pipeline logic to continue progressing while the CPU is stopped.
When the CPU finally wakes back up, its instruction pipeline has been silently corrupted.

In firmware that doesn’t use WFI with interrupts disabled, we likely avoid the problem because (per the ARMv7-M architecture spec) entering an interrupt handler requires a flush and refetch of the pipeline contents equivalent to an ISB instruction. So for folks (including Hubris) using WFI while interrupts are on, waking from sleep mode will go directly into an ISR and implicitly perform the workaround described here. (I suspect that this is why this bug isn’t better understood – a lot of applications don’t ever hit it!)

Devices known to be affected

I have observed this behavior mostly on the STM32L4 series.

I believe I’ve seen it on the STM32G4 as well, but I haven’t worked with that processor enough to be sure (since this bug requires the stars to align and your code’s alignment to be just right to manifest).

Certain bug reports I found online while trying to figure this out have hinted that the issue also appears on:

I have so far not seen this on any M0-based STM32, such as the STM32G0 or L0, which suggests that the higher pipeline complexity of the M3/M4 is involved in the problem. That, or I just haven’t managed to trigger it yet. There’s a collection of bug reports in the Zephyr project that sound related on STM32L0, which is an M0+ core. They fixed it using unrelated changes like enabling other peripheral clocks and inserting DSB barriers; in my testing here that doesn’t reliably fix the problem, so that’s interesting.

#embedded #hardware

Cliffle

The Symptoms

The cause that I’m aware of

The workaround

Analysis

Devices known to be affected