An STM32 WFI bug
I really like the STM32 series of microcontrollers in general. They’re generally quite reliable, the peripherals are well tested, and more often than not I can just grab one off the shelf and not think about it too much.
However, like every microcontroller, they do contain implementation bugs, so it’s always important to read the “Errata Sheet” (or in ST’s language, “Device Limitations”) when you’re using a part.
I appear to have hit an implementation bug in certain STM32 lines that is not listed in the errata sheet. I can’t find any specific description of this bug on the internet, so I’ve attempted to nail one down. Hopefully this will come up in the search results for someone who hits this in the future and save them some time.
The Symptoms
The CPU sometimes crashes or takes an unexpected code path. This will happen most often when the system is relatively lightly loaded.
If you power-cycle the device without a SWD debug probe attached, the firmware
will work fine. But attaching a SWD debug probe (and running, say, openocd
)
causes things to misbehave.
The misbehavior for me has manifested in a variety of ways:
- A good old-fashioned
HardFault
. - A branch to an unexpected place, such as activating a “panic if pointer is zero” path and printing a pointer value that is clearly non-zero.
- Things that can look like stack corruption.
If you hit this bug, you are almost certainly running an RTOS, or have taken the time to add code to your firmware to enter the microcontroller’s “Sleep” mode on idle.
If the CPU crashes outright (with a HardFault
or other fault) you will likely
see the PRIMASK
register set, disabling all interrupts and faults.
The cause that I’m aware of
This may not be the only thing that tickles this bug, but here’s a situation where I can reproduce it:
-
Your debug software sets the
DBG_SLEEP
bit in theDBGMCU_CR
register to try and keep the debug connection alive even while the processor is sleeping. -
Your firmware executes a
WFI
instruction from flash with interrupts disabled (typically just after acpsid f
instruction or a manual write toPRIMASK
). -
This
WFI
instruction’s address is greater than 8 mod 16. That is, the last hex-digit of its address isa
,c
, ore
. -
There is a branch within a few instructions of the
WFI
. -
There is some important instruction, such as a stack pointer fixup, between the
WFI
and branch.
For instance, this code causes the bug to manifest:
800120c wfi
800120e nop
8001210 add sp, #68
8001212 b.w 8005f6c
But this does not:
800120c wfi
800120e add sp, #68
8001210 b.w 8005f6c
In this particular case, it appears to corrupt the update to the stack pointer. It’s hard to say for sure, because single-stepping causes the bug to go away. But this has manifested as program corruption which sometimes causes faults, and sometimes just causes unexpected control flow (by popping the wrong return address, say).
The workaround
One possible workaround is to reconfigure your debugging software to not set the
DBG_SLEEP
bit in the DBGMCU_CR
register. However, if you can change the
firmware, there’s an alternative.
It appears to be sufficient to insert an Instruction Barrier (isb
) immediately
after the WFI
.
For instance, keeping all instruction addresses the exact same, this sequence crashes:
800120c wfi
800120e nop.w
8001212 add sp, #68
8001214 b.w 8005f70
…while this sequence works:
800120c wfi
800120e isb sy
8001212 add sp, #68
8001214 b.w 8005f70
So, in your idle loop or equivalent, change whatever code you use to generate a
WFI
instruction to also produce an ISB
. For instance, in Rust using the
cortex_m
crate,
;
wfi;
isb
or using assembly directly:
! asm;
Analysis
Without taking more time to reverse engineer the behavior, I can’t say for sure what’s causing this – and since I don’t work at ST and don’t have access to the RTL, reverse-engineering it would be a slow process.
What I suspect is happening is:
-
ST has added special logic around the ARM Cortex-M core to keep clocks used by the debug unit active during sleep, when the clocks would otherwise halt.
-
The Cortex-M core is doing some kind of prefetching of instructions into the pipeline.
-
One of the clocks ST is leaving on during sleep is causing that prefetch/pipeline logic to continue progressing while the CPU is stopped.
-
When the CPU finally wakes back up, its instruction pipeline has been silently corrupted.
In firmware that doesn’t use WFI
with interrupts disabled, we likely avoid the
problem because (per the ARMv7-M architecture spec) entering an interrupt
handler requires a flush and refetch of the pipeline contents equivalent to an
ISB
instruction. So for folks (including Hubris) using WFI
while interrupts
are on, waking from sleep mode will go directly into an ISR and implicitly
perform the workaround described here. (I suspect that this is why this bug
isn’t better understood – a lot of applications don’t ever hit it!)
Devices known to be affected
I have observed this behavior mostly on the STM32L4 series.
I believe I’ve seen it on the STM32G4 as well, but I haven’t worked with that processor enough to be sure (since this bug requires the stars to align and your code’s alignment to be just right to manifest).
Certain bug reports I found online while trying to figure this out have hinted that the issue also appears on:
I have so far not seen this on any M0-based STM32, such as the STM32G0 or L0,
which suggests that the higher pipeline complexity of the M3/M4 is involved in
the problem. That, or I just haven’t managed to trigger it yet. There’s a
collection of bug reports in the Zephyr
project that sound
related on STM32L0, which is an M0+ core. They fixed it using unrelated changes
like enabling other peripheral clocks and inserting DSB
barriers; in my
testing here that doesn’t reliably fix the problem, so that’s interesting.