I really like the STM32 series of microcontrollers in general. They’re generally quite reliable, the peripherals are well tested, and more often than not I can just grab one off the shelf and not think about it too much.
However, like every microcontroller, they do contain implementation bugs, so it’s always important to read the “Errata Sheet” (or in ST’s language, “Device Limitations”) when you’re using a part.
I appear to have hit an implementation bug in certain STM32 lines that is not listed in the errata sheet. I can’t find any specific description of this bug on the internet, so I’ve attempted to nail one down. Hopefully this will come up in the search results for someone who hits this in the future and save them some time.
The CPU sometimes crashes or takes an unexpected code path. This will happen most often when the system is relatively lightly loaded.
If you power-cycle the device without a SWD debug probe attached, the firmware
will work fine. But attaching a SWD debug probe (and running, say,
causes things to misbehave.
The misbehavior for me has manifested in a variety of ways:
- A good old-fashioned
- A branch to an unexpected place, such as activating a “panic if pointer is zero” path and printing a pointer value that is clearly non-zero.
- Things that can look like stack corruption.
If you hit this bug, you are almost certainly running an RTOS, or have taken the time to add code to your firmware to enter the microcontroller’s “Sleep” mode on idle.
If the CPU crashes outright (with a
HardFault or other fault) you will likely
PRIMASK register set, disabling all interrupts and faults.
The cause that I’m aware of
This may not be the only thing that tickles this bug, but here’s a situation where I can reproduce it:
Your debug software sets the
DBG_SLEEPbit in the
DBGMCU_CRregister to try and keep the debug connection alive even while the processor is sleeping.
Your firmware executes a
WFIinstruction from flash with interrupts disabled (typically just after a
cpsid finstruction or a manual write to
WFIinstruction’s address is greater than 8 mod 16. That is, the last hex-digit of its address is
There is a branch within a few instructions of the
There is some important instruction, such as a stack pointer fixup, between the
For instance, this code causes the bug to manifest:
8001210 add sp, #68
8001212 b.w 8005f6c
But this does not:
800120e add sp, #68
8001210 b.w 8005f6c
In this particular case, it appears to corrupt the update to the stack pointer. It’s hard to say for sure, because single-stepping causes the bug to go away. But this has manifested as program corruption which sometimes causes faults, and sometimes just causes unexpected control flow (by popping the wrong return address, say).
One possible workaround is to reconfigure your debugging software to not set the
DBG_SLEEP bit in the
DBGMCU_CR register. However, if you can change the
firmware, there’s an alternative.
It appears to be sufficient to insert an Instruction Barrier (
For instance, keeping all instruction addresses the exact same, this sequence crashes:
8001212 add sp, #68
8001214 b.w 8005f70
…while this sequence works:
800120e isb sy
8001212 add sp, #68
8001214 b.w 8005f70
So, in your idle loop or equivalent, change whatever code you use to generate a
WFI instruction to also produce an
ISB. For instance, in Rust using the
or using assembly directly:
Without taking more time to reverse engineer the behavior, I can’t say for sure what’s causing this – and since I don’t work at ST and don’t have access to the RTL, reverse-engineering it would be a slow process.
What I suspect is happening is:
ST has added special logic around the ARM Cortex-M core to keep clocks used by the debug unit active during sleep, when the clocks would otherwise halt.
The Cortex-M core is doing some kind of prefetching of instructions into the pipeline.
One of the clocks ST is leaving on during sleep is causing that prefetch/pipeline logic to continue progressing while the CPU is stopped.
When the CPU finally wakes back up, its instruction pipeline has been silently corrupted.
In firmware that doesn’t use
WFI with interrupts disabled, we likely avoid the
problem because (per the ARMv7-M architecture spec) entering an interrupt
handler requires a flush and refetch of the pipeline contents equivalent to an
ISB instruction. So for folks (including Hubris) using
WFI while interrupts
are on, waking from sleep mode will go directly into an ISR and implicitly
perform the workaround described here. (I suspect that this is why this bug
isn’t better understood – a lot of applications don’t ever hit it!)
Devices known to be affected
I have observed this behavior mostly on the STM32L4 series.
I believe I’ve seen it on the STM32G4 as well, but I haven’t worked with that processor enough to be sure (since this bug requires the stars to align and your code’s alignment to be just right to manifest).
Certain bug reports I found online while trying to figure this out have hinted that the issue also appears on:
I have so far not seen this on any M0-based STM32, such as the STM32G0 or L0,
which suggests that the higher pipeline complexity of the M3/M4 is involved in
the problem. That, or I just haven’t managed to trigger it yet. There’s a
collection of bug reports in the Zephyr
project that sound
related on STM32L0, which is an M0+ core. They fixed it using unrelated changes
like enabling other peripheral clocks and inserting
DSB barriers; in my
testing here that doesn’t reliably fix the problem, so that’s interesting.