My original development machine for Cimba was an old Intel Xeon E5 running Windows 10. However, it was not possible to switch to Windows 11 on it, not even with a TPM 2.0 chip installed. Windows being what it is, I took that as the impetus for switching to an AMD Threadripper and Arch Linux for the main development rig. I still wanted to have Windows support for Cimba, though, so I checked from time to time that it still worked on the Xeon as well.
I then set up a GitHub Actions process to automate building and testing Cimba
there. Effectively, it runs meson build; meson test on the GitHub runners every time
I push a change. The runners are specified as “latest” both for Linux Ubuntu and Windows.
That presumably implies new hardware and the newest versions of the OS.
It was an unpleasant surprise to see that the Windows test suite on GitHub failed all
tests that involved coroutines, both directly as in test_coroutine.c and everything
built on top of it like test_process.c. Windows 11 promptly killed the executable as
soon as it was trying to start a coroutine. The only error message given was a rather
terse error code ‘0xc0000005’, access violation.
This was quite a debugging challenge. As always, the Windows internals are not particularly well-documented, but two things soon became clear:
- Windows 11 has far stricter security measures to prevent stack-smashing exploits than older versions and will summarily kill any suspect process.
- This is supported by modern hardware such as the Intel Control-flow Enforcement Technology (CET) that monitors program execution and alerts the OS to any suspicious activity.
This is, of course, a very good thing, but it was not able to distinguish the legit Cimba stackful coroutines from a hacker attack that tries to gain control by manipulating the stack and the return addresses there. As described in the Cimba documentation, the coroutines work by creating individual stacks in heap memory. This fails when the CPU and OS are monitoring a “shadow stack” to continuously verify that the program’s own stack still matches the secret copy. In effect, we have to explain to the OS and CPU both that our cactus stack is a valid stack and that our manipulations of return pointers are harmless.
Basically, the relevant parts of the Cimba context switching code for Windows followed the outline given by Malte Skarupke in his 2013 blog post “Handmade Coroutines for Windows”. That still works for Windows 10, but Windows 11 on a modern AMD or Intel CPU, perhaps with a hypervisor inbetween, is a very different animal.
With significant assistance from both Google Gemini and Anthropic Claude together with human trial-and-error testing, I was able to find a set of code clarifications that was accepted. The complete code is in the Cimba repo, so I will only review the main points here.
- Tell it explicitly to expect stack shenanigans. Even if not using any Windows fiber
library functions, call
ConvertThreadToFiber(NULL);to state that here be dragons. For extra credit, wrap it in a test where the magic number ‘0x1e00’ means “not a fiber”:if (GetCurrentFiber() == (void*)0x1e00) { ConvertThreadToFiber(NULL); } -
Stack memory is different. Do not simply call
malloc()orcalloc()to allocate new stack memory. Windows 11 expects page-aligned memory allocated byVirtualAlloc()as valid stack space. After use, it needs to be freed by the matchingVirtualFree(). Allocate and designate a guard page at the end of the stack./* Allocate memory suitable for a stack */ unsigned char *cmi_coroutine_stack_alloc(const size_t size, unsigned char **base_p, unsigned char **limit_p) { void *raw = VirtualAlloc(NULL, size, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE); cmb_assert_always(raw != NULL); DWORD old_protect; VirtualProtect(raw, 4096u, PAGE_READWRITE | PAGE_GUARD, &old_protect); /* The stack grows downwards, the base is at the top, less a few bytes for the OS */ *base_p = raw + size - 16u; /* The bottom includes the guard page */ *limit_p = raw + 4096u; return raw; } /* Free memory previously allocated for a stack */ void cmi_coroutine_stack_free(unsigned char *stack) { int r = VirtualFree(stack, 0, MEM_RELEASE); cmb_assert_always(r != 0); } - There is a third value to worry about in the
Thread Information Block
(TIB), the DeallocationStack at
GS:1478. It contains the raw memory address returned byVirtualAlloc(). As the Wikipedia article states: “Setting stack limits without setting DeallocationStack will probably cause odd behavior in SetThreadStackGuarantee. For example, it will overwrite the stack limits to wrong values.” - In the actual context switch, the CPU becomes very afraid if it detects an inconsistent state between the stack parameters in the TIB, its CET Shadow Stack, and the actual memory accesses in progress. Wait until the very last moment before changing the TIB values. Change them atomically by loading all three values from the old stack to scratch registers before writing them to the TIB with no interleaving stack access. Only then proceed with accessing the new stack.
-
Do not use
RETto jump to the new coroutine. The oldest trick in the hacker book is to overwrite the return address by abusing a buffer overflow on the stack and then let the program return to a hacker-selected address, potentially taking full control of the machine. Windows 11 and CET are understandably wary of anything that looks strange there. Instead, spell it out explicitly by popping the return address into a scratch register and then jumping to the address in the register:;------------------------------------------------------------------------------- ; Callable function void *cmi_coroutine_context_switch(void **old, ; void **new, ; void *ret) ; Arguments: ; void **old - RCX - address for storing current stack pointer ; void **new - RDX - address for reading new stack pointer ; void *ret - R8 - return value passed from old to new context ; Scratch registers used: ; R9, R10, R11, RAX ; Return value: ; void * - RAX - whatever was given as the third argument ; Error handling: ; None - the samurai returns victorious or not at all ; cmi_coroutine_context_switch: ; Push all callee-saved registers to current stack save_context ; ; Push the TIB DeallocationStack, StackLimit, and StackBase entries mov r9, [gs:1478] ; DeallocationStack push r9 mov r9, [gs:16] ; StackLimit push r9 mov r9, [gs:8] ; StackBase push r9 ; ; Store old stack pointer to address given as first argument RCX mov [rcx], rsp ; ; Load the new RSP from the second argument RDX into a scratch register mov r9, [rdx] ; ; Load new stack info into scratch registers for atomic TIB change mov r10, [r9] ; New StackBase mov r11, [r9 + 8] ; New StackLimit mov rax, [r9 + 16] ; New DeallocationStack ; ; Write the new stack info to Windows TIB without touching the stack mov [gs:8], r10 ; Update StackBase mov [gs:16], r11 ; Update StackLimit mov [gs:1478], rax ; Update DeallocationStack ; ; Done, safe to switch to the new stack, advancing past the used TIB entries mov rsp, r9 add rsp, 24 ; ; We are now in the new context, restore other registers from the new stack load_context ; ; Load whatever was in the third argument R8 as return value in RAX mov rax, r8 ; ; Return to wherever the new context was transferring from earlier ; Note that we spell out the 'ret' as 'pop, jmp' for Intel CET reasons. pop r9 jmp r9 -
Tell
gccnot to produce stack-checking code, here inmeson.build:if host_machine.system() == 'windows' desired_flags += ['-fno-plt', '-fcf-protection=none', '-fno-stack-protector'] endif
Once all that was in place, Cimba started working again, also on the latest Windows runner. There may be steps in the above that could be omitted, or there could be additionaL steps needed under certain circumstances that I have not stumbled into yet. If so, please add a comment below.
Leave a comment