Adventures with Windows 11

My original development machine for Cimba was an old Intel Xeon E5 running Windows 10. However, it was not possible to switch to Windows 11 on it, not even with a TPM 2.0 chip installed. Windows being what it is, I took that as the impetus for switching to an AMD Threadripper and Arch Linux for the main development rig. I still wanted to have Windows support for Cimba, though, so I checked from time to time that it still worked on the Xeon as well.

I then set up a GitHub Actions process to automate building and testing Cimba there. Effectively, it runs meson build; meson test on the GitHub runners every time I push a change. The runners are specified as “latest” both for Linux Ubuntu and Windows. That presumably implies new hardware and the newest versions of the OS.

It was an unpleasant surprise to see that the Windows test suite on GitHub failed all tests that involved coroutines, both directly as in test_coroutine.c and everything built on top of it like test_process.c. Windows 11 promptly killed the executable as soon as it was trying to start a coroutine. The only error message given was a rather terse error code ‘0xc0000005’, access violation.

This was quite a debugging challenge. As always, the Windows internals are not particularly well-documented, but two things soon became clear:

Windows 11 has far stricter security measures to prevent stack-smashing exploits than older versions and will summarily kill any suspect process.
This is supported by modern hardware such as the Intel Control-flow Enforcement Technology (CET) that monitors program execution and alerts the OS to any suspicious activity.

This is, of course, a very good thing, but it was not able to distinguish the legit Cimba stackful coroutines from a hacker attack that tries to gain control by manipulating the stack and the return addresses there. As described in the Cimba documentation, the coroutines work by creating individual stacks in heap memory. This fails when the CPU and OS are monitoring a “shadow stack” to continuously verify that the program’s own stack still matches the secret copy. In effect, we have to explain to the OS and CPU both that our cactus stack is a valid stack and that our manipulations of return pointers are harmless.

Basically, the relevant parts of the Cimba context switching code for Windows followed the outline given by Malte Skarupke in his 2013 blog post “Handmade Coroutines for Windows”. That still works for Windows 10, but Windows 11 on a modern AMD or Intel CPU, perhaps with a hypervisor inbetween, is a very different animal.

With significant assistance from both Google Gemini and Anthropic Claude together with human trial-and-error testing, I was able to find a set of code clarifications that was accepted. The complete code is in the Cimba repo, so I will only review the main points here.

Tell it explicitly to expect stack shenanigans. Even if not using any Windows fiber library functions, call ConvertThreadToFiber(NULL); to state that here be dragons. For extra credit, wrap it in a test where the magic number ‘0x1e00’ means “not a fiber”:
```
if (GetCurrentFiber() == (void*)0x1e00) {
     ConvertThreadToFiber(NULL);
}
```

Stack memory is different. Do not simply call malloc() or calloc() to allocate new stack memory. Windows 11 expects page-aligned memory allocated by VirtualAlloc() as valid stack space. After use, it needs to be freed by the matching VirtualFree(). Allocate and designate a guard page at the end of the stack.

 unsigned char *cmi_coroutine_stack_alloc(const size_t size,
                                      unsigned char **base_p,
                                      unsigned char **limit_p)
 {
     const size_t pagesz = cmi_pagesize();
     oid *raw = VirtualAlloc(NULL, size + pagesz, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
     cmb_assert_always(raw != NULL);
    
     DWORD old_protect;
     VirtualProtect(raw, pagesz, PAGE_READWRITE | PAGE_GUARD, &old_protect);
    
     /* The stack grows downwards; the base is at the top */
     *base_p = raw + size + pagesz;
    
     /* The bottom includes the guard page */
     *limit_p = raw + pagesz;
    
     return raw;
 }


 /* Free memory previously allocated for a stack */
 void cmi_coroutine_stack_free(unsigned char *stack)
 {
     int r = VirtualFree(stack, 0, MEM_RELEASE);
     cmb_assert_always(r != 0);
 }

There is a third value to worry about in the Thread Information Block (TIB), the DeallocationStack at GS:1478. It contains the raw memory address returned by VirtualAlloc(). As the Wikipedia article states: “Setting stack limits without setting DeallocationStack will probably cause odd behavior in SetThreadStackGuarantee. For example, it will overwrite the stack limits to wrong values.”
In the actual context switch, the CPU becomes very afraid if it detects an inconsistent state between the stack parameters in the TIB, its CET Shadow Stack, and the actual memory accesses in progress. Wait until the very last moment before changing the TIB values. Change them atomically by loading all three values from the old stack to scratch registers before writing them to the TIB with no interleaving stack access. Only then proceed with accessing the new stack.

Do not use RET to jump to the new coroutine. The oldest trick in the hacker book is to overwrite the return address by abusing a buffer overflow on the stack and then let the program return to a hacker-selected address, potentially taking full control of the machine. Windows 11 and CET are understandably wary of anything that looks strange there. Instead, spell it out explicitly by popping the return address into a scratch register and then jumping to the address in the register:

;-------------------------------------------------------------------------------
; Callable function void *cmi_coroutine_context_switch(void **old,
;                                                      void **new,
;                                                      void *ret)
; Arguments:
;   void **old - RCX - address for storing current stack pointer
;   void **new - RDX - address for reading new stack pointer
;   void *ret  - R8  - return value passed from old to new context
; Scratch registers used:
;   R9, R10, R11, RAX
; Return value:
;   void *     - RAX - whatever was given as the third argument
; Error handling:
;   None - the samurai returns victorious or not at all
;
cmi_coroutine_context_switch:
    ; Push all callee-saved registers to current stack
    save_context
    ;
    ; Push the TIB DeallocationStack, StackLimit, and StackBase entries
    mov r9, [gs:1478]      ; DeallocationStack
    push r9
    mov r9, [gs:16]        ; StackLimit
    push r9
    mov r9, [gs:8]         ; StackBase
    push r9
    ;
    ; Store old stack pointer to address given as first argument RCX
    mov [rcx], rsp
    ;
    ; Load the new RSP from the second argument RDX into a scratch register
    mov r9, [rdx]
    ;
    ; Load new stack info into scratch registers for atomic TIB change
    mov r10, [r9]           ; New StackBase
    mov r11, [r9 + 8]       ; New StackLimit
    mov rax, [r9 + 16]      ; New DeallocationStack
    ;
    ; Write the new stack info to Windows TIB without touching the stack
    mov [gs:8], r10         ; Update StackBase
    mov [gs:16], r11        ; Update StackLimit
    mov [gs:1478], rax      ; Update DeallocationStack
    ;
    ; Done, safe to switch to the new stack, advancing past the used TIB entries
    mov rsp, r9
    add rsp, 24
    ;
    ; We are now in the new context, restore other registers from the new stack
    load_context
    ;
    ; Load whatever was in the third argument R8 as return value in RAX
    mov rax, r8
    ;
    ; Return to wherever the new context was transferring from earlier
    ; Note that we spell out the 'ret' as 'pop, jmp' for Intel CET reasons.
    pop r9
    jmp r9

Tell gcc not to produce stack-checking code, here in meson.build:

 if host_machine.system() == 'windows'
    desired_flags += ['-fno-plt',
                      '-fcf-protection=none',
                      '-fno-stack-protector']
 endif

Once all that was in place, Cimba started working again, also on the latest Windows runner. There may be steps in the above that could be omitted, or there could be additionaL steps needed under certain circumstances that I have not stumbled into yet. If so, please add a comment below.

Share on

X Facebook LinkedIn Bluesky

Adventures with Windows 11

Asbjørn M. Bonvik

Share on

Leave a comment

You may also enjoy

Speed is (statistical) power

Cimba is loose