LSMs Jmp’ing on BPF Trampolines

Back in 2001, the Linux Security Module (LSM) subsystem made its way into the mainline kernel. Almost 10 years ago, in September 2014, the modern BPF virtual machine (VM) landed in the tree. In late 2019, KP Singh proposed a patchset that facilitates the creation of modules for the former that run on the latter - Kernel Runtime Security Instrumentation (KRSI) was born.

But how does kernel control flow transfer in and out of the VM at the security checkpoints? Originally, this question was raised while developing a memory forensics tool for detecting BPF-based malware, but it quickly became a great learning experience about the internals of the two subsystems. In this post, we will seek an answer to that first question, and then use it to develop the module that detects such hooks in memory images.

However, before we jump into assembly code, let’s briefly recap on LSMs and BPF.

Linux Security Modules

By default, Linux implements discretionary access control. For example, the owner of a resource is free to grant others access to it.

$ chmod 444 .ssh/id_rsa

This is potentially problematic, e.g., on multi-user systems where users have different security clearances. To implement other access control policies, e.g., mandatory access control, organizations like the National Security Agency (NSA) had to maintain their own kernel patches.

At the 2001 Kernel Summit in San Jose, California, the NSA’s Peter Loscocco presented their Security Enhanced Linux (SELinux); but the patchset was not merged. However, one year later at the Kernel Summit in Ottawa, Chris Wright presented the patch that should later become the Linux Security Module subsystem.

Citing its documentation:

The LSM framework includes security fields in kernel data structures and calls to hook functions at critical points in the kernel code to manage the security fields and to perform access control. It also adds functions for registering security modules. An interface /sys/kernel/security/lsm reports a comma separated list of security modules that are active on the system. link

The framework is intended to be generic enough to facilitate enforcing of a wide range of security policies by writing a kernel module that uses those two primitives. But writing kernel modules is hard, mistakes might have catastrophic consequences, and the binary blobs only run on the kernel they were compiled for - but fortunately there is…

Further reading: Linux Security Modules: General Security Support for the Linux Kernel

Modern BPF

BPF is a VM inside the Linux kernel. It is used for running programs in a sandboxed environment within the kernel context.

Programs can be written in C, Rust or even high-level scripting languages. They are compiled to BPF bytecode, which can be dynamically loaded into the kernel where it is statically verified before being compiled to native code. Thus, running BPF programs is safe, in the sense that a programming mistake won’t crash your kernel, as well as low overhead. Furthermore, type information included in modern kernels is used to relocate programs before loading, eliminating the need for compilation on the target system.

By now, the VM is used in many different kernel subsystems, like networking, tracing, security, cgroups or scheduling. In an effort to provide a safer kernel programming environment, it is actively being extended with new features.

Further reading: BPF and XDP Reference Guide

Kernel Runtime Security Instrumentation (KRSI)

This patchset added the option to implement the security callbacks as programs for the BPF VM. For illustration purposes, we are going to use a very simple security module that aims to detect and prevent two common malware behaviors. You can find the full source code here.

Fileless Executions

Using a remote code execution exploit, an attacker might be able to compromise a process running on a victim’s machine. Downloading and executing a second stage payload without touching the filesystem might be desirable due to security measures preventing the creation of (executable) files or to minimize forensic artifacts. While userland exec is a well-known technique, it is much more convenient to use Linux’s memfd API. To detect such events, we can write the following BPF program and attach it to the bprm_creds_for_exec security hook.

SEC("lsm/bprm_creds_for_exec")
int BPF_PROG(bprm_creds_for_exec , struct linux_binprm* bprm)
{
  int nlink = 0;
  char comm[BPF_MAX_COMM_LEN] = { 0 };
  char path[BPF_MAX_PATH_LEN] = { 0 };

  nlink = bprm->file->f_path.dentry->d_inode->__i_nlink; // [1]
  bpf_d_path(&bprm->file->f_path, path, sizeof(path));

  LOG_INFO("path=%s nlink=%d", path, nlink);

  if (!nlink) {
    bpf_get_current_comm(comm, sizeof(comm));
    LOG_WARN("fileless execution (%s:%lu)", comm, bpf_ktime_get_boot_ns());
    bpf_send_signal(SIGKILL); // [2]
    return 1; // [3]
  }

  return 0;
}

This hook is called early during the exec system call and it receives the file that the process wants to execute. At [1] we get the number of hard links to this file. If it is zero, we deny the execution [3] and queue a fatal signal for the process [2]. The latter is necessary since the hook is called before the syscall’s point-of-no-return, after which all errors are fatal.

Note: You can use the memfd_exec program to test this hook. It also allows you to experiment with the differences between executing a script starting with #! and an ELF binary.

Self-Deletion

Less sophisticated malware might simply try to go memory-resident by deleting its executable on disk. We can write another BPF program and attach it to the inode_unlink security hook in an attempt to prevent this behavior.

SEC("lsm/inode_unlink")
int BPF_PROG(inode_unlink, struct inode* inode_dir, struct dentry* dentry)
{
  struct task_struct* current = NULL;
  char comm[BPF_MAX_COMM_LEN] = { 0 };
  const struct inode *exe_inode = NULL, *target_inode = NULL;
  int i = 0;

  target_inode = dentry->d_inode; // [1]

  LOG_INFO("target_inode=0x%lx", target_inode);

  for (i = 0,
      current = bpf_get_current_task_btf(),
      exe_inode = current->mm->exe_file->f_inode; // [2]
      exe_inode && i < BPF_MAX_LOOP_SIZE;
       i++,
      current = BPF_CORE_READ(current, parent),
      exe_inode = BPF_CORE_READ(current, mm, exe_file, f_inode))
  {
    bpf_probe_read_kernel(&comm, sizeof(comm), &current->comm);
    LOG_INFO("exe_inode=0x%lx comm=%s", exe_inode, comm);

    if (target_inode == exe_inode) { // [3]
      bpf_get_current_comm(comm, sizeof(comm));
      LOG_WARN("self-deletion (%s:%lu)", comm, bpf_ktime_get_boot_ns());
      bpf_send_signal(SIGKILL);
      return 1;
    }
  }

  return 0;
}

We attach this program to the inode_unlink hook, which is called each time a process attempts to delete a file. The callback receives the parent directory as well as the directory entry of the file that is to be deleted. First, the dentry is used to obtain the underlying inode [1]. Then, the loop reads the inode of the file that the process (or any of its ancestors) is executing [2]. Finally, by comparing the two [3] we can attempt to detect self-deletions.

Note: This program is flawed in many ways. First, it breaks updates, e.g., when the package manager updates systemd’s executable. Furthermore, it is easy to bypass, e.g.,

The parent deletes its child’s executable and exits. Then, the child gets reparented and deletes the parent.
Schedule removal through another task, e.g., cron or systemd.
Use the prctl syscall to set the exe file.
Scripts that are run by an interpreter are unaffected.
Using io_uring’s asynchronous unlink requests in combination with a dedicated kernel thread for processing them might (not tested) also be an option.

BPF Trampolines: The glue between C and BPF

Now that we got ourselves a small BPF-based security module to play with, we can examine how it works under the hood.

Reaching the tramp

Let’s start by looking at the part of the infrastructure that is statically compiled into the kernel. Figure 1 gives an overview of the code path leading up to our BPF program.

Figure 1: Call chain leading up to BPF trampoline

At selected places, the kernel calls functions starting with security_. For example, the vfs_unlink function calls security_inode_unlink and aborts if it returns a nonzero value. LSM hooks are meant to provide a higher level of abstraction than system calls; thus it makes sense to pace the gatekeeper call at a choke point in the virtual file system (VFS) which may operations must pass, independently of their entry point into the kernel and the type of object they are operating on.

Each of those call sites has its own member in the global security_hook_heads structure. The security_inode_unlink function uses its member to traverse a list of all registered callbacks, calling them one by one. As soon as one of them returns a nonzero value it aborts and propagates the value back to the caller.

There are numerous ways how we can find out where the indirect calls lead us. For example, we can use trace-cmd with the function graph tracer to record a trace of the functions called during an unlink system call.

$ touch /tmp/hax && sudo trace-cmd record -p function_graph \
--max-graph-depth 4 -g do_unlinkat -n "*interrupt*" -n "*irq*" \
-n "capable" -n "*__rcu*" -n "*_spin_*" -v -e "*irq*" -e "*sched*" \
-F /bin/rm /tmp/hax \
&& trace-cmd report trace.dat && sudo trace-cmd reset
...
rm-8004  [011] 17857.021621: funcgraph_entry:                   |      security_inode_unlink() {
rm-8004  [011] 17857.021621: funcgraph_entry:        0.129 us   |        bpf_lsm_inode_unlink();
rm-8004  [011] 17857.021622: funcgraph_exit:         0.436 us   |      }
...

Alternatively, we can debug the kernel on a guest VM and set a breakpoint, e.g., on the function we suspect to be called. Looking at the backtrace confirms the observation:

...
(remote) gef➤  b bpf_lsm_inode_mkdir
Breakpoint 2 at 0xffffffff81230d80: file ./include/linux/lsm_hook_defs.h, line 126.
(remote) gef➤  c
...
(remote) gef➤  bt
#0  bpf_lsm_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at ./include/linux/lsm_hook_defs.h:126
#1  0xffffffff814c56d0 in security_inode_mkdir (dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at security/security.c:1298
#2  0xffffffff8130488e in vfs_mkdir (mnt_userns=0xffffffff82e4e920 <init_user_ns>, dir=0xffff8880053f63a0, dentry=0xffff888005799480, mode=0x1ed) at fs/namei.c:4029
#3  0xffffffff81304a02 in do_mkdirat (dfd=0xffffff9c, name=0xffff8880075b1000, mode=0x1ed) at fs/namei.c:4061
#4  0xffffffff81304b7d in __do_sys_mkdir (pathname=<optimized out>, mode=<optimized out>) at fs/namei.c:4081
...

However, as we can see, the function usually does absolutely nothing.

(remote) gef➤  x/5i $rip
=> 0xffffffff81230d80 <bpf_lsm_inode_mkdir>:    endbr64
   0xffffffff81230d84 <bpf_lsm_inode_mkdir+4>:  nop    DWORD PTR [rax+rax*1+0x0]
   0xffffffff81230d89 <bpf_lsm_inode_mkdir+9>:  xor    eax,eax
   0xffffffff81230d8b <bpf_lsm_inode_mkdir+11>: ret

It exists for the sole purpose of being ftrace attachable! You might have spotted the nop instruction at the very beginning: this one is just reserving some space which allows the ftrace framework to divert control flow by dynamically patching the kernel text.

We can see the patching machinery at work by placing a read-write watchpoint on the nop’s address before attaching our program. It is hit multiple times during the attachment process, and when everything is done, the nop is replaced by a call to a BPF trampoline.

The Tramp

BPF trampolines are the architecture-dependent glue that connects native kernel functions to BPF programs. Currently only arm64 and x86 support the generation of BPF trampolines. Figures 2 and 3 show the stack frame and code of the generated trampoline, respectively.

Figure 2: Stack frame of the trampoline Figure 3: Trampoline used for attaching LSM programs

Some values are burned into the trampoline upon generation; this includes pointers to its metadata-holding struct bpf_tramp_image as well as the struct bpf_prog of each program it leads up to. Local variables are stored in the trampolines stack frame, most importantly the read-only context ctx received by the BPF programs lives on this stack.

The execution of the trampoline begins by grabbing a percpu reference to itself. Probably to indicate to other code which CPUs are currently using the trampoline. (Question: Is it really a percpu ref? Would be strange to get it while migration is enabled.)

Now comes the part where the trampoline calls into the jit-compiled BPF programs one by one, handing each program a pointer to the stack-allocated context holding the directory and the directory entry of the file to be deleted as well as the return value of the previous BPF program. While a BPF program is running it enters a read-copy-update (RCU) read-side critical section and disables migration to other CPUs. Optionally, it might also measure the execution time of the BPF program.

As soon as a program returns a nonzero value, the trampoline stops invoking other programs and directly goes to its exit routine. Otherwise, it calls the original function, i.e., bpf_lsm_inode_unlink, for its side effects and return value once all BPF programs are finished.

Upon exit, the trampoline drops the ref to itself and returns directly into the security_inode_unlink function through an extra lowering of its stack pointer, propagating either the return value of the last BPF program or that of the attached function.

Aside: You probably already guessed that the utility of ftrace and BPF trampolines are not limited to realizing KRSI.

In fact, we already saw another use of the ftrace framework: it’s the machinery that drives trace-cmd. If you’d strace it, you would see that it’s doing most of its magic by reading and writing files in tracefs, the user space interface of ftrace. Furthermore, ftrace can also be used by kernel modules to install callbacks into their code.

BPF trampolines on the other hand are used to jump into all kinds of BPF programs, not only LSM-related ones. Numerous flags are controlling their generation, and the outcome we described above is just one combination of those. For instance, you could also generate trampolines that cannot skip the original function and simply return to it, or trampolines that call BPF programs after the original function was executed by the trampoline.

Digging Through Memory Dumps

Now that we are equipped with some background knowledge, we can get back to the original question: Given a memory dump of a system can we reconstruct which BPF LSM hooks were active?

For this part, we’re going to use Volatility3 as a memory forensics framework and implement our feature as a plugin. You can find the source code here. However, the techniques are not tied to a specific framework in any way.

First, we have to find out which hooks are active. For that purpose we can simply disassemble all the bpf_lsm_ stub functions and check if their ftrace nops are replaced by a call. This gives us the active hooks and the corresponding addresses of the executable trampoline images.

Next, we can figure out which programs belong to a given image. There are at least two approaches that immediately come to mind: disassemble the trampolines and extract the compiled-in addresses of the program code and their corresponding metadata structs, or query higher-level abstractions like BPF link objects.

While the first approach is less susceptible to anti-forensics it suffers from dependence on trampoline code generation. Since we already know which hooks are active it’s safe to look for the program information at places that are easier to manipulate, like the link_idr which contains all BPF link objects in use. If we don’t find a corresponding program there it’s considered suspicious, and we will raise an alert.

Figure 4 gives an overview of how BPF link objects can be used to match trampoline images to links. Iterating the link_idr gives us bpf_link objects. The link type maps directly the type of the container structure. Thus, if it indicates a tracing link we can pivot to the outer struct and use its trampoline member to get the address of the executable trampoline image that the link’s program is attached to.

Figure 4: Matching BPF links to trampolines

Putting it all together, we can find all active BPF LSM hooks as well as all programs attached to them. You can find the full Volatility plugin code that implements the above approach in our BPF plugin suite on GitHub. Running it against a memory image of a system that uses our toy LSM correctly reports activity on two hooks.

./vol.py -f /io/dumps/mini_lsm_w_dummy.elf -v linux.bpf_lsm
...
LSM HOOK	Nr. PROGS	IDs

bpf_lsm_bprm_creds_for_exec	2	16,18

bpf_lsm_inode_unlink	1	19

Note that we have added a second program to bprm_creds_for_exec in order to cover the case where more than one program is attached to a single hook. You can now use the program IDs with other plugins like bpf_listprogs to continue your investigation. The memory image can be downloaded here and the symbols are provided here.

Wrapup

In this post, we learned about LSMs, a cornerstone of building high-security Linux systems, and their programmability through modern BPF. On the way we met two core parts of Linux’s tracing infrastructure: ftrace and BPF trampolines. In the end, we leveraged this knowledge to build a memory forensics tool capable of detecting a subtle way in which malware might infect a system.

References

[1] “Building a Security Tracing Utility To Snoop Into the Linux Kernel.” https://lumontec.com/1-building-a-security-tracing

[2] “ChromeOS: Noexec File System Bypass Using Memfd.” https://bugs.chromium.org/p/chromium/issues/detail?id=916146

[3] “Commit: bpf: Introduce BPF Trampoline.” [Online]. Available: https://github.com/torvalds/linux/commit/fec56f5890d93fc2ed74166c397dc186b1c25951

[4] “eBPF: Block Linux Fileless Payload ‘Malware’ Execution With BPF LSM.” https://djalal.opendz.org/post/ebpf-block-linux-fileless-payload-execution-with-bpf-lsm/

[5] “Elixir: arch/x86 arch_prepare_bpf_trampoline.” [Online]. Available: https://elixir.bootlin.com/linux/latest/source/arch/x86/net/bpf_jit_comp.c#L2128

[6] “FOSDEM 2020: Kernel Runtime Security Instrumentation LSM+BPF=KRSI,” Jan. 02, 2020. [Online]. Available: https://archive.fosdem.org/2020/schedule/event/security_kernel_runtime_security_instrumentation/

[7] “Google Help: retpoline.” [Online]. Available: https://support.google.com/faqs/answer/7625886?hl=en

[8] “KPsingh‘s Kernel Tree.” [Online]. Available: https://github.com/sinkap/linux-krsi/blob/patch/v1/examples/samples/bpf/lsm_detect_exec_unlink.c

[9] “KRSI PATCHv1.” [Online]. Available: https://lwn.net/ml/linux-kernel/20191220154208.15895-1-kpsingh@chromium.org/

[10] “KRSI PATCHv9 (final).” [Online]. Available: https://patchwork.kernel.org/project/linux-security-module/cover/20200329004356.27286-1-kpsingh@chromium.org/

[11] “KRSI RFCv1,” Oct. 09, 2019. [Online]. Available: https://lore.kernel.org/bpf/20190910115527.5235-1-kpsingh@chromium.org/#r

[12] “Linux Security Modules: General Security Support for the Linux Kernel”.

[13] “Linux Security Summit NA 2019: Kernel Runtime Security Instrumentation - KP Singh, Google,” Oct. 02, 2019. [Online]. Available: https://www.youtube.com/watch?v=2CZSSRfgAgQ

[14] “Linux Security Summit NA 2020: KRSI (BPF + LSM) - Updates and Progress - KP Singh, Google,” Jul. 01, 2020. [Online]. Available: https://lssna2020.sched.com/event/c74F/krsi-bpf-lsm-updates-and-progress-kp-singh-google

[15] “LPC 2020: BPF LSM (Updates + Progress),” Aug. 25, 2020. [Online]. Available: https://lpc.events/event/7/contributions/680/

[16] “LWN: bpf: add ambient BPF runtime context stored in current.” https://lwn.net/Articles/862539/

[17] “LWN: Enabling Non-Executable Memfds.” https://lwn.net/Articles/918106/

[18] “LWN: Impedance Matching for BPF and LSM,” Feb. 26, 2020. https://lwn.net/Articles/813261/

[19] “LWN: Kernel Runtime Security Instrumentation.” https://lwn.net/Articles/798157/

[20] “LWN: KRSI — The Other BPF Security Module,” Dec. 27, 2019. https://lwn.net/Articles/808048/

[21] “LWN: KRSI and Proprietary BPF Programs,” Jan. 17, 2020. https://lwn.net/Articles/809841/

[22] “Mitigating Attacks on a Supercomputer With KRSI.”

[23] “[PATCH bpf-next 0/4] Reduce overhead of LSMs with static calls.” [Online]. Available: https://lore.kernel.org/bpf/202301201137.93A66D1C76@keescook/T/#ma6a93c345ad38764bef97c18c982c11ab1cf0c0f

[24] “[PATCH v4 bpf-next 00/20] Introduce BPF trampoline.” [Online]. Available: https://lore.kernel.org/bpf/20191114185720.1641606-1-ast@kernel.org/#t

[25] “[RFC] security: replace indirect calls with static calls.” [Online]. Available: https://lore.kernel.org/bpf/20200820164753.3256899-1-jackmanb@chromium.org/

[26] “Static Calls in Linux 5.10.” https://blog.yossarian.net/2020/12/16/Static-calls-in-Linux-5-10

[27] “The Design and Implementation of Userland Exec”, [Online]. Available: https://grugq.github.io/docs/ul_exec.txt

[28] “Volatility3.” [Online]. Available: https://github.com/volatilityfoundation/volatility3