reverie/experimental/docs/reverie_v2_design_doc.md
Jason White 8e716edec1 Publish no-std design doc
Summary: Trying to publish more of our docs and materials.

Reviewed By: VladimirMakaev

Differential Revision: D44387507

fbshipit-source-id: 32cd324db73483ec3f0dd7c2c1b0817f559fa958
2023-03-25 16:12:24 -07:00

16 KiB
Raw Blame History

Reverie v2: in-guest no_std design doc

This is a design document on how to implement an in-guest syscall tracer that has extremely low overhead, is reliable, and works on all types of binaries.

Background

The fastest way to intercept system calls is by replacing the syscall with a CALL/JMP instruction that calls a function that we inject into the guest process. This might sound rather easy to do, but there are many challenges with this approach. Weve had two previous attempts at implementing an in-guest Reverie backend. They are described below along with their shortcomings.

Reverie Alpha

Before we wrote reverie-ptrace, the very first version of Reverie did binary rewriting to intercept syscalls.

Shortcomings:

  • Missed early syscalls because of LD_PRELOAD usage.
  • Not all syscalls could be patched because theyre at the end of a basic block.
  • Conflicts between the plugins glibc and the tracees glibc.
  • Does not work on static binaries.
  • Does not work well with a chroot environment. Because the plugin is compiled as a DSO, it may not be accessible from inside of the chroot.

We also tried to have the plugin use musl libc instead of glibc, but we cannot mix two different libc implementations because they both have their own ideas of how to manage thread local storage. It is very difficult to reconcile this problem.

Reverie Sabre

Reverie Sabre uses SaBRe to do the binary rewriting. Since SaBRe is written in C, we wrote a Rust interface for it. SaBRe works by hijacking the dynamic loader and loads our plugin first before any other DSOs.

Shortcomings:

  • While we can technically intercept early syscalls, we cant do it easily in practice. To do anything with the early syscalls, we cant use anything from glibc. That means no allocations, no thread local storage, etc.
  • Does not work on static binaries.
  • Does not patch syscalls that are JITed.
  • Does not work well with a chroot environment.
  • Clobbers the result of readlink /proc/self/exe because SaBRe replaces its own code with the tracees.
  • We can get glibc mismatch errors. The tracees binary and the plugins DSO may have been compiled with different versions of glibc. This can lead to load-time errors.
  • The plugin can end up blowing through the stack because it reuses the same stack that the tracee uses.

The no_std implementation

Because many of the above problems are caused by fighting with glibc, we can leverage Rusts no_std mode to avoid it completely.

Plugin Loading

Lets assume we have a statically linked DSO for a plugin with a single exported function that is to be called instead of the syscall. How do we inject this into the address space of the tracee process?

The absolute best time to load this is immediately after a successful call to execve and the only way to do this reliably is with ptrace.

The way it shall work is as follows:

  1. From the tracer process, we spawn the child process and attach to it with ptrace. We then wait for a PTRACE_EVENT_EXEC to have the tracee stopped immediately after a successful call to execve. Note that we are not interested in intercepting all ptrace events, because that would introduce a large slowdown. By only intercepting exec events, the overhead of ptrace should be quite minimal.

  2. Now, to actually load our plugins DSO into the tracees address space, we need to inject calls to the tracee to run mmap. Instead of mapping it in from disk (via open + mmap), the DSO should be copied over into a MAP_ANONYMOUS mapping using process_vm_writev. Effectively, the tracer should be doing the loading of the DSO and loading each of its PT_LOAD segments. By copying the contents of the DSO over rather than using open + mmap, we bypass the chroot problem.

    We need to do some ELF parsing to figure out where the PT_LOAD segments are, but this only needs to be done once inside the tracer and it should be very fast to do so.

    Ideally, we should map the DSO at fixed addresses to ensure determinism and make it easier to identify the plugin while debugging, but this is not required. However, it is required if we don't perform relocations.

  3. Perform relocations. Just like the dynamic linker, we need to do a pass to fix up addresses so that they point to the real place. While relocations are only necessary when compiling with -pic, this flag is usually the default and it is best not to fight the build system.

  4. We need to keep track of the address of our callback function because we will need it during the patching phase.

Notes:

  • Injecting syscalls is done in the same way that reverie-ptrace currently does it:
    • Replace the instructions at the current instruction pointer with a syscall instruction.
    • Use ptrace to set the registers and step through the syscall instruction.
    • Finally, the original instructions that we overwrite with the syscall instruction get restored along with the original value of the instruction pointer.
  • All of the symbol and debug information of the plugin should still be available while debugging. However, we should confirm that the debugger will even try to find this in pages not backed by a file on disk.
  • We can use include_bytes!() to bake the plugin into the tracer executable.

Rational:

  • The main downside of using ptrace here is that it makes it harder to debug the tracee with gdb. There can only be one ptracer at a time. There are a couple of ways we can work around this issue: (1) either implement a gdb server in the tracer to allow debugging, or (2) make it possible to detach from the tracee when debugging is desired. The first option is certainly more robust.
  • We cant just detach from the tracee immediately after execve because we need to do the same thing for every child process.
  • Using ptrace may also have other advantages, like observing thread and process exits more reliably than we can from within the tracee itself.
  • The overhead of ptrace should be quite minimal since we are only interested in intercepting PTRACE_EVENT_EXEC events. Thus, the tracee should only be in a stopped state once during its lifetime.

References:

  • Use the safeptrace crate to avoid shooting yourself in the foot with ptrace.

Patching

With the plugin DSO loaded into the address space of the tracee, we need to find all of the syscall instructions and replace them with a call to our callback function. This is easier said than done because there are tricky edge cases to handle here.

How to patch

Patching generally involves the following:

  1. Find syscall instructions in the .text section by simply searching for its bit pattern. (On x86-64, a syscall instruction is 2 bytes represented by 0x0f05.) Note that even if we find this bit pattern, theres no guarantee that it is actually a syscall instruction. We need to disassemble further to find out for sure.
  2. Replace each syscall with a JMP instruction to our trampoline. Since a syscall is 2 bytes and our JMP instruction is 5 bytes, the trampoline contains any instructions that were overwritten plus a call to our callback function. Moving code around like this isnt valid because any relative addressing needs to be fixed up inside the trampoline. Thus, we need to disassemble the instructions that we have overwritten to figure out if they need to be fixed up or not.

Where to patch

There are two places where patching can be done, each with their own advantages and disadvantages.

From the tracer via ptrace

  • Advantages:
    • We can easily use parallelism to do the searching and patching.
    • We can guarantee that a cache of the patched instructions will be available because the tracer is guaranteed to be outside of any chroot jail.
  • Disadvantages:
    • Doing a fast search could be difficult because the data lives in the address space of the tracee process. We might be able to mmap the memory into the tracer process, but this isnt yet clear.
    • It is more tricky to allocate memory in the tracee for constructing trampolines.

From the tracee itself

  • Advantages:
    • To do patching, we can just have the ptracer jump to a function in our DSO that does all of the patching.
  • Disadvantages:
    • Since the DSO must use no_std, finding a disassembler that works with no_std could be tricky. (For x86-64, iced-x86 works with no_std.)
    • Parallelism will be very difficult to implement with no_std.

When to patch

Patching ultimately needs to be done on everything the .text section. The executable itself has a .text section and so do all of the loaded DSOs. Thus, patching needs to happen in the following scenarios:

  • Immediately after execve (to patch the executable). Here, were guaranteed that no other threads are running at the same time.
  • Immediately after a PT_LOAD segment is mapped into memory via mmap. There might be other threads at this point, but we can be reasonably certain that no other threads are accessing this just-mmaped segment.

Potential Optimizations

  • Cache the locations of the patched instructions for next time. We can store a mapping of BuildID -> PatchLocations. As long as the binarys BuildID is different upon subsequent rebuilds, this should work just fine. Note that shared libraries loaded at runtime have their own BuildID and can be cached separately. For shared libraries that are common to many binaries, this could lead to a big performance win.
  • We can use parallelism to simultaneously search through multiple PT_LOAD segments at a time.

References:

Catching unpatched syscalls

Patching does not always work in 100% of cases and there may be JITed code that is not patched, so we should have a fallback in these rare circumstances.

In kernel versions newer than v5.11.0, there is a wonderful new way to intercept syscalls from within the process itself. It is called PR_SET_SYSCALL_USER_DISPATCH.

It works like this:

prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, start, length, selector);
  • start is the starting address where syscalls should not be intercepted.
  • length is the length of the memory region from start where syscalls shouldnt be intercepted.
  • selector is a pointer to a u8 that controls whether or not to enable the filter. The kernel looks at this memory address whenever a syscall instruction is trapped to determine if it should raise a SIGSYS. Thus, it allows super-fast toggling of the filter.

When the plugin is loaded, we can use rt_sigaction to set a signal handler for SIGSYS. When we receive this signal, we should look at siginfo_t.si_syscall to see which syscall was attempted.

The syscall arguments can be retrieved via the ucontext parameter to the signal handler. It has a uc_mcontext member that holds all of the registers.

From this signal handler, we can run our syscall callback to handle the syscall.

Finally, after the signal handler is set up, we call this prctl to turn on interception. Since were excluding the memory range of our plugin and all syscalls outside of that range should have been replaced by a JMP instruction, this should catch only syscalls that we werent able to patch.

Notes:

  • We can use the magic linker variables __ehdr_start and _end within the plugins DSO to figure out the section of memory to exclude from interception.
  • This is much better than using seccomp to trap syscalls because it doesnt rely on the plugin to be loaded to a fixed address in all child processes.

Thread-local Storage (TLS)

Thread-local storage is handled entirely by libc. Whenever a new thread is created, it allocates some new memory to use for thread local storage. The %fs register on x86-64 is a register that is dedicated to holding the offset to this special region of memory. When a thread needs to access local storage, it will use an address relative to the one in the %fs register.

Since our plugin is a static DSO, we cant have threads and the compiler wont generate any %fs-relative addresses. However, we still want our plugin to be able to store state on a per-thread basis. Since %fs shall be unique for each thread, we can use it as a key into a global hash table where our thread-local state is stored.

Notes:

  • The %fs register is not set right away. It doesnt get set until arch_prctl is called early on in the execution of the program. We need to intercept this call and adjust our hash table accordingly.

See Also:

The Stack

Similar to TLS, each thread has its own stack which gets allocated whenever a new thread is spawned. It is up to the application to choose the size of the stack space allocated for a thread. By default, it is only 2MB. Some threads may even have very small stack sizes. For example, the thread glibc spawns to manage timers only has a 16KB stack. Therefore, we cannot rely on the stack of the thread to be large enough for the plugins needs. Instead, we need to create our own stack and switch to it for only the duration of the plugins callback function.

Like with TLS, we can use another global hashmap that translates our %fs register into a Box<[u8]> where our stack is located. (Dont forget that the stack pointer decreases to grow the stack, so it should initially point to the last 8 bytes of the allocated memory.)

This can be implemented by using assembly code to change %rsp to point to the top of our stack. We need to store the old value of the previous stack on our new stack.

This can be extremely tricky to get right as we need to be careful to not clobber any other registers. All registers should be exactly the same as they were before. On x86-64, the only two registers that are safe to clobber are %rcx and %r11.

Enforcing no_std in the plugin

Since everything relies on the plugin not having any dependencies and no usage of libc, we need to be extra careful to ensure this.

The biggest requirement is that we only depend on no_std crates. As soon as we depend on a crate that uses libc, then the build system will think that our DSO should link with it too. This can be especially painful when a proc macro crate uses std at build-time, but not at runtime.

At the very least, our loader should check that our DSO has no interpreter (i.e., no usage of ld.so).

The Allocator

For no_std to be useful, we need to have a global allocator. Without this, we cannot use Vec, Box, HashMap, etc. It would be incredibly restrictive.

We cant just use any off the shelf allocator, however. We need to have an allocator that has the following restrictions:

  • No thread-local storage is used.
  • Does not use glibc in any way. Many allocators use libcs malloc to allocate the underlying memory.
  • No mutexes, only atomics.

There exist crates that implement a no_std allocator, but they usually rely on a pre-allocated pool of memory to operate on. There is no way to dynamically mmap in new memory pages for use. This is fine, however, because we can leverage the fact that 2TB of virtual memory does not necessarily map to 2TB of physical memory if the full range of the 2TB has not been touched. All we need to do is allocate a large pool of memory up front and use it for the lifetime of the tracee.