# Reverie v2: in-guest no_std design doc

This is a design document on how to implement an in-guest syscall tracer that
has extremely low overhead, is reliable, and works on all types of binaries.

# Background

The fastest way to intercept system calls is by replacing the syscall with a
CALL/JMP instruction that calls a function that we inject into the guest
process. This might sound rather easy to do, but there are many challenges with
this approach. We’ve had two previous attempts at implementing an in-guest
Reverie backend. They are described below along with their shortcomings.

## Reverie Alpha

Before we wrote `reverie-ptrace`, the very first version of Reverie did binary
rewriting to intercept syscalls.

Shortcomings:

 * Missed early syscalls because of `LD_PRELOAD` usage.
 * Not all syscalls could be patched because they’re at the end of a basic
   block.
 * Conflicts between the plugin’s glibc and the tracee’s glibc.
 * Does not work on static binaries.
 * Does not work well with a chroot environment. Because the plugin is compiled
   as a DSO, it may not be accessible from inside of the chroot.

We also tried to have the plugin use musl libc instead of glibc, but we cannot
mix two different libc implementations because they both have their own ideas of
how to manage thread local storage. It is _very_ difficult to reconcile this
problem.

## Reverie Sabre

Reverie Sabre uses [SaBRe](https://github.com/srg-imperial/SaBRe) to do the
binary rewriting. Since SaBRe is written in C, we wrote a Rust interface for it.
SaBRe works by hijacking the dynamic loader and loads our plugin first before
any other DSOs.

Shortcomings:

 * While we can technically intercept early syscalls, we can’t do it easily in
   practice. To do anything with the early syscalls, we can’t use anything from
   glibc. That means no allocations, no thread local storage, etc.
 * Does not work on static binaries.
 * Does not patch syscalls that are JITed.
 * Does not work well with a chroot environment.
 * Clobbers the result of `readlink /proc/self/exe` because SaBRe replaces its
   own code with the tracee’s.
 * We can get glibc mismatch errors. The tracee’s binary and the plugin’s DSO
   may have been compiled with different versions of glibc. This can lead to
   load-time errors.
 * The plugin can end up blowing through the stack because it reuses the same
   stack that the tracee uses.


# The no_std implementation

Because many of the above problems are caused by fighting with glibc, we can
leverage Rust’s no_std mode to avoid it completely.

## Plugin Loading

Let’s assume we have a statically linked DSO for a plugin with a single exported
function that is to be called instead of the syscall. How do we inject this into
the address space of the tracee process?

The absolute best time to load this is immediately after a successful call to
`execve` and the only way to do this reliably is with ptrace.

The way it shall work is as follows:

 1. From the tracer process, we **spawn the child process and attach to it with
    ptrace**. We then wait for a `PTRACE_EVENT_EXEC` to have the tracee stopped
    immediately after a successful call to `execve`. Note that we are _not_
    interested in intercepting all ptrace events, because that would introduce a
    large slowdown. By only intercepting exec events, the overhead of ptrace
    should be quite minimal.
 2. Now, to actually load our plugin’s DSO into the tracee’s address space, we
    need to inject calls to the tracee to run `mmap`. Instead of mapping it in
    from disk (via `open` + `mmap`), the DSO should be copied over into a
    `MAP_ANONYMOUS` mapping using `process_vm_writev`. Effectively, the tracer
    should be doing the loading of the DSO and loading each of its `PT_LOAD`
    segments. By copying the contents of the DSO over rather than using `open` +
    `mmap`, we bypass the chroot problem.

    We need to do some ELF parsing to figure out where the `PT_LOAD` segments
    are, but this only needs to be done once inside the tracer and it should be
    very fast to do so.

    Ideally, we should map the DSO at fixed addresses to ensure determinism and
    make it easier to identify the plugin while debugging, but this is not
    required. However, it is required if we don't perform relocations.

 3. Perform [relocations](https://en.wikipedia.org/wiki/Relocation_(computing)).
    Just like the dynamic linker, we need to do a pass to fix up addresses so
    that they point to the real place. While relocations are only necessary when
    compiling with `-pic`, this flag is usually the default and it is best not
    to fight the build system.
 4. We need to keep track of the address of our callback function because we
 will need it during the patching phase.

**Notes:**

 * Injecting syscalls is done in the same way that `reverie-ptrace` currently
   does it:
     * Replace the instructions at the current instruction pointer with a
       `syscall` instruction.
     * Use ptrace to set the registers and step through the syscall instruction.
     * Finally, the original instructions that we overwrite with the syscall
       instruction get restored along with the original value of the instruction
       pointer.
 * All of the symbol and debug information of the plugin should still be
   available while debugging. However, we should confirm that the debugger will
   even try to find this in pages not backed by a file on disk.
 * We can use `include_bytes!()` to bake the plugin into the tracer executable.

**Rational**:

 * The main downside of using `ptrace` here is that it makes it harder to debug
   the tracee with gdb. There can only be one ptracer at a time. There are a
   couple of ways we can work around this issue: (1) either implement a gdb
   server in the tracer to allow debugging, or (2) make it possible to detach
   from the tracee when debugging is desired. The first option is certainly more
   robust.
 * We can’t just detach from the tracee immediately after `execve` because we
   need to do the same thing for every child process.
 * Using ptrace may also have other advantages, like observing thread and process
   exits more reliably than we can from within the tracee itself.
 * The overhead of ptrace should be quite minimal since we are only interested
   in intercepting `PTRACE_EVENT_EXEC` events. Thus, the tracee should only be
   in a stopped state once during its lifetime.

**References**:

* Use the `safeptrace` crate to avoid shooting yourself in the foot with ptrace.

## Patching

With the plugin DSO loaded into the address space of the tracee, we need to find
all of the syscall instructions and replace them with a call to our callback
function. This is easier said than done because there are tricky edge cases to
handle here.


### How to patch

Patching generally involves the following:

 1. **Find** syscall instructions in the `.text` section by simply searching for
    its bit pattern. (On x86-64, a syscall instruction is 2 bytes represented by
    0x0f05.) Note that even if we find this bit pattern, there’s no guarantee
    that it is actually a syscall instruction. We need to disassemble further to
    find out for sure.
 2. **Replace** each syscall with a JMP instruction to our trampoline. Since a
    syscall is 2 bytes and our JMP instruction is 5 bytes, the trampoline
    contains any instructions that were overwritten plus a call to our callback
    function.  Moving code around like this isn’t valid because any relative
    addressing needs to be fixed up inside the trampoline. Thus, we need to
    disassemble the instructions that we have overwritten to figure out if they
    need to be fixed up or not.

### Where to patch

There are two places where patching can be done, each with their own advantages
and disadvantages.

#### From the tracer via ptrace

 * Advantages:
     * We can easily use parallelism to do the searching and patching.
     * We can guarantee that a cache of the patched instructions will be
       available because the tracer is guaranteed to be outside of any chroot
       jail.
 * Disadvantages:
     * Doing a fast search could be difficult because the data lives in the
       address space of the tracee process. We might be able to mmap the memory
       into the tracer process, but this isn’t yet clear.
     * It is more tricky to allocate memory in the tracee for constructing
       trampolines.

#### From the tracee itself

 * Advantages:
   * To do patching, we can just have the ptracer jump to a function in our DSO
     that does all of the patching.
 * Disadvantages:
   * Since the DSO must use no_std, finding a disassembler that works with
     no_std could be tricky. (For x86-64, iced-x86 works with `no_std`.)
   * Parallelism will be very difficult to implement with `no_std`.

### When to patch

Patching ultimately needs to be done on everything the `.text` section. The
executable itself has a `.text` section and so do all of the loaded DSOs. Thus,
patching needs to happen in the following scenarios:

 * Immediately after execve (to patch the executable). Here, we’re guaranteed
   that no other threads are running at the same time.
 * Immediately after a PT_LOAD segment is mapped into memory via mmap. There
   might be other threads at this point, but we can be reasonably certain that no
   other threads are accessing this just-mmaped segment.

### Potential Optimizations

 * Cache the locations of the patched instructions for next time. We can store a
   mapping of BuildID -> PatchLocations. As long as the binary’s BuildID is
   different upon subsequent rebuilds, this should work just fine. Note that
   shared libraries loaded at runtime have their own BuildID and can be cached
   separately. For shared libraries that are common to many binaries, this could
   lead to a big performance win.
 * We can use parallelism to simultaneously search through multiple PT_LOAD
   segments at a time.

### References:

 * [SaBRe’s rewriter implementation](https://github.com/srg-imperial/SaBRe/blob/05816ee066a7284bee8afd0e73eeb44455b254b4/arch/x86_64/rewriter.c)


## Catching unpatched syscalls

Patching does not always work in 100% of cases and there may be JITed code that
is not patched, so we should have a fallback in these rare circumstances.

In kernel versions newer than v5.11.0, there is a wonderful new way to intercept
syscalls from within the process itself. It is called
[PR_SET_SYSCALL_USER_DISPATCH](https://lwn.net/Articles/826313/).

It works like this:

```
prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, start, length, selector);
```

 * `start` is the starting address where syscalls _should not_ be intercepted.
 * `length` is the length of the memory region from `start` where syscalls
   shouldn’t be intercepted.
 * `selector` is a pointer to a `u8` that controls whether or not to enable the
   filter. The kernel looks at this memory address whenever a syscall
   instruction is trapped to determine if it should raise a `SIGSYS`. Thus, it
   allows super-fast toggling of the filter.

When the plugin is loaded, we can use `rt_sigaction` to set a signal handler for
`SIGSYS`. When we receive this signal, we should look at `siginfo_t.si_syscall`
to see which syscall was attempted.

The syscall arguments can be retrieved via the `ucontext` parameter to the
signal handler. It has a `uc_mcontext` member that holds all of the registers.

From this signal handler, we can run our syscall callback to handle the syscall.

Finally, after the signal handler is set up, we call this `prctl` to turn on
interception. Since we’re excluding the memory range of our plugin and all
syscalls outside of that range should have been replaced by a JMP instruction,
this should catch only syscalls that we weren’t able to patch.

**Notes**:

* We can use the magic linker variables `__ehdr_start` and `_end` within the
  plugin’s DSO to figure out the section of memory to exclude from interception.
* This is much better than using seccomp to trap syscalls because it doesn’t
  rely on the plugin to be loaded to a fixed address in all child processes.

## Thread-local Storage (TLS)

Thread-local storage is handled entirely by libc. Whenever a new thread is
created, it allocates some new memory to use for thread local storage. The `%fs`
register on x86-64 is a register that is dedicated to holding the offset to this
special region of memory. When a thread needs to access local storage, it will
use an address relative to the one in the `%fs` register.

Since our plugin is a static DSO, we can’t have threads and the compiler won’t
generate any `%fs`-relative addresses. However, we still want our plugin to be
able to store state on a per-thread basis. Since `%fs` shall be unique for each
thread, we can use it as a key into a global hash table where our thread-local
state is stored.

**Notes**:

 * The `%fs` register is not set right away. It doesn’t get set until
   `arch_prctl` is called early on in the execution of the program. We need to
   intercept this call and adjust our hash table accordingly.

**See Also**:

 * The ultimate guide on TLS: [https://www.akkadia.org/drepper/tls.pdf](https://www.akkadia.org/drepper/tls.pdf)


## The Stack

Similar to TLS, each thread has its own stack which gets allocated whenever a
new thread is spawned. It is up to the application to choose the size of the
stack space allocated for a thread. By default, it is only 2MB. Some threads may
even have very small stack sizes. For example, the thread glibc spawns to manage
timers only has a 16KB stack. Therefore, we cannot rely on the stack of the
thread to be large enough for the plugin’s needs. Instead, we need to create our
own stack and switch to it for only the duration of the plugin’s callback
function.

Like with TLS, we can use another global hashmap that translates our `%fs`
register into a `Box<[u8]>` where our stack is located. (Don’t forget that
the stack pointer decreases to grow the stack, so it should initially point to
the last 8 bytes of the allocated memory.)

This can be implemented by using assembly code to change `%rsp` to point to the
top of our stack. We need to store the old value of the previous stack on our
new stack.

This can be extremely tricky to get right as we need to be careful to not
clobber any other registers. All registers should be exactly the same as they
were before. On x86-64, the only two registers that are safe to clobber are
`%rcx` and `%r11`.

## Enforcing no_std in the plugin

Since everything relies on the plugin not having any dependencies and no usage
of libc, we need to be extra careful to ensure this.

The biggest requirement is that we only depend on no_std crates. As soon as we
depend on a crate that uses libc, then the build system will think that our DSO
should link with it too. This can be especially painful when a proc macro crate
uses std at build-time, but not at runtime.

At the very least, our loader should check that our DSO has no interpreter
(i.e., no usage of `ld.so`).

## The Allocator

For `no_std` to be useful, we need to have a global allocator. Without this, we
cannot use `Vec`, `Box`, `HashMap`, etc. It would be incredibly restrictive.

We can’t just use any off the shelf allocator, however. We need to have an
allocator that has the following restrictions:

 * No thread-local storage is used.
 * Does not use glibc in any way. Many allocators use libc’s `malloc` to allocate
   the underlying memory.
 * No mutexes, only atomics.

There exist crates that implement a `no_std` allocator, but they usually rely on
a pre-allocated pool of memory to operate on. There is no way to dynamically
mmap in new memory pages for use. This is fine, however, because we can leverage
the fact that 2TB of virtual memory does not necessarily map to 2TB of physical
memory if the full range of the 2TB has not been touched. All we need to do is
allocate a large pool of memory up front and use it for the lifetime of the
tracee.