reverie/experimental/docs/reverie_v2_design_doc.md

348 lines
16 KiB
Markdown
Raw Normal View History

# Reverie v2: in-guest no_std design doc
This is a design document on how to implement an in-guest syscall tracer that
has extremely low overhead, is reliable, and works on all types of binaries.
# Background
The fastest way to intercept system calls is by replacing the syscall with a
CALL/JMP instruction that calls a function that we inject into the guest
process. This might sound rather easy to do, but there are many challenges with
this approach. Weve had two previous attempts at implementing an in-guest
Reverie backend. They are described below along with their shortcomings.
## Reverie Alpha
Before we wrote `reverie-ptrace`, the very first version of Reverie did binary
rewriting to intercept syscalls.
Shortcomings:
* Missed early syscalls because of `LD_PRELOAD` usage.
* Not all syscalls could be patched because theyre at the end of a basic
block.
* Conflicts between the plugins glibc and the tracees glibc.
* Does not work on static binaries.
* Does not work well with a chroot environment. Because the plugin is compiled
as a DSO, it may not be accessible from inside of the chroot.
We also tried to have the plugin use musl libc instead of glibc, but we cannot
mix two different libc implementations because they both have their own ideas of
how to manage thread local storage. It is _very_ difficult to reconcile this
problem.
## Reverie Sabre
Reverie Sabre uses [SaBRe](https://github.com/srg-imperial/SaBRe) to do the
binary rewriting. Since SaBRe is written in C, we wrote a Rust interface for it.
SaBRe works by hijacking the dynamic loader and loads our plugin first before
any other DSOs.
Shortcomings:
* While we can technically intercept early syscalls, we cant do it easily in
practice. To do anything with the early syscalls, we cant use anything from
glibc. That means no allocations, no thread local storage, etc.
* Does not work on static binaries.
* Does not patch syscalls that are JITed.
* Does not work well with a chroot environment.
* Clobbers the result of `readlink /proc/self/exe` because SaBRe replaces its
own code with the tracees.
* We can get glibc mismatch errors. The tracees binary and the plugins DSO
may have been compiled with different versions of glibc. This can lead to
load-time errors.
* The plugin can end up blowing through the stack because it reuses the same
stack that the tracee uses.
# The no_std implementation
Because many of the above problems are caused by fighting with glibc, we can
leverage Rusts no_std mode to avoid it completely.
## Plugin Loading
Lets assume we have a statically linked DSO for a plugin with a single exported
function that is to be called instead of the syscall. How do we inject this into
the address space of the tracee process?
The absolute best time to load this is immediately after a successful call to
`execve` and the only way to do this reliably is with ptrace.
The way it shall work is as follows:
1. From the tracer process, we **spawn the child process and attach to it with
ptrace**. We then wait for a `PTRACE_EVENT_EXEC` to have the tracee stopped
immediately after a successful call to `execve`. Note that we are _not_
interested in intercepting all ptrace events, because that would introduce a
large slowdown. By only intercepting exec events, the overhead of ptrace
should be quite minimal.
2. Now, to actually load our plugins DSO into the tracees address space, we
need to inject calls to the tracee to run `mmap`. Instead of mapping it in
from disk (via `open` + `mmap`), the DSO should be copied over into a
`MAP_ANONYMOUS` mapping using `process_vm_writev`. Effectively, the tracer
should be doing the loading of the DSO and loading each of its `PT_LOAD`
segments. By copying the contents of the DSO over rather than using `open` +
`mmap`, we bypass the chroot problem.
We need to do some ELF parsing to figure out where the `PT_LOAD` segments
are, but this only needs to be done once inside the tracer and it should be
very fast to do so.
Ideally, we should map the DSO at fixed addresses to ensure determinism and
make it easier to identify the plugin while debugging, but this is not
required. However, it is required if we don't perform relocations.
3. Perform [relocations](https://en.wikipedia.org/wiki/Relocation_(computing)).
Just like the dynamic linker, we need to do a pass to fix up addresses so
that they point to the real place. While relocations are only necessary when
compiling with `-pic`, this flag is usually the default and it is best not
to fight the build system.
4. We need to keep track of the address of our callback function because we
will need it during the patching phase.
**Notes:**
* Injecting syscalls is done in the same way that `reverie-ptrace` currently
does it:
* Replace the instructions at the current instruction pointer with a
`syscall` instruction.
* Use ptrace to set the registers and step through the syscall instruction.
* Finally, the original instructions that we overwrite with the syscall
instruction get restored along with the original value of the instruction
pointer.
* All of the symbol and debug information of the plugin should still be
available while debugging. However, we should confirm that the debugger will
even try to find this in pages not backed by a file on disk.
* We can use `include_bytes!()` to bake the plugin into the tracer executable.
**Rational**:
* The main downside of using `ptrace` here is that it makes it harder to debug
the tracee with gdb. There can only be one ptracer at a time. There are a
couple of ways we can work around this issue: (1) either implement a gdb
server in the tracer to allow debugging, or (2) make it possible to detach
from the tracee when debugging is desired. The first option is certainly more
robust.
* We cant just detach from the tracee immediately after `execve` because we
need to do the same thing for every child process.
* Using ptrace may also have other advantages, like observing thread and process
exits more reliably than we can from within the tracee itself.
* The overhead of ptrace should be quite minimal since we are only interested
in intercepting `PTRACE_EVENT_EXEC` events. Thus, the tracee should only be
in a stopped state once during its lifetime.
**References**:
* Use the `safeptrace` crate to avoid shooting yourself in the foot with ptrace.
## Patching
With the plugin DSO loaded into the address space of the tracee, we need to find
all of the syscall instructions and replace them with a call to our callback
function. This is easier said than done because there are tricky edge cases to
handle here.
### How to patch
Patching generally involves the following:
1. **Find** syscall instructions in the `.text` section by simply searching for
its bit pattern. (On x86-64, a syscall instruction is 2 bytes represented by
0x0f05.) Note that even if we find this bit pattern, theres no guarantee
that it is actually a syscall instruction. We need to disassemble further to
find out for sure.
2. **Replace** each syscall with a JMP instruction to our trampoline. Since a
syscall is 2 bytes and our JMP instruction is 5 bytes, the trampoline
contains any instructions that were overwritten plus a call to our callback
function. Moving code around like this isnt valid because any relative
addressing needs to be fixed up inside the trampoline. Thus, we need to
disassemble the instructions that we have overwritten to figure out if they
need to be fixed up or not.
### Where to patch
There are two places where patching can be done, each with their own advantages
and disadvantages.
#### From the tracer via ptrace
* Advantages:
* We can easily use parallelism to do the searching and patching.
* We can guarantee that a cache of the patched instructions will be
available because the tracer is guaranteed to be outside of any chroot
jail.
* Disadvantages:
* Doing a fast search could be difficult because the data lives in the
address space of the tracee process. We might be able to mmap the memory
into the tracer process, but this isnt yet clear.
* It is more tricky to allocate memory in the tracee for constructing
trampolines.
#### From the tracee itself
* Advantages:
* To do patching, we can just have the ptracer jump to a function in our DSO
that does all of the patching.
* Disadvantages:
* Since the DSO must use no_std, finding a disassembler that works with
no_std could be tricky. (For x86-64, iced-x86 works with `no_std`.)
* Parallelism will be very difficult to implement with `no_std`.
### When to patch
Patching ultimately needs to be done on everything the `.text` section. The
executable itself has a `.text` section and so do all of the loaded DSOs. Thus,
patching needs to happen in the following scenarios:
* Immediately after execve (to patch the executable). Here, were guaranteed
that no other threads are running at the same time.
* Immediately after a PT_LOAD segment is mapped into memory via mmap. There
might be other threads at this point, but we can be reasonably certain that no
other threads are accessing this just-mmaped segment.
### Potential Optimizations
* Cache the locations of the patched instructions for next time. We can store a
mapping of BuildID -> PatchLocations. As long as the binarys BuildID is
different upon subsequent rebuilds, this should work just fine. Note that
shared libraries loaded at runtime have their own BuildID and can be cached
separately. For shared libraries that are common to many binaries, this could
lead to a big performance win.
* We can use parallelism to simultaneously search through multiple PT_LOAD
segments at a time.
### References:
* [SaBRes rewriter implementation](https://github.com/srg-imperial/SaBRe/blob/05816ee066a7284bee8afd0e73eeb44455b254b4/arch/x86_64/rewriter.c)
## Catching unpatched syscalls
Patching does not always work in 100% of cases and there may be JITed code that
is not patched, so we should have a fallback in these rare circumstances.
In kernel versions newer than v5.11.0, there is a wonderful new way to intercept
syscalls from within the process itself. It is called
[PR_SET_SYSCALL_USER_DISPATCH](https://lwn.net/Articles/826313/).
It works like this:
```
prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, start, length, selector);
```
* `start` is the starting address where syscalls _should not_ be intercepted.
* `length` is the length of the memory region from `start` where syscalls
shouldnt be intercepted.
* `selector` is a pointer to a `u8` that controls whether or not to enable the
filter. The kernel looks at this memory address whenever a syscall
instruction is trapped to determine if it should raise a `SIGSYS`. Thus, it
allows super-fast toggling of the filter.
When the plugin is loaded, we can use `rt_sigaction` to set a signal handler for
`SIGSYS`. When we receive this signal, we should look at `siginfo_t.si_syscall`
to see which syscall was attempted.
The syscall arguments can be retrieved via the `ucontext` parameter to the
signal handler. It has a `uc_mcontext` member that holds all of the registers.
From this signal handler, we can run our syscall callback to handle the syscall.
Finally, after the signal handler is set up, we call this `prctl` to turn on
interception. Since were excluding the memory range of our plugin and all
syscalls outside of that range should have been replaced by a JMP instruction,
this should catch only syscalls that we werent able to patch.
**Notes**:
* We can use the magic linker variables `__ehdr_start` and `_end` within the
plugins DSO to figure out the section of memory to exclude from interception.
* This is much better than using seccomp to trap syscalls because it doesnt
rely on the plugin to be loaded to a fixed address in all child processes.
## Thread-local Storage (TLS)
Thread-local storage is handled entirely by libc. Whenever a new thread is
created, it allocates some new memory to use for thread local storage. The `%fs`
register on x86-64 is a register that is dedicated to holding the offset to this
special region of memory. When a thread needs to access local storage, it will
use an address relative to the one in the `%fs` register.
Since our plugin is a static DSO, we cant have threads and the compiler wont
generate any `%fs`-relative addresses. However, we still want our plugin to be
able to store state on a per-thread basis. Since `%fs` shall be unique for each
thread, we can use it as a key into a global hash table where our thread-local
state is stored.
**Notes**:
* The `%fs` register is not set right away. It doesnt get set until
`arch_prctl` is called early on in the execution of the program. We need to
intercept this call and adjust our hash table accordingly.
**See Also**:
* The ultimate guide on TLS: [https://www.akkadia.org/drepper/tls.pdf](https://www.akkadia.org/drepper/tls.pdf)
## The Stack
Similar to TLS, each thread has its own stack which gets allocated whenever a
new thread is spawned. It is up to the application to choose the size of the
stack space allocated for a thread. By default, it is only 2MB. Some threads may
even have very small stack sizes. For example, the thread glibc spawns to manage
timers only has a 16KB stack. Therefore, we cannot rely on the stack of the
thread to be large enough for the plugins needs. Instead, we need to create our
own stack and switch to it for only the duration of the plugins callback
function.
Like with TLS, we can use another global hashmap that translates our `%fs`
register into a `Box<[u8]>` where our stack is located. (Dont forget that
the stack pointer decreases to grow the stack, so it should initially point to
the last 8 bytes of the allocated memory.)
This can be implemented by using assembly code to change `%rsp` to point to the
top of our stack. We need to store the old value of the previous stack on our
new stack.
This can be extremely tricky to get right as we need to be careful to not
clobber any other registers. All registers should be exactly the same as they
were before. On x86-64, the only two registers that are safe to clobber are
`%rcx` and `%r11`.
## Enforcing no_std in the plugin
Since everything relies on the plugin not having any dependencies and no usage
of libc, we need to be extra careful to ensure this.
The biggest requirement is that we only depend on no_std crates. As soon as we
depend on a crate that uses libc, then the build system will think that our DSO
should link with it too. This can be especially painful when a proc macro crate
uses std at build-time, but not at runtime.
At the very least, our loader should check that our DSO has no interpreter
(i.e., no usage of `ld.so`).
## The Allocator
For `no_std` to be useful, we need to have a global allocator. Without this, we
cannot use `Vec`, `Box`, `HashMap`, etc. It would be incredibly restrictive.
We cant just use any off the shelf allocator, however. We need to have an
allocator that has the following restrictions:
* No thread-local storage is used.
* Does not use glibc in any way. Many allocators use libcs `malloc` to allocate
the underlying memory.
* No mutexes, only atomics.
There exist crates that implement a `no_std` allocator, but they usually rely on
a pre-allocated pool of memory to operate on. There is no way to dynamically
mmap in new memory pages for use. This is fine, however, because we can leverage
the fact that 2TB of virtual memory does not necessarily map to 2TB of physical
memory if the full range of the 2TB has not been touched. All we need to do is
allocate a large pool of memory up front and use it for the lifetime of the
tracee.