mirror of
https://github.com/facebookexperimental/reverie.git
synced 2025-01-22 21:04:53 +00:00
348 lines
16 KiB
Markdown
348 lines
16 KiB
Markdown
|
# Reverie v2: in-guest no_std design doc
|
|||
|
|
|||
|
This is a design document on how to implement an in-guest syscall tracer that
|
|||
|
has extremely low overhead, is reliable, and works on all types of binaries.
|
|||
|
|
|||
|
# Background
|
|||
|
|
|||
|
The fastest way to intercept system calls is by replacing the syscall with a
|
|||
|
CALL/JMP instruction that calls a function that we inject into the guest
|
|||
|
process. This might sound rather easy to do, but there are many challenges with
|
|||
|
this approach. We’ve had two previous attempts at implementing an in-guest
|
|||
|
Reverie backend. They are described below along with their shortcomings.
|
|||
|
|
|||
|
## Reverie Alpha
|
|||
|
|
|||
|
Before we wrote `reverie-ptrace`, the very first version of Reverie did binary
|
|||
|
rewriting to intercept syscalls.
|
|||
|
|
|||
|
Shortcomings:
|
|||
|
|
|||
|
* Missed early syscalls because of `LD_PRELOAD` usage.
|
|||
|
* Not all syscalls could be patched because they’re at the end of a basic
|
|||
|
block.
|
|||
|
* Conflicts between the plugin’s glibc and the tracee’s glibc.
|
|||
|
* Does not work on static binaries.
|
|||
|
* Does not work well with a chroot environment. Because the plugin is compiled
|
|||
|
as a DSO, it may not be accessible from inside of the chroot.
|
|||
|
|
|||
|
We also tried to have the plugin use musl libc instead of glibc, but we cannot
|
|||
|
mix two different libc implementations because they both have their own ideas of
|
|||
|
how to manage thread local storage. It is _very_ difficult to reconcile this
|
|||
|
problem.
|
|||
|
|
|||
|
## Reverie Sabre
|
|||
|
|
|||
|
Reverie Sabre uses [SaBRe](https://github.com/srg-imperial/SaBRe) to do the
|
|||
|
binary rewriting. Since SaBRe is written in C, we wrote a Rust interface for it.
|
|||
|
SaBRe works by hijacking the dynamic loader and loads our plugin first before
|
|||
|
any other DSOs.
|
|||
|
|
|||
|
Shortcomings:
|
|||
|
|
|||
|
* While we can technically intercept early syscalls, we can’t do it easily in
|
|||
|
practice. To do anything with the early syscalls, we can’t use anything from
|
|||
|
glibc. That means no allocations, no thread local storage, etc.
|
|||
|
* Does not work on static binaries.
|
|||
|
* Does not patch syscalls that are JITed.
|
|||
|
* Does not work well with a chroot environment.
|
|||
|
* Clobbers the result of `readlink /proc/self/exe` because SaBRe replaces its
|
|||
|
own code with the tracee’s.
|
|||
|
* We can get glibc mismatch errors. The tracee’s binary and the plugin’s DSO
|
|||
|
may have been compiled with different versions of glibc. This can lead to
|
|||
|
load-time errors.
|
|||
|
* The plugin can end up blowing through the stack because it reuses the same
|
|||
|
stack that the tracee uses.
|
|||
|
|
|||
|
|
|||
|
# The no_std implementation
|
|||
|
|
|||
|
Because many of the above problems are caused by fighting with glibc, we can
|
|||
|
leverage Rust’s no_std mode to avoid it completely.
|
|||
|
|
|||
|
## Plugin Loading
|
|||
|
|
|||
|
Let’s assume we have a statically linked DSO for a plugin with a single exported
|
|||
|
function that is to be called instead of the syscall. How do we inject this into
|
|||
|
the address space of the tracee process?
|
|||
|
|
|||
|
The absolute best time to load this is immediately after a successful call to
|
|||
|
`execve` and the only way to do this reliably is with ptrace.
|
|||
|
|
|||
|
The way it shall work is as follows:
|
|||
|
|
|||
|
1. From the tracer process, we **spawn the child process and attach to it with
|
|||
|
ptrace**. We then wait for a `PTRACE_EVENT_EXEC` to have the tracee stopped
|
|||
|
immediately after a successful call to `execve`. Note that we are _not_
|
|||
|
interested in intercepting all ptrace events, because that would introduce a
|
|||
|
large slowdown. By only intercepting exec events, the overhead of ptrace
|
|||
|
should be quite minimal.
|
|||
|
2. Now, to actually load our plugin’s DSO into the tracee’s address space, we
|
|||
|
need to inject calls to the tracee to run `mmap`. Instead of mapping it in
|
|||
|
from disk (via `open` + `mmap`), the DSO should be copied over into a
|
|||
|
`MAP_ANONYMOUS` mapping using `process_vm_writev`. Effectively, the tracer
|
|||
|
should be doing the loading of the DSO and loading each of its `PT_LOAD`
|
|||
|
segments. By copying the contents of the DSO over rather than using `open` +
|
|||
|
`mmap`, we bypass the chroot problem.
|
|||
|
|
|||
|
We need to do some ELF parsing to figure out where the `PT_LOAD` segments
|
|||
|
are, but this only needs to be done once inside the tracer and it should be
|
|||
|
very fast to do so.
|
|||
|
|
|||
|
Ideally, we should map the DSO at fixed addresses to ensure determinism and
|
|||
|
make it easier to identify the plugin while debugging, but this is not
|
|||
|
required. However, it is required if we don't perform relocations.
|
|||
|
|
|||
|
3. Perform [relocations](https://en.wikipedia.org/wiki/Relocation_(computing)).
|
|||
|
Just like the dynamic linker, we need to do a pass to fix up addresses so
|
|||
|
that they point to the real place. While relocations are only necessary when
|
|||
|
compiling with `-pic`, this flag is usually the default and it is best not
|
|||
|
to fight the build system.
|
|||
|
4. We need to keep track of the address of our callback function because we
|
|||
|
will need it during the patching phase.
|
|||
|
|
|||
|
**Notes:**
|
|||
|
|
|||
|
* Injecting syscalls is done in the same way that `reverie-ptrace` currently
|
|||
|
does it:
|
|||
|
* Replace the instructions at the current instruction pointer with a
|
|||
|
`syscall` instruction.
|
|||
|
* Use ptrace to set the registers and step through the syscall instruction.
|
|||
|
* Finally, the original instructions that we overwrite with the syscall
|
|||
|
instruction get restored along with the original value of the instruction
|
|||
|
pointer.
|
|||
|
* All of the symbol and debug information of the plugin should still be
|
|||
|
available while debugging. However, we should confirm that the debugger will
|
|||
|
even try to find this in pages not backed by a file on disk.
|
|||
|
* We can use `include_bytes!()` to bake the plugin into the tracer executable.
|
|||
|
|
|||
|
**Rational**:
|
|||
|
|
|||
|
* The main downside of using `ptrace` here is that it makes it harder to debug
|
|||
|
the tracee with gdb. There can only be one ptracer at a time. There are a
|
|||
|
couple of ways we can work around this issue: (1) either implement a gdb
|
|||
|
server in the tracer to allow debugging, or (2) make it possible to detach
|
|||
|
from the tracee when debugging is desired. The first option is certainly more
|
|||
|
robust.
|
|||
|
* We can’t just detach from the tracee immediately after `execve` because we
|
|||
|
need to do the same thing for every child process.
|
|||
|
* Using ptrace may also have other advantages, like observing thread and process
|
|||
|
exits more reliably than we can from within the tracee itself.
|
|||
|
* The overhead of ptrace should be quite minimal since we are only interested
|
|||
|
in intercepting `PTRACE_EVENT_EXEC` events. Thus, the tracee should only be
|
|||
|
in a stopped state once during its lifetime.
|
|||
|
|
|||
|
**References**:
|
|||
|
|
|||
|
* Use the `safeptrace` crate to avoid shooting yourself in the foot with ptrace.
|
|||
|
|
|||
|
## Patching
|
|||
|
|
|||
|
With the plugin DSO loaded into the address space of the tracee, we need to find
|
|||
|
all of the syscall instructions and replace them with a call to our callback
|
|||
|
function. This is easier said than done because there are tricky edge cases to
|
|||
|
handle here.
|
|||
|
|
|||
|
|
|||
|
### How to patch
|
|||
|
|
|||
|
Patching generally involves the following:
|
|||
|
|
|||
|
1. **Find** syscall instructions in the `.text` section by simply searching for
|
|||
|
its bit pattern. (On x86-64, a syscall instruction is 2 bytes represented by
|
|||
|
0x0f05.) Note that even if we find this bit pattern, there’s no guarantee
|
|||
|
that it is actually a syscall instruction. We need to disassemble further to
|
|||
|
find out for sure.
|
|||
|
2. **Replace** each syscall with a JMP instruction to our trampoline. Since a
|
|||
|
syscall is 2 bytes and our JMP instruction is 5 bytes, the trampoline
|
|||
|
contains any instructions that were overwritten plus a call to our callback
|
|||
|
function. Moving code around like this isn’t valid because any relative
|
|||
|
addressing needs to be fixed up inside the trampoline. Thus, we need to
|
|||
|
disassemble the instructions that we have overwritten to figure out if they
|
|||
|
need to be fixed up or not.
|
|||
|
|
|||
|
### Where to patch
|
|||
|
|
|||
|
There are two places where patching can be done, each with their own advantages
|
|||
|
and disadvantages.
|
|||
|
|
|||
|
#### From the tracer via ptrace
|
|||
|
|
|||
|
* Advantages:
|
|||
|
* We can easily use parallelism to do the searching and patching.
|
|||
|
* We can guarantee that a cache of the patched instructions will be
|
|||
|
available because the tracer is guaranteed to be outside of any chroot
|
|||
|
jail.
|
|||
|
* Disadvantages:
|
|||
|
* Doing a fast search could be difficult because the data lives in the
|
|||
|
address space of the tracee process. We might be able to mmap the memory
|
|||
|
into the tracer process, but this isn’t yet clear.
|
|||
|
* It is more tricky to allocate memory in the tracee for constructing
|
|||
|
trampolines.
|
|||
|
|
|||
|
#### From the tracee itself
|
|||
|
|
|||
|
* Advantages:
|
|||
|
* To do patching, we can just have the ptracer jump to a function in our DSO
|
|||
|
that does all of the patching.
|
|||
|
* Disadvantages:
|
|||
|
* Since the DSO must use no_std, finding a disassembler that works with
|
|||
|
no_std could be tricky. (For x86-64, iced-x86 works with `no_std`.)
|
|||
|
* Parallelism will be very difficult to implement with `no_std`.
|
|||
|
|
|||
|
### When to patch
|
|||
|
|
|||
|
Patching ultimately needs to be done on everything the `.text` section. The
|
|||
|
executable itself has a `.text` section and so do all of the loaded DSOs. Thus,
|
|||
|
patching needs to happen in the following scenarios:
|
|||
|
|
|||
|
* Immediately after execve (to patch the executable). Here, we’re guaranteed
|
|||
|
that no other threads are running at the same time.
|
|||
|
* Immediately after a PT_LOAD segment is mapped into memory via mmap. There
|
|||
|
might be other threads at this point, but we can be reasonably certain that no
|
|||
|
other threads are accessing this just-mmaped segment.
|
|||
|
|
|||
|
### Potential Optimizations
|
|||
|
|
|||
|
* Cache the locations of the patched instructions for next time. We can store a
|
|||
|
mapping of BuildID -> PatchLocations. As long as the binary’s BuildID is
|
|||
|
different upon subsequent rebuilds, this should work just fine. Note that
|
|||
|
shared libraries loaded at runtime have their own BuildID and can be cached
|
|||
|
separately. For shared libraries that are common to many binaries, this could
|
|||
|
lead to a big performance win.
|
|||
|
* We can use parallelism to simultaneously search through multiple PT_LOAD
|
|||
|
segments at a time.
|
|||
|
|
|||
|
### References:
|
|||
|
|
|||
|
* [SaBRe’s rewriter implementation](https://github.com/srg-imperial/SaBRe/blob/05816ee066a7284bee8afd0e73eeb44455b254b4/arch/x86_64/rewriter.c)
|
|||
|
|
|||
|
|
|||
|
## Catching unpatched syscalls
|
|||
|
|
|||
|
Patching does not always work in 100% of cases and there may be JITed code that
|
|||
|
is not patched, so we should have a fallback in these rare circumstances.
|
|||
|
|
|||
|
In kernel versions newer than v5.11.0, there is a wonderful new way to intercept
|
|||
|
syscalls from within the process itself. It is called
|
|||
|
[PR_SET_SYSCALL_USER_DISPATCH](https://lwn.net/Articles/826313/).
|
|||
|
|
|||
|
It works like this:
|
|||
|
|
|||
|
```
|
|||
|
prctl(PR_SET_SYSCALL_USER_DISPATCH, PR_SYS_DISPATCH_ON, start, length, selector);
|
|||
|
```
|
|||
|
|
|||
|
* `start` is the starting address where syscalls _should not_ be intercepted.
|
|||
|
* `length` is the length of the memory region from `start` where syscalls
|
|||
|
shouldn’t be intercepted.
|
|||
|
* `selector` is a pointer to a `u8` that controls whether or not to enable the
|
|||
|
filter. The kernel looks at this memory address whenever a syscall
|
|||
|
instruction is trapped to determine if it should raise a `SIGSYS`. Thus, it
|
|||
|
allows super-fast toggling of the filter.
|
|||
|
|
|||
|
When the plugin is loaded, we can use `rt_sigaction` to set a signal handler for
|
|||
|
`SIGSYS`. When we receive this signal, we should look at `siginfo_t.si_syscall`
|
|||
|
to see which syscall was attempted.
|
|||
|
|
|||
|
The syscall arguments can be retrieved via the `ucontext` parameter to the
|
|||
|
signal handler. It has a `uc_mcontext` member that holds all of the registers.
|
|||
|
|
|||
|
From this signal handler, we can run our syscall callback to handle the syscall.
|
|||
|
|
|||
|
Finally, after the signal handler is set up, we call this `prctl` to turn on
|
|||
|
interception. Since we’re excluding the memory range of our plugin and all
|
|||
|
syscalls outside of that range should have been replaced by a JMP instruction,
|
|||
|
this should catch only syscalls that we weren’t able to patch.
|
|||
|
|
|||
|
**Notes**:
|
|||
|
|
|||
|
* We can use the magic linker variables `__ehdr_start` and `_end` within the
|
|||
|
plugin’s DSO to figure out the section of memory to exclude from interception.
|
|||
|
* This is much better than using seccomp to trap syscalls because it doesn’t
|
|||
|
rely on the plugin to be loaded to a fixed address in all child processes.
|
|||
|
|
|||
|
## Thread-local Storage (TLS)
|
|||
|
|
|||
|
Thread-local storage is handled entirely by libc. Whenever a new thread is
|
|||
|
created, it allocates some new memory to use for thread local storage. The `%fs`
|
|||
|
register on x86-64 is a register that is dedicated to holding the offset to this
|
|||
|
special region of memory. When a thread needs to access local storage, it will
|
|||
|
use an address relative to the one in the `%fs` register.
|
|||
|
|
|||
|
Since our plugin is a static DSO, we can’t have threads and the compiler won’t
|
|||
|
generate any `%fs`-relative addresses. However, we still want our plugin to be
|
|||
|
able to store state on a per-thread basis. Since `%fs` shall be unique for each
|
|||
|
thread, we can use it as a key into a global hash table where our thread-local
|
|||
|
state is stored.
|
|||
|
|
|||
|
**Notes**:
|
|||
|
|
|||
|
* The `%fs` register is not set right away. It doesn’t get set until
|
|||
|
`arch_prctl` is called early on in the execution of the program. We need to
|
|||
|
intercept this call and adjust our hash table accordingly.
|
|||
|
|
|||
|
**See Also**:
|
|||
|
|
|||
|
* The ultimate guide on TLS: [https://www.akkadia.org/drepper/tls.pdf](https://www.akkadia.org/drepper/tls.pdf)
|
|||
|
|
|||
|
|
|||
|
## The Stack
|
|||
|
|
|||
|
Similar to TLS, each thread has its own stack which gets allocated whenever a
|
|||
|
new thread is spawned. It is up to the application to choose the size of the
|
|||
|
stack space allocated for a thread. By default, it is only 2MB. Some threads may
|
|||
|
even have very small stack sizes. For example, the thread glibc spawns to manage
|
|||
|
timers only has a 16KB stack. Therefore, we cannot rely on the stack of the
|
|||
|
thread to be large enough for the plugin’s needs. Instead, we need to create our
|
|||
|
own stack and switch to it for only the duration of the plugin’s callback
|
|||
|
function.
|
|||
|
|
|||
|
Like with TLS, we can use another global hashmap that translates our `%fs`
|
|||
|
register into a `Box<[u8]>` where our stack is located. (Don’t forget that
|
|||
|
the stack pointer decreases to grow the stack, so it should initially point to
|
|||
|
the last 8 bytes of the allocated memory.)
|
|||
|
|
|||
|
This can be implemented by using assembly code to change `%rsp` to point to the
|
|||
|
top of our stack. We need to store the old value of the previous stack on our
|
|||
|
new stack.
|
|||
|
|
|||
|
This can be extremely tricky to get right as we need to be careful to not
|
|||
|
clobber any other registers. All registers should be exactly the same as they
|
|||
|
were before. On x86-64, the only two registers that are safe to clobber are
|
|||
|
`%rcx` and `%r11`.
|
|||
|
|
|||
|
## Enforcing no_std in the plugin
|
|||
|
|
|||
|
Since everything relies on the plugin not having any dependencies and no usage
|
|||
|
of libc, we need to be extra careful to ensure this.
|
|||
|
|
|||
|
The biggest requirement is that we only depend on no_std crates. As soon as we
|
|||
|
depend on a crate that uses libc, then the build system will think that our DSO
|
|||
|
should link with it too. This can be especially painful when a proc macro crate
|
|||
|
uses std at build-time, but not at runtime.
|
|||
|
|
|||
|
At the very least, our loader should check that our DSO has no interpreter
|
|||
|
(i.e., no usage of `ld.so`).
|
|||
|
|
|||
|
## The Allocator
|
|||
|
|
|||
|
For `no_std` to be useful, we need to have a global allocator. Without this, we
|
|||
|
cannot use `Vec`, `Box`, `HashMap`, etc. It would be incredibly restrictive.
|
|||
|
|
|||
|
We can’t just use any off the shelf allocator, however. We need to have an
|
|||
|
allocator that has the following restrictions:
|
|||
|
|
|||
|
* No thread-local storage is used.
|
|||
|
* Does not use glibc in any way. Many allocators use libc’s `malloc` to allocate
|
|||
|
the underlying memory.
|
|||
|
* No mutexes, only atomics.
|
|||
|
|
|||
|
There exist crates that implement a `no_std` allocator, but they usually rely on
|
|||
|
a pre-allocated pool of memory to operate on. There is no way to dynamically
|
|||
|
mmap in new memory pages for use. This is fine, however, because we can leverage
|
|||
|
the fact that 2TB of virtual memory does not necessarily map to 2TB of physical
|
|||
|
memory if the full range of the 2TB has not been touched. All we need to do is
|
|||
|
allocate a large pool of memory up front and use it for the lifetime of the
|
|||
|
tracee.
|