Document interrupts & snapshotting.

BUG=b:277651566
TEST=ran mdbook serve & verified the book looked as expected.

Change-Id: Id6e1723f99dbe428c009a7a03bc651c9b1ee4125
Reviewed-on: https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4598283
Commit-Queue: Noah Gold <nkgold@google.com>
Reviewed-by: Daniel Verkamp <dverkamp@chromium.org>
Reviewed-by: Takaya Saeki <takayas@chromium.org>
Reviewed-by: Keiichi Watanabe <keiichiw@chromium.org>
This commit is contained in:
Noah Gold 2023-06-07 15:21:26 -07:00 committed by crosvm LUCI
parent dc8d88fd80
commit 7eaa6e75c5
7 changed files with 217 additions and 2 deletions

View file

@ -16,5 +16,6 @@ additional-js = ["mermaid.min.js", "mermaid-init.js"]
# Redirect previously used paths to updated locations.
[output.html.redirect]
"building_crosvm/chromiumos.html" = "../integration/chromeos.html"
"architecture.html" = "architecture/overview.html"
[output.linkcheck]

View file

@ -27,7 +27,10 @@
- [Tracing](./tracing.md)
- [Integration](./integration/index.md)
- [ChromeOS](./integration/chromeos.md)
- [Architecture](./architecture.md)
- [Architecture](./architecture/index.md)
- [Overview](./architecture/overview.md)
- [Guest interrupts](./architecture/interrupts.md)
- [Snapshotting (highly experimental)](./architecture/snapshotting.md)
- [Hypervisors](./hypervisors.md)
- [Contribution Guide](./contributing/index.md)
- [Contributing to crosvm](./contributing/contributing.md)

View file

@ -1 +0,0 @@
{{#include ../../../ARCHITECTURE.md}}

View file

@ -0,0 +1,6 @@
# Architecture
This chapter explains the internal architecture of CrosVM for contributors.
- [Overview](./overview.md) - broad overview of CrosVM
- [Interrupts](./interrupts.md) - deep dive into interrupts

View file

@ -0,0 +1,136 @@
# Interrupts (x86_64)
Interrupts are how devices request service from the guest drivers. This page explores the details of
interrupt routing from the perspective of CrosVM.
## Critical acronyms
This subject area uses *a lot* of acronyms:
- IRQ: Interrupt ReQuest
- ISR: Interrupt Service Routine
- EOI: End Of Interrupt
- MSI: message signaled interrupts. In this document, synonymous with MSI-X.
- MSI-X: message signaled interrupts - extended
- LAPIC: local APIC
- APIC: Advanced Programmable Interrupt Controller (successor to the legacy PIC)
- IOAPIC: IO APIC (has physical interrupt lines, which it responds to by triggering an MSI directed
to the LAPIC).
- PIC: Programmable Interrupt Controller (the "legacy PIC" / Intel 8259 chip).
## Interrupts come in two flavors
Interrupts on `x86_64` in CrosVM come in two primary flavors: legacy and MSI-X. In this document,
MSI is used to refer to the concept of message signaled interrupts, but it always refers to
interrupts sent via MSI-X because that is what CrosVM uses.
### Legacy interrupts (INTx)
These interrupts are traditionally delivered via dedicated signal lines to PICs and/or the IOAPIC.
Older devices, especially those that are used during early boot, often rely on these types of
interrupts. These typically are the first 24 GSIs, and are serviced either by the PIC (during very
early boot), or by the IOAPIC (after it is activated & the PIC is switched off).
#### Background on EOI
The purpose of EOI is rooted in how legacy interrupt lines are shared. If two devices `D1` and `D2`
share a line `L`, `D2` has no guarantee that it will be serviced when `L` is asserted. After
receiving EOI, `D2` has to check whether it was serviced, and if it was not, to re-assert `L`. An
example of how this occurs is if `D2` requests service while `D1` is already being serviced. In that
case, the line has to be reasserted otherwise `D2` won't be serviced.
Because interrupt lines to the IOAPIC can be shared by multiple devices, EOI is critical for devices
to figure out whether they were serviced in response to sending the IRQ, or whether the IRQ needs to
be resent. The operating principles mean that sending extra EOIs to a legacy device is perfectly
safe, because they could be due to another device on the same line receiving service, and so devices
must be tolerant of such "extra" (from their perspective) EOIs.
These "extra" EOIs come from the fact that EOI is often a broadcast message that goes to all legacy
devices. Broadcast is required because interrupt lines can be routed through the two 8259 PICs via
cascade before they reach the CPU, broadcast to both PICs (and attached devices) is the only way to
ensure EOI reaches the device that was serviced.
#### EOI in CrosVM
When the guest's ISR completes and signals EOI, the CrosVM irqchip implementation is responsible for
propagating EOI to the device backends. EOI is delivered to the devices via their
[resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html). Devices are then
responsible for listening to that resample event, and checking their internal state to see if they
received service. If the device wasn't serviced, it must then reassert the IRQ.
### MSIs
MSIs do not use dedicated signal lines; instead, they are "messages" which are sent on the system
bus. The LAPIC(s) receive these messages, and inject the interrupt into the VCPU (where injection
means: jump to ISR).
#### About EOI
EOI is not meaningful for MSIs because lines are *never* shared. No devices using MSI will listen
for the EOI event, and the irqchip will not signal it.
## The fundamental deception on x86_64: there are no legacy interrupts (usually)
After very early boot, the PIC is switched off and legacy interrupts somewhat cease to be legacy.
Instead of being handled by the PIC, legacy interrupts are handled by the IOAPIC, and all the IOAPIC
does is convert them into MSIs; in other words, from the perspective of CrosVM & the guest VCPUs,
after early boot, every interrupt is a MSI.
## Interrupt handling irqchip specifics
Each `IrqChip` can handle interrupts differently. Often these differences are because the underlying
hypervisors will have different interrupt features such as KVM's irqfds. Generally a hypervisor has
three choices for implementing an irqchip:
- Fully in kernel: all of the irqchip (LAPIC & IOAPIC) are implemented in the kernel portion of the
hypervisor.
- Split: the performance critical part of the irqchip (LAPIC) is implemented in the kernel, but the
IOAPIC is implemented by the VMM.
- Userspace: here, the entire irqchip is implemented in the VMM. This is generally slower and not
commonly used.
Below, we describe the rough flow for interrupts in virtio devices for each of the chip types. We
limit ourselves to virtio devices becauseas these are the performance critical devices in CrosVM.
### Kernel mode IRQ chip (w/ irqfd support)
#### MSIs
1. Device wants service, so it signals an `Event` object.
1. The `Event` object is registered with the hypervisor, so the hypervisor immediately routes the
IRQ to a LAPIC so a VCPU can be interrupted.
1. The LAPIC interrupts the VCPU, which jumps to the kernel's ISR (interrupt service routine).
1. The ISR runs.
#### Legacy interrupts
These are handled similarly to MSIs, except the kernel mode IOAPIC is what initially picks up the
event, rather than the LAPIC.
### Split IRQ chip (w/ irqfd support)
This is the same as the kernel mode case.
### Split IRQ chip (no irqfd kernel support)
#### MSIs
1. Device wants service, so it signals an `Event` object.
1. The `Event`object is attached to the IrqChip in CrosVM. An interrupt handling thread wakes up
from the `Event` signal.
1. The IrqChip resets the `Event`.
1. The IrqChip asserts the interrupt to the LAPIC in the kernel via an ioctl (or equivalent).
1. The LAPIC interrupts the VCPU, which jumps to the kernels ISR (interrupt service routine).
1. The ISR runs, and on completion sends EOI (end of interrupt). In CrosVM, this is called the
[resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html).
1. EOI is sent.
#### Legacy interrupts
This introduces an additional `Event` object in the interrupt path, since the IRQ pin itself is an
`Event`, and the MSI is also an `Event`. These interrupts are processed twice by the IRQ handler:
once as a legacy IOAPIC event, and a second time as an MSI.
### Userspace IRQ chip
This chip is not widely used in production. Contributions to fill in this section are welcome.

View file

@ -0,0 +1 @@
{{#include ../../../../ARCHITECTURE.md}}

View file

@ -0,0 +1,69 @@
# Architecture: Snapshotting
Snapshotting is a **highly experimental** `x86_64` only feature currently under development. It is
100% **not supported** and only supports a very limited set of devices. This page roughly summarizes
how the system works, and how device authors should think about it when writing new devices.
## The snapshot & restore sequence
The data required for a snapshot is stored in several places, including guest memory, and the
devices running on the host. To take an accurate snapshot, we need a point in time snapshot. Since
there is no way to fetch this state atomically, we have to freeze the guest (VCPUs) and the device
backends. Similarly, on restore we must freeze in the same way to prevent partially restored state
from being modified.
## Snapshotting a running VM
In code, this is implemented by
[vm_control::do_snapshot](https://crosvm.dev/doc/vm_control/fn.do_snapshot.html). We always freeze
the VCPUs first
([vm_control::VcpuSuspendGuard](https://crosvm.dev/doc/vm_control/struct.VcpuSuspendGuard.html)).
This is done so that we can flush all pending interrupts to the irqchip (LAPIC) without triggering
further activity from the driver (which could in turn trigger more device activity). With the VCPUs
frozen, we freeze devices
([vm_control::DeviceSleepGuard](https://crosvm.dev/doc/vm_control/struct.DeviceSleepGuard.html)).
From here, it's a just a matter of serializing VCPU state, guest memory, and device state.
### A word about interrupts
Interrupts come in two primary flavors from the snapshotting perspective: legacy interrupts (e.g.
IOAPIC interrupt lines), and MSIs.
#### Legacy interrupts
These are a little tricky because they are allocated as part of device creation, and device creation
happens **before** we snapshot or restore. To avoid actually having to snapshot or restore the
`Event` object wiring for these interrupts, we rely on the fact that as long as the VM is created
with the right shape (e.g. devices), the interrupt `Event`s will be wired between the device & the
irqchip correctly. As part of restoring, we will set the routing table, which ensures that those
events map to the right GSIs in the hypervisor.
#### MSIs
These are much simpler, because of how MSIs are implemented in CrosVM. In `MsixConfig`, we save the
MSI routing information for every IRQ. At restore time, we just register these MSIs with the
hypervisor using the exact same mechanism that would be invoked on device activation (albeit
bypassing GSI allocation since we know from the saved state exactly which GSI must be used).
#### Flushing IRQs to the irqchip
IRQs sometimes pass through multiple host `Event`s before reaching the hypervisor (or VCPU loop) for
injection. Rather than trying to snapshot the `Event` state, we freeze all interrupt sources
(devices) and flush all pending interrupts into the irqchip. This way, snapshotting the irqchip
state is sufficient to capture all pending interrupts.
## Restoring a VM in lieu of booting
Restoring on to a running VM is not supported, and may never be. Our preferred approach is to
instead create a new VM from a snapshot. This is why `vm_control::do_restore` can be invoked as part
of the VM creation process.
## Implications for device authors
New devices SHOULD be compatible with the `devices::Suspendable` trait, but MAY defer actual
implementation to the future. This trait's implementation defines how the device will sleep/wake,
and how its state will be saved & restored as part of snapshotting.
New virtio devices SHOULD implement the virtio device snapshot methods on
[VirtioDevice](https://crosvm.dev/doc/devices/virtio/virtio_device/trait.VirtioDevice.html):
`virtio_sleep`, `virtio_wake`, `virtio_snapshot`, and `virtio_restore`.