diff --git a/docs/book/book.toml b/docs/book/book.toml index 1c68f3a919..010fc86c3c 100644 --- a/docs/book/book.toml +++ b/docs/book/book.toml @@ -16,5 +16,6 @@ additional-js = ["mermaid.min.js", "mermaid-init.js"] # Redirect previously used paths to updated locations. [output.html.redirect] "building_crosvm/chromiumos.html" = "../integration/chromeos.html" +"architecture.html" = "architecture/overview.html" [output.linkcheck] diff --git a/docs/book/src/SUMMARY.md b/docs/book/src/SUMMARY.md index 1762d767ea..c4e3936266 100644 --- a/docs/book/src/SUMMARY.md +++ b/docs/book/src/SUMMARY.md @@ -27,7 +27,10 @@ - [Tracing](./tracing.md) - [Integration](./integration/index.md) - [ChromeOS](./integration/chromeos.md) -- [Architecture](./architecture.md) +- [Architecture](./architecture/index.md) + - [Overview](./architecture/overview.md) + - [Guest interrupts](./architecture/interrupts.md) + - [Snapshotting (highly experimental)](./architecture/snapshotting.md) - [Hypervisors](./hypervisors.md) - [Contribution Guide](./contributing/index.md) - [Contributing to crosvm](./contributing/contributing.md) diff --git a/docs/book/src/architecture.md b/docs/book/src/architecture.md deleted file mode 100644 index 4708086a40..0000000000 --- a/docs/book/src/architecture.md +++ /dev/null @@ -1 +0,0 @@ -{{#include ../../../ARCHITECTURE.md}} diff --git a/docs/book/src/architecture/index.md b/docs/book/src/architecture/index.md new file mode 100644 index 0000000000..7f8ba925a2 --- /dev/null +++ b/docs/book/src/architecture/index.md @@ -0,0 +1,6 @@ +# Architecture + +This chapter explains the internal architecture of CrosVM for contributors. + +- [Overview](./overview.md) - broad overview of CrosVM +- [Interrupts](./interrupts.md) - deep dive into interrupts diff --git a/docs/book/src/architecture/interrupts.md b/docs/book/src/architecture/interrupts.md new file mode 100644 index 0000000000..1c1e384925 --- /dev/null +++ b/docs/book/src/architecture/interrupts.md @@ -0,0 +1,136 @@ +# Interrupts (x86_64) + +Interrupts are how devices request service from the guest drivers. This page explores the details of +interrupt routing from the perspective of CrosVM. + +## Critical acronyms + +This subject area uses *a lot* of acronyms: + +- IRQ: Interrupt ReQuest +- ISR: Interrupt Service Routine +- EOI: End Of Interrupt +- MSI: message signaled interrupts. In this document, synonymous with MSI-X. +- MSI-X: message signaled interrupts - extended +- LAPIC: local APIC +- APIC: Advanced Programmable Interrupt Controller (successor to the legacy PIC) +- IOAPIC: IO APIC (has physical interrupt lines, which it responds to by triggering an MSI directed + to the LAPIC). +- PIC: Programmable Interrupt Controller (the "legacy PIC" / Intel 8259 chip). + +## Interrupts come in two flavors + +Interrupts on `x86_64` in CrosVM come in two primary flavors: legacy and MSI-X. In this document, +MSI is used to refer to the concept of message signaled interrupts, but it always refers to +interrupts sent via MSI-X because that is what CrosVM uses. + +### Legacy interrupts (INTx) + +These interrupts are traditionally delivered via dedicated signal lines to PICs and/or the IOAPIC. +Older devices, especially those that are used during early boot, often rely on these types of +interrupts. These typically are the first 24 GSIs, and are serviced either by the PIC (during very +early boot), or by the IOAPIC (after it is activated & the PIC is switched off). + +#### Background on EOI + +The purpose of EOI is rooted in how legacy interrupt lines are shared. If two devices `D1` and `D2` +share a line `L`, `D2` has no guarantee that it will be serviced when `L` is asserted. After +receiving EOI, `D2` has to check whether it was serviced, and if it was not, to re-assert `L`. An +example of how this occurs is if `D2` requests service while `D1` is already being serviced. In that +case, the line has to be reasserted otherwise `D2` won't be serviced. + +Because interrupt lines to the IOAPIC can be shared by multiple devices, EOI is critical for devices +to figure out whether they were serviced in response to sending the IRQ, or whether the IRQ needs to +be resent. The operating principles mean that sending extra EOIs to a legacy device is perfectly +safe, because they could be due to another device on the same line receiving service, and so devices +must be tolerant of such "extra" (from their perspective) EOIs. + +These "extra" EOIs come from the fact that EOI is often a broadcast message that goes to all legacy +devices. Broadcast is required because interrupt lines can be routed through the two 8259 PICs via +cascade before they reach the CPU, broadcast to both PICs (and attached devices) is the only way to +ensure EOI reaches the device that was serviced. + +#### EOI in CrosVM + +When the guest's ISR completes and signals EOI, the CrosVM irqchip implementation is responsible for +propagating EOI to the device backends. EOI is delivered to the devices via their +[resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html). Devices are then +responsible for listening to that resample event, and checking their internal state to see if they +received service. If the device wasn't serviced, it must then reassert the IRQ. + +### MSIs + +MSIs do not use dedicated signal lines; instead, they are "messages" which are sent on the system +bus. The LAPIC(s) receive these messages, and inject the interrupt into the VCPU (where injection +means: jump to ISR). + +#### About EOI + +EOI is not meaningful for MSIs because lines are *never* shared. No devices using MSI will listen +for the EOI event, and the irqchip will not signal it. + +## The fundamental deception on x86_64: there are no legacy interrupts (usually) + +After very early boot, the PIC is switched off and legacy interrupts somewhat cease to be legacy. +Instead of being handled by the PIC, legacy interrupts are handled by the IOAPIC, and all the IOAPIC +does is convert them into MSIs; in other words, from the perspective of CrosVM & the guest VCPUs, +after early boot, every interrupt is a MSI. + +## Interrupt handling irqchip specifics + +Each `IrqChip` can handle interrupts differently. Often these differences are because the underlying +hypervisors will have different interrupt features such as KVM's irqfds. Generally a hypervisor has +three choices for implementing an irqchip: + +- Fully in kernel: all of the irqchip (LAPIC & IOAPIC) are implemented in the kernel portion of the + hypervisor. +- Split: the performance critical part of the irqchip (LAPIC) is implemented in the kernel, but the + IOAPIC is implemented by the VMM. +- Userspace: here, the entire irqchip is implemented in the VMM. This is generally slower and not + commonly used. + +Below, we describe the rough flow for interrupts in virtio devices for each of the chip types. We +limit ourselves to virtio devices becauseas these are the performance critical devices in CrosVM. + +### Kernel mode IRQ chip (w/ irqfd support) + +#### MSIs + +1. Device wants service, so it signals an `Event` object. +1. The `Event` object is registered with the hypervisor, so the hypervisor immediately routes the + IRQ to a LAPIC so a VCPU can be interrupted. +1. The LAPIC interrupts the VCPU, which jumps to the kernel's ISR (interrupt service routine). +1. The ISR runs. + +#### Legacy interrupts + +These are handled similarly to MSIs, except the kernel mode IOAPIC is what initially picks up the +event, rather than the LAPIC. + +### Split IRQ chip (w/ irqfd support) + +This is the same as the kernel mode case. + +### Split IRQ chip (no irqfd kernel support) + +#### MSIs + +1. Device wants service, so it signals an `Event` object. +1. The `Event`object is attached to the IrqChip in CrosVM. An interrupt handling thread wakes up + from the `Event` signal. +1. The IrqChip resets the `Event`. +1. The IrqChip asserts the interrupt to the LAPIC in the kernel via an ioctl (or equivalent). +1. The LAPIC interrupts the VCPU, which jumps to the kernel’s ISR (interrupt service routine). +1. The ISR runs, and on completion sends EOI (end of interrupt). In CrosVM, this is called the + [resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html). +1. EOI is sent. + +#### Legacy interrupts + +This introduces an additional `Event` object in the interrupt path, since the IRQ pin itself is an +`Event`, and the MSI is also an `Event`. These interrupts are processed twice by the IRQ handler: +once as a legacy IOAPIC event, and a second time as an MSI. + +### Userspace IRQ chip + +This chip is not widely used in production. Contributions to fill in this section are welcome. diff --git a/docs/book/src/architecture/overview.md b/docs/book/src/architecture/overview.md new file mode 100644 index 0000000000..e455c66efb --- /dev/null +++ b/docs/book/src/architecture/overview.md @@ -0,0 +1 @@ +{{#include ../../../../ARCHITECTURE.md}} diff --git a/docs/book/src/architecture/snapshotting.md b/docs/book/src/architecture/snapshotting.md new file mode 100644 index 0000000000..2d69d18474 --- /dev/null +++ b/docs/book/src/architecture/snapshotting.md @@ -0,0 +1,69 @@ +# Architecture: Snapshotting + +Snapshotting is a **highly experimental** `x86_64` only feature currently under development. It is +100% **not supported** and only supports a very limited set of devices. This page roughly summarizes +how the system works, and how device authors should think about it when writing new devices. + +## The snapshot & restore sequence + +The data required for a snapshot is stored in several places, including guest memory, and the +devices running on the host. To take an accurate snapshot, we need a point in time snapshot. Since +there is no way to fetch this state atomically, we have to freeze the guest (VCPUs) and the device +backends. Similarly, on restore we must freeze in the same way to prevent partially restored state +from being modified. + +## Snapshotting a running VM + +In code, this is implemented by +[vm_control::do_snapshot](https://crosvm.dev/doc/vm_control/fn.do_snapshot.html). We always freeze +the VCPUs first +([vm_control::VcpuSuspendGuard](https://crosvm.dev/doc/vm_control/struct.VcpuSuspendGuard.html)). +This is done so that we can flush all pending interrupts to the irqchip (LAPIC) without triggering +further activity from the driver (which could in turn trigger more device activity). With the VCPUs +frozen, we freeze devices +([vm_control::DeviceSleepGuard](https://crosvm.dev/doc/vm_control/struct.DeviceSleepGuard.html)). +From here, it's a just a matter of serializing VCPU state, guest memory, and device state. + +### A word about interrupts + +Interrupts come in two primary flavors from the snapshotting perspective: legacy interrupts (e.g. +IOAPIC interrupt lines), and MSIs. + +#### Legacy interrupts + +These are a little tricky because they are allocated as part of device creation, and device creation +happens **before** we snapshot or restore. To avoid actually having to snapshot or restore the +`Event` object wiring for these interrupts, we rely on the fact that as long as the VM is created +with the right shape (e.g. devices), the interrupt `Event`s will be wired between the device & the +irqchip correctly. As part of restoring, we will set the routing table, which ensures that those +events map to the right GSIs in the hypervisor. + +#### MSIs + +These are much simpler, because of how MSIs are implemented in CrosVM. In `MsixConfig`, we save the +MSI routing information for every IRQ. At restore time, we just register these MSIs with the +hypervisor using the exact same mechanism that would be invoked on device activation (albeit +bypassing GSI allocation since we know from the saved state exactly which GSI must be used). + +#### Flushing IRQs to the irqchip + +IRQs sometimes pass through multiple host `Event`s before reaching the hypervisor (or VCPU loop) for +injection. Rather than trying to snapshot the `Event` state, we freeze all interrupt sources +(devices) and flush all pending interrupts into the irqchip. This way, snapshotting the irqchip +state is sufficient to capture all pending interrupts. + +## Restoring a VM in lieu of booting + +Restoring on to a running VM is not supported, and may never be. Our preferred approach is to +instead create a new VM from a snapshot. This is why `vm_control::do_restore` can be invoked as part +of the VM creation process. + +## Implications for device authors + +New devices SHOULD be compatible with the `devices::Suspendable` trait, but MAY defer actual +implementation to the future. This trait's implementation defines how the device will sleep/wake, +and how its state will be saved & restored as part of snapshotting. + +New virtio devices SHOULD implement the virtio device snapshot methods on +[VirtioDevice](https://crosvm.dev/doc/devices/virtio/virtio_device/trait.VirtioDevice.html): +`virtio_sleep`, `virtio_wake`, `virtio_snapshot`, and `virtio_restore`.