mirror of
https://chromium.googlesource.com/crosvm/crosvm
synced 2024-11-25 05:03:05 +00:00
Document interrupts & snapshotting.
BUG=b:277651566 TEST=ran mdbook serve & verified the book looked as expected. Change-Id: Id6e1723f99dbe428c009a7a03bc651c9b1ee4125 Reviewed-on: https://chromium-review.googlesource.com/c/crosvm/crosvm/+/4598283 Commit-Queue: Noah Gold <nkgold@google.com> Reviewed-by: Daniel Verkamp <dverkamp@chromium.org> Reviewed-by: Takaya Saeki <takayas@chromium.org> Reviewed-by: Keiichi Watanabe <keiichiw@chromium.org>
This commit is contained in:
parent
dc8d88fd80
commit
7eaa6e75c5
7 changed files with 217 additions and 2 deletions
|
@ -16,5 +16,6 @@ additional-js = ["mermaid.min.js", "mermaid-init.js"]
|
|||
# Redirect previously used paths to updated locations.
|
||||
[output.html.redirect]
|
||||
"building_crosvm/chromiumos.html" = "../integration/chromeos.html"
|
||||
"architecture.html" = "architecture/overview.html"
|
||||
|
||||
[output.linkcheck]
|
||||
|
|
|
@ -27,7 +27,10 @@
|
|||
- [Tracing](./tracing.md)
|
||||
- [Integration](./integration/index.md)
|
||||
- [ChromeOS](./integration/chromeos.md)
|
||||
- [Architecture](./architecture.md)
|
||||
- [Architecture](./architecture/index.md)
|
||||
- [Overview](./architecture/overview.md)
|
||||
- [Guest interrupts](./architecture/interrupts.md)
|
||||
- [Snapshotting (highly experimental)](./architecture/snapshotting.md)
|
||||
- [Hypervisors](./hypervisors.md)
|
||||
- [Contribution Guide](./contributing/index.md)
|
||||
- [Contributing to crosvm](./contributing/contributing.md)
|
||||
|
|
|
@ -1 +0,0 @@
|
|||
{{#include ../../../ARCHITECTURE.md}}
|
6
docs/book/src/architecture/index.md
Normal file
6
docs/book/src/architecture/index.md
Normal file
|
@ -0,0 +1,6 @@
|
|||
# Architecture
|
||||
|
||||
This chapter explains the internal architecture of CrosVM for contributors.
|
||||
|
||||
- [Overview](./overview.md) - broad overview of CrosVM
|
||||
- [Interrupts](./interrupts.md) - deep dive into interrupts
|
136
docs/book/src/architecture/interrupts.md
Normal file
136
docs/book/src/architecture/interrupts.md
Normal file
|
@ -0,0 +1,136 @@
|
|||
# Interrupts (x86_64)
|
||||
|
||||
Interrupts are how devices request service from the guest drivers. This page explores the details of
|
||||
interrupt routing from the perspective of CrosVM.
|
||||
|
||||
## Critical acronyms
|
||||
|
||||
This subject area uses *a lot* of acronyms:
|
||||
|
||||
- IRQ: Interrupt ReQuest
|
||||
- ISR: Interrupt Service Routine
|
||||
- EOI: End Of Interrupt
|
||||
- MSI: message signaled interrupts. In this document, synonymous with MSI-X.
|
||||
- MSI-X: message signaled interrupts - extended
|
||||
- LAPIC: local APIC
|
||||
- APIC: Advanced Programmable Interrupt Controller (successor to the legacy PIC)
|
||||
- IOAPIC: IO APIC (has physical interrupt lines, which it responds to by triggering an MSI directed
|
||||
to the LAPIC).
|
||||
- PIC: Programmable Interrupt Controller (the "legacy PIC" / Intel 8259 chip).
|
||||
|
||||
## Interrupts come in two flavors
|
||||
|
||||
Interrupts on `x86_64` in CrosVM come in two primary flavors: legacy and MSI-X. In this document,
|
||||
MSI is used to refer to the concept of message signaled interrupts, but it always refers to
|
||||
interrupts sent via MSI-X because that is what CrosVM uses.
|
||||
|
||||
### Legacy interrupts (INTx)
|
||||
|
||||
These interrupts are traditionally delivered via dedicated signal lines to PICs and/or the IOAPIC.
|
||||
Older devices, especially those that are used during early boot, often rely on these types of
|
||||
interrupts. These typically are the first 24 GSIs, and are serviced either by the PIC (during very
|
||||
early boot), or by the IOAPIC (after it is activated & the PIC is switched off).
|
||||
|
||||
#### Background on EOI
|
||||
|
||||
The purpose of EOI is rooted in how legacy interrupt lines are shared. If two devices `D1` and `D2`
|
||||
share a line `L`, `D2` has no guarantee that it will be serviced when `L` is asserted. After
|
||||
receiving EOI, `D2` has to check whether it was serviced, and if it was not, to re-assert `L`. An
|
||||
example of how this occurs is if `D2` requests service while `D1` is already being serviced. In that
|
||||
case, the line has to be reasserted otherwise `D2` won't be serviced.
|
||||
|
||||
Because interrupt lines to the IOAPIC can be shared by multiple devices, EOI is critical for devices
|
||||
to figure out whether they were serviced in response to sending the IRQ, or whether the IRQ needs to
|
||||
be resent. The operating principles mean that sending extra EOIs to a legacy device is perfectly
|
||||
safe, because they could be due to another device on the same line receiving service, and so devices
|
||||
must be tolerant of such "extra" (from their perspective) EOIs.
|
||||
|
||||
These "extra" EOIs come from the fact that EOI is often a broadcast message that goes to all legacy
|
||||
devices. Broadcast is required because interrupt lines can be routed through the two 8259 PICs via
|
||||
cascade before they reach the CPU, broadcast to both PICs (and attached devices) is the only way to
|
||||
ensure EOI reaches the device that was serviced.
|
||||
|
||||
#### EOI in CrosVM
|
||||
|
||||
When the guest's ISR completes and signals EOI, the CrosVM irqchip implementation is responsible for
|
||||
propagating EOI to the device backends. EOI is delivered to the devices via their
|
||||
[resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html). Devices are then
|
||||
responsible for listening to that resample event, and checking their internal state to see if they
|
||||
received service. If the device wasn't serviced, it must then reassert the IRQ.
|
||||
|
||||
### MSIs
|
||||
|
||||
MSIs do not use dedicated signal lines; instead, they are "messages" which are sent on the system
|
||||
bus. The LAPIC(s) receive these messages, and inject the interrupt into the VCPU (where injection
|
||||
means: jump to ISR).
|
||||
|
||||
#### About EOI
|
||||
|
||||
EOI is not meaningful for MSIs because lines are *never* shared. No devices using MSI will listen
|
||||
for the EOI event, and the irqchip will not signal it.
|
||||
|
||||
## The fundamental deception on x86_64: there are no legacy interrupts (usually)
|
||||
|
||||
After very early boot, the PIC is switched off and legacy interrupts somewhat cease to be legacy.
|
||||
Instead of being handled by the PIC, legacy interrupts are handled by the IOAPIC, and all the IOAPIC
|
||||
does is convert them into MSIs; in other words, from the perspective of CrosVM & the guest VCPUs,
|
||||
after early boot, every interrupt is a MSI.
|
||||
|
||||
## Interrupt handling irqchip specifics
|
||||
|
||||
Each `IrqChip` can handle interrupts differently. Often these differences are because the underlying
|
||||
hypervisors will have different interrupt features such as KVM's irqfds. Generally a hypervisor has
|
||||
three choices for implementing an irqchip:
|
||||
|
||||
- Fully in kernel: all of the irqchip (LAPIC & IOAPIC) are implemented in the kernel portion of the
|
||||
hypervisor.
|
||||
- Split: the performance critical part of the irqchip (LAPIC) is implemented in the kernel, but the
|
||||
IOAPIC is implemented by the VMM.
|
||||
- Userspace: here, the entire irqchip is implemented in the VMM. This is generally slower and not
|
||||
commonly used.
|
||||
|
||||
Below, we describe the rough flow for interrupts in virtio devices for each of the chip types. We
|
||||
limit ourselves to virtio devices becauseas these are the performance critical devices in CrosVM.
|
||||
|
||||
### Kernel mode IRQ chip (w/ irqfd support)
|
||||
|
||||
#### MSIs
|
||||
|
||||
1. Device wants service, so it signals an `Event` object.
|
||||
1. The `Event` object is registered with the hypervisor, so the hypervisor immediately routes the
|
||||
IRQ to a LAPIC so a VCPU can be interrupted.
|
||||
1. The LAPIC interrupts the VCPU, which jumps to the kernel's ISR (interrupt service routine).
|
||||
1. The ISR runs.
|
||||
|
||||
#### Legacy interrupts
|
||||
|
||||
These are handled similarly to MSIs, except the kernel mode IOAPIC is what initially picks up the
|
||||
event, rather than the LAPIC.
|
||||
|
||||
### Split IRQ chip (w/ irqfd support)
|
||||
|
||||
This is the same as the kernel mode case.
|
||||
|
||||
### Split IRQ chip (no irqfd kernel support)
|
||||
|
||||
#### MSIs
|
||||
|
||||
1. Device wants service, so it signals an `Event` object.
|
||||
1. The `Event`object is attached to the IrqChip in CrosVM. An interrupt handling thread wakes up
|
||||
from the `Event` signal.
|
||||
1. The IrqChip resets the `Event`.
|
||||
1. The IrqChip asserts the interrupt to the LAPIC in the kernel via an ioctl (or equivalent).
|
||||
1. The LAPIC interrupts the VCPU, which jumps to the kernel’s ISR (interrupt service routine).
|
||||
1. The ISR runs, and on completion sends EOI (end of interrupt). In CrosVM, this is called the
|
||||
[resample event](https://crosvm.dev/doc/devices/struct.IrqLevelEvent.html).
|
||||
1. EOI is sent.
|
||||
|
||||
#### Legacy interrupts
|
||||
|
||||
This introduces an additional `Event` object in the interrupt path, since the IRQ pin itself is an
|
||||
`Event`, and the MSI is also an `Event`. These interrupts are processed twice by the IRQ handler:
|
||||
once as a legacy IOAPIC event, and a second time as an MSI.
|
||||
|
||||
### Userspace IRQ chip
|
||||
|
||||
This chip is not widely used in production. Contributions to fill in this section are welcome.
|
1
docs/book/src/architecture/overview.md
Normal file
1
docs/book/src/architecture/overview.md
Normal file
|
@ -0,0 +1 @@
|
|||
{{#include ../../../../ARCHITECTURE.md}}
|
69
docs/book/src/architecture/snapshotting.md
Normal file
69
docs/book/src/architecture/snapshotting.md
Normal file
|
@ -0,0 +1,69 @@
|
|||
# Architecture: Snapshotting
|
||||
|
||||
Snapshotting is a **highly experimental** `x86_64` only feature currently under development. It is
|
||||
100% **not supported** and only supports a very limited set of devices. This page roughly summarizes
|
||||
how the system works, and how device authors should think about it when writing new devices.
|
||||
|
||||
## The snapshot & restore sequence
|
||||
|
||||
The data required for a snapshot is stored in several places, including guest memory, and the
|
||||
devices running on the host. To take an accurate snapshot, we need a point in time snapshot. Since
|
||||
there is no way to fetch this state atomically, we have to freeze the guest (VCPUs) and the device
|
||||
backends. Similarly, on restore we must freeze in the same way to prevent partially restored state
|
||||
from being modified.
|
||||
|
||||
## Snapshotting a running VM
|
||||
|
||||
In code, this is implemented by
|
||||
[vm_control::do_snapshot](https://crosvm.dev/doc/vm_control/fn.do_snapshot.html). We always freeze
|
||||
the VCPUs first
|
||||
([vm_control::VcpuSuspendGuard](https://crosvm.dev/doc/vm_control/struct.VcpuSuspendGuard.html)).
|
||||
This is done so that we can flush all pending interrupts to the irqchip (LAPIC) without triggering
|
||||
further activity from the driver (which could in turn trigger more device activity). With the VCPUs
|
||||
frozen, we freeze devices
|
||||
([vm_control::DeviceSleepGuard](https://crosvm.dev/doc/vm_control/struct.DeviceSleepGuard.html)).
|
||||
From here, it's a just a matter of serializing VCPU state, guest memory, and device state.
|
||||
|
||||
### A word about interrupts
|
||||
|
||||
Interrupts come in two primary flavors from the snapshotting perspective: legacy interrupts (e.g.
|
||||
IOAPIC interrupt lines), and MSIs.
|
||||
|
||||
#### Legacy interrupts
|
||||
|
||||
These are a little tricky because they are allocated as part of device creation, and device creation
|
||||
happens **before** we snapshot or restore. To avoid actually having to snapshot or restore the
|
||||
`Event` object wiring for these interrupts, we rely on the fact that as long as the VM is created
|
||||
with the right shape (e.g. devices), the interrupt `Event`s will be wired between the device & the
|
||||
irqchip correctly. As part of restoring, we will set the routing table, which ensures that those
|
||||
events map to the right GSIs in the hypervisor.
|
||||
|
||||
#### MSIs
|
||||
|
||||
These are much simpler, because of how MSIs are implemented in CrosVM. In `MsixConfig`, we save the
|
||||
MSI routing information for every IRQ. At restore time, we just register these MSIs with the
|
||||
hypervisor using the exact same mechanism that would be invoked on device activation (albeit
|
||||
bypassing GSI allocation since we know from the saved state exactly which GSI must be used).
|
||||
|
||||
#### Flushing IRQs to the irqchip
|
||||
|
||||
IRQs sometimes pass through multiple host `Event`s before reaching the hypervisor (or VCPU loop) for
|
||||
injection. Rather than trying to snapshot the `Event` state, we freeze all interrupt sources
|
||||
(devices) and flush all pending interrupts into the irqchip. This way, snapshotting the irqchip
|
||||
state is sufficient to capture all pending interrupts.
|
||||
|
||||
## Restoring a VM in lieu of booting
|
||||
|
||||
Restoring on to a running VM is not supported, and may never be. Our preferred approach is to
|
||||
instead create a new VM from a snapshot. This is why `vm_control::do_restore` can be invoked as part
|
||||
of the VM creation process.
|
||||
|
||||
## Implications for device authors
|
||||
|
||||
New devices SHOULD be compatible with the `devices::Suspendable` trait, but MAY defer actual
|
||||
implementation to the future. This trait's implementation defines how the device will sleep/wake,
|
||||
and how its state will be saved & restored as part of snapshotting.
|
||||
|
||||
New virtio devices SHOULD implement the virtio device snapshot methods on
|
||||
[VirtioDevice](https://crosvm.dev/doc/devices/virtio/virtio_device/trait.VirtioDevice.html):
|
||||
`virtio_sleep`, `virtio_wake`, `virtio_snapshot`, and `virtio_restore`.
|
Loading…
Reference in a new issue