linux: split out linux mod into multiple

At nearly 4k loc its harder to maintain. This change only moves some things around without changing any code. Input on symbol visibility is welcome - in reality it doesn't really matter if symb is pub/pub(super)/pub(crate) as mods themselves are private to linux mod. I plan to invest more into splitting things apart if possible (especially the main loop) but its a start TEST=./tools/presubmit BUG=n/a Change-Id: I2792dd0acdb5627f1c9b5d0fb998c976c6fe5e15 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/3422266 Reviewed-by: Daniel Verkamp <dverkamp@chromium.org> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Noah Gold <nkgold@google.com> Reviewed-by: Anton Romanov <romanton@google.com> Commit-Queue: Anton Romanov <romanton@google.com> Auto-Submit: Anton Romanov <romanton@google.com>
2024-11-24 12:34:31 +00:00 · 2022-01-28 00:18:11 +00:00 · 2022-01-28 00:18:11 +00:00 · 5acc0f52f5
commit 5acc0f52f5
parent 6afc5a7f10
7 changed files with 2318 additions and 2219 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -8,14 +8,14 @@ The principle characteristics of crosvm are:
 - Written in Rust for security and safety

 A typical session of crosvm starts in `main.rs` where command line parsing is done to build up a
-`Config` structure. The `Config` is used by `run_config` in `linux.rs` to setup and execute a VM.
-Broken down into rough steps:
+`Config` structure. The `Config` is used by `run_config` in `linux/mod.rs` to setup and execute a
+VM. Broken down into rough steps:

 1. Load the linux kernel from an ELF file.
 1. Create a handful of control sockets used by the virtual devices.
 1. Invoke the architecture specific VM builder `Arch::build_vm` (located in `x86_64/src/lib.rs` or
   `aarch64/src/lib.rs`).
-1. `Arch::build_vm` will itself invoke the provided `create_devices` function from `linux.rs`
+1. `Arch::build_vm` will itself invoke the provided `create_devices` function from `linux/mod.rs`
 1. `create_devices` creates every PCI device, including the virtio devices, that were configured in
   `Config`, along with matching [minijail] configs for each.
 1. `Arch::generate_pci_root`, using a list of every PCI device with optional `Minijail`, will
@ -35,12 +35,12 @@ invalid.

 ## Sandboxing Policy

-Every sandbox is made with [minijail] and starts with `create_base_minijail` in `linux.rs` which set
-some very restrictive settings. Linux namespaces and seccomp filters are used extensively. Each
-seccomp policy can be found under `seccomp/{arch}/{device}.policy` and should start by
-`@include`-ing the `common_device.policy`. With the exception of architecture specific devices (such
-as `Pl030` on ARM or `I8042` on x86_64), every device will need a different policy for each
-supported architecture.
+Every sandbox is made with [minijail] and starts with `create_base_minijail` in
+`linux/jail_helpers.rs` which set some very restrictive settings. Linux namespaces and seccomp
+filters are used extensively. Each seccomp policy can be found under
+`seccomp/{arch}/{device}.policy` and should start by `@include`-ing the `common_device.policy`. With
+the exception of architecture specific devices (such as `Pl030` on ARM or `I8042` on x86_64), every
+device will need a different policy for each supported architecture.

 ## The VM Control Sockets

--- a/docs/book/src/appendix/minijail.md
+++ b/docs/book/src/appendix/minijail.md
@ -8,8 +8,8 @@ The fact that minijail was written, maintained, and continuously tested by a pro
 team more than makes up for its being written in an memory unsafe language.

 The exact configuration of the sandbox varies by device, but they are mostly alike. See
-`create_base_minijail` from `linux.rs`. The set of security constraints explicitly used in crosvm
-are:
+`create_base_minijail` from `linux/jail_helpers.rs`. The set of security constraints explicitly used
+in crosvm are:

 - PID Namespace
  - Runs as init
--- a/src/linux/device_helpers.rs
+++ b/src/linux/device_helpers.rs
--- a/src/linux/gpu.rs
+++ b/src/linux/gpu.rs
@ -0,0 +1,331 @@
+// Copyright 2017 The Chromium OS Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+//! GPU related things
+//! depends on "gpu" feature
+use std::collections::HashSet;
+use std::env;
+
+use devices::virtio::vhost::user::vmm::Gpu as VhostUserGpu;
+use devices::virtio::GpuRenderServerParameters;
+
+use super::*;
+
+pub fn create_vhost_user_gpu_device(
+    cfg: &Config,
+    opt: &VhostUserOption,
+    host_tube: Tube,
+    device_tube: Tube,
+) -> DeviceResult {
+    // The crosvm gpu device expects us to connect the tube before it will accept a vhost-user
+    // connection.
+    let dev = VhostUserGpu::new(
+        virtio::base_features(cfg.protected_vm),
+        &opt.socket,
+        host_tube,
+        device_tube,
+    )
+    .context("failed to set up vhost-user gpu device")?;
+
+    Ok(VirtioDeviceStub {
+        dev: Box::new(dev),
+        // no sandbox here because virtqueue handling is exported to a different process.
+        jail: None,
+    })
+}
+
+pub fn gpu_jail(cfg: &Config, policy: &str) -> Result<Option<Minijail>> {
+    match simple_jail(cfg, policy)? {
+        Some(mut jail) => {
+            // Create a tmpfs in the device's root directory so that we can bind mount the
+            // dri directory into it.  The size=67108864 is size=64*1024*1024 or size=64MB.
+            jail.mount_with_data(
+                Path::new("none"),
+                Path::new("/"),
+                "tmpfs",
+                (libc::MS_NOSUID | libc::MS_NODEV | libc::MS_NOEXEC) as usize,
+                "size=67108864",
+            )?;
+
+            // Device nodes required for DRM.
+            let sys_dev_char_path = Path::new("/sys/dev/char");
+            jail.mount_bind(sys_dev_char_path, sys_dev_char_path, false)?;
+            let sys_devices_path = Path::new("/sys/devices");
+            jail.mount_bind(sys_devices_path, sys_devices_path, false)?;
+
+            let drm_dri_path = Path::new("/dev/dri");
+            if drm_dri_path.exists() {
+                jail.mount_bind(drm_dri_path, drm_dri_path, false)?;
+            }
+
+            // If the ARM specific devices exist on the host, bind mount them in.
+            let mali0_path = Path::new("/dev/mali0");
+            if mali0_path.exists() {
+                jail.mount_bind(mali0_path, mali0_path, true)?;
+            }
+
+            let pvr_sync_path = Path::new("/dev/pvr_sync");
+            if pvr_sync_path.exists() {
+                jail.mount_bind(pvr_sync_path, pvr_sync_path, true)?;
+            }
+
+            // If the udmabuf driver exists on the host, bind mount it in.
+            let udmabuf_path = Path::new("/dev/udmabuf");
+            if udmabuf_path.exists() {
+                jail.mount_bind(udmabuf_path, udmabuf_path, true)?;
+            }
+
+            // Libraries that are required when mesa drivers are dynamically loaded.
+            jail_mount_bind_if_exists(
+                &mut jail,
+                &[
+                    "/usr/lib",
+                    "/usr/lib64",
+                    "/lib",
+                    "/lib64",
+                    "/usr/share/drirc.d",
+                    "/usr/share/glvnd",
+                    "/usr/share/vulkan",
+                ],
+            )?;
+
+            // pvr driver requires read access to /proc/self/task/*/comm.
+            let proc_path = Path::new("/proc");
+            jail.mount(
+                proc_path,
+                proc_path,
+                "proc",
+                (libc::MS_NOSUID | libc::MS_NODEV | libc::MS_NOEXEC | libc::MS_RDONLY) as usize,
+            )?;
+
+            // To enable perfetto tracing, we need to give access to the perfetto service IPC
+            // endpoints.
+            let perfetto_path = Path::new("/run/perfetto");
+            if perfetto_path.exists() {
+                jail.mount_bind(perfetto_path, perfetto_path, true)?;
+            }
+
+            Ok(Some(jail))
+        }
+        None => Ok(None),
+    }
+}
+
+pub struct GpuCacheInfo<'a> {
+    directory: Option<&'a str>,
+    environment: Vec<(&'a str, &'a str)>,
+}
+
+pub fn get_gpu_cache_info<'a>(
+    cache_dir: Option<&'a String>,
+    cache_size: Option<&'a String>,
+    sandbox: bool,
+) -> GpuCacheInfo<'a> {
+    let mut dir = None;
+    let mut env = Vec::new();
+
+    if let Some(cache_dir) = cache_dir {
+        if !Path::new(cache_dir).exists() {
+            warn!("shader caching dir {} does not exist", cache_dir);
+            env.push(("MESA_GLSL_CACHE_DISABLE", "true"));
+        } else if cfg!(any(target_arch = "arm", target_arch = "aarch64")) && sandbox {
+            warn!("shader caching not yet supported on ARM with sandbox enabled");
+            env.push(("MESA_GLSL_CACHE_DISABLE", "true"));
+        } else {
+            dir = Some(cache_dir.as_str());
+
+            env.push(("MESA_GLSL_CACHE_DISABLE", "false"));
+            env.push(("MESA_GLSL_CACHE_DIR", cache_dir.as_str()));
+            if let Some(cache_size) = cache_size {
+                env.push(("MESA_GLSL_CACHE_MAX_SIZE", cache_size.as_str()));
+            }
+        }
+    }
+
+    GpuCacheInfo {
+        directory: dir,
+        environment: env,
+    }
+}
+
+pub fn create_gpu_device(
+    cfg: &Config,
+    exit_evt: &Event,
+    gpu_device_tube: Tube,
+    resource_bridges: Vec<Tube>,
+    wayland_socket_path: Option<&PathBuf>,
+    x_display: Option<String>,
+    render_server_fd: Option<SafeDescriptor>,
+    event_devices: Vec<EventDevice>,
+    map_request: Arc<Mutex<Option<ExternalMapping>>>,
+) -> DeviceResult {
+    let mut display_backends = vec![
+        virtio::DisplayBackend::X(x_display),
+        virtio::DisplayBackend::Stub,
+    ];
+
+    let wayland_socket_dirs = cfg
+        .wayland_socket_paths
+        .iter()
+        .map(|(_name, path)| path.parent())
+        .collect::<Option<Vec<_>>>()
+        .ok_or_else(|| anyhow!("wayland socket path has no parent or file name"))?;
+
+    if let Some(socket_path) = wayland_socket_path {
+        display_backends.insert(
+            0,
+            virtio::DisplayBackend::Wayland(Some(socket_path.to_owned())),
+        );
+    }
+
+    let dev = virtio::Gpu::new(
+        exit_evt.try_clone().context("failed to clone event")?,
+        Some(gpu_device_tube),
+        resource_bridges,
+        display_backends,
+        cfg.gpu_parameters.as_ref().unwrap(),
+        render_server_fd,
+        event_devices,
+        map_request,
+        cfg.sandbox,
+        virtio::base_features(cfg.protected_vm),
+        cfg.wayland_socket_paths.clone(),
+    );
+
+    let jail = match gpu_jail(cfg, "gpu_device")? {
+        Some(mut jail) => {
+            // Prepare GPU shader disk cache directory.
+            let (cache_dir, cache_size) = cfg
+                .gpu_parameters
+                .as_ref()
+                .map(|params| (params.cache_path.as_ref(), params.cache_size.as_ref()))
+                .unwrap();
+            let cache_info = get_gpu_cache_info(cache_dir, cache_size, cfg.sandbox);
+
+            if let Some(dir) = cache_info.directory {
+                jail.mount_bind(dir, dir, true)?;
+            }
+            for (key, val) in cache_info.environment {
+                env::set_var(key, val);
+            }
+
+            // Bind mount the wayland socket's directory into jail's root. This is necessary since
+            // each new wayland context must open() the socket. If the wayland socket is ever
+            // destroyed and remade in the same host directory, new connections will be possible
+            // without restarting the wayland device.
+            for dir in &wayland_socket_dirs {
+                jail.mount_bind(dir, dir, true)?;
+            }
+
+            add_current_user_to_jail(&mut jail)?;
+
+            Some(jail)
+        }
+        None => None,
+    };
+
+    Ok(VirtioDeviceStub {
+        dev: Box::new(dev),
+        jail,
+    })
+}
+
+pub fn get_gpu_render_server_environment(cache_info: &GpuCacheInfo) -> Result<Vec<String>> {
+    let mut env = Vec::new();
+
+    let mut cache_env_keys = HashSet::with_capacity(cache_info.environment.len());
+    for (key, val) in cache_info.environment.iter() {
+        env.push(format!("{}={}", key, val));
+        cache_env_keys.insert(*key);
+    }
+
+    for (key_os, val_os) in env::vars_os() {
+        // minijail should accept OsStr rather than str...
+        let into_string_err = |_| anyhow!("invalid environment key/val");
+        let key = key_os.into_string().map_err(into_string_err)?;
+        let val = val_os.into_string().map_err(into_string_err)?;
+
+        if !cache_env_keys.contains(key.as_str()) {
+            env.push(format!("{}={}", key, val));
+        }
+    }
+
+    Ok(env)
+}
+
+pub struct ScopedMinijail(pub Minijail);
+
+impl Drop for ScopedMinijail {
+    fn drop(&mut self) {
+        let _ = self.0.kill();
+    }
+}
+
+pub fn start_gpu_render_server(
+    cfg: &Config,
+    render_server_parameters: &GpuRenderServerParameters,
+) -> Result<(Minijail, SafeDescriptor)> {
+    let (server_socket, client_socket) =
+        UnixSeqpacket::pair().context("failed to create render server socket")?;
+
+    let mut env = None;
+    let jail = match gpu_jail(cfg, "gpu_render_server")? {
+        Some(mut jail) => {
+            let cache_info = get_gpu_cache_info(
+                render_server_parameters.cache_path.as_ref(),
+                render_server_parameters.cache_size.as_ref(),
+                cfg.sandbox,
+            );
+
+            if let Some(dir) = cache_info.directory {
+                jail.mount_bind(dir, dir, true)?;
+            }
+
+            if !cache_info.environment.is_empty() {
+                env = Some(get_gpu_render_server_environment(&cache_info)?);
+            }
+
+            // bind mount /dev/log for syslog
+            let log_path = Path::new("/dev/log");
+            if log_path.exists() {
+                jail.mount_bind(log_path, log_path, true)?;
+            }
+
+            // Run as root in the jail to keep capabilities after execve, which is needed for
+            // mounting to work.  All capabilities will be dropped afterwards.
+            add_current_user_as_root_to_jail(&mut jail)?;
+
+            jail
+        }
+        None => Minijail::new().context("failed to create jail")?,
+    };
+
+    let inheritable_fds = [
+        server_socket.as_raw_descriptor(),
+        libc::STDOUT_FILENO,
+        libc::STDERR_FILENO,
+    ];
+
+    let cmd = &render_server_parameters.path;
+    let cmd_str = cmd
+        .to_str()
+        .ok_or_else(|| anyhow!("invalid render server path"))?;
+    let fd_str = server_socket.as_raw_descriptor().to_string();
+    let args = [cmd_str, "--socket-fd", &fd_str];
+
+    let mut envp: Option<Vec<&str>> = None;
+    if let Some(ref env) = env {
+        envp = Some(env.iter().map(AsRef::as_ref).collect());
+    }
+
+    jail.run_command(minijail::Command::new_for_path(
+        cmd,
+        &inheritable_fds,
+        &args,
+        envp.as_deref(),
+    )?)
+    .context("failed to start gpu render server")?;
+
+    Ok((jail, SafeDescriptor::from(client_socket)))
+}
--- a/src/linux/jail_helpers.rs
+++ b/src/linux/jail_helpers.rs
@ -0,0 +1,188 @@
+// Copyright 2017 The Chromium OS Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+use std::path::{Path, PathBuf};
+use std::str;
+
+use libc::{self, c_ulong, gid_t, uid_t};
+
+use anyhow::{bail, Context, Result};
+use base::*;
+use minijail::{self, Minijail};
+
+use crate::Config;
+
+pub(super) struct SandboxConfig<'a> {
+    pub(super) limit_caps: bool,
+    pub(super) log_failures: bool,
+    pub(super) seccomp_policy: &'a Path,
+    pub(super) uid_map: Option<&'a str>,
+    pub(super) gid_map: Option<&'a str>,
+    pub(super) remount_mode: Option<c_ulong>,
+}
+
+pub(super) fn create_base_minijail(
+    root: &Path,
+    r_limit: Option<u64>,
+    config: Option<&SandboxConfig>,
+) -> Result<Minijail> {
+    // All child jails run in a new user namespace without any users mapped,
+    // they run as nobody unless otherwise configured.
+    let mut j = Minijail::new().context("failed to jail device")?;
+
+    if let Some(config) = config {
+        j.namespace_pids();
+        j.namespace_user();
+        j.namespace_user_disable_setgroups();
+        if config.limit_caps {
+            // Don't need any capabilities.
+            j.use_caps(0);
+        }
+        if let Some(uid_map) = config.uid_map {
+            j.uidmap(uid_map).context("error setting UID map")?;
+        }
+        if let Some(gid_map) = config.gid_map {
+            j.gidmap(gid_map).context("error setting GID map")?;
+        }
+        // Run in a new mount namespace.
+        j.namespace_vfs();
+
+        // Run in an empty network namespace.
+        j.namespace_net();
+
+        // Don't allow the device to gain new privileges.
+        j.no_new_privs();
+
+        // By default we'll prioritize using the pre-compiled .bpf over the .policy
+        // file (the .bpf is expected to be compiled using "trap" as the failure
+        // behavior instead of the default "kill" behavior).
+        // Refer to the code comment for the "seccomp-log-failures"
+        // command-line parameter for an explanation about why the |log_failures|
+        // flag forces the use of .policy files (and the build-time alternative to
+        // this run-time flag).
+        let bpf_policy_file = config.seccomp_policy.with_extension("bpf");
+        if bpf_policy_file.exists() && !config.log_failures {
+            j.parse_seccomp_program(&bpf_policy_file)
+                .context("failed to parse precompiled seccomp policy")?;
+        } else {
+            // Use TSYNC only for the side effect of it using SECCOMP_RET_TRAP,
+            // which will correctly kill the entire device process if a worker
+            // thread commits a seccomp violation.
+            j.set_seccomp_filter_tsync();
+            if config.log_failures {
+                j.log_seccomp_filter_failures();
+            }
+            j.parse_seccomp_filters(&config.seccomp_policy.with_extension("policy"))
+                .context("failed to parse seccomp policy")?;
+        }
+        j.use_seccomp_filter();
+        // Don't do init setup.
+        j.run_as_init();
+        // Set up requested remount mode instead of default MS_PRIVATE.
+        if let Some(mode) = config.remount_mode {
+            j.set_remount_mode(mode);
+        }
+    }
+
+    // Only pivot_root if we are not re-using the current root directory.
+    if root != Path::new("/") {
+        // It's safe to call `namespace_vfs` multiple times.
+        j.namespace_vfs();
+        j.enter_pivot_root(root)
+            .context("failed to pivot root device")?;
+    }
+
+    // Most devices don't need to open many fds.
+    let limit = if let Some(r) = r_limit { r } else { 1024u64 };
+    j.set_rlimit(libc::RLIMIT_NOFILE as i32, limit, limit)
+        .context("error setting max open files")?;
+
+    Ok(j)
+}
+
+pub(super) fn simple_jail(cfg: &Config, policy: &str) -> Result<Option<Minijail>> {
+    if cfg.sandbox {
+        let pivot_root: &str = option_env!("DEFAULT_PIVOT_ROOT").unwrap_or("/var/empty");
+        // A directory for a jailed device's pivot root.
+        let root_path = Path::new(pivot_root);
+        if !root_path.exists() {
+            bail!("{} doesn't exist, can't jail devices", pivot_root);
+        }
+        let policy_path: PathBuf = cfg.seccomp_policy_dir.join(policy);
+        let config = SandboxConfig {
+            limit_caps: true,
+            log_failures: cfg.seccomp_log_failures,
+            seccomp_policy: &policy_path,
+            uid_map: None,
+            gid_map: None,
+            remount_mode: None,
+        };
+        Ok(Some(create_base_minijail(root_path, None, Some(&config))?))
+    } else {
+        Ok(None)
+    }
+}
+
+/// Mirror-mount all the directories in `dirs` into `jail` on a best-effort basis.
+///
+/// This function will not return an error if any of the directories in `dirs` is missing.
+#[cfg(any(feature = "gpu", feature = "video-decoder", feature = "video-encoder"))]
+pub(super) fn jail_mount_bind_if_exists<P: AsRef<std::ffi::OsStr>>(
+    jail: &mut Minijail,
+    dirs: &[P],
+) -> Result<()> {
+    for dir in dirs {
+        let dir_path = Path::new(dir);
+        if dir_path.exists() {
+            jail.mount_bind(dir_path, dir_path, false)?;
+        }
+    }
+
+    Ok(())
+}
+
+#[derive(Copy, Clone)]
+#[cfg_attr(not(feature = "tpm"), allow(dead_code))]
+pub(super) struct Ids {
+    pub(super) uid: uid_t,
+    pub(super) gid: gid_t,
+}
+
+pub(super) fn add_current_user_as_root_to_jail(jail: &mut Minijail) -> Result<Ids> {
+    let crosvm_uid = geteuid();
+    let crosvm_gid = getegid();
+    jail.uidmap(&format!("0 {0} 1", crosvm_uid))
+        .context("error setting UID map")?;
+    jail.gidmap(&format!("0 {0} 1", crosvm_gid))
+        .context("error setting GID map")?;
+
+    Ok(Ids {
+        uid: crosvm_uid,
+        gid: crosvm_gid,
+    })
+}
+
+/// Set the uid/gid for the jailed process and give a basic id map. This is
+/// required for bind mounts to work.
+pub(super) fn add_current_user_to_jail(jail: &mut Minijail) -> Result<Ids> {
+    let crosvm_uid = geteuid();
+    let crosvm_gid = getegid();
+
+    jail.uidmap(&format!("{0} {0} 1", crosvm_uid))
+        .context("error setting UID map")?;
+    jail.gidmap(&format!("{0} {0} 1", crosvm_gid))
+        .context("error setting GID map")?;
+
+    if crosvm_uid != 0 {
+        jail.change_uid(crosvm_uid);
+    }
+    if crosvm_gid != 0 {
+        jail.change_gid(crosvm_gid);
+    }
+
+    Ok(Ids {
+        uid: crosvm_uid,
+        gid: crosvm_gid,
+    })
+}
--- a/src/linux/mod.rs
+++ b/src/linux/mod.rs
--- a/src/linux/vcpu.rs
+++ b/src/linux/vcpu.rs
@ -0,0 +1,615 @@
+// Copyright 2017 The Chromium OS Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+use std::fs::File;
+use std::io::prelude::*;
+use std::sync::{mpsc, Arc, Barrier};
+
+use std::thread;
+use std::thread::JoinHandle;
+
+use libc::{self, c_int};
+
+use anyhow::{Context, Result};
+use base::*;
+use devices::{self, IrqChip, VcpuRunState};
+use hypervisor::{Vcpu, VcpuExit, VcpuRunHandle};
+use vm_control::*;
+#[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+use vm_memory::GuestMemory;
+
+use arch::{self, LinuxArch};
+
+#[cfg(any(target_arch = "arm", target_arch = "aarch64"))]
+use {
+    aarch64::AArch64 as Arch,
+    devices::IrqChipAArch64 as IrqChipArch,
+    hypervisor::{VcpuAArch64 as VcpuArch, VmAArch64 as VmArch},
+};
+#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
+use {
+    devices::IrqChipX86_64 as IrqChipArch,
+    hypervisor::{VcpuX86_64 as VcpuArch, VmX86_64 as VmArch},
+    x86_64::X8664arch as Arch,
+};
+
+use super::ExitState;
+
+pub fn setup_vcpu_signal_handler<T: Vcpu>(use_hypervisor_signals: bool) -> Result<()> {
+    if use_hypervisor_signals {
+        unsafe {
+            extern "C" fn handle_signal(_: c_int) {}
+            // Our signal handler does nothing and is trivially async signal safe.
+            register_rt_signal_handler(SIGRTMIN() + 0, handle_signal)
+                .context("error registering signal handler")?;
+        }
+        block_signal(SIGRTMIN() + 0).context("failed to block signal")?;
+    } else {
+        unsafe {
+            extern "C" fn handle_signal<T: Vcpu>(_: c_int) {
+                T::set_local_immediate_exit(true);
+            }
+            register_rt_signal_handler(SIGRTMIN() + 0, handle_signal::<T>)
+                .context("error registering signal handler")?;
+        }
+    }
+    Ok(())
+}
+
+// Sets up a vcpu and converts it into a runnable vcpu.
+pub fn runnable_vcpu<V>(
+    cpu_id: usize,
+    kvm_vcpu_id: usize,
+    vcpu: Option<V>,
+    vm: impl VmArch,
+    irq_chip: &mut dyn IrqChipArch,
+    vcpu_count: usize,
+    run_rt: bool,
+    vcpu_affinity: Vec<usize>,
+    no_smt: bool,
+    has_bios: bool,
+    use_hypervisor_signals: bool,
+    enable_per_vm_core_scheduling: bool,
+    host_cpu_topology: bool,
+    vcpu_cgroup_tasks_file: Option<File>,
+) -> Result<(V, VcpuRunHandle)>
+where
+    V: VcpuArch,
+{
+    let mut vcpu = match vcpu {
+        Some(v) => v,
+        None => {
+            // If vcpu is None, it means this arch/hypervisor requires create_vcpu to be called from
+            // the vcpu thread.
+            match vm
+                .create_vcpu(kvm_vcpu_id)
+                .context("failed to create vcpu")?
+                .downcast::<V>()
+            {
+                Ok(v) => *v,
+                Err(_) => panic!("VM created wrong type of VCPU"),
+            }
+        }
+    };
+
+    irq_chip
+        .add_vcpu(cpu_id, &vcpu)
+        .context("failed to add vcpu to irq chip")?;
+
+    if !vcpu_affinity.is_empty() {
+        if let Err(e) = set_cpu_affinity(vcpu_affinity) {
+            error!("Failed to set CPU affinity: {}", e);
+        }
+    }
+
+    Arch::configure_vcpu(
+        &vm,
+        vm.get_hypervisor(),
+        irq_chip,
+        &mut vcpu,
+        cpu_id,
+        vcpu_count,
+        has_bios,
+        no_smt,
+        host_cpu_topology,
+    )
+    .context("failed to configure vcpu")?;
+
+    if !enable_per_vm_core_scheduling {
+        // Do per-vCPU core scheduling by setting a unique cookie to each vCPU.
+        if let Err(e) = enable_core_scheduling() {
+            error!("Failed to enable core scheduling: {}", e);
+        }
+    }
+
+    // Move vcpu thread to cgroup
+    if let Some(mut f) = vcpu_cgroup_tasks_file {
+        f.write_all(base::gettid().to_string().as_bytes())
+            .context("failed to write vcpu tid to cgroup tasks")?;
+    }
+
+    if run_rt {
+        const DEFAULT_VCPU_RT_LEVEL: u16 = 6;
+        if let Err(e) = set_rt_prio_limit(u64::from(DEFAULT_VCPU_RT_LEVEL))
+            .and_then(|_| set_rt_round_robin(i32::from(DEFAULT_VCPU_RT_LEVEL)))
+        {
+            warn!("Failed to set vcpu to real time: {}", e);
+        }
+    }
+
+    if use_hypervisor_signals {
+        let mut v = get_blocked_signals().context("failed to retrieve signal mask for vcpu")?;
+        v.retain(|&x| x != SIGRTMIN() + 0);
+        vcpu.set_signal_mask(&v)
+            .context("failed to set the signal mask for vcpu")?;
+    }
+
+    let vcpu_run_handle = vcpu
+        .take_run_handle(Some(SIGRTMIN() + 0))
+        .context("failed to set thread id for vcpu")?;
+
+    Ok((vcpu, vcpu_run_handle))
+}
+
+#[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+fn handle_debug_msg<V>(
+    cpu_id: usize,
+    vcpu: &V,
+    guest_mem: &GuestMemory,
+    d: VcpuDebug,
+    reply_tube: &mpsc::Sender<VcpuDebugStatusMessage>,
+) -> Result<()>
+where
+    V: VcpuArch + 'static,
+{
+    match d {
+        VcpuDebug::ReadRegs => {
+            let msg = VcpuDebugStatusMessage {
+                cpu: cpu_id as usize,
+                msg: VcpuDebugStatus::RegValues(
+                    Arch::debug_read_registers(vcpu as &V)
+                        .context("failed to handle a gdb ReadRegs command")?,
+                ),
+            };
+            reply_tube
+                .send(msg)
+                .context("failed to send a debug status to GDB thread")
+        }
+        VcpuDebug::WriteRegs(regs) => {
+            Arch::debug_write_registers(vcpu as &V, &regs)
+                .context("failed to handle a gdb WriteRegs command")?;
+            reply_tube
+                .send(VcpuDebugStatusMessage {
+                    cpu: cpu_id as usize,
+                    msg: VcpuDebugStatus::CommandComplete,
+                })
+                .context("failed to send a debug status to GDB thread")
+        }
+        VcpuDebug::ReadMem(vaddr, len) => {
+            let msg = VcpuDebugStatusMessage {
+                cpu: cpu_id as usize,
+                msg: VcpuDebugStatus::MemoryRegion(
+                    Arch::debug_read_memory(vcpu as &V, guest_mem, vaddr, len)
+                        .unwrap_or(Vec::new()),
+                ),
+            };
+            reply_tube
+                .send(msg)
+                .context("failed to send a debug status to GDB thread")
+        }
+        VcpuDebug::WriteMem(vaddr, buf) => {
+            Arch::debug_write_memory(vcpu as &V, guest_mem, vaddr, &buf)
+                .context("failed to handle a gdb WriteMem command")?;
+            reply_tube
+                .send(VcpuDebugStatusMessage {
+                    cpu: cpu_id as usize,
+                    msg: VcpuDebugStatus::CommandComplete,
+                })
+                .context("failed to send a debug status to GDB thread")
+        }
+        VcpuDebug::EnableSinglestep => {
+            Arch::debug_enable_singlestep(vcpu as &V)
+                .context("failed to handle a gdb EnableSingleStep command")?;
+            reply_tube
+                .send(VcpuDebugStatusMessage {
+                    cpu: cpu_id as usize,
+                    msg: VcpuDebugStatus::CommandComplete,
+                })
+                .context("failed to send a debug status to GDB thread")
+        }
+        VcpuDebug::SetHwBreakPoint(addrs) => {
+            Arch::debug_set_hw_breakpoints(vcpu as &V, &addrs)
+                .context("failed to handle a gdb SetHwBreakPoint command")?;
+            reply_tube
+                .send(VcpuDebugStatusMessage {
+                    cpu: cpu_id as usize,
+                    msg: VcpuDebugStatus::CommandComplete,
+                })
+                .context("failed to send a debug status to GDB thread")
+        }
+    }
+}
+
+fn vcpu_loop<V>(
+    mut run_mode: VmRunMode,
+    cpu_id: usize,
+    vcpu: V,
+    vcpu_run_handle: VcpuRunHandle,
+    irq_chip: Box<dyn IrqChipArch + 'static>,
+    run_rt: bool,
+    delay_rt: bool,
+    io_bus: devices::Bus,
+    mmio_bus: devices::Bus,
+    requires_pvclock_ctrl: bool,
+    from_main_tube: mpsc::Receiver<VcpuControl>,
+    use_hypervisor_signals: bool,
+    #[cfg(all(target_arch = "x86_64", feature = "gdb"))] to_gdb_tube: Option<
+        mpsc::Sender<VcpuDebugStatusMessage>,
+    >,
+    #[cfg(all(target_arch = "x86_64", feature = "gdb"))] guest_mem: GuestMemory,
+) -> ExitState
+where
+    V: VcpuArch + 'static,
+{
+    let mut interrupted_by_signal = false;
+
+    loop {
+        // Start by checking for messages to process and the run state of the CPU.
+        // An extra check here for Running so there isn't a need to call recv unless a
+        // message is likely to be ready because a signal was sent.
+        if interrupted_by_signal || run_mode != VmRunMode::Running {
+            'state_loop: loop {
+                // Tries to get a pending message without blocking first.
+                let msg = match from_main_tube.try_recv() {
+                    Ok(m) => m,
+                    Err(mpsc::TryRecvError::Empty) if run_mode == VmRunMode::Running => {
+                        // If the VM is running and no message is pending, the state won't
+                        // change.
+                        break 'state_loop;
+                    }
+                    Err(mpsc::TryRecvError::Empty) => {
+                        // If the VM is not running, wait until a message is ready.
+                        match from_main_tube.recv() {
+                            Ok(m) => m,
+                            Err(mpsc::RecvError) => {
+                                error!("Failed to read from main tube in vcpu");
+                                return ExitState::Crash;
+                            }
+                        }
+                    }
+                    Err(mpsc::TryRecvError::Disconnected) => {
+                        error!("Failed to read from main tube in vcpu");
+                        return ExitState::Crash;
+                    }
+                };
+
+                // Collect all pending messages.
+                let mut messages = vec![msg];
+                messages.append(&mut from_main_tube.try_iter().collect());
+
+                for msg in messages {
+                    match msg {
+                        VcpuControl::RunState(new_mode) => {
+                            run_mode = new_mode;
+                            match run_mode {
+                                VmRunMode::Running => break 'state_loop,
+                                VmRunMode::Suspending => {
+                                    // On KVM implementations that use a paravirtualized
+                                    // clock (e.g. x86), a flag must be set to indicate to
+                                    // the guest kernel that a vCPU was suspended. The guest
+                                    // kernel will use this flag to prevent the soft lockup
+                                    // detection from triggering when this vCPU resumes,
+                                    // which could happen days later in realtime.
+                                    if requires_pvclock_ctrl {
+                                        if let Err(e) = vcpu.pvclock_ctrl() {
+                                            error!(
+                                                "failed to tell hypervisor vcpu {} is suspending: {}",
+                                                cpu_id, e
+                                            );
+                                        }
+                                    }
+                                }
+                                VmRunMode::Breakpoint => {}
+                                VmRunMode::Exiting => return ExitState::Stop,
+                            }
+                        }
+                        #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+                        VcpuControl::Debug(d) => match &to_gdb_tube {
+                            Some(ref ch) => {
+                                if let Err(e) = handle_debug_msg(cpu_id, &vcpu, &guest_mem, d, ch) {
+                                    error!("Failed to handle gdb message: {}", e);
+                                }
+                            }
+                            None => {
+                                error!("VcpuControl::Debug received while GDB feature is disabled: {:?}", d);
+                            }
+                        },
+                        VcpuControl::MakeRT => {
+                            if run_rt && delay_rt {
+                                info!("Making vcpu {} RT\n", cpu_id);
+                                const DEFAULT_VCPU_RT_LEVEL: u16 = 6;
+                                if let Err(e) = set_rt_prio_limit(u64::from(DEFAULT_VCPU_RT_LEVEL))
+                                    .and_then(|_| {
+                                        set_rt_round_robin(i32::from(DEFAULT_VCPU_RT_LEVEL))
+                                    })
+                                {
+                                    warn!("Failed to set vcpu to real time: {}", e);
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        interrupted_by_signal = false;
+
+        // Vcpus may have run a HLT instruction, which puts them into a state other than
+        // VcpuRunState::Runnable. In that case, this call to wait_until_runnable blocks
+        // until either the irqchip receives an interrupt for this vcpu, or until the main
+        // thread kicks this vcpu as a result of some VmControl operation. In most IrqChip
+        // implementations HLT instructions do not make it to crosvm, and thus this is a
+        // no-op that always returns VcpuRunState::Runnable.
+        match irq_chip.wait_until_runnable(&vcpu) {
+            Ok(VcpuRunState::Runnable) => {}
+            Ok(VcpuRunState::Interrupted) => interrupted_by_signal = true,
+            Err(e) => error!(
+                "error waiting for vcpu {} to become runnable: {}",
+                cpu_id, e
+            ),
+        }
+
+        if !interrupted_by_signal {
+            match vcpu.run(&vcpu_run_handle) {
+                Ok(VcpuExit::IoIn { port, mut size }) => {
+                    let mut data = [0; 8];
+                    if size > data.len() {
+                        error!(
+                            "unsupported IoIn size of {} bytes at port {:#x}",
+                            size, port
+                        );
+                        size = data.len();
+                    }
+                    io_bus.read(port as u64, &mut data[..size]);
+                    if let Err(e) = vcpu.set_data(&data[..size]) {
+                        error!(
+                            "failed to set return data for IoIn at port {:#x}: {}",
+                            port, e
+                        );
+                    }
+                }
+                Ok(VcpuExit::IoOut {
+                    port,
+                    mut size,
+                    data,
+                }) => {
+                    if size > data.len() {
+                        error!(
+                            "unsupported IoOut size of {} bytes at port {:#x}",
+                            size, port
+                        );
+                        size = data.len();
+                    }
+                    io_bus.write(port as u64, &data[..size]);
+                }
+                Ok(VcpuExit::MmioRead { address, size }) => {
+                    let mut data = [0; 8];
+                    mmio_bus.read(address, &mut data[..size]);
+                    // Setting data for mmio can not fail.
+                    let _ = vcpu.set_data(&data[..size]);
+                }
+                Ok(VcpuExit::MmioWrite {
+                    address,
+                    size,
+                    data,
+                }) => {
+                    mmio_bus.write(address, &data[..size]);
+                }
+                Ok(VcpuExit::IoapicEoi { vector }) => {
+                    if let Err(e) = irq_chip.broadcast_eoi(vector) {
+                        error!(
+                            "failed to broadcast eoi {} on vcpu {}: {}",
+                            vector, cpu_id, e
+                        );
+                    }
+                }
+                Ok(VcpuExit::IrqWindowOpen) => {}
+                Ok(VcpuExit::Hlt) => irq_chip.halted(cpu_id),
+                Ok(VcpuExit::Shutdown) => return ExitState::Stop,
+                Ok(VcpuExit::FailEntry {
+                    hardware_entry_failure_reason,
+                }) => {
+                    error!("vcpu hw run failure: {:#x}", hardware_entry_failure_reason);
+                    return ExitState::Crash;
+                }
+                Ok(VcpuExit::SystemEventShutdown) => {
+                    info!("system shutdown event on vcpu {}", cpu_id);
+                    return ExitState::Stop;
+                }
+                Ok(VcpuExit::SystemEventReset) => {
+                    info!("system reset event");
+                    return ExitState::Reset;
+                }
+                Ok(VcpuExit::SystemEventCrash) => {
+                    info!("system crash event on vcpu {}", cpu_id);
+                    return ExitState::Stop;
+                }
+                #[rustfmt::skip] Ok(VcpuExit::Debug { .. }) => {
+                    #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+                    {
+                        let msg = VcpuDebugStatusMessage {
+                            cpu: cpu_id as usize,
+                            msg: VcpuDebugStatus::HitBreakPoint,
+                        };
+                        if let Some(ref ch) = to_gdb_tube {
+                            if let Err(e) = ch.send(msg) {
+                                error!("failed to notify breakpoint to GDB thread: {}", e);
+                                return ExitState::Crash;
+                            }
+                        }
+                        run_mode = VmRunMode::Breakpoint;
+                    }
+                }
+                Ok(r) => warn!("unexpected vcpu exit: {:?}", r),
+                Err(e) => match e.errno() {
+                    libc::EINTR => interrupted_by_signal = true,
+                    libc::EAGAIN => {}
+                    _ => {
+                        error!("vcpu hit unknown error: {}", e);
+                        return ExitState::Crash;
+                    }
+                },
+            }
+        }
+
+        if interrupted_by_signal {
+            if use_hypervisor_signals {
+                // Try to clear the signal that we use to kick VCPU if it is pending before
+                // attempting to handle pause requests.
+                if let Err(e) = clear_signal(SIGRTMIN() + 0) {
+                    error!("failed to clear pending signal: {}", e);
+                    return ExitState::Crash;
+                }
+            } else {
+                vcpu.set_immediate_exit(false);
+            }
+        }
+
+        if let Err(e) = irq_chip.inject_interrupts(&vcpu) {
+            error!("failed to inject interrupts for vcpu {}: {}", cpu_id, e);
+        }
+    }
+}
+
+pub fn run_vcpu<V>(
+    cpu_id: usize,
+    kvm_vcpu_id: usize,
+    vcpu: Option<V>,
+    vm: impl VmArch + 'static,
+    mut irq_chip: Box<dyn IrqChipArch + 'static>,
+    vcpu_count: usize,
+    run_rt: bool,
+    vcpu_affinity: Vec<usize>,
+    delay_rt: bool,
+    no_smt: bool,
+    start_barrier: Arc<Barrier>,
+    has_bios: bool,
+    mut io_bus: devices::Bus,
+    mut mmio_bus: devices::Bus,
+    exit_evt: Event,
+    reset_evt: Event,
+    crash_evt: Event,
+    requires_pvclock_ctrl: bool,
+    from_main_tube: mpsc::Receiver<VcpuControl>,
+    use_hypervisor_signals: bool,
+    #[cfg(all(target_arch = "x86_64", feature = "gdb"))] to_gdb_tube: Option<
+        mpsc::Sender<VcpuDebugStatusMessage>,
+    >,
+    enable_per_vm_core_scheduling: bool,
+    host_cpu_topology: bool,
+    vcpu_cgroup_tasks_file: Option<File>,
+) -> Result<JoinHandle<()>>
+where
+    V: VcpuArch + 'static,
+{
+    thread::Builder::new()
+        .name(format!("crosvm_vcpu{}", cpu_id))
+        .spawn(move || {
+            // The VCPU thread must trigger either `exit_evt` or `reset_event` in all paths. A
+            // `ScopedEvent`'s Drop implementation ensures that the `exit_evt` will be sent if
+            // anything happens before we get to writing the final event.
+            let scoped_exit_evt = ScopedEvent::from(exit_evt);
+
+            #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+            let guest_mem = vm.get_memory().clone();
+            let runnable_vcpu = runnable_vcpu(
+                cpu_id,
+                kvm_vcpu_id,
+                vcpu,
+                vm,
+                irq_chip.as_mut(),
+                vcpu_count,
+                run_rt && !delay_rt,
+                vcpu_affinity,
+                no_smt,
+                has_bios,
+                use_hypervisor_signals,
+                enable_per_vm_core_scheduling,
+                host_cpu_topology,
+                vcpu_cgroup_tasks_file,
+            );
+
+            start_barrier.wait();
+
+            let (vcpu, vcpu_run_handle) = match runnable_vcpu {
+                Ok(v) => v,
+                Err(e) => {
+                    error!("failed to start vcpu {}: {:#}", cpu_id, e);
+                    return;
+                }
+            };
+
+            #[allow(unused_mut)]
+            let mut run_mode = VmRunMode::Running;
+            #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+            if to_gdb_tube.is_some() {
+                // Wait until a GDB client attaches
+                run_mode = VmRunMode::Breakpoint;
+            }
+
+            mmio_bus.set_access_id(cpu_id);
+            io_bus.set_access_id(cpu_id);
+
+            let exit_reason = vcpu_loop(
+                run_mode,
+                cpu_id,
+                vcpu,
+                vcpu_run_handle,
+                irq_chip,
+                run_rt,
+                delay_rt,
+                io_bus,
+                mmio_bus,
+                requires_pvclock_ctrl,
+                from_main_tube,
+                use_hypervisor_signals,
+                #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+                to_gdb_tube,
+                #[cfg(all(target_arch = "x86_64", feature = "gdb"))]
+                guest_mem,
+            );
+
+            let exit_evt = scoped_exit_evt.into();
+            let final_event = match exit_reason {
+                ExitState::Stop => exit_evt,
+                ExitState::Reset => reset_evt,
+                ExitState::Crash => crash_evt,
+            };
+            if let Err(e) = final_event.write(1) {
+                error!(
+                    "failed to send final event {:?} on vcpu {}: {}",
+                    final_event, cpu_id, e
+                )
+            }
+        })
+        .context("failed to spawn VCPU thread")
+}
+
+/// Signals all running VCPUs to vmexit, sends VcpuControl message to each VCPU tube, and tells
+/// `irq_chip` to stop blocking halted VCPUs. The channel message is set first because both the
+/// signal and the irq_chip kick could cause the VCPU thread to continue through the VCPU run
+/// loop.
+pub fn kick_all_vcpus(
+    vcpu_handles: &[(JoinHandle<()>, mpsc::Sender<vm_control::VcpuControl>)],
+    irq_chip: &dyn IrqChip,
+    message: VcpuControl,
+) {
+    for (handle, tube) in vcpu_handles {
+        if let Err(e) = tube.send(message.clone()) {
+            error!("failed to send VcpuControl: {}", e);
+        }
+        let _ = handle.kill(SIGRTMIN() + 0);
+    }
+    irq_chip.kick_halted_vcpus();
+}