QEMU/KVM VM Execution Basics

This post walks through the code that makes a simple QEMU/KVM virtual machine run. When you execute one of the qemu-system-* commands, QEMU initializes a model of the machine that you asked for. Machines are compositions of devices and their interconnections through buses. Machines, Devices and Buses are central abstractions in the QEMU codebase and go by the names MachineClass, DeviceClass and BusClass respectively.

The entry-point for the QEMU system emulator is vl.c. This is a rather large main function that spends most of its energy parsing and dealing with command line arguments. Our starting point of interest for understanding QEMU/KVM execution is at the machine_run_board_init function call. machine_run_board_init does a few sanity checks, like ensuring that the requested machine can support the requested processor types and then calls the initialization function for the machine type requested via machine_class->init.

There are many different machine models that come stock with QEMU. In this article, we will be using the default x86_64 machine model which goes by the name pc-i440fx. Its initialization function is called pc_init1. To understand how this function gets mapped into the init member of the machine_class instance found in machine_run_board_init have a look at the DEFINE_I440FX_MACHINE and DEFINE_PC_MACHINE macros

#define DEFINE_I440FX_MACHINE(suffix, name, compatfn, optionfn) \
  static void pc_init_##suffix(MachineState *machine)           \
  {                                                             \
    void (*compat)(MachineState *m) = (compatfn);               \
    if (compat) {                                               \
        compat(machine);                                        \
    }                                                           \
    pc_init1(machine, TYPE_I440FX_PCI_HOST_BRIDGE,              \
              TYPE_I440FX_PCI_DEVICE);                          \
  }                                                             \
  DEFINE_PC_MACHINE(suffix, name, pc_init_##suffix, optionfn)
define DEFINE_PC_MACHINE(suffix, namestr, initfn, optsfn)                   \
  static void pc_machine_##suffix##_class_init(ObjectClass *oc, void *data) \
  {                                                                         \
    MachineClass *mc = MACHINE_CLASS(oc);                                   \
    optsfn(mc);                                                             \
    mc->init = initfn;                                                      \
  }                                                                         \
  static const TypeInfo pc_machine_type_##suffix = {                        \
    .name       = namestr TYPE_MACHINE_SUFFIX,                              \
    .parent     = TYPE_PC_MACHINE,                                          \
    .class_init = pc_machine_##suffix##_class_init,                         \
  };                                                                        \
  static void pc_machine_init_##suffix(void)                                \
  {                                                                         \
      type_register(&pc_machine_type_##suffix);                             \
  }                                                                         \
  type_init(pc_machine_init_##suffix)

So we can see here that the init function for the i440fx is synthesized on the fly at compile time by the macro, but the real work is done by the pc_init1 function within the synthesized function.

vCPU Initialization

Our first point of focus in the pc_init1 function will be the call to pc_cpus_init

void pc_cpus_init(PCMachineState *pcms)
{
  int i;
  const CPUArchIdList *possible_cpus;
  MachineState *ms = MACHINE(pcms);
  MachineClass *mc = MACHINE_GET_CLASS(pcms);

  pcms->apic_id_limit = x86_cpu_apic_id_from_index(max_cpus - 1) + 1;
  possible_cpus = mc->possible_cpu_arch_ids(ms);
  for (i = 0; i < smp_cpus; i++) {
    pc_new_cpu(possible_cpus->cpus[i].type, possible_cpus->cpus[i].arch_id,
        &error_fatal);
  }
}

Here we can see the machine initialization code, reading the smp topology information provided by the user (either explicitly through the -smp argument of qemu-system or implicitly through defaults) to create the correct number of virtual cpus (vCPU). The pc_new_cpu follows

static void pc_new_cpu(const char *typename, int64_t apic_id, Error **errp)
{
    Object *cpu = NULL;
    Error *local_err = NULL;

    cpu = object_new(typename);

    object_property_set_uint(cpu, apic_id, "apic-id", &local_err);
    object_property_set_bool(cpu, true, "realized", &local_err);

    object_unref(cpu);
    error_propagate(errp, local_err);
}

Notice a few things, the CPU device is not special. It’s just a regular qemu device (qdev) created through the object_new factory with a typename. Notice also that we do not see a specific call to realize here. This is because what we have created is the most generic type of object called Object. Objects support an arbitrary set of properties that come with getters and setters, so they are quite extensible. Here we focus on the setting of the boolean property “realized” on the cpu object in the code above.

The cpu is s special type of object called a qdev. All qdev devices are initialized with a few basic properties through their initializer function device_initfn

 1static void device_initfn(Object *obj)
 2{
 3    DeviceState *dev = DEVICE(obj);
 4    ObjectClass *class;
 5    Property *prop;
 6
 7    if (qdev_hotplug) {
 8        dev->hotplugged = 1;
 9        qdev_hot_added = true;
10    }
11
12    dev->instance_id_alias = -1;
13    dev->realized = false;
14
15    object_property_add_bool(obj, "realized",
16                             device_get_realized, device_set_realized, NULL);
17    object_property_add_bool(obj, "hotpluggable",
18                             device_get_hotpluggable, NULL, NULL);
19    object_property_add_bool(obj, "hotplugged",
20                             device_get_hotplugged, NULL,
21                             &error_abort);
22
23    class = object_get_class(OBJECT(dev));
24    do {
25        for (prop = DEVICE_CLASS(class)->props; prop && prop->name; prop++) {
26            qdev_property_add_legacy(dev, prop, &error_abort);
27            qdev_property_add_static(dev, prop, &error_abort);
28        }
29        class = object_class_get_parent(class);
30    } while (class != object_class_by_name(TYPE_DEVICE));
31
32    object_property_add_link(OBJECT(dev), "parent_bus", TYPE_BUS,
33                             (Object **)&dev->parent_bus, NULL, 0,
34                             &error_abort);
35    QLIST_INIT(&dev->gpios);
36}

The particular object property we are interested in is the “realized” property on line 15. Here we see that the setter function provided is device_set_realized. There is quite a bit going on in device_set_realized the particular bits we are interested in are the actual call to the realization of the device which happens at line 913.

static void device_set_realized(Object *obj, bool value, Error **errp)
{
  DeviceState *dev = DEVICE(obj);
  DeviceClass *dc = DEVICE_GET_CLASS(dev);
  HotplugHandler *hotplug_ctrl;
  BusState *bus;
  Error *local_err = NULL;
  bool unattached_parent = false;
  static int unattached_count;

  if (dev->hotplugged && !dc->hotpluggable) {
    error_setg(errp, QERR_DEVICE_NO_HOTPLUG, object_get_typename(obj));
    return;
  }

  if (value && !dev->realized) {
    if (!check_only_migratable(obj, &local_err)) {
      goto fail;
    }

    if (!obj->parent) {
      gchar *name = g_strdup_printf("device[%d]", unattached_count++);

      object_property_add_child(container_get(qdev_get_machine(),
            "/unattached"),
          name, obj, &error_abort);
      unattached_parent = true;
      g_free(name);
    }

    hotplug_ctrl = qdev_get_hotplug_handler(dev);
    if (hotplug_ctrl) {
      hotplug_handler_pre_plug(hotplug_ctrl, dev, &local_err);
      if (local_err != NULL) {
        goto fail;
      }
    }

    if (dc->realize) {
      dc->realize(dev, &local_err);
    }

    // ...

Now the question arises, what does this realize function actually do. To find out, lets first take a look at how the realize function of the x86 cpu is plumbed. This takes place in target/i386/cpu.c

static void x86_cpu_common_class_init(ObjectClass *oc, void *data)
{
     X86CPUClass *xcc = X86_CPU_CLASS(oc);
     CPUClass *cc = CPU_CLASS(oc);
     DeviceClass *dc = DEVICE_CLASS(oc);

     xcc->parent_realize = dc->realize;
     xcc->parent_unrealize = dc->unrealize;
     dc->realize = x86_cpu_realizefn;
     dc->unrealize = x86_cpu_unrealizefn;
     dc->props = x86_cpu_properties;
     
     //...
}

Here we can see that the device class realize function points to x86_cpu_realizefn. Through this x86_cpu_realizefn, we take a look at how vCPUs are actually created. QEMU can implement the vCPUs in many ways. On Linux systems with processors that support hardware virtualization (the vast majority of processors found in workstations and servers these days) the common choice is KVM. KVM is a Linux kernel module that provides, among other things, highly efficient vCPUs for virtual machines that take advantage instructions in modern processors specifically designed to support efficient virtualization. KVM is the mechanism we will be looking at here.

The code path that creates a KVM vCPU from QEMU is the following.

| target/i386/cpu.c          | x86_cpu_realizefn
| target/i386/cpu.c          | qemu_init_vcpu
| cpus.c                     | qemu_kvm_start_vcpu
| cpus.c                     | qemu_thread_create
| util/qemu-thread-posix.c   | pthread_create
| ~~>  cpus.c                | qemu_kvm_cpu_thread_fn     # passed as parameter to qemu_thread_create
|      cpus.c                | kvm_init_vcpu
|      cpus.c                | kvm_init_cpu_signals
|   -->cpus.c                | cpu_can_run
|   :  cpus.c                | kvm_cpu_exec
|   :  cpus.c                | qemu_wait_io_event
|   ---cpus.c                | cpu_can_run

Crossing into KVM

This article is a work in progress, next up I will cover the QEMU/KVM interaction through vCPUs and vmrun