Skip to content

The shape of VM to come

pclouzet edited this page Jan 23, 2024 · 14 revisions

Install a Virtual Machine using qemu

Install qemu

Get qemu
git clone https://gitlab.com/qemu-project/qemu.git

Install qemu

mkdir build \
cd build \
../configure --enable-slirp \
make -j \
sudo make install \

Install a VM using qemu

Get an image of debian12.2.0 we want to boot on as a virtual machine:
wget https://www.debian.org/distrib/netinst/debian-12.2.0-amd64-netinst.iso .

Create a disk image (qcow2 format) where the vm will store
qemu-img create -f qcow2 mydisk.img 20G

Install the vm running on debian with qemu:

qemu-system-x86_64 -boot d -cdrom debian-12.2.0-amd64-netinst.iso -m 4G \
-device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
-hda mydisk.img -accel kvm

Follow all instruction from the interface and you're done. -accel kvm helps boosting the installation time (from 1h30 to 20min in my case)

Launch our new VM

Let's say we want to run debian with 8Gb of ram:
qemu-system-x86_64 -hda mydisk.img -m 8G -accel kvm

A vm can use a lot of ressources and slow down its usage, we can lighten our efforts by disabling all graphical interface: Open a terminal within the vm and run

sudo systemctl set-default multi-user.target
sudo reboot

Just in case, you can re-enable it with:

systemctl set-default graphical.target
sudo reboot

Snapshot of an Image

Before going further, we can save our mydisk.img as snapshots thanks to qcow2 format:

qemu-img create -f qcow2 -b mydisk.img -F qcow2 snapshot.img

'mydisk.img' should not be modified anymore, because, any change could corrupt snapshots.

Set some architecture features using qemu

Using qemu, let's set our VM's hardware with 4 NUMA nodes, each with 4cpus of 4,2,1 and 1Gb of memory: \

qemu-system-x86_64 -hda snapshot.img -m 8G \
        -accel kvm \
       -smp cpus=16 \
       -object memory-backend-ram,size=4G,id=ram0 \
       -object memory-backend-ram,size=2G,id=ram1 \
       -object memory-backend-ram,size=1G,id=ram2 \
       -object memory-backend-ram,size=1G,id=ram3 \
       -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
       -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
       -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
       -numa node,nodeid=3,memdev=ram3,cpus=12-15 \

Add an nvdimm node

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -machine pc,nvdimm=on \
        -m 8G,slots=1,maxmem=9G \
        -smp cpus=16 \
        -object memory-backend-ram,size=4G,id=ram0 \
        -object memory-backend-ram,size=2G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=4 \
        -object memory-backend-file,id=nvdimm1,share=on,mem-path=img/nvdimm.img,size=1G \
        -numa node,nodeid=0,memdev=ram0,cpus=0-3 \
        -numa node,nodeid=1,memdev=ram1,cpus=4-7 \
        -numa node,nodeid=2,memdev=ram2,cpus=8-11 \
        -numa node,nodeid=3,memdev=ram3,cpus=12-15 \
        -numa node,nodeid=4

By running the command: ndctl list -NRD we can list the active and enabled nvdimm devices:

{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"raw",
          "size":1073741824,
          "sector_size":512,
          "blockdev":"pmem0"
        }
      ]
    }
  ]
}

By defaults, the namespaceX.Y (here namespace0.0) is set as a raw mode. Which means, the nvdimm device acts as a memory disk not supporting dax. We need to disable the namespace, create a new one and finally set mode to devdax with following commands:

sudo ndctl disable-namespace namespace0.0
sudo ndctl create-namespace -m devdax
sudo daxctl reconfigure-device -m system-ram all --force

Node 4 is now congired as dax:

{
  "dimms":[
    {
      "dev":"nmem0",
      "id":"8680-56341200",
      "handle":1,
      "phys_id":0
    }
  ],
  "regions":[
    {
      "dev":"region0",
      "size":1073741824,
      "align":16777216,
      "available_size":0,
      "max_available_extent":0,
      "type":"pmem",
      "mappings":[
        {
          "dimm":"nmem0",
          "offset":0,
          "length":1073741824,
          "position":0
        }
      ],
      "persistence_domain":"unknown",
      "namespaces":[
        {
          "dev":"namespace0.0",
          "mode":"devdax",
          "map":"dev",
          "size":1054867456,
          "uuid":"ed8bb2a9-41fb-48e0-a0b2-7dbf0d9ca9ba",
          "chardev":"dax0.0",
          "align":2097152
        }
      ]
    }
  ]
}

CXL

To be sure, ewe work with latest linux kernel: 6.7.0-rc3+

Persistent memory example

First we need a CXL hostbridge (Pci EXtended Bridge, i.e, pxb-cxl "cxl.1"), then we attach a root-port (cxl-rp "root_port13" here), then a Type 3 device.
In this case it is a pmem device so it needs two "memory-backend-file" objects, one for the memory ("pmem0" here) and one for its label storage area (LSA, i.e "cxl-lsa0"). Finally we need a Fixed Memory Window (FMW, i.e, cxl-fwm) to map that memory in the host:

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=8G \
        -smp 4 \
        -object memory-backend-ram,size=4G,id=mem0 \
        -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
        -object memory-backend-file,id=pmem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
        -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
        -device cxl-type3,bus=root_port13,persistent-memdev=pmem0,lsa=cxl-lsa0,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

We need to create the region using cxl create-region and make it available as nvm numa node:

sudo cxl create-region -m -d decoder0.0 -t pmem mem0
sudo daxctl reconfigure-device -m system-ram dax0.0 --force

volatile memory example, with interleaving and switch

Lets build with 2 sockets. Each socket has 2 cpus and 2 cxl devices, 1 switch.
We need a PXB per socket with 2 RP per socket. A switch is installed on each socket. We need to set 1 upstream port per socket and 2 downstream ports per sockets. Both pxb set as upstream port for the switch, have to be attached on slot 0. Hence, we need to distinguish chassis from each other numa nodes. In this case it is a vmem device so it needs two "memory-backend-ram" objects per socket. Finally we set 2 Fixed Memory Window to map both memory in the host:

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 2G,slots=8,maxmem=10G \
        -smp cpus=4,cores=2,sockets=2 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=1 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=2 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=3 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=4 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=5 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=6 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G

Here, we selected root_port1 and root_port3 to be plugged on slot 0 on chassis 0 and chassis 1 respectively. bus_nr of PXBs may lead to error messages because they may be already used. Just change them to another value.
From the vm, list cxl memory devices with cxl list -M :

[
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem0",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  }
]

We can list decoders available with cxl list -D:

[
  {
    "root decoders":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-17985175553,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":-22280142849,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1
      }
    ]
  }
]

We assemble a cxl region with the cxl list create-region command. We need to select the decoder where the region will be created under and containing cxl devices. Below, we first assemble mem1 and mem0 located under decoder0.1, with a 2 way interleaving:

sudo cxl create-region -m -d decoder0.1 -t ram -w 2 mem1 mem0

And we assemble with decoder 0.0 mem2 and mem3 with 1 way interleaving

sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem2
sudo cxl create-region -m -d decoder0.0 -t ram -w 1 mem3

We can see they are now available with command: daxctl list

[
  {
    "chardev":"dax1.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax3.0",
    "size":268435456,
    "target_node":3,
    "align":2097152,
    "mode":"system-ram"
  },
  {
    "chardev":"dax0.0",
    "size":536870912,
    "target_node":2,
    "align":2097152,
    "mode":"system-ram"
  }
]

New DAX device should appear under /sys/bus/dax/devices. By default, new NUMA nodes appear offline. Run daxctl online-memory all to make them online. \

Both pmem and vmem

Lets build a vm with 4 sockets, one socket with only cpus, one with cxl pmem device, one with 2 cxl 2-way interleaved, one with 2 cxl 1-way interleaved

qemu-system-x86_64 -hda img/snapshot.img -accel kvm \
        -machine q35,nvdimm=on,cxl=on \
        -device e1000,netdev=net0,mac=52:54:00:12:34:56 \
        -netdev user,id=net0,hostfwd=tcp::10022-:22 \
        -m 4G,slots=8,maxmem=10G \
        -smp cpus=8,cores=2,sockets=4 \
        -object memory-backend-ram,size=1G,id=ram0 \
        -object memory-backend-ram,size=1G,id=ram1 \
        -object memory-backend-ram,size=1G,id=ram2 \
        -object memory-backend-ram,size=1G,id=ram3 \
        -object memory-backend-ram,id=cxl-mem0,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem1,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem2,share=on,size=256M \
        -object memory-backend-ram,id=cxl-mem3,share=on,size=256M \
        -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest.raw,size=256M \
        -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa.raw,size=256M \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0 \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1 \
        -numa node,nodeid=2,cpus=4-5,memdev=ram2 \
        -numa node,nodeid=3,cpus=6-7,memdev=ram3 \
        -device pxb-cxl,numa_node=0,bus_nr=24,bus=pcie.0,id=pxb-cxl.1 \
        -device pxb-cxl,numa_node=1,bus_nr=32,bus=pcie.0,id=pxb-cxl.2 \
        -device pxb-cxl,numa_node=3,bus_nr=40,bus=pcie.0,id=pxb-cxl.3 \
        -device cxl-rp,port=0,bus=pxb-cxl.1,id=root_port1,chassis=0,slot=0 \
        -device cxl-rp,port=1,bus=pxb-cxl.1,id=root_port2,chassis=0,slot=3 \
        -device cxl-rp,port=2,bus=pxb-cxl.2,id=root_port3,chassis=1,slot=0 \
        -device cxl-rp,port=3,bus=pxb-cxl.2,id=root_port4,chassis=1,slot=5 \
        -device cxl-rp,port=0,bus=pxb-cxl.3,id=root_port5,chassis=2,slot=0 \
        -device cxl-upstream,bus=root_port1,id=us0 \
        -device cxl-upstream,bus=root_port3,id=us1 \
        -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=7 \
        -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,id=cxl-vmem0 \
        -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=8 \
        -device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,id=cxl-vmem1 \
        -device cxl-downstream,port=2,bus=us1,id=swport2,chassis=1,slot=9 \
        -device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,id=cxl-vmem2 \
        -device cxl-downstream,port=3,bus=us1,id=swport3,chassis=1,slot=10 \
        -device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,id=cxl-vmem3 \
        -device cxl-type3,bus=root_port5,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem0 \
        -M cxl-fmw.0.targets.0=pxb-cxl.1,cxl-fmw.0.size=4G,cxl-fmw.1.targets.0=pxb-cxl.2,cxl-fmw.1.size=4G,cxl-fmw.2.targets.0=pxb-cxl.3,
           cxl-fmw.2.size=512M

TIP: How to identify which decoder corresponds to which device.
When listing with cxl list -Dv, identify the id. Here the decoder0.0 corresponds to the id=24. It corresponds to the bus number attached to a node. From our previous qemu script, the bus_nr=24 corresponds to our numa_node=0

"decoders:root0":[
      {
        "decoder":"decoder0.0",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:18",
            "alias":"ACPI0016:02",
            "position":0,
            "id":24
          }
        ]
      },
      {
        "decoder":"decoder0.1",
        "size":4294967296,
        "interleave_ways":1,
        "max_available_extent":4294967296,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:20",
            "alias":"ACPI0016:01",
            "position":0,
            "id":32
          }
        ]
      },
      {
{
        "decoder":"decoder0.2",
        "size":536870912,
        "interleave_ways":1,
        "max_available_extent":536870912,
        "pmem_capable":true,
        "volatile_capable":true,
        "accelmem_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:28",
            "alias":"ACPI0016:00",
            "position":0,
            "id":40
          }
        ]

Lets select the id 24. It is attached to the decoder0.0. To identify which memory device is below that decoder, run cxl list -M:

[
  {
    "memdev":"mem0",
    "pmem_size":268435456,
    "serial":0,
    "numa_node":3,
    "host":"0000:29:00.0"
  },
  {
    "memdev":"mem1",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1b:00.0"
  },
  {
    "memdev":"mem4",
    "ram_size":268435456,
    "serial":0,
    "numa_node":0,
    "host":"0000:1c:00.0"
  },
  {
    "memdev":"mem3",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:23:00.0"
  },
  {
    "memdev":"mem2",
    "ram_size":268435456,
    "serial":0,
    "numa_node":1,
    "host":"0000:24:00.0"
  }
]

We can see that in the numa_node 0, mem1 and mem4 are located.
So we can run: sudo cxl create-region -m -t ram -d decoder0.0 -w2 mem4 mem1 without doubt whether it is the right decoder with the rights memory devices. Then we finalize the configuration with one way interleaving:

sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem3
sudo cxl create-region -m -t ram -d decoder0.1 -w1 mem2

The region with persistent memory:

sudo cxl create-region -m -t pmem -d decoder0.2 mem0
sudo ndctl create-namespace -t pmem -m devdax -r region2 -f

And finally online all devices:

sudo daxctl online-memory all
sudo daxctl reconfigure-device -m system-ram dax2.0 --force

We oberved error messages like: failed to create namespace: No space left on device after running the namespace creation. To encounter this issue, erase files declared in the mem-path argument (usually /tmp/ ) of your qemu script, reboot the vm.