Xen dom0 root filesystem over NFS Howto

Mirko Mariotti - Physics Department - University of Perugia
Riccardo M. CefalĂ  - Undergraduate Student - University of Perugia

Revision 2008-01-21 16:13:22

Introduction

"Dynamical Domains" is a project in which an application (called manager) is able to interact with a batch system and to run a set of different kind of domUs according to given rules on top of a pool of dom0. This is useful when for example you have very different computational tasks which requires completely different operating systems. Instead of compiling compatibility libraries, solve dependencies etc, you run a dedicated virtual machine for each task. For many machines the dom0 root filesystem is over NFS. The benefit of doing this is that we want Xen hypervisors to run on as many machines as possible without modifying the local hard disk of machines since it can be used in some other way (it can be a personal desktop, a lab workstation temporary unused etc.). This Howto is a part of the "Dynamical Domains" project and describes the way we followed to put a Xen dom0 over NFS.

Installation overview

To have a working installation several things are needed:

The following figure shows the logical structure of the network where the dinamical domains will be deployed.

The dom0s will run on the Computer Science Laboratory workstations at the Department of Physics. We already had a DHCP/PXE Enviroment with a server pictured in the figure as infolab offering a pre-existing LTSP system booting via PXE for the students workstations. On a separate virtual machine we set up the Boot Server virtdom, that will host the resources needed by the dom0s to boot. All the machines are on the same network segment, so we can use a unique DHCP server for all of them. Moreover this subnetwork is connected to the Physics Department LAN and to the Internet through the "weblab" gateway. Since the workstations are used by students and professors for their lessons and tasks, the configuration is required to be transparently switchable between the normal daily use profile and the dom0s pool one. Obviously this enviroment can be built using only one machine that hosts all the needed services. We added virtdom to keep the daily use enviroment completely separated from the dom0 pool.

Our configuration

We made this configuration with the following Distro,Software and Hardware:

PXE-enabled NICs

The first thing needed to boot a diskless station is a way to get a kernel from a remote server over the LAN. Several methods are possible: PXE, Etherboot, gPXE, Netboot, etc. We are lucky since all our NICs have a PXE stack. In order to boot over the network only few thing are needed to be checked:

DHCP server configuration

The first interaction is between the PXE stack of the NIC and the DHCP Server, infolab. The DHCP server stores all the informations needed to let a diskless station boot. Besides all the network parameters it offers to machines the location of a boot server, the name of a pxe bootloader and eventually, configuration file or arguments to be passed to it. A typical DHCP configuration file for a netbooting machine follows:

[...]

option space PXE;
option PXE.mtftp-ip               code 1 = ip-address;
option PXE.mtftp-cport            code 2 = unsigned integer 16;
option PXE.mtftp-sport            code 3 = unsigned integer 16;
option PXE.mtftp-tmout            code 4 = unsigned integer 8;
option PXE.mtftp-delay            code 5 = unsigned integer 8;
option PXE.discovery-control      code 6 = unsigned integer 8;
option PXE.discovery-mcast-addr   code 7 = ip-address;

class "pxeclients"
{
        match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
        option vendor-class-identifier "PXEClient";
        vendor-option-space PXE;

        option PXE.mtftp-ip 0.0.0.0;
}

[...]

option option-150 code 150 = text;
option option-128 code 128 = string;
option option-129 code 129 = text;

[...]

## Terminals

host post01 {
    next-server          192.168.0.100;
    hardware ethernet    00:11:2f:a8:b4:d4;
    fixed-address        192.168.0.101;
    filename             "pxegrub";
    option option-150    "/grub/post01.conf";
}

[...]

As stated above we can pass through the DHCP server to each workstation its own pxeGRUB configuration file. In this way the behaviour for each workstation can be specified simply editing the default GRUB entry in each configuration file, providing an easy way to choose the use of the workstation: They can continue with the normal LTSP booting from the same server (infolab), or ask for a new one (virtdom) which serves (always via TFTP) a new GRUB configuration file that contains what to do next in order to boot as a dom0. This way of using two TFTP servers is obviously redundant. Writing all the GRUB configuration to a single file simplifies the boot process so it can be used if there are not particular needs. Insted in our case they are separated because our secondary TFTP server (on virtdom which has more NICs) serves also Dom0s on other LANs.

default 0
timeout 2
password xxxxx

title   Infolab terminal
        root (nd)
        kernel /kernel/ltsp-2.6.17.8 rw root=/dev/ram0
        initrd /initrd/ltsp-2.6.17.8.gz
title   On-demand virtual machines
        password xxxxx
        dhcp
        tftpserver 192.168.0.5
        root (nd)
        configfile /grub/post01.conf

Boot server configuration

The booting process of clients booting as a dom0 need to follow the configuration specified on the boot server. From now on, the resources needed by clients in the booting process (Kernel Images, Xen Hypervisors, GRUB config files and even the clients root filesystem itself) are hosted on virtdom (that in our case is a virtual machine itself) acting as a boot server. To deliver those resouces the system uses:

TFTP

Here is shown the configuration file for tftpd on the virtdom boot server :

#Defaults for tftpd-hpa
RUN_DAEMON="yes"
OPTIONS="-l -s /tftpboot"

In this way /tftpboot will be the base directory from where the resources are available to the clients. In details the base directory contains the following directories:

GRUB

A typical GRUB configuration file hosted on virtdom used in our enviroment looks like this:

#post-pIV-realtek - NFS root

default 0
timeout 2

title   Dom0 - NFS
        root (nd)
        kernel /hypervisors/xen-3.1.gz dom0_mem=32M
        module /kernels/kernel-2.6.20-xen-r6-pIV-rltk ip=dhcp root=/dev/nfs \
               nfsroot=192.168.0.5:/opt/ondemand/root

Every client finds its own configuration file in the /tftpboot/grub/<hostname>.conf, which tipically is a symbolic link to the proper config file where we can specify Hypervisor, Kernel Image, and Root Filesystem for each client. The right config file for each client is specified in the first pxeGRUB menu (see above section).

NFS

The NFS configuration is a rather standard one. Here is the exports file:

#virtdom exports
/opt/ondemand/root              192.168.0.0/255.255.255.0(ro,no_root_squash,async)
/opt/ondemand/imgs              192.168.0.0/255.255.255.0(rw,no_root_squash,async)

In /opt/ondemand/root there is the root filesystem tree. The filesystem is properly configured to run a Xen3.1 enviroment. As noticeable the exported root filesystem is read-only, so it can be shared among the clients without being accidentally modified by one of them. Updates and edit to the configuration files can be done once for all clients through a chrooted enviroment from virtdom. Details about this can be found in the following sections.

Clients' Configuration

Dom0 Kernels configuration

The kernel need to be configured to allow root over NFS. Moreover, the kernel sources need to be patched to run as a Xen privileged domain. The following points show the needed kernel options to be enabled.

[...]

CONFIG_NET_VENDOR_3COM=y
CONFIG_VORTEX=y

[...]

[...]

CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y

[...]

[...]

CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_NFSD_TCP=y
CONFIG_ROOT_NFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y

[...]

[...]

CONFIG_X86_XEN=y
# CONFIG_PCI_GOXEN_FE is not set
CONFIG_XEN_PCIDEV_FRONTEND=y
# CONFIG_XEN_PCIDEV_FE_DEBUG is not set
# CONFIG_NETXEN_NIC is not set
CONFIG_XEN=y
CONFIG_XEN_INTERFACE_VERSION=0x00030205
# XEN
CONFIG_XEN_PRIVILEGED_GUEST=y
# CONFIG_XEN_UNPRIVILEGED_GUEST is not set
CONFIG_XEN_PRIVCMD=y
CONFIG_XEN_XENBUS_DEV=y
CONFIG_XEN_BACKEND=y
CONFIG_XEN_BLKDEV_BACKEND=y
CONFIG_XEN_BLKDEV_TAP=y
CONFIG_XEN_NETDEV_BACKEND=y
# CONFIG_XEN_NETDEV_PIPELINED_TRANSMITTER is not set
CONFIG_XEN_NETDEV_LOOPBACK=m
CONFIG_XEN_PCIDEV_BACKEND=y
CONFIG_XEN_PCIDEV_BACKEND_VPCI=y
# CONFIG_XEN_PCIDEV_BACKEND_PASS is not set
# CONFIG_XEN_PCIDEV_BACKEND_SLOT is not set
# CONFIG_XEN_PCIDEV_BE_DEBUG is not set
CONFIG_XEN_TPMDEV_BACKEND=m
CONFIG_XEN_BLKDEV_FRONTEND=m
CONFIG_XEN_NETDEV_FRONTEND=m
CONFIG_XEN_SCRUB_PAGES=y
CONFIG_XEN_DISABLE_SERIAL=y
CONFIG_XEN_SYSFS=m
CONFIG_XEN_COMPAT_030002_AND_LATER=y
# CONFIG_XEN_COMPAT_030004_AND_LATER is not set
# CONFIG_XEN_COMPAT_LATEST_ONLY is not set
CONFIG_XEN_COMPAT=0x030002

[...]

Clients' NFS root filesystem

The root filesystem we used is a clean Debian-etch with udev, python, bridge tools and xen tools installed from sources.

Some portions of the filesystem have to be writable by the clients (for example /var/run, /var/lock, etc). So clients will create in RAM, through a script executed at boot time, a directory tree containing the locations that must be writable. This tree will be later mounted on a directory called /flash that must exist on the common NFS root filesystem. Also in the common root filesystem tree, files and directories that must be writable are moved to the /flash directory and replaced by symbolic links in the original locations. This allows to keep consistency of the symbolic links either on the chrooted enviroment and the one created at boot time by the clients. Indeed, in the chrooted enviroment, simbolic links are pointed to the proper location in /flash. Similarily , the clients will mount their writable tree on /flash so the common content of /flash will be replaced with the tree built in RAM by each client.
Here are reported some examples of the resulting filesystem edited as described. /opt/ondemand/root/ on virtom contains the exported root tree.

virtdom:~# ls /opt/ondemand/root/
bin  boot  dev  etc  flash  home  initrd  lib  media  mnt 
opt  proc  root  sbin  srv  sys  tmp  usr  var

in /flash there are the real files, replaced in their original location by symbolic links.

As an example here is shown how on virtdom the folders in var/ are replaced by links and the real one are kept in the flash/ directory.

virtdom:~# ls -l /opt/ondemand/root/flash/*
/opt/ondemand/root/flash/etc:
total 4
drwxr-xr-x 4 root root 4096 Nov 29 12:18 udev

/opt/ondemand/root/flash/var:
total 20
drwxr-xr-x 5 root root 4096 Nov 26 10:59 lib
drwxrwxrwt 2 root root 4096 Nov 16 13:07 lock
drwxr-xr-x 5 root root 4096 Nov 16 12:56 log
drwxr-xr-x 2 root root 4096 Nov 21 17:27 run
drwxrwxrwt 2 root root 4096 Nov 23 16:37 tmp

virtdom:~# ls -l /opt/ondemand/root/var/
[...]
drwxrwsr-x  2 root staff 4096 Oct 28  2006 local
lrwxrwxrwx  1 root root    15 Nov 16 18:37 lock -> /flash/var/lock
lrwxrwxrwx  1 root root    14 Nov 16 18:37 log -> /flash/var/log
drwxrwsr-x  2 root mail  4096 Oct 29 15:59 mail
drwxr-xr-x  2 root root  4096 Oct 29 15:59 opt
lrwxrwxrwx  1 root root    14 Nov 16 13:06 run -> /flash/var/run
drwxr-xr-x  3 root root  4096 Oct 29 16:01 spool
drwxr-xr-x  3 root root  4096 Nov 16 09:42 xen

On a generic client, the content of /var will be exactly the same as /opt/ondemand/root/var on the virtdom share:

post23:~# ls -l /var/
[...]
drwxrwsr-x  2 root staff 4096 Oct 28  2006 local
lrwxrwxrwx  1 root root    15 Nov 16 18:37 lock -> /flash/var/lock
lrwxrwxrwx  1 root root    14 Nov 16 18:37 log -> /flash/var/log
drwxrwsr-x  2 root mail  4096 Oct 29 15:59 mail
drwxr-xr-x  2 root root  4096 Oct 29 15:59 opt
lrwxrwxrwx  1 root root    14 Nov 16 13:06 run -> /flash/var/run
drwxr-xr-x  3 root root  4096 Oct 29 16:01 spool
drwxr-xr-x  3 root root  4096 Nov 16 09:42 xen

The /flash folder instead will contain the tree created in RAM at boot time:

post23:~# ls -al /flash/
total 4
drwxrwxrwt  4 root root   80 Jan 18 11:34 .
drwxr-xr-x 20 root root 4096 Nov 26 09:46 ..
drwxr-xr-x  4 root root   80 Jan 18 11:34 etc
drwxr-xr-x  7 root root  140 Jan 18 11:34 var

As a net result the chrooted enviroment on virtdom and the client filesystem will look exactly the same. The real difference is that from virtdom it will be completely writable; from clients instead, only the portion built in RAM mounted on /flash will be writable.

Custom init script

The clients can't access files on the NFS root filesystem in read/write mode ... so how to deal with programs that need to write something, for example, in /var/run? To cope with this we need a mechanism that creates in RAM and then mounts the needed parts of the root filesystem in read/write mode. The mount point is exactly the same pointed in the chrooted enviroment (/flash). In this way, when the new writable filesystem is mounted, the old one is replaced and the symbolic links are still preserved. However some files, like udev rules in /flash/etc/udev, would be lost in the mounting, so we need to copy them. All this is achieved using an init script that is executed before any other init script.

#!/bin/sh
#
# rw filesystem initialization
#
#   Thu Nov 22 16:24:17 UTC 2007
#

#Temporary folder where the writable elements are going to be created
TMPBASE="/mnt/tmp"
#The location where the writable filesystem will be mounted
BASE="/flash"

DIRS="var \
    var/run \
    var/run/screen \
    var/log \
    var/lock \
    var/lib \
    var/lib/xend \
    var/lib/xenstored \
    var/lib/urandom \
    var/lib/dhcp3 \
    var/tmp\
    etc \
    etc/network \
    etc/network/run \
    etc/udev"

DEVBASE="/dev"

DEVDIRS="pts \
    shm"

WAIT=0.1


#generating devices files
mkdev() {
    mount -n -t tmpfs tmpfs ${DEVBASE}

    for i in $DEVDIRS ; do
        mkdir ${DEVBASE}/${i}
    done

    for i in `seq 1 6`; do
        mknod /dev/tty$i c 4 $i
    done

    mknod /dev/tty c 5 0
    mknod /dev/null c 1 3
    mknod /dev/console c 5 1    
    mknod /dev/random c 1 8
    mknod /dev/urandom c 1 9
}

#creates the filesystem in the temporary location
mktmp() {
    #the filesystem is created in ram
    mount -n -t tmpfs tmpfs ${TMPBASE}

    for i in $DIRS ; do
        mkdir ${TMPBASE}/${i}
        chmod +w ${TMPBASE}/${i}
    done
    rsync -a /flash/etc/udev ${TMPBASE}/etc
}

#at the end we move the newly created stuff in the proper location
domove() {
    mount --move ${TMPBASE} ${BASE}
}

#"local" commands
dolocal() {
    touch ${BASE}/var/log/dmesg
    echo "lo=lo" > ${BASE}/etc/network/run/ifstate
}

#what to do
DOLIST="mkdev \
    mktmp \
    domove \
    dolocal"

doit() {
    SN=`echo -n $0 | sed 's:^/.*/::'`

    for i in $DOLIST ; do
        echo -n "${SN}: "
        echo $i
        $i
    done;

    echo -n "${SN}: "
    echo "$WAIT Seconds..."
    sleep $WAIT
}

#do it!
doit

Obviously the script relies on the fact that real files in the root filesystem are replaced with consistent symbolic links as described above. We used the Debian init system to be sure that the script is executed before the udev scripts.

Xen configuration tricks

Some changes to the default xen configuration have to be done in order to make the whole thing work. The first thing to pay attention to is that the Xen 3.x default network configuration changes the name of the ethernet physical interface in peth0, creates a virtual interface that has the name of eth0 and assign the dom0 IP to it (among doing many other things, see the network-bridge script for details). The NFS connection does not survive to this operation so a workaround is needed.

Two are the possibilties:

We choose the first for simplicity. The standard Xen 3.1 network-bridge script already has the code inside the op_start function. The original script:

op_start () {
[...]
create_bridge ${bridge}
if link_exists "$vdev"; then
        mac=`ip link show ${netdev} | grep 'link\/ether' | sed -e 's/.*ether \(..:..:..:..:..:..\).*/\1/'`
        preiftransfer ${netdev}
        transfer_addrs ${netdev} ${vdev}
        if ! ifdown ${netdev}; then
                # If ifdown fails, remember the IP details.
                get_ip_info ${netdev}
                ip link set ${netdev} down
                ip addr flush ${netdev}
        fi
        ip link set ${netdev} name ${pdev}
        ip link set ${vdev} name ${netdev}

        setup_bridge_port ${pdev}
        setup_bridge_port ${vif0}
        ip link set ${netdev} addr ${mac} arp on

        ip link set ${bridge} up
        add_to_bridge  ${bridge} ${vif0}
        add_to_bridge2 ${bridge} ${pdev}
        do_ifup ${netdev}

        if ! ifdown ${pdev}; then
                # If ifdown fails, remember the IP details.
                get_ip_info ${pdev}
                ip link set ${pdev} down
                ip addr flush ${pdev}
        fi

else
        ip link set ${bridge} arp on
        ip link set ${bridge} multicast on
        # old style without ${vdev}
        transfer_addrs  ${netdev} ${bridge}
        transfer_routes ${netdev} ${bridge}
        # Attach the real interface to the bridge.
        add_to_bridge ${bridge} ${netdev}
        ip addr flush ${netdev}
fi
[...]

have to be changed in a way that the last else is always true, we removed (or comment out) the whole code beetween then end else and the if-else-fi rows.

[...]
create_bridge ${bridge}
        ip link set ${bridge} arp on
        ip link set ${bridge} multicast on
        # old style without ${vdev}
        transfer_addrs  ${netdev} ${bridge}
        transfer_routes ${netdev} ${bridge}
        # Attach the real interface to the bridge.
        add_to_bridge ${bridge} ${netdev}
        ip addr flush ${netdev}
[...]

Doing so when xend create its bridges it will not transfer the machine IP to the veth virtual interface and continue using the real interface preserving the root over NFS connection.

The whole boot sequence

The scheme below resumes the booting process for a generic client with the interactions with the two servers.

Troubleshooting

In this section are collected some of the problems we run into and the corresponding solution we followed.

Some old 3com has some problems with newer pxegrub bootloder, we resolved with an old version of pxegrub (taken from an old distro) without further investigation.

VIA Rhine network cards have some problems with APIC, so it should be disabled passing the right parameters to the kernel (noapic). However, some workstation kept on ignoring the DHCP offers from infolab cousing the hanging of the boot process at DHCP configuration. This was becouse some BIOSes enable APIC features by default, no matter what kernel parameters are passed at boot. APIC should then be disabled from BIOS.

Some broken BIOSes are unable to restore via-rhine chip back of power state D3 so PXE booting fails. It's needed to disable D3 power state from via-rhine.c in kernel sources. Since kernel 2.6.18 there is a workaround that allows to disable it passing via-rhine.avoid_D3=1 to the kernel. More informations about this problem can be found here: http://lkml.org/lkml/2004/9/17/220

This error was given while trying to start a virtual machine and was caused by the absence of /etc/udev/xen-backend.rules on the privileged domains.

Future Enhancements

We already use Xen in a VLAN environment so we plan to integrate it in Dom0 over NFS. We also will add the possibility to handle iSCSI and AoE virtual machines.

Back to Howtos

Back to Dynamical Domains

Xen0verNfsHowto (last edited 2008-12-02 21:16:24 by mirko)