Xen dom0 root filesystem over NFS Howto
Mirko Mariotti - Physics Department - University of Perugia
Riccardo M. CefalĂ - Undergraduate Student - University of Perugia
Revision 2008-01-21 16:13:22
Contents
Introduction
"Dynamical Domains" is a project in which an application (called manager) is able to interact with a batch system and to run a set of different kind of domUs according to given rules on top of a pool of dom0. This is useful when for example you have very different computational tasks which requires completely different operating systems. Instead of compiling compatibility libraries, solve dependencies etc, you run a dedicated virtual machine for each task. For many machines the dom0 root filesystem is over NFS. The benefit of doing this is that we want Xen hypervisors to run on as many machines as possible without modifying the local hard disk of machines since it can be used in some other way (it can be a personal desktop, a lab workstation temporary unused etc.). This Howto is a part of the "Dynamical Domains" project and describes the way we followed to put a Xen dom0 over NFS.
Installation overview
To have a working installation several things are needed:
- PXE-enabled NICs
- DHCP Server
- TFTP Server
- GRUB (pxegrub)
- NFS Server
The following figure shows the logical structure of the network where the dinamical domains will be deployed.
|
The dom0s will run on the Computer Science Laboratory workstations at the Department of Physics. We already had a DHCP/PXE Enviroment with a server pictured in the figure as infolab offering a pre-existing LTSP system booting via PXE for the students workstations. On a separate virtual machine we set up the Boot Server virtdom, that will host the resources needed by the dom0s to boot. All the machines are on the same network segment, so we can use a unique DHCP server for all of them. Moreover this subnetwork is connected to the Physics Department LAN and to the Internet through the "weblab" gateway. Since the workstations are used by students and professors for their lessons and tasks, the configuration is required to be transparently switchable between the normal daily use profile and the dom0s pool one. Obviously this enviroment can be built using only one machine that hosts all the needed services. We added virtdom to keep the daily use enviroment completely separated from the dom0 pool.
Our configuration
We made this configuration with the following Distro,Software and Hardware:
- Dhcp (isc) and TFTP (tftp-hpa) servers on infolab (2x3GHz Xeon processors - Gentoo 2007.0)
- pxegrub taken from infolab GRUB installation (Except for some old 3com NIC for which we used a pxegrub taken from a precompiled redhat 7.2)
- TFTP (tftp-hpa) and NFS server are virtdom (Xen domU - Debian Etch)
- Dom0 root filesystem is a Debian Etch
- Dom0 kernels are 2.6.20-xen-r6 taken from the Gentoo xen-sources package
PXE-enabled NICs
The first thing needed to boot a diskless station is a way to get a kernel from a remote server over the LAN. Several methods are possible: PXE, Etherboot, gPXE, Netboot, etc. We are lucky since all our NICs have a PXE stack. In order to boot over the network only few thing are needed to be checked:
- The BIOS have to allow the execution of the NIC boot code. (i.e. enabling features like "Booting Other Devices", "LAN Boot Rom", etc.).
- The boot sequence must be properly configured.
- If the NIC has its own configuration program (eg. 3Com), such program should allow net booting.
DHCP server configuration
The first interaction is between the PXE stack of the NIC and the DHCP Server, infolab. The DHCP server stores all the informations needed to let a diskless station boot. Besides all the network parameters it offers to machines the location of a boot server, the name of a pxe bootloader and eventually, configuration file or arguments to be passed to it. A typical DHCP configuration file for a netbooting machine follows:
[...]
option space PXE;
option PXE.mtftp-ip code 1 = ip-address;
option PXE.mtftp-cport code 2 = unsigned integer 16;
option PXE.mtftp-sport code 3 = unsigned integer 16;
option PXE.mtftp-tmout code 4 = unsigned integer 8;
option PXE.mtftp-delay code 5 = unsigned integer 8;
option PXE.discovery-control code 6 = unsigned integer 8;
option PXE.discovery-mcast-addr code 7 = ip-address;
class "pxeclients"
{
match if substring (option vendor-class-identifier, 0, 9) = "PXEClient";
option vendor-class-identifier "PXEClient";
vendor-option-space PXE;
option PXE.mtftp-ip 0.0.0.0;
}
[...]
option option-150 code 150 = text;
option option-128 code 128 = string;
option option-129 code 129 = text;
[...]
## Terminals
host post01 {
next-server 192.168.0.100;
hardware ethernet 00:11:2f:a8:b4:d4;
fixed-address 192.168.0.101;
filename "pxegrub";
option option-150 "/grub/post01.conf";
}
[...]
As stated above we can pass through the DHCP server to each workstation its own pxeGRUB configuration file. In this way the behaviour for each workstation can be specified simply editing the default GRUB entry in each configuration file, providing an easy way to choose the use of the workstation: They can continue with the normal LTSP booting from the same server (infolab), or ask for a new one (virtdom) which serves (always via TFTP) a new GRUB configuration file that contains what to do next in order to boot as a dom0. This way of using two TFTP servers is obviously redundant. Writing all the GRUB configuration to a single file simplifies the boot process so it can be used if there are not particular needs. Insted in our case they are separated because our secondary TFTP server (on virtdom which has more NICs) serves also Dom0s on other LANs.
default 0
timeout 2
password xxxxx
title Infolab terminal
root (nd)
kernel /kernel/ltsp-2.6.17.8 rw root=/dev/ram0
initrd /initrd/ltsp-2.6.17.8.gz
title On-demand virtual machines
password xxxxx
dhcp
tftpserver 192.168.0.5
root (nd)
configfile /grub/post01.conf
Boot server configuration
The booting process of clients booting as a dom0 need to follow the configuration specified on the boot server. From now on, the resources needed by clients in the booting process (Kernel Images, Xen Hypervisors, GRUB config files and even the clients root filesystem itself) are hosted on virtdom (that in our case is a virtual machine itself) acting as a boot server. To deliver those resouces the system uses:
TFTP for GRUB configuration files, Hypervisors and Kernel Images.
NFS for the root filesystem.
TFTP
Here is shown the configuration file for tftpd on the virtdom boot server :
#Defaults for tftpd-hpa RUN_DAEMON="yes" OPTIONS="-l -s /tftpboot"
In this way /tftpboot will be the base directory from where the resources are available to the clients. In details the base directory contains the following directories:
/tftpboot/grub
containing the GRUB configuration files.
/tftpboot/hypervisors
where the Xen Hypervisors are stored.
/tftpboot/kernels
and here the kernel images.
GRUB
A typical GRUB configuration file hosted on virtdom used in our enviroment looks like this:
#post-pIV-realtek - NFS root
default 0
timeout 2
title Dom0 - NFS
root (nd)
kernel /hypervisors/xen-3.1.gz dom0_mem=32M
module /kernels/kernel-2.6.20-xen-r6-pIV-rltk ip=dhcp root=/dev/nfs \
nfsroot=192.168.0.5:/opt/ondemand/root
Every client finds its own configuration file in the /tftpboot/grub/<hostname>.conf, which tipically is a symbolic link to the proper config file where we can specify Hypervisor, Kernel Image, and Root Filesystem for each client. The right config file for each client is specified in the first pxeGRUB menu (see above section).
NFS
The NFS configuration is a rather standard one. Here is the exports file:
#virtdom exports /opt/ondemand/root 192.168.0.0/255.255.255.0(ro,no_root_squash,async) /opt/ondemand/imgs 192.168.0.0/255.255.255.0(rw,no_root_squash,async)
In /opt/ondemand/root there is the root filesystem tree. The filesystem is properly configured to run a Xen3.1 enviroment. As noticeable the exported root filesystem is read-only, so it can be shared among the clients without being accidentally modified by one of them. Updates and edit to the configuration files can be done once for all clients through a chrooted enviroment from virtdom. Details about this can be found in the following sections.
Clients' Configuration
Dom0 Kernels configuration
The kernel need to be configured to allow root over NFS. Moreover, the kernel sources need to be patched to run as a Xen privileged domain. The following points show the needed kernel options to be enabled.
- NIC drivers need to be compiled as built-in (otherwise youll need an initrd to load the module).
[...] CONFIG_NET_VENDOR_3COM=y CONFIG_VORTEX=y [...]
- Kernel Level IP Autoconfiguration needs to be enabled.
[...] CONFIG_IP_PNP=y CONFIG_IP_PNP_DHCP=y [...]
- Root over NFS must be checked as like NFS support.
[...] CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y # CONFIG_NFS_V4 is not set CONFIG_NFS_DIRECTIO=y CONFIG_NFSD=y CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y # CONFIG_NFSD_V4 is not set CONFIG_NFSD_TCP=y CONFIG_ROOT_NFS=y CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y [...]
- Xen dom0 configuration.
[...] CONFIG_X86_XEN=y # CONFIG_PCI_GOXEN_FE is not set CONFIG_XEN_PCIDEV_FRONTEND=y # CONFIG_XEN_PCIDEV_FE_DEBUG is not set # CONFIG_NETXEN_NIC is not set CONFIG_XEN=y CONFIG_XEN_INTERFACE_VERSION=0x00030205 # XEN CONFIG_XEN_PRIVILEGED_GUEST=y # CONFIG_XEN_UNPRIVILEGED_GUEST is not set CONFIG_XEN_PRIVCMD=y CONFIG_XEN_XENBUS_DEV=y CONFIG_XEN_BACKEND=y CONFIG_XEN_BLKDEV_BACKEND=y CONFIG_XEN_BLKDEV_TAP=y CONFIG_XEN_NETDEV_BACKEND=y # CONFIG_XEN_NETDEV_PIPELINED_TRANSMITTER is not set CONFIG_XEN_NETDEV_LOOPBACK=m CONFIG_XEN_PCIDEV_BACKEND=y CONFIG_XEN_PCIDEV_BACKEND_VPCI=y # CONFIG_XEN_PCIDEV_BACKEND_PASS is not set # CONFIG_XEN_PCIDEV_BACKEND_SLOT is not set # CONFIG_XEN_PCIDEV_BE_DEBUG is not set CONFIG_XEN_TPMDEV_BACKEND=m CONFIG_XEN_BLKDEV_FRONTEND=m CONFIG_XEN_NETDEV_FRONTEND=m CONFIG_XEN_SCRUB_PAGES=y CONFIG_XEN_DISABLE_SERIAL=y CONFIG_XEN_SYSFS=m CONFIG_XEN_COMPAT_030002_AND_LATER=y # CONFIG_XEN_COMPAT_030004_AND_LATER is not set # CONFIG_XEN_COMPAT_LATEST_ONLY is not set CONFIG_XEN_COMPAT=0x030002 [...]
Clients' NFS root filesystem
The root filesystem we used is a clean Debian-etch with udev, python, bridge tools and xen tools installed from sources.
Some portions of the filesystem have to be writable by the clients (for example /var/run, /var/lock, etc). So clients will create in RAM, through a script executed at boot time, a directory tree containing the locations that must be writable. This tree will be later mounted on a directory called /flash that must exist on the common NFS root filesystem. Also in the common root filesystem tree, files and directories that must be writable are moved to the /flash directory and replaced by symbolic links in the original locations. This allows to keep consistency of the symbolic links either on the chrooted enviroment and the one created at boot time by the clients. Indeed, in the chrooted enviroment, simbolic links are pointed to the proper location in /flash. Similarily , the clients will mount their writable tree on /flash so the common content of /flash will be replaced with the tree built in RAM by each client.
Here are reported some examples of the resulting filesystem edited as described. /opt/ondemand/root/ on virtom contains the exported root tree.
virtdom:~# ls /opt/ondemand/root/ bin boot dev etc flash home initrd lib media mnt opt proc root sbin srv sys tmp usr var
in /flash there are the real files, replaced in their original location by symbolic links.
As an example here is shown how on virtdom the folders in var/ are replaced by links and the real one are kept in the flash/ directory.
virtdom:~# ls -l /opt/ondemand/root/flash/* /opt/ondemand/root/flash/etc: total 4 drwxr-xr-x 4 root root 4096 Nov 29 12:18 udev /opt/ondemand/root/flash/var: total 20 drwxr-xr-x 5 root root 4096 Nov 26 10:59 lib drwxrwxrwt 2 root root 4096 Nov 16 13:07 lock drwxr-xr-x 5 root root 4096 Nov 16 12:56 log drwxr-xr-x 2 root root 4096 Nov 21 17:27 run drwxrwxrwt 2 root root 4096 Nov 23 16:37 tmp
virtdom:~# ls -l /opt/ondemand/root/var/ [...] drwxrwsr-x 2 root staff 4096 Oct 28 2006 local lrwxrwxrwx 1 root root 15 Nov 16 18:37 lock -> /flash/var/lock lrwxrwxrwx 1 root root 14 Nov 16 18:37 log -> /flash/var/log drwxrwsr-x 2 root mail 4096 Oct 29 15:59 mail drwxr-xr-x 2 root root 4096 Oct 29 15:59 opt lrwxrwxrwx 1 root root 14 Nov 16 13:06 run -> /flash/var/run drwxr-xr-x 3 root root 4096 Oct 29 16:01 spool drwxr-xr-x 3 root root 4096 Nov 16 09:42 xen
On a generic client, the content of /var will be exactly the same as /opt/ondemand/root/var on the virtdom share:
post23:~# ls -l /var/ [...] drwxrwsr-x 2 root staff 4096 Oct 28 2006 local lrwxrwxrwx 1 root root 15 Nov 16 18:37 lock -> /flash/var/lock lrwxrwxrwx 1 root root 14 Nov 16 18:37 log -> /flash/var/log drwxrwsr-x 2 root mail 4096 Oct 29 15:59 mail drwxr-xr-x 2 root root 4096 Oct 29 15:59 opt lrwxrwxrwx 1 root root 14 Nov 16 13:06 run -> /flash/var/run drwxr-xr-x 3 root root 4096 Oct 29 16:01 spool drwxr-xr-x 3 root root 4096 Nov 16 09:42 xen
The /flash folder instead will contain the tree created in RAM at boot time:
post23:~# ls -al /flash/ total 4 drwxrwxrwt 4 root root 80 Jan 18 11:34 . drwxr-xr-x 20 root root 4096 Nov 26 09:46 .. drwxr-xr-x 4 root root 80 Jan 18 11:34 etc drwxr-xr-x 7 root root 140 Jan 18 11:34 var
As a net result the chrooted enviroment on virtdom and the client filesystem will look exactly the same. The real difference is that from virtdom it will be completely writable; from clients instead, only the portion built in RAM mounted on /flash will be writable.
Custom init script
The clients can't access files on the NFS root filesystem in read/write mode ... so how to deal with programs that need to write something, for example, in /var/run? To cope with this we need a mechanism that creates in RAM and then mounts the needed parts of the root filesystem in read/write mode. The mount point is exactly the same pointed in the chrooted enviroment (/flash). In this way, when the new writable filesystem is mounted, the old one is replaced and the symbolic links are still preserved. However some files, like udev rules in /flash/etc/udev, would be lost in the mounting, so we need to copy them. All this is achieved using an init script that is executed before any other init script.
#!/bin/sh
#
# rw filesystem initialization
#
# Thu Nov 22 16:24:17 UTC 2007
#
#Temporary folder where the writable elements are going to be created
TMPBASE="/mnt/tmp"
#The location where the writable filesystem will be mounted
BASE="/flash"
DIRS="var \
var/run \
var/run/screen \
var/log \
var/lock \
var/lib \
var/lib/xend \
var/lib/xenstored \
var/lib/urandom \
var/lib/dhcp3 \
var/tmp\
etc \
etc/network \
etc/network/run \
etc/udev"
DEVBASE="/dev"
DEVDIRS="pts \
shm"
WAIT=0.1
#generating devices files
mkdev() {
mount -n -t tmpfs tmpfs ${DEVBASE}
for i in $DEVDIRS ; do
mkdir ${DEVBASE}/${i}
done
for i in `seq 1 6`; do
mknod /dev/tty$i c 4 $i
done
mknod /dev/tty c 5 0
mknod /dev/null c 1 3
mknod /dev/console c 5 1
mknod /dev/random c 1 8
mknod /dev/urandom c 1 9
}
#creates the filesystem in the temporary location
mktmp() {
#the filesystem is created in ram
mount -n -t tmpfs tmpfs ${TMPBASE}
for i in $DIRS ; do
mkdir ${TMPBASE}/${i}
chmod +w ${TMPBASE}/${i}
done
rsync -a /flash/etc/udev ${TMPBASE}/etc
}
#at the end we move the newly created stuff in the proper location
domove() {
mount --move ${TMPBASE} ${BASE}
}
#"local" commands
dolocal() {
touch ${BASE}/var/log/dmesg
echo "lo=lo" > ${BASE}/etc/network/run/ifstate
}
#what to do
DOLIST="mkdev \
mktmp \
domove \
dolocal"
doit() {
SN=`echo -n $0 | sed 's:^/.*/::'`
for i in $DOLIST ; do
echo -n "${SN}: "
echo $i
$i
done;
echo -n "${SN}: "
echo "$WAIT Seconds..."
sleep $WAIT
}
#do it!
doit
Obviously the script relies on the fact that real files in the root filesystem are replaced with consistent symbolic links as described above. We used the Debian init system to be sure that the script is executed before the udev scripts.
Xen configuration tricks
Some changes to the default xen configuration have to be done in order to make the whole thing work. The first thing to pay attention to is that the Xen 3.x default network configuration changes the name of the ethernet physical interface in peth0, creates a virtual interface that has the name of eth0 and assign the dom0 IP to it (among doing many other things, see the network-bridge script for details). The NFS connection does not survive to this operation so a workaround is needed.
Two are the possibilties:
- Changing the way Xen uses networking returning to the way Xen 2.x did.
- Creating in RAM an environment with all the necessary tools to make the network setup even without the NFS connection and then restarting the NFS connection.
We choose the first for simplicity. The standard Xen 3.1 network-bridge script already has the code inside the op_start function. The original script:
op_start () {
[...]
create_bridge ${bridge}
if link_exists "$vdev"; then
mac=`ip link show ${netdev} | grep 'link\/ether' | sed -e 's/.*ether \(..:..:..:..:..:..\).*/\1/'`
preiftransfer ${netdev}
transfer_addrs ${netdev} ${vdev}
if ! ifdown ${netdev}; then
# If ifdown fails, remember the IP details.
get_ip_info ${netdev}
ip link set ${netdev} down
ip addr flush ${netdev}
fi
ip link set ${netdev} name ${pdev}
ip link set ${vdev} name ${netdev}
setup_bridge_port ${pdev}
setup_bridge_port ${vif0}
ip link set ${netdev} addr ${mac} arp on
ip link set ${bridge} up
add_to_bridge ${bridge} ${vif0}
add_to_bridge2 ${bridge} ${pdev}
do_ifup ${netdev}
if ! ifdown ${pdev}; then
# If ifdown fails, remember the IP details.
get_ip_info ${pdev}
ip link set ${pdev} down
ip addr flush ${pdev}
fi
else
ip link set ${bridge} arp on
ip link set ${bridge} multicast on
# old style without ${vdev}
transfer_addrs ${netdev} ${bridge}
transfer_routes ${netdev} ${bridge}
# Attach the real interface to the bridge.
add_to_bridge ${bridge} ${netdev}
ip addr flush ${netdev}
fi
[...]
have to be changed in a way that the last else is always true, we removed (or comment out) the whole code beetween then end else and the if-else-fi rows.
[...]
create_bridge ${bridge}
ip link set ${bridge} arp on
ip link set ${bridge} multicast on
# old style without ${vdev}
transfer_addrs ${netdev} ${bridge}
transfer_routes ${netdev} ${bridge}
# Attach the real interface to the bridge.
add_to_bridge ${bridge} ${netdev}
ip addr flush ${netdev}
[...]
Doing so when xend create its bridges it will not transfer the machine IP to the veth virtual interface and continue using the real interface preserving the root over NFS connection.
The whole boot sequence
The scheme below resumes the booting process for a generic client with the interactions with the two servers.
|
Troubleshooting
In this section are collected some of the problems we run into and the corresponding solution we followed.
- 3Com cards pxeGRUB:
Some old 3com has some problems with newer pxegrub bootloder, we resolved with an old version of pxegrub (taken from an old distro) without further investigation.
- APIC enabled in BIOS:
VIA Rhine network cards have some problems with APIC, so it should be disabled passing the right parameters to the kernel (noapic). However, some workstation kept on ignoring the DHCP offers from infolab cousing the hanging of the boot process at DHCP configuration. This was becouse some BIOSes enable APIC features by default, no matter what kernel parameters are passed at boot. APIC should then be disabled from BIOS.
- VIA Rhine rebooting problems:
Some broken BIOSes are unable to restore via-rhine chip back of power state D3 so PXE booting fails. It's needed to disable D3 power state from via-rhine.c in kernel sources. Since kernel 2.6.18 there is a workaround that allows to disable it passing via-rhine.avoid_D3=1 to the kernel. More informations about this problem can be found here: http://lkml.org/lkml/2004/9/17/220
- Error: Device 0 (vif) could not be connected. Hotplug script not working.
This error was given while trying to start a virtual machine and was caused by the absence of /etc/udev/xen-backend.rules on the privileged domains.
Future Enhancements
We already use Xen in a VLAN environment so we plan to integrate it in Dom0 over NFS. We also will add the possibility to handle iSCSI and AoE virtual machines.
Back to Howtos |
Back to Dynamical Domains |
