There are several methods for constructing and managing small-scale,
heterogenous ("Beowulf") computing clusters. Warewulf
is an end-to-end suite that facilitates the management of node images,
provisioning of nodes (including custom, per-node configuration overlays), and
control of TFTP, DHCP and NFS servers. Originally a suite of Perl scripts that
provided a front end to a database, Warefulf is now written in Go and uses a
simpler flat-file backing store. The server-control components rely exclusively
on the systemctl
command in systemd, but could be trivially patched to
support runit as well.
After considering bringing Warewulf to Void, I concluded that most of its major utility can be realized with a much simpler approach. One of my requirements for a small cluster is that the master node run essentially the same image as the others, and this wouldn't be possible with Warewulf without some special accommodations. To satisfy this requirement, all nodes can mount a root filesystem image as the lower tree of an overlay filesystem that uses memory-backed tmpfs as its upper tree. The master node can mount the lower tree from a locally attached ZFS pool; all other nodes can mount the lower tree from an NFS export on the master. For home directories, the master will again mount and export local storage for the other nodes to access. The master then runs a PXE server that will provide kernels and an appropriate initramfs image to all others.
The most heavily customized node will be the master, because it should run services and use local filesystems that will not be used on other nodes. For other nodes, customization is generally limited to assigning unique hostnames to each, although more extensive per-node customization is possible. A simple override system allows the replacement common files with customized variants early in the boot process. The replacement system provides all of the flexibility necessary to assign unique roles to individual nodes as needed.
The subdirectories of this repository provide sample configurations and scripts that should be deployed on the master node. Each directory contains a dedicated README that describes how to install and use the components therein. The subsystems that must be modified for clustering are:
-
initcpio
: I prefer the use ofmkinitcpio
todracut
both becausemkinitcpio
is generally simpler to configure and becausedracut
is increasingly hostile to systems that do not use systemd or attempt to include it in initramfs images. This subdirectory contains the pieces necessary to configuremkinitcpio
to produce initramfs images for both the master node and the client nodes. -
overlays
provides the components necessary to implement early-boot configuration overlays on a per-node basis. -
tftp
provides instruction and a simple PXELINUX configuration that can be used to serve the client kernel and initramfs to diskless nodes.
These scripts and configuration overlays are intended to be added to a stock
Void Linux installation that contains the desired software and configuration
for all nodes in the cluster. Because the master node in my cluster runs atop a
ZFS pool, the master node was originally installed according to the
ZFSBootMenu guide
for booting Void on a UEFI system. After a base installation is configured and
booting, make sure mkinitcpio
is installed and configured for use on the system:
xbps-install -S mkinitcpio mkinitcpio-zfs
xbps-alternatives -s mkinitcpio
At this point, the initcpio
configuration from this repository can replace
the default mkinitcpio.conf
. The initramfs can be regenerated by running
xbps-reconfigure -f linuxX.Y
where X.Y
should be replaced with whatever version describes the Void kernel
series currently installed.
NOTE: at this point, rebooting the system will result in a root filesystem that consists of a
tmpfs
overlay on top of the underlying ZFS filesystem. Subsequent configuration on top of thetmpfs
overlay will be lost after system shutdown. One of three alternatives will be required to complete configuration of the master node:
- Finish all configuration before rebooting with the new initramfs;
- Temporarily disable the
overlayfs
hook inmkinitcpio.conf
; or- After rebooting, make sure to complete subsequent configuration while chrooted into the lower layer of the overlay (
/run/rootfs/lower
).
Eventually, it will be necessary to modify the base installation when the
system has a tmpfs
overlay mounted atop it. Because any changes to the upper
layer will be lost after a reboot, modifications must be made in the lower
layer. As configured by the default overlayfs
hook, the lower level will be
mounted at /run/rootfs/lower
. A straightforward way to manipulate the lower
layer is with the xchroot
script that is provided by the xtools
package:
xbps-install -S xtools
xchroot /run/rootfs/lower /bin/bash
Within the chroot, complete any configuration necessary (for example, a system
upgrade with xbps-install -Su
), then exit the shell and refresh the mount on
the host:
mount -o remount /
Modifications to the lower root filesystem on the master node may trigger messages and I/O errors on clients that hold stale NFS handles to replaced files. Often times, after modifying the lower root filesystem on the master, it is easiest to just reboot the client nodes to make sure they have up-to-date views of the filesystem.
When the master installation is first adapted for cluster use, the image should
be "generalized" by removing configuration that is specific to the master. In
particular, /etc/fstab
and, if it exists, /etc/zfs/zpool.cache
should be
removed. Enter a chroot into the lower root filesystem and move master-specific
files into the /etc/overlays
tree:
xchroot /run/rootfs/lower /bin/bash
macaddr="$(cat /sys/class/net/eth0/address)"
mkdir -p "/etc/overlays/${macaddr}/zfs"
mv /etc/fstab "/etc/overlays/${macaddr}"
mv /etc/zfs/zpool.cache "/etc/overlays/${macaddr}/zfs"
At this point, any other master-specific configuration files should be moved
from /etc
(under the chroot) to /etc/overlays/${macaddr}
. Make sure to
move, and not copy, these files to avoid leaving them accessible to client
nodes; when necessary, replace any moved files with suitable alternatives for
the client.
Note that moving /etc/fstab
into the overlays
tree, as recommended above,
will produce unbootable client nodes. Make sure to replace that file with a
basic version that will mount the home directory exported by the master:
cat > /etc/fstab <<EOF
tmpfs /tmp tmpfs defaults,nosuid,nodev 0 0
172.23.199.225:/home /home nfs4 rw,defaults,retrans=10,rsize=32768,wsize=32768 0 0
EOF
where 172.23.199.225
should be replaced with the IP address of the master
node. The default fstab
in the lower-level root will be used by client
nodes, while the master relies on an overlay replacement.