-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New NBFT initramfs module #2620
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,7 @@ touch %{buildroot}@SYSCONFDIR@/nvme/hostid | |
@UDEVRULESDIR@/65-persistent-net-nbft.rules | ||
@UDEVRULESDIR@/70-nvmf-autoconnect.rules | ||
@UDEVRULESDIR@/71-nvmf-netapp.rules | ||
@DRACUTRILESDIR@/70-nvmf-autoconnect.conf | ||
@DRACUTRULESDIR@/70-nvmf-autoconnect.conf | ||
@SYSTEMDDIR@/[email protected] | ||
@SYSTEMDDIR@/nvmefc-boot-connections.service | ||
@SYSTEMDDIR@/nvmf-connect-nbft.service | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
#!/bin/bash | ||
|
||
if [[ "$1" == nbft* ]] && [[ "$2" == "up" ]]; then | ||
systemctl start nvmf-connect-nbft.service | ||
fi |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Boot from NVMe over TCP (NBFT) | ||
# | ||
# For NVMe/TCP connections that provide namespaces containing rootfs | ||
# it is crucial to react on carrier events and reconnect any missing | ||
# NVMe/TCP connections as defined in the ACPI NBFT table. A custom | ||
# /usr/lib/NetworkManager/dispatcher.d/99-nvme-nbft-connect.sh hook | ||
# will respawn nvmf-connect-nbft.service on such occasion. | ||
|
||
[device-nbft-no-ignore-carrier] | ||
|
||
# only affects nbft0, nbft1, ... interfaces | ||
match-device=interface-name:nbft* | ||
|
||
# react on link up/down events | ||
ignore-carrier=no | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be more intuitive to add globbing support to NetworkManager's "Device List" format, and then simply set something like
Anyway, this is Fedora/RHEL specific. Not sure if other distibutions ship tje I'd prefer if you find some other way to set this up on Fedora. Perhaps simply by documenting that nbft devices should be excempted from the |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# The dracut 95nbft module | ||
|
||
Focused solely on providing the Boot from NVMe over TCP functionality, intended | ||
to replace parts of the existing `95nvmf` dracut module. At the moment this all | ||
depends on the recently added NetworkManager NBFT support, though the desire is | ||
to support more network management frameworks in the future. | ||
|
||
Related nvme-cli meson configure options: | ||
* `-Ddracut-module` (default=false) - enables the 95nbft dracut module | ||
* `-Ddracutmodulesdir` (default=`$prefix/lib/dracut/modules.d/`) | ||
* `-Dnetworkmanagerdir` (default=`$prefix/lib/NetworkManager/`) | ||
|
||
|
||
# The design | ||
|
||
(see [dracut.bootup(7)](https://man7.org/linux/man-pages/man7/dracut.bootup.7.html) | ||
for the overall boot process flow) | ||
|
||
The boot process looks roughly as follows: | ||
* `nbft-boot-pre.service` is run, creates udev network link files and tells | ||
dracut to activate networking | ||
* dracut runs `nm-initrd-generator` and starts the NetworkManager daemon | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I fail to see where this happens. Is it an |
||
* `systemd-udev-trigger.service` renames the network interfaces | ||
* `nm-wait-online-initrd.service` finishes, indicating networking is up and ready | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my remark below about |
||
* `nbft-boot-connect.service` initiates actual NVMe connections | ||
* the dracut initqueue is waiting for specific block devices (rootfs) to appear | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this, and the starting of NM above, are the only 2 things that dracut needs to perform, and therefore it's relatively easy to plug this scheme into mkosi or another initramfs generator, as long as that generator is based on systemd. Nice. Am I understanding correctly? |
||
Two major packages are responsible for this: the new nvme-cli dracut module and | ||
the added NBFT support in NetworkManager. | ||
|
||
## The new dracut 95nbft module | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit: nothing ages so quickly as the word "new" ... |
||
|
||
The dracut `module-setup.sh` only installs two systemd unit files sandwiched | ||
between specific dracut phases, nothing else. By default the module is always | ||
included in the initramfs unless _hostonly_ is requested in which case the system | ||
is tested for ACPI NBFT tables presence and the module is only included in such | ||
a case. | ||
|
||
The systemd unit files are only run when the ACPI NBFT tables are present and | ||
no `rd.nvmf.nonbft` kernel commandline argument was provided that otherwise | ||
instruct the boot process to skip the NBFT machinery. | ||
|
||
## nbft-boot-pre.service | ||
|
||
Calls the nvme-cli nbft plugin to generate network link files for each interface | ||
found in all NBFT tables. The interface naming in form of `nbftXhY` consists | ||
of an ACPI NBFT table index (defaults to 0) and the specified HFI index. | ||
In a typical scenario only `nbft0h1`, `nbft0h2`, `nbft1h1`, ... interfaces are | ||
present, however it's up to the pre-OS driver to supply arbitrary indexes, | ||
possibly leading to interface names skipping the order to something like | ||
`nbft0h100` and `nbft99h123`. Comparing to the old `95nvmf` dracut module | ||
ordering, this new naming scheme is geared towards (semi-)stable predictable | ||
network interface names. Keep in mind that the contents of the NBFT tables | ||
is generated from scratch upon every system start and is not always persistent | ||
between reboots. | ||
|
||
The network link files are then picked up by udev on trigger via | ||
`systemd-udev-trigger.service` to apply the new interface names. | ||
|
||
For simplicity and for the time being this systemd unit replaces the traditional | ||
dracut cmdline hook and adds the `rd.neednet=1` `cmdline.d` argument. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What exactly do you mean with "traditional dracut cmdline hook" here? How can you replace it? |
||
## nm-initrd-generator NBFT support | ||
|
||
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/2077 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems to be the most important difference wrt Your approach is of course much more efficient, but at the cost of being compatible only with (a future version of) NM. A similar approach could be taken by wicked or other network management tools. But I wonder if there might be some middle ground, perhaps we can provide the HFI data in some format that any network management tool can easily convert? The "dracut command line" format is obviously very clumsy and simplistic. So we could keep |
||
Executed before the NetworkManager daemon starts the added NBFT support parses | ||
the ACPI NBFT tables available and generates system connections. Only | ||
referenced by MAC addresses, relying on udev to perform actual interface | ||
renaming. | ||
|
||
The `nm-initrd-generator` doesn't link to `libnvme.so.1` but opens it through | ||
`dlopen()` in runtime. This allows for smaller hostonly initramfs images in case | ||
the NBFT tables are not present in the system. The library is being pulled in | ||
indirectly through the dracut module's requirement of nvme-cli. The | ||
`rd.nvmf.nonbft` kernel commandline argument is respected as well. | ||
|
||
## nbft-boot-connect.service | ||
|
||
Modprobes required modules (`nvme-fabrics`) first. | ||
|
||
Performs actual NVMe connections by calling `nvme connect-all --nbft`. The | ||
nvme-cli code has been modified to return non-zero return code in case one | ||
or more SSNS records fail to connect (except those marked as _'unavailable'_ | ||
by the pre-OS driver), resulting in the service startup failure with defined | ||
respawn of 10 seconds (TBD). This ensures multiple connection attempts while | ||
NetworkManager reacts on link events in the background and the dracut initqueue | ||
eagerly waits for new block devices to appears, to be scanned and mounted. Once | ||
the required block device appears, the wait cycle is ended and the system | ||
continues booting, stopping any queued `nbft-boot-connect.service` respawns | ||
seamlessly. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see below that |
||
The difference from the old dracut `95nvmf` module is that the nvme connection | ||
attempts are not driven by network link up events but have fixed respawn | ||
interval. This may potentially help the cases where the NIC is slow to | ||
initialize, reports link up yet it takes another 5+ seconds before it's fully | ||
able to send/receive packets. We've seen this issue with some 25Gb NICs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. Ideally we wouldn't react on "link up" events but on events that indicate an L3 connection. But I'm not sure if such events exist... (see below) Note that |
||
|
||
|
||
# The post-switchroot boot flow | ||
|
||
## nvmf-connect-nbft.service | ||
|
||
This unit is supposed to run once the `network-online.target` has been reached | ||
and calls `nvme connect-all --nbft` again. This ensures additional connection | ||
attempt for records that failed to connect in the initramfs phase. As long as | ||
this call matches existing connections and skips SSNS records that have been | ||
already connected, in an ideal case this would result in an no-op. This is | ||
mostly a one-shot service run in NetworkManager based distros since the target | ||
typically stays reached until reboot. | ||
|
||
## NetworkManager dispatcher hooks | ||
|
||
The nvme-cli package installs a custom NetworkManager dispatcher service hook | ||
(`99-nvme-nbft-connect.sh`) that just restarts `nvmf-connect-nbft.service` on | ||
_link up_ events on `nbft*` interfaces. At the time the hook runs the interface | ||
in question has been fully configured by NetworkManager. This ensures further | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm. This basically describes the "L3 up" events I just thought we didn't have ... can't we just do this in the initrd as well? |
||
reconnection attempts in multipath scenarios where a network interface just came | ||
alive. This is designed as a secondary measure with the kernel nvme host driver | ||
connection recovery being the primary mechanism. | ||
|
||
In order to make link events work properly the `nbft*` interfaces need to be set | ||
not to ignore carrier events. This is done through a custom override snippet | ||
(`99-nvme-nbft-no-ignore-carrier.conf`) as some distributions may opt to follow | ||
legacy server networking behaviour (see the `NetworkManager-config-server` package). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
#!/usr/bin/bash | ||
|
||
has_nbft() { | ||
local f found= | ||
for f in /sys/firmware/acpi/tables/NBFT*; do | ||
[ -f "$f" ] || continue | ||
found=1 | ||
break | ||
done | ||
[[ $found ]] | ||
} | ||
|
||
# called by dracut | ||
check() { | ||
require_binaries nvme || return 1 | ||
|
||
[[ $hostonly ]] || [[ $mount_needs ]] && { | ||
if ! has_nbft; then | ||
echo "No ACPI NBFT tables present in the system" | ||
return 255 | ||
fi | ||
} | ||
return 0 | ||
} | ||
|
||
# called by dracut | ||
depends() { | ||
echo bash rootfs-block network | ||
return 0 | ||
} | ||
|
||
# called by dracut | ||
installkernel() { | ||
hostonly="" instmods nvme_tcp nvme_fabrics 8021q | ||
} | ||
|
||
# called by dracut | ||
install() { | ||
inst_multiple nvme | ||
|
||
# TODO: /etc/nvme/hostnqn | ||
|
||
for i in \ | ||
nbft-boot-pre.service \ | ||
nbft-boot-connect.service; do | ||
inst_simple "${moddir}/$i" "${systemdsystemunitdir}/$i" | ||
$SYSTEMCTL -q --root "$initdir" enable $i | ||
done | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this config section. Will you implement support for this section, and the directives it contains, in NetworkManager? If not, how is this supposed to work?