Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attach/detach command line interface changes broke nightly tests #2668

Open
igaw opened this issue Jan 21, 2025 · 27 comments
Open

attach/detach command line interface changes broke nightly tests #2668

igaw opened this issue Jan 21, 2025 · 27 comments
Labels

Comments

@igaw
Copy link
Collaborator

igaw commented Jan 21, 2025

It appears #2665 breaks our nightly builds:

https://github.com/linux-nvme/nvme-cli/actions/runs/12878797474/job/35905289626#step:8:761

@ikegami-t could you have a look?

BTW, be careful running these tests locally. These tests do write to the local hardware, e.g. /dev/nvme0n1

@igaw igaw changed the title attach/detach command line interface changes brake nightly tests attach/detach command line interface changes broke nightly tests Jan 21, 2025
@igaw igaw added the bug label Jan 21, 2025
@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 21, 2025

Just confirmed the log then not sure why the device /dev/nvme1 failed to detach as not botu both chardev and blkdev and also /dev/nvme1 NS1 not failed but /dev/nvme1 NS2 failed but looks this not caused by the nvme-cli bug I think.

  1. Build and install nvme-cli: nvme-cli 2.11 built and tested
    https://github.com/linux-nvme/nvme-cli/actions/runs/12878797474/job/35905289626#step:6:134
  Configuration
    git version   : "1.11.1-41-ga9f667a"

nvme-cli 2.11

  Paths
    prefixdir         : /usr/local
  1. Run on device tests: Failed on detaching NS and also seems failed to detach NS2
    https://github.com/linux-nvme/nvme-cli/actions/runs/12878797474/job/35905289626#step:8:616
##### Creating 109
##### Attaching 109
##### Running IOs in 109
Testcase main ... /dev/nvme1 is not a block or character device
Usage: nvme detach-ns <device> [OPTIONS]

Detach the given namespace from the given controller; de-activates the given
namespace's ID. A namespace must be attached to a controller before IO
commands may be directed to that namespace.

Options:
  [  --verbose, -v ]                    --- Increase output verbosity
  [  --output-format=<FMT>, -o <FMT> ]  --- Output format: normal|json|binary
  [  --namespace-id=<NUM>, -n <NUM> ]   --- namespace to attach
  [  --controllers=<LIST>, -c <LIST> ]  --- optional comma-sep controller id
                                            list
  [  --timeout=<NUM>, -t <NUM> ]        --- timeout value, in milliseconds
FAIL
...
##### Detaching 1
#### Deleting 1
##### Detaching 2

teardown: ctrl: /dev/nvme1, ns1: /dev/nvme1n1, default_nsid: 1, flbas: 0
  1. Code: the error messege is output as the device /dev/nvme1 not both chardev and blkdev

    nvme-cli/nvme.c

    Line 367 in 946029c

    nvme_show_error("%s is not a block or character device", devstr);
static int open_dev_direct(struct nvme_dev **devp, char *devstr, int flags, __u32 nsid)
{
...
	if (!is_chardev(dev) && !is_blkdev(dev)) {
		nvme_show_error("%s is not a block or character device", devstr);
		errno = ENODEV;
		err = -1;
		goto err_close;
	}

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 21, 2025

Sorry just realized the error caused on master as tested by run-nightly-tests #88 https://github.com/linux-nvme/nvme-cli/actions/runs/12878797474.
Now I think the error caused by the commit 565b30e changes that changes to use the chardev /dev/nvme1 to the blkdev /dev/nvme1n2 (or /dev/nvme1n1) to use instead then the error happend.
Also I think that the blkdev not detected by the driver fstat() actually then the error caused (but probably the device itself is okay as not failed before the changes).
So can we add some delay time before detaching NS? (Also for attaching and it seems for the attaching currently just used the chardev so no error happend.)

(Added)
Seems the test can be added the validation to create NS to wait until the blkdev /dev/nvmen created.

@igaw
Copy link
Collaborator Author

igaw commented Jan 21, 2025

As far I understand the situation, before your change the test would open the char device /dev/nvme0 and now it opens /dev/nvme0n1. Exactly, what the indent of your change was. Not sure why this tests worked in the past. I must miss something

Adding @keithbusch and @MaisenbacherD for awareness.

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 21, 2025

@MaisenbacherD Can we add to check and weit wait the blkdev /dev/nvme*n* is created atucally actually into the test function create_and_validate_ns()? Since seems the test faiulre failure caused as the blkdev not created completely by the host driver before the NS detaching.

@igaw
Copy link
Collaborator Author

igaw commented Jan 21, 2025

I think we need something like this

        if err == 0:
            # enumerate new namespace block device
            self.nvme_reset_ctrl()
            # check if new namespace block device exists
            err = 0 if stat.S_ISBLK(os.stat(self.ns1).st_mode) else 1

for the create_and_validate_ns method as we have in attach_ns

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 21, 2025

Thank you! Noted it. By the way can we add aslo the -v option for the NS management command commands to check the blkdev and dev print info outputs as below?

tokunori@tokunori-desktop:~/nvme-cli$ nvme-build detach-ns /dev/nvme0 -n 2 -v
[sudo] tokunori のパスワード: 
blkdev: /dev/nvme0n2, fd: -1
dev: /dev/nvme0, fd: 3
warning: empty controller-id list will result in no actual change in namespace attachment
NVMe status: Invalid Command Opcode: A reserved coded value or an unsupported value in the command opcode field(0x1)
NS management and attachment not supported
tokunori@tokunori-desktop:~/nvme-cli$ 

@ikegami-t
Copy link
Contributor

About the blkdev checking as you mentioned it is done in attach_ns already so it looks enough to check the blkdev exists. So still the cause is unkown then let consider more later.

@ikegami-t
Copy link
Contributor

By the way do we need to execute the nvme_tests as root as same with nvme-cli? But failed to install nose2 after changed to su by sudo su then I can not execute the nvme_test as root for now. If any advice please let me know.

tokunori@tokunori-desktop:~/nvme-cli$ nose2 --verbose --start-dir tests nvme_create_max_ns_test
test_attach_detach_ns (nvme_create_max_ns_test.TestNVMeCreateMaxNS.test_attach_detach_ns)
Testcase main ... 
Using nvme binary 'nvme'

setup: ctrl: /dev/nvme0, ns1: /dev/nvme0n1, default_nsid: 1, flbas: 0

Usage: nvme id-ctrl <device> [OPTIONS]

Send an Identify Controller command to the given device and report
information about the specified controller in human-readable or binary
format. May also return vendor-specific controller attributes in hex-dump if
requested.

Options:
  [  --verbose, -v ]                    --- Increase output verbosity
  [  --output-format=<FMT>, -o <FMT> ]  --- Output format: normal|json|binary
  [  --vendor-specific, -V ]            --- dump binary vendor field
  [  --raw-binary, -b ]                 --- show identify in binary format
  [  --human-readable, -H ]             --- show identify in readable format
  [  --timeout=<NUM>, -t <NUM> ]        --- timeout value, in milliseconds
FAIL



======================================================================
FAIL: test_attach_detach_ns (nvme_create_max_ns_test.TestNVMeCreateMaxNS.test_attach_detach_ns)
Testcase main
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tokunori/nvme-cli/tests/nvme_create_max_ns_test.py", line 53, in setUp
    self.nsze = int(self.get_ncap() /
                    ^^^^^^^^^^^^^^^
  File "/home/tokunori/nvme-cli/tests/nvme_test.py", line 285, in get_ncap
    return int(self.get_id_ctrl_field_value("tnvmcap"))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tokunori/nvme-cli/tests/nvme_test.py", line 301, in get_id_ctrl_field_value
    self.assertEqual(err, 0, "ERROR : reading id-ctrl failed")
AssertionError: 1 != 0 : ERROR : reading id-ctrl failed

----------------------------------------------------------------------
Ran 1 test in 0.079s

FAILED (failures=1)

@igaw
Copy link
Collaborator Author

igaw commented Jan 21, 2025

Bummer, this would have been to easy.

Yes, this testsuite needs to be run as root and as noted before it will write to the physical disk /dev/nvme0n1. So by very careful here. This was also the reason why these tests are disabled by default.

@ikegami-t
Copy link
Contributor

Okay I will do retry to execute by the root later since failed to install nose2 after executed 'sudo su' earlier.

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 22, 2025

Sorry I could not find the cause of error as looks no error in both nvme-cli and nvme_test but just pushed the fix PR #2671 for this issue to avoid as partially reverted the 565b30e changes cause caused this issue.

@igaw
Copy link
Collaborator Author

igaw commented Jan 23, 2025

I'd rather revert the change instead stacking stuff on top of each other. I'll try to find some time to look into it. If I don't figure out I'll revert the offending change.

@ikegami-t
Copy link
Contributor

Noted. A TP error fixed by the PR #2673 changes and added to enable verbose debugging by it also. But still the cause not found. Thank you.

@ikegami-t
Copy link
Contributor

By the way TP create_and_validate_ns() function checks the nsid parameter passed from the caller by the id-ns command but the NSID is selected by the controller by the create-ns command so is this really correct NS validation checking? Note: By the PR above added the blkdev checking also for the create NS.

@MaisenbacherD
Copy link
Contributor

I will also try to allocate some time to look into this issue :)

@ikegami-t
Copy link
Contributor

  1. Just confirmed the latest test result below.
    https://github.com/linux-nvme/nvme-cli/actions/runs/13002257549/job/36262990749
  2. Then for now the test failed to check /dev/nvme1n2 after the NS creation. (This checking was added by the commit c276f3d.)
ERROR: test_attach_detach_ns (nvme_create_max_ns_test.TestNVMeCreateMaxNS.test_attach_detach_ns)
Testcase main
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/nvme-cli/nvme-cli/tests/nvme_create_max_ns_test.py", line 86, in test_attach_detach_ns
    err = self.create_and_validate_ns(nsid,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/nvme-cli/nvme-cli/tests/nvme_test.py", line 399, in create_and_validate_ns
    err = 0 if stat.S_ISBLK(os.stat(self.ctrl + "n"  + str(nsid)).st_mode) else 1
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/dev/nvme1n2'
  1. Unfortunately the verbose log output added by the commit b9b34be not recored in the log file but I am not sure the cause so not able to check the detail.
  2. Anyway the error was caused by the commit c276f3d added. Very sorry for the error so jsut push the PR Revert "tests: add blkdev checking after NS creation" #2680 to revert the commit. The NS is created by the NS create command but the blkdev /dev/nvme1n2 not created by the driver since not attached yet so the checking added was not correct.

@igaw
Copy link
Collaborator Author

igaw commented Jan 28, 2025

No worries, this happens when working on code. The important thing is to fix it :)
Anwyay, I've pushed the revert. Let's see if this solves the problem.

@igaw
Copy link
Collaborator Author

igaw commented Jan 29, 2025

 ======================================================================
ERROR: test_attach_detach_ns (nvme_create_max_ns_test.TestNVMeCreateMaxNS.test_attach_detach_ns)
Testcase main
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/nvme-cli/nvme-cli/tests/nvme_create_max_ns_test.py", line 93, in test_attach_detach_ns
    self.assertEqual(self.attach_ns(self.ctrl_id, nsid), 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/__w/nvme-cli/nvme-cli/tests/nvme_test.py", line 414, in attach_ns
    err = 0 if stat.S_ISBLK(os.stat(self.ctrl + "n"  + str(nsid)).st_mode) else 1
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/dev/nvme1n2'****

@igaw
Copy link
Collaborator Author

igaw commented Jan 29, 2025

I think we want something like the below, completely untested:

import subprocess
import os
import inotify.adapters
import time

def attach_ns(self, ctrl_id, nsid):
    """ Wrapper for attaching the namespace.
        - Args:
            - ctrl_id : controller id to which namespace to be attached.
            - nsid : new namespace id.
        - Returns:
            - 0 on success, error code on failure.
    """
    attach_ns_cmd = f"{self.nvme_bin} attach-ns {self.ctrl} " + \
        f"--namespace-id={str(nsid)} --controllers={ctrl_id} --verbose"
    err = subprocess.call(attach_ns_cmd, shell=True, stdout=subprocess.DEVNULL)

    if err != 0:
        return err

    # Device path
    device_path = f"{self.ctrl}n{str(nsid)}"
    watch_dir = os.path.dirname(device_path) or "/dev"

    i = inotify.adapters.Inotify()
    i.add_watch(watch_dir)  # Setup inotify first

    # Check immediately if the device already exists (avoids race condition)
    if os.path.exists(device_path) and os.path.isfile(device_path):
        i.remove_watch(watch_dir)  # Cleanup inotify before returning
        return 0

    # Wait for inotify event
    success = False
    start_time = time.time()
    timeout = 5  # seconds

    for event in i.event_gen(yield_nones=False, timeout_s=timeout):
        (_, type_names, path, filename) = event
        if "IN_CREATE" in type_names and filename == os.path.basename(device_path):
            success = True
            break
        if time.time() - start_time > timeout:
            break

    i.remove_watch(watch_dir)  # Cleanup inotify
    return 0 if success else 1

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 29, 2025

Looks good as change the TP to wait the blkdev by watching for 5 seconds. Thank you.

(Added)
Just rechecked the nvme-cli and the driver for the NS attachment but still seems same handling for the both chardev and blkdev.

@igaw
Copy link
Collaborator Author

igaw commented Jan 29, 2025

this version here is slightly better and even tested. the pyinotify package is more common:

import subprocess
import os
import pyinotify
import stat
import time

class EventHandler(pyinotify.ProcessEvent):
    def __init__(self, target_file):
        self.target_file = target_file
        self.file_created = False

    def process_IN_CREATE(self, event):
        if event.pathname == self.target_file:
            self.file_created = True

def attach_ns(self, ctrl_id, nsid):
    """ Wrapper for attaching the namespace.
        - Args:
            - ctrl_id : controller id to which namespace to be attached.
            - nsid : new namespace id.
        - Returns:
            - 0 on success, error code on failure.
    """
    attach_ns_cmd = f"{self.nvme_bin} attach-ns {self.ctrl} " + \
        f"--namespace-id={str(nsid)} --controllers={ctrl_id} --verbose"
    err = subprocess.call(attach_ns_cmd, shell=True, stdout=subprocess.DEVNULL)

    if err != 0:
        return err

    # Device path
    device_path = f"{self.ctrl}n{str(nsid)}"
    watch_dir = os.path.dirname(device_path) or "/dev"

    # Start watching directory before checking file existence
    wm = pyinotify.WatchManager()
    handler = EventHandler(device_path)
    notifier = pyinotify.Notifier(wm, handler)
    wm.add_watch(watch_dir, pyinotify.IN_CREATE)

    # Correct check: See if the block device already exists
    if os.path.exists(device_path) and stat.S_ISBLK(os.stat(device_path).st_mode):
        return 0

    # Wait for the event or timeout
    start_time = time.time()
    timeout = 5  # seconds

    while time.time() - start_time < timeout:
        notifier.process_events()
        if handler.file_created and stat.S_ISBLK(os.stat(device_path).st_mode):
            return 0
        if notifier.check_events(timeout=500):  # 500ms
            notifier.read_events()

    return 1  # Timeout, device not created

@MaisenbacherD
Copy link
Contributor

@igaw did you test this via nose2 or did you run the test code section by hand?
nose2 does not find the pyinotify lib after pip3 installing it (and reinstalling nose2).
Didn't have much time to look into this yet, will continue testing tomorrow. :)

@MaisenbacherD
Copy link
Contributor

Sorry - Just saw your PR and that you updated the container with installing python3-pyinotify. This helps :)

@igaw
Copy link
Collaborator Author

igaw commented Jan 30, 2025

No I didn't run the whole tests suite locally, just a factored out the function to see if it actually does anything useful. I am very carefully trying to avoid overwriting my local disk :)

@ikegami-t
Copy link
Contributor

ikegami-t commented Jan 31, 2025

Unfortunatelly the test still failed with the latest TP as below.
https://github.com/linux-nvme/nvme-cli/actions/runs/13072672459/job/36477555620

Run nose2 --verbose --start-dir tests \
  nose[2](https://github.com/linux-nvme/nvme-cli/actions/runs/13072672459/job/36477555620#step:8:2) --verbose --start-dir tests \
  nvme_attach_detach_ns_test \
  nvme_compare_test \
...
  nvme_writeuncor_test \
  nvme_writezeros_test
  shell: sh -e {0}
test_attach_detach_ns (nvme_attach_detach_ns_test.TestNVMeAttachDetachNSCmd.test_attach_detach_ns)
Testcase main ... ok
test_nvme_compare (nvme_compare_test.TestNVMeCompareCmd.test_nvme_compare)
Testcase main ... skipped because: Optional NVM Command 'Compare' (NVMCMPS) not supported
test_copy (nvme_copy_test.TestNVMeCopy.test_copy)
Testcase main ... ok
test_attach_detach_ns (nvme_create_max_ns_test.TestNVMeCreateMaxNS.test_attach_detach_ns)
Testcase main ... FAIL
NVMe status: Namespace Already Attached: The controller is already attached to the namespace specified(0x[4](https://github.com/linux-nvme/nvme-cli/actions/runs/13072672459/job/36477555620#step:8:4)118)

The test failed as the NS attached already but I am not sure which NS number is failed actually and seems still failed on NS2 as same from the log below.

##### Testing max_ns: 128
##### Creating 1
##### Attaching 1
##### Running IOs in 1
Running io: dd if=/dev/nvme1n1 of=/dev/null bs=512 count=1 > /dev/null 2>&1
Running io: dd if=/dev/zero of=/dev/nvme1n1 bs=512 count=1 > /dev/null 2>&1
##### Creating 2
##### Attaching 2

teardown: ctrl: /dev/nvme1, ns1: /dev/nvme1n1, default_nsid: 1, flbas: 0




Using nvme binary '/__w/nvme-cli/nvme-cli/.build-ci/nvme'

setup: ctrl: /dev/nvme1, ns1: /dev/nvme1n1, default_nsid: 1, flbas: 0

By the way delete_all_ns() executed before the NSs creation by the setUp() but looks not checked if actually all NSs delete deleted or not so can we add to check the NSs deletion to make sure?

    def delete_all_ns(self):
        """ Wrapper for deleting all the namespaces.
            - Args:
                - None
            - Returns:
                - None
        """
        delete_ns_cmd = f"{self.nvme_bin} delete-ns {self.ctrl} " + \
            "--namespace-id=0xFFFFFFFF"
        self.assertEqual(self.exec_cmd(delete_ns_cmd), 0)
        list_ns_cmd = f"{self.nvme_bin} list-ns {self.ctrl} --all " + \
            "--output-format=json"
        proc = subprocess.Popen(list_ns_cmd,
                                shell=True,
                                stdout=subprocess.PIPE,
                                encoding='utf-8')
        self.assertEqual(proc.wait(), 0, "ERROR : nvme list-ns failed")
        json_output = json.loads(proc.stdout.read())
        self.assertEqual(len(json_output['nsid_list']), 0,
                         "ERROR : deleting all namespace failed")

(Added)
Note: Also abou about the NS already attached error it seems that the current error behavior changed from before but not sure if caused by the TP changes or not (or nvme-cli changes to use blkdev if NSID specified).

@ikegami-t
Copy link
Contributor

Sorry let me correct below.

By the way delete_all_ns() executed before the NSs creation by the setUp() but looks not checked if actually all NSs delete deleted or not so can we add to check the NSs deletion to make sure?

The delete_all_ns() function seems to check correctly the NSs if deleted by the following part. So not sure why the NS already attached error happend.

        list_ns_cmd = f"{self.nvme_bin} list-ns {self.ctrl} --all " + \
            "--output-format=json"
        proc = subprocess.Popen(list_ns_cmd,
                                shell=True,
                                stdout=subprocess.PIPE,
                                encoding='utf-8')
        self.assertEqual(proc.wait(), 0, "ERROR : nvme list-ns failed")
        json_output = json.loads(proc.stdout.read())
        self.assertEqual(len(json_output['nsid_list']), 0,
                         "ERROR : deleting all namespace failed")

@ikegami-t
Copy link
Contributor

ikegami-t commented Feb 1, 2025

There is a difference below for the NS attachment command between the blkdev changes by the commit c276f3d.

change dev nsid ctrl
before /dev/nvme2 (internally used same chardev: /dev/nvme2) 2 1? (but anyway same)
after /dev/nvme2 (but internally used blkdev: /dev/nvme2n2) 2 1? (but anyway same)

Then just I reallized the issue cause as below.
For the NS attachment command the blkdev: /dev/nvme2n2 used by the commit c276f3d but the blkdev is not created yet since not attached the NS with the device then the command failed. (But not sure the NS already attached error itself correct or not..)
The PR #2671 reverts the commit partially for the admin commands but still remained to use the blkdev for IO commands.

(Added)
The open_blkdev_direct() function checks if the blkdev failed to opne the open the chardev instead as below but seems the checking function does not work correctly also. (Unfortunatelly currenty the print_info() debug output not recoreded by the TP log files so the detail behavior not able to check.)

static int open_blkdev_direct(char *dev, int flags, __u32 nsid)
{
	_cleanup_free_ char *blkdev = NULL;
	int fd = -1;

	if (is_nvme_dev(dev) && nsid) {
		if (asprintf(&blkdev, "%sn%d", dev, nsid) < 0)
			blkdev = NULL;
	}

	if (blkdev) {
		fd = open(blkdev, flags);
		print_info("blkdev: %s, fd: %d\n", blkdev, fd);
	}

	if (fd < 0) {
		fd = open(dev, flags);
		print_info("dev: %s, fd: %d\n", dev, fd);
	}

	return fd;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants