Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvmeof Gateway fails to start up in brand new cluster #669

Open
madkiss opened this issue May 23, 2024 · 31 comments
Open

nvmeof Gateway fails to start up in brand new cluster #669

madkiss opened this issue May 23, 2024 · 31 comments

Comments

@madkiss
Copy link

madkiss commented May 23, 2024

I am trying to set up ceph-nvmeof 1.2.9 on Reef. Fresh cluster installed a few hours ago with cephadm. Deployed as per documentation. nvmeof fails to come up, logging messages I see are

May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw void NVMeofGwMonitorClient::tick()
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw bool get_gw_state(const char*, const std::map<std::pair<std::__cxx11::basic_string, std::__cxx11::basic_string >, std::map<std::__cxx11::basic_string, NvmeGwState> >&, const NvmeGroupKey&, const NvmeGwId&, NvmeGwState&) can not find group (nvme,None) old map map: {}
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f605860640 0 nvmeofgw void NVMeofGwMonitorClient::send_beacon() sending beacon as gid 24694 availability 0 osdmap_epoch 0 gwmap_epoch 0
May 23 12:55:14 ceph2 bash[76745]: debug 2024-05-23T12:55:14.333+0000 785f205e5700 0 can't decode unknown message type 2049 MSG_AUTH=17
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f609868640 0 client.0 ms_handle_reset on v2:10.4.3.11:3300/0
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.333+0000 70f609868640 0 client.0 ms_handle_reset on v2:10.4.3.11:3300/0
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.337+0000 70f609868640 0 nvmeofgw virtual bool NVMeofGwMonitorClient::ms_dispatch2(ceph::ref_t&) got map type 4
May 23 12:55:14 ceph2 bash[119529]: 2024-05-23T12:55:14.337+0000 70f609868640 0 ms_deliver_dispatch: unhandled message 0x5e584cc24820 mon_map magic: 0 from mon.1 v2:10.4.3.11:3300/0

Another message is

May 23 12:57:26 ceph1 bash[146371]: 1: [v2:10.4.3.11:3300/0,v1:10.4.3.11:6789/0] mon.ceph2
May 23 12:57:26 ceph1 bash[146371]: 2: [v2:10.4.3.12:3300/0,v1:10.4.3.12:6789/0] mon.ceph3
May 23 12:57:26 ceph1 bash[146371]: -12> 2024-05-23T12:57:24.746+0000 73e68f1de640 0 nvmeofgw virtual bool NVMeofGwMonitorClient::ms_dispatch2(ceph::ref_t&) got map type 4
May 23 12:57:26 ceph1 bash[146371]: -11> 2024-05-23T12:57:24.746+0000 73e68f1de640 0 ms_deliver_dispatch: unhandled message 0x5757d2e9d380 mon_map magic: 0 from mon.0 v2:10.4.3.10:3300/0
May 23 12:57:26 ceph1 bash[146371]: -10> 2024-05-23T12:57:24.746+0000 73e68f1de640 10 monclient: handle_config config(2 keys)
May 23 12:57:26 ceph1 bash[146371]: -9> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 set_mon_vals callback ignored cluster_network
May 23 12:57:26 ceph1 bash[146371]: -8> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 set_mon_vals callback ignored container_image
May 23 12:57:26 ceph1 bash[146371]: -7> 2024-05-23T12:57:24.746+0000 73e68d9db640 4 nvmeofgw NVMeofGwMonitorClient::init()::<lambda()> nvmeof monc config notify callback
May 23 12:57:26 ceph1 bash[146371]: -6> 2024-05-23T12:57:25.654+0000 73e68d1da640 10 monclient: tick
May 23 12:57:26 ceph1 bash[146371]: -5> 2024-05-23T12:57:25.654+0000 73e68d1da640 10 monclient: _check_auth_tickets
May 23 12:57:26 ceph1 bash[146371]: -4> 2024-05-23T12:57:26.654+0000 73e68d1da640 10 monclient: tick
May 23 12:57:26 ceph1 bash[146371]: -3> 2024-05-23T12:57:26.654+0000 73e68d1da640 10 monclient: _check_auth_tickets
May 23 12:57:26 ceph1 bash[146371]: -2> 2024-05-23T12:57:26.742+0000 73e68b1d6640 0 nvmeofgw void NVMeofGwMonitorClient::tick()
May 23 12:57:26 ceph1 bash[146371]: -1> 2024-05-23T12:57:26.742+0000 73e68b1d6640 4 nvmeofgw void NVMeofGwMonitorClient::disconnect_panic() Triggering a panic upon disconnection from the monitor, elapsed 102, configured disconnect panic duration 100
May 23 12:57:26 ceph1 bash[146371]: 0> 2024-05-23T12:57:26.746+0000 73e68b1d6640 -1 *** Caught signal (Aborted) **
May 23 12:57:26 ceph1 bash[146371]: in thread 73e68b1d6640 thread_name:safe_timer

Cluster has a cluster network configured and I saw some messages about the option not being able to be changed at runtime. I did add it to ceph.conf though for the target, so that should be good. Any help will be greatly appreciated. Thank you in advance.

@madkiss
Copy link
Author

madkiss commented May 23, 2024

Some additional infos as I forgot to put those in the first mail. Target version is 1.2.9, Ceph version is ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable).

Set up is brand new with no previous configuration in place. No extraordinary strange configuration either. Any help will be greatly appreciated. Thank you very much in advance again.

@alarmed-ground
Copy link

I faced similar issues with v1.2.x in 18.2.2. I had to move to v1.0.0 and it functions as expected. To set to pull v1.0.0

ceph config set mgr mgr/cephadm/container_image_nvmeof quay.io/ceph/nvmeof:1.0.0

ceph orch apply nvmeof <pool-name> --placement=

I am facing similar challenges when I upgraded to 18.2.4 and currently testing deployments.

@caroav
Copy link
Collaborator

caroav commented Jul 30, 2024

Right now the GW cannot work without a special build of ceph. The reason is that the GW depends on the new nvmeof paxos service which is a part of ceph/ceph#54671.
In the .env file, we update the sha to a ceph ci build that contains that PR, and should be working with the GW. Hopefully in the near future that PR and the nvmeof paxos service will be a part of ongoing ceph builds. Please make sure to work with latest release as update here, and with latest commit of devel branch.

@alarmed-ground
Copy link

These are the commands, I followed to create for ceph-nvmeof v1.0.0 in 18.2.2 and 18.2.4. I'm currently looking into deploying latest versions of ceph-nvmeof

ceph osd pool create nvmeof_pool01
rbd pool init nvmeof_pool01
rbd -p nvmeof_pool01 create nvme_image --size 50G
ceph config set mgr mgr/cephadm/container_image_nvmeof quay.io/ceph/nvmeof:1.0.0
ceph orch apply nvmeof nvmeof_pool01
alias nvmeof-cli='docker run -it quay.io/ceph/nvmeof-cli:1.0.0 --server-address <host-ip-where-nvmeof-service-is-running> --server-port 5500'
nvmeof-cli subsystem add --subsystem nqn.2016-06.io.spdk:ceph
nvmeof-cli namespace add --subsystem nqn.2016-06.io.spdk:ceph --rbd-pool nvmeof_pool01 --rbd-image nvme_image
ceph orch ps | grep nvme # This will give you the service name
nvmeof-cli listener add --subsystem nqn.2016-06.io.spdk:ceph --gateway-name client.<service-name-from-earlier-command> --traddr <host-ip-where-nvmeof-service-is-running> --trsvcid 4420
nvmeof-cli  host add --subsystem nqn.2016-06.io.spdk:ceph --host "*" # Allows connections to any host
nvmeof-cli subsystem list # lists subsystems
nvmeof-cli namespace list --subsystem nqn.2016-06.io.spdk:ceph # lists bdevs

@caroav
Copy link
Collaborator

caroav commented Jul 30, 2024

Please use quay.io/ceph/nvmeof:1.2.16

@caroav
Copy link
Collaborator

caroav commented Jul 30, 2024

and quay.io/ceph/nvmeof-cli:1.2.16

@caroav
Copy link
Collaborator

caroav commented Jul 30, 2024

The listener add command changed. I will update the documentation upstream soon, but meanwhile this is the right command:
listener add --subsystem nqn.2016-06.io.spdk:ceph --host-name HOST_NAME --traddr --trsvcid 4420

@alarmed-ground
Copy link

With 19.1.0(rc) (upgraded from 18.2.4), I have been able to deploy ceph-nvmeof v1.2.16 and add subsystem. I wonder if the nvmeof-cli v1.2.16 usage to add namespace has changed as well, similar to listener?

root@test-Standard-PC-i440FX-PIIX-1996:/home/test# ceph orch ps | grep nvme
nvmeof.nvmeof_pool01.test-Standard-PC-i440FX-PIIX-1996.ojnsii  test-Standard-PC-i440FX-PIIX-1996  *:5500,4420,8009  running (2m)      2m ago   2m    44.4M        -  1.2.16   c8d40f5109eb  d95e2746f6d8
root@test-Standard-PC-i440FX-PIIX-1996:/home/test# alias nvmeof-cli='docker run -it quay.io/ceph/nvmeof-cli:1.2.16 --server-address <ip> --server-port 5500'
root@test-Standard-PC-i440FX-PIIX-1996:/home/test# nvmeof-cli subsystem add --subsystem nqn.2016-06.io.spdk:ceph
Adding subsystem nqn.2016-06.io.spdk:ceph: Successful
root@test-Standard-PC-i440FX-PIIX-1996:/home/test# nvmeof-cli namespace add --subsystem nqn.2016-06.io.spdk:ceph --rbd-pool nvmeof_pool01 --rbd-image nvme_image
Failure adding namespace to nqn.2016-06.io.spdk:ceph:
<_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: Chosen ANA group is 0"
        debug_error_string = "UNKNOWN:Error received from peer ipv4:<ip>:5500 {grpc_message:"Exception calling application: Chosen ANA group is 0", grpc_status:2, created_time:"2024-07-30T23:52:31.770067786+00:00"}"
>

With cephadm deployed v18.2.4, when I tried deploying ceph-nvmeof v1.2.16, I encountered the following in service status, which showed it to be associating with v19.0.0

root@test-Standard-PC-i440FX-PIIX-1996:/home/test# systemctl status [email protected]_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa
● [email protected]_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa.service - Ceph nvmeof.nvmeof_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa for 7f3df55a-4e3b-11ef-a674-f94275ab1b57
     Loaded: loaded (/etc/systemd/system/[email protected]; enabled; vendor preset: enabled)
     Active: active (running) since Wed 2024-07-31 08:23:52 AEST; 1s ago
   Main PID: 347496 (bash)
      Tasks: 10 (limit: 18618)
     Memory: 8.9M
     CGroup: /system.slice/system-ceph\x2d7f3df55a\x2d4e3b\x2d11ef\x2da674\x2df94275ab1b57.slice/[email protected]_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa.service
             ├─347496 /bin/bash /var/lib/ceph/7f3df55a-4e3b-11ef-a674-f94275ab1b57/nvmeof.nvmeof_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa/unit.run
             └─347514 /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --init --name ceph-7f3df55a-4e3b-11ef-a674-f94275ab1b57-nvmeof-nvmeof_pool01-test-Standard-PC-i440FX-PIIX-1996-wnrzxa --p>

Jul 31 08:23:53 test-Standard-PC-i440FX-PIIX-1996 bash[347514]: [30-Jul-2024 22:23:53] INFO server.py:91 (7): Starting gateway client.nvmeof.nvmeof_pool01.test-Standard-PC-i440FX-PIIX-1996.wnrzxa
Jul 31 08:23:53 test-Standard-PC-i440FX-PIIX-1996 bash[347514]: [30-Jul-2024 22:23:53] INFO server.py:162 (7): Starting serve, monitor client version: ceph version 19.0.0-4672-g712d9957 (712d9957d9f2a12f0c34bc0475710fa23e01d609) squid (>
Jul 31 08:23:53 test-Standard-PC-i440FX-PIIX-1996 bash[347514]: [30-Jul-2024 22:23:53] INFO state.py:378 (7): nvmeof.None.state OMAP object already exists.
Jul 31 08:23:53 test-Standard-PC-i440FX-PIIX-1996 bash[347514]: [30-Jul-2024 22:23:53] INFO server.py:244 (7): Starting /usr/bin/ceph-nvmeof-monitor-client --gateway-name client.nvmeof.nvmeof_pool01.test-Standard-PC-i440FX-PIIX-1996.wnr>
Jul 31 08:23:53 test-Standard-PC-i440FX-PIIX-1996 bash[347514]: [30-Jul-2024 22:23:53] INFO server.py:248 (7): monitor client process id: 24

root@test-Standard-PC-i440FX-PIIX-1996:/home/test# nvmeof-cli subsystem add --subsystem nqn.2016-06.io.spdk:ceph
Failure adding subsystem nqn.2016-06.io.spdk:ceph:
<_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:<ip>:5500: Failed to connect to remote host: Connection refused"
        debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:<ip>:5500: Failed to connect to remote host: Connection refused {created_time:"2024-07-30T22:25:10.214187073+00:00", grpc_status:14}"
>

@alarmed-ground
Copy link

alarmed-ground commented Jul 31, 2024

The listener add and host add to an existing subsystem are working as expected in v19.1.0(rc)

root@test-Standard-PC-i440FX-PIIX-1996:/home/test# nvmeof-cli listener add --subsystem nqn.2016-06.io.spdk:ceph --host-name test-Standard-PC-i440FX-PIIX-1996 --traddr <ip>--trsvcid 4420
Adding nqn.2016-06.io.spdk:ceph listener at <ip>:4420: Successful
root@test-Standard-PC-i440FX-PIIX-1996:/home/test# nvmeof-cli  host add --subsystem nqn.2016-06.io.spdk:ceph --host "*"
Allowing open host access to nqn.2016-06.io.spdk:ceph: Successful

@caroav
Copy link
Collaborator

caroav commented Jul 31, 2024

@Peratchi-Kannan what issues are you still having now? were you able to add ns?

@alarmed-ground
Copy link

@caroav I am not able to add namespace to the subsystem. When I try to add namespace, it throws an exception saying Exception calling application: Chosen ANA group is 0

@xin3liang
Copy link
Contributor

@caroav I am not able to add namespace to the subsystem. When I try to add namespace, it throws an exception saying Exception calling application: Chosen ANA group is 0

Hi @Peratchi-Kannan, I met the same issue when running the latest Ceph container image quay.ceph.io/ceph-ci/ceph:main, @caroav told me to build ceph on top of PR 54671 but without commit "nvmeof gw monitor: disable by default". You can try this image: quay.ceph.io/ceph-ci/ceph:ceph-nvmeof-mon-arm64-testin (ignore the tag name contains arm64, it should be x86 arch container image)

@caroav
Copy link
Collaborator

caroav commented Aug 5, 2024

@Peratchi-Kannan see comment above from @xin3liang. We are planning to remove the "nvmeof gw monitor: disable by default" permanently from ceph. But this is pending on some cosmetic changes that we were asked to do. The changes are ongoing so I really hope that we could do that very soon. Meanwhile, you need to build as described in last comment.

@alarmed-ground
Copy link

Hi @xin3liang , The nvmeof service does not start when using quay.ceph.io/ceph-ci/ceph:ceph-nvmeof-mon-arm64-testin image.

@xin3liang
Copy link
Contributor

xin3liang commented Aug 13, 2024

Hi @xin3liang , The nvmeof service does not start when using quay.ceph.io/ceph-ci/ceph:ceph-nvmeof-mon-arm64-testin image.

Hi @Peratchi-Kannan, I just verified the aarch64 image, not the x86 one.
FYI, here are my cephadm deployment record and steps: https://linaro.atlassian.net/browse/STOR-272

@xin3liang
Copy link
Contributor

xin3liang commented Aug 21, 2024

Hi @Peratchi-Kannan, you could try this Ceph image: quay.ceph.io/ceph-ci/ceph:main-nvmeof with the latest nvmeof images: quay.io/barakda1/nvmeof:latest and quay.io/barakda1/nvmeof-cli:latest.
I see the main-nvmeof branch revert commit "nvmeof gw monitor: disable by default" https://github.com/ceph/ceph-ci/commits/main-nvmeof/

@alarmed-ground
Copy link

Hi @xin3liang ,

I confirm nvmeof works as expected with Ceph image: quay.ceph.io/ceph-ci/ceph:main-nvmeof , and nvmeof images: quay.io/barakda1/nvmeof:latest and quay.io/barakda1/nvmeof-cli:latest

Thanks

@RobertLukan
Copy link

Hello everyone,

I am facing the same issue. I got nvmeof working under 18.2 but somehow it got broken, so I decided to upgrade to 19.2 that was just released. I am using 1.3.2 version and I am getting stuck at adding it to the namespace(Exception: chosen ANA group is 0), basically the same as @alarmed-ground has reported. I would like to stay with 19.2, not building it from the source, but get nvmeof working, even without the HA functionality for now. Any ideas how to do that ?

@alarmed-ground
Copy link

Hello everyone,

I am able to replicate @RobertLukan's situation. I am using quay.io/ceph/nvmeof:1.3.2 and quay.io/ceph/nvmeof-cli:1.3.2 , I upgraded the cluster from v19.1.0(rc) to v19.2.0 for my testing

@alarmed-ground
Copy link

Hello everyone,

Just a quick update. I was able to add new namespace from the WebUI instead of using the command(v1.3.2).

I had one namespace before the upgrade and created one after the upgrade.

Interestingly, after adding the new namespace from the WebUI, nvmeof-cli namespace list --subsystem nqn.2016-06.io.spdk:ceph lists both namespaces and I am able to discover the subsystem from the client.

I am unable to connect to both namespaces in the client but I am able to discover the subsystem. My nvme version on the client is nvme version 1.16

@nathandragun
Copy link

As others have stated, things are not working with Reef 18.2.4.

I've tried using the stock nvmeof 1.0.0 version, 1.2.16, and 1.2.17 and all fail to keep the gateway up and running. Since the 1.0.0 version is deprecated and not recommended for use, I won't provide details about that, but the run log is as follows:

[04-Oct-2024 02:32:34] INFO utils.py:258 (2): Initialize gateway log level to "INFO"
[04-Oct-2024 02:32:34] INFO utils.py:271 (2): Log files will be saved in /var/log/ceph/nvmeof-client.nvmeof.mypool.myserver.aewiyx, using rotation
[04-Oct-2024 02:32:34] INFO config.py:78 (2): Using NVMeoF gateway version 1.2.17
[04-Oct-2024 02:32:34] INFO config.py:81 (2): Configured SPDK version 24.01
[04-Oct-2024 02:32:34] INFO config.py:84 (2): Using vstart cluster version based on 18.2.4
[04-Oct-2024 02:32:34] INFO config.py:87 (2): NVMeoF gateway built on: 2024-07-30 15:47:38 UTC
[04-Oct-2024 02:32:34] INFO config.py:90 (2): NVMeoF gateway Git repository: https://github.com/ceph/ceph-nvmeof
[04-Oct-2024 02:32:34] INFO config.py:93 (2): NVMeoF gateway Git branch: tags/1.2.17
[04-Oct-2024 02:32:34] INFO config.py:96 (2): NVMeoF gateway Git commit: 887c7841f275a0cbc00eddb8a038cde3935b95ba
[04-Oct-2024 02:32:34] INFO config.py:102 (2): SPDK Git repository: https://github.com/ceph/spdk.git
[04-Oct-2024 02:32:34] INFO config.py:105 (2): SPDK Git branch: undefined
[04-Oct-2024 02:32:34] INFO config.py:108 (2): SPDK Git commit: a16bb032516da05ea2b7c38fd0ad18e8a7190440
[04-Oct-2024 02:32:34] INFO config.py:59 (2): Using configuration file /src/ceph-nvmeof.conf
[04-Oct-2024 02:32:34] INFO config.py:61 (2): ====================================== Configuration file content ======================================
[04-Oct-2024 02:32:34] INFO config.py:65 (2): # This file is generated by cephadm.
[04-Oct-2024 02:32:34] INFO config.py:65 (2): [gateway]
[04-Oct-2024 02:32:34] INFO config.py:65 (2): name = client.nvmeof.mypool.myserver.aewiyx
[04-Oct-2024 02:32:34] INFO config.py:65 (2): group = None
[04-Oct-2024 02:32:34] INFO config.py:65 (2): addr = 10.20.30.40
[04-Oct-2024 02:32:34] INFO config.py:65 (2): port = 5500
[04-Oct-2024 02:32:34] INFO config.py:65 (2): enable_auth = False
[04-Oct-2024 02:32:34] INFO config.py:65 (2): state_update_notify = True
[04-Oct-2024 02:32:34] INFO config.py:65 (2): state_update_interval_sec = 5
[04-Oct-2024 02:32:34] INFO config.py:65 (2): enable_prometheus_exporter = True
[04-Oct-2024 02:32:34] INFO config.py:65 (2): prometheus_exporter_ssl = False
[04-Oct-2024 02:32:34] INFO config.py:65 (2): prometheus_port = 10008
[04-Oct-2024 02:32:34] INFO config.py:65 (2):
[04-Oct-2024 02:32:34] INFO config.py:65 (2): [ceph]
[04-Oct-2024 02:32:34] INFO config.py:65 (2): pool = mypool
[04-Oct-2024 02:32:34] INFO config.py:65 (2): config_file = /etc/ceph/ceph.conf
[04-Oct-2024 02:32:34] INFO config.py:65 (2): id = nvmeof.mypool.myserver.aewiyx
[04-Oct-2024 02:32:34] INFO config.py:65 (2):
[04-Oct-2024 02:32:34] INFO config.py:65 (2): [mtls]
[04-Oct-2024 02:32:34] INFO config.py:65 (2): server_key = ./server.key
[04-Oct-2024 02:32:34] INFO config.py:65 (2): client_key = ./client.key
[04-Oct-2024 02:32:34] INFO config.py:65 (2): server_cert = ./server.crt
[04-Oct-2024 02:32:34] INFO config.py:65 (2): client_cert = ./client.crt
[04-Oct-2024 02:32:34] INFO config.py:65 (2):
[04-Oct-2024 02:32:34] INFO config.py:65 (2): [spdk]
[04-Oct-2024 02:32:34] INFO config.py:65 (2): tgt_path = /usr/local/bin/nvmf_tgt
[04-Oct-2024 02:32:34] INFO config.py:65 (2): rpc_socket = /var/tmp/spdk.sock
[04-Oct-2024 02:32:34] INFO config.py:65 (2): timeout = 60
[04-Oct-2024 02:32:34] INFO config.py:65 (2): log_level = WARN
[04-Oct-2024 02:32:34] INFO config.py:65 (2): conn_retries = 10
[04-Oct-2024 02:32:34] INFO config.py:65 (2): transports = tcp
[04-Oct-2024 02:32:34] INFO config.py:65 (2): transport_tcp_options = {"in_capsule_data_size": 8192, "max_io_qpairs_per_ctrlr": 7}
[04-Oct-2024 02:32:34] INFO config.py:65 (2): tgt_cmd_extra_args = --cpumask=0xFF
[04-Oct-2024 02:32:34] INFO config.py:66 (2): ========================================================================================================
[04-Oct-2024 02:32:34] INFO server.py:91 (2): Starting gateway client.nvmeof.mypool.myserver.aewiyx
[04-Oct-2024 02:32:34] INFO server.py:162 (2): Starting serve, monitor client version: ceph version 19.0.0-4996-g0ec90b1e (0ec90b1e61a7489b13d6d8432156a0417f35db7f) squid (dev)
[04-Oct-2024 02:32:35] INFO state.py:387 (2): nvmeof.None.state OMAP object already exists.
[04-Oct-2024 02:32:35] INFO server.py:252 (2): Starting /usr/bin/ceph-nvmeof-monitor-client --gateway-name client.nvmeof.mypool.myserver.aewiyx --gateway-address 10.20.30.40:5500 --gateway-pool mypool --gateway-group None --monitor-group-address 10.20.30.40:5499 -c /etc/ceph/ceph.conf -n client.nvmeof.mypool.myserver.aewiyx -k /etc/ceph/keyring
[04-Oct-2024 02:32:35] INFO server.py:256 (2): monitor client process id: 19
[04-Oct-2024 02:32:35] INFO server.py:151 (2): MonitorGroup server is listening on 10.20.30.40:5499 for group id
[04-Oct-2024 02:34:17] ERROR server.py:42 (2): GatewayServer: SIGCHLD received signum=17
[04-Oct-2024 02:34:17] ERROR server.py:46 (2): PID of terminated child process is 19
[04-Oct-2024 02:34:17] ERROR server.py:111 (2): GatewayServer exception occurred:
Traceback (most recent call last):
  File "/src/control/__main__.py", line 38, in <module>
    gateway.serve()
  File "/src/control/server.py", line 173, in serve
    self._start_monitor_client()
  File "/src/control/server.py", line 258, in _start_monitor_client
    self._wait_for_group_id()
  File "/src/control/server.py", line 152, in _wait_for_group_id
    self.monitor_event.wait()
  File "/usr/lib64/python3.9/threading.py", line 581, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib64/python3.9/threading.py", line 312, in wait
    waiter.acquire()
	  File "/src/control/server.py", line 55, in sigchld_handler
    raise SystemExit(f"Gateway subprocess terminated {pid=} {exit_code=}")
SystemExit: Gateway subprocess terminated pid=19 exit_code=-6
[04-Oct-2024 02:34:17] INFO server.py:448 (2): Aborting (client.nvmeof.mypool.myserver.aewiyx) pid 19...
[04-Oct-2024 02:34:17] INFO state.py:545 (2): Cleanup OMAP on exit (gateway-client.nvmeof.mypool.myserver.aewiyx)
[04-Oct-2024 02:34:17] INFO server.py:137 (2): Exiting the gateway process.

@RobertLukan
Copy link

Has anyone found a combination that works ? I tried 1.2.17, 1.1, 1.3.1, 1.3.2 without success.

@caroav
Copy link
Collaborator

caroav commented Oct 6, 2024

The nvmeof is not a part of the official ceph reef and squid branches. It was approved to be merged to main long after that reef and squid were created. It will be a part of the next ceph upstream release. For now, anyone that needs the nvmeof to be working with reef or squid, you can build ceph from - https://github.com/ceph/ceph-ci/tree/squid-nvmeof , or https://github.com/ceph/ceph-ci/tree/reef-nvmeof.

@gurubert
Copy link

gurubert commented Oct 6, 2024

The nvmeof is not a part of the official ceph reef and squid branches.

If this is the case, why do https://docs.ceph.com/en/reef/rbd/nvmeof-overview/ and https://docs.ceph.com/en/squid/rbd/nvmeof-overview/ exist?

The official Ceph documentation suggests that NVMe-oF is working since version 18.

gurubert added a commit to HeinleinSupport/ceph that referenced this issue Oct 6, 2024
According to ceph/ceph-nvmeof#669 (comment)

"The nvmeof is not a part of the official ceph reef and squid branches."

It should be removed from the documentation as currently the docs
suggest that a working NVMe-oF gateway can be deployed easily with the
orchestrator.

Signed-off-by: Robert Sander <[email protected]>
@RobertLukan
Copy link

I understand that the HA feature is not yet part of Ceph reef/squid, but I wonder why non-HA is not part of it ? Especially, I managed to get it working with nvmeof version 1.0.0, but unfortunately the integration has not survived the reboot of hosts.

gurubert added a commit to HeinleinSupport/ceph that referenced this issue Oct 6, 2024
According to ceph/ceph-nvmeof#669 (comment)

"The nvmeof is not a part of the official ceph reef and squid branches."

It should be removed from the documentation as currently the docs
suggest that a working NVMe-oF gateway can be deployed easily with the
orchestrator.

Signed-off-by: Robert Sander <[email protected]>
@nathandragun
Copy link

There is also the ability to deploy an NVME-oF gateway in the current Reef dashboard, so there is definitely a disconnect as to what is production ready.

@caroav
Copy link
Collaborator

caroav commented Oct 6, 2024

There is no - non ha mode. A single gw is a also managed by the ceph mon. I need to check about the documentation and we need to fix it is misleading. In any case, as I suggested, you can build ceph from the branches I mentioned and get it working.

@nathandragun
Copy link

The nvmeof is not a part of the official ceph reef and squid branches. It was approved to be merged to main long after that reef and squid were created. It will be a part of the next ceph upstream release. For now, anyone that needs the nvmeof to be working with reef or squid, you can build ceph from - https://github.com/ceph/ceph-ci/tree/squid-nvmeof , or https://github.com/ceph/ceph-ci/tree/reef-nvmeof.

@caroav, is there any word on if this will merge with a reef/squid update? I'm not familiar enough with how Ceph does feature and patching lifecycles.

@caroav
Copy link
Collaborator

caroav commented Oct 6, 2024

is there any word on if this will merge with a reef/squid update? I'm not familiar enough with how Ceph does feature and patching lifecycles.
I don't think it will be a part of reef and squid. @oritwas @neha-ojha @athanatos can you share your view?

@athanatos
Copy link

@caroav Which PRs need to be backported? The process would be that you backport the relevant PRs/commits and open PRs against squid/reef.

@gee456
Copy link

gee456 commented Oct 9, 2024

The nvmeof is not a part of the official ceph reef and squid branches. It was approved to be merged to main long after that reef and squid were created. It will be a part of the next ceph upstream release. For now, anyone that needs the nvmeof to be working with reef or squid, you can build ceph from - https://github.com/ceph/ceph-ci/tree/squid-nvmeof , or https://github.com/ceph/ceph-ci/tree/reef-nvmeof.

I have tired to build these and they fail at Building CXX object src/librbd/CMakeFiles/rbd_api.dir/librbd.cc.o , where can I go to troubleshoot this build issue? I can provide more details if needed but I don't want to use this thread for that is I should be using some other resource. Here is one line of the error
/root/rpmbuild/BUILD/ceph-19.1.0-1427-g73d4dbc2f9d/src/librbd/librbd.cc: In member function 'int librbd::RBD::open(librados::v14_2_0::IoCtx&, librbd::Image&, const char*, const char*)':
/root/rpmbuild/BUILD/ceph-19.1.0-1427-g73d4dbc2f9d/src/librbd/librbd.cc:525:5: error: expected primary-expression before ',' token
525 | tracepoint(librbd, open_image_enter, ictx, ictx->name.c_str(), ictx->id.c_str(), ictx->snap_name.c_str(), ictx->read_only);
| ^~~~~~~~~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Development

No branches or pull requests

9 participants