Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test the NBE v2.0 prototype #48

Closed
1 task done
cjdcordeiro opened this issue Mar 19, 2021 · 40 comments
Closed
1 task done

test the NBE v2.0 prototype #48

cjdcordeiro opened this issue Mar 19, 2021 · 40 comments
Assignees

Comments

@cjdcordeiro
Copy link

cjdcordeiro commented Mar 19, 2021

How to test

use https://swarm.nuvla.io

clone https://github.com.nuvlabox/deployment, and you'll find .docker-compose.test.yml file under the test folder. Use that one do install the NBE.

All v2.0 related development is in clustering branches, for the following repos: api-server, job-engine, ui, agent, system-manager, deployment, on-stop

What to test

Please provide your feedback on the following topics:

  • ability to update. Has it improved? Does it how from v1.x to v2?

  • deployment of apps into a cluster. Confirm that deployments will only be listed under the managing NB where they were deployed to, but not the actual NBs where they are running. Feel free to propose a solution for this

  • use of the Data Gateway. Try to use MQTT communication when the NB is a manager, a worker or a standalone node

  • test the NB actions, both the old ones and also the new cluster operation

  • trying fiddling with the NBs, physically (reboot it, cut its network, etc.) and see if the cluster recovers by itself

  • needs We need to decide on how to guide users migrating from v1 -> v2. #54

@cjdcordeiro
Copy link
Author

@mebster
Copy link
Contributor

mebster commented Jun 1, 2021

Prior to installing v2, I downed the previous version (v1.16.2) with this result:

pi@raspberrypi:~ $ docker-compose -p nuvlabox down
Stopping nuvlabox_agent_1           ... done
Stopping vpn-client                 ... done
Stopping nuvlabox_network-manager_1 ... done
Stopping compute-api                ... done
Stopping nuvlabox_system-manager_1  ... done
Stopping datagateway                ... done
Stopping nuvlabox_security_1        ... done
Stopping nuvlabox_management-api_1  ... done
Stopping nuvlabox-job-engine-lite   ... done
Stopping nbmosquitto                ... done
WARNING: Found orphan containers (nuvlabox_peripheral-manager-usb_1, nuvlabox_peripheral-manager-network_1, nuvlabox_peripheral-manager-bluetooth_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Removing nuvlabox_agent_1           ... done
Removing vpn-client                 ... done
Removing nuvlabox_network-manager_1 ... done
Removing compute-api                ... done
Removing nuvlabox_system-manager_1  ... done
Removing datagateway                ... done
Removing nuvlabox_security_1        ... done
Removing nuvlabox_management-api_1  ... done
Removing nuvlabox-job-engine-lite   ... done
Removing nbmosquitto                ... done
Removing network nuvlabox_default
Removing network nuvlabox-shared-network
ERROR: rpc error: code = FailedPrecondition desc = network l5eey4u1wuwm3iesqghwtozlq is in use by task zidhj0dlhoipnf0nbmojhytwx

@cjdcordeiro
Copy link
Author

Prior to installing v2, I downed the previous version (v1.16.2) with this result:

pi@raspberrypi:~ $ docker-compose -p nuvlabox down
Stopping nuvlabox_agent_1           ... done
Stopping vpn-client                 ... done
Stopping nuvlabox_network-manager_1 ... done
Stopping compute-api                ... done
Stopping nuvlabox_system-manager_1  ... done
Stopping datagateway                ... done
Stopping nuvlabox_security_1        ... done
Stopping nuvlabox_management-api_1  ... done
Stopping nuvlabox-job-engine-lite   ... done
Stopping nbmosquitto                ... done
WARNING: Found orphan containers (nuvlabox_peripheral-manager-usb_1, nuvlabox_peripheral-manager-network_1, nuvlabox_peripheral-manager-bluetooth_1) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
Removing nuvlabox_agent_1           ... done
Removing vpn-client                 ... done
Removing nuvlabox_network-manager_1 ... done
Removing compute-api                ... done
Removing nuvlabox_system-manager_1  ... done
Removing datagateway                ... done
Removing nuvlabox_security_1        ... done
Removing nuvlabox_management-api_1  ... done
Removing nuvlabox-job-engine-lite   ... done
Removing nbmosquitto                ... done
Removing network nuvlabox_default
Removing network nuvlabox-shared-network
ERROR: rpc error: code = FailedPrecondition desc = network l5eey4u1wuwm3iesqghwtozlq is in use by task zidhj0dlhoipnf0nbmojhytwx

the WARNING is because you installed that NB with peripherals, but then you only "downed" the core components. Either pass the remaining compose files at "down" time, or add the flag --remove-orphans to the down command.

The ERROR is a known problem in v1. Sometimes you need to "down" twice because during the shutdown, the system-manager will try to self-heal, without knowing that there is an intended shutdown in progress. v2 solves that

@mebster
Copy link
Contributor

mebster commented Jun 1, 2021

We need to decide on how to guide users migrating from v1 -> v2.

Options include:

  1. Migrate the nuvlabox resource (and related resources)
  2. Put a warning saying that the migration can only be done by hand and not from the UI.

Other options?

@mebster
Copy link
Contributor

mebster commented Jun 1, 2021

Error in adding new ssh key:

DockerException-Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 410, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 214, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.9/site-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.9/site-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 237, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/executor.py", line 71, in _process_jobs
    return_code = action_instance.do_work()
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/actions/nuvlabox_add_ssh_key.py", line 54, in do_work
    return self.add_ssh_key()
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/actions/nuvlabox_add_ssh_key.py", line 21, in add_ssh_key
    connector = NB.NuvlaBoxConnector(api=self.api, nuvlabox_id=nuvlabox_id, job=self.job)
  File "/usr/local/lib/python3.9/site-packages/nuvla/connector/nuvlabox_connector.py", line 24, in __init__
    self.local_docker_client = docker.from_env()
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 96, in from_env
    return cls(
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 197, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 221, in _retrieve_server_version
    raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

@cjdcordeiro
Copy link
Author

Error in adding new ssh key:

DockerException-Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 410, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/http/client.py", line 1253, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1299, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1248, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1008, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 948, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 214, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.9/site-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.9/site-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 237, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 555, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/executor.py", line 71, in _process_jobs
    return_code = action_instance.do_work()
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/actions/nuvlabox_add_ssh_key.py", line 54, in do_work
    return self.add_ssh_key()
  File "/usr/local/lib/python3.9/site-packages/nuvla/job/actions/nuvlabox_add_ssh_key.py", line 21, in add_ssh_key
    connector = NB.NuvlaBoxConnector(api=self.api, nuvlabox_id=nuvlabox_id, job=self.job)
  File "/usr/local/lib/python3.9/site-packages/nuvla/connector/nuvlabox_connector.py", line 24, in __init__
    self.local_docker_client = docker.from_env()
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 96, in from_env
    return cls(
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 197, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 221, in _retrieve_server_version
    raise DockerException(
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

was this is push or pull mode?

@mebster
Copy link
Contributor

mebster commented Jun 1, 2021

Push

@cjdcordeiro
Copy link
Author

Ok found it. this was a bug introduced in the job-engine and job-engine-lite, which has since then been fixed, but it's still in a PR: nuvla/job-engine#188

Since it's a blocking bug, I've applied the fix to master already and released a new version of the job-engine. The fix has been applied to NBE v1.16.2 and v2.0.0, so all jobs in pull mode should work.

For jobs in push mode, this fix needs to be deployed in nuvla.io (wait for the next release)

@mebster
Copy link
Contributor

mebster commented Jun 2, 2021

My NB is reported as offline, but it isn't. Looking at the logs of the nuvlabox_system-manager_1, I get this:

WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later

Is it related?

@mebster
Copy link
Contributor

mebster commented Jun 2, 2021

Update v2.0.0 -> v2.0.0 fails:

[NuvlaBox Engine update to 2.0.0] update ERROR: cannot proceed: original compose files are not available for rollback at /opt/nuvlabox

See details: job/07891fa5-4b9d-4853-ba1e-97b36042bfe5

The working directory is set to /opt/nuvlabox even though the installation was performed on /home/pi.

@mebster
Copy link
Contributor

mebster commented Jun 2, 2021

Update v2.0.0 -> v2.0.0 timed out

[NuvlaBox Engine update to 2.0.0] �
WARNING: �
NuvlaBox Engine update failed. Rollback to previous version�[0m
�
ERROR: �
Update failed when executing Docker Compose command docker-compose --no-ansi --log-level ERROR -p nuvlabox -f docker-compose.bluetooth.yml -f docker-compose.gpu.yml -f docker-compose.modbus.yml -f docker-compose.yml -f docker-compose.network.yml -f docker-compose.usb.yml up --remove-orphans -d.--no-ansi option is deprecated and will be removed in future versions. Use `--ansi never` instead.
2.15.2: Pulling from nuvla/job-lite
Digest: sha256:7e2ced342c97ee69a57063ef64d473f057fbf54b0885cdcfbf20569f8f582041
Status: Downloaded newer image for nuvla/job-lite:2.15.2
Creating nuvlabox_peripheral-manager-modbus_1 ... 
Creating nuvlabox_peripheral-manager-gpu_1    ... 
Creating nuvlabox-on-stop                     ... 
Creating nuvlabox_security_1                  ... 
Creating nuvlabox_peripheral-manager-network_1 ... 
Creating nuvlabox-job-engine-lite              ... 
Creating nuvlabox_peripheral-manager-bluetooth_1 ... 
Creating nuvlabox_peripheral-manager-usb_1       ... 
Creating nuvlabox_peripheral-manager-network_1   ... done
Creating nuvlabox_peripheral-manager-bluetooth_1 ... done
Creating nuvlabox_security_1                     ... done
Creating nuvlabox_peripheral-manager-usb_1       ... done
Creating nuvlabox_peripheral-manager-modbus_1    ... done
Creating nuvlabox-job-engine-lite                ... done
Creating nuvlabox-on-stop                        ... done
Creating nuvlabox_peripheral-manager-gpu_1       ... done
Creating nuvlabox_system-manager_1               ... 

ERROR: for nuvlabox_system-manager_1  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

ERROR: for system-manager  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).�[0m
�
ERROR: �
Rollback failed when executing Docker Compose command docker-compose -p nuvlabox -f docker-compose.bluetooth.yml -f docker-compose.gpu.yml -f docker-compose.modbus.yml -f docker-compose.yml -f docker-compose.network.yml -f docker-compose.usb.yml up -d: The NUVLABOX_SSH_PUB_KEY variable is not set. Defaulting to a blank string.
The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Creating nuvlabox_peripheral-manager-usb_1 ... 
Creating nuvlabox_peripheral-manager-gpu_1 ... 
Creating nuvlabox_peripheral-manager-network_1 ... 
Creating nuvlabox-job-engine-lite              ... 
Creating nuvlabox-on-stop                      ... 
Creating nuvlabox_security_1                   ... 
Creating nuvlabox_peripheral-manager-bluetooth_1 ... 
Creating nuvlabox_peripheral-manager-modbus_1    ... 
Creating nuvlabox_peripheral-manager-network_1   ... done
Creating nuvlabox_peripheral-manager-usb_1       ... done
Creating nuvlabox_security_1                     ... done
Creating nuvlabox_peripheral-manager-bluetooth_1 ... done
Creating nuvlabox_peripheral-manager-gpu_1       ... done
Creating nuvlabox_peripheral-manager-modbus_1    ... done
Creating nuvlabox-job-engine-lite                ... done
Creating nuvlabox-on-stop                        ... done
Creating nuvlabox_system-manager_1               ... 
Creating nuvlabox_system-manager_1               ... done
Creating compute-api                             ... 
Creating compute-api                             ... done
Creating nuvlabox_agent_1                        ... 

ERROR: for nuvlabox_agent_1  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

ERROR: for agent  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)
An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).�[0m

Noticed that the VPN container didn't come back:

pi@raspberrypi:~ $ docker ps -a
CONTAINER ID   IMAGE                                          COMMAND                  CREATED          STATUS                        PORTS                                                 NAMES
b58881afe920   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   5 minutes ago    Up 5 minutes                                                                        a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_video-source.1.tybfp72kix8i5suxwpwwzgfd0
fd43658dd401   eclipse-mosquitto:1.6.12                       "sh -c 'sleep 10 && …"   5 minutes ago    Up 5 minutes                  1883/tcp                                              data-gateway.1.bxm90c0orz91ftcoewyk35spb
621a6f0f8823   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    5 minutes ago    Up 5 minutes                                                                        a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rtsp-server.1.rtqjgsaao63giozggru0xqj9a
f21b56cd3b84   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   5 minutes ago    Up 5 minutes                  5000/tcp                                              a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_object-detector.1.pi8q4ko1i2lehlpmxopic1ee5
425fd13cc7da   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   5 minutes ago    Exited (1) 5 minutes ago                                                            a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_video-source.1.3dcrr6ed4wbm4ajt3ib7uo1mc
379c955f9383   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   5 minutes ago    Up 5 minutes                  4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp   a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rabbitmq.1.azun8c2emuay1ouxkb1ezxlk2
b5e388d30723   eclipse-mosquitto:1.6.12                       "sh -c 'sleep 10 && …"   5 minutes ago    Created                                                                             data-gateway.1.lj2bxgpns06eweihm8upbbata
ad29e28ae9a2   nuvlabox/agent:2.0.1                           "./app.py"               7 minutes ago    Up 5 minutes (healthy)        127.0.0.1:5080->80/tcp                                nuvlabox_agent_1
33e5e3dca56a   nuvlabox/compute-api:1.1.1                     "./api.sh"               7 minutes ago    Up 7 minutes (healthy)        0.0.0.0:5000->5000/tcp                                compute-api
39e547b1587c   nuvlabox/system-manager:2.0.0                  "./run.py"               8 minutes ago    Up 7 minutes (healthy)        127.0.0.1:3636->3636/tcp                              nuvlabox_system-manager_1
57b9ffe03da7   nuvlabox/peripheral-manager-modbus:1.1.0       "./modbus.py"            8 minutes ago    Up 8 minutes                                                                        nuvlabox_peripheral-manager-modbus_1
8fb59bdd4310   nuvlabox/peripheral-manager-bluetooth:1.0.0    "python -u manager.py"   8 minutes ago    Up 8 minutes                                                                        nuvlabox_peripheral-manager-bluetooth_1
8ad6e3f36431   nuvlabox/on-stop:1.0.0                         "./run.py pause"         8 minutes ago    Up 8 minutes (Paused)                                                               nuvlabox-on-stop
873d664598f9   nuvlabox/security:1.0.3                        "./app.py"               8 minutes ago    Up 8 minutes                                                                        nuvlabox_security_1
355ff72d5b84   nuvla/job-lite:2.15.1                          "/app/pause.py"          8 minutes ago    Up 8 minutes (Paused)                                                               nuvlabox-job-engine-lite
071808741d43   nuvlabox/peripheral-manager-gpu:0.2.0          "./discovery.py"         8 minutes ago    Up 8 minutes                                                                        nuvlabox_peripheral-manager-gpu_1
1fd564fe7fb0   nuvlabox/peripheral-manager-network:1.0.1      "python manager.py"      8 minutes ago    Up 8 minutes                                                                        nuvlabox_peripheral-manager-network_1
3849dd1ca6d8   nuvlabox/peripheral-manager-usb:1.4.1          "./app.sh"               8 minutes ago    Up 5 minutes                                                                        nuvlabox_peripheral-manager-usb_1
2c4ab8f24a3f   nuvladev/on-stop:main                          "./run.py"               9 minutes ago    Exited (0) 8 minutes ago                                                            nuvlabox-on-stop-JZZHA-02-06-2021_141252
6d25a1ff0d57   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   9 minutes ago    Exited (1) 5 minutes ago                                                            a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_video-source.1.i6c8t13rka2wzml0j2lo2dexw
060fa01efa12   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   11 minutes ago   Exited (137) 5 minutes ago                                                          a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_object-detector.1.s8yvufpu8px1jxwi10p4jz2ow
8252914fa8b4   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   11 minutes ago   Exited (0) 5 minutes ago                                                            a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rabbitmq.1.xve2sic322iqw2qgviioqn86i
716a816ef5e1   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    11 minutes ago   Exited (2) 5 minutes ago                                                            a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rtsp-server.1.tpf5lv9dpmoq50zrcwfrz22x9
98d1cc5ca0e2   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   11 minutes ago   Exited (1) 9 minutes ago                                                            a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_video-source.1.u4ungvo6w8dj2frgm1bo2u808
e4b3f387e69f   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    7 hours ago      Exited (2) 11 minutes ago                                                           a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rtsp-server.1.9jg8k25g3rg1fl5y9zpvhww87
326828f2d435   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   7 hours ago      Exited (137) 11 minutes ago                                                         a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rabbitmq.1.txlp5klhit7ifbj5as1lvi74i
fd2f90f3a56c   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   7 hours ago      Exited (137) 11 minutes ago                                                         a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_object-detector.1.ky0y0i0ccha5bbqu4c2ij7tvz
0970222bf0fc   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   7 hours ago      Exited (1) 7 hours ago                                                              a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_video-source.1.w27y2wmc24kkvv35vvblqzgly
9f426047cb0e   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    7 hours ago      Exited (2) 7 hours ago                                                              a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rtsp-server.1.oujfvg8piwdjn5ip8jzg7e5we
2bef8fdfc571   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   24 hours ago     Exited (255) 8 hours ago      5000/tcp                                              a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_object-detector.1.qhq5la5fwdh30ma6fxe717clo
63901f746504   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   24 hours ago     Exited (255) 8 hours ago      4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp   a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rabbitmq.1.2jnplrv6ipcg4itj9aqs7thhw
b40d5803a1dd   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   24 hours ago     Exited (255) 24 hours ago     5000/tcp                                              a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_object-detector.1.9ot8chr1bhyn3mh85zxyn7fd9
0e4c508894cf   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    24 hours ago     Exited (255) 24 hours ago                                                           a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rtsp-server.1.yv9qw1t156olmgznh2bat9zen
72a08dc79fec   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   28 hours ago     Exited (255) 24 hours ago     4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp   a02cc66a-2067-4d80-a5ed-9aaa7fa55c13_rabbitmq.1.h9ko8gjeew7myyeftv2kk4tlo
ffb8ec41a278   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   32 hours ago     Exited (255) 32 hours ago                                                           f207e43e-497f-488d-8328-39de0808568e_video-source.1.rswym58zxeuw7e6n69pjvkw54
93c103826138   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    32 hours ago     Exited (255) 32 hours ago                                                           f207e43e-497f-488d-8328-39de0808568e_rtsp-server.1.082wg84atlr6qz19jwd0rakbo
e4ce4e8fc30d   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   32 hours ago     Exited (255) 32 hours ago     5000/tcp                                              f207e43e-497f-488d-8328-39de0808568e_object-detector.1.2mwyeqxohe154wirrvs9etu1k
1b9474bbc894   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   32 hours ago     Exited (255) 32 hours ago     4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp   f207e43e-497f-488d-8328-39de0808568e_rabbitmq.1.afzj72xmj897l0o8f3m17o45k
pi@raspberrypi:~ $

@mebster
Copy link
Contributor

mebster commented Jun 2, 2021

After the above failure, the working directory in the nuvlabox page is set to /opt/nuvlabox/rollback. But this folder doesn't exist on the device.

@mebster
Copy link
Contributor

mebster commented Jun 3, 2021

Supervise the agent. On a shaky network, the agent fails to start if it can't reach Nuvla.

@cjdcordeiro
Copy link
Author

Supervise the agent. On a shaky network, the agent fails to start if it can't reach Nuvla.

fixed

@cjdcordeiro
Copy link
Author

Update v2.0.0 -> v2.0.0 fails:

[NuvlaBox Engine update to 2.0.0] update ERROR: cannot proceed: original compose files are not available for rollback at /opt/nuvlabox

See details: job/07891fa5-4b9d-4853-ba1e-97b36042bfe5

The working directory is set to /opt/nuvlabox even though the installation was performed on /home/pi.

hum, do you still have this NB? what's the ID?

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

With the latest agent:

pi@raspberrypi:~ $ docker logs nuvlabox_agent_1

[...]

WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
Traceback (most recent call last):
  File "/opt/nuvlabox/./run.py", line 100, in <module>
    self_sup.healer()
  File "/opt/nuvlabox/system_manager/Supervise.py", line 824, in healer
    self.docker_client.start()
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 221, in __getattr__
    raise AttributeError(' '.join(s))
AttributeError: 'DockerClient' object has no attribute 'start' In Docker SDK for Python 2.0, this method is now on the object APIClient. See the low-level API section of the documentation for more details.
INFO - Requirements.py/Requirements/check_docker_requirements - Running in Swarm mode
INFO - run.py/run/run_requirements_check - Successfully created status file
INFO - run.py/run/run_requirements_check - Directory /srv/nuvlabox/shared/.peripherals already exists
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
Traceback (most recent call last):
  File "/opt/nuvlabox/./run.py", line 100, in <module>
    self_sup.healer()
  File "/opt/nuvlabox/system_manager/Supervise.py", line 824, in healer
    self.docker_client.start()
  File "/usr/local/lib/python3.9/site-packages/docker/client.py", line 221, in __getattr__
    raise AttributeError(' '.join(s))
AttributeError: 'DockerClient' object has no attribute 'start' In Docker SDK for Python 2.0, this method is now on the object APIClient. See the low-level API section of the documentation for more details.
INFO - Requirements.py/Requirements/check_docker_requirements - Running in Swarm mode
INFO - run.py/run/run_requirements_check - Successfully created status file
INFO - run.py/run/run_requirements_check - Directory /srv/nuvlabox/shared/.peripherals already exists
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
pi@raspberrypi:~ $

@cjdcordeiro
Copy link
Author

I think your OS is bugging again with a similar Python issue as in the past...it's claiming we are using Python 2 but we aren't

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

pi@raspberrypi:~ $ python
Python 2.7.16 (default, Oct 10 2019, 22:02:15)
[GCC 8.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.

@cjdcordeiro
Copy link
Author

that's your host, not the system-manager container

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

With nuvladev/system-manager:master:

pi@raspberrypi:~ $ docker-compose -p nuvlabox -f docker-compose.network.yml -f docker-compose.usb.yml -f docker-compose.yml -f docker-compose.bluetooth.yml -f docker-compose.gpu.yml -f docker-compose.modbus.yml up -d
WARNING: The NUVLABOX_SSH_PUB_KEY variable is not set. Defaulting to a blank string.
WARNING: The Docker Engine you're using is running in swarm mode.

Compose does not use swarm mode to deploy services to multiple nodes in a swarm. All containers will be scheduled on the current node.

To deploy your application across the swarm, use `docker stack deploy`.

Pulling system-manager (nuvladev/system-manager:master)...
master: Pulling from nuvladev/system-manager
e160e00eb35d: Already exists
87e95228c5a0: Already exists
03e94a5d9023: Already exists
02da6b082b36: Already exists
544da0e82b4e: Already exists
c2d4568e58dd: Pull complete
ed7ead200b37: Pull complete
dd45ba5863ff: Pull complete
c0b8d99994ac: Pull complete
Digest: sha256:4f52288dc91654c1d3707ff29a6affd8dff3b33abe2e7522fe69829059dd004b
Status: Downloaded newer image for nuvladev/system-manager:master
nuvlabox-job-engine-lite is up-to-date
nuvlabox-on-stop is up-to-date
nuvlabox_peripheral-manager-modbus_1 is up-to-date
nuvlabox_peripheral-manager-gpu_1 is up-to-date
nuvlabox_security_1 is up-to-date
nuvlabox_peripheral-manager-bluetooth_1 is up-to-date
Recreating nuvlabox_system-manager_1 ...
nuvlabox_peripheral-manager-network_1 is up-to-date
nuvlabox_peripheral-manager-usb_1 is up-to-date
Recreating nuvlabox_system-manager_1 ... error

ERROR: for nuvlabox_system-manager_1  Cannot start service system-manager: driver failed programming external connectivity on endpoint nuvlabox_system-manager_1 (98e146b619d3bdfe18aecc8ca97d3c5ef3824645706850e2ad8d26bea75111b5): Bind for 127.0.0.1:3636 failed: port is already allocated

ERROR: for system-manager  Cannot start service system-manager: driver failed programming external connectivity on endpoint nuvlabox_system-manager_1 (98e146b619d3bdfe18aecc8ca97d3c5ef3824645706850e2ad8d26bea75111b5): Bind for 127.0.0.1:3636 failed: port is already allocated
ERROR: Encountered errors while bringing up the project.
pi@raspberrypi:~ $

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

Re-launching the command worked.

@cjdcordeiro
Copy link
Author

Re-launching the command worked.

fixed then

@cjdcordeiro
Copy link
Author

Update v2.0.0 -> v2.0.0 fails:

[NuvlaBox Engine update to 2.0.0] update ERROR: cannot proceed: original compose files are not available for rollback at /opt/nuvlabox

See details: job/07891fa5-4b9d-4853-ba1e-97b36042bfe5
The working directory is set to /opt/nuvlabox even though the installation was performed on /home/pi.

hum, do you still have this NB? what's the ID?

@mebster

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

nuvlabox/89a2923e-8baf-4936-91ea-15d443d01da8

@cjdcordeiro
Copy link
Author

nuvlabox/89a2923e-8baf-4936-91ea-15d443d01da8

working-dir is set to "working-dir": "/home/pi",. Why do you say it is /opt/nuvlabox?

@mebster
Copy link
Contributor

mebster commented Jun 4, 2021

This is where I ran the docker-compose command from. The /opt/nuvlabox folder didn't exist and I got an error pointing to the folder not existing:

0377112a | nuvlabox_update | 2021-06-02T11:27:00.646Z | FAILED | 100 | 1 | [NuvlaBox Engine update to 2.0.0] update ERROR: cannot proceed: original compose files are not available for rollback at /opt/nuvlabox

@mebster
Copy link
Contributor

mebster commented Jun 7, 2021

Failed leaving cluster:

[NuvlaBox cluster action leave] Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.41/swarm/leave?force=True

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/nuvlabox/commands/nuvlabox-engine-cluster", line 64, in <module>
    leave()
  File "/opt/nuvlabox/commands/nuvlabox-engine-cluster", line 33, in leave
    docker_client.swarm.leave(force=True)
  File "/usr/local/lib/python3.8/site-packages/docker/models/swarm.py", line 135, in leave
    return self.client.api.leave_swarm(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/docker/utils/decorators.py", line 34, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/docker/api/swarm.py", line 282, in leave_swarm
    self._raise_for_status(response)
  File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.8/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/swarm/leave?force=True: Internal Server Error ("context deadline exceeded")

@mebster
Copy link
Contributor

mebster commented Jun 7, 2021

and as result:

Operational Status: DEGRADED
Notes
Unable to launch Data Gateway

@mebster
Copy link
Contributor

mebster commented Jun 7, 2021

Agent died:

pi@raspberrypi:~ $ docker ps -a
CONTAINER ID   IMAGE                                          COMMAND                  CREATED         STATUS                     PORTS                                                 NAMES
8c2d60a5174b   eclipse-mosquitto:1.6.12                       "/docker-entrypoint.…"   2 minutes ago   Created                                                                          data-gateway
7e9ebfe5846a   nuvlabox/agent:2.0.1                           "./app.py"               2 days ago      Exited (1) 2 minutes ago                                                         nuvlabox_agent_1
74eca3b259cb   nuvladev/system-manager:master                 "./run.py"               2 days ago      Up 18 hours (healthy)      127.0.0.1:3636->3636/tcp                              nuvlabox_system-manager_1
b3325ff910d2   nuvlabox/on-stop:1.0.0                         "./run.py"               2 days ago      Exited (0) 2 days ago                                                            nuvlabox-on-stop-BUZPD-04-06-2021_142407
6bb719572ddf   nuvlabox/vpn-client:1.0.0                      "./openvpn-client.sh"    2 days ago      Up 18 hours                                                                      vpn-client
4344b41b8b88   nuvlabox/compute-api:1.1.1                     "./api.sh"               2 days ago      Up 18 hours (healthy)      0.0.0.0:5000->5000/tcp                                compute-api
7e576438e6de   nuvlabox/on-stop:1.0.0                         "./run.py pause"         2 days ago      Up 18 hours (Paused)                                                             nuvlabox-on-stop
9b8387e15a2f   nuvlabox/peripheral-manager-gpu:0.2.0          "./discovery.py"         2 days ago      Up 18 hours                                                                      nuvlabox_peripheral-manager-gpu_1
d18bc85ca4d4   nuvlabox/security:1.0.3                        "./app.py"               2 days ago      Up 18 hours                                                                      nuvlabox_security_1
da73b90b79a7   nuvlabox/peripheral-manager-network:1.0.1      "python manager.py"      2 days ago      Up 40 seconds                                                                    nuvlabox_peripheral-manager-network_1
c65ecd07b0ed   nuvlabox/peripheral-manager-usb:1.4.1          "./app.sh"               2 days ago      Up 18 hours                                                                      nuvlabox_peripheral-manager-usb_1
a53c93fe211c   nuvla/job-lite:2.15.1                          "/app/pause.py"          2 days ago      Up 18 hours (Paused)                                                             nuvlabox-job-engine-lite
f85ebcfad721   nuvlabox/peripheral-manager-bluetooth:1.0.0    "python -u manager.py"   2 days ago      Up 18 hours                                                                      nuvlabox_peripheral-manager-bluetooth_1
66f2056900e2   nuvlabox/peripheral-manager-modbus:1.1.0       "./modbus.py"            2 days ago      Up About a minute                                                                nuvlabox_peripheral-manager-modbus_1
ffb8ec41a278   linuxserver/ffmpeg:latest                      "bash -xc 'curl -L -…"   6 days ago      Exited (255) 6 days ago                                                          f207e43e-497f-488d-8328-39de0808568e_video-source.1.rswym58zxeuw7e6n69pjvkw54
93c103826138   aler9/rtsp-simple-server:latest                "/rtsp-simple-server"    6 days ago      Exited (255) 6 days ago                                                          f207e43e-497f-488d-8328-39de0808568e_rtsp-server.1.082wg84atlr6qz19jwd0rakbo
e4ce4e8fc30d   sixsq/tensorflow-lite-object-detector:latest   "python3 object_dete…"   6 days ago      Exited (255) 6 days ago    5000/tcp                                              f207e43e-497f-488d-8328-39de0808568e_object-detector.1.2mwyeqxohe154wirrvs9etu1k
1b9474bbc894   sixsq/rabbitmq-mqtt:latest                     "docker-entrypoint.s…"   6 days ago      Exited (255) 6 days ago    4369/tcp, 5671-5672/tcp, 15671-15672/tcp, 25672/tcp   f207e43e-497f-488d-8328-39de0808568e_rabbitmq.1.afzj72xmj897l0o8f3m17o45k
pi@raspberrypi:~ $

and

pi@raspberrypi:~ $ docker ps -a

[...]

Watching VPN credential in Nuvla...
Found VPN credential ID credential/5f94d823-d1a0-4ce4-835f-6b941c5f6c95
The NuvlaBox MQTT broker is not reachable...trying again later
Traceback (most recent call last):
  File "/opt/nuvlabox/agent/Telemetry.py", line 130, in send_mqtt
    self.mqtt_telemetry.connect(self.mqtt_broker_host, self.mqtt_broker_port, self.mqtt_broker_keep_alive)
  File "/usr/local/lib/python3.9/site-packages/paho/mqtt/client.py", line 941, in connect
    return self.reconnect()
  File "/usr/local/lib/python3.9/site-packages/paho/mqtt/client.py", line 1075, in reconnect
    sock = self._create_socket_connection()
  File "/usr/local/lib/python3.9/site-packages/paho/mqtt/client.py", line 3546, in _create_socket_connection
    return socket.create_connection(addr, source_address=source, timeout=self._keepalive)
  File "/usr/local/lib/python3.9/socket.py", line 822, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/lib/python3.9/socket.py", line 953, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
Refresh status: {'resources': {'cpu': {'topic': 'cpu', 'raw-sample': '{"capacity": 4, "load": 1.29, "load-1": 1.17, "load-5": 1.22, "context-switches": 250914448, "interrupts": 124882705, "software-interrupts": 102936880, "system-calls": 0}', 'capacity': 4, 'load': 1.29, 'load-1': 1.17, 'load-5': 1.22, 'context-switches': 250914448, 'interrupts': 124882705, 'software-interrupts': 102936880, 'system-calls': 0}, 'ram': {'topic': 'ram', 'raw-sample': '{"capacity": 1867, "used": 725}', 'capacity': 1867, 'used': 725}, 'disks': [{'device': 'mmcblk0p2', 'capacity': 15, 'used': 8, 'topic': 'disks', 'raw-sample': '{"device": "mmcblk0p2", "capacity": 15, "used": 8}'}], 'net-stats': [{'interface': 'lo', 'bytes-transmitted': 66857185, 'bytes-received': 66857309}, {'interface': 'vethe471644', 'bytes-transmitted': 35293221, 'bytes-received': 105361309}, {'interface': 'veth33c3569', 'bytes-transmitted': 0, 'bytes-received': 0}, {'interface': 'vethbfe879b', 'bytes-transmitted': 12612461, 'bytes-received': 0}, {'interface': 'br-12fd990b4b5c', 'bytes-transmitted': 4426972298, 'bytes-received': 3467916144}, {'interface': 'vpn', 'bytes-transmitted': 27299733843, 'bytes-received': 541509917}, {'interface': 'veth06f0d08', 'bytes-transmitted': 105431990, 'bytes-received': 37200042}, {'interface': 'veth7ea16f0', 'bytes-transmitted': 12484687, 'bytes-received': 273872}, {'interface': 'docker_gwbridge', 'bytes-transmitted': 757693101, 'bytes-received': 28120381346}, {'interface': 'eth0', 'bytes-transmitted': 8957242254, 'bytes-received': 2731545025}, {'interface': 'veth6f5e5c7', 'bytes-transmitted': 2260357148, 'bytes-received': 2414564794}, {'interface': 'wlan0', 'bytes-transmitted': 0, 'bytes-received': 0}, {'interface': 'docker0', 'bytes-transmitted': 552269386, 'bytes-received': 334768642}, {'interface': 'veth5865ab2', 'bytes-transmitted': 12612611, 'bytes-received': 0}, {'interface': 'veth18b30bd', 'bytes-transmitted': 0, 'bytes-received': 0}, {'interface': 'veth6e5e65d', 'bytes-transmitted': 26973577, 'bytes-received': 14343938}, {'interface': 'vethc58f4ca', 'bytes-transmitted': 12612080, 'bytes-received': 621}]}, 'status': 'DEGRADED', 'status-notes': ['Unable to launch Data Gateway'], 'current-time': '2021-06-07T09:14:19Z', 'id': 'nuvlabox-status/65b2e016-cac5-4b62-9d2f-721ce010c515'}
Deleting the following attributes from NuvlaBox Status: nuvlabox-api-endpoint, inferred-location, node-id, cluster-id, cluster-managers, cluster-nodes, cluster-node-role, orchestrator, cluster-join-address
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.9/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.41/nodes?filters=%7B%22role%22%3A+%5B%22manager%22%5D%7D

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/nuvlabox/./app.py", line 232, in <module>
    infra.try_commission()
  File "/opt/nuvlabox/agent/Infrastructure.py", line 401, in try_commission
    commission_payload.update(self.get_cluster_info())
  File "/opt/nuvlabox/agent/Infrastructure.py", line 373, in get_cluster_info
    for manager in self.docker_client.nodes.list(filters={'role': 'manager'}):
  File "/usr/local/lib/python3.9/site-packages/docker/models/nodes.py", line 106, in list
    for n in self.client.api.nodes(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/utils/decorators.py", line 34, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/docker/api/swarm.py", line 307, in nodes
    return self._result(self._get(url, params=params), True)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 274, in _result
    self._raise_for_status(response)
  File "/usr/local/lib/python3.9/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.9/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/nodes?filters=%7B%22role%22%3A+%5B%22manager%22%5D%7D: Internal Server Error ("rpc error: code = Canceled desc = grpc: the client connection is closing")
pi@raspberrypi:~ $

@cjdcordeiro
Copy link
Author

This is where I ran the docker-compose command from. The /opt/nuvlabox folder didn't exist and I got an error pointing to the folder not existing:

0377112a | nuvlabox_update | 2021-06-02T11:27:00.646Z | FAILED | 100 | 1 | [NuvlaBox Engine update to 2.0.0] update ERROR: cannot proceed: original compose files are not available for rollback at /opt/nuvlabox

can you reproduce this? /opt/nuvlabox is the update folder inside a container. That folder always exists. The error is complaining about the yaml files not being there, but these should be copied automatically at update time. So the question is: given that working-dir = /home/pi...are you sure the original yaml files are still there?

@cjdcordeiro
Copy link
Author

Failed leaving cluster:

[NuvlaBox cluster action leave] Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/local/lib/python3.8/site-packages/requests/models.py", line 943, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: http+docker://localhost/v1.41/swarm/leave?force=True

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/nuvlabox/commands/nuvlabox-engine-cluster", line 64, in <module>
    leave()
  File "/opt/nuvlabox/commands/nuvlabox-engine-cluster", line 33, in leave
    docker_client.swarm.leave(force=True)
  File "/usr/local/lib/python3.8/site-packages/docker/models/swarm.py", line 135, in leave
    return self.client.api.leave_swarm(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/docker/utils/decorators.py", line 34, in wrapper
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/docker/api/swarm.py", line 282, in leave_swarm
    self._raise_for_status(response)
  File "/usr/local/lib/python3.8/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/local/lib/python3.8/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.APIError: 500 Server Error for http+docker://localhost/v1.41/swarm/leave?force=True: Internal Server Error ("context deadline exceeded")

This is interesting. It is a Docker issue, that can happen. The question is: in what state did your NB get into? I see you've followed up your comment with a DEGRADED state (which is normal when your host changes modes), but is it still DEGRADED?

@cjdcordeiro
Copy link
Author

also, isn't the system-manager restarting the agent? Can you check both logs?

@mebster
Copy link
Contributor

mebster commented Jun 7, 2021

Here's the system-manager logs:

ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
INFO - Supervise.py/Supervise/healer - Container nuvlabox_agent_1 exited and is not restarting. Forcing restart
ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
INFO - Supervise.py/Supervise/healer - Container nuvlabox_agent_1 exited and is not restarting. Forcing restart
ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
pi@raspberrypi:~ $

The NB is in the same degraded state.

@mebster
Copy link
Contributor

mebster commented Jun 7, 2021

As for the working directory:

pi@raspberrypi:~ $ ls
Bookshelf  docker-compose.bluetooth.yml  docker-compose.gpu.yml  docker-compose.modbus.yml  docker-compose.network.yml  docker-compose.usb.yml  docker-compose.yml  libseccomp2_2.5.1-1_armhf.deb  nuvlabox-engine-5.zip  old
pi@raspberrypi:~ $
pi@raspberrypi:~ $
pi@raspberrypi:~ $
pi@raspberrypi:~ $ ls /opt/
containerd  pigpio  vc
pi@raspberrypi:~ $

@cjdcordeiro
Copy link
Author

Here's the system-manager logs:

ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
INFO - Supervise.py/Supervise/healer - Container nuvlabox_agent_1 exited and is not restarting. Forcing restart
ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
WARNING - Supervise.py/Supervise/find_nuvlabox_agent - Agent API is not ready yet. Trying again later
INFO - Supervise.py/Supervise/healer - Container nuvlabox_agent_1 exited and is not restarting. Forcing restart
ERROR - Supervise.py/Supervise/healer - Cannot heal container nuvlabox_agent_1. Reason: 500 Server Error for http+docker://localhost/v1.41/containers/7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb/restart?t=10: Internal Server Error ("Cannot restart container 7e9ebfe5846a7f2334f170337a521974e095b4ab4a5d8365c764e2e75f1f04bb: invalid cluster node while attaching to network")
pi@raspberrypi:~ $

The NB is in the same degraded state.

can you try to pull and re-deploy the nuvladev/system-manager:master? It would be nice if we could reproduce this issue (leaving the cluster)...

@mebster
Copy link
Contributor

mebster commented Jun 30, 2021

Attempting to join a cluster fails with this error:

Job nuvlabox_cluster_join_worker failed
Exception: Cluster join requires both a token and address: {'cluster-action': 'join-worker', 'nuvlabox-manager-status': {'parent': 'nuvlabox/6b9e4a19-671f-4fe3-be40-6108cc497f78', 'cluster-join-address': '192.168.1.235:2377', 'resource-type': 'nuvlabox-status', 'acl': {'edit-data': ['nuvlabox/6b9e4a19-671f-4fe3-be40-6108cc497f78'], 'view-meta': ['group/studio-koh', 'infrastructure-service/eb8e09c2-8387-4f6d-86a4-ff5ddf3d07d7', 'nuvlabox/6b9e4a19-671f-4fe3-be40-6108cc497f78', 'user/e53f6f2e-c831-4978-b055-b4ecae38bdda'], 'view-acl': ['group/studio-koh', 'infrastructure-service/eb8e09c2-8387-4f6d-86a4-ff5ddf3d07d7', 'user/e53f6f2e-c831-4978-b055-b4ecae38bdda'], 'view-data': ['group/studio-koh', 'infrastructure-service/eb8e09c2-8387-4f6d-86a4-ff5ddf3d07d7', 'nuvlabox/6b9e4a19-671f-4fe3-be40-6108cc497f78', 'user/e53f6f2e-c831-4978-b055-b4ecae38bdda'], 'edit-meta': ['nuvlabox/6b9e4a19-671f-4fe3-be40-6108cc497f78'], 'owners': ['group/nuvla-admin']}, 'id': 'nuvlabox-status/3c0b07b6-8288-426a-ba27-5cd5eb1c1bca', 'cluster-id': 'ufqrijrfestojyoebo7s3dyu0'}, 'token': ''}

@mebster
Copy link
Contributor

mebster commented Jun 30, 2021

This was "hydro-engine-9" joining "Studio KOH NuvlaBox Lionel test". Both are a single node cluster manager.

@mebster
Copy link
Contributor

mebster commented Jun 30, 2021

Sharing a NB that is a manager, doesn't share the cluster it belongs to. I think it should.

@cjdcordeiro
Copy link
Author

This was "hydro-engine-9" joining "Studio KOH NuvlaBox Lionel test". Both are a single node cluster manager.

well this is server-side, and basically just means you didn't "pass" a token for the NB to join. The clustering action wasn't even attempted.

This can happen for 1 of 2 possible causes:

  1. there is no swarm-token credential for that manager NuvlaBox (just check credentials to see if it is there)
  2. you've pressed the cluster action button too fast, before the UI could fetch the respective token for the manager NB, so it sent a nil value to the job (if this happens then we need to disable the button click while the UI is fetching)

@cjdcordeiro
Copy link
Author

Sharing a NB that is a manager, doesn't share the cluster it belongs to. I think it should.

y I think so too. Please put it as a TODO feature ticket on the api-server

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants