Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syncing latest changes from upstream main for ramen #220

Merged
merged 6 commits into from
Mar 19, 2024
Merged

Conversation

df-build-team
Copy link

PR containing the latest commits from upstream main branch

nirs added 6 commits March 18, 2024 08:26
This was added to make cluster start order more predictable, but it does
not works reliably, and makes testing harder.

Signed-off-by: Nir Soffer <[email protected]>
- Add duration argument for error addon for quicker testing.

- Replace example deployment with new sleep addon. The addon starts a
  sleep child process for testing termination of child processes.

- Add additional cluster to check that all addons on all clusters are
  terminated on errors.

- Add multiple workers and multiple addons to test that all workers in
  all clusters are canceled after an error.

- Add logging in the addons for easier debugging

Starting the environment stats this process tree:

    $ ps -f -o user,pid,ppid,pgid,args
    USER         PID    PPID    PGID COMMAND
    nsoffer   684655  684633  684655 bash
    nsoffer   900040  684655  900040  \_ drenv start envs/error.yaml -v
    nsoffer   900987  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   900993  900987  900040      |   \_ sleep 10.0
    nsoffer   900988  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   900992  900988  900040      |   \_ sleep 10.0
    nsoffer   900990  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   900994  900990  900040      |   \_ sleep 10.0
    nsoffer   900991  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   900995  900991  900040      |   \_ sleep 10.0
    nsoffer   901015  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   901019  901015  900040      |   \_ sleep 10.0
    nsoffer   901016  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   901020  901016  900040      |   \_ sleep 10.0
    nsoffer   901017  900040  900040      \_ python3 addons/error/start 10.0
    nsoffer   901018  900040  900040      \_ python3 addons/sleep/start 10.0
    nsoffer   901021  901018  900040          \_ sleep 10.0

Example run:

    $ drenv start envs/error.yaml -v
    2024-03-16 23:47:38,320 INFO    [error] Starting environment
    ...
    2024-03-16 23:48:25,171 ERROR   [c1/0] Cluster failed
    Traceback (most recent call last):
      ...
    drenv.commands.Error: Command failed:
       command: ('addons/error/start', '10.0')
       exitcode: 1
       error:
          Error addon failed

Since we did not terminate addons on errors, some addons and their child
process were orphaned after drenv terminated:

    $ ps -f -o user,pid,ppid,pgid,args
    USER         PID    PPID    PGID COMMAND
    nsoffer   684655  684633  684655 bash
    nsoffer   901053  682980  900040 python3 addons/sleep/start 60
    nsoffer   901054  901053  900040  \_ sleep 60
    nsoffer   901052  682980  900040 python3 addons/sleep/start 60
    nsoffer   901055  901052  900040  \_ sleep 60
    nsoffer   901051  682980  900040 python3 addons/sleep/start 60
    nsoffer   901056  901051  900040  \_ sleep 60
    nsoffer   901050  682980  900040 python3 addons/sleep/start 60
    nsoffer   901057  901050  900040  \_ sleep 60

In the CI the github runner terminate orphaned processes, but locally
they continue to run until they finish.

Signed-off-by: Nir Soffer <[email protected]>
`commands.run()` and `commands.watch()` raise `shutdown.Started`
exception if they are called after `shutdown.start()` was called. This
will be used to perform clean shutdown after errors.

To support this behavior, we separate the fast part starting the child
process from the slow part communicating with the child process.
Starting the child process is done under the shutdown lock, so starting
shutdown, we know that no new child process can be created, and child
processes created by other threads at the same time exist.

Signed-off-by: Nir Soffer <[email protected]>
Previously we terminated without any cleanup leaving orphaned child
processes around. Now we terminate all child processes (including child
processes created by child processes when terminating.

Shutting down cleanly in a multi-threaded multi-process application is
not easy.

- When staring the program, we ensure that drenv run as process group
  leader. This ensures that all child processes are part of the drenv
  process group. Child process started by child processes are also
  created in the same process group.

- When starting, we register SIGTERM handler. This ensures that when
  drenv is terminated, we use the same termination flow, terminating all
  child processes.

- When creating executor, we add them to a module executors list.  We
  abort the worker if shutdown has started.

- When running a hook, we abort the worker if shutdown has started.

- When the first addon fails, we shutdown the executor without waiting
  for running tasks, and cancel pending tasks. We don't log the error in
  the executor but let it propagate up to the new top level error
  handlers in main(), where we log all fatal errors.

- When handling an error in the top level error handler, we start
  shutdown, preventing any new child process to be created, and causing
  running tasks to abort. Then we shutdown all executors to cancel all
  pending tasks.

- Finally we terminate the entire process group using SIGHUP. This
  terminates any child process created before shutdown was started.

Example run using verbose logs to show the new debug logs:

    $ drenv start envs/error.yaml -v
    ...
    2024-03-17 02:34:58,639 DEBUG   [main] Add executor c1
    2024-03-17 02:34:58,639 INFO    [c1/0] Running addons/error/start
    2024-03-17 02:34:58,639 INFO    [c1/1] Running addons/sleep/start
    2024-03-17 02:34:58,640 INFO    [c1/2] Running addons/sleep/start
    2024-03-17 02:34:58,640 INFO    [c1/3] Running addons/sleep/start
    2024-03-17 02:34:58,711 DEBUG   [c1/0] Error addon will fail in 10.0 seconds
    2024-03-17 02:34:58,712 DEBUG   [c1/1] Sleep addon waiting 10.0 seconds
    2024-03-17 02:34:58,713 DEBUG   [c1/2] Sleep addon waiting 10.0 seconds
    2024-03-17 02:34:58,715 DEBUG   [c1/3] Sleep addon waiting 10.0 seconds
    2024-03-17 02:35:07,768 DEBUG   [c2/0] Sleep addon finished normally
    2024-03-17 02:35:07,768 DEBUG   [c2/1] Sleep addon finished normally
    2024-03-17 02:35:07,769 DEBUG   [c2/2] Sleep addon finished normally
    2024-03-17 02:35:07,772 DEBUG   [c2/3] Sleep addon finished normally
    2024-03-17 02:35:07,787 INFO    [c2/0] addons/sleep/start completed in 10.10 seconds
    2024-03-17 02:35:07,788 INFO    [c2/0] Running addons/sleep/start
    2024-03-17 02:35:07,790 INFO    [c2/1] addons/sleep/start completed in 10.10 seconds
    2024-03-17 02:35:07,790 INFO    [c2/1] Running addons/sleep/start
    2024-03-17 02:35:07,791 INFO    [c2/2] addons/sleep/start completed in 10.10 seconds
    2024-03-17 02:35:07,791 INFO    [c2/2] Running addons/sleep/start
    2024-03-17 02:35:07,794 INFO    [c2/3] addons/sleep/start completed in 10.10 seconds
    2024-03-17 02:35:07,794 INFO    [c2/3] Running addons/sleep/start
    2024-03-17 02:35:07,868 DEBUG   [c2/0] Sleep addon waiting 60 seconds
    2024-03-17 02:35:07,871 DEBUG   [c2/1] Sleep addon waiting 60 seconds
    2024-03-17 02:35:07,871 DEBUG   [c2/3] Sleep addon waiting 60 seconds
    2024-03-17 02:35:07,872 DEBUG   [c2/2] Sleep addon waiting 60 seconds
    2024-03-17 02:35:08,713 DEBUG   [c1/1] Sleep addon finished normally
    2024-03-17 02:35:08,714 DEBUG   [c1/2] Sleep addon finished normally
    2024-03-17 02:35:08,717 DEBUG   [c1/3] Sleep addon finished normally
    2024-03-17 02:35:08,734 INFO    [c1/2] addons/sleep/start completed in 10.09 seconds
    2024-03-17 02:35:08,734 INFO    [c1/2] Running addons/sleep/start
    2024-03-17 02:35:08,735 ERROR   Command failed
    Traceback (most recent call last):
      ...
    drenv.commands.Error: Command failed:
       command: ('addons/error/start', '10.0')
       exitcode: 1
       error:
          Error addon failed

    2024-03-17 02:35:08,736 DEBUG   [main] Starting shutdown
    2024-03-17 02:35:08,736 DEBUG   [main] Shutting down executor profiles
    2024-03-17 02:35:08,736 DEBUG   [main] Shutting down executor c2
    2024-03-17 02:35:08,737 DEBUG   [main] Shutting down executor c1
    2024-03-17 02:35:08,737 DEBUG   [main] Terminating process group 941645
    2024-03-17 02:35:08,737 INFO    [c1/3] addons/sleep/start completed in 10.10 seconds
    2024-03-17 02:35:08,737 DEBUG   [c1/3] Shutting down

The sleep addons and their child processes were terminated when drenv
terminated after the error.

Signed-off-by: Nir Soffer <[email protected]>
Copying the drevn e2e config does not belong to starting drenv, and it
always fails when drenv start fails:

    cp: cannot stat '/home/github/.config/drenv/rdr-rdr/config.yaml': No such file or directory

I think the reason is that the `nick-fields/retry@v3` does not use `bash -e`
like the standard `run` step, using:

    shell: /usr/bin/bash -e {0}

The copy config belongs to the e2e tests, move it there.

Signed-off-by: Nir Soffer <[email protected]>
ramenctl config is fails from time to time when:
- Waiting for policies to propagate to managed clusters
- Failing to access ramen webhook when adding a drcluster

The first issue may be a bug in ocm that we need to reproduce and
investigate locally.

The second issue may be a bug in the way we wait for ramen, but I think
there is no good way to wait until a webhook is available, it can fail
even if the container is running. This issue will be resolved soon by
removing the ramen webhooks. This is a temporary error that should be
resolved by retrying.

Until we resolve both issues, lets retry the config step on errors.

Signed-off-by: Nir Soffer <[email protected]>
@df-build-team df-build-team requested a review from a team March 19, 2024 08:07
Copy link

openshift-ci bot commented Mar 19, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: df-build-team

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ShyamsundarR ShyamsundarR merged commit 51ef33a into main Mar 19, 2024
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants