forked from RamenDR/ramen
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Syncing latest changes from upstream main for ramen #220
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was added to make cluster start order more predictable, but it does not works reliably, and makes testing harder. Signed-off-by: Nir Soffer <[email protected]>
- Add duration argument for error addon for quicker testing. - Replace example deployment with new sleep addon. The addon starts a sleep child process for testing termination of child processes. - Add additional cluster to check that all addons on all clusters are terminated on errors. - Add multiple workers and multiple addons to test that all workers in all clusters are canceled after an error. - Add logging in the addons for easier debugging Starting the environment stats this process tree: $ ps -f -o user,pid,ppid,pgid,args USER PID PPID PGID COMMAND nsoffer 684655 684633 684655 bash nsoffer 900040 684655 900040 \_ drenv start envs/error.yaml -v nsoffer 900987 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 900993 900987 900040 | \_ sleep 10.0 nsoffer 900988 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 900992 900988 900040 | \_ sleep 10.0 nsoffer 900990 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 900994 900990 900040 | \_ sleep 10.0 nsoffer 900991 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 900995 900991 900040 | \_ sleep 10.0 nsoffer 901015 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 901019 901015 900040 | \_ sleep 10.0 nsoffer 901016 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 901020 901016 900040 | \_ sleep 10.0 nsoffer 901017 900040 900040 \_ python3 addons/error/start 10.0 nsoffer 901018 900040 900040 \_ python3 addons/sleep/start 10.0 nsoffer 901021 901018 900040 \_ sleep 10.0 Example run: $ drenv start envs/error.yaml -v 2024-03-16 23:47:38,320 INFO [error] Starting environment ... 2024-03-16 23:48:25,171 ERROR [c1/0] Cluster failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('addons/error/start', '10.0') exitcode: 1 error: Error addon failed Since we did not terminate addons on errors, some addons and their child process were orphaned after drenv terminated: $ ps -f -o user,pid,ppid,pgid,args USER PID PPID PGID COMMAND nsoffer 684655 684633 684655 bash nsoffer 901053 682980 900040 python3 addons/sleep/start 60 nsoffer 901054 901053 900040 \_ sleep 60 nsoffer 901052 682980 900040 python3 addons/sleep/start 60 nsoffer 901055 901052 900040 \_ sleep 60 nsoffer 901051 682980 900040 python3 addons/sleep/start 60 nsoffer 901056 901051 900040 \_ sleep 60 nsoffer 901050 682980 900040 python3 addons/sleep/start 60 nsoffer 901057 901050 900040 \_ sleep 60 In the CI the github runner terminate orphaned processes, but locally they continue to run until they finish. Signed-off-by: Nir Soffer <[email protected]>
`commands.run()` and `commands.watch()` raise `shutdown.Started` exception if they are called after `shutdown.start()` was called. This will be used to perform clean shutdown after errors. To support this behavior, we separate the fast part starting the child process from the slow part communicating with the child process. Starting the child process is done under the shutdown lock, so starting shutdown, we know that no new child process can be created, and child processes created by other threads at the same time exist. Signed-off-by: Nir Soffer <[email protected]>
Previously we terminated without any cleanup leaving orphaned child processes around. Now we terminate all child processes (including child processes created by child processes when terminating. Shutting down cleanly in a multi-threaded multi-process application is not easy. - When staring the program, we ensure that drenv run as process group leader. This ensures that all child processes are part of the drenv process group. Child process started by child processes are also created in the same process group. - When starting, we register SIGTERM handler. This ensures that when drenv is terminated, we use the same termination flow, terminating all child processes. - When creating executor, we add them to a module executors list. We abort the worker if shutdown has started. - When running a hook, we abort the worker if shutdown has started. - When the first addon fails, we shutdown the executor without waiting for running tasks, and cancel pending tasks. We don't log the error in the executor but let it propagate up to the new top level error handlers in main(), where we log all fatal errors. - When handling an error in the top level error handler, we start shutdown, preventing any new child process to be created, and causing running tasks to abort. Then we shutdown all executors to cancel all pending tasks. - Finally we terminate the entire process group using SIGHUP. This terminates any child process created before shutdown was started. Example run using verbose logs to show the new debug logs: $ drenv start envs/error.yaml -v ... 2024-03-17 02:34:58,639 DEBUG [main] Add executor c1 2024-03-17 02:34:58,639 INFO [c1/0] Running addons/error/start 2024-03-17 02:34:58,639 INFO [c1/1] Running addons/sleep/start 2024-03-17 02:34:58,640 INFO [c1/2] Running addons/sleep/start 2024-03-17 02:34:58,640 INFO [c1/3] Running addons/sleep/start 2024-03-17 02:34:58,711 DEBUG [c1/0] Error addon will fail in 10.0 seconds 2024-03-17 02:34:58,712 DEBUG [c1/1] Sleep addon waiting 10.0 seconds 2024-03-17 02:34:58,713 DEBUG [c1/2] Sleep addon waiting 10.0 seconds 2024-03-17 02:34:58,715 DEBUG [c1/3] Sleep addon waiting 10.0 seconds 2024-03-17 02:35:07,768 DEBUG [c2/0] Sleep addon finished normally 2024-03-17 02:35:07,768 DEBUG [c2/1] Sleep addon finished normally 2024-03-17 02:35:07,769 DEBUG [c2/2] Sleep addon finished normally 2024-03-17 02:35:07,772 DEBUG [c2/3] Sleep addon finished normally 2024-03-17 02:35:07,787 INFO [c2/0] addons/sleep/start completed in 10.10 seconds 2024-03-17 02:35:07,788 INFO [c2/0] Running addons/sleep/start 2024-03-17 02:35:07,790 INFO [c2/1] addons/sleep/start completed in 10.10 seconds 2024-03-17 02:35:07,790 INFO [c2/1] Running addons/sleep/start 2024-03-17 02:35:07,791 INFO [c2/2] addons/sleep/start completed in 10.10 seconds 2024-03-17 02:35:07,791 INFO [c2/2] Running addons/sleep/start 2024-03-17 02:35:07,794 INFO [c2/3] addons/sleep/start completed in 10.10 seconds 2024-03-17 02:35:07,794 INFO [c2/3] Running addons/sleep/start 2024-03-17 02:35:07,868 DEBUG [c2/0] Sleep addon waiting 60 seconds 2024-03-17 02:35:07,871 DEBUG [c2/1] Sleep addon waiting 60 seconds 2024-03-17 02:35:07,871 DEBUG [c2/3] Sleep addon waiting 60 seconds 2024-03-17 02:35:07,872 DEBUG [c2/2] Sleep addon waiting 60 seconds 2024-03-17 02:35:08,713 DEBUG [c1/1] Sleep addon finished normally 2024-03-17 02:35:08,714 DEBUG [c1/2] Sleep addon finished normally 2024-03-17 02:35:08,717 DEBUG [c1/3] Sleep addon finished normally 2024-03-17 02:35:08,734 INFO [c1/2] addons/sleep/start completed in 10.09 seconds 2024-03-17 02:35:08,734 INFO [c1/2] Running addons/sleep/start 2024-03-17 02:35:08,735 ERROR Command failed Traceback (most recent call last): ... drenv.commands.Error: Command failed: command: ('addons/error/start', '10.0') exitcode: 1 error: Error addon failed 2024-03-17 02:35:08,736 DEBUG [main] Starting shutdown 2024-03-17 02:35:08,736 DEBUG [main] Shutting down executor profiles 2024-03-17 02:35:08,736 DEBUG [main] Shutting down executor c2 2024-03-17 02:35:08,737 DEBUG [main] Shutting down executor c1 2024-03-17 02:35:08,737 DEBUG [main] Terminating process group 941645 2024-03-17 02:35:08,737 INFO [c1/3] addons/sleep/start completed in 10.10 seconds 2024-03-17 02:35:08,737 DEBUG [c1/3] Shutting down The sleep addons and their child processes were terminated when drenv terminated after the error. Signed-off-by: Nir Soffer <[email protected]>
Copying the drevn e2e config does not belong to starting drenv, and it always fails when drenv start fails: cp: cannot stat '/home/github/.config/drenv/rdr-rdr/config.yaml': No such file or directory I think the reason is that the `nick-fields/retry@v3` does not use `bash -e` like the standard `run` step, using: shell: /usr/bin/bash -e {0} The copy config belongs to the e2e tests, move it there. Signed-off-by: Nir Soffer <[email protected]>
ramenctl config is fails from time to time when: - Waiting for policies to propagate to managed clusters - Failing to access ramen webhook when adding a drcluster The first issue may be a bug in ocm that we need to reproduce and investigate locally. The second issue may be a bug in the way we wait for ramen, but I think there is no good way to wait until a webhook is available, it can fail even if the container is running. This issue will be resolved soon by removing the ramen webhooks. This is a temporary error that should be resolved by retrying. Until we resolve both issues, lets retry the config step on errors. Signed-off-by: Nir Soffer <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: df-build-team The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR containing the latest commits from upstream main branch