Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with symbolic link created with postStart and starter.zip logic #31030

Closed
spbolton opened this issue Jan 2, 2025 · 3 comments · Fixed by #31109
Closed

Issue with symbolic link created with postStart and starter.zip logic #31030

spbolton opened this issue Jan 2, 2025 · 3 comments · Fixed by #31109
Assignees

Comments

@spbolton
Copy link
Contributor

spbolton commented Jan 2, 2025

Most of our stateful set deployment templates use the following

lifecycle:
      postStart:
        command:
          - /bin/bash
          - -c
          - if [ ! -d /data/shared/assets/.backup ]; then mkdir /data/shared/assets/.backup;
            fi; ln -s /data/shared/assets/.backup /data/local/dotsecure/backup;

There are multiple issues with this behaviour that need to be addressed.

The folder should be configurable in the App rather than a patch in the template in this way.

This seems to be an attempt to make sure that this folder persists the otherwise ephemeral storage in /data/local or to reduce the ephemeral disk usage on the container. A remapping at the k8s template level in a script in not the right way of addressing this location change. The "/data/local/dotsecure/backup" location is defined in com.dotmarketing.util.starter.ImportStarterUtil where it is hardcoded currently to DYNAMIC_CONTENT_PATH/dotsecure/backup without the option to override in the code. I can understand the need to possibly patch in to solve an immediate problem before code can be changed, but this should not be left in without a task to fix the source configurability issue. The result of not doing this is added complexity, confusion on the purpose, and further issues that are hard to diagnose as they are context sensitive such as what is found below.

The use of a symbolic link here has unexpected consequences for the way the code is currently accessing this folder causing exceptions to be thrown and the starter to fail.

ImportStarterUtil is doing some functionality in validateZipFile to delete existing temp files in the backup folder. This functionality can get confused when the folder is a symbolic link to another folder that may or may not exist. The existing code gets around some of the issues here due to the fact that the
ftempDir.mkdirs()" does not throw an exception if it could not create the folder but returns a true and false unlike the modern nio functions

The remapping of the target of this folder into /data/shared makes it common for all replicas and creates concurrency issues.

If multiple replicas are starting at the same time they will each run this functionality and can call deleteTempFiles at different times on the common folder causing a concurrency issue where the zip is no longer available and the instance fails.

We need to be aware of postStart lifecycle command concurrency

It may not be this issue in this case which is more likely related to the shared folder, but we need to be aware with postStart commands that the do not run to completion before the docker entrypoint runs, but they are started at the same time and run in parallel with the main entrypoint process. This means that if a postStart command is not immediate the application may start and may or may not not immediately see the changes the postStart command makes.

TODO

  1. Understand properly the purpose of the copy of the startup zip into a backup folder. When should this be removed, should it be stored in /data/shared and why. If it is meant to be a cache, why are we attempting to delete the temp files first anyway.
  2. Refactor ImportStarterUtil.validateZipFile to properly handle required logic making use of modern Java NIO where appropriate. Make sure that subfolders are created properly if needed and do not throw exceptions if they already exist whether a symbolic link or a regular folder. Ensure that the logic is safe assuming multiple replicas may be starting up at the same time.
  3. Remove the need for the postStart exec task either by changing the default location for this backup folder, removing the need for the folder, and or allowing for the path to be configured with a config property on the container.

Note: The concurrency issues with access from multiple replicas may go away when we only run the startup tasks in an single init container so we may not be so concerned about the concurrency here, but if possible it would be good to ensure the logic is thought about in a safe manner anyway to prevent any possible future issues and confusion.

Copy link

@dcolina dcolina moved this from In Progress to In Review in dotCMS - Product Planning Jan 13, 2025
@cobbg
Copy link

cobbg commented Jan 13, 2025

I know this is already in review and I have previously commented on the topic during one of our standup meetings. However, I wanted to post a comment here with some background information for historical purposes in case anyone is searching around for the answer in the future.

The reason this post-start command is in place is because dotCMS would attempt to archive and store assets (and any other backup items) into a directory within the container that was using ephemeral storage. If the environment's assets amounted to a large size, this would cause the pod to run out of storage. In some extreme cases, it could cause the underlying node to experience disk pressure and begin evicting pods as well.

In other words, the internal backup feature allowed customers to adversely affect their environments and possibly other customers' environments too. Therefore, we introduced this symlink so the backup directory would be stored on a persistent volume (EFS in this case) that did not impose any hard restrictions on file sizes or overall disk size usage. It is a hidden directory so the backup procedure itself wouldn't backup the directory and cause a weird loop effect where assets would be duplicated many times.

For more current versions of dotCMS, I do not think this is necessary at all anymore. Will previously informed me that the backup function no longer attempts to buffer and archive on the local filesystem; rather it will just stream the data instead. We should verify, but I think we can remove this altogether for recent versions of dotCMS. (Part of the verification would be verifying which dotCMS versions.)

dcolina added a commit that referenced this issue Jan 14, 2025
dcolina added a commit that referenced this issue Jan 14, 2025
@sfreudenthaler
Copy link
Contributor

sfreudenthaler commented Jan 14, 2025

@dcolina can you please create a spike and pull in for this sprint for sorting out previous versions and how we handle them.

I know this is already in review and I have previously commented on the topic during one of our standup meetings. However, I wanted to post a comment here with some background information for historical purposes in case anyone is searching around for the answer in the future.

The reason this post-start command is in place is because dotCMS would attempt to archive and store assets (and any other backup items) into a directory within the container that was using ephemeral storage. If the environment's assets amounted to a large size, this would cause the pod to run out of storage. In some extreme cases, it could cause the underlying node to experience disk pressure and begin evicting pods as well.

In other words, the internal backup feature allowed customers to adversely affect their environments and possibly other customers' environments too. Therefore, we introduced this symlink so the backup directory would be stored on a persistent volume (EFS in this case) that did not impose any hard restrictions on file sizes or overall disk size usage. It is a hidden directory so the backup procedure itself wouldn't backup the directory and cause a weird loop effect where assets would be duplicated many times.

For more current versions of dotCMS, I do not think this is necessary at all anymore. Will previously informed me that the backup function no longer attempts to buffer and archive on the local filesystem; rather it will just stream the data instead. We should verify, but I think we can remove this altogether for recent versions of dotCMS. (Part of the verification would be verifying which dotCMS versions.)

@sfreudenthaler here is the spike ticket #31123

github-merge-queue bot pushed a commit that referenced this issue Jan 15, 2025
…start (#31109)

## Proposed Changes
This pull request includes several updates to the backup directory
configuration and improvements to the `ImportStarterUtil` class for
handling file operations more robustly. The most significant changes are
the introduction of a configurable backup directory path, enhancements
to file handling in `ImportStarterUtil`, and updates to the
`dotmarketing-config.properties` file.

### Backup Directory Configuration:
* Added a method to retrieve a configurable backup directory path with a
default fallback.

### File Handling Improvements:
* Refactored the `deleteTempFiles` method to use Java NIO for better
file handling and safety.
* Enhanced the `validateZipFile` method to ensure proper setup using
Java NIO features and added concurrency safety.

## Fixes
#31030
@github-project-automation github-project-automation bot moved this from In Review to Done in dotCMS - Product Planning Jan 15, 2025
@dcolina dcolina moved this from Done to Internal QA in dotCMS - Product Planning Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Internal QA
Development

Successfully merging a pull request may close this issue.

4 participants