-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with symbolic link created with postStart and starter.zip logic #31030
Comments
I know this is already in review and I have previously commented on the topic during one of our standup meetings. However, I wanted to post a comment here with some background information for historical purposes in case anyone is searching around for the answer in the future. The reason this post-start command is in place is because dotCMS would attempt to archive and store assets (and any other backup items) into a directory within the container that was using ephemeral storage. If the environment's assets amounted to a large size, this would cause the pod to run out of storage. In some extreme cases, it could cause the underlying node to experience disk pressure and begin evicting pods as well. In other words, the internal backup feature allowed customers to adversely affect their environments and possibly other customers' environments too. Therefore, we introduced this symlink so the backup directory would be stored on a persistent volume (EFS in this case) that did not impose any hard restrictions on file sizes or overall disk size usage. It is a hidden directory so the backup procedure itself wouldn't backup the directory and cause a weird loop effect where assets would be duplicated many times. For more current versions of dotCMS, I do not think this is necessary at all anymore. Will previously informed me that the backup function no longer attempts to buffer and archive on the local filesystem; rather it will just stream the data instead. We should verify, but I think we can remove this altogether for recent versions of dotCMS. (Part of the verification would be verifying which dotCMS versions.) |
@dcolina can you please create a spike and pull in for this sprint for sorting out previous versions and how we handle them.
@sfreudenthaler here is the spike ticket #31123 |
…start (#31109) ## Proposed Changes This pull request includes several updates to the backup directory configuration and improvements to the `ImportStarterUtil` class for handling file operations more robustly. The most significant changes are the introduction of a configurable backup directory path, enhancements to file handling in `ImportStarterUtil`, and updates to the `dotmarketing-config.properties` file. ### Backup Directory Configuration: * Added a method to retrieve a configurable backup directory path with a default fallback. ### File Handling Improvements: * Refactored the `deleteTempFiles` method to use Java NIO for better file handling and safety. * Enhanced the `validateZipFile` method to ensure proper setup using Java NIO features and added concurrency safety. ## Fixes #31030
Most of our stateful set deployment templates use the following
There are multiple issues with this behaviour that need to be addressed.
The folder should be configurable in the App rather than a patch in the template in this way.
This seems to be an attempt to make sure that this folder persists the otherwise ephemeral storage in /data/local or to reduce the ephemeral disk usage on the container. A remapping at the k8s template level in a script in not the right way of addressing this location change. The "/data/local/dotsecure/backup" location is defined in com.dotmarketing.util.starter.ImportStarterUtil where it is hardcoded currently to DYNAMIC_CONTENT_PATH/dotsecure/backup without the option to override in the code. I can understand the need to possibly patch in to solve an immediate problem before code can be changed, but this should not be left in without a task to fix the source configurability issue. The result of not doing this is added complexity, confusion on the purpose, and further issues that are hard to diagnose as they are context sensitive such as what is found below.
The use of a symbolic link here has unexpected consequences for the way the code is currently accessing this folder causing exceptions to be thrown and the starter to fail.
ImportStarterUtil is doing some functionality in validateZipFile to delete existing temp files in the backup folder. This functionality can get confused when the folder is a symbolic link to another folder that may or may not exist. The existing code gets around some of the issues here due to the fact that the
ftempDir.mkdirs()" does not throw an exception if it could not create the folder but returns a true and false unlike the modern nio functions
The remapping of the target of this folder into /data/shared makes it common for all replicas and creates concurrency issues.
If multiple replicas are starting at the same time they will each run this functionality and can call deleteTempFiles at different times on the common folder causing a concurrency issue where the zip is no longer available and the instance fails.
We need to be aware of postStart lifecycle command concurrency
It may not be this issue in this case which is more likely related to the shared folder, but we need to be aware with postStart commands that the do not run to completion before the docker entrypoint runs, but they are started at the same time and run in parallel with the main entrypoint process. This means that if a postStart command is not immediate the application may start and may or may not not immediately see the changes the postStart command makes.
TODO
Note: The concurrency issues with access from multiple replicas may go away when we only run the startup tasks in an single init container so we may not be so concerned about the concurrency here, but if possible it would be good to ensure the logic is thought about in a safe manner anyway to prevent any possible future issues and confusion.
The text was updated successfully, but these errors were encountered: