Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce cron jobs to clean up the export cache (data/cache/export/) and temporary files/directories (data/tmp/) #8804

Open
wants to merge 62 commits into
base: develop
Choose a base branch
from

Conversation

Marishka17
Copy link
Contributor

@Marishka17 Marishka17 commented Dec 9, 2024

Motivation and context

Depends on #8721
PR introduces the following changes:

  • [Export datasets|backups] Cache cleaning moved into a separate cron job. A separate RQ job is no longer enqueued after each export process
  • [Export datasets|backups] Temporary directory is created inside the common CVAT tmp directory (data/tmp/) instead of project|task|job/id/tmp/export_cache/
  • [Export datasets] One tmp directory (instead of 2) is created during the export process
  • Added the exportcachecleanup management command to remove outdated project|task|job/id/tmp/export_cache/ directories
  • Added a cron job to clean up the data/tmp/ directory. The new setting TMP_FILE_OR_DIR_RETENTION_DAYS is used to determine whether a file or directory should be removed
  • Locks are now used when working with backups as they are used for export

Notes:

  • data/project|task|job/id/tmp/ is still used during uploading annotations/datasets, but this should be fixed in a separate PR.

Breaking changes:

  • Enqueuing RQ jobs to backup a project/task: A filename argument was removed from the background function signature. Previously enqueued jobs will fail.
  • Cleaning cache after exporting backups or datasets: Scheduled RQ jobs will fail due to cvat.apps.dataset_manager.views.clear_export_cache was moved and cvat.apps.engine.backup._clear_export_cache was deleted.

How has this been tested?

Checklist

  • I submit my changes into the develop branch
  • I have created a changelog fragment
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • I have linked related issues (see GitHub docs)
  • I have increased versions of npm packages if it is necessary
    (cvat-canvas,
    cvat-core,
    cvat-data and
    cvat-ui)

License

  • I submit my code changes under the same MIT License that covers the project.
    Feel free to contact the maintainers if that's a concern.

Summary by CodeRabbit

  • New Features

    • Introduced periodic tasks for automatic cleanup of export caches for projects, tasks, and jobs.
    • Added new functionality for managing and cleaning up export cache files.
  • Bug Fixes

    • Enhanced error handling for expired files in export processes.
  • Refactor

    • Restructured export and cleanup functionalities for improved clarity and maintainability.
    • Updated test methods for better focus on export and cleanup operations.
  • Documentation

    • Added deprecation warnings for outdated environment variables related to caching settings.
  • Chores

    • Expanded formatting script to include new files for consistent code style.

Copy link
Contributor

coderabbitai bot commented Dec 9, 2024

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces a comprehensive refactoring of the export and cache management system in the CVAT application. The changes span multiple files and focus on improving the modularity, error handling, and organization of export-related functionality. Key modifications include introducing new classes for file type management, restructuring export cache handling, adding periodic cleanup jobs for export caches, and enhancing the overall file management process across projects, tasks, and jobs.

Changes

File Change Summary
cvat/apps/dataset_manager/util.py Added ExportFileType enum, refactored filename parsing classes, introduced ExportCacheManager with new methods for file path generation
cvat/apps/dataset_manager/views.py Removed get_export_cache_dir, updated export function, renamed _retry_current_rq_job to public method
cvat/apps/engine/models.py Added abstract base classes _FileSystemRelatedModel and _Exportable, updated Project, Task, and Job models
cvat/apps/engine/cron.py New file for managing export cache cleanup, added clear_export_cache and cron_export_cache_cleanup functions
cvat/settings/base.py Added periodic jobs for export cache cleanup for projects, tasks, and jobs

Poem

🐰 Hop, hop, through export's maze,
Cleaning caches in clever ways,
Enum and manager, hand in paw,
Refactoring code without a flaw,
CVAT's rabbit leaps with glee! 🚀


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch 2 times, most recently from b885dac to f20eadf Compare December 10, 2024 16:00
@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch from f20eadf to 92f9d0b Compare December 10, 2024 16:09
@Marishka17
Copy link
Contributor Author

/check

@cvat-ai cvat-ai deleted a comment from github-actions bot Dec 10, 2024
Copy link
Contributor

github-actions bot commented Dec 10, 2024

❌ Some checks failed
📄 See logs here

@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch from 14084b8 to e4c24ea Compare December 11, 2024 11:02
@Marishka17 Marishka17 changed the title Move export cache cleaning into cron jobs [WIP] Move export cache cleaning into cron jobs Dec 11, 2024
@Marishka17 Marishka17 changed the title [WIP] Move export cache cleaning into cron jobs [Dependent] Move export cache cleaning into cron jobs Dec 17, 2024
@Marishka17 Marishka17 changed the title [Dependent] Move export cache cleaning into cron jobs Move export cache cleaning into cron jobs Dec 20, 2024
@Marishka17
Copy link
Contributor Author

/check

Copy link
Contributor

github-actions bot commented Dec 20, 2024

✔️ All checks completed successfully
📄 See logs here

@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch from ed20c3d to 476c688 Compare January 9, 2025 11:02
@Marishka17
Copy link
Contributor Author

Marishka17 commented Jan 9, 2025

@zhiltsov-max, @SpecLad, okay, I've made several changes related to working with tmp dirs (including the export process):

  1. The data/tmp/ directory is used now during the export process, and a specific prefix (export-<task|project|job>-...) is added to the directories created during the export process. I've initially included the resource type in a directory name to be able to use TTL_CONSTS. However, this is not used now due to the third point.
  2. in the previous implementation, 2 tmp dirs were created during export. This has been fixed.
  3. added a separate cron job to clear the data/tmp/ directory. This cron job iterates over all files and directories from data/tmp/ and deletes all items that were not accessed in the last N days (the default value is set to 2 weeks). This approach can guarantee that we won't have files/directories that should have been deleted but remained.

@Marishka17 Marishka17 changed the title [WIP] Move export cache cleaning into cron jobs [do not merge] Move export cache cleaning into cron jobs Jan 9, 2025
@zhiltsov-max
Copy link
Contributor

and deletes all items that were not accessed in the last N days (the default value is set to 2 weeks)

Do you know why any value other than 1 day may be needed? Is there anything potentially useful in that directory?

The data/tmp/ directory is used now during the export process

The only concern I have about this is that we're still using os.replace() to finalize export, and it just an alias of os.rename() on linux, which doesn't support moving different disks. Probably, it's not a bit problem so far, but it can fire if someone tries to use a different mount for data/tmp. They will need to change os.replace to shutil.move.

@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch from 7da8a17 to 2c6794a Compare January 9, 2025 13:22
@Marishka17 Marishka17 force-pushed the mk/clear_cache_cron_job branch from 2c6794a to 716f13c Compare January 9, 2025 13:26
@Marishka17 Marishka17 changed the title [do not merge] Move export cache cleaning into cron jobs Introduce cron jobs to clean up the export cache (data/cache/export/) and temporary files/directories (data/tmp/) Jan 9, 2025
@Marishka17
Copy link
Contributor Author

Do you know why any value other than 1 day may be needed? Is there anything potentially useful in that directory?

No, I don't. I set this value to make sure I don't delete something we need too early.
As I can see, there are several cases when common tmp directory is used (not taking into account changes from this PR):


In version 2.25.0, CVAT changed the location where the export cache is stored.
To clean up the outdated cache, run the following command: `python manage.py exportcachecleanup`.

Copy link
Contributor

@zhiltsov-max zhiltsov-max Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enqueuing RQ jobs to backup a project/task: A filename argument was removed from the background function signature. Previously enqueued jobs will fail.

What do you think about adding a one-off migration script to avoid losing existing RQ jobs? Unlike cleanup jobs these are real user requests. It should probably just remove 1 parameter from the existing jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we'll have to make this PR dependent on #8898

cvat/settings/base.py Outdated Show resolved Hide resolved
@@ -542,7 +559,7 @@ def _export_task(self, zip_obj, target_dir=None):
self._write_annotations(zip_obj, target_dir)
self._write_annotation_guide(zip_obj, target_dir)

def export_to(self, file, target_dir=None):
def export_to(self, file: str | ZipFile, target_dir: str | None = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an error, but such annotations in inherited functions tend to become outdated quickly. Typically it's better to just use annotations from the original function instead of duplicating them everywhere.

Comment on lines +1058 to +1059
# FUTURE-FIXME: there db_instance_id should be passed
db_instance: models.Project | models.Task,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not change this parameter if the function args are already changed in this PR?

progress = (i + 1) / objects_count
done = int(progress_bar_len * progress)
progress_bar = "#" * done + "-" * (progress_bar_len - done)
self.stdout.write(f"\rProgress: |{progress_bar}| {progress:.0%}", ending="")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not work on mac and windows. Why not use tqdm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ask Roman about it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SpecLad, I saw you were saying it's not installed, but why not install it? Do you see any problems with that?

log_exception(logger)


def cleanup(thread_class_path: str) -> None:
Copy link
Contributor

@zhiltsov-max zhiltsov-max Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python manage.py runperiodicjob cron_export_cache_cleanup

TypeError: cleanup() missing 1 required positional argument: 'thread_class_path'

    seconds_left = rq_job.timeout - 60
AttributeError: 'NoneType' object has no attribute 'timeout'

Probably, default args should be passed.

Copy link
Contributor Author

@Marishka17 Marishka17 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runperiodicjob command was added this week. This command should support passing args rather than setting the default value for the cleanup function.

AttributeError: 'NoneType' object has no attribute 'timeout'

because this command should be executed only from the worker process (by the current design). If you think that it will be useful to allow running this command not only by worker process, please provide reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a periodic job, and we have a command to run periodic jobs manually. Probably, it should support such execution or clarify why it doesn't.

Comment on lines +64 to +75
{{< tabpane lang="shell" >}}
{{< tab header="Docker" >}}
docker exec -it cvat_server python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Kubernetes" >}}
cvat_backend_pod=$(kubectl get pods -l component=server -o 'jsonpath={.items[0].metadata.name}')
kubectl exec -it ${cvat_backend_pod} -- python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Development" >}}
python manage.py exportcachecleanup
{{< /tab >}}
{{< /tabpane >}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{{< tabpane lang="shell" >}}
{{< tab header="Docker" >}}
docker exec -it cvat_server python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Kubernetes" >}}
cvat_backend_pod=$(kubectl get pods -l component=server -o 'jsonpath={.items[0].metadata.name}')
kubectl exec -it ${cvat_backend_pod} -- python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Development" >}}
python manage.py exportcachecleanup
{{< /tab >}}
{{< /tabpane >}}
<!--lint disable no-undefined-references-->
{{< tabpane lang="shell" >}}
{{< tab header="Docker" >}}
docker exec -it cvat_server python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Kubernetes" >}}
cvat_backend_pod=$(kubectl get pods -l component=server -o 'jsonpath={.items[0].metadata.name}')
kubectl exec -it ${cvat_backend_pod} -- python manage.py exportcachecleanup
{{< /tab >}}
{{< tab header="Development" >}}
python manage.py exportcachecleanup
{{< /tab >}}
{{< /tabpane >}}
<!--lint enable no-undefined-references-->

It looks like there is some bug in remarklint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants