Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[24.0] Script for deleting userless histories from database #18058

Closed

Conversation

jdavcs
Copy link
Member

@jdavcs jdavcs commented Apr 25, 2024

Ref #17725

This is a draft. The script needs to get db_url from config; batch_size and max_create_time to be passed as arguments.

Included: script and test (with infrastructure, partially borrowed from not-yet-merged #17662)

I've tested this manually + using the included test, on a postgresql database. I've also (partially) verified that the algorithm should be able to handle the data volume in main's database (including the history_dataset_association table, which was the main offender).

Here's what the test does:

  1. Initializes a new database, loads histories + all possible associated records
  2. Verifies all data counts
  3. Runs the script
  4. Verifies the new counts: histories + associations deleted + histories and associations that were not deleted.
  5. All data is deleted after the test (the db stays initialized), so the test can be rerun on the same database.

I've created a new directory under scripts for this and similar scripts.

What this script does and why

My first attempts, as discussed in #17725 were SQL-only. However, based on my testing, I think the size of the tables make that approach infeasible. The main issue is selecting the history ids to delete excluding the ones that are referred to from tables whose records we shouldn't be deleting (such as job, hda, etc.): that can be done via multiple clauses like where id not in (select history_id from another-table). Given the table size, this won't work.

But the operations are trivial - there's no need for any joins. Also, since this deals with records that are not used, we don't need locking (i.e., old histories only + we mark them as deleted and purged BEFORE selecting the set to delete). Therefore, we don't need the database for anything other than providing us with several (very large) lists of integers that we can then manipulate in memory as sets, which makes it fast enough to be doable.

The algorithm:

  1. Get min/max ids of histories to be deleted (we use this criteria: user is null, hid is 1, create_date is below max-cutoff)
  2. Mark histories (based on selection criteria outlined above) as deleted and purged to prevent their further usage.
  3. Retrieve ids of selected histories
  4. Retrieve ids of histories referred to from tables whose records we can't delete (e.g. job, hda, etc.)
  5. Take the set difference to get the histories to delete.
  6. Create a temporary table and load it with history ids to delete. (using copy instead of batch insert would be faster, but that requires elevated permissions, so I skipped it; insert should be sufficient)
  7. Delete records from association tables referring to histories in the tmp table
  8. Set history_id to null in galaxy_session records that refer to histories in the tmp table (we can't delete a record due to complex referential integrity, but we can set it to null).
  9. Delete histories based on ids from the tmp table
  10. Drop tmp table.

We do the above in batches with the size of a batch configurable. The reason for this is the hda table: a simple select history_id from history_dataset_association appears to be problematic (my process was repeatedly killed after an hour). If there is some postgresql or unix magic that can help with this, we don't need batches; but batches are simple and work fine.

Comments and harsh critique certainly most welcome!

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@jdavcs jdavcs force-pushed the 24.0_userless_histories branch from 0d68929 to 1555014 Compare April 25, 2024 18:42
Copy link
Member

@mvdbeek mvdbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but maybe move this to lib/galaxy/model/ and create a normal entrypoint so the script becomes installable ? Then you can drop the shell script and the manual entrypoint, and it's one step less to create a celery task if we choose to do that.

@jdavcs
Copy link
Member Author

jdavcs commented Apr 29, 2024

Looks good, but maybe move this to lib/galaxy/model/ and create a normal entrypoint so the script becomes installable ? Then you can drop the shell script and the manual entrypoint, and it's one step less to create a celery task if we choose to do that.

You mean like the scripts in the data package (e.g. build_objects, load-objects, manage_db)? I can do that.

What's the criteria we use to decide when a script goes into the scripts directory, and when it goes into lib (and is made installable)?

@mvdbeek
Copy link
Member

mvdbeek commented Apr 29, 2024

I don't think we have one. If I had to create one now I would say things that are using lots of galaxy internals should probably go into the package they primarily deal with ? Further consideration would be if this might be used in an installed galaxy context, where the scripts aren't usually available.

@jdavcs
Copy link
Member Author

jdavcs commented Apr 29, 2024

I don't think we have one. If I had to create one now I would say things that are using lots of galaxy internals should probably go into the package they primarily deal with ? Further consideration would be if this might be used in an installed galaxy context, where the scripts aren't usually available.

Thanks! This makes sense.
Having them under lib also helps with testing: it is straightforward, unlike testing code in scripts.

@jdavcs
Copy link
Member Author

jdavcs commented May 1, 2024

Superseded by #18079 (that one targets the dev branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants