[24.0] Script for deleting userless histories from database #18058

jdavcs · 2024-04-25T18:24:38Z

This is a draft. The script needs to get db_url from config; batch_size and max_create_time to be passed as arguments.

Included: script and test (with infrastructure, partially borrowed from not-yet-merged #17662)

I've tested this manually + using the included test, on a postgresql database. I've also (partially) verified that the algorithm should be able to handle the data volume in main's database (including the history_dataset_association table, which was the main offender).

Here's what the test does:

Initializes a new database, loads histories + all possible associated records
Verifies all data counts
Runs the script
Verifies the new counts: histories + associations deleted + histories and associations that were not deleted.
All data is deleted after the test (the db stays initialized), so the test can be rerun on the same database.

I've created a new directory under scripts for this and similar scripts.

What this script does and why

My first attempts, as discussed in #17725 were SQL-only. However, based on my testing, I think the size of the tables make that approach infeasible. The main issue is selecting the history ids to delete excluding the ones that are referred to from tables whose records we shouldn't be deleting (such as job, hda, etc.): that can be done via multiple clauses like where id not in (select history_id from another-table). Given the table size, this won't work.

But the operations are trivial - there's no need for any joins. Also, since this deals with records that are not used, we don't need locking (i.e., old histories only + we mark them as deleted and purged BEFORE selecting the set to delete). Therefore, we don't need the database for anything other than providing us with several (very large) lists of integers that we can then manipulate in memory as sets, which makes it fast enough to be doable.

The algorithm:

Get min/max ids of histories to be deleted (we use this criteria: user is null, hid is 1, create_date is below max-cutoff)
Mark histories (based on selection criteria outlined above) as deleted and purged to prevent their further usage.
Retrieve ids of selected histories
Retrieve ids of histories referred to from tables whose records we can't delete (e.g. job, hda, etc.)
Take the set difference to get the histories to delete.
Create a temporary table and load it with history ids to delete. (using copy instead of batch insert would be faster, but that requires elevated permissions, so I skipped it; insert should be sufficient)
Delete records from association tables referring to histories in the tmp table
Set history_id to null in galaxy_session records that refer to histories in the tmp table (we can't delete a record due to complex referential integrity, but we can set it to null).
Delete histories based on ids from the tmp table
Drop tmp table.

We do the above in batches with the size of a batch configurable. The reason for this is the hda table: a simple select history_id from history_dataset_association appears to be problematic (my process was repeatedly killed after an hour). If there is some postgresql or unix magic that can help with this, we don't need batches; but batches are simple and work fine.

Comments and harsh critique certainly most welcome!

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

mvdbeek

Looks good, but maybe move this to lib/galaxy/model/ and create a normal entrypoint so the script becomes installable ? Then you can drop the shell script and the manual entrypoint, and it's one step less to create a celery task if we choose to do that.

jdavcs · 2024-04-29T14:57:52Z

Looks good, but maybe move this to lib/galaxy/model/ and create a normal entrypoint so the script becomes installable ? Then you can drop the shell script and the manual entrypoint, and it's one step less to create a celery task if we choose to do that.

You mean like the scripts in the data package (e.g. build_objects, load-objects, manage_db)? I can do that.

What's the criteria we use to decide when a script goes into the scripts directory, and when it goes into lib (and is made installable)?

mvdbeek · 2024-04-29T15:32:47Z

I don't think we have one. If I had to create one now I would say things that are using lots of galaxy internals should probably go into the package they primarily deal with ? Further consideration would be if this might be used in an installed galaxy context, where the scripts aren't usually available.

jdavcs · 2024-04-29T15:42:32Z

I don't think we have one. If I had to create one now I would say things that are using lots of galaxy internals should probably go into the package they primarily deal with ? Further consideration would be if this might be used in an installed galaxy context, where the scripts aren't usually available.

Thanks! This makes sense.
Having them under lib also helps with testing: it is straightforward, unlike testing code in scripts.

jdavcs · 2024-05-01T22:20:01Z

Superseded by #18079 (that one targets the dev branch)

jdavcs added 2 commits April 25, 2024 13:15

Initial pass at script + test to clean userless/hid=1 histories

8bde367

Optimize history selection

2418678

jdavcs added kind/bug kind/feature area/admin area/database Galaxy's database or data access layer area/scripts labels Apr 25, 2024

jdavcs requested review from natefoo, hexylena and mvdbeek April 25, 2024 18:24

Fix linting

1555014

jdavcs force-pushed the 24.0_userless_histories branch from 0d68929 to 1555014 Compare April 25, 2024 18:42

mvdbeek reviewed Apr 29, 2024

View reviewed changes

jdavcs mentioned this pull request May 1, 2024

Script for deleting userless histories from database + testing + drop unused model testing code #18079

Merged

5 tasks

jdavcs closed this May 1, 2024

jdavcs mentioned this pull request May 29, 2024

Script for hard removal of unused histories #17725

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[24.0] Script for deleting userless histories from database #18058

[24.0] Script for deleting userless histories from database #18058

jdavcs commented Apr 25, 2024

mvdbeek left a comment

jdavcs commented Apr 29, 2024

mvdbeek commented Apr 29, 2024

jdavcs commented Apr 29, 2024

jdavcs commented May 1, 2024

[24.0] Script for deleting userless histories from database #18058

[24.0] Script for deleting userless histories from database #18058

Conversation

jdavcs commented Apr 25, 2024

What this script does and why

The algorithm:

How to test the changes?

License

mvdbeek left a comment

Choose a reason for hiding this comment

jdavcs commented Apr 29, 2024

mvdbeek commented Apr 29, 2024

jdavcs commented Apr 29, 2024

jdavcs commented May 1, 2024