Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SODAR ingest command based on irods-pythonclient #168

Closed
wants to merge 32 commits into from

Conversation

sellth
Copy link
Contributor

@sellth sellth commented May 8, 2023

This adds a generic sodar ingest command which tries to mimic the base functionality of irsync, boosted by the power of using SODAR's API. Features include:

  • no dependency on icommands; uses iinit session file if present, asks for password otherwise
  • multiple source locations, single target collection
  • recursive file matching
  • creation of matched sub-folders in iRODS
  • destination either as iRODS path or SODAR LZ UUID
  • uses SODAR API for UUID destination & for querying LZ collections
  • computes local checksums if missing
  • triggers remote iRODS checksum computation
  • scriptable; all user input can be skipped
  • has a cool progress bar

Please try it out! Would appreciate lots of feedback.

I made this while learning Python, so please be patient. :-)

Copy link
Contributor

@Nicolai-vKuegelgen Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall, a few comments inside the code.
Otherwise, I have some points that can probably be improved, though some of these are likely more of a personal wishlist than necessary changes:

  • Given that this provides a new interface to irods, it might be worth to implement an abstract base class that allows easier implementation of specific ingest commands (like SodarIngestFastq)
  • It looks to me like this can currently only ingest files for one collection / one sample? If so, then this will impact usability quite a bit.
  • I think a general ingest function should be able to filter the files that will be uploaded to Sodar somehow, i.e. by having the option to exclude or include only certain filetypes.
  • There are several useful looking flags in the abstract SnappyItransferCommandBase that are not implement here, some might be worth to add:
    --num-parallel-transfer; not sure if this is specific to the way files are transferred?
    --remote-dir-date & --remote-dir-pattern; addition of the upload date to the sub-collection path in irods/sodar is common practice and should be possible
    --validate-and-move; this & other sodar related functionalty could maybe be bundeled in one or another base class / MixIn ?

cubi_tk/sodar/ingest.py Show resolved Hide resolved
cubi_tk/sodar/ingest.py Show resolved Hide resolved
action="store_true",
help="Compute missing md5 checksums for source files.",
)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should definitely be TRUE by default

Copy link
Contributor Author

@sellth sellth May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I agree as this starts writing a bunch of files to your computer. I thought about creating temporary .md5 files and removing them during a cleanup step. Maybe that would be a better option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I placed this comment unclearly. I meant the following line here (validating the checksums in irods)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant to introduce too much "magic", in this case assuming that the user always wants to compute remote checksums right away. For large files this takes a significant amount of time. There's also an upcoming SODAR feature which would make this flag mostly irrelevant: bihealth/sodar-server/issues/1634

cubi_tk/sodar/ingest.py Show resolved Hide resolved
cubi_tk/sodar/ingest.py Show resolved Hide resolved
finally:
irods_session.cleanup()

if self.args.collection is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if one wants to upload to multiple or all available collections?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm for keeping it simple and leave this kind of functionality to outside scripts or other, more specialised commands.

cubi_tk/sodar/ingest.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest.py Outdated Show resolved Hide resolved
cubi_tk/sodar/ingest.py Outdated Show resolved Hide resolved
@sellth
Copy link
Contributor Author

sellth commented May 9, 2023

Thanks Nicolai for going through my scribbles. Let me address some of your points right now and some more tomorrow once I'm back at the office and had time to think it through.

Let me start with my guiding principles for this project:

  • keep it simple feature wise
  • make it agnostic to the special requirements of different assay types
  • remove outside dependencies (i. e. icommands, md5sum)

I guess, it's the KISS principle.

* Given that this provides a new interface to irods, it might be worth to implement an abstract base class that allows easier implementation of specific ingest commands (like SodarIngestFastq)

Would make sense.

* It looks to me like this can currently only ingest files for one collection / one sample? If so, then this will impact usability quite a bit.

Only one target location, yes. Otherwise additional logic would be needed to determine which files should go where. Following the KISS principle this should probably be done by an outside script (using --collection and -y parameters) or a wrapper class. The original rsync also only does one destination.

I think a general ingest function should be able to filter the files that will be uploaded to Sodar somehow, i.e. by having the option to exclude or include only certain filetypes.

File system globbing is already an option (i.e. source/*.csv), but I see some added value by dedicated --exclude and --include flags.

  --num-parallel-transfer; not sure if this is specific to the way files are transferred?

irods-pythonclient uses parallel transfer for PUTs of bigger files. Overall transfer speed is 40-60 MB/s, so not too shabby.

  --remote-dir-date & --remote-dir-pattern; addition of the upload date to the sub-collection path in irods/sodar is common practice and should be possible

I would argue this should be done to the local file structure and not by the ingest command. At least not a generic one. KISS principle.

  --validate-and-move; this & other sodar related functionalty could maybe be bundeled in one or another base class / MixIn ?

This is already implemented as cubi-tk sodar lz_move.

@sellth sellth force-pushed the ingest branch 4 times, most recently from 5f16614 to 321388a Compare May 10, 2023 11:10
@codecov
Copy link

codecov bot commented May 10, 2023

Codecov Report

Attention: 33 lines in your changes are missing coverage. Please review.

Comparison is base (79e85bf) 75.86% compared to head (3522371) 76.46%.
Report is 15 commits behind head on main.

❗ Current head 3522371 differs from pull request most recent head 96a93e7. Consider uploading reports for the commit 96a93e7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #168      +/-   ##
==========================================
+ Coverage   75.86%   76.46%   +0.59%     
==========================================
  Files          94      102       +8     
  Lines        7648     8233     +585     
==========================================
+ Hits         5802     6295     +493     
- Misses       1846     1938      +92     
Files Coverage Δ
cubi_tk/sodar/__init__.py 100.00% <100.00%> (ø)
cubi_tk/irods_utils.py 98.82% <98.82%> (ø)
tests/test_irods_utils.py 98.43% <98.43%> (ø)
tests/test_sodar_ingest.py 99.14% <99.14%> (ø)
cubi_tk/sodar/ingest.py 80.89% <80.89%> (ø)

... and 17 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sellth sellth added the enhancement New feature or request label May 11, 2023
@sellth sellth force-pushed the ingest branch 2 times, most recently from a62fd19 to d0c602e Compare May 17, 2023 18:18
@sellth sellth marked this pull request as ready for review May 19, 2023 13:07
@sellth sellth requested a review from mikkonie May 24, 2023 13:29
@sellth sellth marked this pull request as draft August 1, 2023 12:48
- now checks for .md5 file on disk
- loads hash if found
- computes hash if not found and stores on disk
- checks iRODS for existing file using hash
@sellth sellth closed this Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants