-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: SODAR ingest command based on irods-pythonclient #168
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good overall, a few comments inside the code.
Otherwise, I have some points that can probably be improved, though some of these are likely more of a personal wishlist than necessary changes:
- Given that this provides a new interface to irods, it might be worth to implement an abstract base class that allows easier implementation of specific ingest commands (like SodarIngestFastq)
- It looks to me like this can currently only ingest files for one collection / one sample? If so, then this will impact usability quite a bit.
- I think a general ingest function should be able to filter the files that will be uploaded to Sodar somehow, i.e. by having the option to exclude or include only certain filetypes.
- There are several useful looking flags in the abstract
SnappyItransferCommandBase
that are not implement here, some might be worth to add:
--num-parallel-transfer; not sure if this is specific to the way files are transferred?
--remote-dir-date & --remote-dir-pattern; addition of the upload date to the sub-collection path in irods/sodar is common practice and should be possible
--validate-and-move; this & other sodar related functionalty could maybe be bundeled in one or another base class / MixIn ?
action="store_true", | ||
help="Compute missing md5 checksums for source files.", | ||
) | ||
parser.add_argument( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should definitely be TRUE by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I agree as this starts writing a bunch of files to your computer. I thought about creating temporary .md5 files and removing them during a cleanup step. Maybe that would be a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I placed this comment unclearly. I meant the following line here (validating the checksums in irods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesitant to introduce too much "magic", in this case assuming that the user always wants to compute remote checksums right away. For large files this takes a significant amount of time. There's also an upcoming SODAR feature which would make this flag mostly irrelevant: bihealth/sodar-server/issues/1634
finally: | ||
irods_session.cleanup() | ||
|
||
if self.args.collection is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if one wants to upload to multiple or all available collections?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm for keeping it simple and leave this kind of functionality to outside scripts or other, more specialised commands.
Thanks Nicolai for going through my scribbles. Let me address some of your points right now and some more tomorrow once I'm back at the office and had time to think it through. Let me start with my guiding principles for this project:
I guess, it's the KISS principle.
Would make sense.
Only one target location, yes. Otherwise additional logic would be needed to determine which files should go where. Following the KISS principle this should probably be done by an outside script (using --collection and -y parameters) or a wrapper class. The original
File system globbing is already an option (i.e.
irods-pythonclient uses parallel transfer for PUTs of bigger files. Overall transfer speed is 40-60 MB/s, so not too shabby.
I would argue this should be done to the local file structure and not by the ingest command. At least not a generic one. KISS principle.
This is already implemented as |
5f16614
to
321388a
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #168 +/- ##
==========================================
+ Coverage 75.86% 76.46% +0.59%
==========================================
Files 94 102 +8
Lines 7648 8233 +585
==========================================
+ Hits 5802 6295 +493
- Misses 1846 1938 +92
☔ View full report in Codecov by Sentry. |
a62fd19
to
d0c602e
Compare
- now checks for .md5 file on disk - loads hash if found - computes hash if not found and stores on disk - checks iRODS for existing file using hash
This adds a generic
sodar ingest
command which tries to mimic the base functionality ofirsync
, boosted by the power of using SODAR's API. Features include:iinit
session file if present, asks for password otherwisePlease try it out! Would appreciate lots of feedback.
I made this while learning Python, so please be patient. :-)