Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add generic SODAR ingest command #199

Merged
merged 72 commits into from
Jan 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
95c76ab
feat: Common functions for interfacing with python-irodsclient
sellth Nov 7, 2023
6b72d9d
fix tests
sellth Nov 7, 2023
2636c8c
fix chksum test
sellth Nov 7, 2023
779d402
move all basic irods logic into new class iRODSCommon
sellth Nov 8, 2023
1f58f7e
adapt tests to new class
sellth Nov 8, 2023
da40f44
iRODSTransfer is now a child of iRODSCommon
sellth Nov 8, 2023
d8dbbbf
flake8
sellth Nov 8, 2023
064a1e6
custom irods_env_path support; pass along kwargs
sellth Nov 8, 2023
7837b2a
flake8
sellth Nov 8, 2023
32aaa5f
make internal variables private
sellth Nov 8, 2023
f36efff
cleanup
sellth Nov 8, 2023
523bc28
improved docstrings
sellth Nov 9, 2023
c62faf1
TransferJob computes file size during init
sellth Nov 17, 2023
f66cc42
flake8
sellth Nov 17, 2023
e2bda03
using attrs import instead of attr shadow
sellth Nov 17, 2023
9d480c9
show file counts in progress bar
sellth Nov 17, 2023
179aee5
add recursive put
sellth Nov 17, 2023
4b0e1d2
use proper context manager
sellth Nov 17, 2023
4bde914
add sync mode; check for existing files on remote
sellth Nov 23, 2023
538e9a8
do not clean up session too soon
sellth Nov 27, 2023
41c8e12
prepare TransferJob to also handle gets
sellth Nov 27, 2023
5df1d7e
add get method
sellth Nov 27, 2023
205c6e4
restructure session creation; check valid token every time
sellth Nov 28, 2023
27e54da
create irods sessions dynamically
sellth Nov 28, 2023
f1e5105
also check for timed out session token
sellth Nov 28, 2023
78e01a1
make session available as property
sellth Nov 28, 2023
e1625a8
linting
sellth Nov 28, 2023
1a58949
remove irods session context manager and multiple sessions
sellth Nov 28, 2023
d6f6300
improve logging during chksum
sellth Dec 8, 2023
cb5d10a
increase session timeout
sellth Dec 8, 2023
bab45a0
better error handling
sellth Dec 8, 2023
9daa75e
feat: generic ingest function based on irods pythonclient
sellth May 3, 2023
49c2c54
add -y option alias --yes
sellth May 10, 2023
33006a0
make flake8 happy
sellth May 10, 2023
1a5e0b4
extract iRODS functionality into separate file
sellth May 10, 2023
a6b8d5b
compute md5 always and generate temp file on upload
sellth May 10, 2023
70cb0d7
adjusted log levels
sellth May 11, 2023
4903537
add test for init_irods
sellth May 12, 2023
1fb4d4c
add test for get_irods_error
sellth May 12, 2023
a0ccf97
add test for transfer job builder
sellth May 12, 2023
06d0471
add iinit-like behaviour when asking for password
sellth May 17, 2023
b145301
test sodar ingest build file list
sellth May 19, 2023
ab7736f
removed print statements in favour of logger
sellth May 19, 2023
29ab738
make more clear that this is not about checksum mismatch
sellth May 19, 2023
ab0933f
re-add test for transfer job builder
sellth May 19, 2023
3c4004d
add smoke check test for sodar ingest
sellth May 23, 2023
64bf19b
remove unused code
sellth May 23, 2023
79e9b3f
isort
sellth May 23, 2023
721d69c
use more pathlib
sellth May 24, 2023
cb2dffd
more no-coverage regions
sellth May 24, 2023
d1a599b
check for missing API token
sellth May 25, 2023
9a6cb90
adjusted docstring
sellth May 25, 2023
db872b4
add documentation for sodar ingest command
sellth May 30, 2023
4e0ffee
check if file already exists in iRODS and skip
sellth May 31, 2023
2b24a8c
moved remote file check to wrapper
sellth May 31, 2023
48d33a5
Clean exit code when nothing to do.
sellth Jun 1, 2023
5650dd8
upgrade python-irodsclient version
sellth Jun 2, 2023
d3d0312
add support for exclude patterns
sellth Jul 14, 2023
31b6416
fix excludes
sellth Aug 2, 2023
35eea47
sort file list for output
sellth Aug 2, 2023
6b96f59
reworked hashsum logic
sellth Aug 2, 2023
040bf38
increased iRODS session timeout
sellth Aug 7, 2023
026d93a
don't re-create already existing collections
sellth Sep 26, 2023
7ee5eb4
re-write of command to use irods_common.py
sellth Dec 7, 2023
f179b49
fix infinite loop when only 1 collection is present
sellth Dec 7, 2023
4490dde
increase test coverage
sellth Dec 7, 2023
70908b7
accomodate final subcollection as target
sellth Dec 7, 2023
2fa772c
transfer jobs as tuple, not set
sellth Dec 8, 2023
e6e4308
better help text for --yes
sellth Dec 8, 2023
8ade629
Merge branch 'main' into ingest
sellth Jan 9, 2024
0e41791
describe more clearly what's happening
sellth Jan 9, 2024
8dde5b5
update docs
sellth Jan 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions cubi_tk/sodar/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@
Upload external files to SODAR
(defaults for fastq files).

``ingest``
Upload arbitrary files to SODAR

``check-remote``
Check if or which local files with md5 sums are already deposited in iRODs/Sodar

Expand All @@ -53,6 +56,7 @@
from .add_ped import setup_argparse as setup_argparse_add_ped
from .check_remote import setup_argparse as setup_argparse_check_remote
from .download_sheet import setup_argparse as setup_argparse_download_sheet
from .ingest import setup_argparse as setup_argparse_ingest
from .ingest_fastq import setup_argparse as setup_argparse_ingest_fastq
from .lz_create import setup_argparse as setup_argparse_lz_create
from .lz_list import setup_argparse as setup_argparse_lz_list
Expand Down Expand Up @@ -87,6 +91,7 @@ def setup_argparse(parser: argparse.ArgumentParser) -> None:
"ingest-fastq", help="Upload external files to SODAR (defaults for fastq)"
)
)
setup_argparse_ingest(subparsers.add_parser("ingest", help="Upload arbitrary files to SODAR"))
setup_argparse_check_remote(
subparsers.add_parser(
"check-remote", help="Compare local files with md5 sum against SODAR/iRODS"
Expand Down
297 changes: 297 additions & 0 deletions cubi_tk/sodar/ingest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
"""``cubi-tk sodar ingest``: upload arbitrary files and folders into a specific SODAR landing zone collection"""

import argparse
import os
from pathlib import Path
import sys
import typing

import attrs
import logzero
from logzero import logger
from sodar_cli import api

from cubi_tk.irods_common import TransferJob, iRODSCommon, iRODSTransfer

from ..common import compute_md5_checksum, is_uuid, load_toml_config, sizeof_fmt

# for testing
logger.propagate = True

# no-frills logger
formatter = logzero.LogFormatter(fmt="%(message)s")
output_logger = logzero.setup_logger(formatter=formatter)


@attrs.frozen(auto_attribs=True)
class Config:
"""Configuration for the ingest command."""

config: str = attrs.field(default=None)
sodar_server_url: str = attrs.field(default=None)
sodar_api_token: str = attrs.field(default=None, repr=lambda value: "***") # type: ignore


class SodarIngest:
"""Implementation of sodar ingest command."""

def __init__(self, args):
# Command line arguments.
self.args = args

# Path to iRODS environment file
self.irods_env_path = Path(Path.home(), ".irods", "irods_environment.json")
if not self.irods_env_path.exists():
logger.error("iRODS environment file is missing.")
sys.exit(1)

# Get SODAR API info
toml_config = load_toml_config(Config())
if toml_config: # pragma: no cover
config_url = toml_config.get("global", {}).get("sodar_server_url")
if self.args.sodar_url == "https://sodar.bihealth.org/" and config_url:
self.args.sodar_url = config_url
if not self.args.sodar_api_token:
self.args.sodar_api_token = toml_config.get("global", {}).get("sodar_api_token")
if not self.args.sodar_api_token:
logger.error("SODAR API token missing.")
sys.exit(1)

@classmethod
def setup_argparse(cls, parser: argparse.ArgumentParser) -> None:
parser.add_argument(
"--hidden-cmd", dest="sodar_cmd", default=cls.run, help=argparse.SUPPRESS
)
group_sodar = parser.add_argument_group("SODAR-related")
group_sodar.add_argument(
"--sodar-url",
default=os.environ.get("SODAR_URL", "https://sodar.bihealth.org/"),
help="URL to SODAR, defaults to SODAR_URL environment variable or fallback to https://sodar.bihealth.org/",
)
group_sodar.add_argument(
"--sodar-api-token",
default=os.environ.get("SODAR_API_TOKEN", None),
help="SODAR API token. Defaults to SODAR_API_TOKEN environment variable.",
)
parser.add_argument(
"-r",
"--recursive",
default=False,
action="store_true",
help="Recursively match files in subdirectories. Creates iRODS sub-collections to match directory structure.",
)
parser.add_argument(
"-s",
"--sync",
default=False,
action="store_true",
help="Skip upload of files already present in remote collection.",
)
parser.add_argument(
"-e",
"--exclude",
nargs="+",
default="",
type=str,
help="Exclude files by defining one or multiple glob-style patterns.",
)
parser.add_argument(
"-K",
"--remote-checksums",
default=False,
action="store_true",
help="Trigger checksum computation on the iRODS side.",
)
parser.add_argument(
"-y",
"--yes",
default=False,
action="store_true",
help="Don't ask for permission. Does not skip manual target collection selection.",
)
parser.add_argument(
"--collection",
type=str,
help="Target iRODS collection. Skips manual target collection selection.",
)
parser.add_argument(
"sources", help="One or multiple files/directories to ingest.", nargs="+"
)
parser.add_argument("destination", help="UUID or iRODS path of SODAR landing zone.")

@classmethod
def run(
cls, args, _parser: argparse.ArgumentParser, _subparser: argparse.ArgumentParser
) -> typing.Optional[int]:
"""Entry point into the command."""
return cls(args).execute()

def execute(self):
"""Execute ingest."""
# Retrieve iRODS path if destination is UUID
if is_uuid(self.args.destination):
try:
lz_info = api.landingzone.retrieve(
sodar_url=self.args.sodar_url,
sodar_api_token=self.args.sodar_api_token,
landingzone_uuid=self.args.destination,
)
except Exception as e: # pragma: no cover
logger.error("Failed to retrieve landing zone information.")
logger.exception(e)
sys.exit(1)

# TODO: Replace with status_locked check once implemented in sodar_cli
if lz_info.status in ["ACTIVE", "FAILED"]:
self.lz_irods_path = lz_info.irods_path
logger.info(f"Target iRods path: {self.lz_irods_path}")
else:
logger.error("Target landing zone is not ACTIVE.")
sys.exit(1)
else:
self.lz_irods_path = self.args.destination # pragma: no cover

# Build file list
source_paths = self.build_file_list()
Nicolai-vKuegelgen marked this conversation as resolved.
Show resolved Hide resolved
if len(source_paths) == 0:
logger.info("Nothing to do. Quitting.")
sys.exit(0)

# Initiate iRODS session
irods_session = iRODSCommon().session
sellth marked this conversation as resolved.
Show resolved Hide resolved

# Query target collection
logger.info("Querying landing zone collections…")
collections = []
try:
with irods_session as i:
coll = i.collections.get(self.lz_irods_path)
for c in coll.subcollections:
collections.append(c.name)
except Exception as e: # pragma: no cover
logger.error(
f"Failed to query landing zone collections: {iRODSCommon().get_irods_error(e)}"
)
sys.exit(1)

# Query user for target sub-collection
if not collections:
self.target_coll = self.lz_irods_path
logger.info("No subcollections found. Moving on.")
elif self.args.collection is None:
user_input = ""
input_valid = False
input_message = "####################\nPlease choose target collection:\n"
for index, item in enumerate(collections):
input_message += f"{index+1}) {item}\n"
input_message += "Select by number: "
sellth marked this conversation as resolved.
Show resolved Hide resolved

while not input_valid:
user_input = input(input_message)
if user_input.isdigit():
user_input = int(user_input)
if 0 < user_input <= len(collections):
input_valid = True

self.target_coll = f"{self.lz_irods_path}/{collections[user_input - 1]}"

elif self.args.collection in collections:
self.target_coll = f"{self.lz_irods_path}/{self.args.collection}"
else: # pragma: no cover
logger.error("Selected target collection does not exist in landing zone.")
sys.exit(1)

# Build transfer jobs and add missing md5 files
jobs = self.build_jobs(source_paths)
sellth marked this conversation as resolved.
Show resolved Hide resolved
jobs = sorted(jobs, key=lambda x: x.path_local)

# Final go from user & transfer
itransfer = iRODSTransfer(jobs, ask=not self.args.yes)
logger.info("Planning to transfer the following files:")
for job in jobs:
output_logger.info(job.path_local)
logger.info(f"With a total size of {sizeof_fmt(itransfer.size)}")
logger.info("Into this iRODS collection:")
output_logger.info(f"{self.target_coll}/")

if not self.args.yes:
if not input("Is this OK? [y/N] ").lower().startswith("y"): # pragma: no cover
logger.info("Aborting at your request.")
sys.exit(0)

itransfer.put(recursive=self.args.recursive, sync=self.args.sync)
logger.info("File transfer complete.")

# Compute server-side checksums
if self.args.remote_checksums: # pragma: no cover
logger.info("Computing server-side checksums.")
itransfer.chksum()

def build_file_list(self) -> typing.List[typing.Dict[Path, Path]]:
"""
Build list of source files to transfer.
iRODS paths are relative to target collection.
"""

source_paths = [Path(src) for src in self.args.sources]
output_paths = list()

for src in source_paths:
try:
abspath = src.resolve(strict=True)
except FileNotFoundError:
logger.warning(f"File not found: {src.name}")
continue
except RuntimeError:
logger.warning(f"Symlink loop: {src.name}")
continue

excludes = self.args.exclude
if src.is_dir():
paths = abspath.glob("**/*" if self.args.recursive else "*")
for p in paths:
if excludes and any([p.match(e) for e in excludes]):
continue
if p.is_file() and not p.suffix.lower() == ".md5":
output_paths.append({"spath": p, "ipath": p.relative_to(abspath)})
else:
if not any([src.match(e) for e in excludes if e]):
output_paths.append({"spath": src, "ipath": Path(src.name)})
return output_paths

def build_jobs(self, source_paths: typing.Iterable[Path]) -> typing.Tuple[TransferJob]:
"""Build file transfer jobs."""

transfer_jobs = []

for p in source_paths:
path_remote = f"{self.target_coll}/{str(p['ipath'])}"
md5_path = p["spath"].parent / (p["spath"].name + ".md5")

if md5_path.exists():
logger.info(f"Found md5 hash on disk for {p['spath']}")
else:
md5sum = compute_md5_checksum(p["spath"])
with md5_path.open("w", encoding="utf-8") as f:
f.write(f"{md5sum} {p['spath'].name}")
sellth marked this conversation as resolved.
Show resolved Hide resolved

transfer_jobs.append(
TransferJob(
path_local=str(p["spath"]),
path_remote=path_remote,
)
)

transfer_jobs.append(
TransferJob(
path_local=str(md5_path),
path_remote=path_remote + ".md5",
)
)

return tuple(transfer_jobs)


def setup_argparse(parser: argparse.ArgumentParser) -> None:
"""Setup argument parser for ``cubi-tk sodar ingest``."""
return SodarIngest.setup_argparse(parser)
6 changes: 4 additions & 2 deletions docs_manual/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ Manual

| :ref:`Creating ISA-tab files <man_isa_tpl>`
| :ref:`Annotating ISA-tab files <man_isa_tab>`
| :ref:`Upload raw data to SODAR <man_ingest_fastq>`
| :ref:`Upload raw data to SODAR <man_seasnap_itransfer_results>`
| :ref:`Upload data to SODAR <man_sodar_ingest>`
| :ref:`Upload fastq files to SODAR <man_ingest_fastq>`
| :ref:`Upload results of the Seasnap pipeline to SODAR <man_seasnap_itransfer_results>`
| :ref:`Create a sample info file for Sea-snap <man_write_sample_info>`
| :ref:`Tools for archiving old projects <man_archive>`

Expand Down Expand Up @@ -51,6 +52,7 @@ Project Info

man_isa_tpl
man_isa_tab
man_sodar_ingest
man_ingest_fastq
man_itransfer_results
man_write_sample_info
Expand Down
2 changes: 1 addition & 1 deletion docs_manual/man_ingest_fastq.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.. _man_ingest_fastq:

===========================
Manual for ``ingest-fastq``
Manual for ``sodar ingest-fastq``
===========================

The ``cubi-tk sodar ingest-fastq`` command lets you upload raw data files to SODAR.
Expand Down
Loading
Loading