Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create shelves files for pds4 #57

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft

Conversation

juzen2003
Copy link
Collaborator

@juzen2003 juzen2003 commented Sep 23, 2024

Current status of creating shevles files for pds4:

  • Create files in checksums-* directory (pds4checksums.py)

    • Modification made:
      • update BUNDLENAME_REGEX
    • Example command:
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
    • Pending items:
      • Wait for the finalized directory structure for Cassini
  • Create files in _infoshelf-* directory (pds4infoshelf.py), corresponding checksums files from the above steps are required

    • Modification made:
      • properly import pds4checksums
    • Example command:
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
  • Create files in _indexshelf-metadata (pds4indexshelf.py)

    • Modification made:
      • Put BUNDLENAME_REGEX to Pds3File & Pds4File classes since they are different for pds3 & pds4
      • Add IDX_EXT and LBL_EXT to Pds3File & Pds4File to replace '.tab' & '.lbl' in pdsfile.py
        • pds4 label extension is .xml and idx extension is .csv
        • pds3 label extension is .lbl and idx extension is .tab
    • Pending items:
      • Wait for label files (.xml) in metadata
  • Create files in _linkshelf-* directory (pds4linkshelf.py)

    • Modification made:
      • remove .TXT in EXTS_WO_LABELS, .TXT could have a label in pds4
      • Add the intelligence to link a file to its correspsonding label if the file is in that label's file_name tags.
      • Add the intelligence to identify files like errata.txt, or checksum files that don't exist in the label nor exist in the csv. They are not part of the archive, so they don't have labels.
    • Example command:
      • python holdings_maintenance/pds4/pds4linkshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
    • Pending items:
      • To crease linkshelf-metadata, wait for label files (.xml) in metadata
      • Wait for the finalized directory structure for Cassini

Pending items:

  • Work on rules for pds4 archive (work in progress)
  • Once metadata labels are added, update pds4indexshelf.py to create _indexshelf-metadata for pds4
  • Once metadata labels are added, update pds4linkshelf.py to create _linkshelf-metadata for pds4
  • Test pds4 with use_shelves_only set to True

Note:

  • These directories are generated using full holdings and uploaded to Dropbox
  • We don't bypass any directories now, ring_models and _support are included when running the scripts.
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased

@juzen2003 juzen2003 marked this pull request as draft September 23, 2024 22:53
@rfrenchseti
Copy link
Collaborator

The --volume argument really is supposed to be just volume so that you put the positional argument on the command line without specifying any flag in front of it.

@rfrenchseti
Copy link
Collaborator

For metadata, you're just talking about the index files in uranus_occs_earthbased, right? Those are temporary files until Emilie's PDS4 index file generator is finished, at which point we will replace them with new files that also include the XML labels. There should never be files in the PDS4 archive that don't have associated labels.

@rfrenchseti
Copy link
Collaborator

I don't see any reason to skip ring_models. We should generate link files for all directories. The REPAIRS and KNOWN_MISSING_LABELS variables define places where the PDS3 labels were badly written and refer to the wrong files, directories, have spelling errors, etc. We hope that this will basically never happen in PDS4 labels. In any case, these variables are PDS3-volume-specific, and should be empty when starting from scratch with PDS4 until we see a reason to put something in them.

@rfrenchseti
Copy link
Collaborator

Before you get too far into pds4archives, we need to talk about how that's going to work. The PDS3 concept of archiving each volume separately does not work with PDS4 bundles, because, for example, the entire Cassini ISS Saturn mission is a single bundle (currently 116 PDS3 volumes), which would result in an enormous and impossible-to-use single archive file around a terabyte. This would be a good thing to discuss with the group at large. Of course how we split things into archives will also affect how we make checksum files for those archives.

@rfrenchseti
Copy link
Collaborator

Thinking more about this, we also can't ignore the _support directories. Remember that PdsFile has two purposes - OPUS and Viewmaster. For OPUS, we ignore all sorts of things because they aren't data products. But Viewmaster has to be able to handle every single file in the holdings/pds4-holdings directories, no matter what. Random documentation, text files, labels, data files, whatever shows up. Because the user will be browsing the entire holdings while using Viewmaster, and every single file needs to show up in Viewmaster with size, data, type, and maybe some intelligent commentary. So everything has to be in the shelve files. Everything has to be in the archives. Everything has to be in the checksums. Everything (relevant) has to be in the links. And, equally important, we need lots of rules in Pds4File for the PDS4 bundles that we don't currently have because OPUS doesn't need them - associations between file types, etc.

checksums, infoshelf, and linkshelf files.
file is in that label's file_name tags. (line 323-336, pds4linkshelf.py)
files that don't exist in the label nor exist in the csv. They are not
part of the archive, so they don't have labels. (line 369-378 in
pds4linkshelf.py)
@juzen2003
Copy link
Collaborator Author

Update the latest status on the top comments (10/12/24)

Current pending items:

  • Fix volset_ and volname_ backward compatibility for pds3 (work in progress)
  • Work on rules for pds4 archive (work in progress)
  • Once Cassini directory structured is finalized, update and run the scripts.
  • Once metadata labels are added, update pds4indexshelf.py to create _indexshelf-metadata for pds4
  • Once metadata labels are added, update pds4linkshelf.py to create _linkshelf-metadata for pds4
  • Test pds4 with use_shelves_only set to True

@juzen2003
Copy link
Collaborator Author

The --volume argument really is supposed to be just volume so that you put the positional argument on the command line without specifying any flag in front of it.

Fixed

@juzen2003
Copy link
Collaborator Author

I don't see any reason to skip ring_models. We should generate link files for all directories. The REPAIRS and KNOWN_MISSING_LABELS variables define places where the PDS3 labels were badly written and refer to the wrong files, directories, have spelling errors, etc. We hope that this will basically never happen in PDS4 labels. In any case, these variables are PDS3-volume-specific, and should be empty when starting from scratch with PDS4 until we see a reason to put something in them.

Fixed

@juzen2003
Copy link
Collaborator Author

Thinking more about this, we also can't ignore the _support directories. Remember that PdsFile has two purposes - OPUS and Viewmaster. For OPUS, we ignore all sorts of things because they aren't data products. But Viewmaster has to be able to handle every single file in the holdings/pds4-holdings directories, no matter what. Random documentation, text files, labels, data files, whatever shows up. Because the user will be browsing the entire holdings while using Viewmaster, and every single file needs to show up in Viewmaster with size, data, type, and maybe some intelligent commentary. So everything has to be in the shelve files. Everything has to be in the archives. Everything has to be in the checksums. Everything (relevant) has to be in the links. And, equally important, we need lots of rules in Pds4File for the PDS4 bundles that we don't currently have because OPUS doesn't need them - associations between file types, etc.

Fixed

in the file_name tags and avoid capturing the file name in the title tag
of the label. This will prevent us from getting duplicated file name of
the LinkInfo object when the file name exists in the title tag.
- moving the intelligence to check if a file is in the file_name tag
  of a label. Now this step is done after checking whether the file
  is in the label_dict already.
- moving the intelligence to check if a file is in the collection csv
  files. Now this step is done right before raising an error when we
  can't find its corresponding label.
These two modifications can avoid unnecessary looping of linkinfo_dict
and collection_basename_dict.
when trying to parse each entry to get the basename of a file in the
archive.
@juzen2003
Copy link
Collaborator Author

Update the latest status, the top comments are also updated (10/22/24)

  • Update maintenance tools under holdings_maintenance/pds4 to generate checksums, infoshelf, and linkshelf for PDS4 bundles
  • These Newly generated shelf files are uploaded to Dropbox
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants