Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage service: refactor out the use of lsar in get_base_directory() #992

Open
5 tasks
alexwlchan opened this issue Nov 14, 2019 · 0 comments
Open
5 tasks
Labels
Needs sponsorship Status: refining The issue needs additional details to ensure that requirements are clear.

Comments

@alexwlchan
Copy link

Please describe the problem you'd like to be solved
Currently the storage service uses lsar to find the base directory of a compressed AIP, code here: https://github.com/artefactual/archivematica-storage-service/blob/83d7b5a7da79c158cb99bd0e3426b92fdde0d3f0/storage_service/locations/models/package.py#L343-L361

This code is potentially inefficient – it uses more and more memory as the size of the AIP grows (both in the output from lsar, and the size of the directories list). It might struggle on a very large AIP.

Not a bug per se, but potentially room for improvement.

Describe the solution you'd like to see implemented
We know an AIP will only be compressed in a handful of formats (because Archivematica created them!). Use the Python standard library to open the file directly, and iterate over the members, rather than loading it all into a big JSON string.

Pseudo-code:

for member in get_members_of_archive():
    if member.is_directory() and len(member) < len(shortest_dir):
        shortest_dir = member

return shortest_dir

There's already some code to identify compression formats in utils.py.

Describe alternatives you've considered
None.

Additional context


For Artefactual use:

Before you close this issue, you must check off the following:

  • All pull requests related to this issue are properly linked
  • All pull requests related to this issue have been merged
  • A testing plan for this issue has been implemented and passed (testing plan information should be included in the issue body or comments)
  • Documentation regarding this issue has been written and merged
  • Details about this issue have been added to the release notes
@sallain sallain added Needs sponsorship Status: refining The issue needs additional details to ensure that requirements are clear. labels Feb 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs sponsorship Status: refining The issue needs additional details to ensure that requirements are clear.
Projects
None yet
Development

No branches or pull requests

2 participants