Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create manifest creation script. #55

Merged
merged 1 commit into from
Aug 15, 2024
Merged

Conversation

morriscb
Copy link
Collaborator

@morriscb morriscb commented Aug 5, 2024

Add function, ported from Jupyter notebook to create manifests given a set of data.

@morriscb morriscb changed the base branch from main to rc/0.2.0 August 5, 2024 19:19
@morriscb morriscb requested a review from danielsf August 5, 2024 19:19
Copy link
Collaborator

@danielsf danielsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some cosmetic changes

"directory_listing": {}
}
for directory in os.walk(base_dir):
versions = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind changing this to something like version_list? The one character difference between version and versions might be confusing (channeling my inner Russell Owen here)

Base release manifest dictionary with all directories, links, paths,
and files populated.
"""
datasets = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optionally: same comment as before regarding just calling this the plural of dataset. Maybe this could be dataset_lookup. The need is not as acute because datasets and data_set are a little more distinguishable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to dataset_lookup and dataset.

ver_dict = ds_dict[data_kind]

data_dir = os.path.join(base_dir, ver_dict['relative_path'])
# print('---',data_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented-out line

(along with commented print lines below)

return release


def populate_paths_and_urls(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give this function a name that more clearly indicates that this is for populating information at the level of directories/datasets, rather than individual files? I got a little tripped-up by the double implementation of bucket_prefix + relative_path between this function and populate_datasets. It took me a moment to realize this function was operating at a different level than the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to populate_directories

datasets[data_set][data_kind][tag][norm]['files'] = {}
datasets[data_set][data_kind][tag][norm]['files'][
ext] = {}
datasets[data_set][data_kind][tag][norm]['files'][
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than individually set 'version', 'url', 'file_hash' etc. in each section of this if/else block, can you define a file_stats dict that has each of these elements before entering the if/else block, and then the place file_stats under whatever chain of keys to need to based on the extension?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

from abc_atlas_access.abc_atlas_cache.utils import file_hash_from_path


BUCKET_PREFIX = 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These global variables do not appear to be used

@morriscb morriscb requested a review from danielsf August 9, 2024 17:19
Copy link
Collaborator

@danielsf danielsf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment on a docstring. This is good to merge, though.

@@ -280,7 +280,7 @@ def populate_datasets(
"--datasets_to_skip",
nargs='+',
type=str,
default=['releases', 'SEAAD', 'Zhuang-C57BL6J'],
default=['releases', 'SEA', 'Zhuang-C57BL6J'],
help="Skip a given project for all directories that start with the "
"given pattern. (e.g. SEAD will exclude all directories that "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the example in the docstring be SEAAD (two As), since that is what we call the actual dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Pretty sure there are also directories with SEA-AD and we also want to skip those.

@morriscb
Copy link
Collaborator Author

Running a check of the manifest (downloading and verifying hashes). Once that's done I'll merge the changes unless there is anything else you'd like addressed.

Add manifest builder function.
Add doc strings.
Add json and h5 file manifest creation.
Hide mapmycells for now.
@morriscb morriscb merged commit 455d65f into rc/0.2.0 Aug 15, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants