Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NWB search demo #85

Open
bendichter opened this issue Jul 16, 2024 · 3 comments
Open

NWB search demo #85

bendichter opened this issue Jul 16, 2024 · 3 comments

Comments

@bendichter
Copy link

At NeuroDataReHack, one of the students wanted to identify sessions within the IBL dataset that contained electrodes in a specific brain region. That's doable with the DANDI API, remfile, and pynwb, but can take a very long time because it requires streaming and initializing each NWB file. I think it would be much faster to do it with LINDI, particularly since the metadata they needed was stored in the json file as base64. It would be great if we had a tutorial that demonstrated how to use LINDI in this way. I think it could reduce search time substantially and would be a cool use-case for LINDI.

@bendichter
Copy link
Author

@magland I'm trying to put together an example of a search over brain regions for electrophysiology. This works, but it takes about 15 minutes for me on the IBL dataset (to be fair, it's a very big dataset). Is there a faster way, or is this what you would recommend?

import lindi
from tqdm import tqdm
from dandi.dandiapi import DandiAPIClient

brain_area = "MB"

dandi_api_client = DandiAPIClient()

dandiset_id = "000409"

dandiset = dandi_api_client.get_dandiset(dandiset_id)

elec_loc_path = 'general/extracellular_ephys/electrodes/location'

passing_assets = []
for asset in tqdm(list(dandiset.get_assets())):
    if not asset.path.endswith("nwb"):
        continue
    #s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
    lindi_url = f'https://lindi.neurosift.org/dandi/dandisets/{dandiset_id}/assets/{asset.identifier}/nwb.lindi.json'

    lindi_file = lindi.LindiH5pyFile.from_lindi_file(lindi_url, local_cache=local_cache)
    if elec_loc_path in lindi_file and brain_area in lindi_file[elec_loc_path]:
        passing_assets.append(asset)

@bendichter
Copy link
Author

bendichter commented Jul 18, 2024

For comparison, this took 18:30

import lindi
from tqdm import tqdm
from dandi.dandiapi import DandiAPIClient
import remfile
import h5py

brain_area = "V3"

dandi_api_client = DandiAPIClient()

dandiset_id = "000409"

dandiset = dandi_api_client.get_dandiset(dandiset_id)

elec_loc_path = 'general/extracellular_ephys/electrodes/location'


passing_assets = []
for asset in tqdm(list(dandiset.get_assets())):
    if not asset.path.endswith("nwb"):
        continue
    s3_url = asset.get_content_url(follow_redirects=1, strip_query=True)
    rem_file = remfile.File(s3_url)
    h5_file = h5py.File(rem_file, "r")
    
    if elec_loc_path in h5_file and brain_area in h5_file[elec_loc_path]:
        print(passing_assets)
        passing_assets.append(asset)

Actually, I am surprised there isn't a bigger time difference

@magland
Copy link
Collaborator

magland commented Jul 19, 2024

Yeah that is surprising that the lindi method is not faster. The only part that takes any time is the download. Each file is only 1-2 MB. That should take less than a second per file, but it will depend on the network connection.

I tried it on a gh codespaces instance and it took 506 seconds (8:26 minutes) for 677 files (115 passing).

I then tried with the remfile method -- I didn't run it to completion but it was going at around 2.2 seconds per file. So slower, but not hugely.

Where lindi would provide a much bigger advantage, I am speculating, is when you are loading more information per file -- like for example using pynwb where a lot more metadata needs to be loaded.

Another possibility is to prepare a single large lindi file for the entire dandiset. Then it could be loaded much more efficiently.

Yet another possibility is to cache the lindi files locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants