Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of get_subjects() even when using indexing without metadata #940

Closed
wasciutto opened this issue Feb 5, 2023 · 10 comments · Fixed by #942
Closed

Poor performance of get_subjects() even when using indexing without metadata #940

wasciutto opened this issue Feb 5, 2023 · 10 comments · Fixed by #942

Comments

@wasciutto
Copy link
Contributor

I am puzzled by how slow get_subjects() is for me. It takes about 10 seconds to retrieve the subject ids for a 67 subject BIDS directory. This is after some improvement achieved by using a saved index file and not indexing metadata - two of the most common performance enhancement suggestions I see here.

I don't understand why gathering the subject IDs in such a small database wouldn't be almost instantaneous. Meanwhile, when I query for a single subject, the result is as quick as you'd expect.

Here is how I fetch my layout, using version 15.5:

bids_dir = "path/to/my/bids_dir"
database_path = "path/to/my/index"

indexer = BIDSLayoutIndexer(index_metadata=False)

BIDSLayout(
  bids_dir,
  indexer=indexer, database_path=database_path,
  reset_database=refresh
)

Is there some way I can configure the indexer to improve my performance here, or some other solution?

Thanks!

@effigies
Copy link
Collaborator

effigies commented Feb 6, 2023

We just merged a bug-fix to the indexer (#936). Would you be willing to install master and see if the issue remains? The bug was related to traversing ignored directories despite them being ignored.

pip install git+https://github.com/bids-standard/pybids.git

@wasciutto
Copy link
Contributor Author

I just tried and unfortunately get_subjects() is still taking the same amount of time.

@adelavega
Copy link
Collaborator

Can you elaborate on what you mean by "querying for a single subject"?

I do recall this being rather slow at times, so I'll take a look. My guess is that it has to do with taking the "unique" set of subjects, and it's doing too many pre-querying to do so.

@adelavega
Copy link
Collaborator

adelavega commented Feb 7, 2023

So on a random dataset I had, I couldn't quite replicate your issue, but I do see that it takes 2 orders of magnitudes more to do get_subjects vs get:

In [7]: %timeit layout.get(subject='1')
5.02 ms ± 439 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: %timeit layout.get_subjects()
660 ms ± 2.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The reason for this is that get_subjects is not a special function, but rather a shortcut for the more general form: layout.get(return_type='id', target='subject')

I'll see if there's any quick fixes for this, but if not there is a major refactor coming up which will speed up pybids significantly: #863

@wasciutto
Copy link
Contributor Author

wasciutto commented Feb 7, 2023

Yup, so by querying a single subject, I mean the same thing as in your timing example - using layout.get(subject='123')

As far as replicating my issue, I believe your test showing a two order of magnitude difference in get_subjects() vs (get(subject='123')) captures exactly what I'm dealing with.

I appreciate you looking into this for me!

@wasciutto
Copy link
Contributor Author

Maybe there is some way for me to index on 'id'?

@adelavega
Copy link
Collaborator

Looks like it's slow because of this line:

ent_iter = (x.get_entities(metadata=metadata) for x in results)

Essentially its asking for the meta-data of every indexed file, to later filter by only those that have the target (in this case subject) in their metadata.

.get_entities is > 2 orders of magnitude slower than just checking .entities. I'm trying to figure out if just using the attribute would be reliable enough here.

@adelavega
Copy link
Collaborator

BTW, you are totally right that this can be instantaneous.

If you run this query, you can get it in 2ms:

from bids.layout import models
layout.session.query(models.Tag).filter_by(entity_name='subject').distinct().with_entities(models.Tag._value).all()

The problem is that layout.get is doing this in a very inefficient way, in order to piggy back on earlier logic, that supports filter files based on other filters. That is, if you want to be able to get all subjects that match a certain task, layout.get_subjects can do so, and that's what complicates the logic and in the end makes it slower.

I'm about to submit a patch that at least speeds it up 20x, by getting each file's entity using a faster method.

@wasciutto
Copy link
Contributor Author

This is excellent, thanks so much for diving into this, explaining the situation, and providing a patch.

I have been working to transition a mid-size application away from interacting with BIDS directories through the file system, and instead through pybids exclusively. However, in the use case of getting all subject ids (a very frequent case), a simple file traversal had been much faster than pybids. Speeding up this crucial operation will smooth out the user experience where we are using it, and make me much more enthusiastic about implementing pybids where we haven't yet.

@adelavega
Copy link
Collaborator

I'm glad. Like I said though, we are working on a major refactor that should make an even bigger difference, so stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants