Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Info call makes unnecessary long listing for "directories" #656

Open
shcheklein opened this issue Jan 2, 2025 · 3 comments
Open

Info call makes unnecessary long listing for "directories" #656

shcheklein opened this issue Jan 2, 2025 · 3 comments

Comments

@shcheklein
Copy link
Contributor

When path is a "directory" (prefix) we might end up going into full listing operation to just get an info for a single path. We call ls and we don't pass any limits (and it seems GCSFS doesn't support max results atm?).

        out = await self._ls(path, **kwargs)  # Here we can go into a long listing since we don't limit it
        out0 = [o for o in out if o["name"].rstrip("/") == path]
        if out0:
            # exact hit
            return out0[0]
        elif out:
            # other stuff - must be a directory

Is it intentional for some reason?

On s3 and Azure, in similar place we list to check if at least a single key exists (we don't check for an exact hit).

Can we drop the exact hit check and pass max results into ls?

@martindurant
Copy link
Member

If I understand, you want there to be a max results argument to ls OR that there be a first-pass exact hit check?

On s3 and Azure, in similar place we list to check if at least a single key exists (we don't check for an exact hit).

It is totally reasonable that all three should have the same logic here, so if you can phrase this in code as a PR, I'd be happy to see it.

@shcheklein
Copy link
Contributor Author

thanks @martindurant !

I think s3 / Azure logic is just to have a check that at least some objects exist under the prefix. They don't do exact hit check AFAIU, and they don't list everything.

So, the proposed solution is to:

  • remove the exact check (not sure why it was needed tbh?)
  • add max results supports to list as minimum as possible (get_info should be a constant in terms of API calls - ideally a single call)

so if you can phrase this in code as a PR, I'd be happy to see it.

I'll try to get there. Intention of this issue was to ask if there were some reasons for the existing logic that I'm not aware / can't see.

@psobot
Copy link

psobot commented Jan 12, 2025

Just adding a +1 here - this also applies to exists calls, so .exists(directory) ends up fetching info for every blob with that prefix, rather than just checking if any blob with that prefix exists. This is pretty catastrophically expensive for large directories, both in time and memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants