Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make contents API scale #3609

Merged
merged 5 commits into from
Feb 21, 2024
Merged

Conversation

dbutenhof
Copy link
Member

PBENCH-1321

The /datasets/{id}/contents API includes into several unexpectedly expensive steps:

  1. Finding the tarball (by MD5 value) within the ARCHIVE tree using a glob
  2. Fully discovering all tarballs within the controller directory
  3. Unpacking the tarball into a cache directory using tar
  4. Building a "map" of the contents of the unpacked tarball subtree

This PR includes mitigations for all but the tar unpack step:

  1. Use the server.tarball-path metadata instead of searching the disk
  2. Only discover the target tarball rather than the entire controller
  3. Skip the "map" and evaluate the actual target path within the cache

Finding a tarball within our 30Tb ARCHIVE tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.

Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.

Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.

@dbutenhof dbutenhof added bug Server Code Infrastructure Dashboard Of and relating to the Dashboard GUI API Of and relating to application programming interfaces to services and functions Operations Related to operation and monitoring of a service labels Feb 16, 2024
@dbutenhof dbutenhof requested a review from webbnh February 16, 2024 15:31
@dbutenhof dbutenhof self-assigned this Feb 16, 2024
@dbutenhof dbutenhof marked this pull request as draft February 16, 2024 16:32
@dbutenhof

This comment was marked as outdated.

@dbutenhof dbutenhof marked this pull request as ready for review February 16, 2024 23:16
webbnh
webbnh previously approved these changes Feb 19, 2024
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; however, I have the usual collection of nits, pointed questions, and suggestions.

Comment on lines 838 to 847
if target.is_dir():
uri = f"{origin}/contents/{link}"
type = CacheType.DIRECTORY
append_to = dir_list
elif target.is_file():
uri = f"{origin}/inventory/{link}"
type = CacheType.FILE
else:
uri = f"{origin}/inventory/{relative}"
type = CacheType.OTHER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused: why is the uri sometimes built from link and sometimes built from relative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OTHER and BROKEN the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.

Copy link
Member

@webbnh webbnh Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur on the BROKEN case...but why is that appropriate for the OTHER case? (In the OTHER case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}" and f"{origin}/inventory/{relative}" address the same file, with the latter just taking a more circuitous route, if the route is different at all?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a comment for inventory rather than contents. We don't want to expose the real target of an OTHER file (e.g., a fifo or whatever), which is why we use OTHER. Yeah, it's possible that inventory will transparently resolve the link and return the fifo, which I suspect we really don't want. I'm not going to mess with that here.

Copy link
Member Author

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some of the changes, including some new test links.

I also got an idea this afternoon (which is why I'm working way late today!) about improving the report generator performance by extending my find_dataset changes to an alternate "full discovery" by pulling all the dataset resource IDs and tarball-path values from SQL instead of iterdir-ing through the ARCHIVEs. Turns out that cuts cache discovery from nearly 2 hours to about 25 minutes.

Comment on lines 838 to 847
if target.is_dir():
uri = f"{origin}/contents/{link}"
type = CacheType.DIRECTORY
append_to = dir_list
elif target.is_file():
uri = f"{origin}/inventory/{link}"
type = CacheType.FILE
else:
uri = f"{origin}/inventory/{relative}"
type = CacheType.OTHER
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OTHER and BROKEN the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.

webbnh
webbnh previously approved these changes Feb 20, 2024
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updates look generally good. I've got follow-ups on two of my comments from the first pass (the rest are resolved -- thanks!), and I've got a few thoughts on your revisions. The foremost is that making CacheManager.full_discovery() into a wrapper seems like it causes more problems than it really solves. (The result is good, but the approach has some flaws.) The others are smaller bits, nits, and suggestions.

Comment on lines 838 to 847
if target.is_dir():
uri = f"{origin}/contents/{link}"
type = CacheType.DIRECTORY
append_to = dir_list
elif target.is_file():
uri = f"{origin}/inventory/{link}"
type = CacheType.FILE
else:
uri = f"{origin}/inventory/{relative}"
type = CacheType.OTHER
Copy link
Member

@webbnh webbnh Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur on the BROKEN case...but why is that appropriate for the OTHER case? (In the OTHER case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}" and f"{origin}/inventory/{relative}" address the same file, with the latter just taking a more circuitous route, if the route is different at all?)

PBENCH-1321

The `/datasets/{id}/contents` API includes into several unexpectedly expensive
steps:

1. Finding the tarball (by MD5 value) within the `ARCHIVE` tree using a `glob`
2. Fully discovering all tarballs within the controller directory
3. Unpacking the tarball into a cache directory using `tar`
4. Building a "map" of the contents of the unpacked tarball subtree

This PR includes mitigations for all but the `tar` unpack step:

1. Use the `server.tarball-path` metadata instead of searching the disk
2. Only discover the target tarball rather than the entire controller
3. Skip the "map" and evaluate the actual target path within the cache

Finding a tarball within our 30Tb `ARCHIVE` tree can take many minutes, while
identifying the controller directory from the tarball path takes a fraction of
a second.

Depending on the number of tarballs within a controller (some have many), full
controller discovery has been observed to take half a minute; while populating
only the target tarball takes a fraction of a second.

Building the map for a large tarball tree can take minutes, whereas discovery
of the actual relative file path within the cache runs at native (Python) file
system speeds.
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Are you contemplating acting on or responding to the two remaining open conversations (1, 2) or should I just close them?

Copy link
Member Author

@dbutenhof dbutenhof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I have several responses written earlier this afternoon that didn't get posted. Huh. You can enjoy them at your leisure.

Comment on lines 838 to 847
if target.is_dir():
uri = f"{origin}/contents/{link}"
type = CacheType.DIRECTORY
append_to = dir_list
elif target.is_file():
uri = f"{origin}/inventory/{link}"
type = CacheType.FILE
else:
uri = f"{origin}/inventory/{relative}"
type = CacheType.OTHER
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually a comment for inventory rather than contents. We don't want to expose the real target of an OTHER file (e.g., a fifo or whatever), which is why we use OTHER. Yeah, it's possible that inventory will transparently resolve the link and return the fifo, which I suspect we really don't want. I'm not going to mess with that here.

@dbutenhof dbutenhof merged commit c0946eb into distributed-system-analysis:main Feb 21, 2024
4 checks passed
@dbutenhof dbutenhof deleted the content branch February 21, 2024 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Of and relating to application programming interfaces to services and functions bug Code Infrastructure Dashboard Of and relating to the Dashboard GUI Operations Related to operation and monitoring of a service Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants