-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make contents
API scale
#3609
Make contents
API scale
#3609
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; however, I have the usual collection of nits, pointed questions, and suggestions.
lib/pbench/server/cache_manager.py
Outdated
if target.is_dir(): | ||
uri = f"{origin}/contents/{link}" | ||
type = CacheType.DIRECTORY | ||
append_to = dir_list | ||
elif target.is_file(): | ||
uri = f"{origin}/inventory/{link}" | ||
type = CacheType.FILE | ||
else: | ||
uri = f"{origin}/inventory/{relative}" | ||
type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused: why is the uri
sometimes built from link
and sometimes built from relative
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OTHER
and BROKEN
the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur on the BROKEN
case...but why is that appropriate for the OTHER
case? (In the OTHER
case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}"
and f"{origin}/inventory/{relative}"
address the same file, with the latter just taking a more circuitous route, if the route is different at all?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a comment for inventory
rather than contents
. We don't want to expose the real target of an OTHER
file (e.g., a fifo or whatever), which is why we use OTHER
. Yeah, it's possible that inventory
will transparently resolve the link and return the fifo
, which I suspect we really don't want. I'm not going to mess with that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some of the changes, including some new test links.
I also got an idea this afternoon (which is why I'm working way late today!) about improving the report generator performance by extending my find_dataset
changes to an alternate "full discovery" by pulling all the dataset resource IDs and tarball-path
values from SQL instead of iterdir
-ing through the ARCHIVE
s. Turns out that cuts cache discovery from nearly 2 hours to about 25 minutes.
lib/pbench/server/cache_manager.py
Outdated
if target.is_dir(): | ||
uri = f"{origin}/contents/{link}" | ||
type = CacheType.DIRECTORY | ||
append_to = dir_list | ||
elif target.is_file(): | ||
uri = f"{origin}/inventory/{link}" | ||
type = CacheType.FILE | ||
else: | ||
uri = f"{origin}/inventory/{relative}" | ||
type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For OTHER
and BROKEN
the URL points to the symlink since we can't (or won't) resolve it. For others, we return the resolved target so they get the real directory or file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The updates look generally good. I've got follow-ups on two of my comments from the first pass (the rest are resolved -- thanks!), and I've got a few thoughts on your revisions. The foremost is that making CacheManager.full_discovery()
into a wrapper seems like it causes more problems than it really solves. (The result is good, but the approach has some flaws.) The others are smaller bits, nits, and suggestions.
lib/pbench/server/cache_manager.py
Outdated
if target.is_dir(): | ||
uri = f"{origin}/contents/{link}" | ||
type = CacheType.DIRECTORY | ||
append_to = dir_list | ||
elif target.is_file(): | ||
uri = f"{origin}/inventory/{link}" | ||
type = CacheType.FILE | ||
else: | ||
uri = f"{origin}/inventory/{relative}" | ||
type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I concur on the BROKEN
case...but why is that appropriate for the OTHER
case? (In the OTHER
case, we successfully resolved the symlink to a directory entry accessible within the tarball via a relative path, so won't f"{origin}/inventory/{link}"
and f"{origin}/inventory/{relative}"
address the same file, with the latter just taking a more circuitous route, if the route is different at all?)
PBENCH-1321 The `/datasets/{id}/contents` API includes into several unexpectedly expensive steps: 1. Finding the tarball (by MD5 value) within the `ARCHIVE` tree using a `glob` 2. Fully discovering all tarballs within the controller directory 3. Unpacking the tarball into a cache directory using `tar` 4. Building a "map" of the contents of the unpacked tarball subtree This PR includes mitigations for all but the `tar` unpack step: 1. Use the `server.tarball-path` metadata instead of searching the disk 2. Only discover the target tarball rather than the entire controller 3. Skip the "map" and evaluate the actual target path within the cache Finding a tarball within our 30Tb `ARCHIVE` tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second. Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second. Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently I have several responses written earlier this afternoon that didn't get posted. Huh. You can enjoy them at your leisure.
lib/pbench/server/cache_manager.py
Outdated
if target.is_dir(): | ||
uri = f"{origin}/contents/{link}" | ||
type = CacheType.DIRECTORY | ||
append_to = dir_list | ||
elif target.is_file(): | ||
uri = f"{origin}/inventory/{link}" | ||
type = CacheType.FILE | ||
else: | ||
uri = f"{origin}/inventory/{relative}" | ||
type = CacheType.OTHER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually a comment for inventory
rather than contents
. We don't want to expose the real target of an OTHER
file (e.g., a fifo or whatever), which is why we use OTHER
. Yeah, it's possible that inventory
will transparently resolve the link and return the fifo
, which I suspect we really don't want. I'm not going to mess with that here.
PBENCH-1321
The
/datasets/{id}/contents
API includes into several unexpectedly expensive steps:ARCHIVE
tree using aglob
tar
This PR includes mitigations for all but the
tar
unpack step:server.tarball-path
metadata instead of searching the diskFinding a tarball within our 30Tb
ARCHIVE
tree can take many minutes, while identifying the controller directory from the tarball path takes a fraction of a second.Depending on the number of tarballs within a controller (some have many), full controller discovery has been observed to take half a minute; while populating only the target tarball takes a fraction of a second.
Building the map for a large tarball tree can take minutes, whereas discovery of the actual relative file path within the cache runs at native (Python) file system speeds.