-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-14453: Fix bugs in installercache #7205
base: master
Are you sure you want to change the base?
Conversation
@paul-maidment: This pull request references MGMT-14453 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: paul-maidment The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@paul-maidment: This pull request references MGMT-14453 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
0369404
to
715734c
Compare
fa9cd8d
to
27013f9
Compare
27013f9
to
285a8bd
Compare
135eb1a
to
eef604c
Compare
@paul-maidment: This pull request references MGMT-14453 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
eef604c
to
0ebb21f
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7205 +/- ##
==========================================
+ Coverage 67.89% 67.95% +0.05%
==========================================
Files 298 300 +2
Lines 40711 40981 +270
==========================================
+ Hits 27642 27848 +206
- Misses 10592 10640 +48
- Partials 2477 2493 +16
|
0ebb21f
to
396a0e0
Compare
@paul-maidment Any chance you can provide some description about the installerCache now with your changes (what configuration it should get (and why), what is its main purpose, how it operators, etc. ? I know it should function as a cache for openshift/installer binaries, but not much more than that |
I can certainly add a summary to the description with some of these details, thank you for the suggestion. The installercache is essentially a directory in ephemeral storage where we store the It's used at the point where the manifests and install configuration are being prepared. The manifest generator makes a request to get a specific release from the cache, is given a 'Hardlink' to the release. Once the manifest generator is done with the release, it calls Typically, it could take between 30 seconds and 90 seconds (this is an educated guess and we have another ticket in place to gather more evidence/stats) to free a release after use. The cache is meant to keep the most common releases around in Ephemeral storage so that if another user wants to obtain the same release, we can save the 30 seconds or so required to download it and just return the one we have. Obviously the number of openshift verisions available makes in prohibitive to store all openshift versions, so we maintain a "Least recently used" cache of releases. The cache is configured using the following environment variables INSTALLER_CACHE_CAPACITY_GIB INSTALLER_CACHE_MAX_RELEASE_SIZE_GIB This could perhaps be improved or automated in the future but as INSTALLER_CACHE_EVICTION_THRESHOLD_PERCENT INSTALLER_CACHE_RELEASE_FETCH_RETRY_INTERVAL_SECONDS There is a retry mechanism of course, and that's what this setting pertains to. Because we are using hardlinks, an eviction of the original file does not immediately mean that the space will be freed. Only when the release is cleaned up after use will the space be reconciled. This interval should be ideally set to a duration that covers this length of time and nothing more. |
a12f31d
to
20398ce
Compare
/retest |
20398ce
to
6bf4477
Compare
/test edge-subsystem-kubeapi-aws |
/test okd-scos-e2e-aws-ovn |
@paul-maidment there's no reason to constantly retest failing optional tests unless you think the specific failure is relevant to your changes. Looking at that test's history it seems to be failing much more frequently than it's passing so I wouldn't worry about it. |
Still working on a full review, but one thing I'm noticing is that you have a lot of log info messages that I'm not sure are going to be super interesting once this is all up and running. I'd consider whether we will really need to see these every time and either remove some or change them to debug. I just don't want to clutter up the review with a bunch of comments saying "consider changing this to debug or removing it" |
) | ||
|
||
BeforeEach(func() { | ||
// setup temp workdir | ||
workDir, err = os.MkdirTemp("", "bootstrap-ignition-update-test-") | ||
Expect(err).NotTo(HaveOccurred()) | ||
Expect(err).NotTo(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this likely belongs down under installercache.New
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
for { | ||
select { | ||
case <-ctx.Done(): | ||
return nil, errors.Errorf("context cancelled or timed out while fetching release %s", releaseID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context has an Err()
method that should be used for contexts that are closed.
I'd add that here.
if !isCapacityError { | ||
return nil, errors.Wrapf(err, "failed to get installer path for release %s", releaseID) | ||
} | ||
i.log.WithError(err).Errorf("insufficient installer cache capacity for release %s", releaseID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we always be able to evict something? How could we end up in this situation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the case and this scenario is very easy to reproduce.
- Installer cache should be set to a capacity that is only enough for a single release.
- Fetch 2 different releases
- The second release will have this issue due to a pending hard link cleanup.
The capacity check performed here is of the amount of free space while accounting for hard links that have not yet been removed.
An eviction is removing release files, not hard links (unless we count the questionable " hard link pruning" functionality.)
A release that we have deleted may still be in use by the caller, until that release is freed by calling Cleanup
then the actual space will not be freed.
By the time we reached this line we have done the following
1: Checked the free space while accounting for hard links
2: Determined that we do not have enough space in which to store a release
3: Called evict and performed some eviction on the releases -- note that there may be no eviction candidates that are without hard links yet.
4: Returned to the top of the loop and checked the free space while accounting for hard links again.
5: Determined that we do not have enough space in which to store a release
6: Detected that this is a 'blocking' issue and that unless we have the space, we cannot write
7: Determined that the eviction has already been called and should not be called again as it will most likely not have an impact.
This is a "fail fast" operation to allow a retry to be attempted outside of the context of the Mutex.
// Installer cache directory will not exist prior to first extraction | ||
i.log.WithError(err).Warnf("skipping capacity check as first extraction will trigger creation of directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this when the cache object is created so we don't have to check every time here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could argue that the release extraction code wants to retain responsibility for this as it knows the exact file and directory permissions required. Ultimately oc adm release extract
performs the actions.
If a user were to manually run this command, they would not be expected to manage the permissions as it would be assumed that oc
had made the right choices when it created the directory/files.
I wonder if we should keep the responsibility there?
} | ||
if evictionAlreadyPerformed { | ||
// We still don't have enough capacity after eviction, so we need to exit and retry | ||
return nil, &errorInsufficientCacheCapacity{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this all in a loop if we're just going to exit on the second try? Would it not be simpler to just call GetDiskUsage
again after evicting something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the loop I expected this to continue evicting releases until we had space for another one.
Why not do that?
Edit after reading to the bottom ... Isn't my suggestion what you're actually doing in the loop in the evict function? Why do we need this case then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we use the loop here? To avoid clumsy repetition of the same code, a loop feels appropriate here.
The check in the first loop (the one that runs twice and no more) is looking at the disk usage while accounting for hard links. This is to guarantee that we cover the case where we don't have enough disk space to perform even a single write.
As I see it, we must guarantee not to perform this write, this is why we count the space including hard links.
The eviction loop however, only accounts for the accumulated size of the files in the cache while not accounting for hard links.
The reason for this is that we are evicting these files, not the hard links and we should only base our eviction capacity calculations on what we are able to evict. If we were to check capacity including hard links during eviction, it would lead to 'over eviction' when called repeatedly as we are awaiting the deletion of hard links and that is beyond our control.
In the worst case scenario we could evict every cached release if we make this mistake.
if finfo.info.ModTime().Unix() < grace { | ||
os.Remove(finfo.path) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI it's tough to review stuff like this where you both move and slightly change something.
Given that this PR is already pretty big and the changes you made in this case really a no-op I think it would have been better to just not change this at all.
Now that you've done it, it's fine, but just wanted to make you aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will bear this in mind, I'll try to keep things aligned so they 'diff' better in future.
Thanks for raising this.
// prefer files without hardlinks first because | ||
// 1: files without hardlinks are likely to be least recently used anyway | ||
// 2: files without hardlinks will immediately free up storage | ||
queues := i.splitQueueOnHardLinks(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're using the actual space used on disk as our metric here why consider evicting files with hardlinks at all?
Removing one of those won't get us any space back.
The amount of extra engineering here is the reason I suggested we use number of releases rather than the actual size for this. This extra complexity is bound to make this fragile, but I understand if we don't want to change directions now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there could be limited some edge cases where pruning a release with a hard link could present a small advantage but certainly nothing too significant.
It's safe to include them but in this case, I tend to agree that we should just filter them out.
@@ -119,19 +145,162 @@ var _ = Describe("installer cache", func() { | |||
} | |||
Expect(l.Path).ShouldNot(BeEmpty()) | |||
|
|||
time.Sleep(1 * time.Second) | |||
time.Sleep(10 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this sleep necessary? Can you add a comment explaining it if it really is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I just tested this function again and we don't need this for the assertions we are performing here.
I think I was attempting to simulate a small amount of latency between fetching a release using installers.Get(...)
and calling Cleanup(...)
on the release. In practice however, this is not needed for this simple, non parallel test.
// runParallelTest launches a batch of fetches from the installer cache in order to simulate multiple requests to the same node | ||
// releases are automatically cleaned up as they are gathered | ||
// returns the first error encountered or nil if no error encountered. | ||
runParallelTest := func(params []launchParams) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this kind of test belongs in unit-tests.
At best if/when it fails it will give inconsistent results.
I totally understand if you want to write these tests for yourself, but I'm not as sure about having them here.
Curious what @rccrdpccl thinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here is to have these as smoke tests for the happy path.
The general idea is to have clear scenarios where depending on the scenario at least one of the parallel processes will fail or where none will fail.
All I expect in these cases is that any time there is a failure they will break the test so that we know there is a problem.
We are making sure that the installer cache is instantiated and sending multiple parallel requests, I think that this simulates the environment in which the cache will run reasonably well. Obviously it's not a full 'end to end' context.
If I am not mistaken, the "unit" under test is the "Get" function of installer cache (if not the cache itself)
This call has a mutex and we should ensure that it is being correctly used. A parallel test is an appropriate way to do this.
Steps have been taken to reduce the runtime of the code and they do not hinder the speed of the unit tests, no tests are reported as 'slow'
I don't see this as the comprehensive solution to testing this and expect that a lot more will be done in QE.
If we really want to push this further, we could do some e2e tests in openshift/release that push multiple parallel installation attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At best if/when it fails it will give inconsistent results. <-
How do you feel the results will be inconsistent, can you point out an example in this case?
// 2: files without hardlinks will immediately free up storage | ||
queues := i.splitQueueOnHardLinks(files) | ||
for _, q := range queues { | ||
for i.shouldEvict(totalSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we're evaluating this each loop it feels very much like we should not need the loop in get and should never hit the capacity error, right?
Doesn't this effectively ensure we'll evict releases until we have enough space to download a new one? Or am I misunderstanding something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discussed above, I think something has been misunderstood here.
…in the installer cache This PR is for the purpose of resolving multiple bugs within the installer cache, due to the poor condition of the current cache, it makes sense to fix this in a single PR. * https://issues.redhat.com/browse/MGMT-14452 Installer cache removes in-used cached image when out of space * https://issues.redhat.com/browse/MGMT-14453 INSTALLER_CACHE_CAPACITY small value cause to assisted-service crash * https://issues.redhat.com/browse/MGMT-14457 Installer cache - fails to install when running parallel with same version * Additionally, the cache did not respect limits, so this has been addressed here. Fixes: I have implemented fixes for each of the following issues. * Mutex was ineffective as not instantiated corrctly, leading to [MGMT-14452](https://issues.redhat.com//browse/MGMT-14452), [MGMT-14453](https://issues.redhat.com//browse/MGMT-14453). * Naming convention for hardlinks changed to be UUID based to resolve [MGMT-14457](https://issues.redhat.com//browse/MGMT-14457). * Any time we either extract or use a release, the modified time must be updated, not only for cached releases. This was causing premature pruning of hardlinks. * LRU cache order updated to be based on microseconds instead of seconds. * Eviction checks updated to consider max release size and also cache threshold. * We now check there is enough space before writing. * During eviction - releases without hard links will be evicted before releases with hard links.
6bf4477
to
889994e
Compare
@paul-maidment: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR is for the purpose of resolving multiple bugs within the installer cache, due to the poor condition of the current cache, it makes sense to fix this in a single PR.
Installer cache removes in-used cached image when out of space
INSTALLER_CACHE_CAPACITY small value cause to assisted-service crash
Installer cache - fails to install when running parallel with same version
Fixes:
I have implemented fixes for each of the following issues.