-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix - Cache miss CI fall back #3799
base: develop
Are you sure you want to change the base?
Fix - Cache miss CI fall back #3799
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #3799 +/- ##
=============================================
+ Coverage 23.15% 23.16% +0.01%
Complexity 26 26
=============================================
Files 1866 1866
Lines 35288 35288
Branches 2782 2782
=============================================
+ Hits 8170 8175 +5
+ Misses 26807 26801 -6
- Partials 311 312 +1 |
with: | ||
path: ~/.m2/repository | ||
key: ${{ github.run_id }}-${{ github.run_number }}-maven-cache | ||
- name: Maven artifacts creation #if for some reason there was a cache miss then create the maven artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fall back to the case of each test job to run the entire build independently - are we sure that's preferrable to simply failing the job?
As you are protecting from cache-get failures, reattempting the individual job again would probably be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest introducing the cache retrieval timeout explained in https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cache-segment-restore-timeout - or was that already in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the timeout was already in place (10 minutes). I honestly don't know in that case when the platform will be available in the future so I preferred to insert the building of the artifacts instead of risking losing other 10 minutes in a retry of the cache retrieval (if you think that building the artifacts last something like 10 minutes)
I noticed analyzing the past workflows under the repo that in some very rare cases the cache retrieval exceeded the time-out so I think that inserting this building of the artifacts would not be so heavy overall
Brief description of the PR.
This PR introduces a fall-back strategy for cache-miss cases found in the CI process.
Related Issue
I've noticed some rare occasions where the jobs, while retrieving the maven artifacts cache, get stuck and did not manage to retrieve it. I found what was the problem here https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cache-segment-restore-timeout. Basically, sometimes the platform doesn't respond and the Cache gitAction fails to retrieve the cache; to this end, a time-out has been inserted in order to prevent a too-long restore phase.
Description of the solution adopted
I've inserted a new step, in the test jobs, that verifies if the previous cache restore step concluded with a cache miss (when the above-mentioned time-out is reached there is a cache miss as stated in https://github.com/actions/cache/blob/main/tips-and-workarounds.md#cache-segment-restore-timeout) and in that case, the step builds the maven artifacts. To insert this fall-back strategy I used the restore feature of the Cache gitAction (actions/cache/restore@v3) in order to prevent the cache saving in the post-run phase that is performed by the Cache gitAction.
Additionally, by thinking that a similar fail could happen when the build job saves the cache, I improved that job in order to insert a failing of the workflow in the (rare) case where it doesn't manage to save the cache. I decided to do so because I thought that It would be more time-efficient to re-run the workflow if this unlucky event happens, and hope that the same failure doesn't happen, instead of wasting time for the rebuild of the artifacts in each test job. I never encountered an event like this in the workflows of eclipse/kapua that I have analyzed but I think it could be possible.