From 0d64b6060b10f89a199b71bd7d7a82ec7442fdd2 Mon Sep 17 00:00:00 2001 From: Gerwin Klein Date: Mon, 17 Jun 2024 10:27:02 +1000 Subject: [PATCH] import RFC-14 See https://sel4.atlassian.net/browse/RFC-14 Signed-off-by: Gerwin Klein --- src/proposed/0140-budget-limit-threshold.md | 469 ++++++++++++++++++++ 1 file changed, 469 insertions(+) create mode 100644 src/proposed/0140-budget-limit-threshold.md diff --git a/src/proposed/0140-budget-limit-threshold.md b/src/proposed/0140-budget-limit-threshold.md new file mode 100644 index 0000000..37f5beb --- /dev/null +++ b/src/proposed/0140-budget-limit-threshold.md @@ -0,0 +1,469 @@ + + +# MCS: Budget limit thresholds on endpoints for SC Donation + +- Author: Mitchell Johnston +- Proposed: 2023-08-15 + +## Summary + +This is a two stage RFC that proposes to add a new `budget threshold' to +endpoints, which would only allow scheduling context donation to occur if the +SC's available budget exceeds the threshold. This would provide a mechanism to +prevent budget-expiry from occurring in passive servers. + +As a second optional stage, this threshold value could also restrict the budget +that a passive server is permitted to consume on a donated SC. We refer to this +as a `budget limit'. This would help bound priority inversion in a system as +well as reducing the trust that clients must place in passive servers they call. + +## Motivation + +### Thresholds + +This change regarding thresholds is motivated by a desire to prevent budget +expiry from occurring inside passive servers, improving worst-case performance +and response time analysis. Additionally, it would shift timeout exceptions into +being a true error case, rather than an unavoidable, but routine occurrence on +many systems. This is likely to simplify system design. + +Currently, there are almost no restrictions on the amount of budget required to +enter a passive server. This almost inevitably leads to budget expiry, resulting +in undesirable longer blocking times for other clients of the passive server. +This must also be accounted for in schedulability analysis of a system and is +particularly important for servers with clients of varying criticality. + +The kernel currently supports a mechanism to manage budget expiry, timeout +handlers. These allow an error-handling thread to intervene and resolve the +budget expiry. However, as they address the issue after it occurs, these have +some limitations. Timeout handlers can provide extra budget, however, it is not +preferable to reward clients with additional budget just because they are +blocking the rest of the system. Alternatively, the server can be reverted to a +previous state, however this wastes any work the server performed and can be +expensive or impractical, depending on the nature of the passive server. + +This RFC proposes a new mechanism, endpoint thresholds, which aim to prevent +budget expiry from occurring in passive servers, rather than reacting to it +after it occurs. + +### Budget Limits + +The second stage of budget limits is motivated by the current implicit +requirement that clients must completely trust passive servers that they call. +Under the current model, a server can consume as much budget as it desires on a +donated SC and is not forced to return it. A malfunctioning or malicious passive +server can therefore run indefinitely while also possessing a higher priority +than its clients, significantly impeding their ability to make progress. + +The impact of this extends to any thread with a priority between that of the +passive server and its clients, even threads that are not themselves clients of +the passive server. Enforcing stronger bounds on passive server SC consumption +would reduce the level of trust threads in a system need to have in one another. + +## Guide-level explanation + +### Thresholds Guide + +This change involves adding a new attribute to endpoint objects, known as a +threshold. By default, this threshold value would be set to zero and will have +no effect. Therefore, existing code will not need to be updated to accommodate +thresholds. + +If the threshold is set to a non-zero value, sending invocations performed on +the endpoint will only be permitted if they donate a scheduling context with +available budget greater than or equal to the threshold. Available budget is the +amount of budget ready for use in the SC (the sum of refills with a release time +greater than the current time). By setting the threshold to the WCET of the +passive server associated with the endpoint, this guarantees that the server +will always receive an SC with sufficient budget to complete its execution, +preventing budget expiry from occurring. This will also ensure that timeout +exceptions will only occur as a result of a true error, such as a misconfigured +threshold value or a fault in the server code. + +If a client has insufficient available budget in its SC, its refills will be +deferred and merged until its head refill has sufficient budget at some point in +the future. Once the thread resumes, the IPC invocation will transparently +continue. If a client's maximum budget is less than the threshold value, an +error will instead be returned to the client. + +The time required to check an SC's available budget and to defer and merge are +both bounded by the maximum number of refills in an SC. When designing a system +using an endpoint with a threshold, this will be an additional factor to +consider regarding the optimal choice for the maximum number of SC refills. + +The threshold value of an endpoint can be set via an object invocation, +performed indirectly through an invocation on the CNode, similar to the existing +Cancel Badged Sends invocation. However, we restrict the right to set the +endpoint threshold to only the original unbadged capability. + +Changes to the threshold value are propagated weakly, meaning that clients only +interact with the threshold value at the moment where they invoke the endpoint. +If a threshold value is increased, clients already enqueued on the endpoint are +unaffected, even if they now possess insufficient budget. Similarly, threads +which have had their budget deferred will not be woken up early, even if the +threshold value is reduced. As the threshold is intended to reflect the WCET of +the passive server, changing it would be a rare operation. Therefore, we +consider this weak propagation behaviour acceptable. + +Additionally, invocations which do not support scheduling context donation are +no longer permitted to invoke an endpoint with a non-zero threshold set. They +instead fail immediately as an invalid operation. This prevents a passive server +from performing a rendezvous with a client that did not donate an SC, which +would result in the server being unable to run, which causes similar issues to +budget expiry. Invoking a passive server without donating an SC is fundamentally +an invalid operation, so it is therefore desirable for the IPC to fail upon the +initial send, rather than causing an exception only after rendezvousing with the +server. + +As an optional extension, the new defer and merge behaviour of SC's can also be +made available to threads as an explicit system call. We have provisionally +called this YieldUntilBudget, denoting it as a variant of Yield. This would +allow threads to specify a desired quantity of budget, after which the kernel +will merge and defer refills until the thread's head refill exceeds the desired +budget. We do emphasise that this additional system call is not required for the +functionality of thresholds. However, it will essentially be implemented +regardless as part of thresholds and exposing it to users is an easy change. + +### Budget Limits Guide + +The optional budget limit change extends the meaning of an endpoint threshold +value to also restrict budget consumption by a passive server on a donated SC. +An additional bit is added to endpoint objects to toggle budget limit behaviour. +By default, budget limits are disabled and only the aforementioned threshold +behaviour applies. If the threshold is non-zero and budget limits are enabled, +then both threshold and budget limit behaviour will be in effect. A budget limit +has a no effect if the threshold is zero, regardless of the value of the budget +limit bit. + +If a passive server returns a donated SC having consumed less than the budget +limit (equal to the threshold value), then the budget limit has no effect. +However, if the server's consumption reaches the budget limit, the kernel will +forcibly return the SC to the caller, ensuring that the server’s consumption +does not exceed the limit. This will almost certainly leave the server in a +stuck state, so the server's timeout handler will be invoked to recover it. + +When an SC is donated over an endpoint with budget limits enabled, that SC is +marked as possessing a budget limit. An SC with a budget limit can only be +further donated over endpoints that also have budget limits set. + +During such a donation, the kernel must still determine whether the donated SC +has a sufficient budget (greater than the threshold). However, rather than +considering the total available budget in the SC, the kernel considers the +budget remaining until the existing budget limit would be reached. This allows +the kernel to guarantee that every server is able to use its full budget limit +allocation, while also being able to enforce the limit if any server exceeds its +allocated budget limit. The SC is only marked as no longer possessing a budget +limit once it is returned over the original budget limit endpoint. + +## Reference-level explanation + +### Thresholds Reference + +This RFC proposes to add a `ticks_t` value to endpoint objects as the +`threshold'. It is also viable to associate the threshold with endpoint +capabilities, this is discussed in the rationale and alternative section below. +For every sending IPC invocation, fastpath or slowpath, the kernel checks the +threshold value of the relevant object. If it is set to zero, this represents +threshold behaviour being disabled and the IPC continues, with no further +behaviour changes. + +However, if the threshold value is non-zero, then the kernel calculates the +available budget of the thread (released refills in the SC) and compares against +the budget required to pass the threshold. The intention is for the 'threshold' +value to represent the required runtime of a passive server. Therefore, the +budget to pass the threshold is the 'threshold' value plus twice the kernel +WCET, to account for the call and reply system calls. Additionally, the kernel +must account for time consumed by the client, but not yet charged to its SC. If +the available budget is sufficient, again, the IPC continues without further +behaviour changes. + +However, if the available budget is insufficient, alternate action needs to be +taken. At this point, the fastpath and slowpath diverges. If the available +budget check on the fastpath fails, the kernel switches over to the slowpath to +reduce complexity on the fastpath. + +Returning to the slowpath behaviour, after determining that the thread has +insufficient available budget, the thread's maximum budget is compared to the +threshold. If the maximum budget is insufficient, an error is returned to the +client. Otherwise, the client's usage is first charged to it, and then its SC's +refills are merged and deferred until the head refill is sufficient to pass the +threshold. This will mostly likely be at some point in the future, so the client +will be inserted into the release queue and the system call will be restarted +once the refill is released. + +### Budget Limits Reference + +Supporting budget limits requires multiple changes in addition to those required +for thresholds. Each endpoint object requires an additional bit to represent the +budget limit toggle, however this does not require an increase in the object +size. SC's also require additional fields to track whether they currently +possess a budget limit, along with a budget consumption field. Finally, reply +objects have an additional limit field added to track the budget limit currently +in effect. + +When a thread and SC (without a budget limit in effect) calls over an endpoint +with a budget limit set, the donated SC is marked as possessing a budget limit +and its budget consumption is reset to zero. The reply object's limit field is +then set to the threshold value. When setting the timer interrupt, the kernel +additionally compares the active SC's consumed budget field against the limit in +the reply object. This allows the kernel to preempt the thread if it exceeds the +budget limit. + +When a thread, where the SC has a budget limit in effect, calls over an +endpoint, the kernel only permits the operation if the endpoint has a budget +limit set.The remaining budget limit is then compared to the endpoint's +threshold value and the operation is only permitted if the remaining budget +exceeds the threshold value. The SC’s consumed budget field is not reset, but +the reply object’s limit field is set to the SC’s consumed budget plus the +endpoint threshold value. This allows the kernel to track and enforce the new +budget limit, while not affecting the information required to enforce the +original budget limit. + +If a thread with a budget limit enabled exceeds its budget limit, the kernel +preempts it and returns the SC one layer down to the caller. In a multi-level +call chain, other budget limits further down the chain remain in effect. + +This functionality requires additional overhead on the fastpath, in particular +reprogramming the timer interrupt. + +### Performance summary + +In this section we present a summary of performance results, checking the +increased overhead that would result from implementing these proposed changes. +Three broad cases were tested: + +- Baseline: The baseline MCS kernel, without any changes. + +- Threshold disabled: A kernel with our proposed changes, but tested over an + endpoint that does not enforce threshold or budget limit behaviour. + +- Threshold/Budget limit enabled: A kernel with the proposed changes, over an + endpoint that enforces threshold/budget limit behaviour. + +#### Thresholds only + +First, we compare successful IPC cost of thresholds only (no budget limits) +against the baseline kernel using the standard seL4bench IPC tests. + +These tests were performed with an SC with a single refill. + +| | **Fastpath** | **Fastpath overhead** | **Slowpath** | **Slowpath overhead** | +| ------------------- | ------------ | --------------------- | ------------ | --------------------- | +| Baseline | 269 (3) | N/A | 945 (12) | N/A | +| Thresholds disabled | 276 (4) | 3% | 959 (10) | 1% | +| Thresholds enabled | 304 (2) | 13% | 976 (12) | 3% | + +*IPC Call overhead from endpoint thresholds. Results are cycles, presented as: +mean (standard deviation)* + +Thresholds introduce a new case to IPC, whereby instead of succeeding +immediately, the thread's budget is deferred. We measured the cost of deferring +budget for SC's with various numbers of refills. + +| **Operation** | **Extra Refills** | **Fastpath** | **Slowpath** | +| --------------- | ----------------: | -----------: | -----------: | +| IPC call | N/A | 272 (2) | 962 (15) | +| IPC block | N/A | 852 (11) | 796 (17) | +| Threshold defer | 0 | 1150 (24) | 1015 (17) | +| Threshold defer | 10 | 1391 (18) | 1300 (26) | +| Threshold defer | 20 | 1665 (25) | 1545 (19) | +| Threshold defer | 30 | 1942 (30) | 1819 (20) | +| Threshold defer | 40 | 2174 (18) | 2079 (23) | +| Threshold defer | 50 | 2446 (24) | 2331 (29) | + +*IPC Call defer costs. Results are cycles, presented as: mean (standard +deviation)* + +#### With Budget Limits + +In this section, we consider the costs on a system that supports both thresholds +and budget limits. We compare the increase in overhead for both IPC Call and +ReplyRecv. + +First, overheads for IPC Call: + +| | **Fastpath** | **FP overhead** | **Slowpath** | **SP overhead** | +| -------------------- | -----------: | --------------: | -----------: | --------------: | +| Baseline | 269 (3) | N/A | 945 (12) | N/A | +| Thresholds disabled | 291 (2) | 8% | 1068 (14) | 13% | +| Budget limit enabled | 327 (0) | 22% | 1108 (18) | 17% | +| | | | | | + +*IPC Call overhead from budget limit thresholds. Results are cycles, presented +as: mean (standard deviation)* + +Next, the overheads for ReplyRecv: + +| | Fastpath | FP overhead | Slowpath | SP overhead | +| :------------------- | -------: | ----------: | --------: | ----------: | +| Baseline | 290 (0) | N/A | 972 (16) | N/A | +| Thresholds disabled | 298 (3) | 3% | 1131 (20) | 16% | +| Budget limit enabled | 352 (5) | 21% | 1208 (17) | 24% | + +*IPC ReplyRecv overhead from budget limit thresholds. Results are cycles, +presented as: mean (standard deviation)* + +### Other implementation details + +An example implementation and draft pull request, for clarity of changes, is +available here: + +- +- + +The core of this change is additional code on the IPC path that checks if a +calling client has sufficient budget compared to the threshold and if not, +deferring and merging until its head refill has sufficient budget. The size of +endpoint objects has been increased by 1 bit to make room to store the threshold +value, which is a `ticks_t` value. On a 64-bit system, this increases the size +from 16 bytes to 32 bytes. + +The calculation of available budget requires summing over all of an SC's +released refills. This is not a major issue, but does introduce a new kernel +time dependence on the refill size of an SC. The same is true with deferring and +merging refills. There already exists a performance trade-off regarding max +refill sizes, this is simply an additional factor for system designers to +consider. + +## Drawbacks + +The primary drawback is that this change will increase the IPC path cost, +imposing an additional cost on all IPC calls. This is true even for endpoints +with thresholds set to zero, as the kernel still needs to check whether it +should impose the threshold behaviour restrictions. However, while non-zero, +this checking cost is small. Where thresholds are enforced, there will be a +modest performance impact. However, as outlined above, budget limits will impose +a more significant performance cost. + +## Rationale and alternatives + +A key benefit is that this change will support better schedulability analysis +and reduced server complexity. The threshold change trades slightly slower +average-case performance for much improved worst-case performance. However, this +more predictable behaviour is of particular benefit for schedulability analysis. +In particular, this predictability is highly important for real-time systems +that the MCS kernel is targeted at. Further, this change also moves timeout +faults into a true error case, rather than an unavoidable consequence of passive +servers. + +The current model supports changes to the threshold value, however, changes are +propagated very weakly. As described above, budgets are only compared to the +threshold value at the point where a thread invokes an endpoint. Threads already +enqueued on the endpoint or waiting for sufficient budget are only affected by a +change to the threshold value when they invoke the endpoint again. + +We had considered stronger threshold propagation models, whereby all threads +enqueued on the endpoint or waiting for sufficient budget were affected by +changes to the threshold value immediately. Upon a change to the threshold +value, threads would be enqueued or dequeued from the endpoint to reflect the +change. However, this behaviour involved significant additional kernel +complexity, in particular requiring two additional release queues. Briefly, this +is because deferring and merging budget would not be possible, as a thread's +available budget must be preserved in case the threshold value is decreased. +This required significant additional infrastructure to track the threads and +their budgets, while also waiting for additional refill releases. Given that +changing an endpoint threshold is expected to be a rare operation, we do not +consider this compromise of kernel minimality worth the benefit. Therefore, we +implemented the weak propagation behaviour described above. + +Alternative associations for the threshold value were considered, namely the +passive server's TCB and endpoint capabilities, rather than objects. + +- Association with the server's TCB would add significant kernel complexity with + little benefit. Threshold values would need to be checked at the point of IPC + rendezvous, rather than at the point of entering the endpoint. This could + require endpoints to have two queues, one containing servers that ready to + receive and another with clients that are ready to send, but unable to + rendezvous because of insufficient budget. + +- Association with endpoint capabilities would allow different client's entry + points to have different thresholds. There is potentially a use for different + threshold value representing different operations into a passive server, with + correspondingly different WCET's. However, this would require increasing the + size of endpoint capabilities. Due to the nature of the capability system, + this would also require increasing the size of all capabilities. This is not + an insurmountable issue and may be the best design. + +### Budget limit call stack restriction + +For budget limits, we introduced the restriction that threads with a budget +limit in effect could only donate their SC over an endpoint with a budget limit. +Originally, we did consider a model that did not restrict how users could +construct call stacks, however there were significant downsides. We now +illustrate this with an example. + +Consider a client (C) that calls a server (S) over a budget limit endpoint. +Then, over endpoints without budget limits set, S donates its SC to s_1, s_1 to s_2 +and so on to s_n. We outline two system behaviour assumptions that we believe are +reasonable: + +- The original budget limit guarantee made from S to C should still be enforced + regardless of which thread the SC is bound to at the time. If this is not + enforced then the usefulness of the budget limit to guarantee scheduling + properties is greatly diminished. + +- Further, any thread that is skipped over by the returning SC needs to have its + timeout handler invoked, to restore them to a sane state. This is required for + the system to remain usable. Otherwise, some mechanism must exist to notify + some user-level error-handling thread of which server threads are stuck and + allow it to recover them. This is essentially the same concept as a + timeout/error handler regardless. + +Therefore, if the budget limit is exceeded by s_n, the reply stack needs to be +traversed until the original client is found, triggering timeout handlers at +each level as well. The result is that kernel execution time would be O(n) +dependent on the size of the call stack created, which we consider unreasonable. + +Further, S is making a guarantee to C that only a certain amount of budget will +be consumed on the donated SC, that being the semantic meaning of the budget +limit. It would be a poor design for S to then donate that SC to another server +(s_1) with no timing guarantees, as if s_1 consumes too much budget, S will also +be skipped over when the SC is returned to C. This means that S cannot guarantee +it will meet its own scheduling obligations. Therefore, we consider it a +reasonable design restriction to only allow S to donate its SC to servers that +also provide a budget limit guarantee. + +If S needs to donate its SC to a server (s_1) that cannot provide a budget limit +guarantee, we believe it is most sensible for S to not offer a budget limit +either, as S does not know how much budget will be required. Alternatively, we +believe adding a budget limit to s_1 would also be a sensible design pattern. + +## Prior art + +See also this [honours thesis], where this proposed change was discussed in more +detail. + +[honours thesis]: https://trustworthy.systems/publications/theses_public/22/Johnston%3Abe.abstract + +## Unresolved questions + +What is the best API for the new Yield system call? + +- The cleanest API would probably be to change `seL4_Yield()` to accept a single + parameter. If the parameter is zero, the existing Yield behaviour occurs, + otherwise, a non-zero value triggers the new defer and merge behaviour. + Unfortunately, this would be a breaking API change. However, the change + required would be fairly simple, replacing every `seL4_Yield()` with + `seL4_Yield(0)`. + +- Same as above, but make the true system call `seL4_YieldUntilBudget(time_t + val)`. This would allow `seL4_Yield()` to be made a macro for + `seL4_YieldUntilBudget(0)`. This seems to be a messier API, but maintains + backward API compatibility. It would however, break existing binary + compatibility. + +- Two true system calls, the existing `seL4_Yield()` and the new + `seL4_YieldUntilBudget`. This seems to be less preferable from an API + perspective (and potentially performance, though this has_n't been tested), but + also maintains backward compatibility. + +We restrict sending invocations on an endpoint with a threshold set to only those that permit SC donation. There are some possible extensions of this: + +- It is potentially reasonable to also restrict receiving invocations to only + those that allow SC donation? For a passive server, receiving in a manner that + does not permit SC donation is an invalid configuration. This change would + cause an explicit error upon invocation, rather than the server getting stuck + after IPC rendezvous. This is potentially desirable.