## RateLimiter Design The nosqlbench rate limiter is a hybrid design, combining ideas from well-known algorithms with a heavy dose of mechanical sympathy. The resulting implementation provides the following: 1. A basic design that can be explained in one page (this page!) 2. High throughput, compared to other rate limiters tested. 3. Graceful degradation with increasing concurrency. 4. Clearly defined behavioral semantics. 5. Efficient burst capability, for tunable catch-up rates. 6. Efficient calculation of wait time. ## Parameters **rate** - In simplest terms, users simply need to configure the *rate*. For example, `rate=12000` specifies an op rate of 12000 ops/second. **burst rate** - Additionally, users may specify a burst rate which can be used to recover unused time when a client is able to go faster than the strict limit. The burst rate is multiplied by the _op rate_ to arrive at the maximum rate when wait time is available to recover. For example, `rate=12000,1.1` specifies that a client may operate at 12000 ops/s _when it is caught up_, while allowing it to go at a rate of up to 13200 ops/s _when it is behind schedule_. ## Design Principles The core design of the rate limiter is based on the [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithm as established in the telecom industry for rate metering. Additional refinements have been added to allow for flexible and reliable use on non-realtime systems. The unit of scheduling used in this design is the token, corresponding directly to a nanosecond of time. The scheduling time that is made available to callers is stored in a pool of tokens which is set to a configured size. The size of the token pool determines how many grants are allowed to be dispatched before the next one is forced to wait for available tokens. At some regular frequency, a filler thread adds tokens (nanoseconds of time to be distributed to waiting ops) to the pool. The callers which are waiting for these tokens consume a number of tokens serially. If the pool does not contain the requested number of tokens, then the caller is blocked using basic synchronization primitives. When the pool is filled any blocked callers are unblocked. The hybrid rate limiter tracks and accumulates both the passage of system time and the usage rate of this time as a measurement of progress. The delta between these two reference points in time captures a very simple and empirical value of imposed wait time. That is, the time which was allocated but which was not used always represents a slow down which is imposed by external factors. This manifests as slower response when considering the target rate to be equivalent to user load. ## Design Details In fact, there are three pools. The _active_ pool, the _bursting_ pool, and the _waiting_ pool. The active pool has a limited size based on the number of operations that are allowed to be granted concurrently. The bursting pool is sized according to the relative burst rate and the size of the active pool. For example, with an op rate of 1000 ops/s and a burst rate of 1.1, the active pool can be sized to 1E9 nanos (one second of nanos), and the burst pool can be sized to 1E8 (1/10 of that), thus yielding a combined pool size of 1E9 + 1E8, or 1100000000 ns. The waiting pool is where all extra tokens are held in reserve. It is unlimited except by the size of a long value. The size of the waiting pool is a direct measure of wait time in nanoseconds. Within the pools, tokens (time) are neither created nor destroyed. They are added by the filler based on the passage of time, and consumed by callers when they become available. In between these operations, the net sum of tokens is preserved. In short, when time deltas are observed in the system clock, this time is accumulated into the available scheduling time of the token pools. In this way, the token pool acts as a metered dispenser of scheduling time to waiting (or not) consumers. The filler thread adds tokens to the pool according to the system real-time clock, at some estimated but unreliable interval. The frequency of filling is set high enough to give a reliable perception of time passing smoothly, but low enough to avoid wasting too much thread time in calling overhead. (It is set to 1K/s by default). Each time filling occurs, the real-time clock is check-pointed, and the time delta is fed into the pool filling logic as explained below. ## Visual Explanation The diagram below explains the moving parts of the hybrid rate limiter. The arrows represent the flow of tokens (ns) as a form of scheduling currency. The top box shows an active token filler thread which polls the system clock and accumulates new time into the token pool. The bottom boxes represent concurrent readers of the token pool. These are typically independent threads which do a blocking read for tokens once they are ready to execute the rate-limited task. ![Hybrid Ratelimiter Schematic](hybrid_ratelimiter.png) In the middle, the passive component in this diagram is the token pool itself. When the token filler adds tokens, it never blocks. However, the token filler can cause any readers of the token pool to unblock so that they can acquire newly available tokens. When time is added to the token pool, the following steps are taken: 1) New tokens (based on measured time elapsed since the last fill) are added to the active pool until it is full. 2) Any extra tokens are added to the waiting pool. 3) If the waiting pool has any tokens, and there is room in the bursting pool, some tokens are moved from the waiting pool to the bursting pool according to how many will fit. When a caller asks for a number of tokens, the combined total from the active and burst pools is available to that caller. If the number of tokens needed is not yet available, then the caller will block until tokens are added. ## Bursting Logic Tokens in the waiting pool represent time that has not been claimed by a caller. Tokens accumulate in the waiting pool as a side-effect of continuous filling outpacing continuous draining, thus creating a backlog of operations. The pool sizes determine both the maximum instantaneously available operations as well as the rate at which unclaimed time can be back-filled back into the active or burst pools. ### Normalizing for Jitter Since it is not possible to schedule the filler thread to trigger on a strict and reliable schedule (as in a real-time system), the method of moving tokens from the waiting pool to the bursting pool must account for differences in timing. Thus, tokens which are activated for bursting are scaled according to the amount of time added in the last fill, relative to the maximum active pool. This means that a full pool fill will allow a full burst pool fill, presuming wait time is positive by that amount. It also means that the same effect can be achieved by ten consecutive fills of a tenth the time each. In effect, bursting is normalized to the passage of time along with the burst rate, with a maximum cap imposed when operations are unclaimed by callers. ## Mechanical Trade-offs In this implementation, it is relatively easy to explain how accuracy and performance trade-off. They are competing concerns. Consider these two extremes of an isochronous configuration: ### Slow Isochronous For example, the rate limiter could be configured for strict isochronous behavior by setting the active pool size to *one* op of nanos and the burst rate to 1.0, thus disabling bursting. If the op rate requested is 1 op/s, this configuration will work relatively well, although *any* caller which doesn't show up (or isn't already waiting) when the tokens become available will incur a waittime penalty. The odds of this are relatively low for a high-velocity client. ### Fast Isochronous However, if the op rate for this type of configuration is set to 1E8 operations per second, then the filler thread will be adding 100 ops worth of time when there is only *one* op worth of active pool space. This is due to the fact that filling can only occur at a maximal frequency which has been set to 1K fills/s on average. That will create artificial wait time, since the token consumers and producers would not have enough pool space to hold the tokens needed during fill. It is not possible on most systems to fill the pool at arbitrarily high fill frequencies. Thus, it is important for users to understand the limits of the machinery when using high rates. In most scenarios, these limits will not be onerous. ### Boundary Rules Taking these effects into account, the default configuration makes some reasonable trade-offs according to the rules below. These rules should work well for most rates below 50M ops/s. The net effect of these rules is to increase work bulking within the token pools as rates go higher. Trying to go above 50M ops/s while also forcing isochronous behavior will result in artificial wait-time. For this reason, the pool size itself is not user-configurable at this time. - The pool size will always be at least as big as two ops. This rule ensures that there is adequate buffer space for tokens when callers are accessing the token pools near the rate of the filler thread. If this were not ensured, then artificial wait time would be injected due to overflow error. - The pool size will always be at least as big as 1E6 nanos, or 1/1000 of a second. This rule ensures that the filler thread has a reasonably attainable update frequency which will prevent underflow in the active or burst pools. - The number of ops that can fit in the pool will determine how many ops can be dispatched between fills. For example, an op rate of 1E6 will mean that up to 1000 ops worth of tokens may be present between fills, and up to 1000 ops may be allowed to start at any time before the next fill. .1 ops/s : .2 seconds worth 1 ops/s : 2 seconds worth 100 ops/s : 2 seconds worth In practical terms, this means that rates slower than 1K ops/S will have their strictness controlled by the burst rate in general, and rates faster than 1K ops/S will automatically include some op bulking between fills. ## History A CAS-oriented method which compensated for RTC calling overhead was used previously. This method afforded very high performance, but it was difficult to reason about. This implementation replaces that previous version. Basic synchronization primitives (implicit locking via synchronized methods) performed surprisingly well -- well enough to discard the complexity of the previous implementation. Further, this version is much easier to study and reason about. ## New Challenges While the current implementation works well for most basic cases, high CPU contention has shown that it can become an artificial bottleneck. Based on observations on higher end systems with many cores running many threads and high target rates, it appears that the rate limiter becomes a resource blocker or forces too much thread management. Strategies for handling this should be considered: 1) Make callers able to pseudo-randomly (or not randomly) act as a token filler, such that active consumers can do some work stealing from the original token filler thread. 2) Analyze the timing and history of a high-contention scenario for weaknesses in the parameter adjustment rules above. 3) Add internal micro-batching at the consumer interface, such that contention cost is lower in general. 4) Partition the rate limiter into multiple slices.