docs updates

This commit is contained in:
Jonathan Shook 2020-12-02 14:18:39 -06:00
parent 7a403252cc
commit ff6492763d

View File

@ -0,0 +1,244 @@
## RateLimiter Design
The nosqlbench rate limiter is a hybrid design, combining ideas from
well-known algorithms with a heavy dose of mechanical sympathy. The
resulting implementation provides the following:
1. A basic design that can be explained in one page (this page!)
2. High throughput, compared to other rate limiters tested.
3. Graceful degradation with increasing concurrency.
4. Clearly defined behavioral semantics.
5. Efficient burst capability, for tunable catch-up rates.
6. Efficient calculation of wait time.
## Parameters
**rate** - In simplest terms, users simply need to configure the *rate*.
For example, `rate=12000` specifies an op rate of 12000 ops/second.
**burst rate** - Additionally, users may specify a burst rate which can be
used to recover unused time when a client is able to go faster than the
strict limit. The burst rate is multiplied by the _op rate_ to arrive at
the maximum rate when wait time is available to recover. For
example, `rate=12000,1.1`
specifies that a client may operate at 12000 ops/s _when it is caught up_,
while allowing it to go at a rate of up to 13200 ops/s _when it is behind
schedule_.
## Design Principles
The core design of the rate limiter is based on
the [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithm
as established in the telecom industry for rate metering. Additional
refinements have been added to allow for flexible and reliable use on
non-realtime systems.
The unit of scheduling used in this design is the token, corresponding
directly to a nanosecond of time. The scheduling time that is made
available to callers is stored in a pool of tokens which is set to a
configured size. The size of the token pool determines how many grants are
allowed to be dispatched before the next one is forced to wait for
available tokens.
At some regular frequency, a filler thread adds tokens (nanoseconds of
time to be distributed to waiting ops) to the pool. The callers which are
waiting for these tokens consume a number of tokens serially. If the pool
does not contain the requested number of tokens, then the caller is
blocked using basic synchronization primitives. When the pool is filled
any blocked callers are unblocked.
The hybrid rate limiter tracks and accumulates both the passage of system
time and the usage rate of this time as a measurement of progress. The
delta between these two reference points in time captures a very simple
and empirical value of imposed wait time.
That is, the time which was allocated but which was not used always
represents a slow down which is imposed by external factors. This
manifests as slower response when considering the target rate to be
equivalent to user load.
## Design Details
In fact, there are three pools. The _active_ pool, the _bursting_ pool,
and the
_waiting_ pool. The active pool has a limited size based on the number of
operations that are allowed to be granted concurrently.
The bursting pool is sized according to the relative burst rate and the
size of the active pool. For example, with an op rate of 1000 ops/s and a
burst rate of 1.1, the active pool can be sized to 1E9 nanos (one second
of nanos), and the burst pool can be sized to 1E8 (1/10 of that), thus
yielding a combined pool size of 1E9 + 1E8, or 1100000000 ns.
The waiting pool is where all extra tokens are held in reserve. It is
unlimited except by the size of a long value. The size of the waiting pool
is a direct measure of wait time in nanoseconds.
Within the pools, tokens (time) are neither created nor destroyed. They
are added by the filler based on the passage of time, and consumed by
callers when they become available. In between these operations, the net
sum of tokens is preserved. In short, when time deltas are observed in the
system clock, this time is accumulated into the available scheduling time
of the token pools. In this way, the token pool acts as a metered
dispenser of scheduling time to waiting (or not) consumers.
The filler thread adds tokens to the pool according to the system
real-time clock, at some estimated but unreliable interval. The frequency
of filling is set high enough to give a reliable perception of time
passing smoothly, but low enough to avoid wasting too much thread time in
calling overhead. (It is set to 1K/s by default). Each time filling
occurs, the real-time clock is check-pointed, and the time delta is fed
into the pool filling logic as explained below.
## Visual Explanation
The diagram below explains the moving parts of the hybrid rate limiter.
The arrows represent the flow of tokens (ns) as a form of scheduling
currency.
The top box shows an active token filler thread which polls the system
clock and accumulates new time into the token pool.
The bottom boxes represent concurrent readers of the token pool. These are
typically independent threads which do a blocking read for tokens once
they are ready to execute the rate-limited task.
![Hybrid Ratelimiter Schematic](hybrid_ratelimiter.png)
In the middle, the passive component in this diagram is the token pool
itself. When the token filler adds tokens, it never blocks. However, the
token filler can cause any readers of the token pool to unblock so that
they can acquire newly available tokens.
When time is added to the token pool, the following steps are taken:
1) New tokens (based on measured time elapsed since the last fill) are
added to the active pool until it is full.
2) Any extra tokens are added to the waiting pool.
3) If the waiting pool has any tokens, and there is room in the bursting
pool, some tokens are moved from the waiting pool to the bursting pool
according to how many will fit.
When a caller asks for a number of tokens, the combined total from the
active and burst pools is available to that caller. If the number of
tokens needed is not yet available, then the caller will block until
tokens are added.
## Bursting Logic
Tokens in the waiting pool represent time that has not been claimed by a
caller. Tokens accumulate in the waiting pool as a side-effect of
continuous filling outpacing continuous draining, thus creating a backlog
of operations.
The pool sizes determine both the maximum instantaneously available
operations as well as the rate at which unclaimed time can be back-filled
back into the active or burst pools.
### Normalizing for Jitter
Since it is not possible to schedule the filler thread to trigger on a
strict and reliable schedule (as in a real-time system), the method of
moving tokens from the waiting pool to the bursting pool must account for
differences in timing. Thus, tokens which are activated for bursting are
scaled according to the amount of time added in the last fill, relative to
the maximum active pool. This means that a full pool fill will allow a
full burst pool fill, presuming wait time is positive by that amount. It
also means that the same effect can be achieved by ten consecutive fills
of a tenth the time each. In effect, bursting is normalized to the passage
of time along with the burst rate, with a maximum cap imposed when
operations are unclaimed by callers.
## Mechanical Trade-offs
In this implementation, it is relatively easy to explain how accuracy and
performance trade-off. They are competing concerns. Consider these two
extremes of an isochronous configuration:
### Slow Isochronous
For example, the rate limiter could be configured for strict isochronous
behavior by setting the active pool size to *one* op of nanos and the
burst rate to 1.0, thus disabling bursting. If the op rate requested is 1
op/s, this configuration will work relatively well, although *any* caller
which doesn't show up (or isn't already waiting) when the tokens become
available will incur a waittime penalty. The odds of this are relatively
low for a high-velocity client.
### Fast Isochronous
However, if the op rate for this type of configuration is set to 1E8
operations per second, then the filler thread will be adding 100 ops worth
of time when there is only *one* op worth of active pool space. This is
due to the fact that filling can only occur at a maximal frequency which
has been set to 1K fills/s on average. That will create artificial wait
time, since the token consumers and producers would not have enough pool
space to hold the tokens needed during fill. It is not possible on most
systems to fill the pool at arbitrarily high fill frequencies. Thus, it is
important for users to understand the limits of the machinery when using
high rates. In most scenarios, these limits will not be onerous.
### Boundary Rules
Taking these effects into account, the default configuration makes some
reasonable trade-offs according to the rules below. These rules should
work well for most rates below 50M ops/s. The net effect of these rules is
to increase work bulking within the token pools as rates go higher.
Trying to go above 50M ops/s while also forcing isochronous behavior will
result in artificial wait-time. For this reason, the pool size itself is
not user-configurable at this time.
- The pool size will always be at least as big as two ops. This rule
ensures that there is adequate buffer space for tokens when callers are
accessing the token pools near the rate of the filler thread. If this
were not ensured, then artificial wait time would be injected due to
overflow error.
- The pool size will always be at least as big as 1E6 nanos, or 1/1000 of
a second. This rule ensures that the filler thread has a reasonably
attainable update frequency which will prevent underflow in the active
or burst pools.
- The number of ops that can fit in the pool will determine how many ops
can be dispatched between fills. For example, an op rate of 1E6 will
mean that up to 1000 ops worth of tokens may be present between fills,
and up to 1000 ops may be allowed to start at any time before the next
fill.
.1 ops/s : .2 seconds worth 1 ops/s : 2 seconds worth 100 ops/s : 2
seconds worth
In practical terms, this means that rates slower than 1K ops/S will have
their strictness controlled by the burst rate in general, and rates faster
than 1K ops/S will automatically include some op bulking between fills.
## History
A CAS-oriented method which compensated for RTC calling overhead was used
previously. This method afforded very high performance, but it was
difficult to reason about.
This implementation replaces that previous version. Basic synchronization
primitives (implicit locking via synchronized methods) performed
surprisingly well -- well enough to discard the complexity of the previous
implementation.
Further, this version is much easier to study and reason about.
## New Challenges
While the current implementation works well for most basic cases, high CPU
contention has shown that it can become an artificial bottleneck. Based on
observations on higher end systems with many cores running many threads
and high target rates, it appears that the rate limiter becomes a resource
blocker or forces too much thread management.
Strategies for handling this should be considered:
1) Make callers able to pseudo-randomly (or not randomly) act as a token
filler, such that active consumers can do some work stealing from the
original token filler thread.
2) Analyze the timing and history of a high-contention scenario for
weaknesses in the parameter adjustment rules above.
3) Add internal micro-batching at the consumer interface, such that
contention cost is lower in general.
4) Partition the rate limiter into multiple slices.