mirror of
https://github.com/nosqlbench/nosqlbench.git
synced 2025-01-13 09:22:05 -06:00
184 lines
9.0 KiB
Markdown
184 lines
9.0 KiB
Markdown
# Linearized Operations
|
|
|
|
NOTE: This is a sketch/work in progress and will not be suitable for earnest review until this notice is removed.
|
|
|
|
Thanks to Seb and Wei for helping this design along with their discussions along the way.
|
|
|
|
See https://github.com/nosqlbench/nosqlbench/issues/136
|
|
|
|
Presently, it is possible to stitch together rudimentary chained operations, as long as you already know how statement
|
|
sequences, bindings functions, and thread-local state work. This is a significant amount of knowledge to expect from a
|
|
user who simply wants to configure chained operations with internal dependencies.
|
|
|
|
The design changes needed to make this easy to express are non-trivial and cut across a few of the extant runtime
|
|
systems within nosqlbench. This design sketch will try to capture each of the requirements and approached sufficiently
|
|
for discussion and feedback.
|
|
|
|
# Sync and Async
|
|
|
|
## As it is: Sync vs Async
|
|
|
|
The current default mode (without `async=`) emulates a request-per-thread model, with operations being planned in a
|
|
deterministic sequence. In this mode, each thread dispatches operations from the sequence only after the previous one is
|
|
fully completed, even if there is no dependence between them. This is typical of many applications, even today, but not
|
|
all.
|
|
|
|
On the other end of the spectrum is the fully asynchronous dispatch mode enabled with the `async=` option. This uses a
|
|
completely different internal API to allow threads to juggle a number of operations. In contrast to the default mode,
|
|
the async mode dispatches operations eagerly as long as the user's selected concurrency level is not yet met. This means
|
|
that operations may overlap and also occur out of order with respect to the sequence.
|
|
|
|
Choosing between these modes is a hard choice that does not offer a uniform way of looking at operations. As well, it
|
|
also forces users to pick between two extremes of all request-per-thread or all asynchronous, which is becoming less
|
|
common in application designs, and at the very least does not rise to the level of expressivity of the toolchains that
|
|
most users have access to.
|
|
|
|
## As it should be: Async with Explicit Dependencies
|
|
|
|
* The user should be able to create explicit dependencies from one operation to another.
|
|
* Operations which are not dependent on other operations should be dispatched as soon as possible within the concurrency
|
|
limits of the workload.
|
|
* Operations with dependencies on other operations should only be dispatched if the upstream operations completed
|
|
successfully.
|
|
* Users should have clear expectations of how error handling will occur for individual operations as well
|
|
as chains of operations.
|
|
|
|
# Dependent Ops
|
|
|
|
We are using the phrase _dependent ops_ to capture the notions of data-flow dependency between ops (implying
|
|
linearization in ordering and isolation of input and output boundaries), successful execution, and data sharing within
|
|
an appropriate scope.
|
|
|
|
## As it is: Data Flow
|
|
|
|
Presently, you can store state within a thread local object map in order to share data between operations. This is using
|
|
the implied scope of "thread local" which works well with the "sequence per thread, request per thread" model. This
|
|
works because both the op sequence as well as the variable state used in binding functions are thread local.
|
|
|
|
However, it does not work well with the async mode, since there is no implied scope to tie the variable state to the op
|
|
sequence. There can be many operations within a thread operating on the same state even concurrently. This may appear to
|
|
function, but will create problems for users who are not aware of the limitation.
|
|
|
|
## As it should be: Data Flow
|
|
|
|
* Data flow between operations should be easily expressed with a standard configuration primitive which can work across
|
|
all driver types.
|
|
* The scope of data shared should be
|
|
|
|
The scope of a captured value should be clear to users
|
|
|
|
## As it is: Data Capture
|
|
|
|
Presently, the CQL driver has additional internal operators which allow for the capture of values. These decorator
|
|
behaviors allow for configured statements to do more than just dispatch an operation. However, they are not built upon
|
|
standard data capture and sharing operations which are implemented uniformly across driver types. This makes scope
|
|
management largely a matter of convention, which is ok for the first implementation (in the CQL driver) but not as a
|
|
building block for cross-driver behaviors.
|
|
|
|
# Injecting Operations
|
|
|
|
## As it is: Injecting Operations
|
|
|
|
Presently operations are derived from statement templates on a deterministic op sequence which is of a fixed length
|
|
known as the stride. This follows closely the pattern of assuming each operation comes from one distinct cycle and that
|
|
there is always a one-to-one relationship with cycles. This has carried some weight internally in how metrics for cycles
|
|
are derived, etc. There is presently no separate operational queue for statements except by modifying statements in the
|
|
existing sequence with side-effect binding assignment. It is difficult to reason about additional operations as
|
|
independent without decoupling these two into separate mechanisms.
|
|
|
|
## As it should be: Injecting Operations
|
|
|
|
|
|
|
|
## Seeding Context
|
|
|
|
# Diagrams
|
|
![idealized](idealized.svg)
|
|
|
|
## Op Flow
|
|
|
|
To track
|
|
|
|
|
|
Open concerns
|
|
|
|
- before: variable state was per-thread
|
|
- now: variable state is per opflow
|
|
- (opflow state is back-filled into thread local as the default implementation)
|
|
|
|
* gives scope for enumerating op flows, meaning you opflow 0... opflow (cycles/stride)
|
|
* 5 statements in sequence, stride=5,
|
|
|
|
- scoping for state
|
|
- implied data flow dependence vs explicit data flow dependence
|
|
- opflow retries vs op retries
|
|
|
|
discussion
|
|
|
|
```yaml
|
|
bindings:
|
|
yesterday: HashRange(0L,1234234L);
|
|
statements:
|
|
- s1-with-binding: select [userid*] from foobar.baz where day=23
|
|
- s2-with-binding: select [userid],[yesterday] from accounts where id={id} and timestamp>{yesterday}
|
|
- s3-with-dependency: select login_history from sessions where userid={[userid]}
|
|
- rogue-statement: select [yesterday] from ... <--- WARN USER because of explicit dependency below
|
|
- s4: select login_history from sessions where userid={[userid]} and timestamp>{yesterday}
|
|
- s5: select login_history from sessions where userid={[userid]} and timestamp>{[s2-with-binding/yesterday]}
|
|
```
|
|
|
|
## Dependency Indirection
|
|
|
|
## Error Handling and DataFlow Semantics
|
|
|
|
## Capture Syntax
|
|
|
|
Capturing of variables in statement templates will be signified with `[varname]`. This examples represents the simplest
|
|
case where the user just wants to capture a varaible. Thus the above is taken to mean:
|
|
|
|
- The scope of the captured variable is the OpFlow.
|
|
- The operation is required to succeed. Any other operation which depends on a `varname` value will be skipped and
|
|
counted as such.
|
|
- The captured type of `varname` is a single object, to be determined dynamically, with no type checking required.
|
|
- A field named `varname` is required to be present in the result set for the statement that included it.
|
|
- Exactly one value for `varname` is required to be present.
|
|
- Without other settings to relax sanity constraints, any other appearance of `[varname]` in another active statement
|
|
should yield a warning to the user.
|
|
|
|
All behavioral variations that diverge from the above will be signified within the capture syntax as a variation on the
|
|
above example.
|
|
|
|
## Inject Syntax
|
|
|
|
Similar to binding tokens used in statement templates like '{varname}', it is possible to inject captured variables into
|
|
statement templates with the `{[varname]}` syntax. This indicates that the user explicitly wants to pull a value
|
|
directly from the captured variable. It is necessary to indicate variable capture and variable injection distinctly from
|
|
each other, and this syntax supports that while remaining familiar to the bindings formats already supported.
|
|
|
|
The above syntax example represents the case where the user simply wants to refer to a variable of a given name. This is
|
|
the simplest case, and is taken to mean:
|
|
|
|
- The scope of the variable is not specified. The value may come from OpFlow, thread, global or any scope that is
|
|
available. By default, scopes should be consulted with the shortest-lived inner scopes first and widened only if
|
|
needed to find the variable.
|
|
- The variable must be defined in some available scope. By default, It is an error to refer to a variable for injection
|
|
that is not defined.
|
|
- The type of the variable is not checked on access. The type is presumed to be compatible with any assignments which
|
|
are made within whatever driver type is in use.
|
|
- The variable is assumed to be a single-valued type.
|
|
|
|
All behavioral variations that diverge from the above will be signified within the variable injection syntax as a
|
|
variation on the above syntax.
|
|
|
|
## Scenarios to Consider
|
|
|
|
basic scenario: user wants to capture each variable from one place
|
|
|
|
advanced scenarios:
|
|
- user wants to capture a named var from one or more places
|
|
- some ops may be required to complete successfully, others may not
|
|
- some ops may be required to produce a value
|
|
- some ops may be required to produce multiple values
|
|
|
|
|