nosqlbench/devdocs/sketches/linearized/linearized.md

# Linearized Operations

NOTE: This is a sketch/work in progress and will not be suitable for earnest review until this
notice is removed.

Thanks to Seb and Wei for helping this design along with their discussions along the way.

See https://github.com/nosqlbench/nosqlbench/issues/136

Presently, it is possible to stitch together rudimentary chained operations, as long as you already
know how statement sequences, bindings functions, and thread-local state work. This is a significant
amount of knowledge to expect from a user who simply wants to configure chained operations with
internal dependencies.

The design changes needed to make this easier are non-trivial and cut across the runtime systems
within nosqlbench. This design sketch will try to capture each of the requirements and approaches
for discussion and feedback.

# Sync and Async

## As it is: Sync vs Async

The current default mode (without `async=`) emulates a request-per-thread model, with operations
being planned in a deterministic sequence. In this mode, each thread dispatches operations from the
sequence only after the previous one is fully completed, even if there is no dependence between
them. This is typical of many applications, even today, but not all.

On the other end of the spectrum is the fully asynchronous dispatch mode enabled with the `async=`
option. This uses a completely different internal API to allow threads to juggle a number of
operations. In contrast to the default mode, the async mode dispatches operations eagerly as long as
the user's selected concurrency level is not yet met. This means that operations may overlap and
also occur out of order with respect to the sequence.

Choosing between these modes is a hard choice that does not offer a uniform way of looking at
operations. As well, it also forces users to pick between two extremes of all request-per-thread or
all asynchronous, which is becoming less common in application designs, and at the very least does
not rise to the level of expressivity of the toolchains that most users have access to.

## As it should be: Async with Explicit Dependencies

* The user should be able to create explicit dependencies from one operation to another.
* Operations which are not dependent on other operations should be dispatched as soon as possible
  within the concurrency limits of the workload.
* Operations with dependencies on other operations should only be dispatched if the upstream
  operations completed successfully.
* Users should have clear expectations of how error handling will occur for individual operations as
  well as chains of operations.

# Dependent Ops

We are using the phrase _dependent ops_ to capture the notions of data-flow dependency between ops (
implying linearization in ordering and isolation of input and output boundaries), successful
execution, and data sharing within an appropriate scope.

## As it is: Data Flow

Presently, you can store state within a thread local object map in order to share data between
operations. This is using the implied scope of "
thread local" which works well with the "sequence per thread, request per thread" model. This works
because both the op sequence as well as the variable state used in binding functions are thread
local.

However, it does not work well with the async mode, since there is no implied scope to tie the
variable state to the op sequence. There can be many operations within a thread operating on the
same state even concurrently. This may appear to function, but will create problems for users who
are not aware of the limitation.

## As it should be: Data Flow

* Data flow between operations should be easily expressed with a standard configuration primitive
  which can work across all driver types.
* The scope of data shared should be clear to users when configuring op templates, and in any
  diagnostic outputs from failed operations.

## As it is: Data Capture

Presently, the CQL driver has additional internal operators which allow for the capture of values.
These decorator behaviors allow for configured statements to do more than just dispatch an
operation. However, they are not built upon standard data capture and sharing operations which are
implemented uniformly across driver types. This makes scope management largely a matter of
convention, which is ok for the first implementation (
in the CQL driver) but not as a building block for cross-driver behaviors.

# Injecting Operations

## As it is: Injecting Operations

Presently operations are derived from statement templates on a deterministic op sequence which is of
a fixed length known as the stride. This follows closely the pattern of assuming each operation
comes from one distinct cycle and that there is always a one-to-one relationship with cycles. This
has carried some weight internally in how metrics for cycles are derived, etc. There is presently no
separate operational queue for statements except by modifying statements in the existing sequence
with side-effect binding assignment. It is difficult to reason about additional operations as
independent without decoupling these two into separate mechanisms.

## As it should be: Injecting Operations

## Seeding Context

# Diagrams

![idealized](idealized.svg)

## Op Flow

To track

Open concerns

- before: variable state was per-thread
- now: variable state is per opflow
- (opflow state is back-filled into thread local as the default implementation)

* gives scope for enumerating op flows, meaning you opflow 0... opflow (
  cycles/stride)
* 5 statements in sequence, stride=5,

- scoping for state
- implied data flow dependence vs explicit data flow dependence
- opflow retries vs op retries

discussion

```yaml
bindings:
    yesterday: HashRange(0L,1234234L);
statements:
    - s1-with-binding: select [userid*] from foobar.baz where day=23
    - s2-with-binding: select [userid],[yesterday] from accounts where id={id} and timestamp>{yesterday}
    - s3-with-dependency: select login_history from sessions where userid={[userid]}
    - rogue-statement: select [yesterday] from ...  <--- WARN USER because of explicit dependency below
    - s4: select login_history from sessions where userid={[userid]} and timestamp>{yesterday}
    - s5: select login_history from sessions where userid={[userid]} and timestamp>{[s2-with-binding/yesterday]}
```

## Dependency Indirection

## Error Handling and DataFlow Semantics

## Capture Syntax

Capturing of variables in statement templates will be signified with `[varname]`. This examples
represents the simplest case where the user just wants to capture a variable. Thus the above is
taken to mean:

- The scope of the captured variable is the OpFlow.
- The operation is required to succeed. Any other operation which depends on a `varname` value will
  be skipped and counted as such.
- The captured type of `varname` is a single object, to be determined dynamically, with no type
  checking required.
- A field named `varname` is required to be present in the result set for the statement that
  included it.
- Exactly one value for `varname` is required to be present.
- Without other settings to relax sanity constraints, any other appearance of `[varname]` in another
  active statement should yield a warning to the user.

All behavioral variations that diverge from the above will be signified within the capture syntax as
a variation on the above example.

## Inject Syntax

Similar to binding tokens used in statement templates like '{varname}', it is possible to inject
captured variables into statement templates with the `{[varname]}` syntax. This indicates that the
user explicitly wants to pull a value directly from the captured variable. It is necessary to
indicate variable capture and variable injection distinctly from each other, and this syntax
supports that while remaining familiar to the bindings formats already supported.

The above syntax example represents the case where the user simply wants to refer to a variable of a
given name. This is the simplest case, and is taken to mean:

- The scope of the variable is not specified. The value may come from OpFlow, thread, global or any
  scope that is available. By default, scopes should be consulted with the shortest-lived inner
  scopes first and widened only if needed to find the variable.
- The variable must be defined in some available scope. By default, It is an error to refer to a
  variable for injection that is not defined.
- The type of the variable is not checked on access. The type is presumed to be compatible with any
  assignments which are made within whatever driver type is in use.
- The variable is assumed to be a single-valued type.

All behavioral variations that diverge from the above will be signified within the variable
injection syntax as a variation on the above syntax.

## Scenarios to Consider

basic scenario: user wants to capture each variable from one place

advanced scenarios:

- user wants to capture a named var from one or more places
- some ops may be required to complete successfully, others may not
- some ops may be required to produce a value
- some ops may be required to produce multiple values

* The carrier of op state should enable the following programmatic constructions:
    * Metric measuring the service time of the op on failure
    * Metric measuring the service time of the op on success
    * Metric measuring the size of the op on success
    * Hooks for transforming or acting upon the op or cycle before the op executes
    * Hooks for transforming or acting upon the op or cycle after the op executes, regardless of
      result
    * Additional modifiers on the op, as in transformers.

* All op contextual actions should be presented as a function on the op type

* Completion Stages that support the op API should come from built-in template implementations that
  already include metrics options, logging support, etc.