nosqlbench/devdocs/sketches/workload_synthesis.md

# Workload Synthesis

This document describes the background and essential details for understanding workload
synthesis as a potential NoSQLBench capability. Here, workload synthesis means constructing
a NoSQLBench workload based on op templates which is both representative of some recorded
or implied workload as informed by schema, recorded operations, or data set. The goal is
to construct a template-based workload description that aligns to application access patterns
as closely as the provided data sources allows.

With the release of Apache Cassandra 4.0 imminent, the full query log capability it provides
offers a starting point for advanced workload characterization. However, FQL is only a starting
point for serious testing at scale. The reasons for this can be discussed in detail elsewhere,
but the main reason is that the test apparatus has inertia due to the weight of the data logs
and the operational cost of piping bulk data around. Mechanical sympathy suggests a different
representation that will maintain headroom in testing apparatus, which directly translates to
simplicity in setup as well as accuracy in results.

Further, the capabilities needed to do workload synthesis are not unique to CQL. This is a general
type of approach that can be used for multiple drivers.

# Getting there from here

There are several operational scenarios and possibilities to consider. These can be thought of as incremental goals to getting full workload synthesis into nosqlbech:

1) Schema-driven Workload - Suppose a workload with no operations visibility

   This means taking an existing schema as the basis for some supposed set of operations. There are
   many possibilites to consider in terms of mapping schema to operations, but this plan only considers
   the most basic possible operations which can exercise a schema:

   - write data to a database given a schema - Construct insert operations using very basic and default
     value generation for the known row structure
   - read data to a database, given a schema - Construct read operations using very basic and default
     value generation for the known identifier structure

   Users should be allowed to specify which keyspaces and/or tables should be included.

   Users will be allowed to specify the relative cardinality of values used at each level of identifier,
   and the type of value nesting.

   The source of the schema will be presumed to be a live system, with the workload generation being done
   on-the-fly, but with an option to save the workload description instead of running it. In the case of
   running the workload, the option to save it will be allowed additionally.

   The option to provide the schema as a local file (a database dump of describe keyspace, for example)
   should be provided as an enhancement.

   **PROS**

   This method can allow users to get started testing quickly on the data model that they've chosen
   with *nearly zero* effort.

   **CONS**

   This method only knows how to assume some valid operations from the user's provided schema. This means
   that the data model will be accurate, but it doesn't know how to construct data that is representative
   of the production data. It doesn't have the ability to emulate more sophisticated operational
   patterns or even inter-operational access patterns. it doesn't know how to construct inter-entity
   relationships.

   In essence, this is a bare-bones getting started method for users who just want
   to exercize a specific data model irrespective of other important factors. It should only be used
   for very basic testing or as a quick getting started workflow for more accurate workload definitions.
   As such, this is still a serious improvement to the user workflow for getting started.

2) Raw Replay - Replay raw operations with no schema visibility

   This is simply the ability to take a raw data file which includes operations and play it back via
   some protocol. This is a protocol-specific mechanism, since it requires integration with the
   driver level tools of a given protocol. As such it will only be avilable (to start) on a per-driver
   basis. For the purposes of this plan, assume this means "CQL".

   This level of workload generation may depend on the first "schema-driven workload" as in many cases,
   users must have access to DDL statements before running tests with DML statements. In some scenarios,
   this may not be required as the testing may be done in-situ against a system that is already populated.

   Further, it should be allowed that a user provide a local schema or point their test client at an existing
   system to gather the schema needed to replay from raw data in the absence of a schema.

   **PROS**

   This allows users to start sending operations to a target system that are facsimile copies of operations
   previously observed, with *some* effort.

   **CONS**

   Capturing logs and moving them around is an operational pain. This brings the challenge of managing
   big data directly into the workflow of testing at speed and scale.

   Specifically: Unless users can adapt the testing apparatus to scale better (in-situ) than the
   system under test, they will get invalid results. There is a basic principle at play here that
   requires a testing instrument to have more headroom than the thing being tested in order to avoid
   simply testing the test instrument itself. This is more difficult than users are generally prepared
   to deal with in practice, and navigating it successfully requires more diligence and investment
   than just accepting the tools as-is. This means that the results are often unreliable as the
   perceived cost of doing it "right" are too high. This doesn't have to be the case.

   The raw statement replay is not able to take advantage of optimizations that applications of
   scale depend on for efficiency, such as prepared statements.

   Raw statement replay may depend on native-API level access ot the source format. Particularly
   in the FQL form from Apache Cassandra 4.*, the native form depends on internal buffering formats
   which have no typed-ness or marshalling support for external consumers. Export formats can be
   provided, but the current built-in export format is textual and extremely inefficient for use
   by off-board tools.

   The amount of data captured is the amount of data available for replay.

   The data captured may not be representative across a whole system unless it is sampled across
   all clients and connections.

3) Workload Synthesis - Combine schema and operations visibility into a representative workload.

   This builds on the availability of schema details and operation logs to create a workload which
   is reasonably accurate to the workload that was observed in terms of data used in operations as
   well as relative frequency of operations. Incremental pattern analysis can be used to increase
   the level of awareness about obvious patterns as the toolkit evolves.

   The degree of realism in the emulated data set and operational patterns depends on a degree of deep analysis
   which will start out decidedly simple: relative rates of identifiable statement patterns, and simple
   statistical shaping of fields in operations.

   **PROS**

   This method allows users to achieve a highly representative workload that exactly reproduces the
   statement forms in their application, with a mix of operations which is representative, with
   a set of data which is representative, with *some* effort. Once this workload is synthesized,
   they can take that as much more accurate starting point for experimentation. Changes from this point
   are not dependent on big data, but on a simple recipe and description that can be changed
   in a text file and immediately use again with different test conditions.

   This method allows the test client to run at full speed, using the more efficient and extremely
   portable procedural data generation methods of current state-of-the-art testing methods in nb.

   This workload description also serves as a fingerprint of the shape of data in recorded operations.

   **CONS**

   This requires and additional step of analysis and workload characterization. For the raw data collection,
   the same challenges associated with raw replay apply. This can be caveated with the option that
   users may run the workload analsysi tool on system nodes where the data resides locally and then
   use the synthesized workload on the client with no need to move data around.

   As explained in the _Operations and Data_ section, this is an operation-centric approach to
   workload analysis. While only a minor caveat, this distinction may still be important with respect
   to dataset-centric approaches.

   **PRO & CON**

   The realism of the test depends directly on the quality of the analysis used to synthesize the
   workload, which will start with simple rules and then improve over time.

## Workflows

To break these down, we'll start with the full synthesis view using the following "toolchain" diagram:

![workload_synthesis](workload_synthesis.svg)

In this view, the tools are on the top row, the data (in-flight and otherwise) in the middle,
and the target systems in the bottom. This diagram shows how the data flows between tools
and how is manipulated at each step.

## Operations and Data

Data in fields of operations and data within a dataset are related, but not in a way that allows a user to fully understand one through the lens of the other, except in special cases.

This section contrasts different levels of connectedness, in simple terms, between operations
and data that result from them.

First lest start with an "affine" example. Assume you have a data set which was build from additive operations such as (pseudo-code)

    for all I in 0..1000
     for all names in A B C D E
      insert row (I, name), values ...

This is an incremental process wherein the data of the iterators will map exactly to the
data in the dataset. Now take an example like this:

    for all I in 0..1000
     for all names in A B C D E
      insert row ((I mod 37), name), values ...

In this case all we've done is reduce the cardinality of the effective row identifier. Yet, the operations are not limited to 37 unique operations. As a mathematician, you could still work out
the resulting data set. As a DBA, you would never want to be required to do so.

Let's take it to the next level.

    for all I in 0..1000
     for all names in A B C D E
      insert row ((now() mod 37), name), values...

In this case, we've introduced a form of indeterminacy which seems to make it impossible to predict
the resulting state of the dataset. This is actually much more trivial than what happens in practice
as soon as you start using UUIDs, for example.

Attempts have been made to restore the simplicity of using sequences as identifiers in distributed systems, yet no current implementation seems to have a solid solution without self-defeating
trade-offs in other places. Thus, we have to accept that the relationship between operations and dataset is _complicated_ in practice. This is merely one example of how this relationship gets weakened in practice.

The point of explaining this at this fundamental level of detail is make it clear that we need to treat data of operations an datasets as independent types of data.

To be precise, the data used within operations will be called **op data**.  In contrast, the term **dataset** will be taken to mean data as it resides within storage, distinct
from op data.