doc updates for docsite testing

This commit is contained in:
Jonathan Shook
2023-01-30 22:17:28 -06:00
parent 9f6cecb156
commit 8fc5115a0c
10 changed files with 247 additions and 400 deletions

View File

@@ -0,0 +1,42 @@
# Workload Specification
This directory contains the testable specification for workload definitions used by NoSQLBench.
All the content blocks in this section have been validated with the latest NoSQLBench build.
Usually, users will not need to delve too deeply into this section. It is useful as a detailed
guide for contributors and driver developers. If you are using a driver which leaves you
wondering what a good op template example looks like, then the driver needs better examples in
its documentation!
# Synopsis
There are two primary views of workload definitions that we care about:
1. The User View of **op templates**
1. Op templates are simply the schematic recipes for building an operation once you know the
cycle it is for.
2. Op templates are provided by users in YAML or JSON or even directly via runtime API. This
is called a workload template, which contains op templates.
3. Op templates can be provided with optional metadata which serve to label, group,
parameterize or otherwise make the individual op templates more manageable.
4. A variety of forms are supported which are self-evident, but which allow users to have
some flexibility in how they structure their YAML, JSON, or runtime collections. **This
specification is about how these various forms are allowed, and how they relate to a
fully-qualified and de-normalized op template view.
2. The Developer View of the ParsedOp API. This is the view of an op template which presents the
developer with a very high-level toolkit for building op synthesis functions.
# Details
The documentation in this directory serve as a testable specification for all the above. It
shows specific examples of all the valid op template forms in both YAML and JSON, as well as how
the data is normalized to feed developer's view of the ParsedOp API.
## Related Reading
If you want to understand the rest of this document, it is crucial that you have a working knowledge
of the standard YAML format and several examples from the current drivers. You can learn this from
the main documentation which demonstrates step-by-step how to build a workload. Reading further in
this document will be most useful for core NB developers, or advanced users who want to know all
the possible ways of building workloads.

View File

@@ -0,0 +1,205 @@
# ParsedOp API
In the workload template examples, we show statements as being formed from a string value. This is a
specific type of statement form, although it is possible to provide structured op templates as well.
**The ParsedOp API is responsible for converting all valid op template forms into a consistent and
unambiguous model.** Thus, the rules for mapping the various forms to the command model must be
precise. Those rules are the substance of this specification.
## Op Synthesis
Executable operations are _created_ on the fly by NoSQLBench via a process called _Op Synthesis_.
This is done incrementally in stages. The following narrative describes this process in logical
stages. (The implementation may vary from this, but it explains the effects, nonetheless.)
Everything here happens **during** activity initialization, before the activity starts running
cycles:
1. *Template Variable Expansion* - If there are template variables, such as
`TEMPLATE(name,defaultval)` or `<<name:defaultval>>`, then these are expanded
according to their defaults and any overrides provided in the activity params. This is a macro
substitution only, so the values are simply interposed into the character stream of the document.
2. *Jsonnet Evaluation* - If the source file was in jsonnet format (the extension was `.jsonnet`)
then it is interpreted by sjsonnet, with all activity parameters available as external variables.
3. *Structural Normalization* - The workload template (yaml, json, or data structure) is loaded
into memory and transformed into a standard format. This means taking various list and map
forms at every level and converting them to a singular standard form in memory.
4. *Auto-Naming* - All elements which do not already have a name are assigned a simple name like
`block2` or `op3`.
5. *Auto-Tagging* - All op templates are given standard tag values under reserved tag names:
- **block**: the name of the block containing the op template. For example: `block2`.
- **name**: the name of the op template, prefixed with the block value and `--`. For example,
`block2--op1`.
6. *Property De-normalization* - Default values for all the standard op template properties are
copied from the doc to the block layer unless the same-named key exists. Then the same
method is applied from the doc layer to the op template layer. **At this point, the op
templates are effectively an ordered list of data structures, each containing all necessary
details for use.**
7. *Tag Filtering* - The activity's `tag` param is used to filter all the op templates
according to their tag map.
8. *Bind Point and Capture Points* - Each op template is now converted into a ParsedOp, which is
a swiss-army knife of op template introspection and function generation. It is the direct
programmatic API that driver adapters use in subsequent steps.
- Any string sequences with bind points like `this has a {bindpoint}` are automatically
converted to a long -> string function.
- Any direct references with no surrounding text like `{bindpoint}` are automatically
converted to direct binding references.
- Any other string form is cached as a static value.
- The same process is applied to Lists and Maps, allowing structural templates which read
like JSON with bind points in arbitrary places.
8. *Op Mapping* - Using the ParsedOp API, each op template is categorized by the active `driver`
according to that driver's documented examples and type-matching rules. Once the op mapper
determines what op type a user intended, it uses this information and the associated op
fields to create an *Op Dispenser*.
9. *Op Sequencing* - The op dispensers are kept as an internal sequence, and installed into a
[LUT](https://en.wikipedia.org/wiki/Lookup_table) according to their ratios and the specified
(or default) sequencer. By default, round-robin with bucket exhaustion is used. The ratios
specified are used directly in the LUT.
When this is complete, you are left with an efficient lookup table which indexes into a set of
OpDispensers. The length of this lookup table is called the _sequence length_, and that value is
used, by default, to set the _stride_ for the activity. This stride determines the size of
per-thread cycle batching, effectively turning each sequence into a thread-safe set of
operations which are serialized, and thus suitable for testing linearized operations with
suitable dependency and error-handling mechanisms. (But wait, there's more!)
## Special Cases
Drivers are assigned to op templates individually, meaning you can specify the driver within an
op template, not even assigning a default for the activity. Further, certain drivers are able to
fill in missing details for op templates, like the `stdout` driver which only requires bindings.
This means that there are distinct cases for configuration which are valid, and these are
checked at initialization time:
- A `driver` must be selected for each op template either directly or via activity params.
- If the whole workload template provided does not include actual op templates **AND** a
default driver is provided which can create synthetic op templates, it is given the raw
workload template, incomplete as it is, and asked to provide op templates which have all
normalization, naming, etc. already done. This is injected before the tag-filtering phase.
- In any case that an actual non-zero list of op templates is provided and tag filtering removes
them all, an error is thrown.
- If, after tag filtering no op template are in the active list, an error is thrown.
# The ParsedOp
The components of a fully-parsed op template (AKA a ParsedOp) are:
## name
Each ParsedOp knows its name, which is simply the op template name that it was made from. This
is useful for diagnostics, logging, and metrics.
## description
Every named element of a workload may be given a description.
## tags
Every op template has tags, even if they are auto-assigned from the block and op template names.
If you assign explicit tags to an op template, the standard tags are still provided. Thus, it is
an error to directly provide a tag named `block` or `name`.
## bindings
Although bindings are usually defined as workload template level property, they can also be
provided directly as an op field property.
## op fields
The **op** property of an op template or ParsedOp is the root of the op fields. This is a map of
specific fields specified by the user.
### static op fields
Some op fields are simply static values. Since these values are not generated per cycle, they are
kept separate as reference data. Knowing which fields are static and which are not makes it
possible for developers to optimize op synthesis.
### dynamic op fields
Other fields may be specified as recipes, with the actual value to be filled-in once the cycle
value is known. All such fields are known as _dynamic op fields_, and are provided to the op
dispenser as a long function, where the input is always the cycle value and the output is a
type-specific value as determined by the associated binding recipe.
### bind points
This is how dynamic values are indicated. Each bind point in an op template results in some type of
procedural generation binding. These can be references to named bindings elsewhere in the
workload template, or they can be inline.
### capture points
Names of result values to save, and the variable names they are to be saved as. The names represent
the name as it would be found in the native driver's API, such as the name `userid`
in `select userid from ...`. In string form statements, users can specify that the userid should be
saved as the thread-local variable named *userid* simply by tagging it
like `select [userid] from ...`. They can also specify that this value should be captured under a
different name with a variation like `select [userid as user_id] from ...`. This is the standard
variable capture syntax for any string-based statement form.
### params
A backwards-compatible feature called op params is still available. This is another root
property within an op template which can be used to accessorize op fields. By default, any op
field which is not explicitly rooted under the `op` property are put there anyway. This is also
true when there is an explicitly `params` property. However if the op property is provided, then
all non-reserved fields are given to the params property instead. If both the `op` and the
`param` op properties are specified, then no non-reserved op fields are allowed outside of these
root values. Thus it is possible to still support params, but it is **highly** recommended that
new driver developers avoid using this field, and instead allow all fields to be automatically
anchored under the `op` property. This keeps configs terse and simple going forward.
Params may not be dynamic.
# Mapping Rules
A ParsedOp does not necessarily describe a specific low-level operation to be performed by
a native driver. It *should* do so, but it is up to the user to provide a valid op template
according to the documented rules of op construction for that driver type. These rules should be
clearly documented by the driver developer as examples in markdown that is required for every
driver. With this documentation, users can use `nb5 help <driver>` to see exactly how
to create op templates for a given driver.
## String Form
Basic operations are made from a statement in some type of query language:
```yaml
ops:
- stringform: select [userid] from db.users where user='{username}';
bindings:
username: NumberNameToString()
```
# Reserved op fields
The property names `ratio`, `driver`, `space`, are considered reserved by the NoSQLBench runtime.
These are extracted and handled specially by the core runtime.
# Base OpDispenser fields
The BaseOpDispenser, which <s>is</s> will be required as the base implementation of any op
dispenser going forward, provides cross-cutting functionality. These include `start-timers`,
`stop-timers`, `instrument`, and likely will include more as future cross-driver functionality is
added. These fields will be considered reserved property names.
# Optimization
It should be noted that the op mapping process, where user intentions are mapped from op templates to
op dispensers is not something that needs to be done quickly. This occurs at _initialization_
time. Instead, it is more important to focus on user experience factors, such as flexibility,
obviousness, robustness, correctness, and so on. Thus, priority of design factors in this part
of NB is placed more on clear and purposeful abstractions and less on optimizing for speed. The
clarity and detail which is conveyed by this layer to the driver developer will then enable
them to focus on building fast and correct op dispensers. These dispensers are also constructed
before the workload starts running, but are used at high speed while the workload is running.
In essence:
- Any initialization code which happens before or in the OpDispenser constructor should not be
concerned with careful performance optimization.
- Any code which occurs within the OpDispenser#apply method should be as lightweight as is
reasonable.

View File

@@ -1,99 +0,0 @@
# Workload Specification
This directory contains the testable specification for workload definitions used by NoSQLBench.
## Op Templates vs Developer API
There are two primary views of workload definitions that we care about:
1. The User View of **op templates**
1. Op templates are simply the schematic recipes for building an operation.
2. Op templates are provided by users in YAML or JSON or even directly via runtime API.
3. Op templates can be provided with optional metadata which serves to label, group or
otherwise make the individual op templates more manageable.
4. A variety of forms are supported which are self-evident, but which allow users to have
some flexibility in how they structure their YAML, JSON, or runtime collections.
2. The Developer View of the ParsedOp API -- All op templates, regardless of the form they are
provided in, are processed into a normalized internal data structure.
1. The detailed documentation for the ParsedOp API is in javadoc.
The documentation in this directory serve as a testable specification for all the above. It
shows specific examples of all the valid op template forms in both YAML and JSON, as well as how
the data is normalized to feed developer's view of the ParsedOp API.
If you are a new user, it is recommended that you read the basic docs first before delving into
these specification-level docs too much. The intro docs show normative and simple ways to
specific workloads without worrying too much about all the possible forms.
## Templating Language
When users want to specify a set of operations to perform, they do so with the workload templating
format, which includes document level details, block level details, and op level details.
Specific reserved words like `block` or `ops` are used in tandem with nesting structure to
define all valid workload constructions. Because of this, workload definitions are
essentially data structures comprised of basic collection types and primitive values. Any on-disk
format which can be loaded as such can be a valid source of workload definitions.
- [SpecTest Formatting](spectest_formatting.md) - A primer on the example formats used here
- [Workload Structure](workload_structure.md) - Overall workload structure, keywords, nesting
features
- [Op Template Basics](op-template-basics.md) - Basic Details of op templating
- [Op Template Variations](op_template_variations.md) - Additional op template variants
and corner cases
- [Template Variables](template_variables.md) - Textual macros and default values
## ParsedOp API
After a workload template is loaded into an activity, it is presented to the driver in an API which
is suitable for building executable ops in the native driver.
- [ParsedOp API](parsed_op_api.md) - Defines the API which developers see after a workload is fully
loaded.
## Related Reading
If you want to understand the rest of this document, it is crucial that you have a working knowledge
of the standard YAML format and several examples from the current drivers. You can learn this from
the main documentation which demonstrates step-by-step how to build a workload. Reading further in
this document will be most useful for core NB developers, or advanced users who want to know all
the possible ways of building workloads.
## Op Mapping Stages
The process of loading a workload definition occurs in several discrete steps during a NoSQLBench
session:
1. The workload file is loaded.
2. Template variables from the activity parameters are interposed into the raw contents of the
file.
3. The file is deserialized from its native form into a raw data structure.
4. The raw data structure is transformed into a normalized data structure according to the Op
Template normalization rules.
5. Each op template is then denormalized as a self-contained data
structure, containing all the provided bindings, params, and tags from the upper layers of the
doc structure.
6. The data is provided to the ParsedOp API for use by the developer.
7. The DriverAdapter is loaded which understands the op fields provided in the op template.
8. The DriverAdapter uses its documented rules to determine which types of native driver operations
each op template is intended to represent. This is called **Op Mapping**.
9. The DriverAdapter (via the selected Op Mapper) uses the identified types to create dispensers of
native driver operations. This is called **Op Dispensing**.
10. The op dispensers are arranged into an indexed bank of op sources according to the specified
ratios and or sequencing strategy. From this point on, NoSQLBench has the ability to
construct an operation for any given cycle at high speed.
These specifications are focused on steps 2-5. The DriverAdapter focuses on the developer's use of
the ParsedOp API, and as such is documented in javadoc primarily. Some details on the ParsedOp
API are shared here for basic awareness, but developers should look to the javadoc for the full
story.
## Mapping vs Running
It should be noted that the Op Mapping stage, where user intentions are mapped from op templates to
native operations is not something that needs to be done quickly. This occurs at
_initialization_ time. Instead, it is more important to focus on user experience factors, such as
flexibility, obviousness, robustness, correctness, and so on. Thus, priority of design factors in
this part of NB is placed more on clear and purposeful abstractions and less on optimizing for
speed. The clarity and detail which is conveyed by this layer to the driver developer will then
enable them to focus on building fast and correct op dispensers. These dispensers are also
constructed before the workload starts running, but are used at high speed while the workload
is running.

View File

@@ -1,301 +0,0 @@
# ParsedOp API
In the workload template examples, we show statements as being formed from a string value. This is a
specific type of statement form, although it is possible to provide structured op templates as well.
**The ParsedOp API is responsible for converting all valid op template forms into a consistent and
unambiguous model.** Thus, the rules for mapping the various forms to the command model must be
precise. Those rules are the substance of this specification.
## Op Synthesis
The method of turning an op template, some data generation functions, and some seed values into an
executable operation is called *Op Synthesis* in NoSQLBench. This is done in incremental stages:
1. During activity initialization, NoSQLBench parses the workload template and op templates
contained within. Each active op template (after filtering) is converted to a parsed command.
2. The NB driver uses the parsed command to guide the construction of an OpDispenser<T>. This is a
dispenser of operations that can be executed by the driver's Action implementation.
3. When it is time to create an actual operation to be executed, unique with its own procedurally
generated payload and settings, the OpDispenser<T> is invoked as a LongFunction<T>. The input
provided to this function is the cycle number of the operation. This is essentially a seed that
determines the content of all the dynamic fields in the operation.
This process is non-trivial in that it is an incremental creational pattern, where the resultant
object is contextual to some native API. The command API is intended to guide and enable op
synthesis without tying developers' hands.
## Command Fields
A command structure is intended to provide all the fields needed to fully realize a native
operation. Some of these fields will be constant, or *static* in the op template, expressed simply
as strings, numbers, lists or maps. These are parsed from the op template as such and are cached in
the command structure as statics.
Other fields are only prescribed as recipes. This comes in two parts: 1) The description for how to
create the value from a binding function, and 2) the binding point within the op template. Suppose
you have a string-based op template like this:
```yaml
ops:
- op1: select * from users where userid={userid}
bindings:
userid: ToString();
```
In this case, there is only one op in the list of ops, having a name `op1` and a string form op
template of `select * from users where userid={userid}`.
## Parsed Command Structure
Once an op template is parsed into a *parsed command*, it has the state shown in the data structure
schematic below:
```json
{
"name": "some-map-name",
"statics": {
"s1": "v1",
"s2": {
"f1": "valfoo"
}
},
"dynamics": {
"d1": "NumberNameToString()"
},
"captures": {
"resultprop1": "asname1"
}
}
```
If either an **op** or **stmt** field is provided, then the same structure as above is used:
```json
{
"name": "some-string-op",
"statics": {
},
"dynamics": {
"op": "select username from table where name userid={userid}"
},
"captures": {
}
}
```
The parts of a parsed command structure are:
### command name
Each command knows its name, just like an op template does. This can be useful for diagnostics and
metric naming.
### static fields
The field names which are statically assigned and their values of any type. Since these values are
not generated per-op, they are kept separate as reference data. Knowing which fields are static and
which are not makes it possible for developers to optimize op synthesis.
### dynamic fields
Named bindings points within the op template. These values will only be known for a given cycle.
### variable captures
Names of result values to save, and the variable names they are to be saved as. The names represent
the name as it would be found in the native driver's API, such as the name `userid`
in `select userid from ...`. In string form statements, users can specify that the userid should be
saved as the thread-local variable named *userid* simply by tagging it
like `select [userid] from ...`. They can also specify that this value should be captured under a
different name with a variation like `select [userid as user_id] from ...`. This is the standard
variable capture syntax for any string-based statement form.
# Resolved Command Structure
Once an op template has been parsed into a command structure, the runtime has everything it needs to
know in order to realize a specific set of field values, *given a cycle number*. Within a cycle, the
cycle number is effectively a seed value that drives the generation of all dynamic data for that
cycle.
However, this seed value is only known by the runtime once it is time to execute a specific cycle.
Thus, it is the developer's job to tell the NoSQLBench runtime how to map from the parsed structure
to a native type of executable operation suitable for execution with that driver.
# Interpretation
A command structure does not necessarily describe a specific low-level operation to be performed by
a native driver. It *should* do so, but it is up to the user to provide a valid op template
according to the documented rules of op construction for that driver type. These rules should be
clearly documented by the driver developer.
Once the command structure is provided, the driver takes over and maps the fields into an executable
op -- *almost*. In fact, the driver developer defines the ways that a command structure can be
turned into an executable operation. This is expressed as a *Function<CommandTemplate,T>* where T is
the type used in the native driver's API.
How a developer maps a structure like the above to an operations is up to them. The general rule of
thumb is to use the most obvious and familiar representation of an operation as it would appear to a
user. If this is CQL or SQL, then recommend use that as the statement form. If it GraphQL, use that.
In both of these cases, you have access to
## String Form
Basic operations are made from a statement in some type of query language:
```yaml
ops:
- stringform: select [userid] from db.users where user='{username}';
bindings:
username: NumberNameToString()
```
## Structured Form
Some operations can't be easily represented by a single statement. Some operations are built from a
set of fields which describe more about an operation than the basic statement form. These types of
operations are expressed to NoSQLBench in map or *object* form, where the fields within the op can
be specified independently.
```yaml
ops:
- structured1:
stmt: select * from db.users where user='{username}}';
prepared: true
consistency_level: LOCAL_QUORUM
bindings:
username: NumberNameToString();
- structured2:
cmdtype: "put"
namespace: users
key: { userkey }
body: "User42 was here"
bindings:
userkey: FixedValue(42)
```
In the first case, the op named *structured1* is provided as a string value within a map structure.
The *stmt* field is a reserved word (synonomous with op and operation). When you are reading an op
from the command API, these will represented in exactly the same way as the stringform example
above.
In the second case,
In the second, the op named *structured form* is provided as a map. Both of these examples would
make sense to a user, as they are fairly self-explanatory.
Op templates may specify an op as either a string or a map. No other types are allowed. However,
there are no restrictions placed on the elements below a map.
The driver developer should not have to parse all the possible structural forms that users can
provide. There should be one way to access all of these in a consistent and unambiguous API.
## Command Structure
Here is an example data structure which illustrates all the possible elements of a parsed command:
```json
{
"statics": {
"prepared": "true",
"consistency_level'"
}
}
```
Users provide a template form of an operation in each op template. This contains a sketch of what an
operation might look like, and includes the following optional parts:
- properties of the operation, whether meta (like a statement) or payload content
- the binding points where generated field values will be injected
- The names of values to be extracted from the result of a successful operation.
## Statement Forms
Sometimes operations are derived from a query language, and are thus self-contained in a string
form.
When mapping the template of an operation provided by users to an executable operation in some
native driver with specific values, you have to know
* The s
* The substance of the operation: The name and values of the fields that the user provides as part
of the operation
Command templates are the third layer of workload templating. As described in other spec documents,
the other layers are:
1. [Workload level templates](templated_workloads.md) - This specification covers the basics of a
workload template, including the valid properties and structure.
2. [Operation level templates](templated_operations.md) - This specification covers how operations
can be specified, including semantics and structure.
3. Command level templates, explained below. These are the detailed views of what goes into an op
template, parsed and structured in a way that allows for efficient use at runtime.
Users do not create command templates directly. Instead, these are the *parsed* form of op templates
as seen by the NB driver. The whole point of a command template is to provide crisp semantics and
structure about what a user is asking a driver to do. Command Template
Command templates are essentially schematics for an operation. They are a structural interpretation
of the content provided by users in op templates. Each op template provided can be converted into a
command template. In short, the op template is the form that users tend to edit in yaml or provided
as a data structure via scripting. **Command templates are the view of an op template as seen by an
NB driver.**
```
### Command Templates
Command templates are part of the workload API.
There exists a need to provide op templates to a myriad of runtime APIs,
and thus it has to be flexible enough to serve them all.
1. In some cases, an operation is based on a query language where the
query language itself encodes everything needed for specific operation.
SQL queries are like this. This is a nice simplification, but it is not
realistic for systems build on modern distributed principles.
2. In most cases, you have both an operation and some qualifying rules
about how the operation should be handled, such as consistency level.
Thus, there is a need to provide parameters which can decorate
operations.
3. In some cases, you have a payload for your operation which is not based
on a query language, but instead on an object with fields, or a verb
which determines what other fields are needed. This structure is better
described as a *command* than a *statement*.
4. Finally, you must support separate both of the latter cases where the
command or operations is defined in some pseudo-structured way, but it
also has *separately* a set of qualifying parameters which are
considered orthogonal, or at least separate from the meaning of the
operation itself.
To address the full set of these mapping requirements, a type has been
added to NB which provides a structured and pre-baked version of a
resolvable command -- the CommandTemplate.
This type provides a view to the driver builder of all the fields
specified by the user, whether as encoded as a string, such
as `select row from ...`, or by a set of properties such
as `{"verb":"get",
"id":"2343"}`. It also exposes the parameters separately if provided.
### Static vs Dynamic command fields
Further, for each field in the command template, the driver implementor
knows whether this was provided as a static value or one that can only be
realized for a specific cycle (seed data). Thus, it is possible for
advanced op mapping implementations to optimize the way that new
operations are synthesized for efficiency.
For example, if you know that you have a command which has no dynamic
fields in its command template, then it is possible to create a singleton
op template which can simply be re-used. A fully dynamic command template,
in contrast, may need to be realized dynamically for each cycle, given
that you don't know the value of the fields in the command until you know
the cycle value.
```