fixed silly typos in showcase section

This commit is contained in:
Jonathan Shook 2020-04-06 22:06:00 -07:00
parent 6156b15a21
commit 9de8ef722c
8 changed files with 344 additions and 212 deletions

View File

@ -12,67 +12,91 @@ Some of the features discussed here are only for advanced testing scenarios.
## Hybrid Rate Limiting
Rate limiting is a complicated endeavor, if you want to do it well. The basic rub is that going fast means you have to
be less accurate, and vice-versa. As such, rate limiting is a parasitic drain on any system. The act of rate limiting is
in and of itself poses a limit to the maximum rate, regardless of the settings you pick, because this forces your system
to interact with some hardware notion of time passing, and this takes CPU cycles that could be going to the thing you
are limiting.
Rate limiting is a complicated endeavor, if you want to do it well. The
basic rub is that going fast means you have to be less accurate, and
vice-versa. As such, rate limiting is a parasitic drain on any system. The
act of rate limiting itself poses a limit to the maximum rate, regardless
of the settings you pick. This occurs as a side-effect of forcing your
system to interact with some hardware notion of time passing, which takes
CPU cycles that could be going to the thing you are limiting.
This means that in practice, rate limiters are often very featureless. It's daunting enough to need rate limiting, and
asking for anything more than that is often wishful thinking. Not so in NoSQLBench.
This means that in practice, rate limiters are often very featureless.
It's daunting enough to need rate limiting, and asking for anything more
than that is often wishful thinking. Not so in NoSQLBench.
The rate limiter in NoSQLBench provides a comparable degree of performance and accuracy to others found in the Java
ecosystem, but it *also* has advanced features:
The rate limiter in NoSQLBench provides a comparable degree of performance
and accuracy to others found in the Java ecosystem, but it *also* has
advanced features:
- Allows a sliding scale between average rate limiting and strict rate limiting.
- Internally accumulates delay time, for C.O. friendly metrics
- It is resettable and reconfigurable on the fly
- It provides its configured values in addition to performance data in metrics
- It allows a sliding scale between average rate limiting and strict rate
limiting, called _bursting_.
- It internally accumulates delay time, for C.O. friendly metrics which
are separately tracked for each and every operation.
- It is resettable and reconfigurable on the fly, including the bursting
rate.
- It provides its configured values in addition to performance data in
metrics, capturing your rate limiter settings as a simple matter of
metrics collection.
- It comes with advanced scripting helpers which allow you to read data
directly from histogram reservoirs, or control the reservoir window
programmatically.
## Flexible Error Handling
An emergent facility in NoSQLBench is the way that error are handled within an activity. For example, with the CQL
activity type, you are able to route error handling for any of the known exception types. You can count errors, you can
log them. You can cause errored operations to auto-retry if possible, up to a configurable number of tries.
An emergent facility in NoSQLBench is the way that error are handled
within an activity. For example, with the CQL activity type, you are able
to route error handling for any of the known exception types. You can
count errors, you can log them. You can cause errored operations to
auto-retry if possible, up to a configurable number of tries.
This means, that as a user, you get to decide what your test is about. Is it about measuring some nominal but
anticipated level of errors due to intentional over-saturation? If so, then count the errors, and look at their
histogram data for timing details within the available timeout.
This means, that as a user, you get to decide what your test is about. Is
it about measuring some nominal but anticipated level of errors due to
intentional over-saturation? If so, then count the errors, and look at
their histogram data for timing details within the available timeout.
Are you doing a basic stability test, where you want the test to error out for even the slightest error? You can
configure for that if you need.
Are you doing a basic stability test, where you want the test to error out
for even the slightest error? You can configure for that if you need.
## Cycle Logging
It is possible to record the result status of each and every cycles in a NoSQLBench test run. If the results are mostly
homogeneous, the RLE encoding of the results will reduce the output file down to a small fraction of the number of
cycles. The errors are mapped to ordinals, and these ordinals are stored into a direct RLE-encoded log file. For most
testing where most of the result are simply success, this file will be tiny. You can also convert the cycle log into
textual form for other testing and post-processing and vice-versa.
It is possible to record the result status of each and every cycles in a
NoSQLBench test run. If the results are mostly homogeneous, the RLE
encoding of the results will reduce the output file down to a small
fraction of the number of cycles. The errors are mapped to ordinals by
error type, and these ordinals are stored into a direct RLE-encoded log
file. For most testing where most of the results are simply success, this
file will be tiny. You can also convert the cycle log into textual form
for other testing and post-processing and vice-versa.
## Op Sequencing
The way that operations are planned for execution in NoSQLBench is based on a stable ordering that is configurable. The
statement forms are mixed together based on their relative ratios. The three schemes currently supported are round-robin
with exhaustion (bucket), duplicate in order (concat), and a way to spread each statement out over the unit interval
(interval). These account for most configuration scenarios without users having to micro-manage their statement
templates.
The way that operations are planned for execution in NoSQLBench is based
on a stable ordering that is configurable. The statement forms are mixed
together based on their relative ratios. The three schemes currently
supported are round-robin with exhaustion (bucket), duplicate in order
(concat), and a way to spread each statement out over the unit interval
(interval). These account for most configuration scenarios without users
having to micro-manage their statement templates.
## Sync and Async
There are two distinct usage modes in NoSQLBench when it comes to operation dispatch and thread management:
There are two distinct usage modes in NoSQLBench when it comes to
operation dispatch and thread management:
### Sync
Sync is the default form. In this mode, each thread reads its sequence and dispatches one statement at a time, holding
only one operation in flight per thread. This is the mode you often use when you want to emulate an application's
request-per-thread model, as it implicitly linearizes the order of operations within the computed sequence of
statements.
Sync is the default form. In this mode, each thread reads its sequence and
dispatches one statement at a time, holding only one operation in flight
per thread. This is the mode you often use when you want to emulate an
application's request-per-thread model, as it implicitly linearizes the
order of operations within the computed sequence of statements.
### Async
In Async mode, each thread in an activity is reponsible for juggling a number of operations in-flight. This allows a
NoSQLBench client to juggle an arbitrarily high number of connections, limited primarily by how much memory you have.
In Async mode, each thread in an activity is responsible for juggling a
number of operations in-flight. This allows a NoSQLBench client to juggle
an arbitrarily high number of connections, limited primarily by how much
memory you have.
Internally, the Sync and Async modes have different code paths. It is possible for an activity type to support one or
both of these.
Internally, the Sync and Async modes have different code paths. It is
possible for an activity type to support one or both of these.

View File

@ -5,46 +5,72 @@ weight: 2
# Refined Core Concepts
The core concepts that NoSQLBench is built on have been scrutinized, replaced, refined, and hardened through several
years of use by users of various needs and backgrounds.
The core concepts that NoSQLBench is built on have been scrutinized,
replaced, refined, and hardened through several years of use by users of
various needs and backgrounds.
This is important when trying to find a way to express common patterns in what is often a highly fragmented practice.
Testing is hard. Scale testing is hard. Distributed testing is hard. We need a set of conceptual building blocks that
can span across workloads and system types, and machinery to put these concepts to use. Some concepts used in NoSQLBench
are shared below for illustration, but this is by no means an exhaustive list.
This level of refinement is important when trying to find a way to express
common patterns in what is often a highly fragmented practice. Testing is
hard. Scale testing is hard. Distributed testing is hard. Combined, the
challenge of executing realistic tests is often quite daunting to all but
seasons test engineers. To make this worse, existing tools have only
skirmished with this problem enough to make dents, but none has tackled
full-on the lack of conceptual building blocks.
This has to change. We need a set of conceptual building blocks that can
span across workloads and system types, and machinery to put these
concepts to use. This is why it is important to focus on finding a useful
and robust set of concepts to use as the foundation for the rest of the
toolkit to be built on. Finding these building blocks is often one of the
most difficult tasks in systems design. Once you find and validate a
useful set of concepts, everything else gets easier
We feel that the success that we've already had using NoSQLBench has been
strongly tied to the core concepts. Some concepts used in NoSQLBench are
shared below for illustration, but this is by no means an exhaustive list.
### The Cycle
Cycles in NoSQLBench are whole numbers on a number line. All operations in a NoSQLBench session are derived from a
single cycle. It's a long value, and a seed. The cycle determines not only which statements (of those available) will
get executed, but it also determines what the values bound to that statement will be.
Cycles in NoSQLBench are whole numbers on a number line. Each operation in
a NoSQLBench scenario is derived from a single cycle. It's a long value,
and a seed. The cycle determines not only which statements is selected for
execution, but also what synthetic payload data will be attached to it.
Cycles are specified as a closed-open `[min,max)` interval, just as slices in some languages. That is, the min value is
included in the range, but the max value is not. This means that you can stack slices using common numeric reference
points without overlaps or gaps. It means you can have exact awareness of what data is in your dataset, even
incrementally.
Cycles are specified as a closed-open `[min,max)` interval, just as slices
in some languages. That is, the min value is included in the range, but
the max value is not. This means that you can stack slices using common
numeric reference points without overlaps or gaps. It means you can have
exact awareness of what data is in your dataset, even incrementally.
You can think of a cycle as a single-valued coordinate system for data that lives adjacent to that number on the number
line.
You can think of a cycle as a single-valued coordinate system for data
that lives adjacent to that number on the number line. In this way,
virtual dataset functions are ways of converting coordinates into data.
### The Activity
An activity is a multi-threaded flywheel of statements in some sequence and ratio. Activities run over the numbers in a
cycle range. Each activity has a driver type which determines the native protocol that it speaks.
An activity is a multi-threaded flywheel of statements in some sequence
and ratio. Activities run over the numbers in a cycle range. Each activity
has a driver type which determines the native protocol that it speaks.
### The Activity Type
### The Driver Type
An activity type is a high level driver for a protocol. It is like a statement-aware cartridge that knows how to take a
basic statement template and turn it into an operation for the scenario to execute.
A driver type is a high level driver for a protocol. It is like a
statement-aware cartridge that knows how to take a basic statement
template and turn it into an operation for an activity to execute within
the scenario.
### The Scenario
The scenario is a runtime session that holds the activities while they run. A NoSQLBench scenario is responsible for
aggregating global runtime settings, metrics reporting channels, logfiles, and so on.
The scenario is a runtime session that holds the activities while they
run. A NoSQLBench scenario is responsible for aggregating global runtime
settings, metrics reporting channels, log files, and so on. All activities
run within a scenario, under the control of the scenario script.
### The Scenario Script
Each scenario is governed by a script runs single-threaded, asynchronously from activities, but in control of
activities. If needed, the scenario script is automatically created for the user, and the user never knows it is there.
If the user has advanced testing requirements, then they may take advantage of the scripting capability at such time.
When the script exits, *AND* all activities are complete, then the scenario is complete..
Each scenario is governed by a script runs single-threaded, asynchronously
from activities, but in control of activities. If needed, the scenario
script is automatically created for the user, and the user never knows it
is there. If the user has advanced testing requirements, then they may
take advantage of the scripting capability at such time. When the script
exits, *AND* all activities are complete, then the scenario is complete.

View File

@ -5,43 +5,49 @@ weight: 12
# High Fidelity Metrics
Since NoSQLBench has been built as a serious testing tool for all users, some attention was necessary on the way metric
are used.
Since NoSQLBench has been built as a serious testing tool for all users,
some attention was necessary on the way metric are used.
## Discrete Reservoirs
In NoSQLBench, we avoid the use of time-decaying metrics reservoirs. Internally, we use HDR reservoirs with discrete
time boundaries. This is so that you can look at the min and max values and know that they apply accurately to the whole
sampling window.
In NoSQLBench, we avoid the use of time-decaying metrics reservoirs.
Internally, we use HDR reservoirs with discrete time boundaries. This is
so that you can look at the min and max values and know that they apply
accurately to the whole sampling window.
## Metric Naming
All running activities have a symbolic alias that identifies them for the purposes of automation and metrics. If you
have multiple activities running concurrently, they will have different names and will be represnted distinctly in the
metrics flow.
All running activities have a symbolic alias that identifies them for the
purposes of automation and metrics. If you have multiple activities
running concurrently, they will have different names and will be
represented distinctly in the metrics flow.
## Precision and Units
By default, the internal HDR histogram reservoirs are kept at 4 digits of precision. All timers are kept at nanosecond
resolution.
By default, the internal HDR histogram reservoirs are kept at 4 digits of
precision. All timers are kept at nanosecond resolution.
## Metrics Reportring
## Metrics Reporting
Metrics can be reported via graphite as well as CSV, logs, HDR logs, and HDR stats summary CSV files.
Metrics can be reported via graphite as well as CSV, logs, HDR logs, and
HDR stats summary CSV files.
## Coordianated Omission
## Coordinated Omission
The metrics naming and semantics in NoSQLBench are setup so that you can have coordinated omission metrics when they are
appropriate, but there are no there changes when they are not. This means that the metric names and meanings remain
stable in any case.
The metrics naming and semantics in NoSQLBench are setup so that you can
have coordinated omission metrics when they are appropriate, but there are
no there changes when they are not. This means that the metric names and
meanings remain stable in any case.
Particularly, NoSQLBench avoids the term "latency" altogether as it is often overused and thus prone to confusing
people.
Particularly, NoSQLBench avoids the term "latency" altogether as it is
often overused and thus prone to confusing people.
Instead, the terms `service time`, `wait time`, and `response time` are used. These are abbreviated in metrics as
`servicetime`, `waittime`, and `responsetime`.
Instead, the terms `service time`, `wait time`, and `response time` are
used. These are abbreviated in metrics as `servicetime`, `waittime`, and
`responsetime`.
The `servicetime` metric is the only one which is always present. When a rate limiter is used, then additionally
`waittime` and `responsetime` are reported.
The `servicetime` metric is the only one which is always present. When a
rate limiter is used, then additionally `waittime` and `responsetime` are
reported.

View File

@ -5,18 +5,22 @@ weight: 10
# NoSQLBench Showcase
Since NoSQLBench is new on the scene in its current form, you may be wondering why you would want to use it over any
other tool. That is what this section is all about.
Since NoSQLBench is new on the scene in its current form, you may be
wondering why you would want to use it over any other tool. That is what
this section is all about.
If you want to look under the hood of this toolkit before giving it a spin, this section is for you. You don't have to
read all of this! It is here for those who want to know the answer to the question "So, what's the big deal??" Just
remember it is here for later if you want to skip to the next section and get started testing.
You don't have to read all of this! It is here for those who want to know
the answer to the question "So, what's the big deal??" Just remember it is
here for later if you want to skip to the next section and get started
testing.
NoSQLBench can do nearly everything that other testing tools can do, and more. It achieves this by focusing on a
scalable user experience in combination with a modular internal architecture.
NoSQLBench can do nearly everything that other testing tools can do, and
more. It achieves this by focusing on a scalable user experience in
combination with a modular internal architecture.
NoSQLBench is a workload construction and simulation tool for scalable systems testing. That is an entirely different
scope of endeavor than most other tools.
NoSQLBench is a workload construction and simulation tool for scalable
systems testing. That is an entirely different scope of endeavor than most
other tools.
The pages in this section all speak to advanced capabilities that are unique to NoSQLBench. In time, we want to show
these with basic scenario examples, right in the docs.
The pages in this section all speak to a selection of advanced
capabilities that are unique to NoSQLBench.

View File

@ -5,18 +5,23 @@ weight: 11
# Modular Architecture
The internal architecture of NoSQLBench is modular throughout. Everything from the scripting extensions to the data
generation functions is enumerated at compile time into a service descriptor, and then discovered at runtime by the SPI
The internal architecture of NoSQLBench is modular throughout. Everything
from the scripting extensions to data generation is enumerated at compile
time into a service descriptor, and then discovered at runtime by the SPI
mechanism in Java.
This means that extending and customizing bundles and features is quite manageable.
This means that extending and customizing bundles and features is quite
manageable.
It also means that it is relatively easy to provide a suitable API for multi-protocol support. In fact, there are
several drivers avaialble in the current NoSQLBench distribution. You can list them out with `./nb --list-drivers`, and
you can get help on how to use each of them with `./nb help <name>`.
It also means that it is relatively easy to provide a suitable API for
multi-protocol support. In fact, there are several drivers available in
the current NoSQLBench distribution. You can list them out with `nb
--list-drivers`, and you can get help on how to use each of them with `nb
help <driver name>`.
This also is a way for us to encourage and empower other contributors to help develop the capabilities and reach of
NoSQLBench as a bridge building tool in our community. This level of modularity is somewhat unusual, but it serves the
purpose of helping users with new features.
This also is a way for us to encourage and empower other contributors to
help develop the capabilities and reach of NoSQLBench. By encouraging
others to help us build NoSQLBench modules and extensions, we can help
more users in the NoSQL community at large.

View File

@ -5,38 +5,46 @@ weight: 2
# Portable Workloads
All of the workloads that you can build with NoSQLBench are self-contained in a workload file. This is a
statement-oriented configuration file that contains templates for the operations you want to run in a workload.
All of the workloads that you can build with NoSQLBench are self-contained
in a workload file. This is a statement-oriented configuration file that
contains templates for the operations you want to run in a workload.
This defines part of an activity - the iterative flywheel part that is run directly within an activity type. This file
contains everything needed to run a basic activity -- A set of statements in some ratio. It can be used to start an
activity, or as part of several activities within a scenario.
This defines part of an activity - the iterative flywheel part that is run
directly within an activity type. This file contains everything needed to
run a basic activity -- A set of statements in some ratio. It can be used
to start an activity, or as part of several activities within a scenario.
## Standard YAML Format
The format for describing statements in NoSQLBench is generic, but in a particular way that is specialized around
describing statements for a workload.
The format for describing statements in NoSQLBench is generic, but in a
particular way that is specialized around describing statements for a
workload. That means that you can use the same YAML format to describe a
workload for kafka as you can for Apache Cassandra or DSE.
That means that you can use the same YAML format to describe a workload for kafka as you can for Apache Cassandra or
DSE.
The YAML structure has been tailored to describing statements, their data
generation bindings, how they are grouped and selected, and the parameters
needed by drivers, like whether they should be prepared statements or not.
The YAML structure has been tailored to describing statements, their data generation bindings, how they are grouped and
selected, and the parameters needed by drivers, like whether they should be prepared statements or not.
Further, the YAML format allows for defaults and overrides with a very
simple mechanism that reduces editing fatigue for frequent users.
Further, the YAML format allows for defaults and overrides with a very simple mechanism that reduces editing fatigue for
frequent users.
You can also template document-wide macro paramers which are taken from the command line parameters just like any other
parameter. This is a way of templating a workload and make it multi-purpose or adjustable on the fly.
You can also template document-wide macro parameters which are taken from
the command line just like any other parameter. This is a way of
templating a workload and make it multi-purpose or adjustable on the fly.
## Experimentation Friendly
Because the workload YAML format is generic across driver types, it is possible to ask one driver type to interpret the
statements that are meant for another. This isn't generally a good idea, but it becomes extremely handy when you want to
have a high level driver type like `stdout` interpret the syntax of another driver like `cql`. When you do this, the
stdout activity type _plays_ the statements to your console as they would be executed in CQL, data bindings and all.
Because the workload YAML format is generic across driver types, it is
possible to ask one driver type to interpret the statements that are meant
for another. This isn't generally a good idea, but it becomes extremely
handy when you want to have a high level driver type like `stdout`
interpret the syntax of another driver like `cql`. When you do this, the
stdout activity type _plays_ the statements to your console as they would
be executed in CQL, data bindings and all.
This means you can empirically and substantively demonstrate and verify access patterns, data skew, and other dataset
details before you change back to cql mode and turn up the settings for a higher scale test. It takes away the guess
work about what your test is actually doing, and it works for all drivers.
This means you can empirically and directly demonstrate and verify access
patterns, data skew, and other dataset details before you change back to
cql mode and turn up the settings for a higher scale test. It takes away
the guess work about what your test is actually doing, and it works for
all drivers.

View File

@ -5,68 +5,93 @@ weight: 3
# Scripting Environment
The ability to write open-ended testing simulations is provided in NoSQLBench by means of a scripted runtime, where each
scenario is driven from a control script that can do anything the user wants.
The ability to write open-ended testing simulations is provided in
NoSQLBench by means of a scripted runtime, where each scenario is driven
from a control script that can do anything the user wants.
## Dynamic Parameters
Some configuration parameters of activities are designed to be assignable while a workload is running. This makes things
like threads, rates, and other workload dynamics pseudo real-time. The internal APIs work with the scripting environment
to expose these parameters directly to scenario scripts.
Some configuration parameters of activities are designed to be assignable
while a workload is running. This makes things like threads, rates, and
other workload dynamics in real-time. The internal APIs work with the
scripting environment to expose these parameters directly to scenario
scripts. Drivers that are provided to NoSQLBench can also expose dynamic
parameters in the same way so that anything can be scripted dynamically
when needed.
## Scripting Automatons
When a NoSQLBench scenario is running, it is under the control of a single-threaded script. Each activity that is
started by this script is run within its own threadpool, asynchronously.
When a NoSQLBench scenario is running, it is under the control of a
single-threaded script. Each activity that is started by this script is
run within its own thread pool, simultaneously and asynchronously.
The control script has executive control of the activities, as well as full visibility into the metrics that are
provided by each activity. The way these two parts of the runtime meet is through the service objects which are
installed into the scripting runtime. These service objects provide a named access point for each running activity and
its metrics.
The control script has executive control of the activities, as well as
full visibility into the metrics that are provided by each activity. The
way these two parts of the runtime meet is through the service objects
which are installed into the scripting runtime. These service objects
provide a named access point for each running activity and its metrics.
This means that the scenario script can do something simple, like start activities and wait for them to complete, OR, it
can do something more sophisticated like dynamically and interative scrutinize the metrics and make realtime adjustments
to the workload while it runs.
This means that the scenario script can do something simple, like start
activities and wait for them to complete, OR, it can do something more
sophisticated like dynamically and iteratively scrutinize the metrics and
make real-time adjustments to the workload while it runs.
## Analysis Methods
Scripting automatons that do feedback-oriented analysis of a target system are called analysis methods in NoSQLBench. We
have prototypes a couple of these already, but there is nothing keeping the adventurous from coming up with their own.
Scripting automatons that do feedback-oriented analysis of a target system
are called analysis methods in NoSQLBench. We have prototypes a couple of
these already, but there is nothing keeping the adventurous from coming up
with their own.
## Command Line Scripting
The command line has the form of basic test commands and parameters. These command get converted directly into scenario
control script in the order they appear. The user can choose whether to stay in high level executive mode, with simple
commands like "run workload=...", or to drop down directly into script design. They can look at the equivalent script
for any command line by running --show-script. If you take the script that is dumped to console and run it, it should do
exactly the same thing as if you hadn't even looked at it and just the standard commands.
The command line has the form of basic test commands and parameters. These
command get converted directly into scenario control script in the order
they appear. The user can choose whether to stay in high level executive
mode, with simple commands like `nb test-scenario ...`, or to drop down
directly into script design. They can look at the equivalent script for
any command line by running --show-script. If you take the script that is
dumped to console and run it, it will do exactly the same thing as if you
hadn't even looked at it and just ran basic commands on the command line.
There are even ways to combine script fragments, full commands, and calls to scripts on the command line. Since each
variant is merely a way of constructing scenario script, they all get composited together before the scenario script is
run.
There are even ways to combine script fragments, full commands, and calls
to scripts on the command line. Since each variant is merely a way of
constructing scenario script, they all get composited together before the
scenario script is run.
New introductions to NoSQLBench should focus on the command line. Once a user is familiar with this, it is up to them
whether to tap into the deeper functionality. If they don't need to know about scenario scripting, then they shouldn't
have to learn about it to be effective.
New introductions to NoSQLBench should focus on the command line. Once a
user is familiar with this, it is up to them whether to tap into the
deeper functionality. If they don't need to know about scenario scripting,
then they shouldn't have to learn about it to be effective. This is what
we are calling a _scalable user experience_.
## Compared to DSLs
Other tools may claim that their DSL makes scenario "simulation" easier. In practice, any DSL is generally dependent on
a development tool to lay the language out in front of a user in a fluent way. This means that DSLs are almost always
developer-targeted tools, and mostly useless for casual users who don't want to break out an IDE.
Other tools may claim that their DSL makes scenario "simulation" easier.
In practice, any DSL is generally dependent on a development tool to lay
the language out in front of a user in a fluent way. This means that DSLs
are almost always developer-targeted tools, and mostly useless for casual
users who don't want to break out an IDE.
One of the things a DSL proponent may tell you is that it tells you "all the things you can do!". This is de-facto the
same thing as it telling you "all the things you can't do" because it's not part of the DSL. This is not a win for the
user. For DSL-based systems, the user has to use the DSL whether or not it enhances their creative control, while in
fact, most DSL aren't rich enough to do much that is interesting from a simulation perspective.
One of the things a DSL proponent may tell you is that it tells you "all
the things you can do!". This is de-facto the same thing as it telling you
"all the things you can't do" because it's not part of the DSL. This is
not a win-win for the user. For DSL-based systems, the user has to use the
DSL whether or not it enhances their creative control, while in fact, most
DSLs aren't rich enough to do much that is interesting from a simulation
perspective.
In NoSQLBench, we don't force the user to use the programming abstractions except at a very surface level -- the CLI. It
is up to the user whether or not to open the secret access panel for the more advance functionality. If they decide to
do this, we give them a commodity language (ECMAScript), and we wire it into all the things they were already using. We
don't take away their expressivity by telling them what they can't do. This way, users can pick their level of
investment and reward as best fits thir individual needs, as it should be.
In NoSQLBench, we don't force the user to use the programming abstractions
except at a very surface level -- the CLI. It is up to the user whether or
not to open the secret access panel for the more advance functionality. If
they decide to do this, we give them a commodity language (ECMAScript),
and we wire it into all the things they were already using. We don't take
away their creative freedom by telling them what they can't do. This way,
users can pick their level of investment and reward as best fits their
individual needs, as it should be.
## Scripting Extensions
Also mentioned under the section on modularity, it is relatively easy for a developer to add their own scripting
extensions into NoSQLBench.
Also mentioned under the section on modularity, it is relatively easy for
a developer to add their own scripting extensions into NoSQLBench as named
service objects.

View File

@ -5,71 +5,105 @@ weight: 1
# Virtual Datasets
The _Virtual Dataset_ capabilities within NoSQLBench allow you to generate data on the fly. There are many reasons for
using this technique in testing, but it is often a topic that is overlooked or taken for granted.
The _Virtual Dataset_ capabilities within NoSQLBench allow you to generate
data on the fly. There are many reasons for using this technique in
testing, but it is often a topic that is overlooked or taken for granted.
## Industrial Strength
The algorithms used to generate data are based on advanced techniques in the realm of variate sampling. The authors have
gone to great lengths to ensure that data generation is efficient and as much O(1) in processing time as possible.
The algorithms used to generate data are based on advanced techniques in
the realm of variate sampling. The authors have gone to great lengths to
ensure that data generation is efficient and as much O(1) in processing
time as possible.
For example...
One technique that is used to achieve this is to initialize and cache data in high resolution look-up tables for
distributions which may perform differently depending on their density functions. The existing Apache Commons Math
libraries have been adapted into a set of interpolated Inverse Cumulative Distribution sampling functions. This means
that you can use a Zipfian distribution in the same place as you would a Uniform distribution, and once initialized,
they sample with identical overhead. This means that by changing your test definition, you don't accidentally change the
behavior of your test client.
One technique that is used to achieve this is to initialize and cache data
in high resolution look-up tables for distributions which may otherwise
perform differently depending on their respective density functions. The
existing Apache Commons Math libraries have been adapted into a set of
interpolated Inverse Cumulative Distribution sampling functions. This
means that you can use them all in the same place as you would a Uniform
distribution, and once initialized, they sample with identical overhead.
This means that by changing your test definition, you don't accidentally
change the behavior of your test client, only the data as intended.
## The Right Tool
## A Purpose-Built Tool
Many other testing systems avoid building a dataset generation component. It's a toubgh problem to solve, so it's often
just avoided. Instead, they use libraries like "faker" and variations on that. However, faker is well named, no pun
intended. It was meant as a vignette library, not a source of test data for realistic results. If you are using a
testing tool for scale testing and relying on a faker variant, then you will almost certainly get invalid results for
any serious test.
Many other testing systems avoid building a dataset generation component.
It's a tough problem to solve, so it's often just avoided. Instead, they
use libraries like "faker" or other sources of data which weren't designed
for testing at scale. Faker is well named, no pun intended. It was meant
as a vignette and wire-framing library, not a source of test data for
realistic results. If you are using a testing tool for scale testing and
relying on a faker variant, then you will almost certainly get invalid
results that do not represent how a system would perform in production.
The virtual dataset component of NoSQLBench is a library that was designed for high scale and realistic data streams.
The virtual dataset component of NoSQLBench is a library that was designed
for high scale and realistic data streams. It uses the limits of the data
types in the JVM to simulate high cardinality datasets which approximate
production data distributions for realistic and reproducible results.
## Deterministic
The data that is generated by the virtual dataset libraries is determinstic. This means that for a given cycle in a
test, the operation that is synthesized for that cycle will be the same from one session to the next. This is
intentional. If you want to perturb the test data from one session to the next, then you can most easily do it by simply
The data that is generated by the virtual dataset libraries is
deterministic. This means that for a given cycle in a test, the operation
that is synthesized for that cycle will be the same from one session to
the next. This is intentional. If you want to perturb the test data from
one session to the next, then you can most easily do it by simply
selecting a different set of cycles as your basis.
This means that if you find something intersting in a test run, you can go back to it just by specifying the cycles in
question. It also means that you aren't losing comparative value between tests with additional randomness thrown in. The
data you generate will still look random to the human eye, but that doesn't mean that it can't be reproducible.
This means that if you find something interesting in a test run, you can
go back to it just by specifying the cycles in question. It also means
that you aren't losing comparative value between tests with additional
randomness thrown in. The data you generate will still look random to the
human eye, but that doesn't mean that it can't be reproducible.
## Statistically Shaped
All this means is that the values you use to tie your dataset together can be specific to any distribution that is
appropriate. You can ask for a stream of floating point values 1 trillion values long, in any order. You can use
discrete or continuous distributions, with whatever parameters you need.
All this means is that the values you use to tie your dataset together can
be specific to any distribution that is appropriate. You can ask for a
stream of floating point values 1 trillion values long, in any order. You
can use discrete or continuous distributions, with whatever distribution
parameters you need.
## Best of Both Worlds
Some might worry that fully synthetic testing data is not realistic enough. The devil is in the details on these
arguments, but suffice it to say that you can pick the level of real data you use as seed data with NoSQLBench.
Some might worry that fully synthetic testing data is not realistic
enough. The devil is in the details on these arguments, but suffice it to
say that you can pick the level of real data you use as seed data with
NoSQLBench. You don't have to choose between realism and agility. The
procedural data generation approach allows you to have all the benefits of
testing agility of low-entropy testing tools while retaining nearly all of
the benefits of real testing data.
For example, using the alias sampling method and a published US census (public domain) list of names and surnames tha
occured more than 100x, we can provide extremely accurate samples of names according to the discrete distribution we
know of. The alias method allows us to sample accurately in O(1) time from the entire dataset by turning a large number
of weights into two uniform samples. You will simply not find a better way to sample names of US names than this. (but
if you do, please file an issue!)
For example, using the alias sampling method and a published US census
(public domain) list of names and surnames tha occurred more than 100x, we
can provide extremely accurate samples of names according to the published
labels and weights. The alias method allows us to sample accurately in
O(1) time from the entire dataset by turning a large number of weights
into two uniform samples. You will simply not find a better way to sample
realistic (US) names than this. (If you do, please file an issue!)
Actually, any data set that you have in CSV form with a weight column can
also be used this way, so you're not strictly limited to US census data.
## Java Idiomatic Extension
The way that the virtual dataset component works allows Java developers to write any extension to the data generation
functions simply in the form of Java 8 or newer Funtional interfaces. As long as they include the annotation processor
and annotate their classes, they will show up in the runtime and be available to any workload by their class name.
The way that the virtual dataset component works allows Java developers to
write any extension to the data generation functions simply in the form of
Java 8 or newer Functional interfaces. As long as they include the
annotation processor and annotate their classes, they will show up in the
runtime and be available to any workload by their class name.
Additionally, annotation based examples and annotation processing is used
to hoist function docs directly into the published docs that go along with
any version of NoSQLBench.
## Binding Recipes
It is possible to stitch data generation functions together directly in a workload YAML. These are data-flow sketches of
functions that can be copied and pasted between workload descriptions to share or remix data streams. This allows for
the adventurous to build sophisticated virtual datasets that emulate nuances of real datasets, but in a form that takes
It is possible to stitch data generation functions together directly in a
workload YAML. These are data-flow sketches of functions that can be
copied and pasted between workload descriptions to share or remix data
streams. This allows for the adventurous to build sophisticated virtual
datasets that emulate nuances of real datasets, but in a form that takes
up less space on the screen than this paragraph!