Merge branch 'master' into releases

This commit is contained in:
Jonathan Shook 2020-03-13 02:51:18 -05:00
commit fae0ea49e9
18 changed files with 205 additions and 100 deletions

View File

@ -1,4 +1,4 @@
name: CI
name: build
on:
push:

View File

@ -1,4 +1,4 @@
name: CI
name: release
on:
push:

View File

@ -1,3 +1,5 @@
![maven build](https://github.com/nosqlbench/nosqlbench/workflows/CI/badge.svg)
This project combines upstream projects of engineblock and virtualdataset into one main project. More details on release practices and contributor guidelines are on the way.
# Status

View File

@ -1,6 +1,6 @@
---
title: CLI Scripting
--------------------
---
# CLI Scripting

View File

@ -0,0 +1,100 @@
---
title: Binding Concepts
weight: 10
---
NoSQLBench has a built-in library for the flexible management and expressive use of
procedural generation libraries. This section explains the core concepts
of this library, known as _Virtual Data Set_.
## Variates (Samples)
A numeric sample that is drawn from a distribution for the purpose
of simulation or analysis is called a *Variate*.
## Procedural Generation
Procedural generation is a category of algorithms and techniques which take
a set or stream of inputs and produce an output in a different form or structure.
While it may appear that procedural generation actually _generates_ data, no output
can come from a void. These techniques simply perturb a value in some stateful way,
or map a coordinate system to another representation. Sometimes, both techniques are
combined together.
## Uniform Variate
A variate (sample) drawn from a uniform (flat) distribution is what we are used
to seeing when we ask a system for a "random" value. These are often produced in
one of two very common forms, either a register full of bits as with most hashing
functions, or a floating point value between 0.0 and 1.0. (This is called the _unit
interval_).
Uniform variates are not really random. Without careful attention to API usage,
such random samples are not even unique from session to session. In many systems,
the programmer has to be very careful to seed the random generator or they will
get the same sequence of numbers every time they run their program. This turns out
to be a useful property, and the random number generators that behave this way are
usually called Pseudo-Random Number Generators, or PRNGs.
## Apparently Random Variates
Uniform variates produced by PRNGs are not actually random, even though they may
pass certain tests for randomness. The streams of values produced are nearly
always measurably random by some meaningful standard. However, they can be
used again in exactly the same way with the same initial seed.
## Deterministic Variates
If you intentionally avoid randomizing the initial seed for a PRNG, for example,
with the current timestamp, then it gives you a way to replay a sequence.
You can think of each initial seed as a _bank_ of values which you can go back
to at any time. However, when using stateful PRNGs as a way to provide these
variates, your results will be order dependent.
## Randomly Accessible Determinism
Instead of using a PRNG, it is possible to use a hash function instead. With a 64-bit
register, you have 2^64 (2^63 in practice due to available implementations) possible
values. If your hash function has high dispersion, then you will effectively
get the same result of apparent randomness as well as deterministic sequences, even
when you use simple sequences of inputs to your _random()_ function. This allows
you to access a random value in bucket 57, for example, and go back to it at any
time and in any order to get the same value again.
## Data Mapping Functions
The data mapping functions are the core building block of virtual data set.
Data mapping functions are generally pure functions. This simply means that
a generator function will always provide the same result given the same input.
The parameters that you will see on some binding recipes are not representative
of volatile state. These parameters are initializer values which are part of a
function's definition. For example a `Mod(5)` will always behave like a `Mod(5)`,
as a pure function. But a `Mod(7)` will be have differently than a `Mod(5)`, although
each function will always produce its own stable result for a given input.
## Combining RNGs and Data Mapping Functions
Because pure functions play such a key part in procedural generation techniques,
the terms "data mapping function", "data mapper" and "data mapping library" will
be more common in the library than "generator". Conceptually, mapping functions
to not generate anything. It makes more sense to think of mapping data from one
domain to another. Even so, the data that is yielded by mapping functions can
appear quite realistic.
Because good RNGs do generally contain internal state, they aren't purely
functional. This means that in some cases -- those in which you need to have
random access to a virtual data set, hash functions make more sense. This
toolkit allows you to choose between the two in some cases. However, it
generally favors using hashing and pure-function approaches where possible. Even
the statistical curve simulations do this.
## Bindings Template
It is often useful to have a template that describes a set of generator
functions that can be reused across many threads or other application scopes. A
bindings template is a way to capture the requested generator functions for
re-use, with actual scope instantiation of the generator functions controlled by
the usage point. For example, in a JEE app, you may have a bindings template in
the application scope, and a set of actual bindings within each request (thread
scope).

View File

@ -1,4 +1,8 @@
# CATEGORY collections
---
title: collections functions
weight: 40
---
## HashedLineToStringList
Creates a List\<String\> from a list of words in a file.

View File

@ -1,4 +1,8 @@
# CATEGORY conversion
---
title: conversion functions
weight: 30
---
## DigestToByteBuffer
Computes the digest of the ByteBuffer on input and stores it in the output ByteBuffer. The digestTypes available are:

View File

@ -1,4 +1,8 @@
# CATEGORY datetime
---
title: datetime functions
weight: 20
---
## DateTimeParser
This function will parse a String containing a formatted date time, yielding a DateTime object. If no arguments are provided, then the format is set to

View File

@ -1,4 +1,8 @@
# CATEGORY diagnostics
---
title: diagnostic functions
weight: 40
---
## Show
Show diagnostic values for the thread-local variable map.

View File

@ -1,4 +1,8 @@
# CATEGORY distributions
---
title: distribution functions
weight: 30
---
## Beta
@see [Wikipedia: Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) @see [Commons JavaDoc: BetaDistribution](https://commons.apache.org/proper/commons-statistics/commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/BetaDistribution.html)

View File

@ -1,4 +1,8 @@
# CATEGORY functional
---
title: utility functions
weight: 40
---
## IntFlow
Combine multiple IntUnaryOperators into a single function.

View File

@ -1,4 +1,8 @@
# CATEGORY general
---
title: general functions
weight: 20
---
## Add
Adds a value to the input.

View File

@ -1,4 +1,8 @@
# CATEGORY nulls
---
title: null functions
weight: 40
---
## NullIfCloseTo
Returns null if the input value is within range of the specified value.

View File

@ -1,4 +1,8 @@
# CATEGORY premade
---
title: pre-made functions
weight: 20
---
## FirstNames
Return a pseudo-randomly sampled first name from the last US census data on first names occurring more than 100 times. Both male and female names are combined in this function.

View File

@ -1,4 +1,8 @@
# CATEGORY state
---
title: state functions
weight: 30
---
## Clear
Clears the per-thread map which is used by the Expr function.

View File

@ -0,0 +1,25 @@
---
title: Binding Functions
weight: 100
---
The functions which you can use to generate data in your workloads are
called *bindings*. They are injected into your statement templates by
name, just as you might do with named parameters in CQL statements.
These functions can be stitched together in small recipes. When you give
these mapping functions useful names in your workloads, they are called
bindings.
Here is an example:
```yaml
bindings:
numbers: NumberNameToString()
names: FirstNames()
```
These are two bindings that you can use in your workloads. The names on the left
are the _binding names_ and the functions on the right are the _binding recipes_.
Altogether, we just call them _bindings_.

View File

@ -0,0 +1,25 @@
---
title: Using Bindings
weight: 15
---
The functions which you can use to generate data in your workloads are
mapped into your operations by name, just like you would do with a
prepared statement, for example.
These functions can be stitched together in small recipes. When you give
these mapping functions useful names in your workloads, they are called
bindings.
Here is an example:
```yaml
bindings:
numbers: NumberNameToString()
names: FirstNames()
```
These are two bindings that you can use in your workloads. The names on the left
are the _binding names_ and the functions on the right are the _binding recipes_.
Altogether, we just call them _bindings_.

View File

@ -1,87 +0,0 @@
# Virtual Dataset Concepts
VirtData is a library for the flexible management and expressive use of
procedural generation libraries. It is a reincarnation of a previous project.
This version of the idea starts by focusing directly on usage aspects and
extension points rather than the big idea.
### Procedural Generation
Procedural generation is a general class of methods for taking a set of inputs
and modifying them in a predictable way to generate content which appears random
but is actually deterministic. For example, some games use procedural generation
to take a single value known as the "seed" to generate an apparently rich and
interesting world.
### Apparently Random RNGs
Sequences of values produced by RNGs (more properly called PRNGs) are not
actually random, even though they may pass certain tests for randomness. In
practice, the combination of these two properties is quite valuable for testing
and data synthesis. Having a stream of data that is measurably random by some
meaningful standard, but which is configurable and reusable allows for test to
be replayed, for example.
### Apparently Random Samples
Just as RNGs can appear random when the are not truly, statistical distributions
which rely on them can also appear random. Uniform random number generators over
the unit interval [0,1.0) are a common input to virtual sampling methods. This
means that if you can configure the RNG stream that you feed into your virtual
sampling methods, you can simulate a repeatable sequence from a known
distribution.
### Data Mapping Functions
The data mapping functions are the core building block of virtdata. They are the
functional logic that powers all procedural generation. Data mapping functions
are generally pure functions. This simply means that a generator function will
always provide the same result given the same input. All top-level mapping
functions all take a long value as their input, and produce a result based on
their parameterized type.
##### Combining RNGs and Data Mapping Functions
Because pure functions play such a key part in procedural generation techniques,
the terms "data mapping function", "data mapper" and "data mapping library" will
be more common in the library than "generator". Conceptually, mapping functions
to not generate anything. It makes more sense to think of mapping data from one
domain to another. Even so, the data that is yielded by mapping functions can
appear quite realistic.
Because good RNGs do generally contain internal state, they aren't purely
functional. This means that in some cases -- those in which you need to have
random access to a virtual data set, hash functions make more sense. This
toolkit allows you to choose between the two in some cases. However, it
generally favors using hashing and pure-function approaches where possible. Even
the statistical curve simulations do this.
### Data Mapper Library
Data Mapping functions are packaged into libraries which can be loaded by the
virtdata-user component of the project. Each library has a name, a function
resolver, and a set of functions that can be instantiated via the function
resolver.
### Function Resolver
Each library must implement its own function resolver. This is because each
library may have a different way of naming, finding, creating or managing
function generator instances. For the user, the description of a generator is
simply a string. What the generator library does with it is
implementation-specific. This means that some generator libraries may simply
have constructor signatures as function specifiers, and others may go as far as
implementing their own DSL. The basic contract for a function resolver is that
you pass it a string describing what you want, and it provides a generator
function in return.
#### Bindings Template
It is often useful to have a template that describes a set of generator
functions that can be reused across many threads or other application scopes. A
bindings template is a way to capture the requested generator functions for
re-use, with actual scope instantiation of the generator functions controlled by
the usage point. For example, in a JEE app, you may have a bindings template in
the application scope, and a set of actual bindings within each request (thread
scope).