mirror of
https://github.com/nosqlbench/nosqlbench.git
synced 2025-01-26 15:36:33 -06:00
Merge branch 'master' into releases
This commit is contained in:
commit
fae0ea49e9
2
.github/workflows/build.yml
vendored
2
.github/workflows/build.yml
vendored
@ -1,4 +1,4 @@
|
||||
name: CI
|
||||
name: build
|
||||
|
||||
on:
|
||||
push:
|
||||
|
2
.github/workflows/release.yml
vendored
2
.github/workflows/release.yml
vendored
@ -1,4 +1,4 @@
|
||||
name: CI
|
||||
name: release
|
||||
|
||||
on:
|
||||
push:
|
||||
|
@ -1,3 +1,5 @@
|
||||
![maven build](https://github.com/nosqlbench/nosqlbench/workflows/CI/badge.svg)
|
||||
|
||||
This project combines upstream projects of engineblock and virtualdataset into one main project. More details on release practices and contributor guidelines are on the way.
|
||||
|
||||
# Status
|
||||
|
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: CLI Scripting
|
||||
--------------------
|
||||
---
|
||||
|
||||
# CLI Scripting
|
||||
|
||||
|
@ -0,0 +1,100 @@
|
||||
---
|
||||
title: Binding Concepts
|
||||
weight: 10
|
||||
---
|
||||
|
||||
NoSQLBench has a built-in library for the flexible management and expressive use of
|
||||
procedural generation libraries. This section explains the core concepts
|
||||
of this library, known as _Virtual Data Set_.
|
||||
|
||||
## Variates (Samples)
|
||||
|
||||
A numeric sample that is drawn from a distribution for the purpose
|
||||
of simulation or analysis is called a *Variate*.
|
||||
|
||||
## Procedural Generation
|
||||
|
||||
Procedural generation is a category of algorithms and techniques which take
|
||||
a set or stream of inputs and produce an output in a different form or structure.
|
||||
While it may appear that procedural generation actually _generates_ data, no output
|
||||
can come from a void. These techniques simply perturb a value in some stateful way,
|
||||
or map a coordinate system to another representation. Sometimes, both techniques are
|
||||
combined together.
|
||||
|
||||
## Uniform Variate
|
||||
|
||||
A variate (sample) drawn from a uniform (flat) distribution is what we are used
|
||||
to seeing when we ask a system for a "random" value. These are often produced in
|
||||
one of two very common forms, either a register full of bits as with most hashing
|
||||
functions, or a floating point value between 0.0 and 1.0. (This is called the _unit
|
||||
interval_).
|
||||
|
||||
Uniform variates are not really random. Without careful attention to API usage,
|
||||
such random samples are not even unique from session to session. In many systems,
|
||||
the programmer has to be very careful to seed the random generator or they will
|
||||
get the same sequence of numbers every time they run their program. This turns out
|
||||
to be a useful property, and the random number generators that behave this way are
|
||||
usually called Pseudo-Random Number Generators, or PRNGs.
|
||||
|
||||
## Apparently Random Variates
|
||||
|
||||
Uniform variates produced by PRNGs are not actually random, even though they may
|
||||
pass certain tests for randomness. The streams of values produced are nearly
|
||||
always measurably random by some meaningful standard. However, they can be
|
||||
used again in exactly the same way with the same initial seed.
|
||||
|
||||
## Deterministic Variates
|
||||
|
||||
If you intentionally avoid randomizing the initial seed for a PRNG, for example,
|
||||
with the current timestamp, then it gives you a way to replay a sequence.
|
||||
You can think of each initial seed as a _bank_ of values which you can go back
|
||||
to at any time. However, when using stateful PRNGs as a way to provide these
|
||||
variates, your results will be order dependent.
|
||||
|
||||
## Randomly Accessible Determinism
|
||||
|
||||
Instead of using a PRNG, it is possible to use a hash function instead. With a 64-bit
|
||||
register, you have 2^64 (2^63 in practice due to available implementations) possible
|
||||
values. If your hash function has high dispersion, then you will effectively
|
||||
get the same result of apparent randomness as well as deterministic sequences, even
|
||||
when you use simple sequences of inputs to your _random()_ function. This allows
|
||||
you to access a random value in bucket 57, for example, and go back to it at any
|
||||
time and in any order to get the same value again.
|
||||
|
||||
## Data Mapping Functions
|
||||
|
||||
The data mapping functions are the core building block of virtual data set.
|
||||
Data mapping functions are generally pure functions. This simply means that
|
||||
a generator function will always provide the same result given the same input.
|
||||
The parameters that you will see on some binding recipes are not representative
|
||||
of volatile state. These parameters are initializer values which are part of a
|
||||
function's definition. For example a `Mod(5)` will always behave like a `Mod(5)`,
|
||||
as a pure function. But a `Mod(7)` will be have differently than a `Mod(5)`, although
|
||||
each function will always produce its own stable result for a given input.
|
||||
|
||||
## Combining RNGs and Data Mapping Functions
|
||||
|
||||
Because pure functions play such a key part in procedural generation techniques,
|
||||
the terms "data mapping function", "data mapper" and "data mapping library" will
|
||||
be more common in the library than "generator". Conceptually, mapping functions
|
||||
to not generate anything. It makes more sense to think of mapping data from one
|
||||
domain to another. Even so, the data that is yielded by mapping functions can
|
||||
appear quite realistic.
|
||||
|
||||
Because good RNGs do generally contain internal state, they aren't purely
|
||||
functional. This means that in some cases -- those in which you need to have
|
||||
random access to a virtual data set, hash functions make more sense. This
|
||||
toolkit allows you to choose between the two in some cases. However, it
|
||||
generally favors using hashing and pure-function approaches where possible. Even
|
||||
the statistical curve simulations do this.
|
||||
|
||||
## Bindings Template
|
||||
|
||||
It is often useful to have a template that describes a set of generator
|
||||
functions that can be reused across many threads or other application scopes. A
|
||||
bindings template is a way to capture the requested generator functions for
|
||||
re-use, with actual scope instantiation of the generator functions controlled by
|
||||
the usage point. For example, in a JEE app, you may have a bindings template in
|
||||
the application scope, and a set of actual bindings within each request (thread
|
||||
scope).
|
||||
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY collections
|
||||
---
|
||||
title: collections functions
|
||||
weight: 40
|
||||
---
|
||||
|
||||
## HashedLineToStringList
|
||||
|
||||
Creates a List\<String\> from a list of words in a file.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY conversion
|
||||
---
|
||||
title: conversion functions
|
||||
weight: 30
|
||||
---
|
||||
|
||||
## DigestToByteBuffer
|
||||
|
||||
Computes the digest of the ByteBuffer on input and stores it in the output ByteBuffer. The digestTypes available are:
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY datetime
|
||||
---
|
||||
title: datetime functions
|
||||
weight: 20
|
||||
---
|
||||
|
||||
## DateTimeParser
|
||||
|
||||
This function will parse a String containing a formatted date time, yielding a DateTime object. If no arguments are provided, then the format is set to
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY diagnostics
|
||||
---
|
||||
title: diagnostic functions
|
||||
weight: 40
|
||||
---
|
||||
|
||||
## Show
|
||||
|
||||
Show diagnostic values for the thread-local variable map.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY distributions
|
||||
---
|
||||
title: distribution functions
|
||||
weight: 30
|
||||
---
|
||||
|
||||
## Beta
|
||||
|
||||
@see [Wikipedia: Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) @see [Commons JavaDoc: BetaDistribution](https://commons.apache.org/proper/commons-statistics/commons-statistics-distribution/apidocs/org/apache/commons/statistics/distribution/BetaDistribution.html)
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY functional
|
||||
---
|
||||
title: utility functions
|
||||
weight: 40
|
||||
---
|
||||
|
||||
## IntFlow
|
||||
|
||||
Combine multiple IntUnaryOperators into a single function.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY general
|
||||
---
|
||||
title: general functions
|
||||
weight: 20
|
||||
---
|
||||
|
||||
## Add
|
||||
|
||||
Adds a value to the input.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY nulls
|
||||
---
|
||||
title: null functions
|
||||
weight: 40
|
||||
---
|
||||
|
||||
## NullIfCloseTo
|
||||
|
||||
Returns null if the input value is within range of the specified value.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY premade
|
||||
---
|
||||
title: pre-made functions
|
||||
weight: 20
|
||||
---
|
||||
|
||||
## FirstNames
|
||||
|
||||
Return a pseudo-randomly sampled first name from the last US census data on first names occurring more than 100 times. Both male and female names are combined in this function.
|
@ -1,4 +1,8 @@
|
||||
# CATEGORY state
|
||||
---
|
||||
title: state functions
|
||||
weight: 30
|
||||
---
|
||||
|
||||
## Clear
|
||||
|
||||
Clears the per-thread map which is used by the Expr function.
|
@ -0,0 +1,25 @@
|
||||
---
|
||||
title: Binding Functions
|
||||
weight: 100
|
||||
---
|
||||
|
||||
The functions which you can use to generate data in your workloads are
|
||||
called *bindings*. They are injected into your statement templates by
|
||||
name, just as you might do with named parameters in CQL statements.
|
||||
|
||||
These functions can be stitched together in small recipes. When you give
|
||||
these mapping functions useful names in your workloads, they are called
|
||||
bindings.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```yaml
|
||||
bindings:
|
||||
numbers: NumberNameToString()
|
||||
names: FirstNames()
|
||||
```
|
||||
|
||||
These are two bindings that you can use in your workloads. The names on the left
|
||||
are the _binding names_ and the functions on the right are the _binding recipes_.
|
||||
Altogether, we just call them _bindings_.
|
||||
|
@ -0,0 +1,25 @@
|
||||
---
|
||||
title: Using Bindings
|
||||
weight: 15
|
||||
---
|
||||
|
||||
The functions which you can use to generate data in your workloads are
|
||||
mapped into your operations by name, just like you would do with a
|
||||
prepared statement, for example.
|
||||
|
||||
These functions can be stitched together in small recipes. When you give
|
||||
these mapping functions useful names in your workloads, they are called
|
||||
bindings.
|
||||
|
||||
Here is an example:
|
||||
|
||||
```yaml
|
||||
bindings:
|
||||
numbers: NumberNameToString()
|
||||
names: FirstNames()
|
||||
```
|
||||
|
||||
These are two bindings that you can use in your workloads. The names on the left
|
||||
are the _binding names_ and the functions on the right are the _binding recipes_.
|
||||
Altogether, we just call them _bindings_.
|
||||
|
@ -1,87 +0,0 @@
|
||||
# Virtual Dataset Concepts
|
||||
|
||||
VirtData is a library for the flexible management and expressive use of
|
||||
procedural generation libraries. It is a reincarnation of a previous project.
|
||||
This version of the idea starts by focusing directly on usage aspects and
|
||||
extension points rather than the big idea.
|
||||
|
||||
### Procedural Generation
|
||||
|
||||
Procedural generation is a general class of methods for taking a set of inputs
|
||||
and modifying them in a predictable way to generate content which appears random
|
||||
but is actually deterministic. For example, some games use procedural generation
|
||||
to take a single value known as the "seed" to generate an apparently rich and
|
||||
interesting world.
|
||||
|
||||
### Apparently Random RNGs
|
||||
|
||||
Sequences of values produced by RNGs (more properly called PRNGs) are not
|
||||
actually random, even though they may pass certain tests for randomness. In
|
||||
practice, the combination of these two properties is quite valuable for testing
|
||||
and data synthesis. Having a stream of data that is measurably random by some
|
||||
meaningful standard, but which is configurable and reusable allows for test to
|
||||
be replayed, for example.
|
||||
|
||||
### Apparently Random Samples
|
||||
|
||||
Just as RNGs can appear random when the are not truly, statistical distributions
|
||||
which rely on them can also appear random. Uniform random number generators over
|
||||
the unit interval [0,1.0) are a common input to virtual sampling methods. This
|
||||
means that if you can configure the RNG stream that you feed into your virtual
|
||||
sampling methods, you can simulate a repeatable sequence from a known
|
||||
distribution.
|
||||
|
||||
### Data Mapping Functions
|
||||
|
||||
The data mapping functions are the core building block of virtdata. They are the
|
||||
functional logic that powers all procedural generation. Data mapping functions
|
||||
are generally pure functions. This simply means that a generator function will
|
||||
always provide the same result given the same input. All top-level mapping
|
||||
functions all take a long value as their input, and produce a result based on
|
||||
their parameterized type.
|
||||
|
||||
##### Combining RNGs and Data Mapping Functions
|
||||
|
||||
Because pure functions play such a key part in procedural generation techniques,
|
||||
the terms "data mapping function", "data mapper" and "data mapping library" will
|
||||
be more common in the library than "generator". Conceptually, mapping functions
|
||||
to not generate anything. It makes more sense to think of mapping data from one
|
||||
domain to another. Even so, the data that is yielded by mapping functions can
|
||||
appear quite realistic.
|
||||
|
||||
Because good RNGs do generally contain internal state, they aren't purely
|
||||
functional. This means that in some cases -- those in which you need to have
|
||||
random access to a virtual data set, hash functions make more sense. This
|
||||
toolkit allows you to choose between the two in some cases. However, it
|
||||
generally favors using hashing and pure-function approaches where possible. Even
|
||||
the statistical curve simulations do this.
|
||||
|
||||
### Data Mapper Library
|
||||
|
||||
Data Mapping functions are packaged into libraries which can be loaded by the
|
||||
virtdata-user component of the project. Each library has a name, a function
|
||||
resolver, and a set of functions that can be instantiated via the function
|
||||
resolver.
|
||||
|
||||
### Function Resolver
|
||||
|
||||
Each library must implement its own function resolver. This is because each
|
||||
library may have a different way of naming, finding, creating or managing
|
||||
function generator instances. For the user, the description of a generator is
|
||||
simply a string. What the generator library does with it is
|
||||
implementation-specific. This means that some generator libraries may simply
|
||||
have constructor signatures as function specifiers, and others may go as far as
|
||||
implementing their own DSL. The basic contract for a function resolver is that
|
||||
you pass it a string describing what you want, and it provides a generator
|
||||
function in return.
|
||||
|
||||
#### Bindings Template
|
||||
|
||||
It is often useful to have a template that describes a set of generator
|
||||
functions that can be reused across many threads or other application scopes. A
|
||||
bindings template is a way to capture the requested generator functions for
|
||||
re-use, with actual scope instantiation of the generator functions controlled by
|
||||
the usage point. For example, in a JEE app, you may have a bindings template in
|
||||
the application scope, and a set of actual bindings within each request (thread
|
||||
scope).
|
||||
|
Loading…
Reference in New Issue
Block a user