findmax enhancements and docs

2025-02-25 18:55:28 -06:00 · 2020-12-22 23:33:36 -06:00 · 2020-12-22 23:33:36 -06:00 · f0223501f8
commit f0223501f8
parent 41adf58840
2 changed files with 197 additions and 24 deletions
--- a/engine-cli/src/main/resources/findmax.md
+++ b/engine-cli/src/main/resources/findmax.md
@ -0,0 +1,152 @@
+Analysis Method: FindMax
+========================
+
+The findmax.js script can be used with any normal activity definition
+which uses the standard phase tagging scheme. It searches for the maximum
+throughput available which satisfy a basic SLA requirement. It does this
+by dynamically adjusting the workload while it runs, iteratively adjusting
+the performance goals against known results. In short, it does what we
+often do when we're searching for a local maxima in performance
+parameters.
+
+This particular analysis method is not open-ended in terms of its
+parameters. That is, it is designed (for now) to focus only on a set of
+known parameters: Namely the target throughput as the independent variable
+and the resulting response time as the dependent variable.
+
+Here is how the algorithm works.
+
+1. The baseline is set to zero, and the stepping size is set to a small
+   value.
+2. The stepping size is added to the baseline to determine the current
+   target rate.
+3. The workload is run for a period of time (10 seconds by default) and
+   the achieved throughput and service times are measured.
+4. Boundary conditions are checked to determine whether the settings for
+   this sampling window are acceptable: These conditions are:
+    1. The achieved throughput is high enough with respect to the target
+       rate (min 80% by default).
+    2. The achieved throughput is not lower than the best known result.
+    3. The achieved latency is low enough (p99<50ms).
+5. If any of the boundary conditions are not met, then target throughput
+   for the current iteration is deemed unusable and the search parameters
+   are adjusted. In this case, the baseline is check-pointed to that of
+   the highest valid result, and the step increases start again at the
+   base step size.
+6. If the conditions are met, then the incremental target rate is scaled
+   up by a factor. The incremental target rate is simply the base rate
+   plus the step size multiplied by some factor.
+7. If increasing the target by the minimum step size would set a target
+   rate that is known to be too high from a previous run, then the test
+   terminates.
+
+The findmax script actually runs multiple times, and takes the average
+result.
+
+### Example Output (start of run)
+
+      # This shows findmax just starting up, where the target rates are
+      # still pretty low.
+
+      >-- iteration 1 ---> targeting 100 ops_s (0+100) for 10s
+      LATENCY         PASS(OK)       [ 3.70ms p99 < max 50ms ]
+      OPRATE/TARGET   PASS(OK)       [ 102% of target > min 80% ] ( 102/100 )
+      OPRATE/BEST     PASS(OK)       [ 100% of best known > min 90% ] ( 102/102 )
+      ---> accepting iteration 1
+
+      >-- iteration 2 ---> targeting 200 ops_s (0+200) for 10s
+      LATENCY         PASS(OK)       [ 3.62ms p99 < max 50ms ]
+      OPRATE/TARGET   PASS(OK)       [ 100% of target > min 80% ] ( 199/200 )
+      OPRATE/BEST     PASS(OK)       [ 195% of best known > min 90% ] ( 199/102 )
+      ---> accepting iteration 2
+
+      >-- iteration 3 ---> targeting 400 ops_s (0+400) for 10s
+      LATENCY         PASS(OK)       [ 3.57ms p99 < max 50ms ]
+      OPRATE/TARGET   PASS(OK)       [ 100% of target > min 80% ] ( 399/400 )
+      OPRATE/BEST     PASS(OK)       [ 201% of best known > min 90% ] ( 399/199 )
+      ---> accepting iteration 3
+
+      >-- iteration 4 ---> targeting 800 ops_s (0+800) for 10s
+      LATENCY         PASS(OK)       [ 3.70ms p99 < max 50ms ]
+      OPRATE/TARGET   PASS(OK)       [ 99% of target > min 80% ] ( 796/800 )
+      OPRATE/BEST     PASS(OK)       [ 199% of best known > min 90% ] ( 796/399 )
+      ---> accepting iteration 4
+
+      >-- iteration 5 ---> targeting 1600 ops_s (0+1600) for 10s
+      LATENCY         PASS(OK)       [ 3.53ms p99 < max 50ms ]
+      OPRATE/TARGET   PASS(OK)       [ 99% of target > min 80% ] ( 1591/1600 )
+      OPRATE/BEST     PASS(OK)       [ 200% of best known > min 90% ] ( 1591/796 )
+      ---> accepting iteration 5
+
+      ...
+
+### Example Result (end of run)
+
+### Parameters
+
+#### sampling controls ####
+
+- `sample_time=10` - How long each sampling window lasts in seconds,
+  initially.
+- `sample_incr=1.33` - How to scale up sample_time each time the search
+  range is adjusted.
+- `sample_max=300` - The maximum sampling window size in seconds.
+
+The sample time determines how long each iteration lasts. As findmax runs,
+this value is increased according to the sample_incr value. This improves
+accuracy as test results start to converge, but it also makes the test
+take longer. Lower values of sample_incr or sample_max will limit the
+runtime of tests, but this will also limit accuracy.
+
+#### rate controls ####
+
+- `rate_base=0` - The initial base rate.
+- `rate_step=100` - The minimum step size to be added to the base rate.
+- `rate_incr=2` - The increase factor for the step size.
+
+By default, these parameters approximate the speed of binary search for
+each time the base parameter is check-pointed. The target rate is
+calculated as `rate_base + (rate_step * rate_incr ^ step)` where steps
+start at 0. For example, the initial iteration will thus use `0 +
+(100 * 2^0)` which is simply 100.) Based on this, the initial target rates
+will follow a pattern of 100,200,400,800, and so on. Once the highest
+successful target rate is found in this series, the base rate is
+check-pointed to that, and the search starts again with that new base rate
+with the same progression above stacking on top.
+
+- `min_stride=100` - A lock-avoiding optimization for rate limiting. This
+  uses a stride rate limiter with micro-batching internally to avoid
+  higher contention on many-cored client systems.
+
+- `averageof=2` - How many runs to average together for the final result.
+  Averaging multiple runs together ensures that some noise of local
+  effects (like background processes or compactions) are smoothed out for
+  a more realistic result.
+
+#### pass/fail conditions ####
+
+- `latency_cutoff=50.0` - The numeric value which is considered the
+  highest acceptable latency at the selected percentile.
+- `latency_pctile=0.99` - The selected percentile. The values shown here
+  for these two mean "50ms@p99".
+- `testrate_cutoff=0.8` - The minimum achieved throughput for an iteration
+  with respect to the target rate to be considered successful.
+- `bestrate_cutoff=0.9` - The minimum achieved throughput for an iteration
+  with respect to the known best result to be considered successful.
+
+## Workload Selection
+
+- `activity_type=cql` - The type of internal NB driver to use.
+- `yaml_file=cql-iot` - The name of the workload yaml.
+
+You can invoke findmax with any workload yaml which uses the standard
+naming scheme for phase control. This means that you have tagged each of
+your statements or statement blocks with the appropriate phase tags from
+schema, rampup, main, for example.
+
+- `schematags=phase:schema` - The tag filter for schema statements.
+  Findmax will run a schema phase with 1 thread by default.
+- `maintas=phase:main` - The tag filter for the main workload. This is the
+  workload that is started and run in the background for all of the
+  sampling windows.
+
--- a/nb/src/main/resources/scripts/auto/findmax.js
+++ b/nb/src/main/resources/scripts/auto/findmax.js
@ -21,7 +21,7 @@ if (params.cycles) {


 var sample_time = 10;    // start sampling at 10 second windows
-var sample_incr = 1.6181;  // sampling window is fibonacci
+var sample_incr = 1.333;  // + 1/3 time
 var sample_max = 300;    // 5 minutes is the max sampling window

 var rate_base = 0;
@ -36,6 +36,7 @@ var testrate_cutoff = 0.8;
 var bestrate_cutoff = 0.90;
 var latency_pctile = 0.99;

+
 var profile = "TEMPLATE(profile,default)";
 printf(" profile=%s\n",profile);

@ -44,12 +45,12 @@ if (profile == "default") {
    printf(" :NOTE: Consider using profile=fast or profile=accurate.\n");
 } else if (profile == "fast") {
    sample_time = 5;
-    sample_incr = 1.6181;
+    sample_incr = 1.2;
    sample_max = 60;
    averageof = 1;
 } else if (profile == "accurate") {
    sample_time = 10;
-    sample_incr = 2;
+    sample_incr = 1.6181; // fibonacci
    sample_max = 300;
    averageof = 3;
 }
@ -97,8 +98,8 @@ printf("    Schedule %s operations at a time per thread.\n", min_stride);
 printf("    Report the average result of running the findmax search algorithm %d times.\n",averageof);
 printf("    Reject iterations which fail to achieve %2.0f%% of the target rate.\n", testrate_cutoff*100);
 printf("    Reject iterations which fail to achieve %2.0f%% of the best rate.\n",bestrate_cutoff*100);
-printf("    Reject iterations which fail to achieve better than %dms response\n",latency_cutoff);
-printf("     at percentile p%f\n",latency_pctile*100);
+printf("    Reject iterations which fail to achieve better than %dms response\n", latency_cutoff);
+printf("     at percentile p%f\n", latency_pctile * 100);
 printf("```\n");

 if (latency_pctile > 1.0) {
@ -107,7 +108,8 @@ if (latency_pctile > 1.0) {

 var reporter_rate_base = scriptingmetrics.newGauge("findmax.params.rate_base", rate_base + 0.0);
 var reporter_achieved_rate = scriptingmetrics.newGauge("findmax.sampling.achieved_rate", 0.0);
-var reporter_target_rate = scriptingmetrics.newGauge("findmax.target.rate_base", 0.0);
+var reporter_base_rate = scriptingmetrics.newGauge("findmax.target.rate_base", 0.0);
+var reporter_target_rate = scriptingmetrics.newGauge("findmax.target.rate", 0.0);


 var activity_type = "TEMPLATE(activity_type,cql)";
@ -130,7 +132,7 @@ scenario.run(schema_activitydef);
 activitydef = params.withDefaults({
    type: activity_type,
    yaml: yaml_file,
-    threads: "50"
+    threads: "auto"
 });

 activitydef.alias = "findmax";
@ -159,8 +161,9 @@ function lowest_unacceptable_of(result1, result2) {
 }

 function as_pct(val) {
-    return (val * 100.0).toFixed(0)+"%";
+    return (val * 100.0).toFixed(0) + "%";
 }
+
 function as_pctile(val) {
    return "p" + (val * 100.0).toFixed(0);
 }
@ -168,16 +171,16 @@ function as_pctile(val) {
 scenario.start(activitydef);
 raise_to_min_stride(activities.findmax, min_stride);

-printf("\nwarming up client JIT for 10 seconds... \n");
-activities.findmax.striderate = "" + (1000.0 / activities.findmax.stride) + ":1.1:restart";
-scenario.waitMillis(10000);
-activities.findmax.striderate = "" + (100.0 / activities.findmax.stride) + ":1.1:restart";
-printf("commencing test\n");
+// printf("\nwarming up client JIT for 10 seconds... \n");
+// activities.findmax.striderate = "" + (1000.0 / activities.findmax.stride) + ":1.1:restart";
+// scenario.waitMillis(10000);
+// activities.findmax.striderate = "" + (100.0 / activities.findmax.stride) + ":1.1:restart";
+// printf("commencing test\n");

 function raise_to_min_stride(params, min_stride) {
    var multipler = (min_stride / params.stride);
    var newstride = (multipler * params.stride);
-    printf("increasing stride to %d ( %d x initial)\n",newstride, multipler);
+    printf("increasing stride to %d ( %d x initial)\n", newstride, multipler);
    params.stride = newstride.toFixed(0);
 }

@ -192,9 +195,16 @@ function testCycleFun(params) {

    var result = {};

+
+    reporter_target_rate.update(params.target_rate)
    // printf(" target rate is " + params.target_rate + " ops_s");
    activities.findmax.striderate = "" + (params.target_rate / activities.findmax.stride) + ":1.1:restart";

+    if (params.iteration == 1) {
+        printf("\n settling load at base for %ds before active sampling.\n", sample_time);
+        scenario.waitMillis(sample_time * 1000);
+    }
+
    precount = metrics.findmax.cycles.servicetime.count;

    var snapshot_reader = metrics.findmax.result.deltaReader;
@ -225,7 +235,15 @@ function highest_acceptable_of(results, iteration1, iteration2) {
    if (iteration1 == 0) {
        return iteration2;
    }
-    return results[iteration1].ops_per_second > results[iteration2].ops_per_second ? iteration1 : iteration2;
+    // printf("iteration1.target_rate=%s iteration2.target_rate=%s\n",""+results[iteration1].target_rate,""+results[iteration2].target_rate);
+    // printf("comparing iterations left: %d right: %d\n", iteration1, iteration2);
+    let selected = results[iteration1].target_rate > results[iteration2].target_rate ? iteration1 : iteration2;
+    // printf("selected %d\n", selected)
+    // printf("iteration1.ops_per_second=%f iteration2.ops_per_second=%f\n",results[iteration1].ops_per_second,results[iteration2].ops_per_second);
+    // printf("comparing iterations left: %d right: %d\n", iteration1, iteration2);
+    // let selected =  results[iteration1].ops_per_second > results[iteration2].ops_per_second ? iteration1 : iteration2;
+    // printf("selected %d\n", selected)
+    return selected;
 }

 function find_max() {
@ -239,7 +257,7 @@ function find_max() {
    var rejected_count = 0;

    var base = rate_base;
-    reporter_target_rate.update(base);
+    reporter_base_rate.update(base);
    var step = rate_step;

    var accepted_count = 0;
@ -315,26 +333,29 @@ function find_max() {
        if (failed_checks == 0) {
            accepted_count++;
            highest_acceptable_iteration = highest_acceptable_of(results, highest_acceptable_iteration, iteration);
-            printf(" ---> accepting iteration %d\n",iteration);
-
+            printf(" ---> accepting iteration %d (highest is now %d)\n", iteration, highest_acceptable_iteration);
        } else {
            rejected_count++;
-            printf(" !!!  rejecting iteration %d\n",iteration);
+            printf(" !!!  rejecting iteration %d\n", iteration);

+            // If the lowest unacceptable iteration was not already defined ...
            if (lowest_unacceptable_iteration == 0) {
                lowest_unacceptable_iteration = iteration;
-            } else {
-                // keep the more conservative target rate, if both this iteration and another one was not acceptable
-                lowest_unacceptable_iteration = result.target_rate < results[lowest_unacceptable_iteration].target_rate ?
-                    iteration : lowest_unacceptable_iteration;
            }
+            // keep the lower maximum target rate, if both this iteration and another one failed
+            if (result.target_rate < results[lowest_unacceptable_iteration].target_rate) {
+                lowest_unacceptable_iteration = iteration;
+                printf("setting lowest_unacceptable_iteration to %d\n", iteration)
+            }
+
            highest_acceptable_iteration = highest_acceptable_iteration == 0 ? iteration : highest_acceptable_iteration;

            var too_fast = results[lowest_unacceptable_iteration].target_rate;

+
            if (base + step >= too_fast && results[highest_acceptable_iteration].target_rate + rate_step < too_fast) {
                base = results[highest_acceptable_iteration].target_rate;
-                reporter_target_rate.update(base);
+                reporter_base_rate.update(base);
                step = rate_step;

                var search_summary = "[[ PASS " +