The old barrier implementation was very slow when running on a multi-socket
machine (pcmemtest issue 16).
The new implementation provides two options:
- when blocked, spin on a thread-local flag
- when blocked, execute a HLT instruction and wait for a NMI
The first option might be faster, but we need to measure it to find out. A
new boot command line option is provided to select between the two, with a
third setting that uses a mixture of the two.
Mostly we write and read large chunks of data which will make it likely
that the data is no longer in the cache when we come to verify it. But
this is not always true, and in any case, we shouldn't rely on it.