The old barrier implementation was very slow when running on a multi-socket
machine (pcmemtest issue 16).
The new implementation provides two options:
- when blocked, spin on a thread-local flag
- when blocked, execute a HLT instruction and wait for a NMI
The first option might be faster, but we need to measure it to find out. A
new boot command line option is provided to select between the two, with a
third setting that uses a mixture of the two.
When using a legacy BIOS, the memory regions used by the BIOS are well
defined. This is not the case when using a UEFI BIOS. So include the
stack area in the BSS so the loader knows how much memory to allocate,
and check we have space to relocate the program to either low or high
memory.
There are still some assumptions in the USB driver code that need to
be fixed.
Because we start the APs sequentially, it is unlikely they will coincide
for the brief period that they use the temporary startup stack, but we
should guard against it. This allows us to remove the mutex around the
restart of each AP when relocating, which should improve test times.
After we relocate the program, we restart it. So there is no need to copy
over the old stack contents. This allows us to increase the maximum number
of APs without a run time overhead. The maximum number of APs will still
be limited by the size of low memory.
The BSP only needs extra stack space during program initialisation. The APs
aren't running at that point, so by positioning the BSP stack above the AP
stacks, it can extend down into the AP stack space without causing any
problems.