There is an unavoidable race between one core halting after decrementing
the barrier count and another core sending it the wakeup NMI. This can
only occur if the core sending the wakeup is running at many times the
speed of the core halting, but it has been observed on an Intel Icelake
mobile processor.
This reduces the number of instructions between decrementing the count
and halting in the halt wait case. Use the same code for the spin wait
case for consistency.
Using Interrupt On Completion is not robust, because the interrupt flag
is also set if a short transfer is detected. So we need to poll the
Active flag in the transfer descriptors.
The XHCI device index and scratchpad buffers are mapped into high memory,
in order to conserve low memory. They need to be accessible in the virtual
address space to allow us to initialise them. After initialisation. only
the XHCI accesses them.
We also only need to access the data structures in the device private
workspace during initialisation, but keeping separate physical and virtual
addresses for these structures makes the code considerably more complex,
so for now, move these to low memory.
The old barrier implementation was very slow when running on a multi-socket
machine (pcmemtest issue 16).
The new implementation provides two options:
- when blocked, spin on a thread-local flag
- when blocked, execute a HLT instruction and wait for a NMI
The first option might be faster, but we need to measure it to find out. A
new boot command line option is provided to select between the two, with a
third setting that uses a mixture of the two.
We only map the first 4GB of physical address space, so if the ACPI tables
or local APIC are located above 4GB, or are overlaid when we remap something
else (e.g. the video frame buffer), we need to map them to somewhere we can
access. The ACPI tables are only used during startup, but the local APIC
will be needed when we are running tests if we are saving power by halting
idle CPU cores and using a NMI to wake them up.
This is needed for subsequent changes. If we do ever get presented with
a frame buffer larger than 8192x8192 pixels, we'll need to think again
about how to manage it.
Low/full speed USB devices attached directly to the root hub must be
rerouted to a companion controller. We can't rely on the BIOS to do
this for us. This requires us to initialise the EHCI device before
initialising any of its companions.
This also allows us to support keyboards attached via a high speed
hub on a system with EHCI plus companions.
That optimisation occasionally caused a hang if the CPU sequencing
mode was reconfigured after testing had started. It didn't make a
significant difference to the startup delay, so just drop it.
Leave the APs running whilst the BSP repeats the dummy runs. This means
we need to bypass the barriers during a dummy run. The APs will wait at
the first barrier until the BSP starts the first real run.
Now we halt CPU cores that are going to be idle for a lengthy period,
we don't need to try to save power in other ways. And anyway, this was
not very effective.