Classic Montgomery reduction involves a single conditional subtraction
to ensure that the result is strictly less than the modulus.
When performing chains of Montgomery multiplications (potentially
interspersed with additions and subtractions), it can be useful to
work with values that are stored modulo some small multiple of the
modulus, thereby allowing some reductions to be elided. Each addition
and subtraction stage will increase this running multiple, and the
following multiplication stages can be used to reduce the running
multiple since the reduction carried out for multiplication products
is generally strong enough to absorb some additional bits in the
inputs. This approach is already used in the x25519 code, where
multiplication takes two 258-bit inputs and produces a 257-bit output.
Split out the conditional subtraction from bigint_montgomery() and
provide a separate bigint_montgomery_relaxed() for callers who do not
require immediate reduction to within the range of the modulus.
Modular exponentiation could potentially make use of relaxed
Montgomery multiplication, but this would require R>4N, i.e. that the
two most significant bits of the modulus be zero. For both RSA and
DHE, this would necessitate extending the modulus size by one element,
which would negate any speed increase from omitting the conditional
subtractions. We therefore retain the use of classic Montgomery
reduction for modular exponentiation, apart from the final conversion
out of Montgomery form.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reduce the number of parameters passed to bigint_montgomery() by
calculating the inverse of the modulus modulo the element size on
demand. Cache the result, since Montgomery reduction will be used
repeatedly with the same modulus value.
In all currently supported algorithms, the modulus is a public value
(or a fixed value defined by specification) and so this non-constant
timing does not leak any private information.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The startup process is scheduled to run when the device is opened and
terminated (if still running) when the device is closed. It assumes
that the resource allocation performed in gve_open() has taken place,
and that the admin and transmit/receive data structure pointers are
therefore valid.
The process initialisation in gve_probe() erroneously calls
process_init() rather than process_init_stopped() and will therefore
schedule the startup process immediately, before the relevant
resources have been allocated.
This bug is masked in the typical use case of a Google Cloud instance
with a single NIC built with the config/cloud/gce.ipxe embedded
script, since the embedded script will immediately open the NIC (and
therefore allocate the required resources) before the scheduled
process is allowed to run for the first time. In a multi-NIC
instance, undefined behaviour will arise as soon as the startup
process for the second NIC is allowed to run.
Fix by using process_init_stopped() to avoid implicitly scheduling the
startup process during gve_probe().
Originally-fixed-by: Kal Cutter Conley <kalcutterc@nvidia.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
There is no further need for a standalone modular multiplication
primitive, since the only consumer is modular exponentiation (which
now uses Montgomery multiplication instead).
Remove the now obsolete bigint_mod_multiply().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Speed up modular exponentiation by using Montgomery reduction rather
than direct modular reduction.
Montgomery reduction in base 2^n requires the modulus to be coprime to
2^n, which would limit us to requiring that the modulus is an odd
number. Extend the implementation to include support for
exponentiation with even moduli via Garner's algorithm as described in
"Montgomery reduction with even modulus" (Koç, 1994).
Since almost all use cases for modular exponentation require a large
prime (and hence odd) modulus, the support for even moduli could
potentially be removed in future.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery reduction is substantially faster than direct reduction,
and is better suited for modular exponentiation operations.
Add bigint_montgomery() to perform the Montgomery reduction operation
(often referred to as "REDC"), along with some test vectors.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery reduction requires only the least significant element of an
inverse modulo 2^k, which in turn depends upon only the least
significant element of the invertend.
Use the inverse size (rather than the invertend size) as the effective
size for bigint_mod_invert(). This eliminates around 97% of the loop
iterations for a typical 2048-bit RSA modulus.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
With a slight modification to the algorithm to ignore bits of the
residue that can never contribute to the result, it is possible to
reuse the as-yet uncalculated portions of the inverse to hold the
residue. This removes the requirement for additional temporary
working space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Direct modular reduction is expected to be used in situations where
there is no requirement to retain the original (unreduced) value.
Modify the API for bigint_reduce() to reduce the value in place,
(removing the separate result buffer), impose a constraint that the
modulus and value have the same size, and require the modulus to be
passed in writable memory (to allow for scaling in place). This
removes the requirement for additional temporary working space.
Reverse the order of arguments so that the constant input is first,
to match the usage pattern for bigint_add() et al.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Expose the effective carry (or borrow) out flag from big integer
addition and subtraction, and use this to elide an explicit bit test
when performing x25519 reduction.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add a dedicated bigint_msb_is_set() to reduce the amount of open
coding required in the common case of testing the sign of a two's
complement big integer.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
UEFI systems may choose not to connect drivers for local disk drives
when the boot policy is set to attempt a network boot. This may cause
the "sanboot" command to be unable to boot from a local drive, since
the relevant block device and filesystem drivers may not have been
connected.
Fix by ensuring that all available drivers are connected before
attempting to boot from an EFI block device.
Reported-by: Andrew Cottrell <andrew.cottrell@xtxmarkets.com>
Tested-by: Andrew Cottrell <andrew.cottrell@xtxmarkets.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
We currently require the variable CROSS (or CROSS_COMPILE) to be set
to specify the global cross-compilation prefix. This becomes
cumbersome when developing across multiple CPU architectures,
requiring frequent editing of build command lines and preventing
incompatible architectures from being built with a single command.
Allow a default cross-compilation prefix for each architecture to be
specified via the CROSS_COMPILE_<arch> variables. These may then be
provided as environment variables, e.g. using
export CROSS_COMPILE_arm32=arm-linux-gnu-
export CROSS_COMPILE_arm64=aarch64-linux-gnu-
export CROSS_COMPILE_loong64=loongarch64-linux-gnu-
export CROSS_COMPILE_riscv32=riscv64-linux-gnu-
export CROSS_COMPILE_riscv64=riscv64-linux-gnu-
This change requires some portions of the Makefile to be rearranged,
to allow for the fact that $(CROSS_COMPILE) may not have been set
until the build directory has been parsed to determine the CPU
architecture.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The seed CSR defined by the Zkr extension is accessible only in M-mode
by default. Older versions of OpenSBI (prior to version 1.4) do not
set mseccfg.sseed, with the result that attempts to access the seed
CSR from S-mode will raise an illegal instruction exception.
Add a facility for testing the accessibility of arbitrary CSRs, and
use it to check that the seed CSR is accessible before reporting the
seed CSR entropy source as being functional.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add basic support for running directly on top of SBI, with no UEFI
firmware present. Build as e.g.:
make CROSS=riscv64-linux-gnu- bin-riscv64/ipxe.sbi
The resulting binary can be tested in QEMU using e.g.:
qemu-system-riscv64 -M virt -cpu max -serial stdio \
-kernel bin-riscv64/ipxe.sbi
No drivers or executable binary formats are supported yet, but the
unit test suite may be run successfully.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Restructure the parsing of the build directory name from
bin[[-<arch>]-<platform>]
to
bin[-<arch>[-<platform>]]
and allow for a per-architecture default build platform.
For the sake of backwards compatibility, handle "bin-efi" as a special
case equivalent to "bin-i386-efi".
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The timer and entropy seed CSRs will, by design, return different
values each time they are read.
Add the missing volatile qualifiers on the inline assembly to prevent
gcc from assuming that repeated invocations may be elided.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Zkr entropy source extension defines a potentially unprivileged
seed CSR that can be read to obtain 16 bits of entropy input, with a
mandated requirement that 256 entropy input bits read from the seed
CSR will contain at least 128 bits of min-entropy.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The Zicntr extension defines an unprivileged wall-clock time CSR that
roughly matches the behaviour of an invariant TSC on x86. The nominal
frequency of this timer may be read from the "timebase-frequency"
property of the CPU node in the device tree.
Add a timer source using RDTIME to provide implementations of udelay()
and currticks(), modelled on the existing RDTSC-based timer for x86.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
RISC-V seems to allow for direct discovery of CPU features only from
M-mode (e.g. by setting up a trap handler and then attempting to
access a CSR), with S-mode code expected to read the resulting
constructed ISA description from the device tree.
Add the ability to check for the presence of named extensions listed
in the "riscv,isa" property of the device tree node corresponding to
the boot hart.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for the existence of platforms with no PCI bus by including the
PCI settings mechanism only if PCI bus support is included.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Running with flat physical addressing is a fairly common early boot
environment. Rename UACCESS_EFI to UACCESS_FLAT so that this code may
be reused in non-UEFI boot environments that also use flat physical
addressing.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the ability to issue Supervisor Binary Interface (SBI) calls via
the ECALL instruction, and use the SBI DBCN extension to implement a
debug console.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Montgomery multiplication requires calculating the inverse of the
modulus modulo a larger power of two.
Add bigint_mod_invert() to calculate the inverse of any (odd) big
integer modulo an arbitrary power of two, using a lightly modified
version of the algorithm presented in "A New Algorithm for Inversion
mod p^k (Koç, 2017)".
The power of two is taken to be 2^k, where k is the number of bits
available in the big integer representation of the invertend. The
inverse modulo any smaller power of two may be obtained simply by
masking off the relevant bits in the inverse.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow scripts to read basic information from USB device descriptors
via the settings mechanism. For example:
echo USB vendor ID: ${usb/${busloc}.8.2}
echo USB device ID: ${usb/${busloc}.10.2}
echo USB manufacturer name: ${usb/${busloc}.14.0}
The general syntax is
usb/<bus:dev>.<offset>.<length>
where bus:dev is the USB bus:device address (as obtained via the
"usbscan" command, or from e.g. ${net0/busloc} for a USB network
device), and <offset> and <length> select the required portion of the
USB device descriptor.
Following the usage of SMBIOS settings tags, a <length> of zero may be
used to indicate that the byte at <offset> contains a USB string
descriptor index, and an <offset> of zero may be used to indicate that
the <length> contains a literal USB string descriptor index.
Since the byte at offset zero can never contain a string index, and a
literal string index can never be zero, the combination of both
<length> and <offset> being zero may be used to indicate that the
entire device descriptor is to be read as a raw hex dump.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Implement a "usbscan" command as a direct analogy of the existing
"pciscan" command, allowing scripts to iterate over all detected USB
devices.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Faster modular multiplication algorithms such as Montgomery
multiplication will still require the ability to perform a single
direct modular reduction.
Neaten up the implementation of direct reduction and split it out into
a separate bigint_reduce() function, complete with its own unit tests.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Every architecture uses the same implementation for bigint_is_set(),
and there is no reason to suspect that a future CPU architecture will
provide a more efficient way to implement this operation.
Simplify the code by providing a single architecture-independent
implementation of bigint_is_set().
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The big integer shift operations are misleadingly described as
rotations since the original x86 implementations are essentially
trivial loops around the relevant rotate-through-carry instruction.
The overall operation performed is a shift rather than a rotation.
Update the function names and descriptions to reflect this.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
An n-bit multiplication product may be added to up to two n-bit
integers without exceeding the range of a (2n)-bit integer:
(2^n - 1)*(2^n - 1) + (2^n - 1) + (2^n - 1) = 2^(2n) - 1
Exploit this to perform big integer multiplication in constant time
without requiring the caller to provide temporary carry space.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for building as a Linux userspace binary for AArch32.
This allows the self-test suite to be more easily run for the 32-bit
ARM code. For example:
make CROSS=arm-linux-gnu- bin-arm32-linux/tests.linux
qemu-arm -L /usr/arm-linux-gnu/sys-root/ \
./bin-arm32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Reading from PMCCNTR causes an undefined instruction exception when
running in PL0 (e.g. as a Linux userspace binary), unless the
PMUSERENR.EN bit is set.
Restructure profile_timestamp() for 32-bit ARM to perform an
availability check on the first invocation, with subsequent
invocations returning zero if PMCCNTR could not be enabled.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
All consumers of profile_timestamp() currently treat the value as an
unsigned long. Only the elapsed number of ticks is ever relevant: the
absolute value of the timestamp is not used. Profiling is used to
measure short durations that are generally fewer than a million CPU
cycles, for which an unsigned long is easily large enough.
Standardise the return type of profile_timestamp() as unsigned long
across all CPU architectures. This allows 32-bit architectures such
as i386 and riscv32 to omit all logic associated with retrieving the
upper 32 bits of the 64-bit hardware counter, which simplifies the
code and allows riscv32 and riscv64 to share the same implementation.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Big integer multiplication currently performs immediate carry
propagation from each step of the long multiplication, relying on the
fact that the overall result has a known maximum value to minimise the
number of carries performed without ever needing to explicitly check
against the result buffer size.
This is not a constant-time algorithm, since the number of carries
performed will be a function of the input values. We could make it
constant-time by always continuing to propagate the carry until
reaching the end of the result buffer, but this would introduce a
large number of redundant zero carries.
Require callers of bigint_multiply() to provide a temporary carry
storage buffer, of the same size as the result buffer. This allows
the carry-out from the accumulation of each double-element product to
be accumulated in the temporary carry space, and then added in via a
single call to bigint_add() after the multiplication is complete.
Since the structure of big integer multiplication is identical across
all current CPU architectures, provide a single shared implementation
of bigint_multiply(). The architecture-specific operation then
becomes the multiplication of two big integer elements and the
accumulation of the double-element product.
Note that any intermediate carry arising from accumulating the lower
half of the double-element product may be added to the upper half of
the double-element product without risk of overflow, since the result
of multiplying two n-bit integers can never have all n bits set in its
upper half. This simplifies the carry calculations for architectures
such as RISC-V and LoongArch64 that do not have a carry flag.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The admin queue API requires us to tell the device how many event
counters we have provided via the "configure device resources" admin
queue command. There is, of course, absolutely no documentation
indicating how many event counters actually need to be provided.
We require only two event counters: one for the transmit queue, one
for the receive queue. (The receive queue doesn't seem to actually
make any use of its event counter, but the "create receive queue"
admin queue command will fail if it doesn't have an available event
counter to choose.)
In the absence of any documentation, we currently make the assumption
that allocating and configuring 16 counters (i.e. one whole cacheline)
will be sufficient to allow for the use of two counters.
This assumption turns out to be incorrect. On larger instance types
(observed with a c3d-standard-16 instance in europe-west4-a), we find
that creating the transmit or receive queues will each fail with a
probability of around 50% with the "failed precondition" error code.
Experimentation suggests that even though the device has accepted our
"configure device resources" command indicating that we are providing
only 16 event counters, it will attempt to choose any of its potential
32 event counters (and will then fail since the event counter that it
unilaterally chose is outside of the agreed range).
Work around this firmware bug by always allocating the maximum number
of event counters supported by the device. (This requires deferring
the allocation of the event counters until after issuing the "describe
device" command.)
Signed-off-by: Michael Brown <mcb30@ipxe.org>
As of commit 79c0173 ("[build] Create util/genfsimg for building
filesystem-based images"), the EFI boot file name for each CPU
architecture is defined within the genfsimg script itself, rather than
being passed in as a Makefile parameter.
Remove the now-redundant Makefile definitions for EFI_BOOT_FILE.
Reported-by: Christian I. Nilsson <nikize@gmail.com>
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add support for building iPXE as a 64-bit or 32-bit RISC-V binary, for
either UEFI or Linux userspace platforms. For example:
# RISC-V 64-bit UEFI
make CROSS=riscv64-linux-gnu- bin-riscv64-efi/ipxe.efi
# RISC-V 32-bit UEFI
make CROSS=riscv64-linux-gnu- bin-riscv32-efi/ipxe.efi
# RISC-V 64-bit Linux
make CROSS=riscv64-linux-gnu- bin-riscv64-linux/tests.linux
qemu-riscv64 -L /usr/riscv64-linux-gnu/sys-root \
./bin-riscv64-linux/tests.linux
# RISC-V 32-bit Linux
make CROSS=riscv64-linux-gnu- SYSROOT=/usr/riscv32-linux-gnu/sys-root \
bin-riscv32-linux/tests.linux
qemu-riscv32 -L /usr/riscv32-linux-gnu/sys-root \
./bin-riscv32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The cross-compiler will typically use the appropriate sysroot
directory automatically. This may not work for toolchains where a
single cross-compiler is used to produce output for multiple CPU
variants (e.g. 32-bit and 64-bit RISC-V).
Add a SYSROOT=... parameter that may be used to specify the relevant
sysroot directory, e.g.
make CROSS=riscv64-linux-gnu- SYSROOT=/usr/riscv32-linux-gnu/sys-root \
bin-riscv32-linux/tests.linux
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The EDK2 header macros VA_START(), VA_ARG() etc produce build errors
on some CPU architectures (notably on 32-bit RISC-V, which is not yet
supported by EDK2).
Fix by using the standard variable argument list macros.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
For some 32-bit CPUs, we need to provide implementations of 64-bit
shifts as libgcc helper functions. Add test cases to cover these.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Define a cpu_halt() function which is architecture-specific but
platform-independent, and merge the multiple architecture-specific
implementations of the EFI cpu_nap() function into a single central
efi_cpu_nap() that uses cpu_halt() if applicable.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
The definitions of the setjmp() and longjmp() functions are common to
all architectures, with only the definition of the jump buffer
structure being architecture-specific.
Move the architecture-specific portions to bits/setjmp.h and provide a
common setjmp.h for the function definitions.
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Add the "--retain <N>" option to limit the number of retained old AMI
images (within the same family, architecture, and public visibility).
Signed-off-by: Michael Brown <mcb30@ipxe.org>
Allow for easier identification of images and snapshots created by the
aws-import script by adding tags for image family (e.g. "iPXE") and
architecture (e.g. "x86_64") to both.
Signed-off-by: Michael Brown <mcb30@ipxe.org>