Rather than having lots of "cpu.replace("0x", "").str_radix(...)
calls around, move to a single, unit-tested function in lqos_utils
and use it repeatedly.
This one was pretty funny. Any line that contained interfaceA in
ispConfig.example.py was transformed into an interfaceA statement.
I forgot to check for comments, so the comment on how to use
onAStick configuration *also* generated an interface statement.
It now just copies comments verbatim.
1) When calculating median latency, reject any entry that doesn't
have at least 5 data points. From local testing, 5 appears
to be the magic number (when combined with sampling time) that
ignores the "idle" traffic from CPEs, routers and long-poll
sessions on devices.
2) Filter out RTT 0 from best/worst reports.
3) Note that no data is discarded - it's just filtered for display.
This results in a much cleaner display of RTT times in the
reporting interface, giving a much better ability to "zero in"
on problem areas without being distracted by poor RTT - but
basically no traffic - hosts that are idle.
* Adds a new Rust program, `lqos_setup`.
* If no /etc/lqos.conf is found, prompts for interfaces and
creates a dual-interface XDP bridge setup.
* If no /opt/libreqos/src/ispConfig.py is found, prompts
for bandwidth and creates one (using the interfaces also)
* Same for ShapedDevices.csv and network.json
* If no webusers are found, prompts to make one.
* Adds build_dbpkg.sh
* Creates a new directory named `dist`
* Builds the Rust components in a portable mode.
* Creates a list of dependencies and DEBIAN directory
with control and postinst files.
* Handles PIP dependencies in postinst
* Calls the new `lqos_setup` program for final
configuration.
* Sets up the daemons in systemd and enables them.
In very brief testing, I had a working XDP bridge with
1 fake user and a total bandwidth limit configured and
working after running:
dpkg -i 1.4-1.dpkg
apt -f install
Could still use some tweaking.
The partial reload mechanism *really* doesn't work with OnAStick
configurations at present. There's a lot of work required to make
it function. In the meantime, warn the poor user that this
isn't going to work.
Affects ISSUE #129
OnAStick mode is still broken for partial updates. The
`addDeviceIPsToFilter` function was referencing circuit
information (the class_id) that wasn't present in the
partially reloaded data.
Changed shell call to call `add_ip_mapping` which accesses
the bus directly, saving a shell call.
Part of ISSUE #239 - does not fix it for "on a stick"
configurations.
Display the offending string as well as the general "Class id must
be in the format (major):(minor), e.g. 1:12.
May help with diagnosing #239 (since "Error: Class id..." is the
only visible error message.
ISSUE #204 : Running on versions prior to 3.10 will fail, due to
the use of `match` statements. Other parts of the script assume
a recent Python also, and the system as a whole expects a recent
version of Ubuntu.
`pythonCheck.py` polls `sys.version_info` to detect the in-use
version of Python. If the version is prior to 3.10, it bails out
with the message "LibreQoS requires Python 3.10 or greater".
This should help with outdated OS detection in general.
Note that you will need to regenerate the webusers.toml file
after this.
Part of my motivation for this is that this file is not strictly
for governing web permissions in the future.
Previously, generated parent nodes aka generatedPNs (which hold ShapedDevices without a defined Parent Node) would reside on the last few cores of the CPU. A problem became apparent here where if an operator had more Top Level Parent Nodes defined in network.json than CPU cores, there were no CPU cores left to use for generatedPNs. With this change, generatedPNs are created for each CPU core. Additionally, to reduce verbosity of the console output, the warning "uploadMax of Circuit ID exceeded that of its parent node." has been changed to an info log.
IPv6 is a max of 40 characters, so I cut the allowed space to 42.
That gave room to have multi-second long observable RTTs in the
RTT segment.
Compile tested only. A slightly better approach might be to
display time in ns/us/ms/s.
Sometimes the watcher persists, sometimes it doesn't. This is
awful behavior. Work around by looping the watcher and returning
after it fires - restarting the watch process. This has the
added advantage of handling "file deleted" gracefully.
When ShapedDevices cannot be read, an empty set is loaded so
its obvious on the web UI that there is an issue.
In testing, I've been through a bunch of "break shaped devices",
"unbreak shaped devices" and seen the data come and go correctly
now.
ISSUE #239
The Rust validator was running, and the very next line set the
result to success irregardless of the result. Move the validation
initialization to the top.
ISSUE #240
separated items will no longer cause validation issues - excess
whitespace is automatically trimmed at the beginning and end
of comma separated entries before parsing.
First part of ISSUE #240
(Fixing the validation issue rather than the actual cause)
to let the server finish starting.
The delay is in the polling thread, not global - so it doesn't
cause a stall, or affect data access. The ringbuffer will be
slightly delayed in starting (and show zeroes until then).
Testing shows no more logged messages on reboot.
ISSUE #235
into running files to avoid service interruption. The script
detects if you are using systemd (with the default names) and
will restart the services at the end of the process - for a
very brief interruption rather than several minutes.
Also suppresses pushd/popd output.
Related to ISSUE #208
ISSUE #52
* Added file locking commands to the Python/Rust library.
* When LibreQoS.py starts, it checks that /var/run/libreqos.lock
does not exist. If it does, it checks that it contains a PID
and that PID is not still running with a python process.
* If the lock file exists and is valid, execution aborts.
* If the lock file exists and is invalid, it is cleaned.
* Cleans the lock on termination.
boot time. ISSUE #235 . Also relevant to ISSUE #209
* Discovered that the BOOT_TIME clock can fail if called
immediately after boot.
* Refactored time fetching functions into `lqos_utils` with
proper error wrapping.
* Adjusted unknown IP expiration to issue a bus response of
"not ready yet" if the boot time clock is not available.
* Adjusted unknown IP expiration to handle 5-minutes in the
past being a negative number.
* Adjusted queue collection to suggest that you run
LibreQoS.py if queues don't exist - and fail gracefully,
without causing a hitch.
Remove the unused feature from notify. We actually moved notify
into a single crate (as opposed to all over the place) in
816ca7e651
As of this commit, build_rust runs without warnings.
* Corrected a lot of small issues like passing a string when a char
will be (marginally) faster.
* Cleaned up single-arm match statements for the much more compiler
friendly if let.
* Combined nested if statements.
* Cleaned all remaining unchecked unwrap() calls.
ISSUE #209
Replace "anyhow" with "thiserror". Add logging for all errors,
and only allow pass-through for errors that have already been
converted to a local error type and reported.
structure.
* Creates FileWatcher, in lqos_utils.
* Removes "notify" dependency from other crates.
FileWatcher is designed to watch a file. If the file doesn't exist,
then an optional callback is called - and the watcher waits,
periodically checking to see if the file has appeared yet. When the
file appears, another optional callback is executed.
Once the file exists, a `notify` system is started (it uses Linux's
`inotify` system internally) for that file. When the file changes,
the process sleeps briefly and then executes an `on_change` callback.
Further messages are then suppressed for a short period to avoid
duplicates.
All uses of notify have been updated to use this system. Errors are
handled cleanly, per ISSUE #209.
* ShapedDevices.csv may not exist on first run.
* This previously caused lqos_node_manager to emit a hard error
and not show any shaped devices at all.
To rectify this:
* ShapedDevices in-memory starts as an empty set.
* When the "watcher" spawns, if the file exists then ShapedDecices
is loaded.
* If the "watcher" can't find ShapedDevices, it sleeps periodically
looking for the file to be created. Once created, it loads it
and starts the change monitor.
ISSUE #209
Using the inode watcher on a file that doesn't exist fails, and
was previously failing silently! This would result in queue
mappings not updating when LibreQoS.py was executed - even though
the queueingStructure.json file became available.
* Replace "anyhow" with specific errors.
* Track and log each step of the file monitor process for
queueingStructure.json
* If the watcher cannot start because the file doesn't exist,
the watcher loop sleeps for 30 seconds at a time (to keep
load very low) and checks if the file exists yet. If it does,
it loads it and then commences watching.
file descriptor.
* Remove the Tokio timer system.
* Replace with Linux's timer fd system.
* Add a watchdog to alert if we've somehow overrun the timer.
* Replace Tokio timers in the bandwidth/throughput monitor with
Linux timer file descriptors API.
* Instead of spawning a Tokio process, spawn an independent
thread for the bandwidth monitor.
* Queue timing is now provided by Linux "timer file descriptors"
instead of Tokio timers.
* Added an atomic bool to track "we're going faster than we should"
(it's true while executing), skip cycles if we ran out of time and
issue a warning.
* Queue tracking is no longer async, but is locked to its very own
thread.
* Queue watcher is now more verbose about any issues it encounters.
* Queue structures with children will now correctly track all the
children, avoiding the blank queue data issue.
It was hilarious that I already missed the new "bridge"
section in my first attempt. Imagine what it is like for the
users?
Pithy notes:
I think this is an artifact of history, as a bool.
disable_rxvlan = true
disable_txvlan = true
There are a zillion other options in ethtool -h for
coalesing things, besides this.
disable_offload = [ "gso", "tso", "lro", "sg", "gro" ]
We have a lot of configuration stuff, written in several very
different styles. We have hidden knowledge (like port numbers)
buried elsewhere. We have overly wordy variables names, and not
clear separation of each concept. We have a need to keep some
data secure (passwords to the apis), and others, need to be common.
Ideally there would be more of a secrets file for secrets to
point to, on the security case.
Having one file to rule them all is not exactly the right way
forward, but parsing one file *format* might prove simpler.
Please, everyone, think about how to best to express oneself,
I took a stab at it via this commit.
statistics, while lqtop still works.
1) Add warning and error logging to lqos_node_manager if any
part of the statistics gathering process fails.
2) (Hopefully temporarily) use the non-persistent bus client,
again logging any issues.
3) Improve the statistics gathering timer code.
deleting.
* Adjust the Python integration `delete_ip_mapping` function to
not require a secondary "upload" parameter - because the
Python code is unaware of whether there needs to be a
separation of the two at this point.
* Change ENOEXIST return code in BPF map delete to NOT be an
error - it indicates that there was nothing to do, rather
than something not working.
Affects ISSUE #206
* Add a blank node for [children] in the QueueNode parser.
* Add two unit tests to cover loading content with and without
[children] entries.
it returns readable error messages explaining where it encountered a problem.
* Adds the bus call to the Python-Rust bridge.
* Adjusts LibreQoS.py to call the new bridge code and alert if Rust can't
read the ShapedDevices.csv file.
* Fix the Python code to actually call `is_lqosd_alive()` instead of just
checking that it exists (`is_lqosd_alive`).
* Fix the os.exit command syntax.
* Cleanup the blocking Tokio/lqosd request handler to pass better messages.
* Catch the "file not found" and replace it with a nicer message.
I exclude .json from my rg searches, and when going through code
via things like "rg" - having the json show up is kind of painful.
use .txt for text, .json for .json, .whatever.gz for compressed.
The default total summar was the opposite of the table below,
and confusing.
Ideally the layout of the topmost bar should be:
LibreQos 'NOT CONNECTED' 'OTHER STATUS' 'down bps' 'up bps' 'pps' 'pps'
But my table-fu failed me.
By using the handy fmt macros this is feasible if the cell width
is known, via "{:>11}" for example.
However my intuitive thought that this was a "Constraint",
that you could apply to a Cell or Span, and it isn't. There are
multiple calls on the github for this. Until such a day, fmt
goes to 11.
Modern linuxes add fields all the time. On a kernel upgrade we
shouldn't crash just because there's a new json field.
Also there are some further optimizations to represent the kernel
structures themselves. Some fields can overflow which would lead
to some surprizing behaviors in polling the json over time.
FIXME - always check for overflow elsewhere, in packets, overlimits
etc. Could that be a trait?
On systems that support it, `jemallocator` will replace the default
allocator for `lqosd` and `lqos_node_manager`. The Jem allocator
is a LOT more friendly to Rust's preferred vector allocation patterns,
and actually cares about your cache.
Enable "fat" Link-Time Optimization. This slows down compilation a lot,
but results in smaller binaries and permits the LLVM optimizer to
cross crate boundaries when inlining, optimizing tail calls and a
few other optimizations.
FIXME: We need, in general, to check for wrapping in long term runs
* FIXME: Need a test to ensure fq_codel is still parsing
* Still want a size before and after test.
Lastly...
Newer versions of the kernel now have newer options for fq_codel
such as ce_threshold. The present implementation will spam the log
on encountering a newer kernel and tc.
Added a note for where `n_rows` gains the current terminal size.
`tui` turns out to consume the `Event::Resize` event, so it's
never received - instead, we have to calculate it from the
creation of the UI "chunks".
The rendering length turned out to be an artefact of forgetting
to remove a height constraint from a UI chunk that was tested
and never used. (It briefly contained some graph data).
* At any time, you can ask a BusClient if it is connected.
* If `lqtop` loses connectivity while running, it displays NOT
CONNECTED in red on the title bar.
* If `lqtop` can't reach the daemon on start, it bails out with
an error message.
The file locking is "smart": it checks to see if a lock is
valid before refusing to run (and updates the lock if
it can run anyway).
The locking mechanism will fail if you manually create
the lock file and dump random data into it that doesn't
readily convert to an i32.
Affects issue #54 and issue #52
* Add a new structure `FileLock` to `lqosd`.
* FileLock first checks /run/lqos/lqosd.lock. If it exists,
it opens it and attempts to read a PID from it. If that PID
is running and the associated name includes "lqosd", the
FileLock returns an error.
* If no lock exists, then a file is created in
/run/lqos/lqosd.lock containing the running PID.
* Includes Drop and signal termination support.
Since guarantying the execution of Drop traits on process
termination from signals is now "unsound" in the Rust specs,
provide an explicit clean-up path for `lqosd`, called when
a termination signal is processed. This removes the Unix
stream socket on termination.