This closes HTTP responses when no content reads are required, as
when requests are made in streaming mode, ``requests`` doesn't know
whether the caller may intend to later read content from a streamed
HTTP response object and holds the socket open.
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Add raw directives' source URL to the list of links to check with linkcheck.
By the way, refactor HyperlinkCollector by adding `add_uri` function.
Add test for linkcheck raw directives source URL
Node.traverse() was marked as deprecated since docutils-0.18. Instead
of it, Node.findall() has been added as successor of traverse().
This applies a patch to docutils-0.17 or older to be available
Node.findall() and use it.
The approach of `rewrite_github_anchor` makes some anchors valid. But
it also makes other kind of anchors invalid. This disables the handler
to make them valid again (while 4.1.x release).
Now linkcheck builder integrates `linkcheck_warn_redirects` into
`linkcheck_allowed_redirects`. As a result, linkcheck builder will
emit a warning when "disallowed" redirection detected via
`linkcheck_allowed_redirects`.
Add a new confval; `linkcheck_warn_redirects` to emit a warning when
the hyperlink is redirected. It's useful to detect unexpected redirects
under the warn-is-error mode.
To separate hyperlink itself and checking (intermediate) state, this
removes `next_check` attribute from Hyperlink object and add a new
named-tuple `CheckRequest`. It was rejected idea on implementing
rate-limiting first (see #8467). But it's better way to separate
large builder module into small components and make whole of linkcheck
builder simple.
After this change, `Hyperlink` object represents a hyperlink on the
document. It has a URI and location info (docname and lineno).
Currently, linkcheck displays the status of hyperlinks. But it is hard
to search where the hyperlink is written because only line numbers are
shown as the location for the link.
This displays the docname of the link too.
Opening and closing a file requires processing from the operating
system. Repeatedly opening and closing wastes system resources and
hinders buffering, causing a flush (disk I/O) after each write
operation.
Using a context manager ensures the logs are flushed to disk and files
are properly closed whether the program exists successfully or an
exception occurs. Compared to the previous implementation, a brutal
shutdown of the machine (e.g. power cord disconnected) could cause some
log lines not to be written. That should not be an issue in practice.
Now, files are created and truncated when linkcheck submitted the links
to check to the threads and is ready to process the results, instead of
when the builder is constructed. It keeps the file operations closer to
their use.
These attributes were used to cache checked links and avoid issuing
another web request to the same URI.
Since 82ef497a8c, links are pre-processed
to ensure uniqueness. This caching the results of checked links is no
longer useful.
So far, linkcheck scans all of references and images from documents, and
checks them parallel. As a result, some URL would be checked twice (or
more) by race condition.
This collects the URL via post-transforms, and removes duplicated URLs
before checking availability.
refs: #4303
Linkcheck organizes the URLs to checks in a PriorityQueue. The items are
tuples (priority, url, docname, lineno).
Tuples where the lineno is `None` are not comparable with tuples that
have an integer lineno, and PriorityQueue items must be comparable (see
https://bugs.python.org/issue31145).
Fixes an issue when a document contains two links to the same URL, one
with an int line number and the other without line number metadata (such
as an image :target: attribute).
Using 0 instead of None to represent no line number should not lead to
observable changes, the result logger only logs the line number when it
is truthy.
Close#8565
This method always returns False, it is dead code. The exception
checking stopped working because Requests library wraps SSL errors in a
`requests.exceptions.SSLError` and no longer throws an
`urllib3.exceptions.SSLError`. The first argument to that exception is
an `urllib3.exceptions.MaxRetryError`.
Keep imports alphabetically sorted and their order homogeneous across
Python source files.
The isort project has more feature and is more active than the
flake8-import-order plugin.
Most issues caught were simply import ordering from the same module.
Where imports were purposefully placed out of order, tag with
isort:skip.
Some websites will enter infinite redirect loops with HEAD requests. In this
case, the GET fallback is ignored as the exception is of type TooManyRedirects
and the link is reported as broken.
This extends the except clause to retry with a GET request for such scenarios.
* TST: linkcheck: make tests more flexible
* CLN: linkcheck: flake8, mypy
* REF: linkcheck: docpath->filename, write_jsonline->write_linkstat
* REF: linkcheck: remove redundant call to doc2path
* TST: linkcheck: show JSON obj structure in test
* REF: linkcheck: remove docname from JSON obj because it's redundant (use path2doc(filename) if necessary)
* TST: linkcheck: don't test row[info] output (see comments for examples)