To separate hyperlink itself and checking (intermediate) state, this
removes `next_check` attribute from Hyperlink object and add a new
named-tuple `CheckRequest`. It was rejected idea on implementing
rate-limiting first (see #8467). But it's better way to separate
large builder module into small components and make whole of linkcheck
builder simple.
After this change, `Hyperlink` object represents a hyperlink on the
document. It has a URI and location info (docname and lineno).
Currently, linkcheck displays the status of hyperlinks. But it is hard
to search where the hyperlink is written because only line numbers are
shown as the location for the link.
This displays the docname of the link too.
Opening and closing a file requires processing from the operating
system. Repeatedly opening and closing wastes system resources and
hinders buffering, causing a flush (disk I/O) after each write
operation.
Using a context manager ensures the logs are flushed to disk and files
are properly closed whether the program exists successfully or an
exception occurs. Compared to the previous implementation, a brutal
shutdown of the machine (e.g. power cord disconnected) could cause some
log lines not to be written. That should not be an issue in practice.
Now, files are created and truncated when linkcheck submitted the links
to check to the threads and is ready to process the results, instead of
when the builder is constructed. It keeps the file operations closer to
their use.
These attributes were used to cache checked links and avoid issuing
another web request to the same URI.
Since 82ef497a8c, links are pre-processed
to ensure uniqueness. This caching the results of checked links is no
longer useful.
So far, linkcheck scans all of references and images from documents, and
checks them parallel. As a result, some URL would be checked twice (or
more) by race condition.
This collects the URL via post-transforms, and removes duplicated URLs
before checking availability.
refs: #4303
Linkcheck organizes the URLs to checks in a PriorityQueue. The items are
tuples (priority, url, docname, lineno).
Tuples where the lineno is `None` are not comparable with tuples that
have an integer lineno, and PriorityQueue items must be comparable (see
https://bugs.python.org/issue31145).
Fixes an issue when a document contains two links to the same URL, one
with an int line number and the other without line number metadata (such
as an image :target: attribute).
Using 0 instead of None to represent no line number should not lead to
observable changes, the result logger only logs the line number when it
is truthy.
Close#8565
This method always returns False, it is dead code. The exception
checking stopped working because Requests library wraps SSL errors in a
`requests.exceptions.SSLError` and no longer throws an
`urllib3.exceptions.SSLError`. The first argument to that exception is
an `urllib3.exceptions.MaxRetryError`.
Keep imports alphabetically sorted and their order homogeneous across
Python source files.
The isort project has more feature and is more active than the
flake8-import-order plugin.
Most issues caught were simply import ordering from the same module.
Where imports were purposefully placed out of order, tag with
isort:skip.
Some websites will enter infinite redirect loops with HEAD requests. In this
case, the GET fallback is ignored as the exception is of type TooManyRedirects
and the link is reported as broken.
This extends the except clause to retry with a GET request for such scenarios.
* TST: linkcheck: make tests more flexible
* CLN: linkcheck: flake8, mypy
* REF: linkcheck: docpath->filename, write_jsonline->write_linkstat
* REF: linkcheck: remove redundant call to doc2path
* TST: linkcheck: show JSON obj structure in test
* REF: linkcheck: remove docname from JSON obj because it's redundant (use path2doc(filename) if necessary)
* TST: linkcheck: don't test row[info] output (see comments for examples)
This commit implements a change to the linkcheck builder to handle undefined
HTTP error code such as https://tools.ietf.org/html/rfc7538.
At the moment, Sphinx will fail with an internal error. With this change,
a non-registered HTTP error code will be treated in the same way as error
code 0 i.e. unknown code.
In Python 3, the default encoding of source files is utf-8. The encoding
cookie is now unnecessary and redundant so remove it. For more details,
see the docs:
https://docs.python.org/3/howto/unicode.html#the-string-type
> The default encoding for Python source code is UTF-8, so you can
> simply include a Unicode character in a string literal ...
Includes a fix for the flake8 header checks to stop expecting an
encoding cookie.
In Python3, the functions io.open() is an alias of the builtin open()
and codecs.open() is functionally equivalent. To reduce indirection,
number of imports, and number of patterns, always prefer the builtin.
https://docs.python.org/3/library/io.html#high-level-module-interface
> io.open()
>
> This is an alias for the builtin open() function.
There are situations where requested server replies with a different content (in my particular case HTTP 404) when there is no accept header, possibly because it evaluates the content negotiation to an API request instead of a browser request. This change adds a default Accept header, which equals to what my Firefox sets out of the box to its requests.
I stumbled upon this when checking a link to https://crates.io/crates/dredd-hooks. While
curl -i https://crates.io/crates/dredd-hooks
returns HTTP 404, following results in an expected HTTP 200 response with HTML body:
curl -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -i https://crates.io/crates/dredd-hooks
This allows builders to emit a final epilog message containing
information such as where resulting files can be found. This is only
emitted if the build was successful.
This allows us to remove this content from the 'make_mode' tool and
the legacy 'Makefile' and 'make.bat' templates. There's room for more
dramatic simplification of the former, but this will come later.
Signed-off-by: Stephen Finucane <stephen@that.guru>
To avoid needing to turn off anchor checking across the entire
documentation allow skipping based on matching the anchor against a
regex.
Some sites/pages use JavaScript to perform anchor assignment in a
webpage, which would require rendering the page to determine whether
the anchor exists. Allow fine grain control of whether the anchor is
checked based on pattern matching, until such stage as the retrieved
URLs can be passed through an engine for deeper checking on the HTML
doctree.
Atlassian websites will return a 403 Forbidden access code when
queried, so add this to the list of codes that should trigger using a
GET request as a fallback.
This is useful because if you run linkcheck often, you are likely to see lots of transient network errors, which usually disappear if you simply try again.
- Use print function instead of print statement;
- Use new exception handling;
- Use in operator instead of has_key();
- Do not use tuple arguments in functions;
- Other miscellaneous improvements.
This is based on output of `futurize --stage1`, with some
manual corrections.