Commit Graph

21 Commits

Author SHA1 Message Date
Sam
758e160862
FEATURE: explicitly ban outlier traffic sources in robots.txt (#11553)
Googlebot handles no-index headers very elegantly. It advises to leave as many routes as possible open and uses headers for high fidelity rules regarding indexes.

Discourse adds special `x-robot-tags` noindex headers to users, badges, groups, search and tag routes.

Following up on b52143feff we now have it so Googlebot gets special handling.

Rest of the crawlers get a far more aggressive disallow list to protect against excessive crawling.
2020-12-23 08:51:14 +11:00
Daniel Waterworth
721ee36425
Replace base_uri with base_path (#10879)
DEV: Replace instances of Discourse.base_uri with Discourse.base_path

This is clearer because the base_uri is actually just a path prefix. This continues the work started in 555f467.
2020-10-09 12:51:24 +01:00
Joshua Rosenfeld
1e6d125db6
FIX: Remove additional paths from robots.txt
admin path is protected by guardian and thus inaccessible to crawlers

Follow up to b52143f
2020-08-26 16:52:22 -04:00
Krzysztof Kotlarek
e0d9232259
FIX: use allowlist and blocklist terminology (#10209)
This is a PR of the renaming whitelist to allowlist and blacklist to the blocklist.
2020-07-27 10:23:54 +10:00
Joshua Rosenfeld
b52143feff
FIX: Remove paths from robots.txt in favor of noindex header
Google no longer supports the use of robots.txt to block indexing.
See https://support.google.com/webmasters/answer/6062608 and
https://support.google.com/webmasters/answer/93710

Previous commits have added the `noindex` header to appropriate pages,
now we need to remove the paths from robots.txt so the pages can be
crawled.

Follow up to:
13f229808a
b6765aac4b
676be3a853
07b728c5e5
c94e6a9a66
2020-06-25 13:55:06 -04:00
Osama Sayegh
6515ff19e5
FEATURE: Allow customization of robots.txt (#7884)
* FEATURE: Allow customization of robots.txt

This allows admins to customize/override the content of the robots.txt
file at /admin/customize/robots. That page is not linked to anywhere in
the UI -- admins have to manually type the URL to access that page.

* use Ember.computed.not

* Jeff feedback

* Feedback

* Remove unused import
2019-07-15 20:47:44 +03:00
Sam Saffron
30990006a9 DEV: enable frozen string literal on all files
This reduces chances of errors where consumers of strings mutate inputs
and reduces memory usage of the app.

Test suite passes now, but there may be some stuff left, so we will run
a few sites on a branch prior to merging
2019-05-13 09:31:32 +08:00
Sam
baa72d18f8 FIX: simplify so we ban all auth paths
previously plugins that have auth paths were not disallowed and robots
tend to call them
2018-08-16 19:16:47 +10:00
Sam Saffron
030e322a39 FEATURE: block top level /my/ routes 2018-06-12 19:47:45 +10:00
Robin Ward
3d7dbdedc0 FEATURE: An API to help sites build robots.txt files programatically
This is mainly useful for subfolder sites, who need to expose their
robots.txt contents to a parent site.
2018-04-16 15:43:20 -04:00
Régis Hanol
df7970a6f6 prefix the robots.txt rules with the directory when using subfolder 2018-04-11 22:05:02 +02:00
Sam
3a7b696703 FEATURE: allow for setting crawl delay per user agent
Also moved to default crawl delay bing so no more than a req every 5 seconds is allowed

New site settings:

"slow_down_crawler_user_agents" - list of crawlers that will be slowed down
"slow_down_crawler_rate" - how many seconds to wait between requests

Not enforced server side yet
2018-04-06 10:15:23 +10:00
Neil Lalonde
ced7e9a691 FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-22 15:41:02 -04:00
Guo Xiang Tan
77d4c4d8dc Fix all the errors to get our tests green on Rails 5.1. 2017-09-25 13:48:58 +08:00
Régis Hanol
d75cc67d86 FIX: robots.txt should be accessible even when login is required 2015-10-15 11:42:41 +02:00
Sam
e5888cf090 PERF: avoid preloading json in cases where it is not needed
(uploads / avatars / non GET requests)
2015-05-20 17:12:16 +10:00
Neil Lalonde
a86b35c873 Remove the access_password site setting 2013-06-25 15:05:25 -04:00
Robin Ward
d2596c3c4c Remove unusued site_settings, show checkbox in UI for boolean values, remove restrict_access
boolean to avoid locking yourself out by setting access_password to empty string. Minor
UI tweaks.
2013-03-01 14:27:41 -05:00
Gosha Arinich
cafc75b238 remove trailing whitespaces ❤️ 2013-02-26 07:31:35 +03:00
Sam Saffron
1c12c91d0c forgot to skip a filter 2013-02-11 17:14:36 +11:00
Sam Saffron
c50a9e4d01 added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-11 11:02:57 +11:00