discourse/config
Osama Sayegh 7bd3986b21
FEATURE: Replace Crawl-delay directive with proper rate limiting (#15131)
We have a couple of site setting, `slow_down_crawler_user_agents` and `slow_down_crawler_rate`, that are meant to allow site owners to signal to specific crawlers that they're crawling the site too aggressively and that they should slow down.

When a crawler is added to the `slow_down_crawler_user_agents` setting, Discourse currently adds a `Crawl-delay` directive for that crawler in `/robots.txt`. Unfortunately, many crawlers don't support the `Crawl-delay` directive in `/robots.txt` which leaves the site owners no options if a crawler is crawling the site too aggressively.

This PR replaces the `Crawl-delay` directive with proper rate limiting for crawlers added to the `slow_down_crawler_user_agents` list. On every request made by a non-logged in user, Discourse will check the User Agent string and if it contains one of the values of the `slow_down_crawler_user_agents` list, Discourse will only allow 1 request every N seconds for that User Agent (N is the value of the `slow_down_crawler_rate` setting) and the rest of requests made within the same interval will get a 429 response. 

The `slow_down_crawler_user_agents` setting becomes quite dangerous with this PR since it could rate limit lots if not all of anonymous traffic if the setting is not used appropriately. So to protect against this scenario, we've added a couple of new validations to the setting when it's changed:

1) each value added to setting must 3 characters or longer
2) each value cannot be a substring of tokens found in popular browser User Agent. The current list of prohibited values is: apple, windows, linux, ubuntu, gecko, firefox, chrome, safari, applewebkit, webkit, mozilla, macintosh, khtml, intel, osx, os x, iphone, ipad and mac.
2021-11-30 12:55:25 +03:00
..
cloud/cloud66 DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
environments FIX: remove 'crawl_images' site setting (#14646) 2021-10-19 17:12:29 +05:30
initializers DEV: Deprecate message bus site settings (#14465) 2021-11-11 11:12:25 -06:00
locales FEATURE: Replace Crawl-delay directive with proper rate limiting (#15131) 2021-11-30 12:55:25 +03:00
application.rb FIX: Check env for multisite config path even if config file exists (#14536) 2021-10-06 13:24:50 -05:00
boot.rb DEV: Remove deprecated bootsnap options (#11929) 2021-02-02 14:39:51 +01:00
cdn.yml.sample Initial release of Discourse 2013-02-05 14:16:51 -05:00
database.yml remove some hardcoded 'localhost's from dev environment (#14801) 2021-11-03 11:26:44 +08:00
deploy.rb.sample enough with the malloc limit, not needed 2016-05-25 21:09:07 +10:00
dev_defaults.yml FEATURE: Add post edits count to user activity (#13495) 2021-08-02 10:15:53 -04:00
discourse_defaults.conf FEATURE: Apply rate limits per user instead of IP for trusted users (#14706) 2021-11-17 23:27:30 +03:00
discourse.config.sample enough with the malloc limit, not needed 2016-05-25 21:09:07 +10:00
discourse.pill.sample Improve bluepill sample config. 2014-01-31 16:09:35 -05:00
environment.rb DEV: replace mailcatcher references with mailhog (#14500) 2021-10-05 15:48:06 +05:30
logrotate.conf Replace Clockwork with Sidetiq 2013-08-14 21:39:40 +02:00
multisite.yml.production-sample DEV: Remove db_id from sample multisite config. 2020-05-29 10:48:29 +08:00
nginx.global.conf Address @Supermathie's concerns in PR1430 2013-09-30 16:28:22 -04:00
nginx.sample.conf FEATURE: Optimize images before upload (#13432) 2021-06-23 12:31:12 -03:00
projections.json DEV: Use .hbr for raw template file extension (#8883) 2020-02-11 13:38:12 -06:00
puma.rb remove daemonize setting (#12232) 2021-03-01 16:42:50 +11:00
routes.rb FEATURE: Display pending posts on user’s page 2021-11-29 10:26:33 +01:00
sidekiq.yml FEATURE: introduce ultra_low priority queue 2019-01-17 14:53:19 +11:00
site_settings.yml DEV: Switch to using uppy uploads in composer by default (#15058) 2021-11-30 08:33:06 +10:00
spring.rb DEV: enable frozen string literal on all files 2019-05-13 09:31:32 +08:00
thin.yml.sample Add sample Capistrano deployment files 2013-05-02 19:53:37 -07:00
unicorn_launcher FIX: Increase timeout when trying to reload unicorn. 2018-12-04 13:43:14 +08:00
unicorn_upstart.conf enough with the malloc limit, not needed 2016-05-25 21:09:07 +10:00
unicorn.conf.rb Revert "DEV: suppress assets logs from qunit tests (#13871)" 2021-07-29 13:28:24 +08:00