discourse/app/controllers/robots_txt_controller.rb

# frozen_string_literal: true

class RobotsTxtController < ApplicationController
  layout false
  skip_before_action :preload_json, :check_xhr, :redirect_to_login_if_required

  OVERRIDDEN_HEADER = "# This robots.txt file has been customized at /admin/customize/robots\n"

  # NOTE: order is important!
  DISALLOWED_PATHS ||= %w{
    /auth/
    /assets/browser-update*.js
    /email/
    /session
    /session/
    /admin
    /admin/
    /user-api-key
    /user-api-key/
    /*?api_key*
    /*?*api_key*
  }

  def index
    if (overridden = SiteSetting.overridden_robots_txt.dup).present?
      overridden.prepend(OVERRIDDEN_HEADER) if guardian.is_admin? && !is_api?
      render plain: overridden
      return
    end
    if SiteSetting.allow_index_in_robots_txt?
      @robots_info = self.class.fetch_default_robots_info
      render :index, content_type: 'text/plain'
    else
      render :no_index, content_type: 'text/plain'
    end
  end

  # If you are hosting Discourse in a subfolder, you will need to create your robots.txt
  # in the root of your web server with the appropriate paths. This method will return
  # JSON that can be used by a script to create a robots.txt that works well with your
  # existing site.
  def builder
    result = self.class.fetch_default_robots_info
    overridden = SiteSetting.overridden_robots_txt
    result[:overridden] = overridden if overridden.present?
    render json: result
  end

  def self.fetch_default_robots_info
    deny_paths = DISALLOWED_PATHS.map { |p| Discourse.base_uri + p }
    deny_all = [ "#{Discourse.base_uri}/" ]

    result = {
      header: "# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file",
      agents: []
    }

    if SiteSetting.whitelisted_crawler_user_agents.present?
      SiteSetting.whitelisted_crawler_user_agents.split('|').each do |agent|
        result[:agents] << { name: agent, disallow: deny_paths }
      end

      result[:agents] << { name: '*', disallow: deny_all }
    elsif SiteSetting.blacklisted_crawler_user_agents.present?
      result[:agents] << { name: '*', disallow: deny_paths }
      SiteSetting.blacklisted_crawler_user_agents.split('|').each do |agent|
        result[:agents] << { name: agent, disallow: deny_all }
      end
    else
      result[:agents] << { name: '*', disallow: deny_paths }
    end

    if SiteSetting.slow_down_crawler_user_agents.present?
      SiteSetting.slow_down_crawler_user_agents.split('|').each do |agent|
        result[:agents] << {
          name: agent,
          delay: SiteSetting.slow_down_crawler_rate,
          disallow: deny_paths
        }
      end
    end

    result
  end
end
DEV: enable frozen string literal on all files This reduces chances of errors where consumers of strings mutate inputs and reduces memory usage of the app. Test suite passes now, but there may be some stuff left, so we will run a few sites on a branch prior to merging 2019-05-02 17:17:27 -05:00			`# frozen_string_literal: true`

added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-10 18:02:57 -06:00			`class RobotsTxtController < ApplicationController`
			`layout false`
Fix all the errors to get our tests green on Rails 5.1. 2017-08-30 23:06:56 -05:00			`skip_before_action :preload_json, :check_xhr, :redirect_to_login_if_required`
added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-10 18:02:57 -06:00
FEATURE: Allow customization of robots.txt (#7884) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import 2019-07-15 12:47:44 -05:00			`OVERRIDDEN_HEADER = "# This robots.txt file has been customized at /admin/customize/robots\n"`

prefix the robots.txt rules with the directory when using subfolder 2018-04-11 15:05:02 -05:00			`# NOTE: order is important!`
			`DISALLOWED_PATHS \|\|= %w{`
FIX: simplify so we ban all auth paths previously plugins that have auth paths were not disallowed and robots tend to call them 2018-08-16 04:16:47 -05:00			`/auth/`
prefix the robots.txt rules with the directory when using subfolder 2018-04-11 15:05:02 -05:00			`/assets/browser-update*.js`
			`/email/`
			`/session`
			`/session/`
			`/admin`
			`/admin/`
			`/user-api-key`
			`/user-api-key/`
			`/?api_key`
			`/?api_key*`
			`}`

added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-10 18:02:57 -06:00			`def index`
FEATURE: Allow customization of robots.txt (#7884) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import 2019-07-15 12:47:44 -05:00			`if (overridden = SiteSetting.overridden_robots_txt.dup).present?`
			`overridden.prepend(OVERRIDDEN_HEADER) if guardian.is_admin? && !is_api?`
			`render plain: overridden`
			`return`
			`end`
FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`if SiteSetting.allow_index_in_robots_txt?`
FEATURE: Allow customization of robots.txt (#7884) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import 2019-07-15 12:47:44 -05:00			`@robots_info = self.class.fetch_default_robots_info`
FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`render :index, content_type: 'text/plain'`
			`else`
			`render :no_index, content_type: 'text/plain'`
			`end`
			`end`

			`# If you are hosting Discourse in a subfolder, you will need to create your robots.txt`
			`# in the root of your web server with the appropriate paths. This method will return`
			`# JSON that can be used by a script to create a robots.txt that works well with your`
			`# existing site.`
			`def builder`
FEATURE: Allow customization of robots.txt (#7884) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import 2019-07-15 12:47:44 -05:00			`result = self.class.fetch_default_robots_info`
			`overridden = SiteSetting.overridden_robots_txt`
			`result[:overridden] = overridden if overridden.present?`
			`render json: result`
FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`end`

FEATURE: Allow customization of robots.txt (#7884) * FEATURE: Allow customization of robots.txt This allows admins to customize/override the content of the robots.txt file at /admin/customize/robots. That page is not linked to anywhere in the UI -- admins have to manually type the URL to access that page. * use Ember.computed.not * Jeff feedback * Feedback * Remove unused import 2019-07-15 12:47:44 -05:00			`def self.fetch_default_robots_info`
FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`deny_paths = DISALLOWED_PATHS.map { \|p\| Discourse.base_uri + p }`
			`deny_all = [ "#{Discourse.base_uri}/" ]`

			`result = {`
			`header: "# See http://www.robotstxt.org/robotstxt.html for documentation on how to use the robots.txt file",`
			`agents: []`
			`}`

			`if SiteSetting.whitelisted_crawler_user_agents.present?`
			`SiteSetting.whitelisted_crawler_user_agents.split('\|').each do \|agent\|`
			`result[:agents] << { name: agent, disallow: deny_paths }`
			`end`

			`result[:agents] << { name: '*', disallow: deny_all }`
			`elsif SiteSetting.blacklisted_crawler_user_agents.present?`
			`result[:agents] << { name: '*', disallow: deny_paths }`
			`SiteSetting.blacklisted_crawler_user_agents.split('\|').each do \|agent\|`
			`result[:agents] << { name: agent, disallow: deny_all }`
FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-15 16:10:45 -05:00			`end`
			`else`
FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`result[:agents] << { name: '*', disallow: deny_paths }`
FEATURE: control which web crawlers can access using a whitelist or blacklist 2018-03-15 16:10:45 -05:00			`end`

FEATURE: An API to help sites build robots.txt files programatically This is mainly useful for subfolder sites, who need to expose their robots.txt contents to a parent site. 2018-04-16 14:43:20 -05:00			`if SiteSetting.slow_down_crawler_user_agents.present?`
			`SiteSetting.slow_down_crawler_user_agents.split('\|').each do \|agent\|`
			`result[:agents] << {`
			`name: agent,`
			`delay: SiteSetting.slow_down_crawler_rate,`
			`disallow: deny_paths`
			`}`
			`end`
			`end`

			`result`
added support for disabling indexing by google using SiteSetting.allow_index_in_robots_txt = false 2013-02-10 18:02:57 -06:00			`end`
			`end`