discourse/app/jobs/regular/crawl_topic_link.rb

require 'open-uri'
require 'nokogiri'
require 'excon'
require_dependency 'retrieve_title'
require_dependency 'topic_link'

module Jobs
  class CrawlTopicLink < Jobs::Base

    def execute(args)
      raise Discourse::InvalidParameters.new(:topic_link_id) unless args[:topic_link_id].present?

      topic_link = TopicLink.find_by(id: args[:topic_link_id], internal: false, crawled_at: nil)
      return if topic_link.blank?

      # Look for a topic embed for the URL. If it exists, use its title and don't crawl
      topic_embed = TopicEmbed.where(embed_url: topic_link.url).includes(:topic).references(:topic).first
      # topic could be deleted, so skip
      if topic_embed && topic_embed.topic
        TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', topic_embed.topic.title[0..255]])
        return
      end

      begin
        crawled = false

        # Special case: Images
        # If the link is to an image, put the filename as the title
        if FileHelper.is_image?(topic_link.url)
          uri = URI(topic_link.url)
          filename = File.basename(uri.path)
          crawled = (TopicLink.where(id: topic_link.id).update_all(["title = ?, crawled_at = CURRENT_TIMESTAMP", filename]) == 1)
        end

        unless crawled
          # Fetch the beginning of the document to find the title
          title = RetrieveTitle.crawl(topic_link.url)
          if title.present?
            crawled = (TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', title[0..254]]) == 1)
          end
        end
      rescue Exception
        # If there was a connection error, do nothing
      ensure
        TopicLink.where(id: topic_link.id).update_all('crawled_at = CURRENT_TIMESTAMP') if !crawled && topic_link.present?
      end
    end

  end
end
Support for crawling topic links 2014-04-05 13:47:25 -05:00			`require 'open-uri'`
			`require 'nokogiri'`
			`require 'excon'`
FEATURE: Whitelists for inline oneboxing 2017-07-21 14:29:04 -05:00			`require_dependency 'retrieve_title'`
Require dependencies to enable live reload in dev for Sidekiq. 2017-10-05 22:39:00 -05:00			`require_dependency 'topic_link'`
Support for crawling topic links 2014-04-05 13:47:25 -05:00
			`module Jobs`
			`class CrawlTopicLink < Jobs::Base`

			`def execute(args)`
			`raise Discourse::InvalidParameters.new(:topic_link_id) unless args[:topic_link_id].present?`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-07 13:38:18 -05:00
Perform the where(...).first to find_by(...) refactoring. This refactoring was automated using the command: bundle exec "ruby refactorings/where_dot_first_to_find_by/app.rb" 2014-05-06 08:41:59 -05:00			`topic_link = TopicLink.find_by(id: args[:topic_link_id], internal: false, crawled_at: nil)`
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-17 13:00:22 -05:00			`return if topic_link.blank?`

			`# Look for a topic embed for the URL. If it exists, use its title and don't crawl`
			`topic_embed = TopicEmbed.where(embed_url: topic_link.url).includes(:topic).references(:topic).first`
Don't try loading embeds on deleted topics 2015-05-06 01:53:28 -05:00			`# topic could be deleted, so skip`
			`if topic_embed && topic_embed.topic`
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-17 13:00:22 -05:00			`TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', topic_embed.topic.title[0..255]])`
			`return`
			`end`
FIX: Change crawl size to 10k. Youtube for example doesn't work with the first 1k 2014-04-07 15:03:47 -05:00
If there's a `TopicEmbed` record for a url, we don't have to crawl it. This should help sites like Boing Boing where sometimes links are crawled before saved in WordPress. 2014-04-17 13:00:22 -05:00			`begin`
FIX: Change crawl size to 10k. Youtube for example doesn't work with the first 1k 2014-04-07 15:03:47 -05:00			`crawled = false`

Special case: When crawling a link to an image, just put the filename as the title. 2014-04-10 12:45:13 -05:00			`# Special case: Images`
			`# If the link is to an image, put the filename as the title`
User 'FileHelper.is_image?' to check wether a link is poiting to an image 2017-06-22 05:54:42 -05:00			`if FileHelper.is_image?(topic_link.url)`
Special case: When crawling a link to an image, just put the filename as the title. 2014-04-10 12:45:13 -05:00			`uri = URI(topic_link.url)`
			`filename = File.basename(uri.path)`
			`crawled = (TopicLink.where(id: topic_link.id).update_all(["title = ?, crawled_at = CURRENT_TIMESTAMP", filename]) == 1)`
			`end`

			`unless crawled`
			`# Fetch the beginning of the document to find the title`
FEATURE: Whitelists for inline oneboxing 2017-07-21 14:29:04 -05:00			`title = RetrieveTitle.crawl(topic_link.url)`
			`if title.present?`
			`crawled = (TopicLink.where(id: topic_link.id).update_all(['title = ?, crawled_at = CURRENT_TIMESTAMP', title[0..254]]) == 1)`
Support for crawling topic links 2014-04-05 13:47:25 -05:00			`end`
			`end`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-07 13:38:18 -05:00			`rescue Exception`
			`# If there was a connection error, do nothing`
			`ensure`
Use `update_all` to prevent `after_commit` from executing again. 2014-04-10 12:19:38 -05:00			`TopicLink.where(id: topic_link.id).update_all('crawled_at = CURRENT_TIMESTAMP') if !crawled && topic_link.present?`
FIX: Don't crawl in test mode, raise correct exception when parameters are missing 2014-04-07 13:38:18 -05:00			`end`
Support for crawling topic links 2014-04-05 13:47:25 -05:00			`end`

			`end`
			`end`