discourse/app/services
Guo Xiang Tan cfd507822f
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
..
spam_rule FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
anonymous_shadow_creator.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
badge_granter.rb FIX: prevent error when badge has already been awarded 2019-01-04 15:17:54 +01:00
color_scheme_revisor.rb Fix all the errors to get our tests green on Rails 5.1. 2017-09-25 13:48:58 +08:00
destroy_task.rb FIX: Allow rake destroy:topics to delete topics in sub-categories 2018-09-10 12:52:14 +01:00
group_action_logger.rb Make rubocop happy again. 2018-06-07 13:28:18 +08:00
group_mentions_updater.rb FIX: Skip validations when updating group mentions. 2017-04-04 14:13:18 +08:00
group_message.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
handle_chunk_upload.rb FEATURE: Support backup uploads/downloads directly to/from S3. 2018-10-15 09:43:31 +08:00
notification_emailer.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
post_action_notifier.rb FIX: Liked notification consolidation has to account for user like frequency setting. 2019-01-17 14:33:23 +08:00
post_alerter.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
post_owner_changer.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
push_notification_pusher.rb FIX: Properly support defaults for upload site settings. 2019-03-13 16:36:57 +08:00
random_topic_selector.rb improve erraticly failing spec 2018-05-23 08:39:15 +10:00
search_indexer.rb PERF: Improve quality of PostSearchData#raw_data. (#7275) 2019-04-01 10:14:29 +08:00
site_settings_task.rb FEATURE: only export settings that changed via rake task 2018-10-08 11:54:52 +11:00
staff_action_logger.rb FIX: reset embedding settings when no embeddable host, log host changes (#7264) 2019-03-29 17:05:51 +01:00
topic_status_updater.rb Add a DiscourseEvent for when a topic is closed 2017-09-27 14:00:53 -04:00
topic_timestamp_changer.rb FIX: TopicTimestampChanger should not allow timestamps in the future. 2017-05-22 16:03:49 +08:00
tracked_topics_updater.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
trust_level_granter.rb REFACTOR: Track manual locked user levels separately from groups 2017-11-27 11:23:44 -05:00
user_action_manager.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
user_activator.rb FEATURE: forgot_password_strict setting also prevents reporting that an email address is taken during signup 2017-10-03 15:28:30 -04:00
user_anonymizer.rb FEATURE: add more granular user option levels for email notifications (#7143) 2019-03-15 10:55:11 -04:00
user_authenticator.rb FIX: apply automatic group rules when using social login providers 2018-05-23 02:26:07 +03:00
user_destroyer.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
user_merger.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
user_silencer.rb Update app/services/user_silencer.rb 2019-02-08 08:50:50 -05:00
user_updater.rb REFACTOR: Move redundant ignored user check into guardian (#7219) 2019-03-20 19:55:46 +00:00
username_changer.rb Update username only after successful user anonymization 2018-06-08 15:50:07 +02:00
username_checker_service.rb FIX: Check for group name availability should skip reserved usernames. 2018-08-01 11:09:33 +08:00
wildcard_domain_checker.rb FEATURE: allow multiple secrets for Discourse SSO provider 2018-10-15 16:03:53 +11:00
wildcard_url_checker.rb FEATURE: Allow wildcard in allowed_user_api_auth_redirects setting (#6779) 2019-02-26 17:03:20 +01:00
word_watcher.rb FEATURE: allow blocking emojis (#7011) 2019-02-15 20:55:48 +05:30