discourse/app/jobs/regular
Guo Xiang Tan cfd507822f
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`:

1. URLs are being tokenized and links with similar href and characters
are being duplicated in the raw data.

`Post#cooked`:

```
<p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org
```

2. Ligthbox being included in search pollutes the
`PostSearchData#raw_data` unncessarily.

From 28 March 2018 to 28 March 2019, searches for the term `image` on
`meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for
uploads that were pasted.

`Post#cooked`

```
<p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p>
```

`PostSearchData#raw_data` Before:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000
```

`PostSearchData#raw_data` After:

```
This is a test topic 0 Uncategorized Let me see how I can fix this image
```

In terms of indexing performance, we now have to parse the given HTML
through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms
to scrub plus the indexing takes place in a background job.
2019-04-01 10:14:29 +08:00
..
admin_confirmation_email.rb SECURITY: Confirm new administrator accounts via email 2017-04-04 15:59:01 -04:00
anonymize_user.rb FIX: Remove user fields when anonymizing user 2018-09-07 00:02:56 +02:00
automatic_group_membership.rb FIX: automatic group membership when using SSO 2018-05-15 01:48:30 +02:00
backup_chunks_merger.rb FEATURE: Support backup uploads/downloads directly to/from S3. 2018-10-15 09:43:31 +08:00
bulk_grant_trust_level.rb FIX: grant trust level when bulk adding users to group 2017-03-06 14:39:53 +05:30
bulk_invite.rb Minor fixes to Jobs::BulkInvite. 2018-08-30 15:35:16 +08:00
bump_topic.rb FEATURE: Topic timer for bumping a topic in the future 2019-01-04 13:08:04 +00:00
confirm_sns_subscription.rb FIX: SES webhook wasn't parsing the message 2019-03-19 11:40:19 +01:00
crawl_topic_link.rb Rename FileHelper.is_image? -> FileHelper.is_supported_image?. 2018-09-12 09:22:28 +08:00
create_avatar_thumbnails.rb FIX: automatically correct bad avatars on access 2018-08-16 16:32:56 +10:00
create_backup.rb FIX: stop forking regular backup jobs 2017-12-21 09:00:48 +11:00
critical_user_email.rb PERF: Quit out of the email job quickly if disabled (#6423) 2018-10-01 01:15:45 +08:00
delete_topic.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
download_avatar_from_url.rb FIX: SSO avatar downloads were broken 2017-10-12 12:12:04 +05:30
download_backup_email.rb SECURITY: Don't pass email backup token to sidekiq as a parameter. 2017-12-18 11:25:22 +08:00
download_profile_background_from_url.rb FEATURE: add profile_background fields into SSO (#5701) 2018-05-07 10:03:26 +02:00
emit_web_hook_event.rb DEV: Update webhook event attributes even when an error raised 2019-03-21 20:45:21 +05:30
enable_bootstrap_mode.rb FIX: bootstrap mode should not amend setting that is not in default state 2016-05-04 16:46:46 +05:30
export_csv_file.rb Aadd 'secondary_emails' field in users export 2019-02-27 10:12:20 +01:00
feature_topic_users.rb correct bad error reporting. 2015-08-14 13:29:39 +10:00
invite_email.rb FIX: Don't try to send invite email when invite was deleted 2018-08-29 12:43:12 +02:00
invite_password_instructions_email.rb FEATURE: send set password instructions after invite redemption 2014-10-11 14:13:05 +05:30
notify_category_change.rb FIX: ensure PostAlerter is always run in sidekiq 2018-05-24 17:27:43 +02:00
notify_mailing_list_subscribers.rb FEATURE: Ignored user notification behaviour should be as a muted user (#7227) 2019-03-21 12:15:34 +01:00
notify_moved_posts.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
notify_reviewable.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
notify_tag_change.rb FIX: Notify on tag change. (#7119) 2019-03-12 18:09:34 +01:00
post_alert.rb FIX: Do not trigger post alerts for empty posts. (#7138) 2019-03-15 17:58:43 +01:00
process_email.rb FIX: Processing incoming email should be done in a background job. 2017-04-24 13:57:28 +08:00
process_post.rb PERF: Improve quality of PostSearchData#raw_data. (#7275) 2019-04-01 10:14:29 +08:00
process_sns_notification.rb FIX: SES webhook wasn't parsing the message 2019-03-19 11:40:19 +01:00
publish_topic_to_category.rb FEATURE: Shared Drafts 2018-03-20 17:15:26 -04:00
pull_hotlinked_images.rb FIX: Re-download hotlinked optimized images (#7249) 2019-03-27 21:31:12 +01:00
push_notification.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
rebake_custom_emoji_posts.rb FIX: properly escape name of custom emoji 2018-10-11 09:35:23 +02:00
retrieve_topic.rb Add rubocop to our build. (#5004) 2017-07-28 10:20:09 +09:00
run_heartbeat.rb FEATURE: prioritize sidekiq jobs 2016-04-07 12:56:43 +10:00
send_push_notification.rb FIX: handle missing users when sending push notifications 2018-05-17 12:53:19 +05:30
send_system_message.rb FIX: no notification was being sent when a post is hidden by community flags 2017-09-12 15:43:44 -04:00
suspicious_login.rb FEATURE: Suspicious logins report. (#6544) 2018-10-30 22:51:58 +00:00
toggle_topic_closed.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
topic_action_converter.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
topic_reminder.rb FEATURE: staff can set a timer to remind them about a topic 2017-05-16 14:49:50 -04:00
truncate_user_flag_stats.rb FEATURE: New 'Reviewable' model to make reviewable items generic 2019-03-28 12:45:10 -04:00
unpin_topic.rb FEATURE: make pin expiration mandatory 2015-07-29 16:34:21 +02:00
update_gravatar.rb FIX: don't update gravatar if the user has no email 2019-02-20 22:34:43 +01:00
update_group_mentions.rb FEATURE: Remap group mentions when group name has been changed. 2017-01-18 13:39:34 +08:00
update_s3_inventory.rb FEATURE: Use amazon s3 inventory to manage upload stats (#6867) 2019-02-01 10:10:48 +05:30
update_top_redirection.rb FIX: redirect to top wasn't working 2017-10-04 22:08:41 +02:00
update_username.rb PERF: Split loading of posts to speed up user renames 2018-07-24 11:57:04 +02:00
user_email.rb FIX: prevent sending multiple summary emails due to Sidekiq delays 2019-03-22 12:34:34 -04:00