mirror of
https://github.com/discourse/discourse.git
synced 2024-12-02 13:39:36 -06:00
cfd507822f
This commit fixes the follow quality issue with `PostSearchData#raw_data`: 1. URLs are being tokenized and links with similar href and characters are being duplicated in the raw data. `Post#cooked`: ``` <p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org ``` 2. Ligthbox being included in search pollutes the `PostSearchData#raw_data` unncessarily. From 28 March 2018 to 28 March 2019, searches for the term `image` on `meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for uploads that were pasted. `Post#cooked` ``` <p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000 ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image ``` In terms of indexing performance, we now have to parse the given HTML through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms to scrub plus the indexing takes place in a background job. |
||
---|---|---|
.. | ||
admin_confirmation_email.rb | ||
anonymize_user.rb | ||
automatic_group_membership.rb | ||
backup_chunks_merger.rb | ||
bulk_grant_trust_level.rb | ||
bulk_invite.rb | ||
bump_topic.rb | ||
confirm_sns_subscription.rb | ||
crawl_topic_link.rb | ||
create_avatar_thumbnails.rb | ||
create_backup.rb | ||
critical_user_email.rb | ||
delete_topic.rb | ||
download_avatar_from_url.rb | ||
download_backup_email.rb | ||
download_profile_background_from_url.rb | ||
emit_web_hook_event.rb | ||
enable_bootstrap_mode.rb | ||
export_csv_file.rb | ||
feature_topic_users.rb | ||
invite_email.rb | ||
invite_password_instructions_email.rb | ||
notify_category_change.rb | ||
notify_mailing_list_subscribers.rb | ||
notify_moved_posts.rb | ||
notify_reviewable.rb | ||
notify_tag_change.rb | ||
post_alert.rb | ||
process_email.rb | ||
process_post.rb | ||
process_sns_notification.rb | ||
publish_topic_to_category.rb | ||
pull_hotlinked_images.rb | ||
push_notification.rb | ||
rebake_custom_emoji_posts.rb | ||
retrieve_topic.rb | ||
run_heartbeat.rb | ||
send_push_notification.rb | ||
send_system_message.rb | ||
suspicious_login.rb | ||
toggle_topic_closed.rb | ||
topic_action_converter.rb | ||
topic_reminder.rb | ||
truncate_user_flag_stats.rb | ||
unpin_topic.rb | ||
update_gravatar.rb | ||
update_group_mentions.rb | ||
update_s3_inventory.rb | ||
update_top_redirection.rb | ||
update_username.rb | ||
user_email.rb |