mirror of
https://github.com/discourse/discourse.git
synced 2025-02-25 18:55:32 -06:00
PERF: Improve quality of PostSearchData#raw_data. (#7275)
This commit fixes the follow quality issue with `PostSearchData#raw_data`: 1. URLs are being tokenized and links with similar href and characters are being duplicated in the raw data. `Post#cooked`: ``` <p><a href=\"https://meta.discourse.org/some.png\" class=\"onebox\" target=\"_blank\" rel=\"nofollow noopener\">https://meta.discourse.org/some.png</a></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png discourse org/some png https://meta.discourse.org/some.png discourse org/some png ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized https://meta.discourse.org/some.png meta discourse org ``` 2. Ligthbox being included in search pollutes the `PostSearchData#raw_data` unncessarily. From 28 March 2018 to 28 March 2019, searches for the term `image` on `meta.discourse.org` had a click through rate of 2.1%. Non-lightboxed images are not included in indexing for search yet we were indexing content within a lightbox. Also, search for terms like `image` was affected we were using `Pasted image` as the filename for uploads that were pasted. `Post#cooked` ``` <p>Let me see how I can fix this image<br>\n<div class=\"lightbox-wrapper\"><a class=\"lightbox\" href=\"https://meta.discourse.org/some.png\" title=\"some.png\" rel=\"nofollow noopener\"><img src=\"https://meta.discourse.org/some.png\" width=\"275\" height=\"299\"><div class=\"meta\">\n<svg class=\"fa d-icon d-icon-far-image svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#far-image\"></use></svg><span class=\"filename\">some.png</span><span class=\"informations\">1750×2000</span><svg class=\"fa d-icon d-icon-discourse-expand svg-icon\" aria-hidden=\"true\"><use xlink:href=\"#discourse-expand\"></use></svg>\n</div></a></div></p> ``` `PostSearchData#raw_data` Before: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image some.png png https://meta.discourse.org/some.png discourse org/some png some.png png 1750×2000 ``` `PostSearchData#raw_data` After: ``` This is a test topic 0 Uncategorized Let me see how I can fix this image ``` In terms of indexing performance, we now have to parse the given HTML through nokogiri twice. However performance is not a huge worry here since a string length of 194170 takes only 30ms to scrub plus the indexing takes place in a background job.
This commit is contained in:
@@ -61,7 +61,7 @@ describe SearchIndexer do
|
||||
|
||||
scrubbed = scrub(html)
|
||||
|
||||
expect(scrubbed).to eq("Discourse 51%20PM Untitled design (21).jpg Untitled%20design%20(21) Untitled design (21).jpg 1280x1136 472 KB")
|
||||
expect(scrubbed).to eq("Discourse 51%20PM")
|
||||
end
|
||||
|
||||
it 'correctly indexes a post according to version' do
|
||||
@@ -110,5 +110,41 @@ describe SearchIndexer do
|
||||
post.save!(validate: false)
|
||||
end.to_not change { PostSearchData.count }
|
||||
end
|
||||
|
||||
it "should not tokenize urls and duplicate title and href in <a>" do
|
||||
post = Fabricate(:post, raw: <<~RAW)
|
||||
https://meta.discourse.org/some.png
|
||||
RAW
|
||||
|
||||
post.rebake!
|
||||
post.reload
|
||||
topic = post.topic
|
||||
|
||||
expect(post.post_search_data.raw_data).to eq(
|
||||
"#{topic.title} #{topic.category.name} https://meta.discourse.org/some.png meta discourse org"
|
||||
)
|
||||
end
|
||||
|
||||
it 'should not include lightbox in search' do
|
||||
Jobs.run_immediately!
|
||||
SiteSetting.max_image_height = 2000
|
||||
SiteSetting.crawl_images = true
|
||||
FastImage.expects(:size).returns([1750, 2000])
|
||||
|
||||
src = "https://meta.discourse.org/some.png"
|
||||
|
||||
post = Fabricate(:post, raw: <<~RAW)
|
||||
Let me see how I can fix this image
|
||||
<img src="#{src}" width="275" height="299">
|
||||
RAW
|
||||
|
||||
post.rebake!
|
||||
post.reload
|
||||
topic = post.topic
|
||||
|
||||
expect(post.post_search_data.raw_data).to eq(
|
||||
"#{topic.title} #{topic.category.name} Let me see how I can fix this image"
|
||||
)
|
||||
end
|
||||
end
|
||||
end
|
||||
|
||||
Reference in New Issue
Block a user