FEATURE: Split up text segmentation for Chinese and Japanese.

* Chinese segmenetation will continue to rely on cppjieba
* Japanese segmentation will use our port of TinySegmenter
* Korean currently does not rely on segmentation which was dropped in c677877e4f
* SiteSetting.search_tokenize_chinese_japanese_korean has been split
into SiteSetting.search_tokenize_chinese and
SiteSetting.search_tokenize_japanese respectively
This commit is contained in:
Alan Guo Xiang Tan
2022-01-26 15:24:11 +08:00
parent 9ddd1f739e
commit 930f51e175
14 changed files with 406 additions and 72 deletions

View File

@@ -87,7 +87,7 @@ class Search
blurb_length: @blurb_length
}
if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION && !Search.segment_cjk?
if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION && !Search.segment_chinese? && !Search.segment_japanese?
if SiteSetting.use_pg_headlines_for_excerpt
scrubbed_headline = post.headline.gsub(SCRUB_HEADLINE_REGEXP, '\1')
prefix_omission = scrubbed_headline.start_with?(post.leading_raw_data) ? '' : OMISSION