FEATURE: Split up text segmentation for Chinese and Japanese.

* Chinese segmenetation will continue to rely on cppjieba * Japanese segmentation will use our port of TinySegmenter * Korean currently does not rely on segmentation which was dropped in c677877e4f * SiteSetting.search_tokenize_chinese_japanese_korean has been split into SiteSetting.search_tokenize_chinese and SiteSetting.search_tokenize_japanese respectively
2025-02-25 18:55:32 -06:00 · 2022-01-26 15:24:11 +08:00
parent 9ddd1f739e
commit 930f51e175
14 changed files with 406 additions and 72 deletions
--- a/lib/search/grouped_search_results.rb
+++ b/lib/search/grouped_search_results.rb
@@ -87,7 +87,7 @@ class Search
        blurb_length: @blurb_length
      }

-      if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION && !Search.segment_cjk?
+      if post.post_search_data.version > SearchIndexer::MIN_POST_REINDEX_VERSION && !Search.segment_chinese? && !Search.segment_japanese?
        if SiteSetting.use_pg_headlines_for_excerpt
          scrubbed_headline = post.headline.gsub(SCRUB_HEADLINE_REGEXP, '\1')
          prefix_omission = scrubbed_headline.start_with?(post.leading_raw_data) ? '' : OMISSION