将前导/尾随空格添加到 elasticsearch tokenizer ngram

问题描述

我正在尝试使用 elasticsearch 分析器生成 ngram 功能,特别是,我想为单词添加前导/尾随空格。例如,如果单词是“2 Quick Foxes”,则带有前导/尾随空格的 ngram 特征将是:

" 2 ","2 Q",.....,"Fox","oxe","xes","es"

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },"tokenizer": {
        "my_tokenizer": {
          "type": "ngram","min_gram": 3,"max_gram": 3,"token_chars": [
            "letter","digit"
          ]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer","text": "2 Quick Foxes"
}

解决方法

您可以添加两个 pattern replace character filters -- 一个用于前导空格,另一个用于尾随:

PUT my-index-000001
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "my_tokenizer","char_filter": [
              "leading_space","trailing_space"
            ]
          }
        },"tokenizer": {
          "my_tokenizer": {
            "type": "ngram","min_gram": 3,"max_gram": 3,"token_chars": [
              "letter","digit","whitespace"       
            ]
          }
        },"char_filter": {
          "leading_space": {
            "type": "pattern_replace","pattern": "(^.)","replacement": " $1"
          },"trailing_space": {
            "type": "pattern_replace","pattern": "(.$)","replacement": "$1 "
          }
        }
      }
    }
  }
}

注意添加到 whitespacetoken_chars 中的 my_tokenizer -- 如果没有它,上述内容将无法工作。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...