问题描述
我正在尝试使用 elasticsearch 分析器生成 ngram 功能,特别是,我想为单词添加前导/尾随空格。例如,如果单词是“2 Quick Foxes”,则带有前导/尾随空格的 ngram 特征将是:
" 2 ","2 Q",.....,"Fox","oxe","xes","es"
PUT my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},"tokenizer": {
"my_tokenizer": {
"type": "ngram","min_gram": 3,"max_gram": 3,"token_chars": [
"letter","digit"
]
}
}
}
}
}
POST my-index-000001/_analyze
{
"analyzer": "my_analyzer","text": "2 Quick Foxes"
}
解决方法
您可以添加两个 pattern replace character filters -- 一个用于前导空格,另一个用于尾随:
PUT my-index-000001
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer","char_filter": [
"leading_space","trailing_space"
]
}
},"tokenizer": {
"my_tokenizer": {
"type": "ngram","min_gram": 3,"max_gram": 3,"token_chars": [
"letter","digit","whitespace"
]
}
},"char_filter": {
"leading_space": {
"type": "pattern_replace","pattern": "(^.)","replacement": " $1"
},"trailing_space": {
"type": "pattern_replace","pattern": "(.$)","replacement": "$1 "
}
}
}
}
}
}
注意添加到 whitespace
的 token_chars
中的 my_tokenizer
-- 如果没有它,上述内容将无法工作。