Elasticsearch 6.8 match_phrase搜索N元语法分词器效果不佳

问题描述

我使用Elasticsearch N-gram tokenizer并使用match_phrase进行模糊匹配 我的索引和测试数据如下:

DELETE /m8
PUT m8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },"tokenizer": {
        "my_tokenizer": {
          "type": "ngram","min_gram": 1,"max_gram": 3,"custom_token_chars":"_."
        }
      }
    },"max_ngram_diff": 10
  },"mappings": {
    "table": {
      "properties": {
        "dataSourceId": {
          "type": "long"
        },"dataSourceType": {
          "type": "integer"
        },"dbname": {
          "type": "text","analyzer": "my_analyzer","fields": {
            "keyword": {
              "type": "keyword","ignore_above": 256
            }
          }
        }
      }
    }
  }
}


PUT /m8/table/1
{
  "dataSourceId":1,"dataSourceType":2,"dbname":"rm.rf"
}

PUT /m8/table/2
{
  "dataSourceId":1,"dbname":"rm_rf"
}
PUT /m8/table/3
{
  "dataSourceId":1,"dbname":"rmrf"
}

检查_analyze:

POST m8/_analyze
{
  "tokenizer": "my_tokenizer","text": "rm.rf"
}

_analyze结果:

{
  "tokens" : [
    {
      "token" : "r","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0
    },{
      "token" : "rm","end_offset" : 2,"position" : 1
    },{
      "token" : "rm.","end_offset" : 3,"position" : 2
    },{
      "token" : "m","start_offset" : 1,"position" : 3
    },{
      "token" : "m.","position" : 4
    },{
      "token" : "m.r","end_offset" : 4,"position" : 5
    },{
      "token" : ".","start_offset" : 2,"position" : 6
    },{
      "token" : ".r","position" : 7
    },{
      "token" : ".rf","end_offset" : 5,"position" : 8
    },{
      "token" : "r","start_offset" : 3,"position" : 9
    },{
      "token" : "rf","position" : 10
    },{
      "token" : "f","start_offset" : 4,"position" : 11
    }
  ]
}

当我搜索“ rm”时,什么也没找到:

GET /m8/table/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match_phrase": {
            "dbname": "rm"
          }
        }
      ]
    }
  }
}
{
  "took" : 2,"timed_out" : false,"_shards" : {
    "total" : 5,"successful" : 5,"skipped" : 0,"Failed" : 0
  },"hits" : {
    "total" : 0,"max_score" : null,"hits" : [ ]
  }
}

但是可以找到'.rf':

{
  "took" : 1,"hits" : {
    "total" : 1,"max_score" : 1.7260926,"hits" : [
      {
        "_index" : "m8","_type" : "table","_id" : "1","_score" : 1.7260926,"_source" : {
          "dataSourceId" : 1,"dataSourceType" : 2,"dbname" : "rm.rf"
        }
      }
    ]
  }
}

我的问题: 为什么即使_analyze拆分了这些短语也找不到“ rm”?

解决方法

  1. my_analyzer也将在搜索期间使用。

    "mapping":{
     "dbName": {
      "type": "text","analyzer": "my_analyzer" 
      "search_analyzer":"my_analyzer"  // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
    
  2. Match_phrase查询用于考虑已分析文本的位置来匹配短语。例如,搜索“ Kal ho”将匹配分析文本中X位置具有“ Kal”和X + 1位置具有“ ho”的文档。

  3. 当您搜索“ rm”(#1)时,将使用my_analyzer分析文本,该文本将其转换为n-gram,并将在该phrase_search的顶部使用。因此,结果是无法预期的。

解决方案:

  1. 使用带有简单匹配查询的标准分析器

    GET /m8/_search
    {
     "query": {
     "bool": {
       "must": [
         {
           "match": {
             "dbName": {
               "query": "rm","analyzer": "standard" // <=========
             }
           }
         }
       ]
     }
     }
     }
    

    在映射过程中进行定义并使用匹配查询(而非match_phrase)

    "mapping":{
          "dbName": {
           "type": "text","analyzer": "my_analyzer" 
           "search_analyzer":"standard" //<==========
    

后续问题::为什么要对n-gram令牌生成器使用 match_phrase 查询?

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...