问题描述
我使用Elasticsearch N-gram tokenizer
并使用match_phrase
进行模糊匹配
我的索引和测试数据如下:
DELETE /m8
PUT m8
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},"tokenizer": {
"my_tokenizer": {
"type": "ngram","min_gram": 1,"max_gram": 3,"custom_token_chars":"_."
}
}
},"max_ngram_diff": 10
},"mappings": {
"table": {
"properties": {
"dataSourceId": {
"type": "long"
},"dataSourceType": {
"type": "integer"
},"dbname": {
"type": "text","analyzer": "my_analyzer","fields": {
"keyword": {
"type": "keyword","ignore_above": 256
}
}
}
}
}
}
}
PUT /m8/table/1
{
"dataSourceId":1,"dataSourceType":2,"dbname":"rm.rf"
}
PUT /m8/table/2
{
"dataSourceId":1,"dbname":"rm_rf"
}
PUT /m8/table/3
{
"dataSourceId":1,"dbname":"rmrf"
}
检查_analyze:
POST m8/_analyze
{
"tokenizer": "my_tokenizer","text": "rm.rf"
}
_analyze结果:
{
"tokens" : [
{
"token" : "r","start_offset" : 0,"end_offset" : 1,"type" : "word","position" : 0
},{
"token" : "rm","end_offset" : 2,"position" : 1
},{
"token" : "rm.","end_offset" : 3,"position" : 2
},{
"token" : "m","start_offset" : 1,"position" : 3
},{
"token" : "m.","position" : 4
},{
"token" : "m.r","end_offset" : 4,"position" : 5
},{
"token" : ".","start_offset" : 2,"position" : 6
},{
"token" : ".r","position" : 7
},{
"token" : ".rf","end_offset" : 5,"position" : 8
},{
"token" : "r","start_offset" : 3,"position" : 9
},{
"token" : "rf","position" : 10
},{
"token" : "f","start_offset" : 4,"position" : 11
}
]
}
当我搜索“ rm”时,什么也没找到:
GET /m8/table/_search
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"dbname": "rm"
}
}
]
}
}
}
{
"took" : 2,"timed_out" : false,"_shards" : {
"total" : 5,"successful" : 5,"skipped" : 0,"Failed" : 0
},"hits" : {
"total" : 0,"max_score" : null,"hits" : [ ]
}
}
但是可以找到'.rf':
{
"took" : 1,"hits" : {
"total" : 1,"max_score" : 1.7260926,"hits" : [
{
"_index" : "m8","_type" : "table","_id" : "1","_score" : 1.7260926,"_source" : {
"dataSourceId" : 1,"dataSourceType" : 2,"dbname" : "rm.rf"
}
}
]
}
}
我的问题: 为什么即使_analyze拆分了这些短语也找不到“ rm”?
解决方法
-
my_analyzer也将在搜索期间使用。
"mapping":{ "dbName": { "type": "text","analyzer": "my_analyzer" "search_analyzer":"my_analyzer" // <==== If you don't provide a search analyzer then what you defined in analyzer will be used during search time as well.
-
Match_phrase查询用于考虑已分析文本的位置来匹配短语。例如,搜索“ Kal ho”将匹配分析文本中X位置具有“ Kal”和X + 1位置具有“ ho”的文档。
-
当您搜索“ rm”(#1)时,将使用my_analyzer分析文本,该文本将其转换为n-gram,并将在该phrase_search的顶部使用。因此,结果是无法预期的。
解决方案:
-
使用带有简单匹配查询的标准分析器
GET /m8/_search { "query": { "bool": { "must": [ { "match": { "dbName": { "query": "rm","analyzer": "standard" // <========= } } } ] } } }
或在映射过程中进行定义并使用匹配查询(而非match_phrase)
"mapping":{ "dbName": { "type": "text","analyzer": "my_analyzer" "search_analyzer":"standard" //<==========
后续问题::为什么要对n-gram令牌生成器使用 match_phrase 查询?