如何使较短较近的令牌匹配更相关? edge_ngram

问题描述

我使用的用于自动完成功能的edge_ngram标记生成器得到了奇怪的结果。我试图弄清楚如何使我的结果更相关。我从elasticsearch文档中复制了example

我有以下说明的文件

  • “未加工的苹果,没有皮肤”
  • “苹果,生的,金黄的,有皮的”
  • “辣椒,苹果酱”
  • “婴儿食品,水果,苹果酱,初中”

如果我搜索apple,则“ APPLEBEE'S,chili”的得分要高于“无皮的苹果”

如果我搜索apples,则“婴儿食品,水果,苹果酱,初中”的得分要高于“苹果,生的,金黄的,有皮的苹果”

在这两种情况下,我都希望对更相关/更短的匹配具有更高的分数(即,当我搜索appleapples时,结果中包含单词{{ 1}}的得分应高于applesAPPLEBEE'S

我的设置是:

applesauce

查询

{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete","filter": [
            "lowercase","asciifolding"
          ]
        },"autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },"tokenizer": {
        "autocomplete": {
          "type": "edge_ngram","min_gram": 2,"max_gram": 20,"token_chars": [
            "letter"
          ]
        }
      }
    }
  },"mappings": {
    "properties": {
      "description": {
        "type": "text","analyzer": "autocomplete","search_analyzer": "autocomplete_search"
      }
    }
  }
}

如何使相关性更高的得分更高?

解决方法

由于新的BM25算法(用于评分)中称为(dl)的匹配字段的长度而发生此问题,您可以轻松地在查询中使用explain param来详细了解它

http:// {{hostname}}:{{port}} // _ search?explain = true

您的APPLEBEE'S,chili的长度最短,它会获得更高的分数,因此这是该文档的tf分数

 {
                                    "value": 0.5344296,"description": "tf,computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:","details": [
                                        {
                                            "value": 1.0,"description": "freq,occurrences of term within document","details": []
                                        },{
                                            "value": 1.2,"description": "k1,term saturation parameter",{
                                            "value": 0.75,"description": "b,length normalization parameter",{
                                            "value": 11.0,"description": "dl,length of field",---> note this
                                            "details": []
                                        },{
                                            "value": 17.333334,"description": "avgdl,average length of field","details": []
                                        }
                                    ]
                                }

解决方案

您需要创建另一个使用english分析器的字段,如multi-fields示例中所示,以下是完整示例

索引示例

{
    "settings": {
        "analysis": {
            "analyzer": {
                "autocomplete": {
                    "tokenizer": "autocomplete","filter": [
                        "lowercase","asciifolding"
                    ]
                },"autocomplete_search": {
                    "tokenizer": "lowercase"
                }
            },"tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram","min_gram": 2,"max_gram": 20,"token_chars": [
                        "letter"
                    ]
                }
            }
        }
    },"mappings": {
        "properties": {
            "name": {
                "type": "text","analyzer": "autocomplete","search_analyzer": "autocomplete_search","fields": {
                    "english": {
                        "type": "text","analyzer": "english"
                    }
                }
            }
        }
    }
}
}

并索引示例文档

{
    "name" : "Apples,raw,without skin"
}
{
    "name" : "APPLEBEE'S,chili"
}
{
    "name" : "Babyfood,fruit,applesauce,junior"
}
{
    "name" : "Apples,golden delicious,with skin"
}

并搜索查询

{
    "query": {
        "bool": {
            "should": [
                {
                    "multi_match": {
                        "query": "apple","fields": [
                            "name.english","name"
                        ]
                    }
                }
            ]
        }
    }
}

搜索结果中,包含apple

的文档得分更高
 "hits": [
            {
                "_index": "edgelow","_type": "_doc","_id": "1","_score": 0.6747451,"_source": {
                    "name": "Apples,without skin"
                }
            },{
                "_index": "edgelow","_id": "4","_score": 0.60996956,with skin"
                }
            },"_id": "2","_score": 0.12822598,"_source": {
                    "name": "APPLEBEE'S,chili"
                }
            },"_id": "3","_score": 0.09446116,"_source": {
                    "name": "Babyfood,junior"
                }
            }
        ]

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...