Nest - 弹性搜索 - 查找不区分大小写的子字符串，允许使用斜杠

问题描述

假设我已经用这两个 keyword 字段在 Elastic 中索引了一个文档：

"lastModified": "02/03/2020"
"theText": "Hello there"

我想支持对两个字段进行不区分大小写的子字符串搜索。

这样当我搜索 "lastModified" 时，文档与 QueryString 中的任何一个匹配：

"02"
"2/03"
"/0"
"2020"

并且当我搜索 "theText" 时，文档应该匹配（注意大小写更改）：

"helLO"
"lo there"
"the"

你懂的。 我只需要一个简单的不区分大小写的子字符串搜索。没有模糊或任何花哨的东西。我尝试过通配符、正则表达式、转义 "lastModified" 的斜杠、将 / 重新映射到 _slash_，但我被卡住了。通配符有效除非有斜线。如何使用通配符方法来处理斜杠？或者有更好的方法吗？

编辑

我宁愿避免走 N-Gram 路线，因为文本数据可能是一个很长的段落，并且会产生很多 gram :)。

总而言之，我的首选解决方案是：

不需要 N-Grams（我们的文本可能很长）
不区分大小写
支持输入中的斜杠

现在我对关键字字段使用丑陋的正则表达式。它有效，但感觉很傻。

解决方法

您需要使用 n-gram tokenizer 进行子字符串匹配。由于您还想保留 /，那么您还需要在 punctuation 中添加 token_chars

添加一个包含索引数据、映射、搜索查询和搜索结果的工作示例

索引映射：

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "filter": [
            "lowercase"
          ],"tokenizer": "my_tokenizer"
        }
      },"tokenizer": {
        "my_tokenizer": {
          "type": "ngram","min_gram": 2,"max_gram": 10,"token_chars": [
            "letter","digit","punctuation"
          ]
        }
      }
    },"max_ngram_diff": 10
  },"mappings": {
    "properties": {
      "lastModified": {
        "type": "text","analyzer": "my_analyzer"
      },"theText": {
        "type": "text","analyzer": "my_analyzer"
      }
    }
  }
}

索引数据：

{
  "theText": "Hello there"
}
{
  "lastModified": "02/03/2020"
}

对字段 1 的搜索查询：

{
  "query": {
    "match": {
      "theText": "helLO"
    }
  }
}

搜索结果：

"hits": [
      {
        "_index": "67825121","_type": "_doc","_id": "2","_score": 0.970927,"_source": {
          "theText": "Hello there"
        }
      }
    ]

字段 2 上的搜索查询：

{
  "query": {
    "match": {
      "lastModified": "2/03"
    }
  }
}

搜索结果：

"hits": [
      {
        "_index": "67825121","_id": "1","_score": 2.0497348,"_source": {
          "lastModified": "02/03/2020"
        }
      }
    ]

也尝试过其他查询，它们根据您的用例显示了正确的结果。

更新 1：

如果未指定分析器，Elasticsearch 将使用标准分析器。假设 lastModified 和 theText 字段是 text 类型，因此“02/03/2020”将被标记为

{
  "tokens": [
    {
      "token": "02","start_offset": 0,"end_offset": 2,"type": "<NUM>","position": 0
    },{
      "token": "03","start_offset": 3,"end_offset": 5,"position": 1
    },{
      "token": "2020","start_offset": 6,"end_offset": 10,"position": 2
    }
  ]
}

现在，当您对上述任何字段进行通配符查询时，它将搜索上面显示的标记。由于没有与“2/03”匹配的标记，您将获得查询的空结果。

最好使用关键字字段进行通配符查询。如果您没有明确定义任何索引映射，那么您需要将 .keyword 添加到这两个字段。这使用关键字分析器而不是标准分析器（注意字段后面的“.keyword”）。

搜索查询：

{
  "query": {
    "wildcard": {
      "lastModified.keyword": {
        "value": "*2/03*"
      }
    }
  }
}

搜索结果：

"hits": [
      {
        "_index": "67825121","_score": 1.0,"_source": {
          "lastModified": "02/03/2020"
        }
      }
    ]

搜索查询：

{
  "query": {
    "wildcard": {
      "theText.keyword": {
        "value": "*lo there*"
      }
    }
  }
}

搜索结果：

"hits": [
      {
        "_index": "67825121","_source": {
          "theText": "Hello there"
        }
      }
    ]

elasticsearch elasticsearch nest

Nest - 弹性搜索 - 查找不区分大小写的子字符串，允许使用斜杠

问题描述

编辑

解决方法

相关问答