如何通过在 Kibana 中摄取管道将字段拆分为单词

问题描述

我创建了一个如下的摄取管道来将字段拆分为单词：

POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "String cutting processing","processors": [
            {
                "split": {
                    "field": "foo","separator": "|"
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "foo": "apple|time"
            }
        }
    ]
}

但它将字段拆分为字符：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index","_type" : "_doc","_id" : "_id","_source" : {
          "foo" : [
            "a","p","l","e","|","t","i","m","e"
          ]
        }
      }
    }
  ]
}

如果我用逗号替换分隔符，相同的管道将字段拆分为单词：

POST _ingest/pipeline/_simulate
{
    "pipeline": {
        "description": "String cutting processing","separator": ","
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "foo": "apple,time"
            }
        }
    ]
}

那么输出将是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index","_source" : {
          "foo" : [
            "apple","time"
          ]
        }
      }
    }
  ]
}

当分隔符为“|”时，如何将字段拆分为单词？我的下一个问题是如何将此摄取管道应用于现有索引？我试过 this solution，但对我不起作用。

编辑

以下是将两个部分分配给两列的文档的整个管道：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": """combined fields are text that contain  "|" to separate two fields""","processors": [
      {
        "split": {
          "field": "dv_m","separator": "|","target_field": "dv_m_splited"
        }
      },{
        "set": {
          "field": "dv_metric_prod","value": "{{dv_m_splited.1}}","override": false
        }
      },{
        "set": {
          "field": "dv_metric_section","value": "{{dv_m_splited.2}}","override": false
        }
      }
    ]
  },"docs": [
    {

      "_source": {

        "dv_m": "amaze_inc|Understanding"

      }
    }
  ]
}

生成此响应：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index","_source" : {
          "dv_metric_prod" : "m","dv_m_splited" : [
            "a","a","z","_","n","c","U","d","r","s","g"
          ],"dv_metric_section" : "a","dv_m" : "amaze_inc|Understanding"
        },"_ingest" : {
          "timestamp" : "2021-08-02T08:33:58.2234143Z"
        }
      }
    }
  ]
}

如果我设置了 "separator": "\\|"，那么我会得到这个错误：

{
  "docs" : [
    {
      "error" : {
        "root_cause" : [
          {
            "type" : "general_script_exception","reason" : "Error running com.github.mustachejava.codes.DefaultMustache@776f8239"
          }
        ],"type" : "general_script_exception","reason" : "Error running com.github.mustachejava.codes.DefaultMustache@776f8239","caused_by" : {
          "type" : "mustache_exception","reason" : "Failed to get value for dv_m_splited.2 @[query-template:1]","caused_by" : {
            "type" : "mustache_exception","reason" : "2 @[query-template:1]","caused_by" : {
              "type" : "index_out_of_bounds_exception","reason" : "2"
            }
          }
        }
      }
    }
  ]
}

解决方法

解决方案相当简单：只需转义分隔符即可。

作为split处理器is a regular expression中的separator字段，需要对|等特殊字符进行转义。

您还需要将其转义两次。

所以你的代码只缺少双重转义部分：

POST _ingest/pipeline/_simulate

{
    "pipeline": {
        "description": "String cutting processing","processors": [
            {
                "split": {
                    "field": "foo","separator": "\\|"
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "foo": "apple|time"
            }
        }
    ]
}

elasticsearch ingest pipeline pipeline split