Elasticsearch 错误语言检测

问题描述

我正在使用 Elasticsearch 的默认 language identification model (lang_ident_model_1)。

我遇到了阿拉伯语和英语检测问题。

使用 7.13.2 版，请求通过 curl 发送到 GET _ingest/pipeline/_simulate。

这被错误地识别为 SV（瑞典语）。

请求：

    "pipeline": {
        "processors": [
            {
                "inference": {
                    "model_id": "lang_ident_model_1"
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "text": "CNN تتجه أمريكا إلى أفضل الأوقات ، وأسوأ الأوقات في الصيف ، حيث يخفف الوعد الذي طال انتظاره "
            }
        }
    ]
}

回复：

{
    "docs": [
        {
            "doc": {
                "_index": "_index","_type": "_doc","_id": "_id","_source": {
                    "text": "CNN تتجه أمريكا إلى أفضل الأوقات ، وأسوأ الأوقات في الصيف ، حيث يخفف الوعد الذي طال انتظاره ","ml": {
                        "inference": {
                            "prediction_score": 0.9711386959542202,"model_id": "lang_ident_model_1","prediction_probability": 0.9711386959542202,"predicted_value": "sv"
                        }
                    }
                },"_ingest": {
                    "timestamp": "2021-06-30T08:25:00.959013809Z"
                }
            }
        }
    ]
}

这被正确识别为 AR（阿拉伯语）。

请求：

{
    "pipeline": {
        "processors": [
            {
                "inference": {
                    "model_id": "lang_ident_model_1"
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "text": "تتجه أمريكا إلى أفضل الأوقات ، وأسوأ الأوقات في الصيف ، حيث يخفف الوعد الذي طال انتظاره "
            }
        }
    ]
}

回复：

{
    "docs": [
        {
            "doc": {
                "_index": "_index","_source": {
                    "text": "تتجه أمريكا إلى أفضل الأوقات ، وأسوأ الأوقات في الصيف ، حيث يخفف الوعد الذي طال انتظاره ","ml": {
                        "inference": {
                            "prediction_score": 0.9999964083151712,"prediction_probability": 0.9999964083151712,"predicted_value": "ar"
                        }
                    }
                },"_ingest": {
                    "timestamp": "2021-06-30T08:25:36.663997653Z"
                }
            }
        }
    ]
}

拉丁字母开头的阿拉伯语文本似乎有问题。

这被错误地识别为 MG（马达加斯加语）

请求：

{
    "pipeline": {
        "processors": [
            {
                "inference": {
                    "model_id": "lang_ident_model_1"
                }
            }
        ]
    },"docs": [
        {
            "_source": {
                "text": "I am so happy today,I am also so sad today"
            }
        }
    ]
}

回复：

{
    "docs": [
        {
            "doc": {
                "_index": "_index","_source": {
                    "text": "I am so happy today,I am also so sad today","ml": {
                        "inference": {
                            "prediction_score": 0.9639440826428515,"prediction_probability": 0.9639440826428515,"predicted_value": "mg"
                        }
                    }
                },"_ingest": {
                    "timestamp": "2021-06-30T08:52:17.64136234Z"
                }
            }
        }
    ]
}

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

elasticsearch elasticsearch language-detection