使用 langdetects 检测不同行中的语言

问题描述

我想检测不同字符串行中的语言。为此，我有一个 csv 文件，我只返回了一个特定的列（“Reescribe aquí / Rewrite here”）并转换为字符串格式。

因此，我想做的是能够检测这些不同行的语言。

这是我的代码：

import pandas as pd
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer
from langdetect import detect_langs


df1=pd.read_csv('TFG1.csv',encoding = 'utf8')


def find_all_words(words,sentence):
    all_words = re.findall(r'\w+',sentence)
    words_found = []
    for word in words:

        if word in all_words:
            words_found.append(word)
    return "Words found:",words_found.__len__()," The words are:",words_found

def detect_language(text):
    """Detect the language of the text function.
        Input:
            text: a string containing the text

        Output:
            lang: language ('es' for Spanish; 'en' for English)

        """

    res = detect_langs(text)
    for item in res:
        if item.lang == "es" or item.lang == "en":
            return item.lang
    return None

i=1

TreebankWordDetokenizer().detokenize(df1["Reescribe aquí / Rewrite here"])


for rows in [x.lower() for x in df1["Reescribe aquí / Rewrite here"]]:

    print(i,"-",rows,find_all_words(['sage','selection'],rows))
    print(detect_language(TreebankWordDetokenizer().detokenize(df1["Reescribe aquí / Rewrite here"])))


    i += 1

返回这个：

1 - el grupo sage dijo que todo esta bien ('Words found:',1,' The words are:',['sage'])
en
2 - sage group clarifies that the selection of vaccines is optimal ('Words found:',2,['sage','selection'])
en

如您所见，在这两种情况下，它都会返回语言为英语 ('en')。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

language-detection nlp nltk python