问题描述
我想检测不同字符串行中的语言。为此,我有一个 csv 文件,我只返回了一个特定的列(“Reescribe aquí / Rewrite here”)并转换为字符串格式。
因此,我想做的是能够检测这些不同行的语言。
这是我的代码:
import pandas as pd
import re
from nltk.tokenize.treebank import TreebankWordDetokenizer
from langdetect import detect_langs
df1=pd.read_csv('TFG1.csv',encoding = 'utf8')
def find_all_words(words,sentence):
all_words = re.findall(r'\w+',sentence)
words_found = []
for word in words:
if word in all_words:
words_found.append(word)
return "Words found:",words_found.__len__()," The words are:",words_found
def detect_language(text):
"""Detect the language of the text function.
Input:
text: a string containing the text
Output:
lang: language ('es' for Spanish; 'en' for English)
"""
res = detect_langs(text)
for item in res:
if item.lang == "es" or item.lang == "en":
return item.lang
return None
i=1
TreebankWordDetokenizer().detokenize(df1["Reescribe aquí / Rewrite here"])
for rows in [x.lower() for x in df1["Reescribe aquí / Rewrite here"]]:
print(i,"-",rows,find_all_words(['sage','selection'],rows))
print(detect_language(TreebankWordDetokenizer().detokenize(df1["Reescribe aquí / Rewrite here"])))
i += 1
返回这个:
1 - el grupo sage dijo que todo esta bien ('Words found:',1,' The words are:',['sage'])
en
2 - sage group clarifies that the selection of vaccines is optimal ('Words found:',2,['sage','selection'])
en
如您所见,在这两种情况下,它都会返回语言为英语 ('en')。
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)