使用 Python 从自己的字典中每行查找单词

问题描述

我会尽量直截了当，但我需要先说明我要做什么。

我有一个包含不同列 (click to see the csv) 的 csv 文件，所以我只想选择第二列（称为 “Reescribe aquí / Rewrite here”）并从我创建的字典（也作为 csv 文件）。由此我想要两件事：

1-返回每行找到了多少个单词。

2-返回每行也找到了哪些词。

它看起来像 this。

到目前为止，我已经能够在没有问题的情况下使用此代码来获取第二列（请注意，我已经创建了一个文本预处理函数），但是在尝试查找单词时我遇到了第二个函数的问题：

import pandas as pd
import re
import spacy

df1=pd.read_csv('TFG1.csv',encoding = 'utf8')

language = 'en' 

if language=='es':
    nlp=spacy.load('es_core_news_sm')
elif language=='en':
    nlp=spacy.load('en_core_web_sm')


def process_text_spacy(text):
    """Process text function.
    Input:
        text: a string containing the text

    Output:
        text_clean: a list of words containing the processed text

    """

    tokens = nlp(text)

    # Get the words in lowercase ignoring puntuactions,stop words,etc.
    filtered_tokens = [tok.lower_ for tok in tokens if
                       not tok.is_punct and not tok.is_space and not nlp.vocab[tok.text].is_stop]

    return filtered_tokens


df1['cleaned_text']=df1['Reescribe aquí / Rewrite here'].apply(process_text_spacy)
print(df1["Reescribe aquí / Rewrite here"])


def find_all_words(words,sentence):
    all_words = re.findall(r'\w+',sentence)
    words_found = []
    for word in words:

        if word in all_words:
            words_found.append(word)
    return "Words found:",words_found.__len__()," The words are:",words_found

print(find_all_words(['sage','selection'],df1))

这就是我得到的：

第一个函数（process_text_spacy）：

0                el grupo sage dijo que todo esta bien
1    Sage group clarifies that the selection of vac...
Name: Reescribe aquí / Rewrite here,dtype: object

第二个函数（find_all_words）：

TypeError: expected string or bytes-like object

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

csv csv pandas python python-textprocessing spacy