根据 Pandas Python 中另一个数据帧的条件从一个数据帧中删除行

问题描述

我有两个 Pandas 数据框，在 python 中包含数百万行。我想根据三个条件从第一个包含单词的数据框中删除行：

如果单词连续出现在句子的开头
如果该词连续出现在句尾
如果该词出现在连续句子的中间（准确的词，不是子集）

示例：

第一个数据框：

This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence

第二个数据框：

Second
forth
fifth

预期输出：

This is the first sentence
This is fifth_sentence

请注意，我在两个数据框中都有数百万条记录，我该如何处理并以最有效的方式导出？

我试过了，但需要很长时间

import pandas as pd
import re

bad_words_file_data = pd.read_csv("words.txt",sep = ",",header = None)
sentences_file_data = pd.read_csv("setences.txt",sep = ".",header = None)

bad_words_index = []
for i in sentences_file_data.index:
    print("Processing Sentence:- ",i,"\n")
    single_sentence = sentences_file_data[0][i]
    for j in bad_words_file_data.index:
        word = bad_words_file_data[0][j]
        if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
            bad_words_index.append(i)
            break
            
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None,index = False)

谢谢

解决方法

您可以使用 numpy.where 函数并创建一个名为“remove”的变量，如果您列出的条件得到满足，该变量将标记为 1。首先，创建一个值为 df2

的列表

条件 1： 将检查单元格值是否以列表中的任何值开头

条件 2： 与上述相同，但会检查单元格值是否以列表中的任何值结尾

条件 3： 拆分每个单元格并检查拆分器字符串中是否有任何值在您的列表中

此后，您可以通过过滤掉 1 来创建新的数据框：

# Imports
import pandas as pd
import numpy as np

# Get the values from df2 in a list
l = list(set(df2['col']))

# Set conditions
c = df['col']

cond = (c.str.startswith(tuple(l)) \
        |(c.str.endswith(tuple(l))) \
        |pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))

# Assign 1 or 0
df['remove'] = np.where(cond,1,0)

# Create 
out = (df[df['remove']!=1]).drop(['remove'],axis=1)

out 打印：

                          col
0  This is the first sentence
4      This is fifth_sentence

参考：

Pandas Row Select Where String Starts With Any Item In List

check if a columns contains any str from list

使用的数据框：

>>> df.to_dict()

{'col': {0: 'This is the first sentence',1: 'Second this is another sentence',2: 'This is the third sentence forth',3: 'This is fifth sentence',4: 'This is fifth_sentence'}}

>>> df2.to_dict()

Out[80]: {'col': {0: 'Second',1: 'forth',2: 'fifth'}}

dataframe modin pandas pandas python python-3.x