问题描述
我有两个 Pandas 数据框,在 python 中包含数百万行。我想根据三个条件从第一个包含单词的数据框中删除行:
- 如果单词连续出现在句子的开头
- 如果该词连续出现在句尾
- 如果该词出现在连续句子的中间(准确的词,不是子集)
示例:
第一个数据框:
This is the first sentence
Second this is another sentence
This is the third sentence forth
This is fifth sentence
This is fifth_sentence
第二个数据框:
Second
forth
fifth
预期输出:
This is the first sentence
This is fifth_sentence
请注意,我在两个数据框中都有数百万条记录,我该如何处理并以最有效的方式导出?
我试过了,但需要很长时间
import pandas as pd
import re
bad_words_file_data = pd.read_csv("words.txt",sep = ",",header = None)
sentences_file_data = pd.read_csv("setences.txt",sep = ".",header = None)
bad_words_index = []
for i in sentences_file_data.index:
print("Processing Sentence:- ",i,"\n")
single_sentence = sentences_file_data[0][i]
for j in bad_words_file_data.index:
word = bad_words_file_data[0][j]
if single_sentence.endswith(word) or single_sentence.startswith(word) or word in single_sentence.split(" "):
bad_words_index.append(i)
break
sentences_file_data = sentences_file_data.drop(index=bad_words_index)
sentences_file_data.to_csv("filtered.txt",header = None,index = False)
谢谢
解决方法
您可以使用 numpy.where
函数并创建一个名为“remove”的变量,如果您列出的条件得到满足,该变量将标记为 1。首先,创建一个值为 df2
条件 1: 将检查单元格值是否以列表中的任何值开头
条件 2: 与上述相同,但会检查单元格值是否以列表中的任何值结尾
条件 3: 拆分每个单元格并检查拆分器字符串中是否有任何值在您的列表中
此后,您可以通过过滤掉 1
来创建新的数据框:
# Imports
import pandas as pd
import numpy as np
# Get the values from df2 in a list
l = list(set(df2['col']))
# Set conditions
c = df['col']
cond = (c.str.startswith(tuple(l)) \
|(c.str.endswith(tuple(l))) \
|pd.DataFrame(c.str.split(' ').tolist()).isin(l).any(1))
# Assign 1 or 0
df['remove'] = np.where(cond,1,0)
# Create
out = (df[df['remove']!=1]).drop(['remove'],axis=1)
out
打印:
col
0 This is the first sentence
4 This is fifth_sentence
参考:
Pandas Row Select Where String Starts With Any Item In List
check if a columns contains any str from list
使用的数据框:
>>> df.to_dict()
{'col': {0: 'This is the first sentence',1: 'Second this is another sentence',2: 'This is the third sentence forth',3: 'This is fifth sentence',4: 'This is fifth_sentence'}}
>>> df2.to_dict()
Out[80]: {'col': {0: 'Second',1: 'forth',2: 'fifth'}}