问题描述
我有一个Python脚本,可从Word文件中的表中提取数据并将其转换为数据框阿拉伯文本,问题是当我尝试显示数据框时,它会显示两次记录,并且我无法删除重复的记录。 / p>
代码:
import pandas as pd
import docx
document = docx.Document(path)
table = document.tables[0]
data = []
for row_index,row in enumerate(table.rows): # Loop through rows
data.append([]) # Add container list for each row.
for col_index in range(13): # Loop through columns
cell_text= row.cells[col_index].paragraphs[0].text.encode('utf-8')
cell_decode_text = cell_text.decode('utf-8')
data[row_index].append(cell_decode_text)
df = pd.DataFrame(data)
df.columns=["group","person","category","source","dds","time","date","location","text","title","date_export","num_export",""]
df.drop_duplicates()
df.head(20)
结果:
'date_export': {0: 'تاريخ الصادر',1: '',2: '2020/8/23',3: '2020/8/23',4: '2020/8/23',5: '2020/8/23',6: '2020/8/23',7: '2020/8/23',8: '2020/8/23',9: '2020/8/23',10: '2020/8/23',11: '2020/8/23',12: '2020/8/23'},'num_export': {0: 'رقم الصادر',1: 'رقم الصادر',2: '36015',3: '36015',4: '36016',5: '36016',6: '36017',7: '36017',8: '36018',9: '36018',10: '36019',11: '36019',12: '36020'},
解决方法
您必须设置
df.drop_duplicates(inplace=True)
使用下面提供的数据集,下面的示例说明如何使用df.drop_duplicates(inplace=True)
来完成工作;正如@Chinte在回答中也提到的那样。
之前:
>>> df
date_export num_export
0 تاريخ الصادر رقم الصادر
1 رقم الصادر
2 2020/8/23 36015
3 2020/8/23 36015
4 2020/8/23 36016
5 2020/8/23 36016
6 2020/8/23 36017
7 2020/8/23 36017
8 2020/8/23 36018
9 2020/8/23 36018
10 2020/8/23 36019
11 2020/8/23 36019
12 2020/8/23 3602
之后:
>>> df.drop_duplicates(inplace=True)
>>> df
date_export num_export
0 تاريخ الصادر رقم الصادر
1 رقم الصادر
2 2020/8/23 36015
4 2020/8/23 36016
6 2020/8/23 36017
8 2020/8/23 36018
10 2020/8/23 36019
12 2020/8/23 36020