如何从数据框中删除重复的记录? 之前:之后:

问题描述

我有一个Python脚本,可从Word文件中的表中提取数据并将其转换为数据框阿拉伯文本,问题是当我尝试显示数据框时,它会显示两次记录,并且我无法删除重复的记录。 / p>

代码

import pandas as pd
import docx

document = docx.Document(path)
table = document.tables[0]

data = []

for row_index,row in enumerate(table.rows): # Loop through rows
    data.append([]) # Add container list for each row.
    for col_index in range(13): # Loop through columns 
        cell_text= row.cells[col_index].paragraphs[0].text.encode('utf-8')
        cell_decode_text = cell_text.decode('utf-8')
        data[row_index].append(cell_decode_text)

df = pd.DataFrame(data)
df.columns=["group","person","category","source","dds","time","date","location","text","title","date_export","num_export",""]
df.drop_duplicates()
df.head(20)

结果:

 'date_export': {0: 'تاريخ الصادر',1: '',2: '2020/8/23',3: '2020/8/23',4: '2020/8/23',5: '2020/8/23',6: '2020/8/23',7: '2020/8/23',8: '2020/8/23',9: '2020/8/23',10: '2020/8/23',11: '2020/8/23',12: '2020/8/23'},'num_export': {0: 'رقم الصادر',1: 'رقم الصادر',2: '36015',3: '36015',4: '36016',5: '36016',6: '36017',7: '36017',8: '36018',9: '36018',10: '36019',11: '36019',12: '36020'},

解决方法

您必须设置

df.drop_duplicates(inplace=True)

,

使用下面提供的数据集,下面的示例说明如何使用df.drop_duplicates(inplace=True)来完成工作;正如@Chinte在回答中也提到的那样。

之前:

>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
3   2020/8/23   36015
4   2020/8/23   36016
5   2020/8/23   36016
6   2020/8/23   36017
7   2020/8/23   36017
8   2020/8/23   36018
9   2020/8/23   36018
10  2020/8/23   36019
11  2020/8/23   36019
12  2020/8/23   3602

之后:

>>> df.drop_duplicates(inplace=True)
>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
4   2020/8/23   36016
6   2020/8/23   36017
8   2020/8/23   36018
10  2020/8/23   36019
12  2020/8/23   36020