问题描述

我有一个Python脚本，可从Word文件中的表中提取数据并将其转换为数据框阿拉伯文本，问题是当我尝试显示数据框时，它会显示两次记录，并且我无法删除重复的记录。 / p>

代码：

import pandas as pd
import docx

document = docx.Document(path)
table = document.tables[0]

data = []

for row_index,row in enumerate(table.rows): # Loop through rows
    data.append([]) # Add container list for each row.
    for col_index in range(13): # Loop through columns 
        cell_text= row.cells[col_index].paragraphs[0].text.encode('utf-8')
        cell_decode_text = cell_text.decode('utf-8')
        data[row_index].append(cell_decode_text)

df = pd.DataFrame(data)
df.columns=["group","person","category","source","dds","time","date","location","text","title","date_export","num_export",""]
df.drop_duplicates()
df.head(20)

结果：

 'date_export': {0: 'تاريخ الصادر',1: '',2: '2020/8/23',3: '2020/8/23',4: '2020/8/23',5: '2020/8/23',6: '2020/8/23',7: '2020/8/23',8: '2020/8/23',9: '2020/8/23',10: '2020/8/23',11: '2020/8/23',12: '2020/8/23'},'num_export': {0: 'رقم الصادر',1: 'رقم الصادر',2: '36015',3: '36015',4: '36016',5: '36016',6: '36017',7: '36017',8: '36018',9: '36018',10: '36019',11: '36019',12: '36020'},

解决方法

您必须设置

df.drop_duplicates(inplace=True)

使用下面提供的数据集，下面的示例说明如何使用df.drop_duplicates(inplace=True)来完成工作；正如@Chinte在回答中也提到的那样。

之前：

>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
3   2020/8/23   36015
4   2020/8/23   36016
5   2020/8/23   36016
6   2020/8/23   36017
7   2020/8/23   36017
8   2020/8/23   36018
9   2020/8/23   36018
10  2020/8/23   36019
11  2020/8/23   36019
12  2020/8/23   3602

之后：

>>> df.drop_duplicates(inplace=True)
>>> df

    date_export     num_export
0   تاريخ الصادر    رقم الصادر
1       رقم الصادر
2   2020/8/23   36015
4   2020/8/23   36016
6   2020/8/23   36017
8   2020/8/23   36018
10  2020/8/23   36019
12  2020/8/23   36020

arabic dataframe ms-word python

如何从数据框中删除重复的记录？ 之前：之后：

问题描述

代码：

结果：

解决方法

之前：

之后：

如何从数据框中删除重复的记录？之前：之后：