从大型图像数据集中删除重复项

问题描述

我正在使用从互联网上抓取的 127.000 张图像的训练数据集。

我知道那里有很多重复项，我想删除它们以改进我的深度学习模型的性能。

我尝试了几种不同的方法来做到这一点。有些根本不起作用，有些只是删除了一些或太多。

我试过的最后一个是这样的：

import hashlib
import os
import PIL
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import time
import numpy as np

def file_hash(filepath):
    with open(filepath,'rb') as f:
        return md5(f.read()).hexdigest()

os.chdir('/content/train')

import hashlib,os
duplicates = []
hash_keys = dict()
for index,filename in  enumerate(os.listdir('.')):  #listdir('.') = current directory
    if os.path.isfile(filename):
        with open(filename,'rb') as f:
            filehash = hashlib.md5(f.read()).hexdigest()
        if filehash not in hash_keys: 
            hash_keys[filehash] = index
        else:
            duplicates.append((index,hash_keys[filehash]))

for file_indexes in duplicates[:30]:
    try:
    
        plt.subplot(121),plt.imshow(imread(file_list[file_indexes[1]]))
        plt.title(file_indexes[1]),plt.xticks([]),plt.yticks([])

        plt.subplot(122),plt.imshow(imread(file_list[file_indexes[0]]))
        plt.title(str(file_indexes[0]) + ' duplicate'),plt.yticks([])
        plt.show()
    
    except OSError as e:
        continue

for index in duplicates:
    os.remove(file_list[index[0]])

此方法找到了 490 个重复项，但我估计至少有几千个重复项。

我也尝试过使用不同的方法和阈值进行 imagededup。

pip install imagededup

from imagededup.methods import DHash
method_object = DHash()
duplicates = method_object.find_duplicates_to_remove(image_dir='/content/train',max_distance_threshold=3)

上次运行发现 23919 个重复项，通常在 20k 到 35k 之间，具体取决于方法和阈值。这太多了。运行模型去除所有这些会产生更糟糕的结果。

有人知道删除重复图片的更好方法吗？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

deep-learning duplicates image-preprocessing machine-learning python