Pythons os.walk() 访问所有文件夹而不是只访问给定文件夹

问题描述

我想使用一个简单的脚本来获取给定文件夹下的所有图像并比较它们/查找重复项。

解决方案的第一步已经存在时,为什么还要发明轮子: Finding duplicate files and removing them

但它在第一步就已经失败了,因为它访问了给定 USB 闪存驱动器上的所有文件夹。我去掉了所有散列的东西,我试图只获取文件列表,但即使这样也会永远持续并访问 USB 驱动器上的每个文件

from __future__ import print_function   # py2 compatibility
from collections import defaultdict
import hashlib
import os
import sys


folder_to_check = "D:\FileCompareTest"

def check_for_duplicates(paths,hash=hashlib.sha1):
    hashes_by_size = defaultdict(list)  # dict of size_in_bytes: [full_path_to_file1,full_path_to_file2,]
    hashes_on_1k = defaultdict(list)  # dict of (hash1k,size_in_bytes): [full_path_to_file1,]
    hashes_full = {}   # dict of full_file_hash: full_path_to_file_string

    for path in paths:
        for dirpath,dirnames,filenames in os.walk(path):
            # get all files that have the same size - they are the collision candidates
            for filename in filenames:
                full_path = os.path.join(dirpath,filename)
                try:
                    # if the target is a symlink (soft one),this will 
                    # dereference it - change the value to the actual target file
                    full_path = os.path.realpath(full_path)
                    file_size = os.path.getsize(full_path)
                    hashes_by_size[file_size].append(full_path)
                except (OSError,):
                    # not accessible (permissions,etc) - pass on
                    continue




check_for_duplicates(folder_to_check)

我没有在几毫秒内获得 hashes_by_size 列表,而是陷入了一个永恒的循环,或者程序在几个小时后退出,所有文件都在 USB 上。

我对 os.walk() 不了解的是什么?

解决方法

你应该打电话

paths_to_check = []
paths_to_check.append(folder_to_check)
check_for_duplicates(paths_to_check)

按照您调用的方式,您会在路径的每个字符上获得生成器,而不是在正确的路径上。