更快地重新实现以在python中查找唯一键

问题描述

我有一个代码片段，用于根据它们的值查找唯一键。这样，对于包含在另一个具有相同值的键 D 中的所有键 s ，键s被丢弃/删除并返回键D。

输入：mydict = { 'C': 4,'A': 4,'B': 4,'CA': 3,'AB': 4,'BC': 4,'ABC': 3 }

输出：{'CA': 3,'ABC': 3}

newdict = {}

for key1,value1 in mydict.items():
  for key2,value2 in mydict.items():
    if ((key1 in key2) and (value1 == value2)):
      if key1 in newdict:
        del newdict[key1]
      newdict[key2] = value2

print(newdict)

我想在python 3中实现更快的计算时间和更少的内存消耗。

请帮忙吗？!!!

解决方法

首先，您的实现中存在错误。想象一下mydict = {'AB': 1,'A': 1}，并且以书面顺序读取了这些项目：

第一步，key1 = 'AB'，value1 = 1：
- key2 = 'AB'，value2 = 1我们有key1 in key2 and value1 == value2，因此newdict['AB'] = 1
- key2 = 'A'，value2 = 1我们没有key1 in key2，因此newdict仍然是{'AB': 1}
第二步，key1 = 'A'，value1 = 1：
- key2 = 'AB'，value2 = 1，我们有key1 in key2 and value1 == value2。由于key1不在newdict中，因此不会将其丢弃。再一次，newdict['AB'] = 1
- key2 = 'A'，value2 = 1，我们有key1 in key2 and value1 == value2。由于key1不在newdict中，因此不会将其丢弃。然后，newdict['A'] = 1。

结论：mydict = {'AB': 1,'A': 1}，但应为{'AB': 1}。

要解决此问题，您必须通过增加len对密钥进行排序，以确保丢弃较小的密钥：

def original_func(mydict):
    newdict = {}
    # sort by inc key len
    items = sorted(mydict.items(),key=lambda i: len(i[0]))

    for key1,value1 in items:
        for key2,value2 in items:
            if key1 in key2 and value1 == value2:
                if key1 in newdict:
                    del newdict[key1]
                newdict[key2] = value2

    return newdict

第二，算法的时间复杂度为O(n^2 * K)，其中K是最长密钥的大小。我假设您的实际用例中有更大的词典。您可以通过以下方式对此进行改进：

仅比较具有相同值的键；
仅比较先前超级键的键，而不是所有键；

如果您有V个不同的值，并且S < n/V个平均超键具有相同的值，则平均时间复杂度将约为O(V * S^2 * K)。除非您处于退化的情况，否则这会更快。

为了测试这一点，我们创建了一个巨大的字典：

from random import shuffle,randint

mydict = {}
s = list("ABCDEFGHIJ")
for i in range(1000):
    shuffle(s)
    key = "".join(s[:randint(1,7)])
    mydict[key] = randint(1,7)

然后执行功能：

def new_func(mydict):
    # group the keys by value
    keys_by_value = {}
    for k,v in mydict.items():
        keys_by_value.setdefault(v,[]).append(k)

    newdict = {}
    for v,ks in keys_by_value.items():
        # for each value,find the superkeys
        for sk in find_superkeys(ks):
            # and add the mapping superkey -> value
            newdict[sk] = v

    return newdict

当然，程序的核心是find_superkeys函数。这里没有魔术，但是您可以按len将键分组，因为同一len的两个键相等或不相交（也许有人知道这样做的更快方法？）：

def find_superkeys(ks):
    # group keys by len (-len to sort by decreasing len)
    keys_by_neg_len = {}
    for k in ks:
        keys_by_neg_len.setdefault(-len(k),[]).append(k)

    superkeys  = []
    for _,ks in sorted(keys_by_neg_len.items()):
        cur = []
        for k in ks:
            # keep k if k not in any of superkeys
            # by using a decreasing len,we avoid useless tests like 'AB' in 'A'
            if all(k not in k2 for k2 in superkeys):
                cur.append(k)

        # here,cur contains superkeys of a given len
        superkeys += cur

    return superkeys

和基准：

import timeit

assert original_func(mydict) == new_func(mydict)
print(timeit.timeit(lambda: original_func(mydict),number=100))
# 1.6872997790005684
print(timeit.timeit(lambda: new_func(mydict),number=100))
# 0.15774182000041037

dictionary dictionary-comprehension key python python-3.x