更快地重新实现以在python中查找唯一键

问题描述

我有一个代码片段,用于根据它们的值查找唯一键。这样,对于包含在另一个具有相同的键 D 中的所有键 s ,键s被丢弃/删除并返回键D。

输入:mydict = { 'C': 4,'A': 4,'B': 4,'CA': 3,'AB': 4,'BC': 4,'ABC': 3 }

输出{'CA': 3,'ABC': 3}

newdict = {}

for key1,value1 in mydict.items():
  for key2,value2 in mydict.items():
    if ((key1 in key2) and (value1 == value2)):
      if key1 in newdict:
        del newdict[key1]
      newdict[key2] = value2

print(newdict)

我想在python 3中实现更快的计算时间和更少的内存消耗。

请帮忙吗?!!!

解决方法

首先,您的实现中存在错误。想象一下mydict = {'AB': 1,'A': 1},并且以书面顺序读取了这些项目:

  • 第一步,key1 = 'AB'value1 = 1

    • key2 = 'AB'value2 = 1我们有key1 in key2 and value1 == value2,因此newdict['AB'] = 1
    • key2 = 'A'value2 = 1我们没有key1 in key2,因此newdict仍然是{'AB': 1}
  • 第二步,key1 = 'A'value1 = 1

    • key2 = 'AB'value2 = 1,我们有key1 in key2 and value1 == value2。由于key1不在newdict中,因此不会将其丢弃。再一次,newdict['AB'] = 1
    • key2 = 'A'value2 = 1,我们有key1 in key2 and value1 == value2。由于key1不在newdict中,因此不会将其丢弃。然后,newdict['A'] = 1

结论:mydict = {'AB': 1,'A': 1},但应为{'AB': 1}

要解决此问题,您必须通过增加len对密钥进行排序,以确保丢弃较小的密钥:

def original_func(mydict):
    newdict = {}
    # sort by inc key len
    items = sorted(mydict.items(),key=lambda i: len(i[0]))

    for key1,value1 in items:
        for key2,value2 in items:
            if key1 in key2 and value1 == value2:
                if key1 in newdict:
                    del newdict[key1]
                newdict[key2] = value2

    return newdict

第二,算法的时间复杂度为O(n^2 * K),其中K是最长密钥的大小。我假设您的实际用例中有更大的词典。您可以通过以下方式对此进行改进:

  1. 仅比较具有相同值的键;
  2. 仅比较先前超级键的键,而不是所有键;

如果您有V个不同的值,并且S < n/V个平均超键具有相同的值,则平均时间复杂度将约为O(V * S^2 * K)。除非您处于退化的情况,否则这会更快。

为了测试这一点,我们创建了一个巨大的字典:

from random import shuffle,randint

mydict = {}
s = list("ABCDEFGHIJ")
for i in range(1000):
    shuffle(s)
    key = "".join(s[:randint(1,7)])
    mydict[key] = randint(1,7)

然后执行功能:

def new_func(mydict):
    # group the keys by value
    keys_by_value = {}
    for k,v in mydict.items():
        keys_by_value.setdefault(v,[]).append(k)

    newdict = {}
    for v,ks in keys_by_value.items():
        # for each value,find the superkeys
        for sk in find_superkeys(ks):
            # and add the mapping superkey -> value
            newdict[sk] = v

    return newdict

当然,程序的核心是find_superkeys函数。这里没有魔术,但是您可以按len将键分组,因为同一len的两个键相等或不相交(也许有人知道这样做的更快方法?):

def find_superkeys(ks):
    # group keys by len (-len to sort by decreasing len)
    keys_by_neg_len = {}
    for k in ks:
        keys_by_neg_len.setdefault(-len(k),[]).append(k)

    superkeys  = []
    for _,ks in sorted(keys_by_neg_len.items()):
        cur = []
        for k in ks:
            # keep k if k not in any of superkeys
            # by using a decreasing len,we avoid useless tests like 'AB' in 'A'
            if all(k not in k2 for k2 in superkeys):
                cur.append(k)

        # here,cur contains superkeys of a given len
        superkeys += cur

    return superkeys

和基准:

import timeit

assert original_func(mydict) == new_func(mydict)
print(timeit.timeit(lambda: original_func(mydict),number=100))
# 1.6872997790005684
print(timeit.timeit(lambda: new_func(mydict),number=100))
# 0.15774182000041037