Python difflib 给出了不好的结果

问题描述

我正在使用 python difflib 来计算两个纯文本英文段落之间的差异。

段落非常相似——有一个额外的前导句和结尾句。人物之间也有细微差别。

不幸的是,我得到了非常糟糕的结果。似乎差异开头的一个字符正在抛弃它,并在整个过程中散布随机字符。

diffchecker.com 等网站在计算差异时没有问题。我还注意到,如果我减少 difflib 的窗口以忽略第一句话,它会正确计算差异。有没有其他人注意到这个问题?

附上我的代码和下面的示例段落。非常感谢。

import difflib

s1 = "Ableton Live also supports Audio To MIDI,which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody,Harmony,or Rhythm. Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.[14] See Fourier transform.Envelopes[edit]Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,in which case they will be used in every performance of that clip,or on the entire arrangement. The most obvIoUs examples are volume and track panning,but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls,which can also control parameters in real-time using sliders,faders and such. Using the global transport record function will also record changes made to these parameters,creating an envelope for them.User interface[edit]Much of Live’s interface comes from being designed for use in live performance,as well as for production.[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help Box)."
s2 = "Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes. [14] See Fourier transform . Envelopes[ edit ] Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,creating an envelope for them. User interface[ edit ] Much of Live’s interface comes from being designed for use in live performance,as well as for production."

if __name__ == "__main__":
    res = [d for d in difflib.ndiff(s1,s2)]
    print(res)

解决方法

正如文档所说,

比较 a 和 b(字符串列表)...返回一个不同风格的增量(生成增量线的生成器)。

ndiff() 旨在,例如,比较两个文件,给定行列表 文件包含。很像常见的 Unixy diff 实用程序。

您正在尝试比较两条单独的行。 difflib 没有内置的“漂亮打印”方式来做到这一点,但确实提供了比较工具,您可以在此基础上构建您喜欢的任何格式。例如,

d = difflib.SequenceMatcher(None,s1,s2,autojunk=None)
for op in d.get_opcodes():
    print(op)

印刷品

('delete',194,0)
('equal',446,252)
('insert',252,253)
('equal',472,253,279)
('insert',279,280)
('equal',473,280,281)
('insert',281,282)
('equal',483,282,292)
('insert',292,293)
('equal',487,293,297)
('insert',297,298)
('equal',488,298,299)
('insert',299,300)
('equal',1143,300,955)
('insert',955,956)
('equal',1158,956,971)
('insert',971,972)
('equal',1162,972,976)
('insert',976,977)
('equal',1163,977,978)
('insert',978,979)
('equal',1269,979,1085)
('delete',1508,1085,1085)

有关这些的确切含义,请参阅文档。它们简洁地描述了将 s1 更改为 s2 所需的条件。长精确匹配块由 ('equal',955) 描述,实际上,

>>> s1[488 : 1143] == s2[300 : 955]
True

建议:相反,将您的两个输入分成句子,并将每个输入视为换行终止的句子的序列(如列表)。然后您可以直接使用 ndiff(),以它的预期使用方式。

让另一种方式更具体,例如这段代码:

import difflib
d = difflib.SequenceMatcher(None,autojunk=None)
for op,i1,i2,j1,j2 in d.get_opcodes():
    print(">>> ",end="")
    if op == "equal":
        print(f"{i2-i1} characters the same at",f"{i1}:{i2} and {j1}:{j2}")
        print(s1[i1:i2])
    elif op == "delete":
        print(f"delete {i2-i1} characters at {i1}:{i2}")
        print(s1[i1:i2])
    elif op == "insert":
        print(f"insert {j2-j1} characters from {j1}:{j2}")
        print(s2[j1:j2])
    elif op == "replace":
        print(f"replace {i1}:{i2} with {j1}:{j2}")
        print(s1[i1:i2])
        print(s2[j1:j2])
    else:
        assert False,("unknown op",repr(op))

产生这个输出:

>>> delete 194 characters at 0:194
Ableton Live also supports Audio To MIDI,which converts audio samples into a sequence of MIDI notes using three different conversion methods including conversion to Melody,Harmony,or Rhythm. 
>>> 252 characters the same at 194:446 and 0:252
Once finished,Live will create a new MIDI track containing the fresh MIDI notes along with an instrument to play back the notes. Audio to midi conversion is not always 100% accurate and may require the artist or producer to manually adjust some notes.
>>> insert 1 characters from 252:253
 
>>> 26 characters the same at 446:472 and 253:279
[14] See Fourier transform
>>> insert 1 characters from 279:280
 
>>> 1 characters the same at 472:473 and 280:281
.
>>> insert 1 characters from 281:282
 
>>> 10 characters the same at 473:483 and 282:292
Envelopes[
>>> insert 1 characters from 292:293
 
>>> 4 characters the same at 483:487 and 293:297
edit
>>> insert 1 characters from 297:298
 
>>> 1 characters the same at 487:488 and 298:299
]
>>> insert 1 characters from 299:300
 
>>> 655 characters the same at 488:1143 and 300:955
Almost all of the parameters in Live can be automated by envelopes which may be drawn either on clips,in which case they will be used in every performance of that clip,or on the entire arrangement. The most obvious examples are volume and track panning,but envelopes are also used in Live to control parameters of audio devices such as the root note of a resonator or a filter’s cutoff frequency. Clip envelopes may also be mapped to MIDI controls,which can also control parameters in real-time using sliders,faders and such. Using the global transport record function will also record changes made to these parameters,creating an envelope for them.
>>> insert 1 characters from 955:956
 
>>> 15 characters the same at 1143:1158 and 956:971
User interface[
>>> insert 1 characters from 971:972
 
>>> 4 characters the same at 1158:1162 and 972:976
edit
>>> insert 1 characters from 976:977
 
>>> 1 characters the same at 1162:1163 and 977:978
]
>>> insert 1 characters from 978:979
 
>>> 106 characters the same at 1163:1269 and 979:1085
Much of Live’s interface comes from being designed for use in live performance,as well as for production.
>>> delete 239 characters at 1269:1508
[15] There are few pop up messages or dialogs. Portions of the interface are hidden and shown based on arrows which may be clicked to show or hide a certain segment (e.g. to hide the instrument/effect list or to show or hide the help box).

您可以编辑该模板,以您最喜欢的任何方式显示结果。