[New-bugs-announce] [issue46667] SequenceMatcher & autojunk - false negative

Jonathan report at bugs.python.org
Sun Feb 6 16:58:45 EST 2022


New submission from Jonathan <bugreports at lightpear.com>:

The following two strings are identical other than the text "UNIQUESTRING".
UNIQUESTRING is at the start of first and at the end of second.
Running the below gives the following output:


0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

0.99830220713073
0.99830220713073
0.023769100169779286  # ratio

As you can see, Ratio is basically 0. Remove either of the UNIQUESTRING pieces and it goes up to 0.98 (correct)... Remove both and you get 1.0 (correct)


```
from difflib import SequenceMatcher

first = """
UNIQUESTRING
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum
"""


second = """

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum  UNIQUESTRING
"""

sm = SequenceMatcher(None, first, second, autojunk=False)
print(sm.real_quick_ratio())
print(sm.quick_ratio())
print(sm.ratio())

print()

sm2 = SequenceMatcher(None, second, first, autojunk=False)
print(sm2.real_quick_ratio())
print(sm2.quick_ratio())
print(sm2.ratio())

```

If I add `autojunk=False`, then I get a correct looking ratio (0.98...), however from my reading of the autojunk docs, UNIQUESTRING shouldn't be triggering it. Furthermore, looking in the code, as far as I can see autojunk is having no effect...

Autojunk considers these items to be "popular" in that string:
`{'n', 'p', 'a', 'h', 'e', 'u', 'I', 'r', 'k', 'g', 'y', 'm', 'c', 'd', 't', 'l', 'o', 's', ' ', 'i'}`

If I remove UNIQUESTRING from `first`, this is the autojunk popular set:
`{'c', 'p', 'a', 'u', 'r', 'm', 'k', 'g', 'I', 'd', ' ', 'o', 'h', 't', 'e', 'i', 'l', 's', 'y', 'n'}`

They're identical!

In both scenarios, `b2j` is also identical.

I don't pretend to understand what the module is doing in any detail, but this certainly seems like a false positive/negative.

Python 3.8.10

----------
components: Library (Lib)
messages: 412673
nosy: jonathan-lp
priority: normal
severity: normal
status: open
title: SequenceMatcher & autojunk - false negative
type: behavior
versions: Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue46667>
_______________________________________


More information about the New-bugs-announce mailing list