[issue35955] difflib reports incorrect location of mismatch
Tim Peters
report at bugs.python.org
Mon Feb 11 13:46:15 EST 2019
Tim Peters <tim at python.org> added the comment:
difflib generally synchs on the longest contiguous matching subsequence that doesn't contain a "junk" element. By default, `ndiff()`'s optional `charjunk` argument considers blanks and tabs to be junk characters.
In the strings:
"drwxrwxr-x 2 2000 2000\n"
"drwxr-xr-x 2 2000 2000\n"
the longest matching substring not containing whitespace is "rwxr-x", of length 6, starting at index 4 in the first string and at index 1 in the second. So it's aligning the strings like so:
"drwxrwxr-x 2 2000 2000\n"
"drwxr-xr-x 2 2000 2000\n"
123456
That's why it wants to delete the 1:4 slice in the first string and insert "r-x" after the longest matching substring.
The default is aimed at improving results for human-readable text, like prose and Python code, where stuff between whitespace is often read "as a whole" (words, keywords, identifiers, ...).
For cases like this one, where character-by-character differences are important, it's often better to pass `charjunk=None`. Then the longest matching substring is "xr-x 2 2000 2000" at the tail end of both strings, and you get the output you're expecting.
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35955>
_______________________________________
More information about the Python-bugs-list
mailing list