[issue35955] difflib reports incorrect location of mismatch

Mon Feb 11 13:46:15 EST 2019

Tim Peters <tim at python.org> added the comment:

difflib generally synchs on the longest contiguous matching subsequence that doesn't contain a "junk" element.  By default, `ndiff()`'s optional `charjunk` argument considers blanks and tabs to be junk characters.

In the strings:

"drwxrwxr-x 2 2000  2000\n"
"drwxr-xr-x 2 2000  2000\n"

the longest matching substring not containing whitespace is "rwxr-x", of length 6, starting at index 4 in the first string and at index 1 in the second.  So it's aligning the strings like so:

"drwxrwxr-x 2 2000  2000\n"
   "drwxr-xr-x 2 2000  2000\n"
     123456

That's why it wants to delete the 1:4 slice in the first string and insert "r-x" after the longest matching substring.

The default is aimed at improving results for human-readable text, like prose and Python code, where stuff between whitespace is often read "as a whole" (words, keywords, identifiers, ...).

For cases like this one, where character-by-character differences are important, it's often better to pass `charjunk=None`.  Then the longest matching substring is "xr-x 2 2000  2000" at the tail end of both strings, and you get the output you're expecting.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35955>
_______________________________________