How to ignore white space changes using difflib?
Duncan Booth
duncan.booth at invalid.invalid
Wed Apr 8 13:16:13 EDT 2009
Grant Edwards <invalid at invalid> wrote:
> Apparently that "filtering out" characters doesn't mean that
> they're ignored when doing the comparison. (A bit of a "WTF?"
> if you ask me). After some more googling, it appears that I'm
> far from the first person who interpreted "filtered out" as
> "ignored when comparing lines". I'd submit a fix for the doc
> page, but you apparently have to be a lot smarter than me to
> figure out what "filters out" means in this context.
So far as I can see from looking at the code:
Once if you have identified one block of lines as having been replaced by
another the matcher can then give you additional information by marking up
the changes within each line. However it only makes sense to do that if the
lines are still somewhat similar.
'charjunk' is used to remove junk characters before scanning the lines
within a replacement block and the most similar lines (if they are
sufficiently similar) are then chosen for this extra step of comparing the
character changes within the line.
Here's an example. If I do this:
>>> print ''.join(Differ().compare('one\ntwo\nthree\n'.splitlines(1),
'one\nwot\ntoo\nthree\n'.splitlines(1)))
one
- two
? -
+ wot
? +
+ too
three
The comparison detected that "two" was replaced by 2 lines "wot" and "too".
It decided the first of these was the best match for the original line so
it shows character level difference between the original and the first
replacement line.
>>> print ''.join(Differ(charjunk=lambda c:c=='w')
.compare('one\ntwo\nthree\n'.splitlines(1),
'one\nwot\ntoo\nthree\n'.splitlines(1)))
one
+ wot
- two
? ^
+ too
? ^
three
This time we told the system that we don't care about 'w' in either the
original or replacement text. That means instead of seeing which of "wot"
and "too" is closest to "two" it looks to see which of "ot" and "too" is
closest to "to". "ot" has two changes but "too" only has one, so this time
it does the detailed comparison between the original line and the second
line of the output. N.B. The junk function is only used to decide which
lines to use for the detailed comparison: the original lines are still used
for the comparison itself.
More information about the Python-list
mailing list