Using difflib to compare text ignoring whitespace differences
Gabriel Genellina
gagsl-py at yahoo.com.ar
Wed Dec 20 23:52:41 EST 2006
On 19 dic, 11:53, Neilen Marais <nmar... at sun.ac.za> wrote:
> Hi
>
> I'm trying to compare some text to find differences other than whitespace.
> I seem to be misunderstanding something, since I can't even get a basic
> example to work:
>
> In [104]: d =difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK)
>
> In [105]: list(d.compare([' a'], ['a']))
> Out[105]: ['- a', '+ a']
>
> Surely if whitespace characters are being ignored those two strings should
> be marked as identical? What am I doing wrong?
The docs for Differ are a bit terse and misleading.
compare() does a two-level matching: first, on a *line* level,
considering only the linejunk parameter. And then, for each pair of
similar lines found on the first stage, it does a intraline match
considering only the charjunk parameter.
Also note that junk!=ignored, the algorithm tries to "find the longest
contiguous matching subsequence that contains no ``junk'' elements"
Using a slightly longer text gets closer to what you want, I think:
d=difflib.Differ(charjunk=difflib.IS_CHARACTER_JUNK)
for delta in d.compare([' a larger line'],['a longer line']): print
delta
- a larger line
? --- ^^
+ a longer line
? ^^
--
Gabriel Genellina
More information about the Python-list
mailing list