[Python-ideas] enhance filecmp to support text-and-universal-newline-mode file comparison

Fri Jun 19 18:20:20 CEST 2009

En Thu, 18 Jun 2009 11:04:34 -0300, zhong nanhai
<higerinbeijing-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org> escribió:

> We know that we often use programmes or scripts to generate useful
> output files ,but these files may come from different OS platforms.
>
> Comparing them by reading line by line may seem a bit trivial and we
> can use the filecmp model to do such thing(the cmp function). But
> when we try to compare two text files from different platforms,e.g.
> Ubuntu and Windows, even though these two files contain the same
> content, the filecmp.cmp will return false.We know that the reason is
> different ways to handle newline flag in different platforms, '\n\r'
> for Windows,'\n' for Unix,'\r' for Mac, e.t.c.
>
> So is it a good idea to enhance the filecmp to support
> universal-newline-mode?If so, we can compare different files from
> different operation systems and if they have the same content, the
> filecmp.cmp would return true.

With aid from itertools.izip_longest, it's a one-line recipe:

py> print repr(open("one.txt","rb").read())
'hello\nworld!\nlast line\n'
py> print repr(open("two.txt","rb").read())
'hello\r\nworld!\r\nlast line\r\n'
py> import filecmp
py> filecmp.cmp("one.txt", "two.txt", False)
False
py> from itertools import izip_longest
py> f1 = open("one.txt", "rU")
py> f2 = open("two.txt", "rU")
py>
py> print all(line1==line2 for line1,line2 in izip_longest(f1,f2))
True

Currently filecmp considers both files as binary, not text; if they differ
in size they're considered different and the contents are not even read.

If you want a generic text-mode file comparison, there are other factors
to consider in addition to line endings: character encoding, BOM,
character case, whitespace... All of those may be considered "irrelevant
differences" by some people. A generic text file comparison should take
all of them into account.

-- 
Gabriel Genellina