enhance filecmp to support text-and-universal-newline-mode file comparison

Hello everyone: We know that we often use programmes or scripts to generate useful output files ,but these files may come from different OS platforms. Comparing them by reading line by line may seem a bit trivial and we can use the filecmp model to do such thing(the cmp function). But when we try to compare two text files from different platforms,e.g. Ubuntu and Windows, even though these two files contain the same content, the filecmp.cmp will return false.We know that the reason is different ways to handle newline flag in different platforms, '\n\r' for Windows,'\n' for Unix,'\r' for Mac, e.t.c. So is it a good idea to enhance the filecmp to support universal-newline-mode?If so, we can compare different files from different operation systems and if they have the same content, the filecmp.cmp would return true. Hoping everyone can give some advice about this idea. Thanks in advance, higer

En Thu, 18 Jun 2009 11:04:34 -0300, zhong nanhai <higerinbeijing-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> escribió:
We know that we often use programmes or scripts to generate useful output files ,but these files may come from different OS platforms.
Comparing them by reading line by line may seem a bit trivial and we can use the filecmp model to do such thing(the cmp function). But when we try to compare two text files from different platforms,e.g. Ubuntu and Windows, even though these two files contain the same content, the filecmp.cmp will return false.We know that the reason is different ways to handle newline flag in different platforms, '\n\r' for Windows,'\n' for Unix,'\r' for Mac, e.t.c.
So is it a good idea to enhance the filecmp to support universal-newline-mode?If so, we can compare different files from different operation systems and if they have the same content, the filecmp.cmp would return true.
With aid from itertools.izip_longest, it's a one-line recipe: py> print repr(open("one.txt","rb").read()) 'hello\nworld!\nlast line\n' py> print repr(open("two.txt","rb").read()) 'hello\r\nworld!\r\nlast line\r\n' py> import filecmp py> filecmp.cmp("one.txt", "two.txt", False) False py> from itertools import izip_longest py> f1 = open("one.txt", "rU") py> f2 = open("two.txt", "rU") py> py> print all(line1==line2 for line1,line2 in izip_longest(f1,f2)) True Currently filecmp considers both files as binary, not text; if they differ in size they're considered different and the contents are not even read. If you want a generic text-mode file comparison, there are other factors to consider in addition to line endings: character encoding, BOM, character case, whitespace... All of those may be considered "irrelevant differences" by some people. A generic text file comparison should take all of them into account. -- Gabriel Genellina
participants (2)
-
Gabriel Genellina
-
zhong nanhai