Re: [Python-ideas] enhance filecmp to support text-and-universal-newline-mode file comparison

[Please keep the discussion in the list. Also, please avoid top posting (corrected below)]
On 6/20/09, Gabriel Genellina gagsl-py2@yahoo.com.ar wrote:
En Thu, 18 Jun 2009 11:04:34 -0300, zhong nanhai higerinbeijing-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
escribió:
So is it a good idea to enhance the filecmp to
support
universal-newline-mode?If so, we can compare
different files from
different operation systems and if they have the
same content, the
filecmp.cmp would return true.
With aid from itertools.izip_longest, it's a one-line
recipe:
py> print repr(open("one.txt","rb").read()) 'hello\nworld!\nlast line\n' py> print repr(open("two.txt","rb").read()) 'hello\r\nworld!\r\nlast line\r\n' py> import filecmp py> filecmp.cmp("one.txt", "two.txt", False) False py> from itertools import izip_longest py> f1 = open("one.txt", "rU") py> f2 = open("two.txt", "rU") py> py> print all(line1==line2 for line1,line2 in
izip_longest(f1,f2))
True
Currently filecmp considers both files as binary, not
text; if they differ
in size they're considered different and the contents
are not even read.
If you want a generic text-mode file comparison, there
are other factors
to consider in addition to line endings: character
encoding, BOM,
character case, whitespace... All of those may be
considered "irrelevant
differences" by some people. A generic text file
comparison should take
all of them into account.
--- El vie 19-jun-09, zhong nanhai higerinbeijing@gmail.com escribió:
Thanks for you suggestion. You are right and there are a lot of things to consider if we want to make filecmp support text comparision.But I think we can just do some little feature enhancement,e.g. only the universal-newline mode. I am not clear the way filecmp implement the file comparision. So, you can tell me more about that. And if in the source of filecmp, it compare files just by reading them line by line, then we can do some further comparisons when encountering newline flag(means the end of a line).
You can see it yourself, in lib/filecmp.py in your Python installation. It does a binary comparison only -- and it does not read anything if file sizes differ. A text comparison should use a different algorithm; the code above already ignores end-of-line differences and breaks as soon as two lines differ. One could enhance it to add support for other options as menctioned earlier.
participants (1)
-
gagsl-py2@yahoo.com.ar