[Python-ideas] enhance filecmp to support text-and-universal-newline-mode file comparison

Wed Jun 24 17:15:14 CEST 2009

[Please keep the discussion in the list. Also, please avoid top posting (corrected below)]

> On 6/20/09, Gabriel Genellina <gagsl-py2 at yahoo.com.ar>
> wrote:
> > En Thu, 18 Jun 2009 11:04:34 -0300, zhong nanhai
> > <higerinbeijing-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
> escribió:
> >
> >> So is it a good idea to enhance the filecmp to
> support
> >> universal-newline-mode?If so, we can compare
> different files from
> >> different operation systems and if they have the
> same content, the
> >> filecmp.cmp would return true.
> >
> > With aid from itertools.izip_longest, it's a one-line
> recipe:
> >
> > py> print repr(open("one.txt","rb").read())
> > 'hello\nworld!\nlast line\n'
> > py> print repr(open("two.txt","rb").read())
> > 'hello\r\nworld!\r\nlast line\r\n'
> > py> import filecmp
> > py> filecmp.cmp("one.txt", "two.txt", False)
> > False
> > py> from itertools import izip_longest
> > py> f1 = open("one.txt", "rU")
> > py> f2 = open("two.txt", "rU")
> > py>
> > py> print all(line1==line2 for line1,line2 in
> izip_longest(f1,f2))
> > True
> >
> > Currently filecmp considers both files as binary, not
> text; if they differ
> > in size they're considered different and the contents
> are not even read.
> >
> > If you want a generic text-mode file comparison, there
> are other factors
> > to consider in addition to line endings: character
> encoding, BOM,
> > character case, whitespace... All of those may be
> considered "irrelevant
> > differences" by some people. A generic text file
> comparison should take
> > all of them into account.

--- El vie 19-jun-09, zhong nanhai <higerinbeijing at gmail.com> escribió:

> Thanks for you suggestion.
> You are right and there are a lot of things to consider if
> we want to
> make filecmp support text comparision.But I think we can
> just do some
> little feature enhancement,e.g. only  the
> universal-newline mode. I am
> not clear the way filecmp implement the file comparision.
> So, you can
> tell me more about that.
> And if in the source of filecmp, it compare files just by
> reading them
> line by line, then we can do some further comparisons when
> encountering newline flag(means the end of a line). 

You can see it yourself, in lib/filecmp.py in your Python installation.
It does a binary comparison only -- and it does not read anything if file sizes differ. A text comparison should use a different algorithm; the code above already ignores end-of-line differences and breaks as soon as two lines differ. One could enhance it to add support for other options as menctioned earlier.

-- 
Gabriel Genellina

      ____________________________________________________________________________________
¡Viví la mejor experiencia en la web!
Descargá gratis el nuevo Internet Explorer 8
http://downloads.yahoo.com/ieak8/?l=ar