exclude binary files from os.walk

Alex Martelli aleaxit at yahoo.com
Thu Jan 27 10:09:57 EST 2005


rbt <rbt at athop1.ath.vt.edu> wrote:

> Grant Edwards wrote:
> > On 2005-01-26, rbt <rbt at athop1.ath.vt.edu> wrote:
> > 
> >>Is there an easy way to exclude binary files (I'm working on
> >>Windows XP) from the file list returned by os.walk()?
> > 
> > Sure, assuming you can provide a rigorous definition of 'binary
> > files'.  :)
> 
> non-ascii

The only way to tell for sure if a file contains only ASCII characters
is to read the whole file and check.  You _are_, however, using a very
strange definition of "binary".  A file of text in German, French or
Italian, for example, is likely to be one you'll define as "binary" --
just as soon as it contains a vowel with accent or diaeresis, for
example.  On the other hand, you want to consider "non-binary" a file
chock full of hardly-ever-used control characters, just because the
American Standard Code for Information Interchange happened to
standardize them once upon a time?  Most people's intuitive sense of
what "binary" means would rebel against both of these choices, I think;
calling a file "binary" because its contents are, say, the string
'El perro de aguas español.\n' (the n-with-tilde in "español"
disqualifies it from being ASCII), while another whose contents are 32
bytes all made up of 8 zero bits each (ASCII 'NUL' characters) is to be
considered "non-binary".

In any case, since you need to open and read all the files to check them
for "being binary", either by your definition or whatever heuristics you
might prefer, you would really not ``excluded them from os.walk'', but
rather filter os.walk's results by these criteria.


Alex



More information about the Python-list mailing list