Finding nonprintable characters?

Steven Majewski sdm7g at Virginia.EDU
Tue Feb 19 14:35:40 EST 2002


On Tue, 19 Feb 2002, VanL wrote:
>
> I have a function
>
> isBinary(filehandle)
>
> that I'm not sure how to implement.  I've decided to define binary as
> containing characters above \x80.  But  what is the best way to do this?
>
> 1. iterate through xreadline, so the whole thing doesn't get loaded into
> memory?

I would use file.read( bytes ) -- if it's binary, then you probably
don't need to read the whole file in. Most programs I've seen that
try to determine 'binaryness' only check the first N bytes anyway.
( I've seen some that want a certain percentage of non-printing chars
  per block -- not just a single out of range char. )

> 2. String searching?  If so, for what string?  Searching for anything
> greater  than \x7f?
>
> 3. Re searching?  for what class?
>

How about something like:

  filter( lambda c: ord(c) > value, file.read( blocksize ) )

or, as you note, save the ord() call and use an octal or hex string
literal. If you want to use list comprehensions it would be something
like:

  [ c for c in file.read( blocksize ) if c > '\x7f' ]

but list comprehensions give you a list while filter on a string
yields a string. You can divide the (float) length of the filtered value
by the length of the original ( blocksize ) for a ratio if you
want to use that instead of a single out of range char.


-- Steve Majewski






More information about the Python-list mailing list