Detecting Binary content in files

John Machin sjmachin at lexicon.net
Wed Apr 1 17:39:15 EDT 2009


On Apr 1, 4:59 pm, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
> On Tue, 31 Mar 2009 14:26:08 -0700 (PDT), ritu
> <ritu_bhandar... at yahoo.com> declaimed the following in
> gmane.comp.python.general:
>
>
>
> > if ( ( -B $filename ||
> >            $filename =~ /\.pdf$/ ) &&
> >          -s $filename > 0 ) {
> >         return(1);
> >     }
>
>         According to my old copy of the Camel, -B only reads the "first
> block" of the file. If the block contains a <NUL>, or if ~30% of the
> block contains bytes >127 or from some (undefined) set of control
> characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
> <FF>, maybe some others)... So...

Not sure whether this is meant to be rough pseudocode or an April 1
"jeu d'esprit" or ...

>
> def isbin(fid):
>         fin = open(fid, "r")

(1) mode = "rb" might be better

>         block = fin.read(1024)  #what is the size of a "block" these days
>         binary = "\0" in block
>         if not binary:
>                 mrkrs = [b for b in block
>                                         if b > 127

(2) [assuming Python 2.x]
b is a str object; change 127 to "\x3f"

>                                                 or b in [ "\r", "\n", "\t" ]      ]       #add needed

(3) surely you mean "b not in"

(4) possible improvements on ["\r", etc etc] :
(4a) use tuple ("\r", etc etc)
(4b) use string "\r\n\t"
(you don't really want to build that list from scratch for each byte
tested, do you?)

>                 binary = (float(len(mrkrs)) / len(block)) > 0.30
>         fin.close()
>         return binary

Cheers,
John




More information about the Python-list mailing list