Detecting Binary content in files

John Machin sjmachin at lexicon.net
Thu Apr 2 00:13:03 CEST 2009


On Apr 2, 8:39 am, John Machin <sjmac... at lexicon.net> wrote:
> On Apr 1, 4:59 pm, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
>
>
>
> > On Tue, 31 Mar 2009 14:26:08 -0700 (PDT), ritu
> > <ritu_bhandar... at yahoo.com> declaimed the following in
> > gmane.comp.python.general:
>
> > > if ( ( -B $filename ||
> > >            $filename =~ /\.pdf$/ ) &&
> > >          -s $filename > 0 ) {
> > >         return(1);
> > >     }
>
> >         According to my old copy of the Camel, -B only reads the "first
> > block" of the file. If the block contains a <NUL>, or if ~30% of the
> > block contains bytes >127 or from some (undefined) set of control
> > characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
> > <FF>, maybe some others)... So...
>
> Not sure whether this is meant to be rough pseudocode or an April 1
> "jeu d'esprit" or ...
>
>
>
> > def isbin(fid):
> >         fin = open(fid, "r")
>
> (1) mode = "rb" might be better
>
> >         block = fin.read(1024)  #what is the size of a "block" these days
> >         binary = "\0" in block
> >         if not binary:
> >                 mrkrs = [b for b in block
> >                                         if b > 127
>
> (2) [assuming Python 2.x]
> b is a str object; change 127 to "\x3f"

Gah ... it must be gamma rays from outer space! Trying again:

change 127 to "\x7f" (and actually "\x7e" would be a better choice)

>
> >                                                 or b in [ "\r", "\n", "\t" ]      ]       #add needed
>
> (3) surely you mean "b not in"

take 2:

surely you mean
   ... or b < "\x20" and b not in "\r\n\t"

and at that stage the idea of making a set of chars befor entering the
loop has some attraction :-)



More information about the Python-list mailing list