Detecting Binary content in files
John Machin
sjmachin at lexicon.net
Wed Apr 1 17:39:15 EDT 2009
On Apr 1, 4:59 pm, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
> On Tue, 31 Mar 2009 14:26:08 -0700 (PDT), ritu
> <ritu_bhandar... at yahoo.com> declaimed the following in
> gmane.comp.python.general:
>
>
>
> > if ( ( -B $filename ||
> > $filename =~ /\.pdf$/ ) &&
> > -s $filename > 0 ) {
> > return(1);
> > }
>
> According to my old copy of the Camel, -B only reads the "first
> block" of the file. If the block contains a <NUL>, or if ~30% of the
> block contains bytes >127 or from some (undefined) set of control
> characters (that is, I expect it does not count <LF>, <CR>, <TAB>, <VT>,
> <FF>, maybe some others)... So...
Not sure whether this is meant to be rough pseudocode or an April 1
"jeu d'esprit" or ...
>
> def isbin(fid):
> fin = open(fid, "r")
(1) mode = "rb" might be better
> block = fin.read(1024) #what is the size of a "block" these days
> binary = "\0" in block
> if not binary:
> mrkrs = [b for b in block
> if b > 127
(2) [assuming Python 2.x]
b is a str object; change 127 to "\x3f"
> or b in [ "\r", "\n", "\t" ] ] #add needed
(3) surely you mean "b not in"
(4) possible improvements on ["\r", etc etc] :
(4a) use tuple ("\r", etc etc)
(4b) use string "\r\n\t"
(you don't really want to build that list from scratch for each byte
tested, do you?)
> binary = (float(len(mrkrs)) / len(block)) > 0.30
> fin.close()
> return binary
Cheers,
John
More information about the Python-list
mailing list