Python version of perl's "if (-T ..)" and "if (-B ...)"?

Fri Feb 12 09:45:17 EST 2010

Lloyd Zusman wrote:
> Perl has the following constructs to check whether a file is considered
> to contain "text" or "binary" data:
> 
> if (-T $filename) { print "file contains 'text' characters\n"; }
> if (-B $filename) { print "file contains 'binary' characters\n"; }
> 
> Is there already a Python analog to these? I'm happy to write them on
> my own if no such constructs currently exist, but before I start, I'd
> like to make sure that I'm not "re-inventing the wheel".
> 
> By the way, here's what the perl docs say about these constructs. I'm
> looking for something similar in Python:
> 
> ... The -T  and -B  switches work as follows. The first block or so
> ... of the file is examined for odd characters such as strange control
> ... codes or characters with the high bit set. If too many strange
> ... characters (>30%) are found, it's a -B file; otherwise it's a -T
> ... file. Also, any file containing null in the first block is
> ... considered a binary file. [ ... ]

While I agree with the others who have responded along the lines 
of "that's a hinky heuristic", it's not too hard to write an analog:

   import string
   def is_text(fname,
       chars=set(string.printable),
       threshold=0.3,
       portion=1024, # read a kilobyte to find out
       mode='rb',
       ):
     assert portion is None or portion > 0
     assert 0 < threshold < 1
     f = file(fname, mode)
     if portion is None:
       content = iter(f)
     else:
       content = iter(f.read(int(portion)))
     f.close()
     total = valid = 0
     for c in content:
       if c in chars:
         valid += 1
       total += 1
     return (float(valid)/total) > threshold

   def is_bin(*args, **kwargs):
     return not is_text(*args, **kwargs)

   for fname in (
       '/usr/bin/abiword',
       '/home/tkc/.bashrc',
       ):
     print fname, is_text(fname)

It should allow you to tweak the charset to consider "text", 
defaulting to string.printable, but adjust the "text" chars and 
the file-reading-mode accordingly if you're using unicode text 
(perhaps inverting the logic to make it an "binary chars" set). 
You can also change the threshold from 0.3 (30%) to whatever you 
need, and test the entire file or a subset of it (this defaults 
to just reading the first K of the file, but if you pass None for 
the portion, it will read the whole thing, even if it's a TB file).

-tkc