Python version of perl's "if (-T ..)" and "if (-B ...)"?
Tim Chase
python.list at tim.thechases.com
Fri Feb 12 09:45:17 EST 2010
Lloyd Zusman wrote:
> Perl has the following constructs to check whether a file is considered
> to contain "text" or "binary" data:
>
> if (-T $filename) { print "file contains 'text' characters\n"; }
> if (-B $filename) { print "file contains 'binary' characters\n"; }
>
> Is there already a Python analog to these? I'm happy to write them on
> my own if no such constructs currently exist, but before I start, I'd
> like to make sure that I'm not "re-inventing the wheel".
>
> By the way, here's what the perl docs say about these constructs. I'm
> looking for something similar in Python:
>
> ... The -T and -B switches work as follows. The first block or so
> ... of the file is examined for odd characters such as strange control
> ... codes or characters with the high bit set. If too many strange
> ... characters (>30%) are found, it's a -B file; otherwise it's a -T
> ... file. Also, any file containing null in the first block is
> ... considered a binary file. [ ... ]
While I agree with the others who have responded along the lines
of "that's a hinky heuristic", it's not too hard to write an analog:
import string
def is_text(fname,
chars=set(string.printable),
threshold=0.3,
portion=1024, # read a kilobyte to find out
mode='rb',
):
assert portion is None or portion > 0
assert 0 < threshold < 1
f = file(fname, mode)
if portion is None:
content = iter(f)
else:
content = iter(f.read(int(portion)))
f.close()
total = valid = 0
for c in content:
if c in chars:
valid += 1
total += 1
return (float(valid)/total) > threshold
def is_bin(*args, **kwargs):
return not is_text(*args, **kwargs)
for fname in (
'/usr/bin/abiword',
'/home/tkc/.bashrc',
):
print fname, is_text(fname)
It should allow you to tweak the charset to consider "text",
defaulting to string.printable, but adjust the "text" chars and
the file-reading-mode accordingly if you're using unicode text
(perhaps inverting the logic to make it an "binary chars" set).
You can also change the threshold from 0.3 (30%) to whatever you
need, and test the entire file or a subset of it (this defaults
to just reading the first K of the file, but if you pass None for
the portion, it will read the whole thing, even if it's a TB file).
-tkc
More information about the Python-list
mailing list