Determine file type (binary or text)

Brian Lenihan brian_l at yahoo.com
Thu Aug 14 02:32:34 EDT 2003


Peter Hansen <peter at engcorp.com> wrote in message news:<3F3A8275.8B6EE8C4 at engcorp.com>...

> "Contains only printable characters" is probably a more useful definition
> of text in many cases.  I can't say off the top of my head exactly when
> either definition might be a problem....  wait, how about this one: in
> CVS, if you don't have a file that is effectively line-oriented, human
> readable information, you probably don't want to let it be treated as 
> "text" and stored as diffs.  In that situation, "contains primarily 
> printable characters organized in lines" is probably a more thorough,
> though less deterministic, definition.

We check for binary files in our CVS commitprep script like this:

look for -kb arg
open the file in binary mode, read 4k fom the file and...

for i in range(len(buff)):
    a = ord(buff[i])
    if (a < 8) or (a > 13 and a < 32) or (a > 126):
        non_text = non_text + 1

If 10 percent of the characters are found to be non-text, we reject
the file if it was not commited with the -kb flag, or print a warning
if the file appears to be text but is being checked in as a binary.

We don't bother checking for charsets other than ascii, because
localized files have to be checked in as binaries or bad things
(tm) happen.




More information about the Python-list mailing list