Need to know if a file as only ASCII charaters

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Tue Jun 16 22:51:23 EDT 2009


On Tue, 16 Jun 2009 10:42:58 -0700, Scott David Daniels wrote:

> Dave Angel wrote:
>> Jorge wrote:
>>> Hi there,
>>> I'm making  a application that reads 3 party generated ASCII files,
>>> but some
>>> times
>>> the files are corrupted totally or partiality and I need to know if
>>> it's a
>>> ASCII file with *nix line terminators. In linux I can run the file
>>> command but the applications should run in windows.
>>>
>>> Any help will be great.
>>>
>>> Thank you in advance.
>>>
>>>
>> So, which is the assignment:
>>   1) determine if a file has non-ASCII characters 2) determine whether
>>   the line-endings are crlf or just lf
>> 
>> In the former case, look at translating the file contents to Unicode,
>> specifying ASCII as source.  If it fails, you have non-ASCII In the
>> latter case, investigate the 'u' attribute of the mode parameter in the
>> open() function.
>> 
>> You also need to ask yourself whether you're doing a validation of the
>> file, or doing a "best guess" like the file command.
>> 
>> 
> Also, realize that ASCII is a 7-bit code, with printing characters all
> greater than space, and very few people use delete ('\x7F'), so you can
> define a function to determine if a file contains only printing ASCII
> and a few control characters.  This one is False unless some ink would
> be printed.
> 
> Python 3.X:
>      def ascii_file(name, controls=b'\t\n'):
>          ctrls = set(controls + b' ')
>          with open(name, 'rb') as f:
>              chars = set(f.read())
>          return min(chars) >= min(ctrls) ord('~') >= max(chars)
>                                ) and min(chars - ctrls) > ord(' ')
> 
> Python 2.X:
>      def ascii_file(name, controls='\t\n'):
>          ctrls = set(controls + ' ')
>          with open(name, 'rb') as f:
>              chars = set(f.read())
>          return min(chars) >= min(ctrls) and '~' >= max(chars
>                                ) and min(chars - ctrls) > ' '
> 
> For potentially more performance (at least on 2.X), you could do min and
> max on the data read, and only do the set(data) if the min and max are
> OK.


You're suggesting that running through the entire data three times 
instead of once is an optimization? Boy, I'd hate to see what you 
consider a pessimation! *wink*

I think the best solution will probably be a lazy function which stops 
processing as soon as it hits a character that isn't ASCII.


# Python 2.5, and untested
def ascii_file(name):
    from string import printable
    with open(name, 'rb') as f:
        for c in f.read(1):
            if c not in printable: return False
    return True

This only reads the entire file if it needs to, and only walks the data 
once if it is ASCII.

In practice, you may actually get better performance by reading in a 
block at a time, rather than a byte at a time:

# Python 2.5, and still untested
def ascii_file(name, bs=65536):  # 64K default blocksize
    from string import printable
    with open(name, 'rb') as f:
        text = f.read(bs)
        while text:
            for c in text:
                if c not in printable: return False
            text = f.read(bs)
    return True




-- 
Steven



More information about the Python-list mailing list