Need to know if a file as only ASCII charaters

Scott David Daniels Scott.Daniels at Acm.Org
Tue Jun 16 13:42:58 EDT 2009


Dave Angel wrote:
> Jorge wrote:
>> Hi there,
>> I'm making  a application that reads 3 party generated ASCII files, 
>> but some
>> times
>> the files are corrupted totally or partiality and I need to know if 
>> it's a
>> ASCII file with *nix line terminators.
>> In linux I can run the file command but the applications should run in
>> windows.
>>
>> Any help will be great.
>>
>> Thank you in advance.
>>
>>   
> So, which is the assignment:
>   1) determine if a file has non-ASCII characters
>   2) determine whether the line-endings are crlf or just lf
> 
> In the former case, look at translating the file contents to Unicode, 
> specifying ASCII as source.  If it fails, you have non-ASCII
> In the latter case, investigate the 'u' attribute of the mode parameter 
> in the open() function.
> 
> You also need to ask yourself whether you're doing a validation of the 
> file, or doing a "best guess" like the file command.
> 
> 
Also, realize that ASCII is a 7-bit code, with printing characters all
greater than space, and very few people use delete ('\x7F'), so you
can define a function to determine if a file contains only printing
ASCII and a few control characters.  This one is False unless some ink
would be printed.

Python 3.X:
     def ascii_file(name, controls=b'\t\n'):
         ctrls = set(controls + b' ')
         with open(name, 'rb') as f:
             chars = set(f.read())
         return min(chars) >= min(ctrls) ord('~') >= max(chars)
                               ) and min(chars - ctrls) > ord(' ')

Python 2.X:
     def ascii_file(name, controls='\t\n'):
         ctrls = set(controls + ' ')
         with open(name, 'rb') as f:
             chars = set(f.read())
         return min(chars) >= min(ctrls) and '~' >= max(chars
                               ) and min(chars - ctrls) > ' '

For potentially more performance (at least on 2.X), you could do min
and max on the data read, and only do the set(data) if the min and
max are OK.

--Scott David Daniels
Scott.Daniels at Acm.Org



More information about the Python-list mailing list