Need to know if a file as only ASCII charaters

norseman norseman at hughes.net
Tue Jun 16 14:12:14 EDT 2009


Scott David Daniels wrote:
> Dave Angel wrote:
>> Jorge wrote:
>>> Hi there,
>>> I'm making  a application that reads 3 party generated ASCII files, 
>>> but some
>>> times
>>> the files are corrupted totally or partiality and I need to know if 
>>> it's a
>>> ASCII file with *nix line terminators.
>>> In linux I can run the file command but the applications should run in
>>> windows.

you are looking for a \x0D (the Carriage Return) \x0A (the Line feed) 
combination. If present you have Microsoft compatibility. If not you 
don't.  If you think High Bits might be part of the corruption, filter 
each byte with byte && \x7F  (byte AND'ed with hex 7F or 127 base 10) 
then check for the \x0D \x0A combination.
Run the test on a known text setup. Intel uses one order and the SUN and 
the internet another.  The BIG/Little ending confuses many. Intel 
reverses the order of multibyte numerics.  Thus - Small machine has big 
ego or largest byte value last. Big Ending.  Big machine has small ego. 
Little Ending.  Some coders get the 0D0A backwards, some don't.  You 
might want to test both.

(2^32)(2^24)(2^16(2^8)  4 bytes correct math order  little ending
Intel stores them (2^8)(2^16)(2^24)(2^32)   big ending
SUN/Internet stores them in correct math order.


Python will use \r\n (0D0A) and \n\r (0A0D) correctly.

HTH

Steve
>>>
>>> Any help will be great.
>>>
>>> Thank you in advance.
>>>
>>>   
>> So, which is the assignment:
>>   1) determine if a file has non-ASCII characters
>>   2) determine whether the line-endings are crlf or just lf
>>
>> In the former case, look at translating the file contents to Unicode, 
>> specifying ASCII as source.  If it fails, you have non-ASCII
>> In the latter case, investigate the 'u' attribute of the mode 
>> parameter in the open() function.
>>
>> You also need to ask yourself whether you're doing a validation of the 
>> file, or doing a "best guess" like the file command.
>>
>>
> Also, realize that ASCII is a 7-bit code, with printing characters all
> greater than space, and very few people use delete ('\x7F'), so you
> can define a function to determine if a file contains only printing
> ASCII and a few control characters.  This one is False unless some ink
> would be printed.
> 
> Python 3.X:
>     def ascii_file(name, controls=b'\t\n'):
>         ctrls = set(controls + b' ')
>         with open(name, 'rb') as f:
>             chars = set(f.read())
>         return min(chars) >= min(ctrls) ord('~') >= max(chars)
>                               ) and min(chars - ctrls) > ord(' ')
> 
> Python 2.X:
>     def ascii_file(name, controls='\t\n'):
>         ctrls = set(controls + ' ')
>         with open(name, 'rb') as f:
>             chars = set(f.read())
>         return min(chars) >= min(ctrls) and '~' >= max(chars
>                               ) and min(chars - ctrls) > ' '
> 
> For potentially more performance (at least on 2.X), you could do min
> and max on the data read, and only do the set(data) if the min and
> max are OK.
> 
> --Scott David Daniels
> Scott.Daniels at Acm.Org




More information about the Python-list mailing list