Need to know if a file as only ASCII charaters

Tue Jun 16 19:07:06 EDT 2009

Scott David Daniels wrote:
> norseman wrote:
>> Scott David Daniels wrote:
>>> Dave Angel wrote:
>>>> Jorge wrote: ...
>>>>> I'm making  a application that reads 3 party generated ASCII files, 
>>>>> but some times the files are corrupted totally or partiality and I 
>>>>> need to know if it's a ASCII file with *nix line terminators.
>>>>> In linux I can run the file command but the applications should run in
>>>>> windows.
>> you are looking for a \x0D (the Carriage Return) \x0A (the Line feed) 
>> combination. If present you have Microsoft compatibility. If not you 
>> don't.  If you think High Bits might be part of the corruption, filter 
>> each byte with byte && \x7F  (byte AND'ed with hex 7F or 127 base 10) 
>> then check for the \x0D \x0A combination.
> 
> Well  ASCII defines a \x0D as the return code, and \x0A as line feed.
> It is unix that is wrong, not Microsoft (don't get me wrong, I know
> Microsoft has often redefined what it likes invalidly).  If you
> open the file with 'U', Python will return lines w/o the \r character
> whether or not they started with it, equally well on both unix and
> Microsoft systems.  

Yep - but if you are on Microsoft systems you will usually need the \r.

Remove them and open the file in Notepad to see what I mean.
Wordpad handles the lack of \r OK. Handles larger files too.

> Many moons ago the high order bit was used as a
> parity bit, but few communication systems do that these days, so
> anything with the high bit set is likely corruption.
> 

OH?  How did one transfer binary files over the phone?
I used PIP or Kermit and it got there just fine, high bits and all. 
Mail and other so called "text only" programs CAN (but not necessarily 
do) use 7bit transfer protocols.  Can we say  MIME?   FTP transfers high 
bit just fine too.
Set protocols to 8,1 and none. (8bit, 1 stop, no parity)
As to how his 3rd party ASCII files are generated? He does not know, I 
do not know, we do not know (or care), so test before use.
Filter out the high bits, remove all control characters except cr,lf and 
perhaps keep the ff too, then test what's left.

                 ASCII
cr - carriage return       ^M    x0D   \r
lf - line feed             ^J    x0A   \n
ff - form feed (new page)  ^L    x0C   \f

>> .... Intel uses one order and the SUN and  the internet another.  The
>  > BIG/Little ending confuses many. Intel reverses the order of multibyte
>  > numerics.  Thus- Small machine has big ego or largest byte value last.
>  > Big Ending.  Big machine has small ego.
>> Little Ending.  Some coders get the 0D0A backwards, some don't.  You 
>> might want to test both.
>> (2^32)(2^24)(2^16(2^8)  4 bytes correct math order  little ending
>> Intel stores them (2^8)(2^16)(2^24)(2^32)   big ending
>> SUN/Internet stores them in correct math order.
>> Python will use \r\n (0D0A) and \n\r (0A0D) correctly.
> 
> This is the most confused summary of byte sex I've ever read.
> There is no such thing as "correct math order" (numbers are numbers).

"...number are numbers..."   Nope! Numbers represented as characters may 
be in ASCII but you should take a look at at IBM mainframes. They use 
EBCDIC and the 'numbers' are different bit patterns.  Has anyone taken 
the time to read the IEEE floating point specs?  To an electronic 
calculating machine, internally everything is a bit. Bytes are a group 
of bits and the CPU structure determines what a given bit pattern is.
The computer has no notion of number, character or program instruction. 
It only knows what it is told.  Try this - set the next instruction 
(jump) to a data value and watch the machine try to execute it as a 
program instruction.  (I assume you can program in assembly. If not - 
don't tell because 'REAL programmers do assembly'. I think the last time 
I used it was 1980 or so. The program ran until the last of the hardware 
died and replacements could not be found. The client hired another to 
write for the new machines and closed shop shortly after.  I think the 
owner was tired and found an excuse to retire. :)

> The '\n\r' vs. '\r\n' has _nothing_ to do with little-endian vs.
> big-endian.  By the way, there are great arguments for each order,
> and no clear winner. 

I don't care. Not the point. Point is some people get it fouled up and 
cause others problems.  Test for both.  You will save yourself a great 
deal of trouble in the long run.

> Network order was defined for sending numbers
> across a wire, the idea was that you'd unpack them to native order
> as you pulled the data off the wire.

"... sending BINARY FORMATTED numbers..." (verses character - type'able)

Network order was defined to reduce machine time. Since the servers that 
worked day in and day out were SUN, SUN order won.
I haven't used EBCDIC in so long I really don't remember for sure but it 
seems to me they used SUN order before SUN was around.  Same for the 
VAX, I think.

> 
> The '\n\r' vs. '\r\n' differences harken back to the days when they were
> format effectors (carriage return moved the carriage to the extreme
> left, line feed advanced the paper).  You needed both to properly
> position the print head.  

Yep.  There wasn't enough intelligence in the old printers to 'cook" the 
stream.

> ASCII uses the pair, and defined the effect
> of each.  

Actually the Teletype people defined most of the \x00 - \x1f concepts.
If I remember the trivia correctly - original teletype was 6 bit bytes. 
Bit pattern was neither ASCII nor EBCDIC. Both of those adopted the 
teletype control-character concept.

> As ASCII was being worked out, MIT even defined a "line
> starve" character to move up one line just as line feed went down one.
> The order of the format effectors most used was '\r\n' because the
> carriage return involved the most physical motion on many devices, and
> the vertical motion time of the line feed could happen while the
> carriage was moving.  

True.  My experiment with reversing the two instructions would sometimes 
cause the printer to malfunction.  One of my first 'black boxes' 
(filters) included instructions to see and correct the "wrong" pattern. 
  Then I had to modify it to allow pure binary to get 'pictures' on the 
dot matrix types.

> After that, you often added padding bytes 
> (typically ASCII NUL ('\x00') or DEL ('\x7F')) to allow the hardware
> time to finish before you the did spacing and printing.
> 

If I remember correctly:
ASCII NULL   x00      In my opinion, NULL should be none set :)
IBM NULL     x80      IBM card  80 Cols
Sperry-Rand  x90      S/R Card  90 Cols

Trivia question:
Why is a byte 8 bits?

Ans: people have 10 fingers and the hardware to handle morse code 
(single wire - serial transfers) needed timers.  1-start, 8 data, 1-stop 
makes it a count by ten.  Burroughs had 10 bits but counting by 12s just 
didn't come 'naturally'.
That was the best answer I've heard to date. In reality - who knows?

'...padding...'
I never did. Never had to. Printers I used had enough buffer to void 
that practice.  Thirty two character buffer seemed to be enough to 
disallow overflow.  Of course we were using 300 to 1200 BAUD and DTR 
(pin 19 in most cases) -OR- the RTS and CTS pair of wires to control 
flow since ^S/^Q could be a valid dot matrix byte(s). Same for hardwired 
PIP or Kermit transfers.

> --Scott David Daniels
> Scott.Daniels at Acm.Org
>