How to bypass Windows 'cooking' the I/O? (One more time, please) II

Mon Jul 7 04:03:10 EDT 2008

I know I saw the answer recently, as in since February '08, but I can't
re-find it. :(   I tried the mail archives and such and my own
collections but the piece I saw still eludes me.

Problem:  (sos=same old s...)  Microsoft insists the world work it's way
even when the Microsoft way was proven wrong decades ago. In this case
it's (still) 'cooking' the writes even with 'rwb' and O_RDWR|O_BINARY in
(proper respective) use.

Specific:  python created and inspected binary file ends:
	   00460: 0D 1A        (this is correct)

	   after a write
			os.lseek(target, -1, 2)
			os.write(target,record)
	   the expected result would be:
	   00460: 0D 20 .....data bytes.... 1A

	   BUT I get:
	   00460: 20 .... data bytes... 1A

It is one byte off!!!  And the 0D has to be there. Signifies the end of
the header.

Same python program runs as expected in Linux.  Maybe because that's
where it was written?! :)

What I seek is the way to slap Microsoft up side the head and make it
work correctly.  OK, well, at least in this situation.

Note: Things like this justify Python implementers bypassing OS calls
(data fetch, data write) and using the BIOS direct. Remember, the CPU
understands bit patterns only. It has no comprehension of 'text',
'program', 'number', 'pointer', blah blah blah....  All that is totally
beyond it's understanding. A given bit pattern means 'do that'. The CPU
is 100% binary.  Memory, storage and the rest is just bits-on, bits-off.
Patterns. Proper binary I/O is mandatory for the machine to function.

Anyway - if whoever mentioned the flags and such to 'over ride'
Microsoft's BS would re-send that piece I would be very appreciative.

Steve
norseman at hughes.net
==================================================================
Above is original request. I assume the answer I seek is by someone not 
receiving the list currently. I'm on and off myself so I can understand.

I received two replies the the original request. Both insist on trying 
to use TEXT modes to do BINARY work. Allow me to explain before hooting 
and hollering that I'm all wet.

Yes - 'rwb' is a typo. The compiler catches it every time I use it.
READ as in read only,
WRITE as in write only
but NOT READ/WRITE in syntax when using STREAM I/O.
One is to use r+ for rw  and w+ for wr and tack on the b to eliminate 
the default of 'cooking' the data.

File.verb is defined as being --- well a STREAM I/O handler. The docs 
insist on calling it a file descriptor somehow. Or at least that is the 
way it reads.  Larry's 'fp' (file pointer) is a correct short form.

Let's start with Larry:

 > You may be the victim of buffering (not calling .flush() or .close()
 > to commit your write to disk).  Why aren't you using the file object
 > to do you seek and write?

     STREAM things take a great deal of overhead. I'm just reading,
      re-arranging and writing. (supposedly) Simple buffer stuff.
 >
 > Normal file I/O sequence:
 >
 > fp = open(target, 'wb')
 >
 > fp.seek(-1, 2)
 >
 > fp.write(record)
 >

   Except it doesn't do that in Windows. See below.

 > by going through os. methods instead of the file instance I think you
 > are accessing the file through 2 different I/O buffers.  I could be
 > all wrong here.
     Nope (on 2 diff...) and You are but I can see where you might think 
as you do.

Tim is my next victim:

 >>> f = open('x.bat','r+b')
 >>> >>> s = f.read()
 >>> >>> s
'sed -e "s/[ \\t]*$//" -e "/^$/d" %1\rhow about that\r\n'
 >>> >>> f.seek(-1,2)
 >>> >>> f.write('xxx\r\n')
 >>> >>> f.close()
 >>> >>> f = open('x.bat','rb')
 >>> >>> t = f.read()
 >>> >>> t
'sed -e "s/[ \\t]*$//" -e "/^$/d" %1\rhow about that\rxxx\r\n'
 >>> >>>

     If you put that in a binary file, that file will never work again. 
If you don't believe me, try it on ntldr and see what happens.
(Tim - please don't. Your Window$ system won't boot if you do. 
Originally ALL STREAM I/O was supposed to 'cook' the stream for 'text' 
transmission. It has evolved, BUT.... legacy tends to linger.)

This is the 'see below' place:

		concerning seek and write:
let: 0123456789  be contents of a file. (offsets zero through nine)
   File size will be reported as 10 bytes
   If you seek to EOF and append ABC it will move to OFFSET 10 and put in
   ABC with a result of  0123456789ABC   and report a file size of 13.
   With offsets it is simple math, quick and efficient.
   So move filesize minus 3 shifts to byte 10  which has a content of 9
This is correct and this is what Python on Linux does.
          (How many just got lost?  Remember, byte 10 is offset 9)
                    (or offset plus 1 is byte count)

using ...seek(-1,2)
   (be it: os.lseek, file.seek, whatever - don't get dum on me)

What Microsoft does is go to EOF and IF it is a hex-1A preceded by a
hex-0D it then backs over the hex-0D also. If the last character of the
file is a hex-1A and no EOL precedes it, it starts next write on top of
hex-1A as it was told. Result: file is shifted left one byte.

using ...seek(0,2) (os.lseek, file.seek, whatever - staaayy....)

What Microsoft does is go to EOF and IF it is a hex-1A preceded by a
line terminator hex-0D it then backs over the hex-1A and starts writing 
there. If the last character of the file is a hex-1A and no EOL precedes 
it, it starts next right after the hex-1A. Leaving the 1A in the file.
Result: file data shifts one byte right for each record appended having 
a hex-1A terminator unless the 1A was preceded by a hex-0D. Fun Huh?

BOTH CASES occur whichever I/O system is chosen and used.
BOTH CASES are unacceptable behavior when the 'b' is added to the file 
use mode. Once in binary there is supposed to be no bullshit.

In the mid 1960's I was updating code written in the early 1950's and my 
0-9+ABC thing above was rule then too. Over half a century of legacy.

Who's got the 'two by four' to pound Microsoft's nonsense switch into 
the off position?

OK - the hooting and hollering that I'm all wet can start. :)

Steve
norseman at hughes.net