Possible read()/readline() bug?

Thu Oct 23 02:14:49 EDT 2008

Steven D'Aprano wrote:
> On Wed, 22 Oct 2008 16:59:45 -0400, Terry Reedy wrote:
> 
>> Mike Kent wrote:
>>> Before I file a bug report against Python 2.5.2, I want to run this by
>>> the newsgroup to make sure I'm not [missing something].

>> Good idea ;-).  What you are missing is a rereading of the fine manual
>> to see what you missed the first time.  I recommend this *whenever* you
>> are having a vexing problem.
> 
> With respect Terry, I think what you have missed is the reason why the OP 
> thinks this is a bug.

I think not.  I read and responded carefully ;-)  I stand by my answer: 
the OP should read the doc and try buffer=0 to see if that solves his 
problem.

> He's not surprised that buffering is going on:

> "This indicates some sort of buffering and caching is going on."

If one reads the open() doc section on buffering, one will *know* that 
the reading is buffered and that this is very intentional, and that one 
can turn it off.

> but he thinks that the buffering should be discarded when you seek:
> 
> "It seems pretty clear to me that this is wrong.  If there is any
> caching going on, it should clearly be discarded if I do a seek.

I don't think Python has any control over this, certainly not in a 
platform independent way, and not after the file has been open.

For normal sane file reading, discarding after every seek would be very 
wrong.  Buffering is an *optional* efficiency measure which normally is 
the right thing to do and so is the default but which can be disabled 
when it is not IF ONE READS THE DOC.

> Note
> that it's not just readline() that's returning me the wrong, cached
> data, as I've also tried this with read(), and I get the same
> results.  It's not acceptable that I have to close and reopen the file
> before every read when I'm doing random record access."

And he does not have to do such a thing.

> I think Mike has a point: if a cache is out of sync with the actual data, 
> then the cache needs to be thrown away. A bad cache is worse than no 
> cache at all.

Right.  I told him what to try.  If *that* does not work, he can report 
back.

Python is not doing the caching.  This is OS stuff.

> Surely dealing with files that are being actively changed by other 
> processes is hard.

Tail, which sequentially reads what a other process(es) sequentially 
write, works fine.

> I'm not sure that the solution is anything other than 
> "well, don't do that then".

Mixed random access is a different matter.  There is a reason DBMSes run 
file access through one process.

> How do other programming languages and Unix 
> tools behave? (Windows generally only allows a single process to read or 
> write to a file at once.)
> 
> Additionally, I wonder whether what Mike is seeing is some side-effect of 
> file-system caching. Perhaps the bytes written to the file by echo are 
> only written to disk when the file is closed? I don't know, I'm just 
> hypothesizing.

When echo closes, I expect the disk block will be flushed, which means 
added to the pool of blocks ready to be read or written when the disk 
driver gets cpu time and gets around to any particular block.  Depending 
of the file system and driver, blocks may get sorted by disk address to 
minimize inter-access seek times (the elevator algorithm).

Terry Jan Reedy