Mailman 3 thread semantics for file objects - Python-Dev

thread semantics for file objects

Jeremy Hylton

March 17, 2005

8:29 p.m.

Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic. I'm less sure what guarantees, if any, the other methods attempt to provide. For example, it looks like concurrent calls to writelines() will interleave entire lines, but not parts of lines. Concurrent calls to readlines() provide insane results, but I don't know if that's a bug or a feature. Specifically, if your file has a line that is longer than the internal buffer size SMALLCHUNK you're likely to get parts of that line chopped up into different lines in the resulting return values. If we can come up with intended semantics, I'd be willing to prepare a patch for the documentation. Jeremy

Show replies by date

Aahz

March 2005

9:25 p.m.

On Thu, Mar 17, 2005, Jeremy Hylton wrote:

...

Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code -- not in reams of trivial code that bores the reader to death." --GvR

Jeremy Hylton

9:47 p.m.

On Thu, 17 Mar 2005 16:25:44 -0500, Aahz <aahz@pythoncraft.com> wrote:

...

On Thu, Mar 17, 2005, Jeremy Hylton wrote:

...
Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid.

I'm not looking for your permission or approval. I just want to know what semantics are intended. If the documentation wants to say that the semantics are undefined that okay, although I think we need to say more because some behavior has been provided by the implementation for a long time. Jeremy

Samuele Pedroni

10 p.m.

Jeremy Hylton wrote:

...

On Thu, 17 Mar 2005 16:25:44 -0500, Aahz <aahz@pythoncraft.com> wrote:

...
On Thu, Mar 17, 2005, Jeremy Hylton wrote:

...
Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid.

I'm not looking for your permission or approval. I just want to know what semantics are intended. If the documentation wants to say that the semantics are undefined that okay, although I think we need to say more because some behavior has been provided by the implementation for a long time.

I think this is left unspecified for example by Java too. I would be surprised if Jython would offer the same characteristics in this respect as CPython.

"Martin v. Löwis"

10:04 p.m.

Jeremy Hylton wrote:

...

...
...
Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid.

I'm not looking for your permission or approval.

Literally, the answer to your question is "no". In fact, Python does not specify *any* interleaving semantics for threads whatsoever. The only statement to this respect is """ Not all built-in functions that may block waiting for I/O allow other threads to run. (The most popular ones (\function{time.sleep()}, \method{\var{file}.read()}, \function{select.select()}) work as expected.) """ Of course, this says it works as expected, without saying what actually is expected.

...

I just want to know what semantics are intended.

But this is not what you've asked :-) Anyway, expected by whom? Aahz clearly expects that the semantics are unspecified, as he expects that nobody ever even attempts to read the same file from multiple threads.

...

If the documentation wants to say that the semantics are undefined that okay,

Formally, there is no need to say that something is undefined. Not defining anything is sufficient. So the semantics *is* undefined, whether the documentation "wants" to say that or not.

...

although I think we need to say more because some behavior has been provided by the implementation for a long time.

That immediately rings the Jython bell, and perhaps also the PyPy bell. So if you want to say something, just go ahead. Before I make the documentation want to say that, I would like to make it say more basic things first (e.g. that stores to variables are atomic). Regards, Martin

Jeremy Hylton

10:15 p.m.

On Thu, 17 Mar 2005 23:04:16 +0100, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Jeremy Hylton wrote:

...
...
...
Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid.

I'm not looking for your permission or approval.

Literally, the answer to your question is "no". In fact, Python does not specify *any* interleaving semantics for threads whatsoever. The only statement to this respect is

I'm surprised that it does not, for example, guarantee that reads and writes are atomic, since CPython relies on fread and fwrite which are atomic. Also, there are other operations that go to the trouble of calling flockfile(). What's the point if we don't provide any guarantees? <0.6 wink>. If it is not part of the specified behavior, then I suppose it's a quality of implementation issue. Either way it would be helpful if the Python documentation said something, e.g. you can rely on readline() being threadsafe or you can't but the current CPython implementation happens to be. readline() seemed like an interesting case because readlines() doesn't have the same implementation and the behavior is different. So, as another example, you could ask whether readlines() has a bug or not. Jeremy

"Martin v. Löwis"

10:57 p.m.

Jeremy Hylton wrote:

...

...
...
...
...
Are the thread semantics for file objecst documented anywhere? Literally, the answer to your question is "no". I'm surprised that it does not, for example, guarantee that reads and writes are atomic, since CPython relies on fread and fwrite which are atomic.

Where is the connection? Why would anything that CPython requires from the C library have any effect on Python's documentation? The only effect on Python documentation is that anybody writes it. Nobody cares, so nobody writes documentation. Remember, you were asking what behaviour is *documented*, not what behaviour is guaranteed by the implementation (in a specific version of the implementation).

...

Also, there are other operations that go to the trouble of calling flockfile(). What's the point if we don't provide any guarantees?

Because nobody cares about guarantees in the documentation. Instead, people care about observable behaviour. So if you get a crash due to a race condition, you care, you report a bug, the Python developer agrees its a bug, and fixes it by adding synchronization. Nobody reported a bug to the Python documentation.

...

<0.6 wink>. If it is not part of the specified behavior, then I suppose it's a quality of implementation issue. Either way it would be helpful if the Python documentation said something, e.g. you can rely on readline() being threadsafe or you can't but the current CPython implementation happens to be.

It would be helpful to whom? To you? I doubt this, as you will be the one who writes the documentation :-)

...

readline() seemed like an interesting case because readlines() doesn't have the same implementation and the behavior is different. So, as another example, you could ask whether readlines() has a bug or not.

Nobody knows. It depends on the Python developer who reviews the bug report. Most likely, he considers it tricky and leaves it open for somebody else. If his name is Martin, he will find that this is not a bug (because it does not cause a crash, and does not contradict with the documentation), and he will reclassify it as a wishlist item. If his name is Tim, and if he has a good day, he will fix it, and add a comment on floating point numbers. Regards, Martin

Jeremy Hylton

3:56 a.m.

On Thu, 17 Mar 2005 23:57:52 +0100, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Remember, you were asking what behaviour is *documented*, not what behaviour is guaranteed by the implementation (in a specific version of the implementation).

Martin, I think you're trying to find more finesse in my question than I ever intended. I intended to ask -- hey, what are the semantics we intend in this case? since the documentation doesn't say, we could improve them by capturing the intended semantics.

...

...
Also, there are other operations that go to the trouble of calling flockfile(). What's the point if we don't provide any guarantees?

Because nobody cares about guarantees in the documentation. Instead, people care about observable behaviour. So if you get a crash due to a race condition, you care, you report a bug, the Python developer agrees its a bug, and fixes it by adding synchronization.

As Tim later reported this wasn't to address a crash, but to appease a pig headed developer :-). I'm surprised by your claim that whether something is a bug depends on the person who reviews it. In practice, this may be the case, but I've always been under the impression that there was rough consensus about what constituted a bug and what a feature. I'd certainly say its a goal to strive for. It sounds like the weakest intended behavior we have is the one Tim reported: "provided the platform C stdio wasn't thread-braindead, then if you had N threads all simultaneously reading a file object containing B bytes, while nobody wrote to that file object, then the total number of bytes seen by all N threads would sum to B at the time they all saw EOF." It seems to me like a good idea to document this intended behavior somewhere. Jeremy

"Martin v. Löwis"

6:57 a.m.

Jeremy Hylton wrote:

...

It sounds like the weakest intended behavior we have is the one Tim reported: "provided the platform C stdio wasn't thread-braindead, then if you had N threads all simultaneously reading a file object containing B bytes, while nobody wrote to that file object, then the total number of bytes seen by all N threads would sum to B at the time they all saw EOF." It seems to me like a good idea to document this intended behavior somewhere.

The guarantee that "we" want to make is certainly stronger: if the threads all read from the same file, each will get a series of "chunks". The guarantee is that it is possible to combine the chunks in a way to get the original contents of the file (i.e. not only the sum of the bytes is correct, but also the contents). However, I see little value adding this specific guarantee to the documentation when so many other aspects of thread interleaving are unspecified. For example, if a thread reads a dictionary simultaneous to a write in another thread, and the read and the write deal with different keys, there is a guarantee that they won't affect each other. If they operate on the same key, the read either gets the old value, or the new value, but not both. And so on. Writing down all these properties does little good, IMO. This includes your proposed property of file reads: anybody reading your statement will think "of course it works this way - why even mention it". Regards, Martin

Paul Moore

1:16 p.m.

On Fri, 18 Mar 2005 07:57:25 +0100, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

The guarantee that "we" want to make is certainly stronger: if the threads all read from the same file, each will get a series of "chunks". The guarantee is that it is possible to combine the chunks in a way to get the original contents of the file (i.e. not only the sum of the bytes is correct, but also the contents).

That would be a useful property to be able to rely on, certainly. (Although in practical terms, probably a lot less than people would *like* to see guaranteed :-))

...

However, I see little value adding this specific guarantee to the documentation when so many other aspects of thread interleaving are unspecified.

I'm not sure I agree. It's an improvement in the situation, so why not add it? It may even encourage others, when thinking about threading issues, to consider whether the documentation should guarantee anything - and if so, to add that guarantee. Over time, the documentation gets better at describing thread-related behaviour - and correspondingly, people get (somewhat) more confident that where the documentation doesn't guarantee things, it's because there is a good reason.

...

For example, if a thread reads a dictionary simultaneous to a write in another thread, and the read and the write deal with different keys, there is a guarantee that they won't affect each other. If they operate on the same key, the read either gets the old value, or the new value, but not both.

If this is a genuine guarantee, then let's document it! I asked about precisely this issue on python-list a long while ago, and no-one could provide me with a confident answer (I couldn't be sure myself, my head explodes when I try to understand thread-related code). The only confident answer I got was "you're safe if you use a lock", but taking that position to extremes results in massive levels of unnecessary serialisation.

...

Writing down all these properties does little good, IMO.

Not a huge amount of good, certainly. But no harm, and a little bit of direct good, and also some indirect good in terms of making it clear that the issue has been thought about. I suppose what I am saying that there is a practical difference between "undefined" and "unknown", even if there isn't a theoretical one... Of course, there's an implied requirement here to confirm any documented guarantees in Jython, and IronPython, and PyPy, and... But given that none of these (yet) implement the full Python 2.4 language definition, as far as I am aware, it's probably not sensible to get too hung up on this fact (although confirming that a guarantee doesn't cause major implementation difficulties would be reasonable). Paul.

Jeremy Hylton

3:17 p.m.

On Fri, 18 Mar 2005 07:57:25 +0100, "Martin v. Löwis" <martin@v.loewis.de> wrote:

...

Writing down all these properties does little good, IMO. This includes your proposed property of file reads: anybody reading your statement will think "of course it works this way - why even mention it".

The thingsa that are so obvious they don't need to be written down are often the most interesting things to write down. In fact, you started the thread by saying there were no guarantees whatsoever and chiding me for asking if there were any. But it seems there are some intended semantics that are strong than what you would find in C or Perl. Hence, I don't think they would be obvious to anyone who comes to Python from one of those languages. I agree that the semantics of multi-threaded Python programs is an enormous domain and we're discussing a tiny corner of it. I agree that it would be quite challenging to get better documentation or specifications here. But I also think that every little bit helps. Jeremy

stelios xanthakis

2:21 a.m.

Jeremy Hylton wrote:

...

On Thu, 17 Mar 2005 16:25:44 -0500, Aahz <aahz@pythoncraft.com> wrote:

...
On Thu, Mar 17, 2005, Jeremy Hylton wrote:

...
Are the thread semantics for file objecst documented anywhere? I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

Uncle Timmy will no doubt agree with me: the semantics don't matter. NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And even using a lock is stupid.

I just want to know what semantics are intended. If the documentation wants to say that the semantics are undefined that okay, although I think we need to say more because some behavior has been provided by the implementation for a long time.

I think that when two threads write to the same fd without syncronization, the result is not deterministic anyway. In the case they are reading from the same fd, even worse! (and therefore the input cannot be useful to any serious algorithm) Python (libc in fact) just guarantees that there will be no crashes and corruption of data if the read/write functions are reentered. But ensuring that readline/writeline/etc is atomic would probably be a waste of resources protect the input/output for a case where the data is as good as random noise anyway. So I think aahz is right. Stelios

"Martin v. Löwis"

6:41 p.m.

stelios xanthakis wrote:

...

I think that when two threads write to the same fd without syncronization, the result is not deterministic anyway. In the case they are reading from the same fd, even worse! (and therefore the input cannot be useful to any serious algorithm)

Yes, but we are not talking about the same fd. Instead, we talk about the same FILE*. A thread-safe libc guarantees (AFAIK) that the data passed to fwrite are appended as a whole. This, in turn, means that the data passed to Python's file.write are also appended as a whole. I'm pretty sure this property also holds on Windows. Regards, Martin

Tim Peters

10:13 p.m.

[Jeremy Hylton]

...

Are the thread semantics for file objecst documented anywhere?

No. At base level, they're inherited from the C stdio implementation. Since the C standard doesn't even mention threads, that's all platform-dependent. POSIX defines thread semantics for file I/O, but fat lot of good that does you on Windows, etc.

...

I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

I wouldn't consider this as more than CPython implementation accidents in the cases it appears to apply. For example, in universal-newlines mode, are you sure f.read(n) always maps to exactly one fread() call?

...

I'm less sure what guarantees, if any, the other methods attempt to provide.

I don't believe they're _trying_ to provide anything specific.

...

For example, it looks like concurrent calls to writelines() will interleave entire lines, but not parts of lines. Concurrent calls to readlines() provide insane results, but I don't know if that's a bug or a feature. Specifically, if your file has a line that is longer than the internal buffer size SMALLCHUNK you're likely to get parts of that line chopped up into different lines in the resulting return values.

And you're _still_ not thinking "implementation accidents" <wink>?

...

If we can come up with intended semantics, I'd be willing to prepare a patch for the documentation.

I think Aahz was on target here: NEVER, NEVER access the same file object from multiple threads, unless you're using a lock. And here he went overboard: And even using a lock is stupid. ZODB's FileStorage is bristling with locks protecting multi-threaded access to file objects, therefore that can't be stupid. QED

Jeremy Hylton

10:22 p.m.

On Thu, 17 Mar 2005 17:13:05 -0500, Tim Peters <tim.peters@gmail.com> wrote:

...

[Jeremy Hylton]

...
Are the thread semantics for file objecst documented anywhere?

No. At base level, they're inherited from the C stdio implementation. Since the C standard doesn't even mention threads, that's all platform-dependent. POSIX defines thread semantics for file I/O, but fat lot of good that does you on Windows, etc.

Fair enough. I didn't consider Windows at all or other non-POSIX platforms.

...

...
I don't see anything in the library manual, which is where I expected to find it. It looks like read and write are atomic by virtue of fread and fwrite being atomic.

I wouldn't consider this as more than CPython implementation accidents in the cases it appears to apply. For example, in universal-newlines mode, are you sure f.read(n) always maps to exactly one fread() call?

Universal newline reads and get_line() both lock the stream if the platform supports it. So I expect that they are atomic on those platforms. But it certainly seems safe to conclude this is a quality of implementation issue. Otherwise, why bother with the flockfile() at all, right? Or is there some correctness issue I'm not seeing that requires the locking for some basic safety in the implementation.

...

And even using a lock is stupid.

ZODB's FileStorage is bristling with locks protecting multi-threaded access to file objects, therefore that can't be stupid. QED

Using a lock seemed like a good idea there and still seems like a good idea now :-). jeremy

Tim Peters

11:14 p.m.

[Jeremy Hylton] ...

...

Universal newline reads and get_line() both lock the stream if the platform supports it. So I expect that they are atomic on those platforms.

Well, certainly not get_line(). That locks and unlocks the stream _inside_ an enclosing for-loop. Looks quite possible for different threads to read different parts of "the same line" if multiple threads are trying to do get_line() simultaneously. It releases the GIL inside the for-loop too, so other threads _can_ sneak in. We put a lot of work into speeding those getc()-in-a-loop functions. There was undocumented agreement at the time that they "should be" thread-safe in this sense: provided the platform C stdio wasn't thread-braindead, then if you had N threads all simultaneously reading a file object containing B bytes, while nobody wrote to that file object, then the total number of bytes seen by all N threads would sum to B at the time they all saw EOF. This was a much stronger guarantee than Perl provided at the time (and, for all I know, still provides), and we (at least I) wrote little test programs at the time demonstrating that the total number of bytes Perl saw in this case was unpredictable, while Python's did sum to B. Of course Perl didn't document any of this either, and it Pythonland was clearly specific to the horrid tricks in CPython's fileobject.c.

...

But it certainly seems safe to conclude this is a quality of implementation issue.

Or a sheer pigheadness-of-implementor issue <wink>.

...

Otherwise, why bother with the flockfile() at all, right? Or is there some correctness issue I'm not seeing that requires the locking for some basic safety in the implementation.

There are correctness issues, but we still ignore them; locking relieves, but doesn't solve, them. For example, C doesn't (and POSIX doesn't either!) define what happens if you mix reads with writes on a file opened for update unless a file-positioning operation (like seek) intervenes, and that's pretty easy for threads to run afoul of. Python does nothing to stop you from trying, and behavior if you do is truly all over the map across boxes. IIRC, one of the multi-threaded test programs I mentioned above provoked ugly death in the bowels of MS's I/O libraries when I threw an undisciplined writer thread into the mix too. This was reported to MS, and their response was "so don't that -- it's undefined". Locking the stream at least cuts down the chance of that happening, although that's not the primary reason for it. Heck, we still have a years-open critical bug against segfaults when one thread tries to close a file object while another threading is reading from it, right?

...

...
...
And even using a lock is stupid.

...

...
ZODB's FileStorage is bristling with locks protecting multi-threaded access to file objects, therefore that can't be stupid. QED

...

Using a lock seemed like a good idea there and still seems like a good idea now :-).

Damn straight, and we're certain it has nothing to do with those large runs of NUL bytes that sometime overwrite peoples' critical data for no reason at all <wink>.

Aahz

10:27 p.m.

On Thu, Mar 17, 2005, Tim Peters wrote:

...

I think Aahz was on target here:

NEVER, NEVER access the same file object from multiple threads, unless you're using a lock.

And here he went overboard:

And even using a lock is stupid.

ZODB's FileStorage is bristling with locks protecting multi-threaded access to file objects, therefore that can't be stupid. QED

Heh. And how much time have you spent debugging race conditions and such? That's the thrust of my point, same as we tell people to avoid locks and use Queue instead. I know that my statement isn't absolutely true in the sense that it's possible to make code work that accesses external objects across threads. (Which is why I didn't garnish that part with emphasis.) But it's still stupid, 95-99% of the time. Actually, I did skip over one other counter-example: stdout is usually safe across threads provided one builds up a single string. Still not something to rely on. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code -- not in reams of trivial code that bores the reader to death." --GvR

7298

Age (days ago)

7300

Last active (days ago)

List overview

Download

16 comments

7 participants

participants (7)

"Martin v. Löwis"
Aahz
Jeremy Hylton
Paul Moore
Samuele Pedroni
stelios xanthakis
Tim Peters

thread semantics for file objects

Samuele Pedroni

stelios xanthakis

tags

participants (7)