[Python-Dev] IO module improvements

Sat Feb 6 12:43:08 CET 2010

Antoine Pitrou a écrit :
>   
> What is the difference between "file handle" and a regular C file descriptor?
> Is it some Windows-specific thing?
> If so, then perhaps it deserves some Windows-specific attribute ("handle"?).
>   
At the moment it's windows-specific, but it's not impossible that some 
other OSes also rely on specific file handles (only emulating C file 
descriptors for compatibility).
I've indeed mirrored the fileno concept, with a "handle" argument for 
constructors, and a handle() getter.

> On Fri, Feb 5, 2010 at 5:28 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>   
>> Pascal Chambon <pythoniks <at> gmail.com> writes:
>>     
>>> By the way, I'm having trouble with the "name" attribute of raw files,
>>> which can be string or integer (confusing), ambiguous if containing a
>>> relative path,
>>>       
>
> Why is it ambiguous? It sounds like you're using str() of the name and
> then can't tell whether the file is named e.g. '1' or whether it
> refers to file descriptor 1 (i.e. sys.stdout).
>
>   
As Jean-Paul mentioned, I find confusing the fact that it can be a 
relative path, and sometimes not a path at all. I'm pretty sure many 
programmers haven't even cared in their library code that it could be a 
non-string, using concatenation etc. on it...
However I guess that the history is so high on it, that I'll have to 
conform to this semantic, putting all paths/fileno/handle in the same 
"name" property, and adding an "origin" property telling how to 
interpret the "name"...

>> Methods too would deserve some auto-forwarding. If you want to bufferize 
>> a raw stream which also offers size(), times(), lock_file() and other 
>> methods, how can these be accessed from a top-level buffering/text 
>> stream ?
>>     
>
> I think it's a bad idea. If you forget to implement one of the standard IO
> methods (e.g. seek()), it will get forwarded to the raw stream, but with the
> wrong semantics (because it won't take buffering into account).
>
> It's better to require the implementor to do the forwarding explicitly if
> desired, IMO.
>   
The problem is, doing that forwarding is quite complicated. IO is a 
collection of "core tools for working with streams", but it's currently 
not flexible enough to let people customize them too...
For example, if I want to add a new series of methods to all standard 
streams, which simply forward calls to new raw stream features, what do 
I do ? Monkey-patching base classes (RawFileIO, BufferedIOBase...) ? Not 
a good pattern. Subclassing 
FileIO+BufferedWriter+BufferredReader+BufferedRandom+TextIOWrapper ? 
That's really redundant...

And there are sepecially flaws around BufferedRandom. This stream 
inherits BufferedWriter and BufferedRandom, and overrides some methods. 
How do I do to extend it ? I'd want to reuse its methods, but then have 
it forward calls to MY buffered classes, not original BufferedWriter or 
BufferredReader classes. Should I modify its __bases__ to edit the 
inheritance tree ? Handy but not a good pattern... I'm currently getting 
what I want with a triple inheritance (praying for the MRO to be as I 
expect), but it's really not straightforward.
Having BufferedRandom as an additional layer would slow down the system, 
but allow its reuse with custom buffered writers and readers...

>> - I feel thread-safety locking and stream stream status checking are 
>> currently overly complicated. All methods are filled with locking calls 
>> and CheckClosed() calls, which is both a performance loss (most io 
>> streams will have 3 such levels of locking, when 1 would suffice)
>>     
>
> FileIO objects don't have a lock, so there are 2 levels of locking at worse, not
> 3 (and, actually, TextIOWrapper doesn't have a lock either, although perhaps it
> should).
> As for the checkClosed() calls, they are probably cheap, especially if they
> bypass regular attribute lookup.
>   
CheckClosed calls are cheap, but they can easily be forgotten in one of 
the dozens of methods involved...
My own FileIO class alas needs locking, because for example, on windows 
truncating a file means seeking + setting end of file + restoring pointer.
And I TextIOWrapper seems to deserve locks. Maybe excerpts like this one 
really are thread-safe, but a long study would be required to ensure it.

       if whence == 2: # seek relative to end of file
            if cookie != 0:
                raise IOError("can't do nonzero end-relative seeks")
            self.flush()
            position = self.buffer.seek(0, 2)
            self._set_decoded_chars('')
            self._snapshot = None
            if self._decoder:
                self._decoder.reset()
            return position

>   
>> Since we're anyway in a mood of imbricating streams, why not simply 
>> adding a "safety stream" on top of each stream chain returned by open() 
>> ? That layer could gracefully handle mutex locking, CheckClosed() calls, 
>> and even, maybe, the attribute/method forwarding I evocated above.
>>     
>
> It's an interesting idea, but it could also end up slower than the current
> situation.
> First because you are adding a level of indirection (i.e. additional method
> lookups and method calls).
> Second because currently the locks aren't always taken. For example, in
> BufferedIOReader, we needn't take the lock when the requested data is available
> in our buffer (the GIL already protects us). Having a separate "synchronizing"
> wrapper would forbid such micro-optimizations.
>
> If you want to experiment with this, you can use iobench (in the Tools
> directory) to measure file IO performance.
>
>   
There are chances that my approach is slower, but the gains are so high 
in terms of maintainability and use of use, that I would definitely 
advocate it.
Typically, the micro-optimizations you speak about can please heavy 
programs, but they make code a mined land (maybe that's why they haven't 
been put into _pyio :p).
When the order of every instruction matters, when all is carefully 
crafted so that the Gil is sufficient, I personally don't dare touching 
anything anymore...

There is for sure an important trade-off between speed and robustness 
here, but I fear speed has won too much so far (and now that the main 
implementation is in C, it's getting real hard to apprehend).

Maybe I should take the latest _pyio version, and make a fork offering 
high level flexibility and security, for those who don't care about so 
high performances ?

>> - some semantic decisions of the current system are somehow dangerous. 
>> For example, flushing errors occuring on close are swallowed. It seems 
>> to me that it's of the utmost importance that the user be warned if the 
>> bytes he wrote disappeared before reaching the kernel ; shouldn't we 
>> decidedly enforce a "don't hide errors" everywhere in the io module ?
>>     
>
> It may be a bug. Can you report it, along with a script or test showcasing it?
>
> Regards
>
> Antoine.
>   
It seems a rather decided semantic (with comments like "#If flush() 
fails, just give up"), but yep I'll file a bug to be sure.

> I don't think this can be helped though -- I really don't want open()
> to be slowed down or complicated by an attempt to do path
> manipulation. If this matters to the app author they should use
> os.path.abspath() or os.path.realpath() or whatever before calling
> open().
>
>   
On second thought, having more precise "name" or "path" attributes might 
give users the impression that they can rely on them, whereas indeed the 
filesystem might have been modified a lot during the use of the stream 
(even on windows, where files can actually be renamed/deleted while 
they're open)...

> AFAIK, they aren't simple indexes in windows, and that's  partly why
> even file descriptors cannot be safely passed between C runtimes on
> windows (whereas they can in most unices).
>
> David
>   
Yep, windows file descriptors are actually emulated (with bugs...) on 
top of native file handles, that's why we can't rely on them for 
advanced stream operations.

Regards,
Pascal

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100206/fb5cb12c/attachment.htm>