[Python-3000] Google Sprint Ideas

Mon Aug 21 07:41:11 CEST 2006

Guido van Rossum wrote:
> On 8/20/06, Talin <talin at acm.org> wrote:
>> Guido van Rossum wrote:
> How sure are you of all that? I always thought that these have about
> the same age, and that the main distinction is byte vs. char
> orientation. Also, the InputStreamReader class clearly sits on top of
> the InputStream class (but surprisingly recommends that for efficiency
> you do buffering on the reader side instead of on the stream side --
> should we consider this for Python too?). And FileReader is a subclass
> of InputStreamReader. (OK, further investigation does show that
> FileInputStream exists since JDK 1.0 while InputStreamReader exists
> since JDK 1.1. But there's much newer Java I/O in the "nio" package,
> and there's work going on for "nio2", JSR 203.)

Admittedly my Java knowledge is somewhat old - I spent 2 years 
programming Java in the ".com era" (2000 - 2001). I remember when the 
new reader classes came out in JDK 1.1. So "old" and "new" are somewhat 
relative here. From the point of view of JDK1.5 they are probably 
indistinguishable as to age :)

>> For purposes of Python, it probably makes more sense to look at the .Net
>> System.IO.Stream. (As a general rule, the .Net classes are refactored
>> versions of the Java classes, which is both good and bad. It's best to
>> study both if one is looking for inspiration.)
> 
> Perhaps you can tell us more about that? I've used the Java I/O system
> sufficiently to have a feel for how it is actually used, which helps
> me find my way in the docs; but for .NET I fear that I would have to
> go on a sabbattical to make sense of it. And I don't have time for
> that.

Try this page. This will at least give you a start:

http://msdn2.microsoft.com/en-us/library/system.io.streamreader_members.aspx

Here's an excerpt from the "Read" method (reformatted by me):

StreamReader.Read () -- Reads the next character from the input stream 
and advances the character position by one character.

StreamReader.Read( Char[], Int32, Int32 ) -- Reads a maximum of count 
characters from the current stream into buffer, beginning at index.

>> Hmmm, apparently the .Net documentation *does* use the term 'layer' to
>> describe one stream wrapping another - which I still find strange. To my
>> mind, the term 'layer' can either describe a particular design stratum
>> within an architecture - such as the 'device layer' of an operating
>> system - or it can describe a portion of a document, such as a drawing
>> layer in a CAD program.
> 
> It's used whenever you could draw a diagram of several layers of
> software sitting on top of each other. Perhaps usually layers are
> bigger (like device layers) but I see nothing wrong with declaring
> that Python I/O consists of three layers.
> 
>> I don't normally think of a single instance of a
>> class wrapping another instance as constituting a "layer" - I usually
>> use the term "adapter" or "proxy" to describe that case.
>>
>> (OK, so I'm pedantic about naming. Now you know why one of my side
>> projects is writing an online programmer's thesaurus -- using
>> Python/TurboGears of course!)
> 
> Wouldn't it make more sense to contribute to wikipedia at this point?

Off topic :)

Seriously, though, what I am doing is very different from Wikipedia, and 
much more like WordNet - that is, I have a database that represents 
semantic relations between words, and an AJAX GUI that allows editing of 
those relationships. Mostly it works, but I still need a way for people 
to create accounts.

(Source browsable at http://www.viridia.org/hg/ if interested.)

>> >> Also, I notice that this proposal removes what I consider to be a nice
>> >> feature of Python, which is that you can take a plain file object and
>> >> iterate over the lines of the file -- it would require a separate line
>> >> buffering adapter to be created. I think I understand the reasoning
>> >> behind this - in a world with multiple text encodings, the 
>> definition of
>> >> "line" may not be so simple. However, I would assume that the 
>> "built-in"
>> >> streams would support the most basic, least-common-denominator 
>> encodings
>> >> for convenience.
>> >
>> > First time I noticed that. But perhaps it's the concept of "plain file
>> > object" that changed? My own hierarchy (which I arrived at without
>> > reading tomer's proposal) is something like this:
>> >
>> > (1) Basic level (implemented in C) -- open, close, read, write, seek,
>> > tell. Completely unbuffered, maps directly to system calls. Does
>> > binary I/O only.
>> >
>> > (2) Buffering. Implements the same API as (1) but adds buffering. This
>> > is what one normally uses for binary file I/O. It builds on (1), but
>> > can also be built on raw sockets instead. It adds an API to inquire
>> > about the amount of buffered data, a flush() method, and ways to
>> > change the buffer size.
>> >
>> > (3) Encoding and line endings. Implements a somewhat different API,
>> > for reading/writing text files; the API resembles Python 2's I/O
>> > library more. This is where readline() and next() giving the next line
>> > are implemented. It also does newline translation to/from the
>> > platform's native convention (CRLF or LF, or perhaps CR if anyone
>> > still cares about Mac OS <= 9) and Python's convention (always \n). I
>> > think I want to put these two features (encoding and line endings) in
>> > the same layer because they are both text related. Of course you can
>> > specify ASCII or Latin-1 to effectively disable the encoding part.
>> >
>> > Does this make more sense?
>>
>> I understood that much -- this is pretty much the way everyone does
>> things these days (our own custom stream library at work looks pretty
>> much like this too.)
> 
> So you have the buffering between the binary I/O and the text I/O too?

Theoretically, yes - you can plug in a buffer in-between them if you 
want. It doesn't do this by default however (our needs are somewhat 
specialized.)

>> The question I was wondering is, will the built-in 'file' function
>> return an object of level 3?
> 
> I am hoping to get rid of 'file' altogether. Instead, I want to go
> back to 'open'. Calling open() with a binary mode argument would
> return a layer 2 or layer 1 (if unbuffered) object; calling it with a
> text mode would return a layer 3 object. open() would grow additional
> keyword parameters to specify the encoding, the desired newline
> translation, and perhaps other aspects of the layering that might need
> control.
> 
> BTW in response to Alexander Belopolsky: yes, I would like to continue
> support for something like readinto() by layer 1 and maybe 2 (perhaps
> even more flexible, e.g. specifying a buffer and optional start and
> end indices). I don't think it makes sense for layer 3 since strings
> are immutable. I agree with Martin von Loewis that a readv() style API
> would be impractical (and I note that Alexander doesn't provide any
> use case beyond "it's more efficient").

Note that the .Net API in the example above supports this.

> A use case that I do think is important is reading encoded text data
> asynchronously from a socket. This might mean that layers 2 and 3 may
> have to be aware of the asynchronous (non-blocking or timeout-driven)
> nature of the I/O; reading from layer 3 should give as many characters
> as possible without blocking for I/O more than the specified timeout.
> We should also decide how asynchronous I/O calls report "no more data"
> -- exceptions are inefficient and cause clumsy code, but if we return
> "", how can we tell that apart from EOF? Perhaps we can use None to
> indicate "no more data available without blocking", continuing "" to
> indicate EOF. (The other way around makes just as much sense but would
> be a bigger break with Python's past than this particular issue is
> worth to me.)
>