i first thought on focusing on the socket module, because it's the part that<br>bothers me most, but since people have expressed their thoughts on completely<br>revamping the IO stack, perhaps we should be open to adopting new ideas,
<br>mainly from the java/.NET world (keeping the momentum from the previous post).<br><br>there is an inevitable issue of performance here, since it basically splits <br>what used to be "file" or "socket" into many layers... each adding additional
<br>overhead, so many parts should be lowered to C.<br><br>if we look at java/.NET for guidance, they have come up with two concepts:<br>* stream - an arbitrary, usually sequential, byte data source<br>* readers and writers - the way data is encoded into/decoded from the stream.
<br>we'll use the term "codec" for these readers and writers in general.<br><br>so "stream" is the "where" and "codec" is the "how", and the concept of <br>codecs is not limited to ASCII vs UTF-8. it can grow into fully-fledged
<br>protocols.<br><br><br><br><br>- - - - - - -<br>Streams<br>- - - - - - -<br><br>streams provide an interface to data sources, like memory, files, pipes, or<br>sockets. the basic interface of all of these is<br><br>class Stream:
<br> def close(self)<br> def read(self, count)<br> def readall(self)<br> def write(self, data)<br><br>and unlike today's files and sockets, when you read from a broken socket or<br>past the end of the file, you get EOFError.
<br><br>read(x) guarantees to return x bytes, or EOFError otherwise (and also restoing<br>the stream position). on the other hand, readall() makes no such guarantee: it<br>reads all the data up to EOF, and if you readall() from EOF, you get "".
<br><br>perhaps readall() should return all *available* data, not necessarily up to <br>EOF. for files, this is equivalent, but for sockets, readall would return all <br>the data that sits in the network stack. this could be a nice way to do
<br>non-blocking IO.<br><br>and if we do that already, perhaps we should introduce async operations as a<br>built-in feature? .NET does (BeginRead, EndRead, etc.)<br> def async_read(self, count, callback)<br> def async_write(self, data, callback)
<br><br>i'm not sure about these two, but it does seem like a good path to follow.<br><br>-----<br><br>another issue is the current class hierarchy: fileno, seek, and readline are<br>meaningless in many situations, yet they are considered the part of the file-
<br>protocol (take a look at StringIO implementing isatty!). <br><br>these methods, which may be meaningless for several types of streams, must <br>not be part of the base Stream class.<br><br>for example, only FileStream and MemoryStream are seekable, so why have seek
<br>as part of the base Stream class?<br><br>-----<br><br>streams that don't rely on an operating-system resource, would derive directly<br>from Stream. as examples for such streams, we can condier <br><br>class MemoryStream(Stream):
<br> # like today's StringIO<br> # allows seeking<br><br>class RandomStream(Stream):<br> # provider of random data<br><br>-----<br><br>on the other hand, streams that rely on operating-system resources, like files
<br>or sockets, would derive from<br><br>class OSStream(Stream):<br> def isatty(self)<br> def fileno(self) # for select()<br> def dup(self)<br><br>and there are several examples for this kind:<br><br>FileStream is the entity that works with files, instead of the file/open
<br>class of today. since files provide random-access (seek/tell), this kind of <br>stream is "seekable" and "tellable".<br><br>class FileStream(OSStream):<br> def __init__(self, filename, mode = "r")
<br> def seek(self, pos, offset = None)<br> def tell(self)<br> def set_size(self, size)<br> def get_size(self)<br><br>although i prefer properties instead<br> position = property(tell, seek)<br> size = property(get_size, set_size)
<br><br>PipeStream represents a stream over a (simplex) pipe:<br><br>class PipeStream(OSStream):<br> def get_mode(self) # read or write<br><br>DuplexPipeStream is an abstraction layer that uses two simplex pipes<br>as a full-duplex stream:
<br><br>class DuplexPipeStream(OSStream):<br> def __init__(self, incoming, outgoing):<br><br> @classmethod<br> def open(cls):<br> incoming, outgoing = os.pipe()<br> return cls(incoming, outgoing)<br>
<br>NetworkStreams provide a stream over a socket. unlike files, sockets may<br>get quite complicated (options, accept, bind), so we keep the distinction:<br>* sockets as the underlying "physical resource"<br>* NetworkStreams wrap them with a nice stream interface. for example, while
<br>socket.recv(x) may return less than x bytes, networkstream.read(x) returns x<br>bytes.<br><br>we must keep this distinction because streams are *data sources*, and there's<br>no way to represent things like bind or accept in a data source. only client
<br>(connected) sockets would be wrappable by NetworkStream. server sockets don't<br>provide data and hence have nothing to do with streams.<br><br>class NetworkStream(OSStream):<br> def __init__(self, sock)<br><br><br>
<br><br>- - - - - - - - -<br>Special Streams<br>- - - - - - - - -<br><br>it will also be useful to have a way to duplicate a stream, like the unix tee<br>command does<br><br>class TeeStream(Stream):<br> def __init__(self, src_stream, dst_stream)
<br><br>f1 = FileStream("c:\\blah")<br>f2 = FileStream("c:\\yaddah")<br>f1 = TeeStream(f1, f2)<br><br>f1.write("hello") <br><br>will write "hello" to f2 as well. that's useful for monitoring/debugging,
<br>like echoing everything from a NetworkStream to a file, so you could debug<br>it easily.<br><br>-----<br><br>buffering is always *explicit* and implemented at the interpreter level, <br>rather than by libc, so it is consistent between all platforms and streams.
<br>all streams, by nature, and *non-buffered* (write the data as soon as <br>possible). buffering wraps an underlying stream, making it explicit<br><br>class BufferedStream(Stream):<br> def __init__(self, stream, bufsize)
<br> def flush(self)<br> <br>(BufferedStream appears in .NET)<br><br>class LineBufferedStream(BufferedStream):<br> def __init__(self, stream, flush_on = b"\n")<br> <br>f = LineBufferedStream(FileStream("c:\\blah"))
<br><br>where flush_on specifies the byte (or sequence of bytes?) to flush upon<br>writing. by default it would be on newline.<br><br><br><br><br><br>- - - - - - -<br>Codecs<br>- - - - - - -<br><br>as was said earlier, formatting defines how the data (or arbitrary objects) are
<br>to be encoded into and decoded from a stream. <br><br>class StreamCodec:<br> def __init__(self, stream)<br> def write(self, ...)<br> def read(self, ...)<br><br>for example, in order to serialize binary records into a file, you would use
<br><br>class StructCodec(StreamCodec):<br> def __init__(self, stream, format):<br> Codec.__init__(self, stream)<br> self.format = format<br> def write(self, *args):<br> self.stream.write(struct.pack
(self.format, *args))<br> def read(self):<br> size = struct.calcsize(self.format)<br> data = self.stream.read(size)<br> return struct.unpack(self.format, data)<br><br>(similar to BinaryReader/BinaryWriter in .NET)
<br><br>and for working with text, you would have<br><br>class TextCodec(StreamCodec):<br> def __init__(self, stream, textcodec = "utf-8"):<br> Codec.__init__(self, stream)<br> self.textcodec = textcodec
<br> def write(self, data):<br> self.stream.write(data.encode(self.textcodec))<br> def read(self, length):<br> return self.stream.read(length).decode(self.textcodec)<br><br> def __iter__(self) # iter by lines
<br> def readline(self) # read the next line<br> def writeline(self, data) # write a line<br><br>as you can see, only the TextCodec adds the readline/writeline methods, as <br>they are meaningless to most binary formats. the stream itself has no notion
<br>of a line. <br><br><big drum roll> no more newline issues! </big drum roll><br><br>the TextCodec will do the translation for you. all newlines are \n in python,<br>and are written to the underlying stream in a way that would please the
<br>underlying platform.<br><br>so the "rb" and "wb" file modes will deminish, and instead you would wrap the<br>FileStream with a TextCodec. it's explicit, so you won't be able to corrupt<br>data accidentally.
<br><br>-----<br><br>it's worth to note that in .NET (and perhaps java as well), they splitted <br>TextCodec into two parts, the TextReader and TextWriter classes, which you <br>initialize over a stream:<br><br>f = new FileStream("c:\\blah");
<br>sr = new StreamReader(f, Encoding.UTF8);<br>sw = new StreamWriter(f, Encoding.UTF8);<br>sw.Write("hello");<br>f.Position = 0;<br>sr.read(5);<br><br>but why separate the two? it could only cause problems, as you may initialize
<br>them with different encodings, which leads to no good. under the guidelines <br>of this suggestion, it would be implemented this way:<br><br>f = TextCodec(FileStream("c:\\blah"), "utf-8")<br><br>which can of course be refactored to a function:
<br><br>def textfile(filename, mode = "r", codec = "utf-8"):<br> return TextCodec(FileStream(filename, mode), codec)<br><br>for line in textfile("c:\\blah"):<br> print line<br><br>unlike today's file objects, FileStream objects don't know about lines, so
<br>you can't iterate through a file directly. it's quite logical if you think <br>about it, as there's no meaning to iterating over a binary file by lines.<br>it's a feature of text files.<br><br>-----<br><br>many times, especially in network protocols, you need framing for transfering
<br>frames/packets/messages over a stream. so a very useful FramingCodec can be <br>introduced:<br><br>class FramingCodec(Codec):<br> def write(self, data):<br> self.stream.write(struct.pack("<L", len(data)))
<br> self.stream.write(data)<br> def read(self):<br> length, = struct.unpack("<L", self.stream.read(4))<br> return self.stream.read(length)<br><br>once you set up such a connection, you are free of socket hassle:
<br><br>conn = FramingCodec(NetworkStream(TcpClientSocket("host", 1234)))<br>conn.write("hello")<br>reply = conn.read() <br><br>and it can be extended by subclassing, for instance, to allow serializing
<br>streams: you can write objects directly to the stream and get them on the <br>other side with ease:<br><br>class SeralizingCodec(FramingCodec):<br> def write(self, obj):<br> FramingCodec.write(self, pickle.dumps
(obj))<br> def read(self):<br> return pickle.loads(FramingCodec.read(self))<br><br>conn = SeralizingCodec(NetworkStream(TcpClientSocket("host", 1234)))<br>conn.send([1,2,{3:4}])<br>person = conn.recv()
<br>print person.first_name<br><br>and it can serve as the basis for RPC protocols or as a simple way to transfer<br>arbitrary objects (for example, database query results from a server, etc.)<br><br>and since the codecs don't care what the underlying stream is, it can be a
<br>FileStream as well, serializing objects to disk.<br><br>-----<br><br>many protocols can also be represented as codecs. textual protocols, like <br>HTTP or SMTP, can be easily implemented that way:<br><br>class HttpClientCodec( *TextCodec* ):
<br> def __init__(self, stream):<br> TextCodec.__init__(self, stream, textcodec = "ascii")<br> <br> def write(self, request, params, data = ""):<br> self.writeline("%s %s" % (request, params))
<br> self.writeline()<br> if data:<br> self.writeline(data)<br> <br> def read(self):<br> ...<br> return response, header, data<br> <br> def do_get(filename):<br>
self.write("GET", filename)<br><br> def do_post(filename, data):<br> self.write("POST", filename, data)<br><br>class HttpServerCodec(TextCodec):<br> ....<br><br>and then an http-server becomes rather simple:
<br><br># client<br>conn = HttpClientCodec(NetworkStream(TcpClientSocket("host", 8080)))<br>conn.do_get("/index.html")<br>response, header, data = conn.recv()<br>if response == "200":<br> print data
<br><br># server<br>s = TcpServerSocket(("", 8080))<br>client_sock = s.accept()<br>conn = HttpServerCodec(NetworkStream(client_sock))<br>request, params, data = conn.read()<br><br>if request == "GET":<br>
...<br><br>you can write something like urllib in no-time.<br><br>-----<br><br>it's worth to note that codecs are "stackable", so you can chain them, thus<br>creating more complex codecs, for instance:<br><br>
https_conn = HttpClientCodec(SslCodec(NetworkStream(...)))<br><br>and other crazy stuff can follow: imaging doing SSL authentication over pipes,<br>between two processes. why only sockets? yeah, it's crazy, but why not?<br>
<br><br><br><br><br>- - - - - - -<br>Summary<br>- - - - - - -<br><br>to conclude this long post, streams are generic data providers (random, files,<br>sockets, in-memory), and codecs provide an abstraction layer over streams,
<br>allowing sophisticated use cases (text, binary records, framing, and even <br>full protocols). <br><br>i've implemented some of these ideas in RPyC ( <a href="http://rpyc.wikispaces.com">http://rpyc.wikispaces.com</a>
),<br>in the Stream and Channel modules (i needed a uniform way of working with <br>pipes and sockets). of course i didn't go rewriting the whole io stack there, <br>but it shows real-life usage of this model.<br><br><br>
<br>-tomer<br>