[Web-SIG] Use cases for file-like objects (was Re: Bill's comments
on WSGI draft 1.4)
py-web-sig at xhaus.com
Wed Sep 8 16:25:12 CEST 2004
[Phillip J. Eby]
>>> Instead of using 'fileno' as an extension attribute on the iterable,
>>> we'll add a 'wsgi.file_wrapper' key, usable as follows by an
>>> return environ['wsgi.file_wrapper'](something,blksize)
>>> The 'file_wrapper' may introspect "something" in order to do a
>>> fileno() check, or other "I know how to send this kind of object
>>> quickly" optimizations. It must return an iterable, that the
>>> application may return back to the server.
> [...] I should warn you that I'm thinking
> 'file_wrapper' was a bad idea, and that there's a better way to do all
> As I understand them, the current use cases for file-like objects are:
> 1. sendfile(fileno()) for fast file-descriptor copying (Unix-like
> OSes only, and only single-thread synchronous servers like Apache 1.x
> or CGI)
Well, I see sendfile functionality as being much more than widespread
than that. Java.nio, for example, has excellent support for fast
"channel transfers" between file channels and other writable channel types.
This support goes right down to the level of allocating "direct
buffers", which use DMA to bypass the CPU when transferring the
bytestream to/from the destination channel. On OSes where such DMA
facilities are not supported, the exact same code still works, but just
isn't as fast.
For an excellent discussion of how these facilities work in java.nio,
and more importantly why they work and are high performance, I recommend
Ron Hitchens comprehensive book "Java NIO"
And I'd be surprised if the .Net CLR doesn't soon develop such
functionality, if it isn't already supported.
> [other cases snipped]
> These are all very simple, one-line solutions (at least for 2.2+) and
> have the advantage of being explicit, and refusing the temptation to
> guess. The application is in total control of how the resource will
> be transmitted.
Well, I suppose the key question here is "should the application be in
total control of how the resource is transmitted"? Can we rely on all
WSGI applications behaving correctly across all server platforms? Should
the server not have some say in how the resource can be optimally
tansmitted, in its environment?
> That leaves only use case 1, which is a fairly limited use case and
> isn't even applicable to most web servers written in Python, as most
> such servers are asynchronous and can't take advantage of the
> 'sendfile()' system call (which Python doesn't expose as an 'os'
> facility anyway).
A pity that cpython doesn't implement sendfile as an native C method
that is layered on top of a native OS implementation if available, or a
generic C implementation if not. The current lack of the call means that
people tend to implement their own sendfile in pure python, meaning that
they end up acquiring and releasing the GIL between every chunk sent.
Also, I don't think we should restrict ourselves to thinking solely in
terms of single-threaded asynchronous architectures. When I think about
asynchronous, high-performance and high-throughput server architectures,
I tend to think in terms of hybrid asynchronous/threaded architectures,
of the type described by Welsh et al. in the excellent and readable
14-page overview paper "A Design Framework for Highly Concurrent
Systems" (highly recommended reading, for those who might be interested)
More details on Welsh's work can be obtained from his publications page.
Welsh describes the use of thread-pools of a fixed "width" to service
particular request types, with requests shunted between those (otherwise
isolated) thread pools using queues. For example, if the server hardware
is capable of processing 50 disk requests simultaneously, then the
"width" of the thread pool serving resources from disk should be 50: any
more is a waste, any less will underperform the theoretical maximum.
It is important to note that those 50 threads would be threads which
continually block while waiting for disk read completions. When the disk
I/O has completed, they could either "sendfile" the data back to the
client, or more likely pass it onto a dedicated thread-pool that does
nothing but transfer disk byte streams to client sockets. Meaning that
that they need some way to record/represent the fact that the bytestream
is coming from a file.
This file->socket transfer could also conceivably be done by a single
thread, which continually watches the readiness status of large sets of
both socket and file channels/descriptors, and transferring blocks
between them as appropriate. And "blocks" is the key word here. Data
comes from disks in fixed size chunks, the size of which are optimised
for maximum throughput at all levels of the OS. Many modern operating
systems come with specialised high-performance support for transferring
data from one channel/descriptor to another. Such support can radically
increase throughput on a server.
So I suppose my real concern is that by relegating disk-originating byte
streams to being second-class citizens under WSGI, we might hinder the
portability of some highly-desirable server architectural approaches.
> Therefore, my current thinking is to relegate use case 1 to a WSGI
> extension, 'wsgi.fd_wrapper', which can used like this (if the
> application is returning an object with a working 'fileno()' method):
> if 'wsgi.fd_wrapper' in environ:
> return environ['wsgi.sendfile'](fd.fileno())
> # return a normal iterable
> In other words, 'wsgi.fd_wrapper' would be sort of like my earlier
> 'wsgi.file_wrapper', but it would be *optional* to implement and use.
> (Meaning it can be relegated to an application note, instead of having
> to be introduced in-line.)
Well, I suppose that that makes sense too. After all, all of this talk
of "highly-concurrent" architectures doesn't really apply to Apache +
CGI/WSGI, for example.
> For Alan's attempt to support Jython 2.1, he could write an 'iter'
> function or class and put it in __builtin__, so that programs written
> to this idiom would still work.
> After thinking about the 'file_wrapper' idea some more, I'm thinking
> that this way works better for everything but the issue of closing
> files. However, my example 'file_wrapper' class should maybe be
> included in the PEP under an application note about sending files and
> file-like objects.
Perhaps a "finalise" method might be appropriate?
Just thinking through some scenarios here:
What happens if the server is just about to start serving a
multi-megabyte PDF file back to a client socket, and then the client
closes the socket, i.e. the user cancelled their request. What should
the server do in that case? Should it continue to iterate through the
iterable right until the end, discarding the results? Or should it just
drop the iterable on the floor, to be sorted out by GC (and thus
potentially wasting file-descriptors)? Or should it attempt to finalise
the iterable, so that all related resource is freed?
Does these considerations also apply when the bytestream being
transferred is not "physical", i.e. coming from a
file-descriptor/channel. What if the bytestream is coming from an
iterable yielding several megabytes of python strings, from a page
rendering component, for example. How does the server tell the
application to stop, because the client is no longer interested? Does it
simply drop the iterable on the floor and forget about it?
Might the application have a need to know that the client aborted the
request, for example in E-commerce scenarios? If the application did
need to know, how could the server inform the application?
More information about the Web-SIG