[Web-SIG] Proposal: Avoiding Serialization When Stacking Middleware
ianb at colorstudy.com
Wed Mar 7 04:43:43 CET 2007
Phillip J. Eby wrote:
> At 08:08 PM 3/6/2007 -0600, Ian Bicking wrote:
>> Posted here: http://wsgi.org/wsgi/Specifications/avoiding_serialization
>> Text copied below for discussion:
>> :Title: Avoiding Serialization When Stacking Middleware
>> :Author: Ian Bicking <ianb at colorstudy.com>
>> :Discussions-To: Python Web-SIG <web-sig at python.org>
>> :Status: Proposed
>> :Created: 06-03-2007
>> .. contents::
>> This proposal gives a strategy for avoiding unnecessary serialization
>> and deserialization of request and response bodies. It does so by
>> attaching attributes to ``wsgi.input`` and the ``app_iter``, as well as
>> a new environment key ``x-wsgiorg.want_parsed_response``.
>> Output-transforming middleware often has to parse the upstream content,
>> transform it, then serialize it back to a string for output. The
>> original output may have already been in the parsed form that the
>> middleware wanted. Or there may be more middleware that does similar
>> transformations on the same kind of objects.
> HTTP already includes a mechanism for specifying what types are accepted
> by a content consumer: the "Accept" header. You can always add other
> values to it to indicate the parsed values you can accept.
> Of course, this doesn't really work well with WSGI - you want the result
> to actually *be* WSGI... so you can use the WSGI way of doing this,
> which is to have a standard wrapper for the specific content type you
> want to use.
Yeah, using Accept is clever, but not really accurate, since if you
serialize the WSGI request to HTTP the addition no longer makes sense.
> The wrapper (as with the wsgi "file wrapper") simply puts a WSGI face on
> a non-WSGI result body, converting it to an iterator of strings, and
> holding other attributes known to the middleware or other application
That just calls for a series of ad hoc techniques, basically, where each
object type results in a new key in the environment and a new ad hoc
specification to be made (e.g., wsgi.file_wrapper takes a block size,
which is specific only to that case).
> This could be implemented as an environ key containing a mapping from
> types to wrapper functions. Middleware that wants a type just copies
> the mapping and overwrites any entries it cares about. Applications
> that want to return a non-serialized result just look up the type (using
> __mro__ order) to find an applicable wrapper.
OK, the dict would avoid multiple different kinds of keys, and
presumably they'd all have the same signature. Block size doesn't
really make any sense to me as a common parameter. Content type should
be a common parameter, as something like an lxml object can be
serialized as either XML or HTML. I don't think any response headers
are likely to effect the serialization... though with my specification
that remains an application concern, so it doesn't have to be resolved
in the specification.
I hadn't really thought about MRO, though generally I don't trust
inheritance to be meaningful anyway -- I feel like I'd be more likely to
a switch on the type than test inheritance.
> Notice that this approach doesn't require any special protocol for these
> wrappers -- just WSGI. It's simpler to specify, and simpler to
> implement than what you propose, while addressing some of the open issues.
The specification isn't particularly long or complicated, IMHO. The
implementation is complicated mostly for reasons unrelated to the
specification -- any output-transforming middleware will be similarly
> Yes, it does have some problems with interface vs. implementation. ISTM
> that trying to solve that problem is effectively asking to revive or
> reinvent PEP 246, however. But we could explicitly allow the use of
> type names instead of the actual types.
When playing with implementation I used type names, and actually I
rather prefer them, but it's not always clear what name to use. For
instance, "lxml", "lxml.etree", "lxml.etree.Element", and
"lxml.etree._Element" all are reasonable names. Or "ElementTree",
"ElementTree.Element", "ElementTree._Element", "xml.etree",
"xml.etree.Element", and "xml.etree._Element". Or even something like
"IElement" could make sense in some context (e.g., what if you can
accept the overlapping interfaces of both lxml and ElementTree?)
At least the actual type object seems easy enough. OTOH, there are
actually cases when I'd like to say that I could accept a certain type
without having to import the type. E.g., if I wanted to do an XSLT
transformation, I *could* support several kinds of objects without
requiring any of them (e.g., lxml, 4DOM, and Genshi Markup).
>> The same things apply to the parsing of ``wsgi.input``, specifically
>> parsing form data. A similar strategy is presented to avoid
>> unnecessarily reparsing that data.
> I would rather offer an optional 'get_file_storage()' method or some
> such as a blessed WSGI extension, than have such an open-ended "get
> whatever you want from the input object" concept floating around. A
> strategy which reinvents half of PEP 246 (the *old* PEP 246, before it
> became almost as complicated as WSGI) seems like overkill to me.
I don't really understand what you are proposing. This part addresses
the same issues as presented in
I really don't *want* to write every wsgi.input to a temporary file just
because someone else *might* want to reparse the input. I'd much rather
do it lazily, as 99% of the time reparsing won't happen.
>> Obviously the code is not simple, but this is the nature of WSGI
>> output-transforming middleware.
> Something I'd like to fix in WSGI 2.0, by getting rid of both
> "start_response" and "write", but that's a discussion for another time.
Yeah, that'd be nice, but another discussion for another time.
>> Other Possibilities
>> * You could simply parse everything ever time.
>> * You could pass data through callbacks in the environment (but this can
>> break non-aware middleware).
>> * You can make custom methods and keys for each case.
>> * You can use something other than WSGI.
> And you can use the established WSGI method for adding semantics to a
> response, using a middleware-supplied wrapper. I think this is actually
> the best alternative.
I really don't understand the advantage.
> In truth, it could be as simple as using the class's fully-qualified
> name as an environ key (perhaps with a prefix or suffix), with the value
> being a wrapper for objects implementing that protocol. No
> x-foobar-wsgiorg-whatchamacallit cruft needed.
> And, it's lightweight enough of a concept to be expressed as a simple
> "best practice" design pattern.
Best practice is fine, though of course still needs to be documented, as
this is hardly a practice that people would naturally think about or
implement. But I don't really think that practice would be any simpler
or easier to describe if done completely. In fact, I think it would
take exactly the same amount of space to describe.
Ian Bicking | ianb at colorstudy.com | http://blog.ianbicking.org
More information about the Web-SIG