Re: [Twisted-web] twisted.web.template output encoding

4 Jan 2012


      On Jan 3, 2012, at 10:54 AM, exarkun@twistedmatrix.com wrote:
...
On 5 Dec 2011, 08:19 pm, glyph@twistedmatrix.com wrote:
...
Sorry it took me so long to get to this.  Hopefully it's still relevant 
;).
Heh.  Heh heh heh.  Heh.
So it goes ;-).
...
...
On Nov 26, 2011, at 11:52 AM, exarkun@twistedmatrix.com wrote:
...
Apart from various issues relating to the lack of patterns in 
twisted.web.template,
I had some trepidation about marking 
<http://twistedmatrix.com/trac/ticket/5040> as "closed" :).  What kind 
of issues came up with patterns?  Anything you feel needs fixing?
The approach facilitated by #5040 seems to result in much more 
boilerplate than the approach facilitated by Nevow's patterns.  The code 
for #4896 has many, many Elements.  An implementation using Nevow 
probably would have had far fewer, perhaps only one.
Which of these is better, I don't know.  I certainly got bored very 
early on in the #4896 work, though.
Well, if the approach on #5040 is way more verbose, what does it have in its favor?  Simplicity?  I must imagine that we can get both somehow.
...
...
...
the main difficulty is in handling non-ascii contents in the 
traceback.  Apart from any unicode that may show up in the source code 
being rendered (or, perhaps, eventually, the values of variables to be 
rendered - though for now I do not plan to implement this) the no- 
break space characters which are necessary to get traceback lines 
indented properly mean that there is always some non-ascii to include 
in the output.
Looking at the actual output now, these   characters strike me as 
an accident of how browsers collapse different types of whitespace. 
They could be replaced with a <span style="width: 4em;" /> to avoid 
this problem for now, which is probably more expressive.
If I understood Jonathan's reply properly, it sounds like the   
hack is the best we've got.
I don't _want_ to read Jonathan's reply thoroughly enough to understand it, so I'll have to take your word for it.
...
...
...
twisted.web.template encodes its output using UTF-8, and this is not 
customizable.  Thus, using twisted.web.template, formatFailure's 
result will be a str containing UTF-8 encoded text.  Previously the 
result was a str containing only ASCII encoded text, with no-break 
space represented as ` ´.  Consequently, callers of 
`formatFailure´ will probably mishandle the result - the caller in 
`twisted.web.server´ does, at least, including the bytes in a page 
with a content type of "text/html".
The solutions that come to mind are all about removing this 
incompatible change and making it so `formatFailure´ can continue to 
return a str with ASCII-encoded text.
One solution is to add support for named entities or numeric character 
references to twisted.web.template.  Very likely this is a good idea 
regardless (Nevow supported these).
I think that this is probably a necessary feature regardless, 
eventually.  Did you end up filing a ticket for it?
Yep, this has been filed and is up for review (for weeks now ;): #5408.
Great, okay.
...
...
...
Another solution is to use a different encoding in 
`twisted.web.template´ - ASCII, with xmlcharrefreplace as the error 
handler.  This is tempting since it avoids an obtrusive non-ASCII 
support API (the way Nevow supports these is via `nevow.entities´, 
which must be used rather than normal Python unicode objects).
I like this idea, because it's so hard to get wrong even if you have 
other problems (missing charset, buggy proxies, overly aggressive 
encoding detection, etc).  We can still say it's UTF-8 but it will work 
anywhere ASCII will work :).
...
Perhaps another question is whether the encoding used by 
`twisted.web.template´ should be a parameter.  A related question 
raised might be whether `twisted.web.template´ should encoded to bytes 
at all, or delegate the responsibility for that to code closer to a 
socket.
Personal experience looking at profiles of applications which serialize 
a lot of XML suggests to me that encoding and decoding text in Python 
is a huge chunk of CPU work and memory footprint; keeping the encoding 
in t.w.t provides an opportunity for a potentially important 
optimization which might not be possible if it were done closer to the 
socket.
For example, if we're generating a long table that generates 10MB of 
HTML, if this is encoded incrementally (even foregoing any smarter 
optimizations, like caching the encoded form of strings) then there's a 
small working set of encoded data which can be collected as the 
template renders, and by the time the final string is emitted by 
cStringIO.getvalue() or what have you, you're using 20-ish megabytes of 
heap to store your UTF-8 bytes (10 in the StringIO and 10 in the str). 
If you build this as a unicode string instead, you'll end up using 
50MB; 40MB for your unicode string, 10MB for the decoded bytes.  Part 
of this is just an implementation issue, but even if Python gets a 
smarter unicode representation, you still need more space, because you 
need to store the encoded and decoded representations concurrently.
This all seems to suppose the non-existence of the 
twisted.web.template.flatten
style interface.  Doesn't that give you what's needed to do your 
incremental encoding outside of the flattener?
Hmmmmmm.  Okay, generating a couple of short encoded strings does leave one with a much shorter working set.  There should definitely be a lot more convenience functions in this area to just do the right thing in the various contexts one might want to flatten something (for which there are already a few tickets, such as <http://tm.tl/5395>).  As I recall you've spoken against the flatten() style interface because it makes error-handling somewhat more challenging, but if #5395 were fixed it could take care of those complexities internally.
...
...
It might be a while until I get around to implementing something smart 
in this area, but I'd prefer we have an interface that makes such 
optimizations possible without breaking compatibility.
...
As a work-around in `formatFailure´ I can decode the output of the 
flattener using UTF-8 and then re-encode it to avoid non-ASCII, but it 
seems like this should be solved in `twisted.web.template´ rather than 
over and over again in application code.
If this does end up happening in formatFailure or anywhere else, please 
(whoever does it) make sure to file a ticket to fix it; this should 
never be more than a temporary workaround.
Okay.  #4896 is still up for review, and the branch implementing it does 
use the decode/encode hack.  I'll file a ticket for fixing that if I 
ever get to merge the branch (someone review it please).
Why not just file the ticket now?  As you said before: "Heh.  Heh heh heh.  Heh."  It might be a while before sufficient review bandwidth becomes available.  (If history is any indicator, things will stall out between now and February, and March will be crazily active.)

-glyph

Re: [Twisted-web] twisted.web.template output encoding

Glyph