On Fri, Sep 17, 2010 at 3:25 PM, Michael Foord <span dir="ltr"><<a href="mailto:fuzzyman@voidspace.org.uk">fuzzyman@voidspace.org.uk</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im"> On 16/09/2010 23:05, Antoine Pitrou wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On Thu, 16 Sep 2010 16:51:58 -0400<br>
"R. David Murray"<<a href="mailto:rdmurray@bitdance.com" target="_blank">rdmurray@bitdance.com</a>> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
What do we store in the model? We could say that the model is always<br>
text. But then we lose information about the original bytes message,<br>
and we can't reproduce it. For various reasons (mailman being a big one),<br>
this is not acceptable. So we could say that the model is always bytes.<br>
But we want access to (for example) the header values as text, so header<br>
lookup should take string keys and return string values[2].<br>
</blockquote>
Why can't you have both in a single class? If you create the class<br>
using a bytes source (a raw message sent by SMTP, for example), the<br>
class automatically parses and decodes it to unicode strings; if you<br>
create the class using an unicode source (the text body of the e-mail<br>
message and the list of recipients, for example), the class<br>
automatically creates the bytes representation.<br>
<br>
</blockquote></div>
I think something like this would be great for WSGI. Rather than focus on whether bytes *or* text should be used, use a higher level object that provides a bytes view, and (where possible/appropriate) a unicode view too.<br>
</blockquote></div><br>This is what WebOb does; e.g., there is only bytes version of a POST body, and a view on that body that does decoding and encoding. If you don't touch something, it is never decoded or encoded. I only vaguely understand the specifics here, and I suspect the specifics matter, but this seems applicable in this case too -- if you have an incoming email with a smattering of bytes, inline (2047) encoding, other encoding declarations, and then orthogonal systems like quoted-printable, you don't want to touch that stuff if you don't need to as handling unicode objects implies you are normalizing the content, and that might have subtle impacts you don't know about, or don't want to know about, or maybe just don't fit into the unicode model (like a string with two character sets).<br>
<br>Note that WebOb does not have two views, it has only one view -- unicode viewing bytes. I'm not sure I could keep two views straight. I *think* Antoine is describing two possible canonical data types (unicode or bytes) and two views. That sounds hard.<br>
<br>-- <br>Ian Bicking | <a href="http://blog.ianbicking.org">http://blog.ianbicking.org</a><br>