[Python-Dev] Dropping bytes "support" in json

Fri Apr 10 04:26:22 CEST 2009

On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:

> The real problem I came across in storing email in a relational  
> database
> was the inability to store messages as Unicode. Some messages have a
> body in one encoding and an attachment in another, so the only ways to
> store the messages are either as a monolithic bytes string that gets
> parsed when the individual components are required or as a sequence of
> components in the database's preferred encoding (if you want to keep  
> the
> original encoding most relational databases won't be able to help  
> unless
> you store the components as bytes).
>
> All in all, as you might expect from a system that's been growing up
> since 1970 or so, it can be quite intractable.

There are really two ways to look at an email message.  It's either an  
unstructured blob of bytes, or it's a structured tree of objects.   
Those objects have headers and payload.  The payload can be of any  
type, though I think it generally breaks down into "strings" for text/ 
* types and bytes for anything else (not counting multiparts).

The email package isn't a perfect mapping to this, which is something  
I want to improve.  That aside, I think storing a message in a  
database means storing some or all of the headers separately from the  
byte stream (or text?) of its payload.  That's for non-multipart  
types.  It would be more complicated to represent a message tree of  
course.

It does seem to make sense to think about headers as text header names  
and text header values.  Of course, header values can contain almost  
anything and there's an encoding to bring it back to 7-bit ASCII, but  
again, you really have two views of a header value.  Which you want  
really depends on your application.

Maybe you just care about the text of both the header name and value.   
In that case, I think you want the values as unicodes, and probably  
the headers as unicodes containing only ASCII.  So your table would be  
strings in both cases.  OTOH, maybe your application cares about the  
raw underlying encoded data, in which case the header names are  
probably still strings of ASCII-ish unicodes and the values are  
bytes.  It's this distinction (and I think the competing use cases)  
that make a true Python 3.x API for email more complicated.

Thinking about this stuff makes me nostalgic for the sloppy happy days  
of Python 2.x

-Barry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090409/cdf11303/attachment.pgp>