[Python-Dev] Dropping bytes "support" in json
Barry Warsaw
barry at python.org
Fri Apr 10 04:26:22 CEST 2009
On Apr 9, 2009, at 8:07 AM, Steve Holden wrote:
> The real problem I came across in storing email in a relational
> database
> was the inability to store messages as Unicode. Some messages have a
> body in one encoding and an attachment in another, so the only ways to
> store the messages are either as a monolithic bytes string that gets
> parsed when the individual components are required or as a sequence of
> components in the database's preferred encoding (if you want to keep
> the
> original encoding most relational databases won't be able to help
> unless
> you store the components as bytes).
>
> All in all, as you might expect from a system that's been growing up
> since 1970 or so, it can be quite intractable.
There are really two ways to look at an email message. It's either an
unstructured blob of bytes, or it's a structured tree of objects.
Those objects have headers and payload. The payload can be of any
type, though I think it generally breaks down into "strings" for text/
* types and bytes for anything else (not counting multiparts).
The email package isn't a perfect mapping to this, which is something
I want to improve. That aside, I think storing a message in a
database means storing some or all of the headers separately from the
byte stream (or text?) of its payload. That's for non-multipart
types. It would be more complicated to represent a message tree of
course.
It does seem to make sense to think about headers as text header names
and text header values. Of course, header values can contain almost
anything and there's an encoding to bring it back to 7-bit ASCII, but
again, you really have two views of a header value. Which you want
really depends on your application.
Maybe you just care about the text of both the header name and value.
In that case, I think you want the values as unicodes, and probably
the headers as unicodes containing only ASCII. So your table would be
strings in both cases. OTOH, maybe your application cares about the
raw underlying encoded data, in which case the header names are
probably still strings of ASCII-ish unicodes and the values are
bytes. It's this distinction (and I think the competing use cases)
that make a true Python 3.x API for email more complicated.
Thinking about this stuff makes me nostalgic for the sloppy happy days
of Python 2.x
-Barry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PGP.sig
Type: application/pgp-signature
Size: 304 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/python-dev/attachments/20090409/cdf11303/attachment.pgp>
More information about the Python-Dev
mailing list