[Python-Dev] email package Bytes vs Unicode (was Re: Dropping bytes "support" in json)

Thu Apr 9 18:20:31 CEST 2009

Tony Nelson wrote:
> (email-sig added)
> 
> At 08:07 -0400 04/09/2009, Steve Holden wrote:
>> Barry Warsaw wrote:
>  ...
>>> This is an interesting question, and something I'm struggling with for
>>> the email package for 3.x.  It turns out to be pretty convenient to have
>>> both a bytes and a string API, both for input and output, but I think
>>> email really wants to be represented internally as bytes.  Maybe.  Or
>>> maybe just for content bodies and not headers, or maybe both.  Anyway,
>>> aside from that decision, I haven't come up with an elegant way to allow
>>> /output/ in both bytes and strings (input is I think theoretically
>>> easier by sniffing the arguments).
>>>
>> The real problem I came across in storing email in a relational database
>> was the inability to store messages as Unicode. Some messages have a
>> body in one encoding and an attachment in another, so the only ways to
>> store the messages are either as a monolithic bytes string that gets
>> parsed when the individual components are required or as a sequence of
>> components in the database's preferred encoding (if you want to keep the
>> original encoding most relational databases won't be able to help unless
>> you store the components as bytes).
>  ...
> 
> I found it confusing myself, and did it wrong for a while.  Now, I
> understand that essages come over the wire as bytes, either 7-bit US-ASCII
> or 8-bit whatever, and are parsed at the receiver.  I think of the database
> as a wire to the future, and store the data as bytes (a BLOB), letting the
> future receiver parse them as it did the first time, when I cleaned the
> message.  Data I care to query is extracted into fields (in UTF-8, what I
> usually use for char fields).  I have no need to store messages as Unicode,
> and they aren't Unicode anyway.  I have no need ever to flatten a message
> to Unicode, only to US-ASCII or, for messages (spam) that are corrupt, raw
> 8-bit data.
> 
> If you need the data from the message, by all means extract it and store it
> in whatever form is useful to the purpose of the database.  If you need the
> entire message, store it intact in the database, as the bytes it is.  Email
> isn't Unicode any more than a JPEG or other image types (often payloads in
> a message) are Unicode.

This is all great, and I did quite quickly realize that the best
approach was to store the mails in their network byte-stream format as
bytes. The approach was negated in my own case because of PostgreSQL's
execrable BLOB-handling capabilities. I took a look at the escaping they
required, snorted with derision and gave it up as a bad job.

PostgreSQL strongly encourages you to store text as encoded columns.
Because emails lack an encoding it turns out this is a most inconvenient
storage type for it. Sadly BLOBs are such a pain in PostgreSQL that it's
easier to store the messages in external files and just use the
relational database to index those files to retrieve content, so that's
what I ended up doing.

regards
 Steve

-- 
Steve Holden           +1 571 484 6266   +1 800 494 3119
Holden Web LLC                 http://www.holdenweb.com/
Watch PyCon on video now!          http://pycon.blip.tv/