[Python-Dev] email package status in 3.X

Mon Jun 21 23:52:08 CEST 2010

...
>> IOW, if you're producing output that has to go into another system
>> that doesn't take unicode, it doesn't matter how
>> theoretically-correct it would be for your app to process the data in
>> unicode form.  In that case, unicode is not a feature: it's a bug.
>>
> This is not always true.  If you read a webpage, chop it up so you get
> a list of words, create a histogram of word length, and then write the output as
> utf8 to a database.  Should you do all your intermediate string operations
> on utf8 encoded byte strings?  No, you should do them on unicode strings as
> otherwise you need to know about the details of how utf8 encodes characters.
> 

You'd still have problems in Unicode given stuff like å =~ å even though
u'\xe5' vs u'a\u030a' (those will look the same depending on your
Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw
with my current font shows the second as 2 characters.)

I realize this was a toy example, but it does point out that Unicode
complicates the idea of 'equality' as well as the idea of 'what is a
character'. And just saying "decode it to Unicode" isn't really sufficient.

John
=:->