
On Mon, Jun 21, 2010 at 04:52:08PM -0500, John Arbash Meinel wrote:
...
IOW, if you're producing output that has to go into another system that doesn't take unicode, it doesn't matter how theoretically-correct it would be for your app to process the data in unicode form. In that case, unicode is not a feature: it's a bug.
This is not always true. If you read a webpage, chop it up so you get a list of words, create a histogram of word length, and then write the output as utf8 to a database. Should you do all your intermediate string operations on utf8 encoded byte strings? No, you should do them on unicode strings as otherwise you need to know about the details of how utf8 encodes characters.
You'd still have problems in Unicode given stuff like å =~ å even though u'\xe5' vs u'a\u030a' (those will look the same depending on your Unicode system. IDLE shows them pretty much the same, T-Bird on Windosw with my current font shows the second as 2 characters.)
I realize this was a toy example, but it does point out that Unicode complicates the idea of 'equality' as well as the idea of 'what is a character'. And just saying "decode it to Unicode" isn't really sufficient.
Ah -- but if you're dealing with unicode objects you can use the unicodedata.normalize() function on them to come out with the right values. If you're using bytes, it's yet another case where you, the programmer, have to know what byte sequences represent combining characters in the particular encoding that you're dealing with. -Toshio