inserting Unicode character in dictionary - Python

Joe Strout joe at strout.net
Sun Oct 19 08:57:43 EDT 2008


On Oct 18, 2008, at 1:20 AM, Martin v. Löwis wrote:

>> Do you then have a proper UTF-8 string,
>> but the problem is that none of the standard Python library methods  
>> know
>> how to properly interpret UTF-8?
>
> There is (probably) no such thing as a "proper UTF-8 string" (in the
> sense in which you probably mean it).

To be clear, I mean a string that is valid UTF-8 (not all strings of  
bytes are, of course).

> Python doesn't have a data type
> for "UTF-8 string". It only has a data type "byte string". It's up to
> the application whether it gets interpreted in a consistent manner.
> Libraries are (typically) encoding-agnostic, i.e. they work for UTF-8
> encoded strings the same way as for, say, Big-5 encoded strings.

Oi -- so if I ask for length, I get the number of bytes, not the  
number of characters.  If I slice and dice, I could end up splitting  
characters in half.  It is, as you say, just a string of bytes, not a  
string of characters.

>> 4. In Python 3.0, this silliness goes away, because all strings are
>> Unicode by default.
>
> You still need to make sure that the editor's encoding and the  
> declared
> encoding match.

Well, the if no encoding is declared, it (quite sensibly) assumes  
UTF-8, so for my purposes this boils down to using a UTF-8 editor --  
which I always do anyway.  But do I still have to put a "u" before my  
string literals in order to have it treated as characters rather than  
bytes?

I'm hoping that the answer is "no" -- most string literals in a source  
file are text (which should be Unicode text, these days); a raw byte  
string would be the exceptional case, and I'd be happy to use the "r"  
prefix for those.

Best,
- Joe




More information about the Python-list mailing list