what is the unicode?

Dave Angel d at davea.name
Sat Jan 28 06:18:45 EST 2012


I'm guessing you're using Python 2.7 or something similar. Things are
much different in Python 3.x

On 01/28/2012 02:47 AM, contro opinion wrote:
> as far as i know
>
>>>> u'中国'.encode('utf-8')
> '\xe4\xb8\xad\xe5\x9b\xbd'
>
> so,'\xe4\xb8\xad\xe5\x9b\xbd'  is the  utf-8  of  '中国'
No, it is the utf-8 encoding of the unicode string u'中国' That unicode
string has two characters in it, which may take two bytes each or four
bytes each, depending mainly on the platform your python was compiled
against. So it's 4 bytes or 8.

The encoded version happens to take six bytes, when encoded in utf8. It
happens in this case that each of those unicode characters takes 3 bytes
to encode. In a utf-8 encoding, a character may take anywhere from one
to around six bytes to represent.

>>>> u'中国'.encode('gbk')
> '\xd6\xd0\xb9\xfa'
> so,'\xd6\xd0\xb9\xfa' is the  utf-8  of  '中国'
>
(presumably the utf-8 above was just a typo on your part.)
No, it is the gbk encoding of the unicode string. In this case it takes
two bytes for each character. I don't know gbk, so I don't know what
range of possibilities exist.

>>>> u'中国'
> u'\u4e2d\u56fd'
>
> what is the meaning of u'\u4e2d\u56fd'?
> u'\u4e2d\u56fd'  =  \x4e2d\x56fd  ??
>
Here you can see the exact two unicode characters. The first character
has a hex representation of 4e2d, and the second has a hex
representation of 56fd. If you were on a platform that didn't have the
fonts or keyboard layout for either of those characters, you could enter
the string as
u'\u4e2d\u56fd'
and it would be exactly equivalent to entering the literal with those
characters directly. For example, on my (English) keyboard, I have no
easy way to enter in those unicode characters; I have been copy/pasting
them between windows.

Do you know how to interpret those literal strings? The u outside the
quotes says the whole thing is a unicode string. That's a distinct type
from a byte string, and it almost always has to be converted to a byte
string before going out to console or a file, or whatever. When you say
print mystring, if mystring is of type unicode, the unicode characters
are encoded according to some rules established by your console handler
(here's where I get pretty fuzzy), which it thinks will get them to the
console display correctly.

Inside the unicode string literal, you can have regular characters or
escape sequences. For these two particular, Python's repr() function
chooses to use the escape sequences. The backslash identifies it as an
escape sequence. The u immediately after says that this particular
escape sequence is a four-character hex representation. Those four hex
digits (0-9 and a-f) represent a two byte number, which is the ord() of
the unicode character.

The whole concept of unicode is that it has enough code points that
nearly all characters of nearly all languages can be uniquely
represented. When you've got a unicode string, you can search it and
substring it, and be sure that every operation deals with characters,
and not some variable-length representation of characters.

In Python 3.x, unicode is the default string type, and you have to use
b'xxx' notation to explicitly ask for bytes. Some things become much
simpler, and even more obvious in that environment.




-- 

DaveA




More information about the Python-list mailing list