Multibyte Character Surport for Python
Skip Montanaro
skip at pobox.com
Thu May 9 18:14:23 EDT 2002
Huaiyu> If a character is two bytes, what would len() report?
Depends on the type of the argument. If it's a Unicode object, the number
of characters. If it's a plain string, the number of bytes:
>>> u"\N{Greek CAPItal letter alpha}"
u'\u0391'
>>> len(u"\N{Greek CAPItal letter alpha}")
1
>>> len(u"\N{Greek CAPItal letter alpha}".encode("utf-8"))
2
Huaiyu> Would it change depending on how the unicode is encoded?
Yes, depending on what you pass to len(). If it's a plain string it
definitely depends on the encoding:
>>> u"a"
u'a'
>>> u"a".encode("utf-16")
'\xff\xfea\x00'
>>> u"a".encode("utf-8")
'a'
>>> len(u"a".encode("utf-16"))
4
>>> len(u"a".encode("utf-8"))
1
>>> len(u"a")
1
--
Skip Montanaro (skip at pobox.com - http://www.mojam.com/)
"Excellant Written and Communications Skills [required]" - post to chi.jobs
More information about the Python-list
mailing list