Multibyte Character Surport for Python

Skip Montanaro skip at pobox.com
Thu May 9 18:14:23 EDT 2002


    Huaiyu> If a character is two bytes, what would len() report?  

Depends on the type of the argument.  If it's a Unicode object, the number
of characters.  If it's a plain string, the number of bytes:

    >>> u"\N{Greek CAPItal letter alpha}"
    u'\u0391'
    >>> len(u"\N{Greek CAPItal letter alpha}")
    1
    >>> len(u"\N{Greek CAPItal letter alpha}".encode("utf-8"))
    2

    Huaiyu> Would it change depending on how the unicode is encoded?

Yes, depending on what you pass to len().  If it's a plain string it
definitely depends on the encoding:

    >>> u"a"
    u'a'
    >>> u"a".encode("utf-16")
    '\xff\xfea\x00'
    >>> u"a".encode("utf-8")
    'a'
    >>> len(u"a".encode("utf-16"))
    4
    >>> len(u"a".encode("utf-8"))
    1
    >>> len(u"a")
    1

-- 
Skip Montanaro (skip at pobox.com - http://www.mojam.com/)
"Excellant Written and Communications Skills [required]" - post to chi.jobs





More information about the Python-list mailing list