[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Mon Aug 7 23:20:12 EDT 2017

On 07Aug2017 21:44, boB Stepp <robertvstepp at gmail.com> wrote:
>py3: s = 'Hello!'
>py3: len(s.encode("UTF-8"))
>6
>py3: len(s.encode("UTF-16"))
>14
>py3: len(s.encode("UTF-32"))
>28
>
>How is len() getting these values?  And I am sure it will turn out not
>to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm

The result of str.encode is a bytes object with the specified encoding of the 
original text.

Your sample string contains only ASCII characters, which encode 1-to-1 in 
UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others have 
a slight twist. Let's see:

    Python 3.6.1 (default, Apr 24 2017, 06:17:09)
    [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> s = 'Hello!'
    >>> s.encode()
    b'Hello!'
    >>> s.encode('utf-16')
    b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
    >>> s.encode('utf-32')
    b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'

The utf-8 encoding (the default) is as you would expect.

The UTF-16 and UTF-32 encodings encode code points into 2 and 4 byte sequences 
as you might expect. Unlike the UTF-8 encoding, however, it is necessary to to 
know whether these byte sequences are big-endian (most significant byte first) 
or little-endian (least significant byte first).

The machine I'm on here is writing big endian UTF-16 and UTF-32.

As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This is 
because each encoding has a leading byte order marker to indicate the big 
endianness or little endianness. For big endian data that is \xff\xfe; for 
little endian data it would be \xfe\xff.

Cheers,
Cameron Simpson <cs at cskk.id.au> (formerly cs at zip.com.au)