[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Tue Aug 8 23:30:53 EDT 2017

On Mon, Aug 7, 2017 at 10:20 PM, Cameron Simpson <cs at cskk.id.au> wrote:
> On 07Aug2017 21:44, boB Stepp <robertvstepp at gmail.com> wrote:
>>
>> py3: s = 'Hello!'
>> py3: len(s.encode("UTF-8"))
>> 6
>> py3: len(s.encode("UTF-16"))
>> 14
>> py3: len(s.encode("UTF-32"))
>> 28
>>
>> How is len() getting these values?  And I am sure it will turn out not
>> to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm
>
>
> The result of str.encode is a bytes object with the specified encoding of
> the original text.
>
> Your sample string contains only ASCII characters, which encode 1-to-1 in
> UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others
> have a slight twist. Let's see:
>
>    Python 3.6.1 (default, Apr 24 2017, 06:17:09)
>    [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
>    Type "help", "copyright", "credits" or "license" for more information.
>    >>> s = 'Hello!'
>    >>> s.encode()
>    b'Hello!'
>    >>> s.encode('utf-16')
>    b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
>    >>> s.encode('utf-32')
>
> b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'
>
> The utf-8 encoding (the default) is as you would expect.
>
> The UTF-16 and UTF-32 encodings encode code points into 2 and 4 byte
> sequences as you might expect. Unlike the UTF-8 encoding, however, it is
> necessary to to know whether these byte sequences are big-endian (most
> significant byte first) or little-endian (least significant byte first).

As I just posted in my response to Ben, I am missing something
probably quite basic in translating the bytes representation above
into "bytes".

> The machine I'm on here is writing big endian UTF-16 and UTF-32.
>
> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
> is because each encoding has a leading byte order marker to indicate the big
> endianness or little endianness. For big endian data that is \xff\xfe; for
> little endian data it would be \xfe\xff.

The arithmetic as I mentioned in my original post is what I am
expecting in "bytes", but my current thinking is that if I have for
the BOM you point out "\xff\xfe", I translate that as 4 hex digits,
each having 16 bits, for a total of 64 bits or 8 bytes.  What am I
misunderstanding here?  Is a definition of "byte" meaning something
other than 8 bits here?  I vaguely recall reading somewhere that
"byte" can mean different numbers of bits in different contexts.

And is len() actually counting "bytes" or something else for these encodings?

-- 
boB