[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Tue Aug 8 23:48:58 EDT 2017

On 08Aug2017 22:30, boB Stepp <robertvstepp at gmail.com> wrote:
>On Mon, Aug 7, 2017 at 10:20 PM, Cameron Simpson <cs at cskk.id.au> wrote:
>> On 07Aug2017 21:44, boB Stepp <robertvstepp at gmail.com> wrote:
>>> py3: s = 'Hello!'
>>> py3: len(s.encode("UTF-8"))
>>> 6
>>> py3: len(s.encode("UTF-16"))
>>> 14
>>> py3: len(s.encode("UTF-32"))
>>> 28
>>>
>>> How is len() getting these values?  And I am sure it will turn out not
>>> to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm
>>
>> The result of str.encode is a bytes object with the specified encoding of
>> the original text.
>>
>> Your sample string contains only ASCII characters, which encode 1-to-1 in
>> UTF-8. So as you might expect, your UTF-8 encoding is 6 bytes. The others
>> have a slight twist. Let's see:
>>
>>    Python 3.6.1 (default, Apr 24 2017, 06:17:09)
>>    [GCC 4.2.1 Compatible Apple LLVM 7.3.0 (clang-703.0.31)] on darwin
>>    Type "help", "copyright", "credits" or "license" for more information.
>>    >>> s = 'Hello!'
>>    >>> s.encode()
>>    b'Hello!'
>>    >>> s.encode('utf-16')
>>    b'\xff\xfeH\x00e\x00l\x00l\x00o\x00!\x00'
>>    >>> s.encode('utf-32')
>>    b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00!\x00\x00\x00'
[...]
>As I just posted in my response to Ben, I am missing something
>probably quite basic in translating the bytes representation above
>into "bytes".
>
>> The machine I'm on here is writing big endian UTF-16 and UTF-32.
>>
>> As you note, the 16 and 32 forms are (6 + 1) times 2 or 4 respectively. This
>> is because each encoding has a leading byte order marker to indicate the big
>> endianness or little endianness. For big endian data that is \xff\xfe; for
>> little endian data it would be \xfe\xff.
>
>The arithmetic as I mentioned in my original post is what I am
>expecting in "bytes", but my current thinking is that if I have for
>the BOM you point out "\xff\xfe", I translate that as 4 hex digits,
>each having 16 bits, for a total of 64 bits or 8 bytes.  What am I
>misunderstanding here?

A hex digit expresses 4 bits, not 16. "Hex"/"hexadecimal" is base 16, but that 
is 2^4, so just four bits per hex digit. So the BOM is 2 bytes long in UTF-16 
and 4 bytes long (\xff\xfe\x00\x00) in UTF-32.

>Is a definition of "byte" meaning something
>other than 8 bits here?  I vaguely recall reading somewhere that
>"byte" can mean different numbers of bits in different contexts.

There used to be machines with different "word" or "memory cell" sizes, size as 
6 or 9 bits etc, and these were still referred to as bytes. They're pretty much 
defunct, and these days the word "byte" always means 8 bits unless someone goes 
out of their way to say otherwise.

You'll find all the RFCs talk about "octets" for this very reason: a value 
consisting of 8 bits ("oct" meaning 8).

>And is len() actually counting "bytes" or something else for these encodings?

Just bytes, exactly as you expect.

Cheers,
Cameron Simpson <cs at cskk.id.au> (formerly cs at zip.com.au)