[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

boB Stepp robertvstepp at gmail.com
Tue Aug 8 23:53:53 EDT 2017


On Tue, Aug 8, 2017 at 10:17 PM, boB Stepp <robertvstepp at gmail.com> wrote:
> On Mon, Aug 7, 2017 at 10:01 PM, Ben Finney <ben+python at benfinney.id.au> wrote:
>> boB Stepp <robertvstepp at gmail.com> writes:
>>
>>> How is len() getting these values?
>>

>
> It is translating the Unicode code points into bits patterned by the
> encoding specified.  I know this.  I was reading some examples from a
> book and it was demonstrating the different lengths resulting from
> encoding into UTF-8, 16 and 32.  I was mildly surprised that len()
> even worked on these encoding results.  But for the life of me I can't
> figure out for UTF-16 and 18 how these lengths are determined.  For
> instance just looking at a single character:
>
> py3: h = 'h'
> py3: h16 = h.encode("UTF-16")
> py3: h16
> b'\xff\xfeh\x00'
> py3: len(h16)
> 4

This all makes perfect sense, arithmetic-wise, now that Matt has made
me realize my hex arithmetic was quite deficient!  And that makes
sense of the trailing "\x00", too (NOT an EOL char.).


-- 
boB


More information about the Tutor mailing list