[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Zachary Ware zachary.ware+pytut at gmail.com
Mon Aug 7 23:04:21 EDT 2017


On Mon, Aug 7, 2017 at 9:44 PM, boB Stepp <robertvstepp at gmail.com> wrote:
> py3: s = 'Hello!'
> py3: len(s.encode("UTF-8"))
> 6
> py3: len(s.encode("UTF-16"))
> 14
> py3: len(s.encode("UTF-32"))
> 28
>
> How is len() getting these values?  And I am sure it will turn out not
> to be a coincidence that 2 * (6 + 1) = 14 and 4 * (6 + 1) = 28.  Hmm

First, make sure you know exactly what is having its length checked.
In each of those cases, you're not checking the length of the string,
`s`, you're checking the length of the string `s` encoded in various
encodings (try each of those lines without the 'len' part).

Next, take a dive into the wonderful* world of Unicode:

https://nedbatchelder.com/text/unipain.html
https://www.youtube.com/watch?v=7m5JA3XaZ4k

Hope this helps,
-- 
Zach


More information about the Tutor mailing list