[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Cameron Simpson cs at cskk.id.au
Thu Aug 10 22:34:06 EDT 2017


On 10Aug2017 20:40, boB Stepp <robertvstepp at gmail.com> wrote:
>> (By the way, it is nearly 14 years later, and PHP still believes that
>> the world is ASCII.)
>
>I thought you must surely be engaging in hyperbole, but at
>http://php.net/manual/en/xml.encoding.php I found:
>
>"The default source encoding used by PHP is ISO-8859-1."

This kind of amounts to Python 2's situation in some ways: a PHP string or 
Python 2 str is effectively just an array of bytes, treated like a lexical 
stringy thing.

If you're working only in ASCII or _universally_ in some fixed 8-bit character 
set (eg ISO8859-1 in Western Europe) you mostly get by if you don't look 
closely. PHP's "default source encoding" means that the variable _character_ 
based routines in PHP (things that know about characters as letter, punctuation 
etc) treat these strings as using IS8859-1 encoding. You can load UTF-8 into 
these strings and work that way too (there's a PHP global setting for the 
encoding).

Python 2 has a "unicode" type for proper Unicode strings.

In Python 3 str is Unicode text, and you use bytes for bytes. It is hugely 
better, because you don't need to concern yourself about what text encoding a 
str is - it doesn't have one - it is Unicode. You only need to care when 
reading and writing data.

>> So long as your editor knows to save the file in UTF-8, it will Just
>> Work.
>
>So Python 3's default behavior for strings is to store them as UTF-8
>encodings in both RAM and files?

Not quite.

In memory Python 3 strings are sequences of Unicode code points. The CPython 
internals pick an 8 or 16 or 32 bit storage mode for these based on the highest 
code point value in the string as a space optimisation decision, but that is 
concealed at the language level. UTF-8 as a storage format is nearly as 
compact, but has the disadvantage that you can't directly index the string 
(i.e. go to character "n") because UTF-8 uses variable length encodings for the 
various code points.

In files however, the default encoding for text files is 'utf-8': Python will 
read the file's bytes as UTF-8 data and will write Python string characters in 
UTF-8 encoding when writing.

If you open a file in "binary" mode there's no encoding: you get bytes. But if 
you open in text mode (no "b" in the open mode string) you get text, and you 
can define the character encoding used as an optional parameter to the open() 
function call.

>No funny business anywhere?  Except
>perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
>Python's fault.  Which makes me wonder, what is my editor's default
>encoding/decoding?  I will have to investigate!

On most UNIX platforms most situations expect and use UTF-8. There aresome 
complications because this needn't be the case, but most modern environments 
provide UTF-8 by default.

The situation in Windows is more complex for historic reasons. I believe Eryk 
Sun is the go to guy for precise technical descriptions of the Windows 
situation. I'm not a Windows guy, but I gather modern Windows generally gives 
you a pretty clean UTF-8 environment in most situations.

Cheers,
Cameron Simpson <cs at cskk.id.au> (formerly cs at zip.com.au)


More information about the Tutor mailing list