[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?
Cameron Simpson
cs at cskk.id.au
Thu Aug 10 22:34:06 EDT 2017
On 10Aug2017 20:40, boB Stepp <robertvstepp at gmail.com> wrote:
>> (By the way, it is nearly 14 years later, and PHP still believes that
>> the world is ASCII.)
>
>I thought you must surely be engaging in hyperbole, but at
>http://php.net/manual/en/xml.encoding.php I found:
>
>"The default source encoding used by PHP is ISO-8859-1."
This kind of amounts to Python 2's situation in some ways: a PHP string or
Python 2 str is effectively just an array of bytes, treated like a lexical
stringy thing.
If you're working only in ASCII or _universally_ in some fixed 8-bit character
set (eg ISO8859-1 in Western Europe) you mostly get by if you don't look
closely. PHP's "default source encoding" means that the variable _character_
based routines in PHP (things that know about characters as letter, punctuation
etc) treat these strings as using IS8859-1 encoding. You can load UTF-8 into
these strings and work that way too (there's a PHP global setting for the
encoding).
Python 2 has a "unicode" type for proper Unicode strings.
In Python 3 str is Unicode text, and you use bytes for bytes. It is hugely
better, because you don't need to concern yourself about what text encoding a
str is - it doesn't have one - it is Unicode. You only need to care when
reading and writing data.
>> So long as your editor knows to save the file in UTF-8, it will Just
>> Work.
>
>So Python 3's default behavior for strings is to store them as UTF-8
>encodings in both RAM and files?
Not quite.
In memory Python 3 strings are sequences of Unicode code points. The CPython
internals pick an 8 or 16 or 32 bit storage mode for these based on the highest
code point value in the string as a space optimisation decision, but that is
concealed at the language level. UTF-8 as a storage format is nearly as
compact, but has the disadvantage that you can't directly index the string
(i.e. go to character "n") because UTF-8 uses variable length encodings for the
various code points.
In files however, the default encoding for text files is 'utf-8': Python will
read the file's bytes as UTF-8 data and will write Python string characters in
UTF-8 encoding when writing.
If you open a file in "binary" mode there's no encoding: you get bytes. But if
you open in text mode (no "b" in the open mode string) you get text, and you
can define the character encoding used as an optional parameter to the open()
function call.
>No funny business anywhere? Except
>perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
>Python's fault. Which makes me wonder, what is my editor's default
>encoding/decoding? I will have to investigate!
On most UNIX platforms most situations expect and use UTF-8. There aresome
complications because this needn't be the case, but most modern environments
provide UTF-8 by default.
The situation in Windows is more complex for historic reasons. I believe Eryk
Sun is the go to guy for precise technical descriptions of the Windows
situation. I'm not a Windows guy, but I gather modern Windows generally gives
you a pretty clean UTF-8 environment in most situations.
Cheers,
Cameron Simpson <cs at cskk.id.au> (formerly cs at zip.com.au)
More information about the Tutor
mailing list