PEP 393 vs UTF-8 Everywhere

Chris Angelico rosuav at
Sat Jan 21 11:18:03 EST 2017

On Sun, Jan 22, 2017 at 2:56 AM, Jussi Piitulainen
<jussi.piitulainen at> wrote:
> Steve D'Aprano writes:
> [snip]
>> You could avoid that error by increasing the offset by the right
>> amount:
>> stuff = text[offset + len("ф".encode('utf-8'):]
>> which is awful. I believe that's what Go and Julia expect you to do.
> Julia provides a method to get the next index.
> let text = "ἐπὶ οἴνοπα πόντον", offset = 1
>     while offset <= endof(text)
>         print(text[offset], ".")
>         offset = nextind(text, offset)
>     end
>     println()
> end # prints: ἐ.π.ὶ. .ο.ἴ.ν.ο.π.α. .π.ό.ν.τ.ο.ν.

This implies that regular iteration isn't good enough, though.

Here's a function that creates a numbered list:

def print_list(items):
    width = len(str(len(items)))
    for idx, item in enumerate(items, 1):
        print("%*d: %s" % (width, idx, item))

In Python, this will happily accept anything that is iterable and has
a known length. Could be a list or tuple, obviously, but can also just
as easily be a dict view (keys or items), a range object, or.... a
string. It's perfectly acceptable to enumerate the characters of a
string. And enumerate() itself is implemented entirely generically. If
you have to call nextind() to get the next character, you've made it
impossible to do any kind of generic operation on the text. You can't
do a windowed view by slicing while iterating, you can't have a "lag"
or "lead" value, you can't do any of those kinds of simple and obvious
index-based operations.

Oh, and Python 3.3 wasn't the first programming language to use this
flexible string representation. Pike introduced an extremely similar
string representation back in 1998:

So yes, UTF-8 has its advantages. But it also has its costs, and for a
text processing language like Pike or Python, they significantly
outweigh the benefits.


More information about the Python-list mailing list