Cult-like behaviour [was Re: Kindness]

Chris Angelico rosuav at
Sun Jul 15 09:18:41 EDT 2018

On Sun, Jul 15, 2018 at 9:17 PM, Marko Rauhamaa <marko at> wrote:
> Steven D'Aprano <steve+comp.lang.python at>:
>> - to have a language where text strings are a second-class
>>   data type, not available in the language itself, only in
>>   the libraries;
> Unicode code point strings *ought* to be a second--class data type. They
> were a valiant idea but in the end turned out to be a mistake.

So let's see. Suppose I go to a web site that asks me to type in a
title; I enter something and hit "Save". That title goes through
JavaScript, gets sent to the back-end API via AJAX and JSON, received
by a Python web app, and saved into a database. Later, it gets
retrieved from that database and displayed to me on the same web page,
where I click on it, and it gets put into an input field. I then
submit it using standard form fill-out (no JS), it is received by the
web app, and then gets sent to the API to become my stream
title. I look at my stream, and the title is the exact string that I
entered originally.

During this time, I consider that string to be text. Always text. But
if Unicode strings are second-class data, then that title changed from
being text (in the input box) to UTF-16 (in JS) to UTF-8 (in JSON) to
UTF-32 (in Python) to UTF-8 (in the database) to UTF-32 (retrieved
into Python later) to ASCII with "\uXXXX" escapes (being sent to the
web page) to text (in the input box). Then it gets converted to
URL-encoded UTF-8 (form submission), then UTF-8, and UTF-32 (retrieval
in Python), then UTF-8 (Twitch API), and finally back to text
(displayed on the screen).

Remind me how it's such a mistake to treat that string as text all the
way through?

>> - to have a language where text characters are *literally*
>>   32-bit integers ("rune" is an alias to int32);
>>   (you can multiple a linefeed by a grave accent and get pi)
> Again, that has barely anything to do with the topic at hand. I don't
> think there's any unproblematic way to capture a true text character,
> period. Python3 certainly hasn't been able to capture it.

Python's Unicode type is an accurate representation of a Unicode text
string, just as Python's float type is an accurate representation of
IEEE 754 floating-point. Just as floats are not reals, so too is
Unicode not perfectly able to represent all human text, and has to
mess around with things like combining characters. It's not 100%
perfect (
point #11), but it's about as close as you'll ever get inside a

>>> That's the ten-billion-dollar question, isn't it?!
>> No. The real ten billion dollar question is how people in 2018 can
>> stick their head in the sand and take seriously the position that
>> Latin-1 (let alone ASCII) is enough for text strings.
> Here's the deal: text strings are irrelevant for most modern programming
> needs. Most software is middleware between the human and the terminal
> device. Carrying opaque octet strings from end to end is often the most
> correct and least problematic thing to do.

Uhh, so the human uses byte/octet strings? You can argue that the
terminal device is fundamentally byte-oriented, but if you do, I'm
going to dispute the use of the definite article, and say that *many*
terminal devices are byte-oriented as of today. There's no fundamental
reason for that to remain the case, and even today, we have
fundamentally text-oriented terminal devices. I know this because I
maintain one (okay, it's called a "MUD client" rather than a "terminal
device", but it's basically the same thing).

> On the other hand, Python3's code point strings mess things up for no
> added value. You still can't upcase or downcase strings.

Not entirely sure what the .upper() and .lower() methods do, then.
Case conversion of arbitrary text strings is hard, but Python
definitely gives you as good as you'll ever get without actually
stipulating, not just the language, but the context.

> You still can't sort strings.

Strings are intrinsically totally ordered in a mostly-sane way. If you
want anything more than that, you have to stipulate the language.
Python offers this in the 'locale' module, with strcoll and strxfrm.

> You still can't perform random access on strings.

Say what?

> You still don't know how long your string is.

How long is a piece of string?

1) Do you count code points? len(x)
2) Do you count code units? len(x.encode("..."))
3) Do you count base characters, ignoring combining characters?
4) Do you count pixels of display width?
5) Do you count advancement (like pixels, but negative for RTL text)?

Two of them are easy. Two require font metrics (so they're the job of
a display engine). Only #3 is moderately hard, and you could do that
with a one-liner by checking the Unicode categories. But it isn't very
useful except to "prove" that Python sucks.

> You still don't know where you can break a string safely.

Impossible without language-based and font-based information. For
instance, in the string "python", you cannot break the string between
the "t" and the "h", because they are parts of one phonogram.
Splitting the string "اطلقي سرك" anywhere other than at the space will
result in the two halves displaying differently from the combined
whole, because of the way Arabic text is written. Python lets you
split the string between any two code points, a massive step up from
exposing UTF-8 or UTF-16 code units, so that's about as safe as it

> You still don't know how to normalize a string.

You mean unicodedata.normalize? Yeah, you're right, I don't know how
to do it. I can never remember whether it's normalize(str, "NFC") or
normalize("NFC", str).

> You still don't know if two strings are equal or not.

Do an NFD or NFKD normalization on both strings, then compare.

> You still don't know how to concatenate strings.

Uhh.... s1 + s2?

I'm fairly sure you have no clue about Unicode or Python, but I'll
give you the benefit of the doubt and assume you're merely trolling.


More information about the Python-list mailing list