Unicode [was Re: Cult-like behaviour]
steve+comp.lang.python at pearwood.info
Mon Jul 16 06:26:23 EDT 2018
On Sun, 15 Jul 2018 17:39:55 -0700, Jim Lee wrote:
> On 07/15/18 17:18, Steven D'Aprano wrote:
>> On Sun, 15 Jul 2018 16:08:15 -0700, Jim Lee wrote:
>>> Python3 is intrinsically tied to Unicode for string handling.
>>> Therefore, the Python programmer is forced to deal with it (in all but
>>> trivial cases), rather than given a choice. So I don't understand how
>>> I can illustrate my point with Python code since Python won't let me
>>> deal with strings without also dealing with Unicode.
>> b"Look ma, a Python 2 style ASCII string."
> As I said, all but trivial cases.
> Do you consider separating Unicode strings from byte strings, having to
> decode and encode from one to the other,
If you use nothing but byte strings, you don't need to separate the non-
existent text strings from the byte strings, nor do you need to decode or
> and knowing which
> functions/methods accept one, the other, or both as arguments,
That's certainly a real complication, if I may stretch the meaning of the
word "complication" beyond breaking point. Surely you are already having
to read the documentation of the function to learn what arguments it
takes, and what types they are (int or float, list or iterator, 'r' or
'a', etc). If someone can't deal with the question of "unicode or bytes"
as well, then perhaps they ought to consider a career change to something
less demanding, like politics.
If, as you insinuate, all your data is 100% ASCII, then you have nothing
to fear. Just treat
as the equivalent of a cast or coercion, and you won't go wrong. (Of
course, in 2018, the number of applications that can truly say all their
data is pure ASCII is vanishingly small.)
Or use Latin-1, if you want to do the most simple-minded thing that you
can to make errors go away, without caring about correctness.
But the thing is, that complexity is *inherent in the domain*. You can
try to deal with it without Unicode, and as soon as you have users
expecting to use more than one code page, you're doomed.
> as "not dealing with Unicode"? I don't.
Frankly, I do.
Dealing with all the vagaries of human text *is* complicated, that's the
nature of the beast. Dealing with the complexities of Unicode can be as
complex as dealing with the complexities of floating point arithmetic.
(But neither of those are even in the same ballpark as dealing with the
complexities of *not* using Unicode: legacy code pages and encodings are
a nightmare to deal with.)
Nevertheless, just as casual users can go a very, very long way just
treating floats as the real numbers we learn about in school, and trust
that IEEE-754 semantics will mean your answers are "close enough", so the
casual user can go a very long way ignoring the complexities of Unicode,
so long as they control their own data and know what it is.
If you don't know what your data is, then you're doomed, Unicode or no
Unicode. (If you don't think that's a problem, if you think that "just
treat text as octets" works, then people like you are the reason there is
so much mojibake in the world, screwing it up for the rest of us.)
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson
More information about the Python-list