[Python-ideas] Processing surrogates in

Thu May 7 20:32:31 CEST 2015

My not-an-expert thoughts on these issues:

[NOTE: nested comments, so attribution may be totally confused]

    Why, oh why, do things have to be SO FU*****G COMPLICATED?
>
> two reasons:

1) human languages are complicated, and they all have their idiosyncrasies
-- some are inherently better suited to machine interpretation, but the
real killer is that we want to use multiple languages with one system --
that IS inherently very complicated.

2) legacy decisions an backward compatibility -- this is what makes it
impossible to "simply" come up with a single bets way to to do it (or a few
 ways, anyway...)

> Surely 65536 (2-byte) encodings are enough to express all characters in
> all the languages in the world, plus all the special characters we need.
>
> That was once thought true -- but it turns out it's not -- darn!

Though we do think that 4 bytes is plenty, and to some extent I'm confused
as to why there isn't more use of UCS-4 -- sure it wastes a lot of space,
but everything in computer (memory, cache, disk space, bandwidth) is orders
of magnitudes larger/faster than it was when the Unicode discussion got
started. But people don't like inefficiency and, in fact, as the newer py3
Unicode objects shows, we don't need to compromise on that.

Or is there really some fundamental reason why things can't be simpler?
>  (Like, REALLY, REALLY simple?)

Well, if there were no legacy systems, it still couldn't be REALLY, REALLY
simple (though UCS-4 is close), but there could be a LOT fewer ways to do
things: programming languages would have their own internal representation
(like Python does), and we would have a small handful of encodings
optimized for various things: UCS-4 for easy of use, utf-8 for small disk
storage (at least of Euro-centered text), and that would be that. But we do
have the legacies to deal with.

Apple, Microsoft, Sun, and a few other vendors jumped on the Unicode
> bandwagon early and committed themselves to the idea that 2 bytes is enough
> for everything. When the world discovered that wasn't true, we were stuck
> with a bunch of APIs that insisted on 2 bytes. Apple was able to partly
> make a break with that era, but Windows and Java are completely stuck with
> "Unicode means 16-bit" forever, which is why the whole world is stuck
> dealing with UTF-16 and surrogates forever.
>

I've read many of the rants about UTF-16, but in fact, it's really not any
worse than UTF-8 -- it's kind of a worst of both worlds -- not a set number
of bytes per char, but a lot of wasted space (particularly for euro
languages), but other than a bi tof wasted sapce, it's jsut like UTF-8.

The Problem with is it not UTF-16 itself, but the fact that an really
surprising number of APIs and programmers still think that it's UCS-2,
rather than UTF-16 --painful. And the fact, that AFAIK, ther really is not
C++ Unicode type -- at least not one commonly used. Again -- legacy issues.

And there are still people creating filenames on Latin-1 filesystems on
> older Linux and Unix boxes,
>

This is the odd one to me -- reading about people's struggles with py3 an
*nix filenames -- they argue that *nix is not broken -- and the world
should just use char* for filenames and all is well! IN fact, maybe it
would be easier to handle filenames as char* in some circumstances, but to
argue that a system is not broken when you can't know the encoding of
filenames, and there may be differently encoded filenames ON THE SAME
Filesystem is insane! of course that is broken! It may be reality, and
maybe Py3 needs to do a bit more to accommodate it, but it is broken.

In fact, as much as I like to bash Windows, I've had NO problems with
assuming filenames in Windows are UTF-16 (as long as we use the "wide char"
APIs, sigh), and OS-X's specification of filenames as utf-8 works fine. So
Linux really needs to catch up here!

UTF-16 is a historical accident,
>

yeah, but it's not really a killer, either -- the problems come when people
assume UTF-16 is UCS-2, just alike assuming that utf-8 is ascii (or any
one-byte encoding...)

 We really do need at least UTF-8 and UTF-32. But that's it. And I think
> that's simple enough.

is UTF-32 the same as UCS-4 ? Always a bit confused by that.

Oh, and endian issues -- *sigh*

 Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive me.
> :-) )

Some of it yes, I'm afraid so -- but probably not the surrogate pair stuff,
etc. That stuff is pretty esoteric, and really needs to be understood by
people writing APIs -- but for those of us that USE APIs, not so much.

For instance, Python's handling Unicode file names almost always "just
works" (as long as you stay in Python...)

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150507/4ef24c1c/attachment-0001.html>