[Python-ideas] Processing surrogates in

Thu May 7 23:04:55 CEST 2015

On May 7, 2015, at 11:32, Chris Barker <chris.barker at noaa.gov> wrote:
> 
> My not-an-expert thoughts on these issues:
> 
> [NOTE: nested comments, so attribution may be totally confused]
> 
>>>     Why, oh why, do things have to be SO FU*****G COMPLICATED?
> two reasons:
> 
> 1) human languages are complicated, and they all have their idiosyncrasies -- some are inherently better suited to machine interpretation, but the real killer is that we want to use multiple languages with one system -- that IS inherently very complicated.
> 
> 2) legacy decisions an backward compatibility -- this is what makes it impossible to "simply" come up with a single bets way to to do it (or a few  ways, anyway...)
>>> Surely 65536 (2-byte) encodings are enough to express all characters in all the languages in the world, plus all the special characters we need.
> That was once thought true -- but it turns out it's not -- darn!
> 
> Though we do think that 4 bytes is plenty, and to some extent I'm confused as to why there isn't more use of UCS-4 -- sure it wastes a lot of space, but everything in computer (memory, cache, disk space, bandwidth) is orders of magnitudes larger/faster than it was when the Unicode discussion got started. But people don't like inefficiency and, in fact, as the newer py3 Unicode objects shows, we don't need to compromise on that.
> 
>> Or is there really some fundamental reason why things can't be simpler?  (Like, REALLY, REALLY simple?)
> 
> 
> Well, if there were no legacy systems, it still couldn't be REALLY, REALLY simple (though UCS-4 is close), but there could be a LOT fewer ways to do things: programming languages would have their own internal representation (like Python does), and we would have a small handful of encodings optimized for various things: UCS-4 for easy of use, utf-8 for small disk storage (at least of Euro-centered text), and that would be that. But we do have the legacies to deal with.
>  
> 
>> Apple, Microsoft, Sun, and a few other vendors jumped on the Unicode bandwagon early and committed themselves to the idea that 2 bytes is enough for everything. When the world discovered that wasn't true, we were stuck with a bunch of APIs that insisted on 2 bytes. Apple was able to partly make a break with that era, but Windows and Java are completely stuck with "Unicode means 16-bit" forever, which is why the whole world is stuck dealing with UTF-16 and surrogates forever.
> 
> I've read many of the rants about UTF-16, but in fact, it's really not any worse than UTF-8 -- it's kind of a worst of both worlds -- not a set number of bytes per char, but a lot of wasted space (particularly for euro languages), but other than a bi tof wasted sapce, it's jsut like UTF-8.
> 
> The Problem with is it not UTF-16 itself, but the fact that an really surprising number of APIs and programmers still think that it's UCS-2, rather than UTF-16 --painful.

But this makes UTF-16 an attractive nuisance. When people use UTF-16, it's not because it happens to save 12% storage or 3% CPU over UTF-8 for some particular corpus, it's because it lets either them or some API they're dealing with pretend Unicode == UCS-2 so they can write buggy code quickly instead of proper code almost as quickly. If we'd never had UCS-2, and invented UTF-16 only now, I don't think anyone would use it; therefore, it would be better it we didn't have it.

> And the fact, that AFAIK, ther really is not C++ Unicode type -- at least not one commonly used. 

I've got no problem with the fact that they defined UTF-8, UTF-16, and UTF-32 types instead of a Unicode type. In a language where strings are just pointers to arrays of characters, what would a Unicode type even mean?

> Again -- legacy issues.

>> And there are still people creating filenames on Latin-1 filesystems on older Linux and Unix boxes,
> 
> This is the odd one to me -- reading about people's struggles with py3 an *nix filenames -- they argue that *nix is not broken -- and the world should just use char* for filenames and all is well! IN fact, maybe it would be easier to handle filenames as char* in some circumstances, but to argue that a system is not broken when you can't know the encoding of filenames, and there may be differently encoded filenames ON THE SAME Filesystem is insane! of course that is broken! It may be reality, and maybe Py3 needs to do a bit more to accommodate it, but it is broken.
> 
> In fact, as much as I like to bash Windows, I've had NO problems with assuming filenames in Windows are UTF-16 (as long as we use the "wide char" APIs, sigh), and OS-X's specification of filenames as utf-8 works fine. So Linux really needs to catch up here!

I _almost_ like OS X's approach here. If you've got files on a filesystem that aren't in UTF-8 (or that a filesystem driver can't transparently represent as UTF-8 because it stores some other static, per-fs, or per-file encoding, like NTFS's static UTF-16-LE), you see those files as UTF-8 anyway. That means some are mojibake. And maybe some either aren't accessible at all, or are accessible through names the filesystem invented that mean nothing. Too bad, here are some tools to repair your broken filesystem if that's a problem for you.

The problem is, those tools are only available at way too high a level. If they just put the real bytes for an undecodable filename right in an extra DIRENTRY slot, anyone could easily write tools to help the user fix it that work at the normal filesystem level. ("rename --transcode-from=Latin-1 broken/*" would require adding 11 lines of trivial code to rename.pl, including the lines for processing the flag and dealing with post-transcoding collisions, if that information were available.)

But Apple doesn't seem to care about making those tools writable at that level. Which means there's no chance in hell of GNU nor BSD following Apple's lead. So no one's ever going to solve it, we'll just close our eyes and hope that eventually it's as rare a problem as dealing with Atari or EBCDIC source code are today so we can declare it solved-enough-I-guess.

>> UTF-16 is a historical accident,
> 
> yeah, but it's not really a killer, either -- the problems come when people assume UTF-16 is UCS-2, just alike assuming that utf-8 is ascii (or any one-byte encoding...)
> 
>>  We really do need at least UTF-8 and UTF-32. But that's it. And I think that's simple enough.
> 
> is UTF-32 the same as UCS-4 ? Always a bit confused by that.

Technically, UTF-32 is a subset of UCS-4. UCS-4 is an encoding of 31-bit values in 4 octets by leaving the top bit 0. UTF-32 is an encoding of 21-bit values in 32 bits by leaving the top 11 bits 0. So if you're using them to transmit Unicode code points (the only use they're defined for), they're identical.

> Oh, and endian issues -- *sigh*

Yes, big-endian-only is order #13 on my plans if I ever became supreme dictator. (Unless my advisors want to argue about big vs. little; in that case, I give them 4 hours to debate it, then drop them all in the crocodile pit and flip a coin.)

>>  Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive me. :-) )
> 
> Some of it yes, I'm afraid so -- but probably not the surrogate pair stuff, etc. That stuff is pretty esoteric, and really needs to be understood by people writing APIs -- but for those of us that USE APIs, not so much.
> 
> For instance, Python's handling Unicode file names almost always "just works" (as long as you stay in Python...)
> 
> 
> -Chris
> 
> 
> -- 
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
> 
> Chris.Barker at noaa.gov
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20150507/f56d0b98/attachment-0001.html>