<div dir="ltr">My not-an-expert thoughts on these issues:<div><br></div><div class="gmail_extra">[NOTE: nested comments, so attribution may be totally confused]</div><div class="gmail_extra"><br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="auto"><span><blockquote type="cite"><div>    Why, oh why, do things have to be SO FU*****G COMPLICATED?</div></blockquote></span></div></blockquote><div>two reasons:</div><div><br></div><div>1) human languages are complicated, and they all have their idiosyncrasies -- some are inherently better suited to machine interpretation, but the real killer is that we want to use multiple languages with one system -- that IS inherently very complicated.</div><div><br></div><div>2) legacy decisions an backward compatibility -- this is what makes it impossible to "simply" come up with a single bets way to to do it (or a few  ways, anyway...)</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="auto"><span><blockquote type="cite"><div>Surely 65536 (2-byte) encodings are enough to express all characters

    in all the languages in the world, plus all the special characters

    we need.<br></div></blockquote></span></div></blockquote><div>That was once thought true -- but it turns out it's not -- darn!</div><div><br></div><div>Though we do think that 4 bytes is plenty, and to some extent I'm confused as to why there isn't more use of UCS-4 -- sure it wastes a lot of space, but everything in computer (memory, cache, disk space, bandwidth) is orders of magnitudes larger/faster than it was when the Unicode discussion got started. But people don't like inefficiency and, in fact, as the newer py3 Unicode objects shows, we don't need to compromise on that.</div><div><br></div></div></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Or is there really some fundamental reason why things can't be simpler?  (Like, REALLY, REALLY simple?)</blockquote></div></div></blockquote><div><br></div>Well, if there were no legacy systems, it still couldn't be REALLY, REALLY simple (though UCS-4 is close), but there could be a LOT fewer ways to do things: programming languages would have their own internal representation (like Python does), and we would have a small handful of encodings optimized for various things: UCS-4 for easy of use, utf-8 for small disk storage (at least of Euro-centered text), and that would be that. But we do have the legacies to deal with.<br><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div class="gmail_extra"><div class="gmail_quote"><div> </div></div></div></blockquote><div class="gmail_extra"><div class="gmail_quote"><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="auto"><div>Apple, Microsoft, Sun, and a few other vendors jumped on the Unicode bandwagon early and committed themselves to the idea that 2 bytes is enough for everything. When the world discovered that wasn't true, we were stuck with a bunch of APIs that insisted on 2 bytes. Apple was able to partly make a break with that era, but Windows and Java are completely stuck with "Unicode means 16-bit" forever, which is why the whole world is stuck dealing with UTF-16 and surrogates forever.<br></div></div></blockquote><div><br></div><div>I've read many of the rants about UTF-16, but in fact, it's really not any worse than UTF-8 -- it's kind of a worst of both worlds -- not a set number of bytes per char, but a lot of wasted space (particularly for euro languages), but other than a bi tof wasted sapce, it's jsut like UTF-8.</div><div><br></div><div>The Problem with is it not UTF-16 itself, but the fact that an really surprising number of APIs and programmers still think that it's UCS-2, rather than UTF-16 --painful. And the fact, that AFAIK, ther really is not C++ Unicode type -- at least not one commonly used. Again -- legacy issues.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="auto"><div>And there are still people creating filenames on Latin-1 filesystems on older Linux and Unix boxes,</div></div></blockquote><div><br></div><div>This is the odd one to me -- reading about people's struggles with py3 an *nix filenames -- they argue that *nix is not broken -- and the world should just use char* for filenames and all is well! IN fact, maybe it would be easier to handle filenames as char* in some circumstances, but to argue that a system is not broken when you can't know the encoding of filenames, and there may be differently encoded filenames ON THE SAME Filesystem is insane! of course that is broken! It may be reality, and maybe Py3 needs to do a bit more to accommodate it, but it is broken.</div><div><br></div><div>In fact, as much as I like to bash Windows, I've had NO problems with assuming filenames in Windows are UTF-16 (as long as we use the "wide char" APIs, sigh), and OS-X's specification of filenames as utf-8 works fine. So Linux really needs to catch up here!</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="auto"><div>UTF-16 is a historical accident,</div></div></blockquote><div><br></div><div>yeah, but it's not really a killer, either -- the problems come when people assume UTF-16 is UCS-2, just alike assuming that utf-8 is ascii (or any one-byte encoding...)</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> We really do need at least UTF-8 and UTF-32. But that's it. And I think that's simple enough.</blockquote><div><br></div><div>is UTF-32 the same as UCS-4 ? Always a bit confused by that.</div><div><br></div><div>Oh, and endian issues -- *sigh*</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"> Aaaargh!  Do I really have to learn all this mumbo-jumbo?!  (Forgive

    me.  :-) )</blockquote><div><br></div><div>Some of it yes, I'm afraid so -- but probably not the surrogate pair stuff, etc. That stuff is pretty esoteric, and really needs to be understood by people writing APIs -- but for those of us that USE APIs, not so much.</div><div><br></div><div>For instance, Python's handling Unicode file names almost always "just works" (as long as you stay in Python...)</div><div><br></div><div><br></div><div>-Chris</div><div><br></div></div><div><br></div>-- <br><div><br>Christopher Barker, Ph.D.<br>Oceanographer<br><br>Emergency Response Division<br>NOAA/NOS/OR&R            <a href="tel:%28206%29%20526-6959" value="+12065266959" target="_blank">(206) 526-6959</a>   voice<br>7600 Sand Point Way NE   <a href="tel:%28206%29%20526-6329" value="+12065266329" target="_blank">(206) 526-6329</a>   fax<br>Seattle, WA  98115       <a href="tel:%28206%29%20526-6317" value="+12065266317" target="_blank">(206) 526-6317</a>   main reception<br><br><a href="mailto:Chris.Barker@noaa.gov" target="_blank">Chris.Barker@noaa.gov</a></div>

</div></div>