Python 3.2 has some deadly infection
rurpy at yahoo.com
rurpy at yahoo.com
Sun Jun 8 00:34:23 EDT 2014
On 06/05/2014 05:02 PM, Steven D'Aprano wrote:
>[...]
> But Linux Unicode support is much better than Windows. Unicode support in
> Windows is crippled by continued reliance on legacy code pages, and by
> the assumption deep inside the Windows APIs that Unicode means "16 bit
> characters". See, for example, the amount of space spent on fixing
> Windows Unicode handling here:
>
> http://www.utf8everywhere.org/
While not disagreeing with the the general premise of that page, it
has some problems that raise doubts in my mind about taking everything
the author says at face value.
For example
"Q: Why would the Asians give up on UTF-16 encoding, which saves
them 50% the memory per character?"
[...] in fact UTF-8 is used just as often in those [Asian] countries.
That is not my experience, at least for Japan. See my comments in
https://mail.python.org/pipermail/python-ideas/2012-June/015429.html
where I show that utf8 files are a tiny minority of the text files
found by Google.
He then gives a table with the size of utf8 and utf16 encoded contents
(ie stripped of html stuff) of an unnamed Japanese wikipedia page to
show that even without a lot of (html-mandated) ascii, the space savings
are not very much compared to the theoretical "50%" savings he stated:
" Dense text (Δ UTF-8)
UTF-8 ... 222 KB (0%)
UTF-16 ... 176 KB (−21%)"
Note that he calculates the space saving as (utf8-utf16)/utf8.
Yet by that metric the theoretical saving is *NOT* 50%, it is 33%.
For example 1000 Japanese characters will use 2000 bytes in utf16
and 3000 in utf8.
I did the same test using
http://ja.wikipedia.org/wiki/%E7%B9%94%E7%94%B0%E4%BF%A1%E9%95%B7
I stripped html tags, javascript and redundant ascii whitespace characters
The stripped utf-8 file was 164946 bytes, the utf-16 encoded version of
same was 117756. That gives (using the (utf8-utf16)/utf16 metric he used
to claim 50% idealized savings) 40% which is quite a bit closer to the
idealized 50% than his 21%.
I would have more faith in his opinions about things I don't know
about (such as unicode programming on Windows) if his other info
were more trustworthy. IOW, just because it's on the internet doesn't
mean it's true.
More information about the Python-list
mailing list