Unicode questions

Tue Oct 19 18:17:10 EDT 2010

On 10/19/2010 4:31 PM, Tobiah wrote:
>> There is no such thing as "plain Unicode representation". The closest
>> thing would be an abstract sequence of Unicode codepoints (ala Python's
>> `unicode` type), but this is way too abstract to be used for
>> sharing/interchange, because storing anything in a file or sending it
>> over a network ultimately involves serialization to binary, which is not
>> directly defined for such an abstract representation (Indeed, this is
>> exactly what encodings are: mappings between abstract codepoints and
>> concrete binary; the problem is, there's more than one of them).
>
> Ok, so the encoding is just the binary representation scheme for
> a conceptual list of unicode points.  So why so many?  I get that
> someone might want big-endian, and I see the various virtues of
> the UTF strains, but why isn't a handful of these representations
> enough?  Languages may vary widely but as far as I know, computers
> really don't that much.  big/little endian is the only problem I
> can think of.  A byte is a byte.  So why so many encoding schemes?
> Do some provide advantages to certain human languages?

The hundred or so language-specific encodings all pre-date unicode and 
are *not* unicode encodings. They are still used because of inertia and 
local optimization.

There are currently about 100000 unicode codepoints, with space for 
about 1,000,000. The unicode standard specifies exactly 2 internal 
representations of codepoints using either 16 or 32 bit words. The 
latter uses one word per codepoint, the former usually uses one word but 
has to use two for codepoints above 2**16-1. The standard also specifies 
about 7 byte-oriented transer formats, UTF-8,16,32 with big and little 
endian variations. As far as I know, these (and a few other variations) 
are the only encodings that encode all unicode chars (codepoints)

-- 
Terry Jan Reedy