[Python-ideas] Proposal for default character representation

Ned Batchelder ned at nedbatchelder.com
Thu Oct 13 13:45:49 EDT 2016


On 10/13/16 2:42 AM, Mikhail V wrote:
> On 13 October 2016 at 08:02, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
>> Mikhail V wrote:
>>> Consider unicode table as an array with glyphs.
>>
>> You mean like this one?
>>
>> http://unicode-table.com/en/
>>
>> Unless I've miscounted, that one has the characters
>> arranged in rows of 16, so it would be *harder* to
>> look up a decimal index in it.
>>
>> --
>> Greg
> Nice point finally, I admit, although quite minor. Where
> the data implies such pagings or alignment, the notation
> should be (probably) more binary-oriented.
> But: you claim to see bit patterns in hex numbers? Then I bet you will
> see them much better if you take binary notation (2 symbols) or quaternary
> notation (4 symbols), I guarantee. And if you take consistent glyph set for them
> also you'll see them twice better, also guarantee 100%.
> So not that the decimal is cool,
> but hex sucks (too big alphabet) and _the character set_ used for hex
> optically sucks.
> That is the point.
> On the other hand why would unicode glyph table which is to the
> biggest part a museum of glyphs would be necesserily
> paged in a binary-friendly manner and not in a decimal friendly
> manner? But I am not saying it should or not, its quite irrelevant
> for this particular case I think.

You continue to overlook the fact that Unicode codepoints are
conventionally presented in hexadecimal, including in the page you
linked us to.  This is the convention.  It makes sense to stick to the
convention. 

When I see a numeric representation of a character, there are only two
things I can do with it: look it up in a reference someplace, or glean
some meaning from it directly.  For looking things up, please remember
that all Unicode references use hex numbering. Looking up a character by
decimal numbers is simply more difficult than looking them up by hex
numbers.

For gleaning meaning directly, please keep in mind that Unicode
fundamentally structured around pages of 256 code points, organized into
planes of 256 pages.  The very structure of how code points are
allocated and grouped is based on a hexadecimal-friendly system.  The
blocks of codepoints are aligned on hexadecimal boundaries:
http://www.fileformat.info/info/unicode/block/index.htm .  When I see
\u0414, I know it is a Cyrillic character because it is in block 04xx.

It simply doesn't make sense to present Unicode code points in anything
other than hex.

--Ned.




More information about the Python-ideas mailing list