[Python-ideas] Proposal for default character representation

Mikhail V mikhailwas at gmail.com
Sat Oct 15 09:06:48 EDT 2016


On 14 October 2016 at 11:36, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:

>but bash wasn't designed for that.
>(The fact that some people use it that way says more
>about their dogged persistence in the face of
>adversity than it does about bash.)

I can not judge what bash is good for, since I never
tried to learn it. But it *looks* indeed frightening.
First feeling is OMG, I must close this and never
see again.
Also I can only hard imagine that special purpose
of some language can ignore readability,
even if it is assembler or whatever,
it can be made readable without much effort.
So I just look for some other solution for same task,
let it be 10 times more code.


> So for that
> person, using decimal would make the code *harder*
> to maintain.
> To a maintainer who doesn't have that familiarity,
> it makes no difference either way.

That is because that person from beginning
(blindly) follows the convention.
So my intention of course was not
to find out if the majority does or not,
but rather which one of two makes
more sence *initially*, just trying to imagine
that we can decide.
To be more precise, if you were to choose
between two options:

1. use hex for the glyph index and use
hex for numbers (e.g. some arbitrary
value like screen coordinates)
2. use decimal for both cases.

I personally choose option 2.
Probably nothing will convince me that option
1. will be better, all the more I don't
believe that anything more than base-8
makes much sense for readable numbers.
Just  little bit  dissapointed that others
again and again speak of convention.

>I just
>don't see this as being anywhere near being a
>significant problem.

I didn't mean that, it is just slightly
annoys me.

>>    In standard ASCII
>>    there are enough glyphs that would work way better
>>    together,

>Out of curiosity, what glyphs do you have in mind?

If I were to decide, I would look into few options here:
1. Easy option which would raise less further
questions is to take 16 first lowercase letters.
2. Better option would be to choose letters and
possibly other glyphs to build up a more readable
set. E.g. drop "c" letter and leave "e" due to
their optical collision, drop some other weak glyphs,
like "l" "h". That is of course would raise
many further questions, like why you do you drop this
glyph and not this and so on so it will surely end up in quarrel.

Here lies another problem - non-constant width of letters,
but this is more the problem of fonts and rendering,
so adresses IDE and editors problematics.
But as said I won't recommend base 16 at all.


>>    ұұ-ұ ---- ---- ---ұ
>>
>>    you can downscale the strings, so a 16-bit
>>    value would be ~60 pixels wide

> Yes, you can make the characters narrow enough that
> you can take 4 of them in at once, almost as though
> they were a single glyph... at which point you've
> effectively just substituted one set of 16 glyphs

No no. I didn't mean to shrink them till they melt together.
The structure is still there, only that with such notation
you don't need to keep the glyph so big as with many-glyph systems.

>for another. Then you'd have to analyse whether the
>*combined* 4-element glyphs were easier to disinguish
>from each other than the ones they replaced. Since
>the new ones are made up of repetitions of just two
>elements, whereas the old ones contain a much more
>varied set of elements, I'd be skeptical about that.

I get your idea and this a very good point.
Seems you have experience in such things?
Currently I don't know for sure if such approach
more effective or less than others and for what case.
But I can bravely claim that it is better than *any*
hex notation, it just follows from what I have here
on paper on my table, namely that it is physically
impossible to make up highly effective glyph system
of more than 8 symbols. You want more only if really
*need* more glyphs.
And skepticism should always be present.

One thing however especially interests me, here not
only the differentiation of glyph comes in play,
but also positional principle which helps to compare
 and it can be beneficaial for specific cases.
So you can clearly see if one
number is two times bigger than other for example.
And of course, strictly speaking those bit groups are not glyphs,
you can call them of course so, but this is
just rhetorics. So one could call all english
written words also glyphs but they are not really.
But I get your analogy, this is how the tests
should be made.

>BTW, your choice of ұ because of its "peak readibility"
>seems to be a case of taking something out of context.
>The readability of a glyph can only be judged in terms
>of how easy it is to distinguish from other glyphs.

True and false. Each single taken glyph has a specific structure
and put alone it has optical qualities.
This is somewhat quite complicated and hardly
describable by words, but anyway, only tests can
tell what is better. In this case it is still 2
glyphs or better say one and a half glyph.
And indeed you can distinguish them really good
since they have different mass.

> Here, the only thing that matters is distinguishing it
> from the other symbol, so something like "|" would
> perhaps be a better choice.

> ||-| ---- ---- ---|

I can get your idea, although not really correct statement,
see above. A vertical stab is hardly a good glyph,
actually quite a bad one. Such notation will cause
quite uncomfortable effect on eyes,
and there are many things here.

Less technically, here is a rule:
- a good glyph has a structure, and the
boundary of the glyph is a proportional form (like a bulb)
(not your case)
- vertical gaps/sheers inside these boundaries are
 bad (your case). One can't always do without them, but
vertical ones are much worse than horizontal.
- too primitive glyph structure is bad (your case)

So a stab is good only as some punctuation sign.
For this exact reason such letters, as "l", "L", "i"
are bad ones, especially their sans-serif variants.
And *not* in the first place because they collide
with other glyphs. This is somewhat non obvious.
One should understand of course that I
just took the standard symbols that only try
to mimic the correct representation.

So if sometime you will play around with bitstrings,
here are the ASCII-only variants which are
best working:

-y-y ---y -yy- -y--
-o-o ---o -oo- -o--
-k-k ---k -kk- -k--
-s-s ---s -ss- -s--

No need to say that these will be way, way better
than "01" notation which is used as standard.
If you read a lot numbers you should have noticed
how unpleasant is to scan through 010101

> What I'm far from convinced of is that I would gain any
> benefit from making that effort, or that a fresh person
> would be noticeably better off if they learned your new
> system instead of the old one.

"far from convinced" sounds quite positive however :)
it is never too late.  I heard from Captain Crunch
https://en.wikipedia.org/wiki/John_Draper
That he was so tired of C syntax that he finally
switched to Python for some projects.
I can imagine how unwanted this can be in age.

All depends on tasks that one often does.
If say, imagine you'll read binaries
for a long time, in one of notations
I proposed above (like "-y-- --y-" for example)
and after that try to
switch back to "0100 0010" notation,
I bet you will realize that better.
Indeed learning new notation for numbers is
quite easy, it is only some practice. And with
base-2 you don't need learn at all, just
can switch to other notation and use straight away.

>>    It is not about speed, it is about brain load.
>>    Chinese can read their hieroglyphs fast, but
>>    the cognition load on the brain is 100 times higher
>>    than current latin set.

>Has that been measured? How?

I don't think it is measurable at all. That is
my opinion, and 100 just shows that I think
it is very stressfull, also due to lot of
meaning disambiguation that such system can
cause. I also heard pesonal complains from chinese
young students, they all had problems with vision already
in early years, but I cannot support it oficially.
So just imagine: if take for truth, max number
of effective glyphs is 8. and hieroglyphs
are *all* printed in same sized box! how would this
provide efficient reading, and if you've
seen chinese books, they all printed with quite
small font. I am not very sentimental person
but somehow feel sorry for people, one doesn't
deserve it.
You know, I become friends with one chinese
girl, she loves to read and eager to learn
and need to always carry pair of goglles with her everywhere.
Somehow sad I become now writing it, she is so sweet
young girl...
And yes in this sence one can say that this cognition
load can be measured. You go to universities in China and
count those with vision problems.

>I don't doubt that some sets of glyphs are easier to
>distinguish from each other than others. But the

That sounds good, this is
not so often that one realizes that :)
Most people would say "it's just matter of habit"

>letters and digits that we currently use have already
>been pretty well optimised by scribes and typographers
>over the last few hundred years, and I'd be surprised
>if there's any *major* room left for improvement.

Here I would slightly disagree
First, *Digits* are not optimised for anything, they are
are just a heritage from ancient time.
They have some minimal readability, namely "2" is not
 bad, others are quite poor.
Second, *small latin letters* are indeed well fabricated.
However don't have an illusion that someone cared much
about their optimisation in last 1000 years.
If you are skeptical about that, take a look at this

http://daten.digitale-sammlungen.de/~db/bsb00003258/images/index.html?seite=320

If believe (there are skeptics who do not believe)
that this dates back end of 10th century,
so we have an interesting picture here,
You see that this is indeed
very similar to what you read now, somewhat optimised
of course, but without much improvements.
Actually in some cases there is even some degradation:
now we have "pbqd" letters, which are just rotation
and reflection of each other, which is no good.
Strictly speaking you can use only one of these 4 glyphs.
And in last 500 hundred years there was zero modifications.
How much improvent can be made is hard question.
According to my results, indeed the peak readability
forms are similar to certain small latin letters,
But I would say quite significant improvement could be made.
But this is not really measurable.


Mikhail


More information about the Python-ideas mailing list