[Python-ideas] Proposal for default character representation

Chris Angelico rosuav at gmail.com
Fri Oct 14 04:18:27 EDT 2016


On Fri, Oct 14, 2016 at 6:53 PM, Mikhail V <mikhailwas at gmail.com> wrote:
> On 13 October 2016 at 16:50, Chris Angelico <rosuav at gmail.com> wrote:
>> On Fri, Oct 14, 2016 at 1:25 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>>> On Thu, Oct 13, 2016 at 03:56:59AM +0200, Mikhail V wrote:
>>>> and in long perspective when the world's alphabetical garbage will
>>>> dissapear, two digits would be ok.
>>> Talking about "alphabetical garbage" like that makes you seem to be an
>>> ASCII bigot: rude, ignorant, arrogant and rather foolish as well. Even
>>> 7-bit ASCII has more than 100 characters (128).
>
> This is sort of rude. Are you from unicode consortium?

No, he's not. He just knows a thing or two.

>> Solution: Abolish most of the control characters. Let's define a brand
>> new character encoding with no "alphabetical garbage". These
>> characters will be sufficient for everyone:
>>
>> * [2] Formatting characters: space, newline. Everything else can go.
>> * [8] Digits: 01234567
>> * [26] Lower case Latin letters a-z
>> * [2] Vital social media characters: # (now officially called "HASHTAG"), @
>> * [2] Can't-type-URLs-without-them: colon, slash (now called both
>> "SLASH" and "BACKSLASH")
>>
>> That's 40 characters that should cover all the important things anyone
>> does - namely, Twitter, Facebook, and email. We don't need punctuation
>> or capitalization, as they're dying arts and just make you look
>> pretentious. I might have missed a few critical characters, but it
>> should be possible to fit it all within 64, which you can then
>> represent using two digits from our newly-restricted set; octal is
>> better than decimal, as it needs less symbols. (Oh, sorry, so that's
>> actually "50" characters, of which "32" are the letters. And we can
>> use up to "100" and still fit within two digits.)
>>
>> Is this the wrong approach, Mikhail?
>
> This is sort of correct approach. We do need punctuation however.
> And one does not need of course to make it too tight.
> So 8-bit units for text is excellent and enough space left for experiments.

... okay. I'm done arguing. Go do some translation work some time.
Here, have a read of some stuff I've written before.

http://rosuav.blogspot.com/2016/09/case-sensitivity-matters.html
http://rosuav.blogspot.com/2015/03/file-systems-case-insensitivity-is.html
http://rosuav.blogspot.com/2014/12/unicode-makes-life-easy.html

>> Perhaps we should go the other
>> way, then, and be *inclusive* of people who speak other languages.
>
> What keeps people from using same characters?
> I will tell you what - it is local law. If you go to school you *have* to
> write in what is prescribed by big daddy. If youre in europe or America, you are
> more lucky. And if you're in China you'll be punished if you
> want some freedom. So like it or not, learn hieroglyphs
> and become visually impaired in age of 18.

Never mind about China and its political problems. All you need to do
is move around Europe for a bit and see how there are more sounds than
can be usefully represented. Turkish has a simple system wherein the
written and spoken forms have direct correspondence, which means they
need to distinguish eight fundamental vowels. How are you going to
spell those? Scandinavian languages make use of letters like "å"
(called "A with ring" in English, but identified by its sound in
Norwegian, same as our letters are - pronounced "Aww" or "Or" or "Au"
or thereabouts). To adequately represent both Turkish and Norwegian in
the same document, you *need* more letters than our 26.

>> Thanks to Unicode's rich collection of characters, we can represent
>> multiple languages in a single document;
>
> Can do it without unicode in 8-bit boundaries with tagged text,
> just need fonts for your language, of course if your
> local charset is less than 256 letters.

No, you can't. Also, you shouldn't. It makes virtually every text
operation impossible: you can't split and rejoin text without tracking
the encodings. Go try to write a text editor under your scheme and see
how hard it is.

> This is how it was before unicode I suppose.
> BTW I don't get it still what such revolutionary
> advantages has unicode compared to tagged text.

It's not tagged. That's the huge advantage.

>> script, but have different characters. Alphabetical garbage, or
>> accurate representations of sounds and words in those languages?
>
> Accurate with some 50 characters is more than enough.

Go build a chat room or something. Invite people to enter their names.
Now make sure you're courteous enough to display those names to
people. Try doing that without Unicode.

I'm done. None of this belongs on python-ideas - it's getting pretty
off-topic even for python-list, and you're talking about modifying
Python 2.7 which is a total non-starter anyway.

ChrisA


More information about the Python-ideas mailing list