[Python-ideas] Proposal for default character representation
Chris Angelico
rosuav at gmail.com
Thu Oct 13 10:50:36 EDT 2016
On Fri, Oct 14, 2016 at 1:25 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Thu, Oct 13, 2016 at 03:56:59AM +0200, Mikhail V wrote:
>> and in long perspective when the world's alphabetical garbage will
>> dissapear, two digits would be ok.
> Talking about "alphabetical garbage" like that makes you seem to be an
> ASCII bigot: rude, ignorant, arrogant and rather foolish as well. Even
> 7-bit ASCII has more than 100 characters (128).
Solution: Abolish most of the control characters. Let's define a brand
new character encoding with no "alphabetical garbage". These
characters will be sufficient for everyone:
* [2] Formatting characters: space, newline. Everything else can go.
* [8] Digits: 01234567
* [26] Lower case Latin letters a-z
* [2] Vital social media characters: # (now officially called "HASHTAG"), @
* [2] Can't-type-URLs-without-them: colon, slash (now called both
"SLASH" and "BACKSLASH")
That's 40 characters that should cover all the important things anyone
does - namely, Twitter, Facebook, and email. We don't need punctuation
or capitalization, as they're dying arts and just make you look
pretentious. I might have missed a few critical characters, but it
should be possible to fit it all within 64, which you can then
represent using two digits from our newly-restricted set; octal is
better than decimal, as it needs less symbols. (Oh, sorry, so that's
actually "50" characters, of which "32" are the letters. And we can
use up to "100" and still fit within two digits.)
Is this the wrong approach, Mikhail? Perhaps we should go the other
way, then, and be *inclusive* of people who speak other languages.
Thanks to Unicode's rich collection of characters, we can represent
multiple languages in a single document; see, for instance, how this
uses four languages and three entirely distinct scripts:
http://youtu.be/iydlR_ptLmk Turkish and French both use the Latin
script, but have different characters. Alphabetical garbage, or
accurate representations of sounds and words in those languages?
Python 3 gives the world's languages equal footing. This is a feature,
not a bug. It has consequences, including that arbitrary character
entities could involve up to seven decimal digits or six hex (although
for most practical work, six decimal or five hex will suffice). Those
consequences are a trivial price to pay for uniting the whole
internet, as opposed to having pockets of different languages, like we
had up until the 90s.
ChrisA
More information about the Python-ideas
mailing list