Proposal for default character representation

Hello all, I want to share my thoughts about syntax improvements regarding character representation in Python. I am new to the list so if such a discussion or a PEP exists already, please let me know. So in short: Currently Python uses hexadecimal notation for characters for input and output. For example let's take a unicode string "абв.txt" (a file named with first three Cyrillic letters). Now printing it we get: u'\u0430\u0431\u0432.txt' So one sees that we have hex numbers here. Same is for typing in the strings which obviously also uses hex. Same is for some parts of the Python documentation, especially those about unicode strings. PROPOSAL: 1. Remove all hex notation from printing functions, typing, documention. So for printing functions leave the hex as an "option", for example for those who feel the need for hex representation, which is strange IMO. 2. Replace it with decimal notation, in this case e.g: u'\u0430\u0431\u0432.txt' becomes u'\u1072\u1073\u1074.txt' and similarly for other cases where raw bytes must be printed/inputed So to summarize: make the decimal notation standard for all cases. I am not going to go deeper, such as what digit amount (leading zeros) to use, since it's quite secondary decision. MOTIVATION: 1. Hex notation is hardly readable. It was not designed with readability in mind, so for reading it is not appropriate system, at least with the current character set, which is a mix of digits and letters (curious who was that wize person who invented such a set?). 2. Mixing of two notations (hex and decimal) is a _very_ bad idea, I hope no need to explain why. So that's it, in short. Feel free to discuss and comment. Regards, Mikhail

On 12.10.2016 23:33, Mikhail V wrote:
Hmm, in Python3, I get:
The hex notation for \uXXXX is a standard also used in many other programming languages, it's also easier to parse, so I don't think we should change this default. Take e.g.
With decimal notation, it's not clear where to end parsing the digit notation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 12 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

I'm -1 on this. Just type "0431 unicode" on your favorite search engine. U+0431 is the codepoint, not whatever digits 0x431 has in decimal. That's a tradition and something external to Python. As a related concern, I think using decimal/octal on raw data is a terrible idea (e.g. On Linux, I always have to re-format the "cmp -l" to really grasp what's going on, changing it to hexadecimal). Decimal notation is hardly readable when we're dealing with stuff designed in base 2 (e.g. due to the visual separation of distinct bytes). How many people use "hexdump" (or any binary file viewer) with decimal output instead of hexadecimal? I agree that mixing representations for the same abstraction (using decimal in some places, hexadecimal in other ones) can be a bad idea. Actually, that makes me believe "decimal unicode codepoint" shouldn't ever appear in string representations. -- Danilo J. S. Bellini --------------- "*It is not our business to set up prohibitions, but to arrive at conventions.*" (R. Carnap)

On 12 October 2016 at 23:58, Danilo J. S. Bellini <danilo.bellini@gmail.com> wrote:
Hmm what keeps you from separateting the logical units to be represented each by a decimal number? like 001 023 255 ... Do you really think this is less readable than its hex equivalent? Then you are probably working with hex numbers only, but I doubt that.
PS: that is rather peculiar, three negative replies already but with no strong arguments why it would be bad to stick to decimal only, only some "others do it so" and "tradition" arguments. The "base 2" argument could work at some grade but if stick to this criteria why not speak about octal/quoternary/binary then? Please note, I am talking only about readability _of the character set_ actually. And it is not including your habit issues, but rather is an objective criteria for using this or that character set. And decimal is objectively way more readable than hex standard character set, regardless of how strong your habits are.

On Thu, Oct 13, 2016 at 10:09 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Way WAY less readable, and I'm comfortable working in both hex and decimal.
You're the one who's non-standard here. Most of the world uses hex for Unicode codepoints. http://unicode.org/charts/ HTML entities permit either decimal or hex, but other than that, I can't think of any common system that uses decimal for Unicode codepoints in strings.
"Others do it so" is actually a very strong argument. If all the rest of the world uses + to mean addition, and Python used + to mean subtraction, it doesn't matter how logical that is, it is *wrong*. Most of the world uses U+201C or "\u201C" to represent a curly double quote; if you us 0x93, you are annoyingly wrong, and if you use 8220, everyone has to do the conversion from that to 201C. Yes, these are all differently-valid standards, but that doesn't make it any less annoying.
How many decimal digits would you use to denote a single character? Do you have to pad everything to seven digits (\u0000034 for an ASCII quote)? And if not, how do you mark the end? This is not "objectively more readable" if the only gain is "no A-F" and the loss is "unpredictable length". ChrisA

On 13 October 2016 at 01:50, Chris Angelico <rosuav@gmail.com> wrote:
Please don't mix the readability and personal habit, which previuos repliers seems to do as well. Those two things has nothing to do with each other. If you are comfortable with old roman numbering system this does not make it readable. And I am NOT comfortable with hex, as well as most people would be glad to use single notation. But some of them think that they are cool because they know several numbering notations ;) But I bet few can actually understand which is more readable.
This actually supports my proposal perfectly, if everyone uses decimal why suddenly use hex for same thing - index of array. I don't see how your analogy contradicts with my proposal, it's rather supporting it.
quote; if you us 0x93, you are annoyingly wrong,
Please don't make personal assessments here, I can use whatever I want, moreover I find this notation as silly as using different measurement systems without any reason and within one activity, and in my eyes this is annoyingly wrong and stupid, but I don't call nobody here stupid. But I do want that you could abstract yourself from your habit for a while and talk about what would be better for the future usage.
everyone has to do the conversion from that to 201C.
Nobody need to do ANY conversions if use decimal, and as said everything is decimal: numbers, array indexes, ord() function returns decimal, you can imagine more examples so it is not only more readable but also more traditional.
How many decimal digits would you use to denote a single character?
for text, three decimal digits would be enough for me personally, and in long perspective when the world's alphabetical garbage will dissapear, two digits would be ok.
you have to pad everything to seven digits (\u0000034 for an ASCII quote)?
Depends on case, for input - some separator, or padding is also ok, I don't have problems with both. For printing obviously don't show leading zeros, but rather spaces. But as said I find this Unicode only some temporary happening, it will go to history in some future and be used only to study extinct glyphs. Mikhail

On 2016-10-12 18:56, Mikhail V wrote:
You keep saying this, but it's quite incorrect. The usage of decimal notation is itself just a convention, and the only reason it's easy for you (and for many other people) is because you're used to it. If you had grown up using only hexadecimal or binary, you would find decimal awkward. There is nothing objectively better about base 10 than any other place-value numbering system. Decimal is just a habit. Now, it's true that base-10 is at this point effectively universal across human societies, and that gives it a certain claim to primacy. But base-16 (along with base 2) is also quite common in computing contexts. Saying we should dump hex notation because everyone understands decimal is like saying that all signs in Prague should only be printed in English because there are more English speakers in the entire world than Czech speakers. But that ignores the fact that there are more Czech speakers *in Prague*. Likewise, decimal may be more common as an overall numerical notation, but when it comes to referring to Unicode code points, hexadecimal is far and away more common. Just look at the Wikipedia page for Unicode, which says: "Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number." That's it. You'll find the same thing on unicode.org. The unicode code point is hardly even a number in the usual sense; it's just a label that identifies the character. If you have an issue with using hex to represent unicode code points, your issue goes way beyond Python, and you need to take it up with the Unicode consortium. (Good luck with that.) -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On 13 October 2016 at 04:18, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
Exactly, but this is not called "readability" but rather "acquired ability to read" or simply habit, which does not reflect the "readability" of the character set itself.
There is nothing objectively better about base 10 than any other place-value numbering system.
Sorry to say, but here you are totally wrong. Not to treat you personally for your fallacy, that is quite common among those who are not familiar with the topic, but you should consider some important points: --- 1. Each taken character set has certain grade of readability which depends solely on the form of its units (aka glyphs). 2. Linear string representation is superior to anything else (spiral, arc, etc.) 3. There exist glyphs which provide maximal readability, those are particular glyphs with particular constant form, and this form is absolutely independent from the encoding subject. 4. According to my personal studies (which does not mean it must be accepted or blindly believed in, but I have solid experience in this area and acting quite successful in it) the amount of this glyphs is less then 10, namely I am by 8 glyphs now. 5. Main measured parameter which reflects the readability (somewhat indirect however) is the pair-wize optical collision of each character pair of the set. This refers somewhat to legibility, or differentiation ability of glyphs. --- Less technically, you can understand it better if you think of your own words "There is nothing objectively better about base 10 than any other place-value numbering system." If this could be ever true than you could read with characters that are very similar to each other or something messy as good as with characters which are easily identifyable, collision resistant and optically consistent. But that is absurd, sorry. For numbers obviously you don't need so many character as for speech encoding, so this means that only those glyphs or even a subset of it should be used. This means anything more than 8 characters is quite worthless for reading numbers. Note that I can't provide here the works currently so don't ask me for that. Some of them would be probably available in near future. Your analogy with speech and signs is not correct because speech is different but numbers are numbers. But also for different speech, same character set must be used namely the one with superior optical qualities, readability.
Saying we should dump hex notation because everyone understands decimal is like saying that all signs in Prague should only be printed in English
We should dump hex notation because currently decimal is simply superiour to hex, just like Mercedes is superior to Lada, aand secondly, because it is more common for ALL people, so it is 2:0 for not using such notation. With that said, I am not against base-16 itself in the first place, but rather against the character set which is simply visually inconsistent and not readable. Someone just took arabic digits and added first latin letters to it. It could be forgiven for a schoolboy's exercises in drawing but I fail to understand how it can be accepted as a working notation for medium supposed to be human readable. Practically all this notation does, it reduces the time before you as a programmer become visual and brain impairments.
Yeah that's it. And it sucks and migrated to coding standard, sucks twice. If a new syntax/standard is decided, there'll be only positive sides of using decimal vs hex. So nobody'll be hurt, this is only the question of remaking current implementation and is proposed only as a long-term theoretical improvement.
it's just a label that identifies the character.
Ok, but if I write a string filtering in Python for example then obviously I use decimal everywhere to compare index ranges, etc. so what is the use for me of that label? Just redundant conversions back and forth. Makes me sick actually.

Mikhail V wrote:
I'm not sure what you mean by that. If by "index ranges" you're talking about the numbers you use to index into the string, they have nothing to do with character codes, so you can write them in whatever base is most convenient for you. If you have occasion to write a literal representing a character code, there's nothing to stop you writing it in hex to match the way it's shown in a repr(), or in published Unicode tables, etc. I don't see a need for any conversions back and forth. -- Greg

On 2016-10-12 22:46, Mikhail V wrote:
It's pretty clear to me by this point that your argument has no rational basis, so I'm regarding this thread as a dead end. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Oct 13, 2016 at 1:46 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Even if you were right that your approach is somehow inherently easier, it is flat-out wrong that other approaches lead to "brain impairment". On the contrary, it is well-established that challenging the brain prevents or at least delays brain impairment. And it also makes no sense that it would cause visual impairment, either. Comparing glyphs is a higher-level task in the brain, it has little to do with your eyes. All your eyes detect are areas of changing contrast, any set of lines and curves, not even glyphs, is functionally identical at that level (and even at medium-level brain regions). The size of the glyphs can make a difference, but not the number of available ones. On the contrary, having more glyphs increases the information density of text, reducing the amount of reading you have to do to get the same information.

On 16 October 2016 at 17:16, Todd <toddrjen@gmail.com> wrote:
My phrasing "impairment" is of course somewhat exaggeration. It cannot be compared to harm due to smoking for example. However it also known that many people who do big amount of information processing and intensive reading are subject to earlier loss of the vision sharpness. And I feel it myself. How exactly this happens to the eye itself is not clear for me. One my supposition is that during the reading there is very intensive two-directional signalling between eye and brain. So generally you are correct, the eye is technically a camera attached to the brain and simply sends pictures at some frequency to the brain. But I would tend to think that it is not so simple actually. You probably have heard sometimes users who claim something like: "this text hurts my eyes" For example if you read non-antialiased text and with too high contrast, you'll notice that something is indeed going wrong with your eyes. This can happen probably because the brain starts to signal the eye control system "something is wrong, stop doing it" Since your eye cannot do anything with wrong contrast on your screen and you still need to continue reading, this happens again and again. This can cause indeed unwanted processes and overtiredness of muscles inside the eye. So in case of my examle with Chinese students, who wear goggles more frequently, this would probaly mean that they could "recover" if they just stop reading a lot. "challenging the brain prevents or at least delays brain" Yes but I hardly see connection with this case, I would probably recommend to make some creative exercises, like drawing or solving puzzles for this purpose. But if I propose reading books in illegible font than I would be wrong in any case.
You forget about that whith illegible font or wrong contrast for example you *do* need to do more concentrarion, This causes again your eye to try harder to adopt to the information you see, reread, which again affects your lens and eye movements. Anyway, how do you think then this earlier vision loss happens? You'd say I fantasise? Mikhail

On Sun, Oct 16, 2016 at 3:26 PM, Mikhail V <mikhailwas@gmail.com> wrote:
The downards-projecting signals from the brain to the eye are heavily studied. In fact I have friends who specialize in studying those connections specifically. They simply don't behave the way you are describing. You are basing your claims on the superiority of certain sorts of glyphs on conjecture about how the brain works, conjecture that goes against what the evidence says about how the brain actually processes visual information. Yes, the quality of the glyphs can make a big difference. There is no indication, however, that the number of possible glyphs can.
I don't want to imply bad faith on your part, but you cut off an important part of what I said: "The size of the glyphs can make a difference, but not the number of available ones. On the contrary, having more glyphs increases the information density of text, reducing the amount of reading you have to do to get the same information." Badly-antialised text can be a problem from that standpoint too. But again, none of this has anything whatsoever to do with the number of glyphs, which is your complaint. Again, I don't want to imply bad faith, but the argument you are making now is completely different than the argument I was addressing. I don't disagree that bad text quality or too much reading can hurt your eyes. On the contrary, I said explicitly that it can. The claim of yours that I was addressing is that having too many glyphs can hurt your eyes or brain, which doesn't match with anything we know about how the eyes or brain work.

On Oct 12, 2016 9:25 PM, "Chris Angelico" <rosuav@gmail.com> wrote:
Emoji, of course! What else?
-- Ryan (ライアン) [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/

Hello, and welcome to Python-ideas, where only a small portion of ideas go further, and where most newcomers that wish to improve the language get hit by the reality bat! I hope you enjoy your stay :)
I'll turn your argument around: Not being comfortable with hex does not make it unreadable; it's a matter of habit (as Brendan pointed out in his separate reply).
Unicode code points are represented using hex notation virtually everywhere I ever saw it. Your Unicode-code-points-as-decimal website was a new discovery for me (and, I presume, many others on this list). Since it's widely used in the world, going against that effectively makes you non-standard. That doesn't mean it's necessarily a bad thing, but it does mean that your chances (or anyone's chances) of actually changing that are equal to zero (and this isn't some gross exaggeration),
I fail to see your point here. Where is that "everyone uses decimal"? Unless you stopped talking about representation in strings (which seems likely, as you're talking about indexing?), everything is represented as hex.
But I do want that you could abstract yourself from your habit for a while and talk about what would be better for the future usage.
I'll be that guy and tell you that you need to step back from your own idea for a while and consider your proposal and the current state of things. I'll also take the opportunity to reiterate that there is virtually no chance to change this behaviour. This doesn't, however, prevent you or anyone from talking about the topic, either for fun, or for finding other (related or otherwise) areas of interest that you think might be worth investigating further. A lot of threads actually branch off in different topics that came up when discussing, and that are interesting enough to pursue on their own.
You're mixing up more than just one concept here: - Integer literals; I assume this is what you meant, and you seem to forget (or maybe you didn't know, in which case here's to learning something new!) that 0xff is perfectly valid syntax, and store the integer with the value of 255 in base 10. - Indexing, and that's completely irrelevant to the topic at hand (also see above bullet point). - ord() which returns an integer (which can be interpreted in any base!), and that's both an argument for and against this proposal; the "against" side is actually that decimal notation has no defined boundary for when to stop (and before you argue that it does, I'll point out that the separations, e.g. grouping by the thousands, are culture-driven and not an international standard). There's actually a precedent for this in Python 2 with the \x escape (need I remind anyone why Python 3 was created again? :), but that's exactly a stone in the "don't do that" camp, instead of the other way around.
You seem to have misunderstood the question - in "\u00123456", there is no ambiguity that this is a string consisting of 5 characters; the first one is '\u0012', the second one is '3', the third one is '4', the fourth one is '5', and the last one is '6'. In the string (using \d as a hypothetical escape method; regex gurus can go read #27364 ;) "\d00123456", how many characters does this contain? It's decimal, so should the escape grab the first 5 digits? Or 6 maybe? You tell me.
No leading zeros? That means you don't have a fixed number of digits, and your string is suddenly very ambiguous (also see my point above).
Unicode, a temporary happening? Well, strictly speaking, nobody can know that, but I'd expect that it's going to, someday, be *the* common standard. I'm not bathed in illusion, though.
Mikhail
All in all, that's a pretty interesting idea. However, it has no chance of happening, because a lot of code would break, Python would deviate from the rest of the world, this wouldn't be backwards compatible (and another backwards-incompatible major release isn't happening; the community still hasn't fully caught up with the one 8 years ago), and it would be unintuitive to anyone who's done computer programming before (or after, or during, or anytime). I do see some bits worth pursuing in your idea, though, and I encourage you to keep going! As I said earlier, Python-ideas is a place where a lot of ideas are born and die, and that shouldn't stop you from trying to contribute. Python is 25 years old, and a bunch of stuff is there just for backwards compatibility; these kind of things can't get changed easily. The older (older by contribution period, not actual age) contributors still active don't try to fix what's not broken (to them). Newcomers, such as you, are a breath of fresh air to the language, and what helps make it thrive even more! By bringing new, uncommon ideas, you're challenging the status quo and potentially changing it for the best. But keep in mind that, with no clear consensus, the status quo always wins a stalemate. I hope that makes sense! Cheers, Emanuel

On 13 October 2016 at 04:49, Emanuel Barry <vgr255@live.ca> wrote:
Matter of habit does not reflect the readability, see my last reply to Brandan. It is quite precise engeneering. And readability it is kind of serious stuff especially if you decide for programming carreer. Young people underestimate it and for oldies it is too late when they realize it :) And Python is all about readability and I like it. As for your other points, I'll need to read it with fresh head tomorrow, Of course I don't believe this would all suddenly happen with Python, or other programming language, it is just an idea anyway. And I do want to learn more actually. Especially want to see some example where it would be really beneficial to use hex, either technically (some low level binary related stuff?) or regarding comprehension, which is to my knowledge hardly possible.
Mikhail

Mikhail V wrote:
Did you see much code written with hex literals?
From /usr/include/sys/fcntl.h: /* * File status flags: these are used by open(2), fcntl(2). * They are also used (indirectly) in the kernel file structure f_flags, * which is a superset of the open/fcntl flags. Open flags and f_flags * are inter-convertible using OFLAGS(fflags) and FFLAGS(oflags). * Open/fcntl flags begin with O_; kernel-internal flags begin with F. */ /* open-only flags */ #define O_RDONLY 0x0000 /* open for reading only */ #define O_WRONLY 0x0001 /* open for writing only */ #define O_RDWR 0x0002 /* open for reading and writing */ #define O_ACCMODE 0x0003 /* mask for above modes */ /* * Kernel encoding of open mode; separate read and write bits that are * independently testable: 1 greater than the above. * * XXX * FREAD and FWRITE are excluded from the #ifdef KERNEL so that TIOCFLUSH, * which was documented to use FREAD/FWRITE, continues to work. */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define FREAD 0x0001 #define FWRITE 0x0002 #endif #define O_NONBLOCK 0x0004 /* no delay */ #define O_APPEND 0x0008 /* set append mode */ #ifndef O_SYNC /* allow simultaneous inclusion of <aio.h> */ #define O_SYNC 0x0080 /* synch I/O file integrity */ #endif #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_SHLOCK 0x0010 /* open with shared file lock */ #define O_EXLOCK 0x0020 /* open with exclusive file lock */ #define O_ASYNC 0x0040 /* signal pgrp when data ready */ #define O_FSYNC O_SYNC /* source compatibility: do not use */ #define O_NOFOLLOW 0x0100 /* don't follow symlinks */ #endif /* (_POSIX_C_SOURCE && !_DARWIN_C_SOURCE) */ #define O_CREAT 0x0200 /* create if nonexistant */ #define O_TRUNC 0x0400 /* truncate to zero length */ #define O_EXCL 0x0800 /* error if already exists */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_EVTONLY 0x8000 /* descriptor requested for event notifications only */ #endif #define O_NOCTTY 0x20000 /* don't assign controlling terminal */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_DIRECTORY 0x100000 #define O_SYMLINK 0x200000 /* allow open of a symlink */ #endif #ifndef O_DSYNC /* allow simultaneous inclusion of <aio.h> */ #define O_DSYNC 0x400000 /* synch I/O data integrity */ #endif -- Greg

Backing Greg up for a moment, hex literals are extremely common in any code that needs to work with binary data, such as network programming or fine data structure manipulation. For example, consider the frequent requirement to mask out certain bits of a given integer (e.g., keep the low 24 bits of a 32 bit integer). Here are a few ways to represent that: integer & 0x00FFFFFF # Hex integer & 16777215 # Decimal integer & 0o77777777 # Octal integer & 0b111111111111111111111111 # Binary Of those four, hexadecimal has the advantage of being both extremely concise and clear. The octal representation is infuriating because one octal digit refers to *three* bits, which means that there is a non-whole number of octal digits in a byte (that is, one byte with all bits set is represented by 0o377). This causes problems both with reading comprehension and with most other common tasks. For example, moving from 0xFF to 0xFFFF (or 255 to 65535, also known as setting the next most significant byte to all 1) is represented in octal by moving from 0o377 to 0o177777. This is not an obvious transition, and I doubt many programmers could do it from memory in any representation but hex or binary. Decimal is no clearer. Programmers know how to represent certain bit patterns from memory in decimal simply because they see them a lot: usually they can do the all 1s case, and often the 0 followed by all 1s case (255 and 128 for one byte, 65535 and 32767 for two bytes, and then increasingly few programmers know the next set). But trying to work out what mask to use for setting only bits 15 and 14 is tricky in decimal, while in hex it’s fairly easy (in hex it’s 0xC000, in decimal it’s 49152). Binary notation seems like the solution, but note the above case: the only way to work out how many bits are being masked out is to count them, and there can be quite a lot. IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111, which would help the readability case, but it’s still substantially less dense and loses clarity for many kinds of unusual bit patterns. Additionally, as the number of bits increases life gets really hard: masking out certain bits of a 64-bit number requires a literal that’s at least 66 characters long, not including the underscores that would add another 15 underscores for a literal that is 81 characters long (more than the PEP8 line width recommendation). That starts getting unwieldy fast, while the hex representation is still down at 18 characters. Hexadecimal has the clear advantage that each character wholly represents 4 bits, and the next 4 bits are independent of the previous bits. That’s not true of decimal or octal, and while it’s true of binary it costs a fourfold increase in the length of the representation. It’s definitely not as intuitive to the average human being, but that’s ok: it’s a specialised use case, and we aren’t requiring that all human beings learn this skill. This is a very long argument to suggest that your argument against hexadecimal literals (namely, that they use 16 glyphs as opposed to the 10 glyphs used in decimal) is an argument that is too simple to be correct. Different collections of glyphs are clearer in different contexts. For example, decimal numerals can be represented using 10 glyphs, while the english language requires 26 glyphs plus punctuation. But I don’t think you’re seriously proposing we should swap from writing English using the larger glyph set to writing it in decimal representation of ASCII bytes. Given this, I think the argument that says that the Unicode consortium said “write the number in hex” is good enough for me. Cory

On Thu, Oct 13, 2016 at 9:05 PM, Cory Benfield <cory@lukasa.co.uk> wrote:
Binary notation seems like the solution, but note the above case: the only way to work out how many bits are being masked out is to count them, and there can be quite a lot. IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111, which would help the readability case, but it’s still substantially less dense and loses clarity for many kinds of unusual bit patterns.
And if you were to write them like this, you would start to read them in blocks of four - effectively, treating each underscore-separated unit as a glyph, despite them being represented with four characters. Fortunately, just like with Hangul characters, we have a transformation that combines these multi-character glyphs into single characters. We call it 'hexadecimal'. ChrisA

On 13 October 2016 at 12:05, Cory Benfield <cory@lukasa.co.uk> wrote:
Correct, makes it not so nice looking and 8-bit-paradigm friendly. Does not make it however bad option in general and according to my personal suppositions and works on glyph development the optimal set is exactly of 8 glyphs.
Decimal is no clearer.
In same alignment problematic context, yes, correct.
Binary notation seems like the solution, ...
Agree with you, see my last reply to Greg for more thoughts on bitstrings and quoternary approach.
IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111
Very good. Healthy attitude :)
less dense and loses clarity for many kinds of unusual bit patterns.
Not very clear for me what is exactly there with patterns.
Additionally, as the number of bits increases life gets really hard: masking out certain bits of a 64-bit number requires
Self the editing of such a BITmask in hex notation makes life hard. Editing it in binary notation makes life easier.
a literal that’s at least 66 characters long,
Length is a feature of binary, though it is not major issue, see my ideas on it in reply to Greg
Hexadecimal has the clear advantage that each character wholly represents 4 bits,
This advantage is brevity, but one need slightly less brevity to make it more readable. So what do you think about base 4 ?
I didn't understood this sorry :))) Youre welcome to ask more if youre intersted in this.
Different collections of glyphs are clearer in different contexts. How much different collections and how much different contexts?
while the english language requires 26 glyphs plus punctuation.
Does not *require*, but of course 8 glyphs would not suffice to effectively read the language, so one finds a way to extend the glyph set. Roughly speaking 20 letters is enough, but this is not exact science. And it is quite hard science.
I didn't understand this sentence :) In general I think we agree on many points, thank you for the input! Cheers, Mikhail

On Thu, Oct 13, 2016 at 03:56:59AM +0200, Mikhail V wrote:
How many decimal digits would you use to denote a single character?
for text, three decimal digits would be enough for me personally,
Well, if it's enough for you, why would anyone need more?
and in long perspective when the world's alphabetical garbage will dissapear, two digits would be ok.
Are you serious? Talking about "alphabetical garbage" like that makes you seem to be an ASCII bigot: rude, ignorant, arrogant and rather foolish as well. Even 7-bit ASCII has more than 100 characters (128). -- Steve

On Fri, Oct 14, 2016 at 1:25 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Solution: Abolish most of the control characters. Let's define a brand new character encoding with no "alphabetical garbage". These characters will be sufficient for everyone: * [2] Formatting characters: space, newline. Everything else can go. * [8] Digits: 01234567 * [26] Lower case Latin letters a-z * [2] Vital social media characters: # (now officially called "HASHTAG"), @ * [2] Can't-type-URLs-without-them: colon, slash (now called both "SLASH" and "BACKSLASH") That's 40 characters that should cover all the important things anyone does - namely, Twitter, Facebook, and email. We don't need punctuation or capitalization, as they're dying arts and just make you look pretentious. I might have missed a few critical characters, but it should be possible to fit it all within 64, which you can then represent using two digits from our newly-restricted set; octal is better than decimal, as it needs less symbols. (Oh, sorry, so that's actually "50" characters, of which "32" are the letters. And we can use up to "100" and still fit within two digits.) Is this the wrong approach, Mikhail? Perhaps we should go the other way, then, and be *inclusive* of people who speak other languages. Thanks to Unicode's rich collection of characters, we can represent multiple languages in a single document; see, for instance, how this uses four languages and three entirely distinct scripts: http://youtu.be/iydlR_ptLmk Turkish and French both use the Latin script, but have different characters. Alphabetical garbage, or accurate representations of sounds and words in those languages? Python 3 gives the world's languages equal footing. This is a feature, not a bug. It has consequences, including that arbitrary character entities could involve up to seven decimal digits or six hex (although for most practical work, six decimal or five hex will suffice). Those consequences are a trivial price to pay for uniting the whole internet, as opposed to having pockets of different languages, like we had up until the 90s. ChrisA

On 13 October 2016 at 16:50, Chris Angelico <rosuav@gmail.com> wrote:
This is sort of rude. Are you from unicode consortium?
This is sort of correct approach. We do need punctuation however. And one does not need of course to make it too tight. So 8-bit units for text is excellent and enough space left for experiments.
Perhaps we should go the other way, then, and be *inclusive* of people who speak other languages.
What keeps people from using same characters? I will tell you what - it is local law. If you go to school you *have* to write in what is prescribed by big daddy. If youre in europe or America, you are more lucky. And if you're in China you'll be punished if you want some freedom. So like it or not, learn hieroglyphs and become visually impaired in age of 18.
Thanks to Unicode's rich collection of characters, we can represent multiple languages in a single document;
Can do it without unicode in 8-bit boundaries with tagged text, just need fonts for your language, of course if your local charset is less than 256 letters. This is how it was before unicode I suppose. BTW I don't get it still what such revolutionary advantages has unicode compared to tagged text.
script, but have different characters. Alphabetical garbage, or accurate representations of sounds and words in those languages?
Accurate with some 50 characters is more than enough. Mikhail

On Fri, Oct 14, 2016 at 6:53 PM, Mikhail V <mikhailwas@gmail.com> wrote:
No, he's not. He just knows a thing or two.
... okay. I'm done arguing. Go do some translation work some time. Here, have a read of some stuff I've written before. http://rosuav.blogspot.com/2016/09/case-sensitivity-matters.html http://rosuav.blogspot.com/2015/03/file-systems-case-insensitivity-is.html http://rosuav.blogspot.com/2014/12/unicode-makes-life-easy.html
Never mind about China and its political problems. All you need to do is move around Europe for a bit and see how there are more sounds than can be usefully represented. Turkish has a simple system wherein the written and spoken forms have direct correspondence, which means they need to distinguish eight fundamental vowels. How are you going to spell those? Scandinavian languages make use of letters like "å" (called "A with ring" in English, but identified by its sound in Norwegian, same as our letters are - pronounced "Aww" or "Or" or "Au" or thereabouts). To adequately represent both Turkish and Norwegian in the same document, you *need* more letters than our 26.
No, you can't. Also, you shouldn't. It makes virtually every text operation impossible: you can't split and rejoin text without tracking the encodings. Go try to write a text editor under your scheme and see how hard it is.
It's not tagged. That's the huge advantage.
Go build a chat room or something. Invite people to enter their names. Now make sure you're courteous enough to display those names to people. Try doing that without Unicode. I'm done. None of this belongs on python-ideas - it's getting pretty off-topic even for python-list, and you're talking about modifying Python 2.7 which is a total non-starter anyway. ChrisA

So you know, for the future, I think this comment is going to be the one that causes most of the people who were left to disengage with this discussion. The many glyphs that exist for writing various human languages are not inefficiency to be optimised away. Further, I should note that most places to not legislate about what character sets are acceptable to transcribe their languages. Indeed, plenty of non-romance-language-speakers have found ways to transcribe their languages of choice into the limited 8-bit character sets that the Anglophone world propagated: take a look at Arabish for the best kind of example of this behaviour, where "الجو عامل ايه النهارده فى إسكندرية؟" will get rendered as "el gaw 3amel eh elnaharda f eskendereya?” But I think you’re in a tiny minority of people who believe that all languages should be rendered in the same script. I can think of only two reasons to argue for this: 1. Dealing with lots of scripts is technologically tricky and it would be better if we didn’t bother. This is the anti-Unicode argument, and it’s a weak argument, though it has the advantage of being internally consistent. 2. There is some genuine harm caused by learning non-ASCII scripts. Your paragraph suggest that you really believe that learning to write in Kanji (logographic system) as opposed to Katagana (alphabetic system with 48 non-punctuation characters) somehow leads to active harm (your phrase was “become visually impaired”). I’m afraid that you’re really going to need to provide one hell of a citation for that, because that’s quite an extraordinary claim. Otherwise, I’m afraid I have to say お先に失礼します. Cory

On Fri, Oct 14, 2016 at 7:18 PM, Cory Benfield <cory@lukasa.co.uk> wrote:
The many glyphs that exist for writing various human languages are not inefficiency to be optimised away. Further, I should note that most places to not legislate about what character sets are acceptable to transcribe their languages. Indeed, plenty of non-romance-language-speakers have found ways to transcribe their languages of choice into the limited 8-bit character sets that the Anglophone world propagated: take a look at Arabish for the best kind of example of this behaviour, where "الجو عامل ايه النهارده فى إسكندرية؟" will get rendered as "el gaw 3amel eh elnaharda f eskendereya?”
I've worked with transliterations enough to have built myself a dedicated translit tool. It's pretty straight-forward to come up with something you can type on a US-English keyboard (eg "a\o" for "å", and "d\-" for "đ"), and in some cases, it helps with visual/audio synchronization, but nobody would ever claim that it's the best way to represent that language. https://github.com/Rosuav/LetItTrans/blob/master/25%20languages.srt
#1 does carry a decent bit of weight, but only if you start with the assumption that "characters are bytes". If you once shed that assumption (and the related assumption that "characters are 16-bit numbers"), the only weight it carries is "right-to-left text is hard"... and let's face it, that *is* hard, but there are far, far harder problems in computing. Oh wait. Naming things. In Hebrew. That's hard. ChrisA


On 14.10.2016 10:26, Serhiy Storchaka wrote:
And then we store Python identifiers in a single 64-bit word, allow at most 20 chars per identifier and use the remaining bits for cool things like type information :-) Not a bad idea, really. But then again: even microbits support Unicode these days, so apparently there isn't much need for such memory footprint optimizations anymore. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 15 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Mikhail V wrote:
And decimal is objectively way more readable than hex standard character set, regardless of how strong your habits are.
That depends on what you're trying to read from it. I can look at a hex number and instantly get a mental picture of the bit pattern it represents. I can't do that with decimal numbers. This is the reason hex exists. It's used when the bit pattern represented by a number is more important to know than its numerical value. This is the case with Unicode code points. Their numerical value is irrelevant, but the bit pattern conveys useful information, such as which page and plane it belongs to, whether it fits in 1 or 2 bytes, etc. -- Greg

Forgot to reply to all, duping my mesage... On 12 October 2016 at 23:48, M.-A. Lemburg <mal@egenix.com> wrote:
I posted output with Python2 and Windows 7 BTW , In Windows 10 'print' won't work in cmd console at all by default with unicode but thats another story, let us not go into that. I think you get my idea right, it is not only about printing.
In programming literature it is used often, but let me point out that decimal is THE standard and is much much better standard in sence of readability. And there is no solid reason to use 2 standards at the same time.
How it is not clear if the digit amount is fixed? Not very clear what did you mean.

On 13.10.2016 01:06, Mikhail V wrote:
I guess it's a matter of choosing the right standard for the right purpose. For \uXXXX and \UXXXXXXXX the intention was to be able to represent a Unicode code point using its standard Unicode ordinal representation and since the standard uses hex for this, it's quite natural to use the same here.
Unicode code points have ordinals from the range [0, 1114111], so it's not clear where to stop parsing the decimal representation and continue to interpret the literal as regular string, since I suppose you did not intend everyone to have to write \u0000010 just to get a newline code point to avoid the ambiguity. PS: I'm not even talking about the breakage such a change would cause. This discussion is merely about the pointing out how things got to be how they are now. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 13 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 13 October 2016 at 10:18, M.-A. Lemburg <mal@egenix.com> wrote:
Ok there are different usage cases. So in short without going into detail, for example if I need to type in a unicode string literal in ASCII editor I would find such notation replacement beneficial for me: u'\u0430\u0431\u0432.txt' --> u"{1072}{1073}{1074}.txt" Printing could be the same I suppose. I use Python 2.7. And printing so with numbers instead of non-ASCII would help me see where I have non-ASCII chars. But I think the print behavior must be easily configurable. Any critics on it? Besides not following the unicode consortium. Also I would not even mind fixed width 7-digit decimals actually. Ugly but still for me better than hex. Mikhail

On Fri, Oct 14, 2016 at 08:05:40AM +0200, Mikhail V wrote:
Any critics on it? Besides not following the unicode consortium.
Besides the other remarks on "tradition", I think this is where a big problem lies: We should not deviate from a common standard (without very good cause). There are cases where a language does good by deviating from the common standard. There are also cases where it is bad to deviate. Almost all current programming languages understand unicode, for instance: * C: http://en.cppreference.com/w/c/language/escape * C++: http://en.cppreference.com/w/cpp/language/escape * JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Grammar_and_ty... and that were only the first 3 I tried. They all use `\u` followed by 4 hexadecimal digits. You may not like the current standard. You may think/know/... it to be suboptimal for human comprehension. However, what you are suggesting is a very costly change. A change where --- I believe --- Python should not take the lead, but also should not be afraid to follow if other programming languages start to change. I would suggest that this is a change that might be best proposed to the unicode consortium itself, instead of going to (just) a programming language. It'd be interesting to see whether or not you can convince the unicode consortium that 8 symbols will be enough.

FWIW, Python 3.6 should print this in the console just fine. Feel free to upgrade whenever you're ready. Cheers, Steve -----Original Message----- From: "Mikhail V" <mikhailwas@gmail.com> Sent: 10/12/2016 16:07 To: "M.-A. Lemburg" <mal@egenix.com> Cc: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] Proposal for default character representation Forgot to reply to all, duping my mesage... On 12 October 2016 at 23:48, M.-A. Lemburg <mal@egenix.com> wrote:
I posted output with Python2 and Windows 7 BTW , In Windows 10 'print' won't work in cmd console at all by default with unicode but thats another story, let us not go into that. I think you get my idea right, it is not only about printing.
In programming literature it is used often, but let me point out that decimal is THE standard and is much much better standard in sence of readability. And there is no solid reason to use 2 standards at the same time.
How it is not clear if the digit amount is fixed? Not very clear what did you mean. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 10/12/2016 05:33 PM, Mikhail V wrote:
Hello all,
Hello! New to this list so not sure if I can reply here... :)
Now printing it we get:
u'\u0430\u0431\u0432.txt'
By "printing it", do you mean "this is the string representation"? I would presume printing it would show characters nicely rendered. Does it not for you?
Since when was decimal notation "standard"? It seems to be quite the opposite. For unicode representations, byte notation seems standard.
This is an opinion. I should clarify that for many cases I personally find byte notation much simpler. In this case, I view it as a toss up though for something like utf8-encoded text I would had it if I saw decimal numbers and not bytes.
2. Mixing of two notations (hex and decimal) is a _very_ bad idea, I hope no need to explain why.
Still not sure which "mixing" you refer to.
Cheers, Thomas

Mikhail V wrote:
Consider unicode table as an array with glyphs.
You mean like this one? http://unicode-table.com/en/ Unless I've miscounted, that one has the characters arranged in rows of 16, so it would be *harder* to look up a decimal index in it. -- Greg

On 13 October 2016 at 08:02, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nice point finally, I admit, although quite minor. Where the data implies such pagings or alignment, the notation should be (probably) more binary-oriented. But: you claim to see bit patterns in hex numbers? Then I bet you will see them much better if you take binary notation (2 symbols) or quaternary notation (4 symbols), I guarantee. And if you take consistent glyph set for them also you'll see them twice better, also guarantee 100%. So not that the decimal is cool, but hex sucks (too big alphabet) and _the character set_ used for hex optically sucks. That is the point. On the other hand why would unicode glyph table which is to the biggest part a museum of glyphs would be necesserily paged in a binary-friendly manner and not in a decimal friendly manner? But I am not saying it should or not, its quite irrelevant for this particular case I think. Mikhail

Mikhail V wrote:
Nope. The meaning of 0xC001 is much clearer to me than 1100000000000001, because I'd have to count the bits very carefully in the latter to distinguish it from, e.g. 6001 or 18001. The bits could be spaced out: 1100 0000 0000 0001 but that just takes up even more room to no good effect. I don't find it any faster to read -- if anything, it's slower, because my eyes have to travel further to see the whole thing. Another point -- a string of hex digits is much easier for me to *remember* if I'm transliterating it from one place to another. Not only because it's shorter, but because I can pronounce it. "Cee zero zero one" is a lot easier to keep in my head than "one one zero zero zero zero zero zero zero zero zero zero zero zero zero one"... by the time I get to the end, I've already forgotten how it started!
And if you take consistent glyph set for them also you'll see them twice better, also guarantee 100%.
When I say "instantly", I really do mean *instantly*. I fail to see how a different glyph set could reduce the recognition time to less than zero. -- Greg

Greg Ewing wrote:
Good example. But it is not an average high level code of course. Example works again only if we for some reason follow binary segmentation which is bound to low level functionality, in this case 8 bit grouping.
c = "\u1235" if "\u1230" <= c <= "\u123f":
I don't see a need for any conversions back and forth.
I'll explain what I mean with an example. This is an example which I would make to support my proposal. Compare: if "\u1230" <= c <= "\u123f": and: o = ord (c) if 100 <= o <= 150: So yours is a valid code but for me its freaky, and surely I stick to the second variant. You said, I can better see in which unicode page I am by looking at hex ordinal, but I hardly need it, I just need to know one integer, namely where some range begins, that's it. Furthermore this is the code which would an average programmer better read and maintain. So it is the question of maintainability (+1). Secondly, for me it is the question of being able to type in and correct these decimal values: look, if I make a mistake, typo, or want to expand the range by some value I need to make summ and substract operation in my head to progress with my code effectively. Obviously nobody operates good with two notations in head simultanosly, so I will complete my code without extra effort. Is it clear now what I mean by conversions back and forth? This example alone actually explains my whole point very well, I feel however like being misunderstood or so.
Yes ideally one uses other glyphs for base-16 it does not however mean that one must use new invented glyphs. In standard ASCII there are enough glyphs that would work way better together, but it is too late anyway, should be better decided at the time of standard declaration. Got to love it.
Greg, I feel somehow that you are an open minded person and I value this. You also can understand quite good how you read. What you refer to here is the brevity of the word Indeed there is some degrade of readability if the word is too big, or a font is set to big size, so you brake it, one step towards better. And now I'll explain you some further magic regarding the binary representation. If you find free time you can experiment a bit. So what is so peculiar about bitstring actually? Bitstring unlike higher bases can be treated as an abscence/presence of the signal, which is not possible for higher bases, literally binary string can be made almost "analphabetic" if one could say so. Consider such notation: instead of 1100 0000 0000 0001 you write ұұ-ұ ---- ---- ---ұ (NOTE: of course if you read this in non monospaced font you will not see it correct, I should make screenshots which I will do in a while) Note that I choose this letter not accidentally, this letter is similar to one of glyphs with peak readability. The unset value simply would be a stroke. So I take only one letter. ұұ-ұ ---- ---- ---ұ ---ұ ---ұ --ұ- -ұ-- --ұ- ---- ---- ---ұ ---- ---- --ұ- ---ұ -ұұ- ұұ-- ---- ---- ---- ---- ---- ---- --ұұ ---- ұ--- ---ұ -ұ-- --ұұ ---- ---ұ So the digits need not be equal-weighted as in higher bases. What does it bring? Simple: you can downscale the strings, so a 16-bit value would be ~60 pixels wide (for 96dpi display) without legibility loss, so it compensate the "too wide to scan" issue. And don't forget to make enough linespacing. Other benefits of binary string obviously: - nice editing feautures like bitshifting - very interesting cognitive features, (it becomes more noticable however if you train to work with it) ... So there is a whole bunch of good effects. Understand me right, I don't have reason not to believe you that you don't see any effect, but you should always remember that this can be simply caused by your habit. So if you are more than 40 years old (sorry for some familiarity) this can be really strong issue and unfortunately hardly changeable. It is not bad, it is natural thing, it is with everyone so.
It is not about speed, it is about brain load. Chinese can read their hieroglyphs fast, but the cognition load on the brain is 100 times higher than current latin set. I know people who can read bash scripts fast, but would you claim that bash syntax can be any good compared to Python syntax?
Already noted, another good alternative for 8bit aligned data will be quoternary notation, it is 2x more compact and can be very legible due to few glyphs, it is also possible to emulate it with existing chars. Mikhail

On Fri, Oct 14, 2016 at 07:21:48AM +0200, Mikhail V wrote:
For an English-speaker writing that, I'd recommend: if "\N{ETHIOPIC SYLLABLE SA}" <= c <= "\N{ETHIOPIC SYLLABLE SHWA}": ... which is a bit verbose, but that's the price you pay for programming with text in a language you don't read. If you do read Ethiopian, then you can simply write: if "ሰ" <= c <= "ሿ": ... which to a literate reader of Ethiopean, is no harder to understand than the strange and mysterious rotated and reflected glyphs used by Europeans: if "d" <= c <= "p": ... (Why is "double-u" written as vv (w) instead of uu?)
Which is clearly not the same thing, and better written as: if "d" <= c <= "\x96": ...
No, the average programmer is MUCH more skillful than that. Your standard for what you consider "average" seems to me to be more like "lowest 20%". [...]
I feel however like being misunderstood or so.
Trust me, we understand you perfectly. You personally aren't familiar or comfortable with hexadecimal, Unicode code points, or programming standards which have been in widespread use for at least 35 years, and probably more like 50, but rather than accepting that this means you have a lot to learn, you think you can convince the rest of the world to dumb-down and de-skill to a level that you are comfortable with. And that eventually the entire world will standardise on just 100 characters, which you think is enough for all communication, maths and science. Good luck with that last one. Even if you could convince the Chinese and Japanese to swap to ASCII, I'd like to see you pry the emoji out of the young folk's phones. [...]
Citation required. -- Steve

On Fri, Oct 14, 2016, at 01:54, Steven D'Aprano wrote:
This is actually probably the one part of this proposal that *is* feasible. While encoding emoji as a single character each makes sense for a culture that already uses thousands of characters; before they existed the English-speaking software industry already had several competing "standards" emerging for encoding them as sequences of ASCII characters.

On Fri, Oct 14, 2016 at 07:56:29AM -0400, Random832 wrote:
It really isn't feasible to use emoticons instead of emoji, not if you're serious about it. To put it bluntly, emoticons are amateur hour. Emoji implemented as dedicated code points are what professionals use. Why do you think phone manufacturers are standardising on dedicated code points instead of using emoticons? Anyone who has every posted (say) source code on IRC, Usenet, email or many web forums has probably seen unexpected smileys in the middle of their code (false positives). That's because some sequence of characters is being wrongly interpreted as an emoticon by the client software. The more emoticons you support, the greater the chance this will happen. A concrete example: bash code in Pidgin (IRC) will often show unwanted smileys. The quality of applications can vary greatly: once the false emoticon is displayed as a graphic, you may not be able to copy the source code containing the graphic and paste it into a text editor unchanged. There are false negatives as well as false positives: if your :-) happens to fall on the boundary of a line, and your software breaks the sequence with a soft line break, instead of seeing the smiley face you expected, you might see a line ending with :- and a new line starting with ). It's hard to use punctuation or brackets around emoticons without risking them being misinterpreted as an invalid or different sequence. If you are serious about offering smileys, snowmen and piles of poo to your users, you are much better off supporting real emoji (dedicated Unicode characters) instead of emoticons. It is much easier to support ☺ than :-) and you don't need any special software apart from fonts that support the emoji you care about. -- Steve

Steven D'Aprano wrote:
That's because some sequence of characters is being wrongly interpreted as an emoticon by the client software.
The only thing wrong here is that the client software is trying to interpret the emoticons. Emoticons are for *humans* to interpret, not software. Subtlety and cleverness is part of their charm. If you blatantly replace them with explicit images, you crush that. And don't even get me started on *animated* emoji... -- Greg

On Sat, Oct 15, 2016 at 01:42:34PM +1300, Greg Ewing wrote:
Heh :-) I agree with you. But so long as people want, or at least phone and software developers think people want, graphical smiley faces and dancing paperclips and piles of poo, then emoticons are a distictly more troublesome way of dealing with them. -- Steve

Mikhail V wrote:
Note that, if need be, you could also write that as if 0x64 <= o <= 0x96:
So yours is a valid code but for me its freaky, and surely I stick to the second variant.
The thing is, where did you get those numbers from in the first place? If you got them in some way that gives them to you in decimal, such as print(ord(c)), there is nothing to stop you from writing them as decimal constants in the code. But if you got them e.g. by looking up a character table that gives them to you in hex, you can equally well put them in as hex constants. So there is no particular advantage either way.
To a maintainer who is familiar with the layout of the unicode code space, the hex representation of a character is likely to have some meaning, whereas the decimal representation will not. So for that person, using decimal would make the code *harder* to maintain. To a maintainer who doesn't have that familiarity, it makes no difference either way. So your proposal would result in a *decrease* of maintainability overall.
Yes, but in my experience the number of times I've had to do that kind of arithmetic with character codes is very nearly zero. And when I do, I'm more likely to get the computer to do it for me than work out the numbers and then type them in as literals. I just don't see this as being anywhere near being a significant problem.
Out of curiosity, what glyphs do you have in mind?
Yes, you can make the characters narrow enough that you can take 4 of them in at once, almost as though they were a single glyph... at which point you've effectively just substituted one set of 16 glyphs for another. Then you'd have to analyse whether the *combined* 4-element glyphs were easier to disinguish from each other than the ones they replaced. Since the new ones are made up of repetitions of just two elements, whereas the old ones contain a much more varied set of elements, I'd be skeptical about that. BTW, your choice of ұ because of its "peak readibility" seems to be a case of taking something out of context. The readability of a glyph can only be judged in terms of how easy it is to distinguish from other glyphs. Here, the only thing that matters is distinguishing it from the other symbol, so something like "|" would perhaps be a better choice. ||-| ---- ---- ---|
Sure, being familiar with the current system means that it would take me some effort to become proficient with a new one. What I'm far from convinced of is that I would gain any benefit from making that effort, or that a fresh person would be noticeably better off if they learned your new system instead of the old one. At this point you're probably going to say "Greg, it's taken you 40 years to become that proficient in hex. Someone learning my system would do it much faster!" Well, no. When I was about 12 I built a computer whose only I/O devices worked in binary. From the time I first started toggling programs into it to the time I had the whole binary/hex conversion table burned into my neurons was maybe about 1 hour. And I wasn't even *trying* to memorise it, it just happened.
Has that been measured? How? This one sets off my skepticism alarm too, because people that read Latin scripts don't read them a letter at a time -- they recognise whole *words* at once, or at least large chunks of them. The number of English words is about the same order of magnitude as the number of Chinese characters.
For the things that bash was designed to be good for, yes, it can. Python wins for anything beyond very simple programming, but bash wasn't designed for that. (The fact that some people use it that way says more about their dogged persistence in the face of adversity than it does about bash.) I don't doubt that some sets of glyphs are easier to distinguish from each other than others. But the letters and digits that we currently use have already been pretty well optimised by scribes and typographers over the last few hundred years, and I'd be surprised if there's any *major* room left for improvement. Mixing up letters and digits is certainly jarring to many people, but I'm not sure that isn't largely just because we're so used to mentally categorising them into two distinct groups. Maybe there is some objective difference that can be measured, but I'd expect it to be quite small compared to the effect of these prior "habits" as you call them. -- Greg

On Fri, Oct 14, 2016 at 8:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
And any time I look at a large and complex bash script and say "this needs to be a Python script" or "this would be better done in Pike" or whatever, I end up missing the convenient syntax of piping one thing into another. Shell scripting languages are the undisputed kings of process management. ChrisA

On 14 October 2016 at 11:36, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I can not judge what bash is good for, since I never tried to learn it. But it *looks* indeed frightening. First feeling is OMG, I must close this and never see again. Also I can only hard imagine that special purpose of some language can ignore readability, even if it is assembler or whatever, it can be made readable without much effort. So I just look for some other solution for same task, let it be 10 times more code.
That is because that person from beginning (blindly) follows the convention. So my intention of course was not to find out if the majority does or not, but rather which one of two makes more sence *initially*, just trying to imagine that we can decide. To be more precise, if you were to choose between two options: 1. use hex for the glyph index and use hex for numbers (e.g. some arbitrary value like screen coordinates) 2. use decimal for both cases. I personally choose option 2. Probably nothing will convince me that option 1. will be better, all the more I don't believe that anything more than base-8 makes much sense for readable numbers. Just little bit dissapointed that others again and again speak of convention.
I didn't mean that, it is just slightly annoys me.
Out of curiosity, what glyphs do you have in mind?
If I were to decide, I would look into few options here: 1. Easy option which would raise less further questions is to take 16 first lowercase letters. 2. Better option would be to choose letters and possibly other glyphs to build up a more readable set. E.g. drop "c" letter and leave "e" due to their optical collision, drop some other weak glyphs, like "l" "h". That is of course would raise many further questions, like why you do you drop this glyph and not this and so on so it will surely end up in quarrel. Here lies another problem - non-constant width of letters, but this is more the problem of fonts and rendering, so adresses IDE and editors problematics. But as said I won't recommend base 16 at all.
No no. I didn't mean to shrink them till they melt together. The structure is still there, only that with such notation you don't need to keep the glyph so big as with many-glyph systems.
I get your idea and this a very good point. Seems you have experience in such things? Currently I don't know for sure if such approach more effective or less than others and for what case. But I can bravely claim that it is better than *any* hex notation, it just follows from what I have here on paper on my table, namely that it is physically impossible to make up highly effective glyph system of more than 8 symbols. You want more only if really *need* more glyphs. And skepticism should always be present. One thing however especially interests me, here not only the differentiation of glyph comes in play, but also positional principle which helps to compare and it can be beneficaial for specific cases. So you can clearly see if one number is two times bigger than other for example. And of course, strictly speaking those bit groups are not glyphs, you can call them of course so, but this is just rhetorics. So one could call all english written words also glyphs but they are not really. But I get your analogy, this is how the tests should be made.
True and false. Each single taken glyph has a specific structure and put alone it has optical qualities. This is somewhat quite complicated and hardly describable by words, but anyway, only tests can tell what is better. In this case it is still 2 glyphs or better say one and a half glyph. And indeed you can distinguish them really good since they have different mass.
||-| ---- ---- ---|
I can get your idea, although not really correct statement, see above. A vertical stab is hardly a good glyph, actually quite a bad one. Such notation will cause quite uncomfortable effect on eyes, and there are many things here. Less technically, here is a rule: - a good glyph has a structure, and the boundary of the glyph is a proportional form (like a bulb) (not your case) - vertical gaps/sheers inside these boundaries are bad (your case). One can't always do without them, but vertical ones are much worse than horizontal. - too primitive glyph structure is bad (your case) So a stab is good only as some punctuation sign. For this exact reason such letters, as "l", "L", "i" are bad ones, especially their sans-serif variants. And *not* in the first place because they collide with other glyphs. This is somewhat non obvious. One should understand of course that I just took the standard symbols that only try to mimic the correct representation. So if sometime you will play around with bitstrings, here are the ASCII-only variants which are best working: -y-y ---y -yy- -y-- -o-o ---o -oo- -o-- -k-k ---k -kk- -k-- -s-s ---s -ss- -s-- No need to say that these will be way, way better than "01" notation which is used as standard. If you read a lot numbers you should have noticed how unpleasant is to scan through 010101
"far from convinced" sounds quite positive however :) it is never too late. I heard from Captain Crunch https://en.wikipedia.org/wiki/John_Draper That he was so tired of C syntax that he finally switched to Python for some projects. I can imagine how unwanted this can be in age. All depends on tasks that one often does. If say, imagine you'll read binaries for a long time, in one of notations I proposed above (like "-y-- --y-" for example) and after that try to switch back to "0100 0010" notation, I bet you will realize that better. Indeed learning new notation for numbers is quite easy, it is only some practice. And with base-2 you don't need learn at all, just can switch to other notation and use straight away.
Has that been measured? How?
I don't think it is measurable at all. That is my opinion, and 100 just shows that I think it is very stressfull, also due to lot of meaning disambiguation that such system can cause. I also heard pesonal complains from chinese young students, they all had problems with vision already in early years, but I cannot support it oficially. So just imagine: if take for truth, max number of effective glyphs is 8. and hieroglyphs are *all* printed in same sized box! how would this provide efficient reading, and if you've seen chinese books, they all printed with quite small font. I am not very sentimental person but somehow feel sorry for people, one doesn't deserve it. You know, I become friends with one chinese girl, she loves to read and eager to learn and need to always carry pair of goglles with her everywhere. Somehow sad I become now writing it, she is so sweet young girl... And yes in this sence one can say that this cognition load can be measured. You go to universities in China and count those with vision problems.
I don't doubt that some sets of glyphs are easier to distinguish from each other than others. But the
That sounds good, this is not so often that one realizes that :) Most people would say "it's just matter of habit"
Here I would slightly disagree First, *Digits* are not optimised for anything, they are are just a heritage from ancient time. They have some minimal readability, namely "2" is not bad, others are quite poor. Second, *small latin letters* are indeed well fabricated. However don't have an illusion that someone cared much about their optimisation in last 1000 years. If you are skeptical about that, take a look at this http://daten.digitale-sammlungen.de/~db/bsb00003258/images/index.html?seite=... If believe (there are skeptics who do not believe) that this dates back end of 10th century, so we have an interesting picture here, You see that this is indeed very similar to what you read now, somewhat optimised of course, but without much improvements. Actually in some cases there is even some degradation: now we have "pbqd" letters, which are just rotation and reflection of each other, which is no good. Strictly speaking you can use only one of these 4 glyphs. And in last 500 hundred years there was zero modifications. How much improvent can be made is hard question. According to my results, indeed the peak readability forms are similar to certain small latin letters, But I would say quite significant improvement could be made. But this is not really measurable. Mikhail

On Sun, Oct 16, 2016 at 12:06 AM, Mikhail V <mikhailwas@gmail.com> wrote:
You should go and hang out with jmf. Both of you have made bold assertions that our current system is physically/mathematically impossible, despite the fact that *it is working*. Neither of you can cite any actual scientific research to back your claims. Bye bye. ChrisA

Mikhail V wrote:
Also I can only hard imagine that special purpose of some language can ignore readability,
Readability is not something absolute that stands on its own. It depends a great deal on what is being expressed.
even if it is assembler or whatever, it can be made readable without much effort.
You seem to be focused on a very narrow aspect of readability, i.e. fine details of individual character glyphs. That's not what we mean when we talk about readability of programs.
So I just look for some other solution for same task, let it be 10 times more code.
Then it will take you 10 times longer to write, and will on average contain 10 times as many bugs. Is that really worth some small, probably mostly theoretical advantage at the typographical level?
That is because that person from beginning (blindly) follows the convention.
What you seem to be missing is that there are *reasons* for those conventions. They were not arbitrary choices. Ultimately they can be traced back to the fact that our computers are built from two-state electronic devices. That's definitely not an arbitrary choice -- there are excellent physical reasons for it. Base 10, on the other hand, *is* an arbitrary choice. Due to an accident of evolution, we ended up with 10 easily accessible appendages for counting on, and that made its way into the counting system that is currently the most widely used by everyday people. So, if anything, *you're* the one who is "blindly following tradition" by wanting to use base 10.
Well, that's the thing. If there were large, objective, easily measurable differences between different possible sets of glyphs, then there would be no room for such arguments. The fact that you anticipate such arguments suggests that any differences are likely to be small, hard to measure and/or subjective.
I think "on paper" is the important thing here. I suspect you are looking at the published results from some study or other and greatly overestimating the size of the effects compared to other considerations. -- Greg

On 16 October 2016 at 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
In this discussion yes, but layout aspects can be also improved and I suppose special purpose of language does not always dictate the layout of code, it is up to you who can define that also. And glyphs is not very narrow aspect, it is one of the fundamental aspects. Also it is much harder to develop than good layout, note that.
So, if anything, *you're* the one who is "blindly following tradition" by wanting to use base 10.
Yes because when I was a child I learned it everywhere for everything, others too. As said I don't defend usage of base-10 as you can already note from my posts.
Those things cannot be easiliy measured, if at all, it requires a lot of tests and huge amount of time, you cannot plug measure device to the brain to precisely measure the load. In this case the only choice is to trust most experienced people who show the results which worked for them better and try self to implement and compare. Not saying you have special reason to trust me personally.
If you try to google that particular topic you'll see that there is zero related published material, there are tons of papers on readability, but zero concrete proposals or any attempts to develop something real. That is the thing. I would look in results if there was something. In my case I am looking at what I've achieved during years of my work on it and indeed there some interesting things there. Not that I am overestimating the role of it, but indeed it can really help in many cases, e.g, like in my example with bitstrings. Last but not the least, I am not a "paper ass" in any case, I try to keep only experimantal work where possible. Mikhail

Mikhail V wrote:
Those things cannot be easiliy measured, if at all,
If you can't measure something, you can't be sure it exists at all.
Have you *measured* anything, though? Do you have any feel for how *big* the effects you're talking about are?
There must *very* solid reason for digits+letters against my variant, wonder what is it.
The reasons only have to be *very* solid if there are *very* large advantages to the alternative you propose. My conjecture is that the advantages are actually extremely *small* by comparison. To refute that, you would need to provide some evidence to the contrary. Here are some reasons in favour of the current system: * At the point where most people learn to program, they are already intimately familiar with reading, writing and pronouncing letters and digits. * It makes sense to use 0-9 to represent the first ten digits, because they have the same numerical value. * Using letters for the remaining digits, rather than punctuation characters, makes sense because we're already used to thinking of them as a group. * Using a consecutive sequence of letters makes sense because we're already familiar with their ordering. * In the absence of any strong reason otherwise, we might as well take them from the beginning of the alphabet. Yes, those are all based on "habits", but they're habits shared by everyone, just like the base 10 that you have a preference for. You would have to provide some strong evidence that it's worth disregarding them and using your system instead. -- Greg

On 16 October 2016 at 23:23, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Those things cannot be easiliy measured, if at all,
If you can't measure something, you can't be sure it exists at all.
What do you mean I can't be sure? I am fully functional, mentally healthy man :)
For what case, of course. So the difference for "0010 0011" and "--k- --kk" I can feel indeed big difference. Literally, I can read the latter clearly even I close my left eye and *fully* defocus my right eye. That is indeed a big difference and tells a lot. I suppose for disabled people this would be the only chance to see anything there. Currently I experiment myself and of course I plan to do it with experimental subjects. I plan one survey session in the end of November. But indeed this is very off-topic. So feel free to mail me, if anything. So back to hex notation, which is still not so off-topic I suppose.
There must *very* solid reason for digits+letters against my variant, wonder what is it.
First ,I am the opinion that *initial* decision in such a case must be supported by solid reasons and not just like, "hey, John has already written them in such a manner, lets take it!". Second, I totally disagree that there always must be *very* large advantages for new standards, if we would follow such principle, we would still use cuneiform for writing or bash-like syntax, since everytime when someone proposes a slight improvement, there would be somebody who says : "but the new is not *that much* better than old!". Actually in many cases it is better when it is evolving - everybody is aware.
Here are some reasons in favour of the current system:
So you mean they start to learn hex and see numbers and think like: ooo it looks like a number, not so scary. So less time to learn, yes, +1 (less pain now, more pain later) But if I am an adult intelligent man, I understand that there are only ten digits and I will despite need to extend the set and they *all* should be optically consequent and good readable. And what is a good readable set with >=16 glyphs? Small letters! Somewhat from the height of my current knowledge, since I know that digits anyway not very good readable.
I actually proposed consecutive, but that does not make much difference: being familiar with ordering of the alphabet will have next to zero influence on the reading of numbers encoded with letters, it is just an illusion that it will, since the letter is a symbol, if I see "z" I don't think of 26. More probably, the weight of the glyph could play some role, that means less the weight - less the number. Mikhail

On Sun, Oct 16, 2016 at 05:02:49PM +0200, Mikhail V wrote:
This discussion is completely and utterly off-topic for this mailing list. If you want to discuss changing the world to use your own custom character set for all human communication, you should write a blog or a book. It is completely off-topic for Python: we're interested in improving the Python programming language, not yet another constructed language or artifical alphabet: https://en.wikipedia.org/wiki/Shavian_alphabet If you're interested in this, there is plenty of prior art. See for example: Esperanto, Ido, Volapük, Interlingua, Lojban. But don't discuss it here. -- Steve

On 17 October 2016 at 02:23, Steven D'Aprano <steve@pearwood.info> wrote:
You're right, I was just answering the questions so it came to other thing somehow. BTW, among others we have discussed bitstring representation. So if you work with those, for example if you model cryptography algorithms or similar things in Python, this could help you for example to debug your programs and generally one could interpret it as, say, how is about adding an extra notation for this sake. And if you noticed this is not really about my glyphs, but lies in ASCII. So actually it is me who tried to turn it back to on-topic. Mikhail

On 10/13/16 2:42 AM, Mikhail V wrote:
You continue to overlook the fact that Unicode codepoints are conventionally presented in hexadecimal, including in the page you linked us to. This is the convention. It makes sense to stick to the convention. When I see a numeric representation of a character, there are only two things I can do with it: look it up in a reference someplace, or glean some meaning from it directly. For looking things up, please remember that all Unicode references use hex numbering. Looking up a character by decimal numbers is simply more difficult than looking them up by hex numbers. For gleaning meaning directly, please keep in mind that Unicode fundamentally structured around pages of 256 code points, organized into planes of 256 pages. The very structure of how code points are allocated and grouped is based on a hexadecimal-friendly system. The blocks of codepoints are aligned on hexadecimal boundaries: http://www.fileformat.info/info/unicode/block/index.htm . When I see \u0414, I know it is a Cyrillic character because it is in block 04xx. It simply doesn't make sense to present Unicode code points in anything other than hex. --Ned.

On 10/12/2016 07:13 PM, Mikhail V wrote:
If you mean that decimal notation is the standard used for _counting_ by people, then yes of course that is standard. But decimal notation certainly is not standard in this domain.
Hexadecimal notation is hardly "obscure", but yes I understand that fewer people understand it than decimal notation. Regardless, byte notation seems standard for unicode and unless you can convince the unicode community at large to switch, I don't think it makes any sense for python to switch. Sometimes it's better to go with the flow even if you don't want to.
There is not mixing for unicode in python; it displays as hexadecimal. Decimal is used in other places though. So if by "mixing" you mean python should not use the standard notations of subdomains when working with those domains, then I would totally disagree. The language used in different disciplines is and has always been variable. Until that's no longer true it's better to stick with convention than add inconsistency which will be much more confusing in the long-term than learning the idiosyncrasies of a specific domain (in this case the use of hexadecimal in the unicode world). Cheers, Thomas

On Oct 12, 2016 4:33 PM, "Mikhail V" <mikhailwas@gmail.com> wrote:
If decimal notation isn't used for parsing, only for printing, it would be confusing as heck, but using it for both would break a lot of code in subtle ways (the worst kind of code breakage).
The Unicode standard. I agree that hex is hard to read, but the standard uses it to refer to the code points. It's great to be able to google code points and find the characters easily, and switching to decimal would screw it up. And I've never seen someone *need* to figure out the decimal version from the hex before. It's far more likely to google the hex #. TL;DR: I think this change would induce a LOT of short-term issues, despite it being up in the air if there's any long-term gain. So -1 from me.
2. Mixing of two notations (hex and decimal) is a _very_ bad idea, I hope no need to explain why.
Indeed, you don't. :)
-- Ryan (ライアン) [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/

On 12.10.2016 23:33, Mikhail V wrote:
Hmm, in Python3, I get:
The hex notation for \uXXXX is a standard also used in many other programming languages, it's also easier to parse, so I don't think we should change this default. Take e.g.
With decimal notation, it's not clear where to end parsing the digit notation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 12 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

I'm -1 on this. Just type "0431 unicode" on your favorite search engine. U+0431 is the codepoint, not whatever digits 0x431 has in decimal. That's a tradition and something external to Python. As a related concern, I think using decimal/octal on raw data is a terrible idea (e.g. On Linux, I always have to re-format the "cmp -l" to really grasp what's going on, changing it to hexadecimal). Decimal notation is hardly readable when we're dealing with stuff designed in base 2 (e.g. due to the visual separation of distinct bytes). How many people use "hexdump" (or any binary file viewer) with decimal output instead of hexadecimal? I agree that mixing representations for the same abstraction (using decimal in some places, hexadecimal in other ones) can be a bad idea. Actually, that makes me believe "decimal unicode codepoint" shouldn't ever appear in string representations. -- Danilo J. S. Bellini --------------- "*It is not our business to set up prohibitions, but to arrive at conventions.*" (R. Carnap)

On 12 October 2016 at 23:58, Danilo J. S. Bellini <danilo.bellini@gmail.com> wrote:
Hmm what keeps you from separateting the logical units to be represented each by a decimal number? like 001 023 255 ... Do you really think this is less readable than its hex equivalent? Then you are probably working with hex numbers only, but I doubt that.
PS: that is rather peculiar, three negative replies already but with no strong arguments why it would be bad to stick to decimal only, only some "others do it so" and "tradition" arguments. The "base 2" argument could work at some grade but if stick to this criteria why not speak about octal/quoternary/binary then? Please note, I am talking only about readability _of the character set_ actually. And it is not including your habit issues, but rather is an objective criteria for using this or that character set. And decimal is objectively way more readable than hex standard character set, regardless of how strong your habits are.

On Thu, Oct 13, 2016 at 10:09 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Way WAY less readable, and I'm comfortable working in both hex and decimal.
You're the one who's non-standard here. Most of the world uses hex for Unicode codepoints. http://unicode.org/charts/ HTML entities permit either decimal or hex, but other than that, I can't think of any common system that uses decimal for Unicode codepoints in strings.
"Others do it so" is actually a very strong argument. If all the rest of the world uses + to mean addition, and Python used + to mean subtraction, it doesn't matter how logical that is, it is *wrong*. Most of the world uses U+201C or "\u201C" to represent a curly double quote; if you us 0x93, you are annoyingly wrong, and if you use 8220, everyone has to do the conversion from that to 201C. Yes, these are all differently-valid standards, but that doesn't make it any less annoying.
How many decimal digits would you use to denote a single character? Do you have to pad everything to seven digits (\u0000034 for an ASCII quote)? And if not, how do you mark the end? This is not "objectively more readable" if the only gain is "no A-F" and the loss is "unpredictable length". ChrisA

On 13 October 2016 at 01:50, Chris Angelico <rosuav@gmail.com> wrote:
Please don't mix the readability and personal habit, which previuos repliers seems to do as well. Those two things has nothing to do with each other. If you are comfortable with old roman numbering system this does not make it readable. And I am NOT comfortable with hex, as well as most people would be glad to use single notation. But some of them think that they are cool because they know several numbering notations ;) But I bet few can actually understand which is more readable.
This actually supports my proposal perfectly, if everyone uses decimal why suddenly use hex for same thing - index of array. I don't see how your analogy contradicts with my proposal, it's rather supporting it.
quote; if you us 0x93, you are annoyingly wrong,
Please don't make personal assessments here, I can use whatever I want, moreover I find this notation as silly as using different measurement systems without any reason and within one activity, and in my eyes this is annoyingly wrong and stupid, but I don't call nobody here stupid. But I do want that you could abstract yourself from your habit for a while and talk about what would be better for the future usage.
everyone has to do the conversion from that to 201C.
Nobody need to do ANY conversions if use decimal, and as said everything is decimal: numbers, array indexes, ord() function returns decimal, you can imagine more examples so it is not only more readable but also more traditional.
How many decimal digits would you use to denote a single character?
for text, three decimal digits would be enough for me personally, and in long perspective when the world's alphabetical garbage will dissapear, two digits would be ok.
you have to pad everything to seven digits (\u0000034 for an ASCII quote)?
Depends on case, for input - some separator, or padding is also ok, I don't have problems with both. For printing obviously don't show leading zeros, but rather spaces. But as said I find this Unicode only some temporary happening, it will go to history in some future and be used only to study extinct glyphs. Mikhail

On 2016-10-12 18:56, Mikhail V wrote:
You keep saying this, but it's quite incorrect. The usage of decimal notation is itself just a convention, and the only reason it's easy for you (and for many other people) is because you're used to it. If you had grown up using only hexadecimal or binary, you would find decimal awkward. There is nothing objectively better about base 10 than any other place-value numbering system. Decimal is just a habit. Now, it's true that base-10 is at this point effectively universal across human societies, and that gives it a certain claim to primacy. But base-16 (along with base 2) is also quite common in computing contexts. Saying we should dump hex notation because everyone understands decimal is like saying that all signs in Prague should only be printed in English because there are more English speakers in the entire world than Czech speakers. But that ignores the fact that there are more Czech speakers *in Prague*. Likewise, decimal may be more common as an overall numerical notation, but when it comes to referring to Unicode code points, hexadecimal is far and away more common. Just look at the Wikipedia page for Unicode, which says: "Normally a Unicode code point is referred to by writing "U+" followed by its hexadecimal number." That's it. You'll find the same thing on unicode.org. The unicode code point is hardly even a number in the usual sense; it's just a label that identifies the character. If you have an issue with using hex to represent unicode code points, your issue goes way beyond Python, and you need to take it up with the Unicode consortium. (Good luck with that.) -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On 13 October 2016 at 04:18, Brendan Barnwell <brenbarn@brenbarn.net> wrote:
Exactly, but this is not called "readability" but rather "acquired ability to read" or simply habit, which does not reflect the "readability" of the character set itself.
There is nothing objectively better about base 10 than any other place-value numbering system.
Sorry to say, but here you are totally wrong. Not to treat you personally for your fallacy, that is quite common among those who are not familiar with the topic, but you should consider some important points: --- 1. Each taken character set has certain grade of readability which depends solely on the form of its units (aka glyphs). 2. Linear string representation is superior to anything else (spiral, arc, etc.) 3. There exist glyphs which provide maximal readability, those are particular glyphs with particular constant form, and this form is absolutely independent from the encoding subject. 4. According to my personal studies (which does not mean it must be accepted or blindly believed in, but I have solid experience in this area and acting quite successful in it) the amount of this glyphs is less then 10, namely I am by 8 glyphs now. 5. Main measured parameter which reflects the readability (somewhat indirect however) is the pair-wize optical collision of each character pair of the set. This refers somewhat to legibility, or differentiation ability of glyphs. --- Less technically, you can understand it better if you think of your own words "There is nothing objectively better about base 10 than any other place-value numbering system." If this could be ever true than you could read with characters that are very similar to each other or something messy as good as with characters which are easily identifyable, collision resistant and optically consistent. But that is absurd, sorry. For numbers obviously you don't need so many character as for speech encoding, so this means that only those glyphs or even a subset of it should be used. This means anything more than 8 characters is quite worthless for reading numbers. Note that I can't provide here the works currently so don't ask me for that. Some of them would be probably available in near future. Your analogy with speech and signs is not correct because speech is different but numbers are numbers. But also for different speech, same character set must be used namely the one with superior optical qualities, readability.
Saying we should dump hex notation because everyone understands decimal is like saying that all signs in Prague should only be printed in English
We should dump hex notation because currently decimal is simply superiour to hex, just like Mercedes is superior to Lada, aand secondly, because it is more common for ALL people, so it is 2:0 for not using such notation. With that said, I am not against base-16 itself in the first place, but rather against the character set which is simply visually inconsistent and not readable. Someone just took arabic digits and added first latin letters to it. It could be forgiven for a schoolboy's exercises in drawing but I fail to understand how it can be accepted as a working notation for medium supposed to be human readable. Practically all this notation does, it reduces the time before you as a programmer become visual and brain impairments.
Yeah that's it. And it sucks and migrated to coding standard, sucks twice. If a new syntax/standard is decided, there'll be only positive sides of using decimal vs hex. So nobody'll be hurt, this is only the question of remaking current implementation and is proposed only as a long-term theoretical improvement.
it's just a label that identifies the character.
Ok, but if I write a string filtering in Python for example then obviously I use decimal everywhere to compare index ranges, etc. so what is the use for me of that label? Just redundant conversions back and forth. Makes me sick actually.

Mikhail V wrote:
I'm not sure what you mean by that. If by "index ranges" you're talking about the numbers you use to index into the string, they have nothing to do with character codes, so you can write them in whatever base is most convenient for you. If you have occasion to write a literal representing a character code, there's nothing to stop you writing it in hex to match the way it's shown in a repr(), or in published Unicode tables, etc. I don't see a need for any conversions back and forth. -- Greg

On 2016-10-12 22:46, Mikhail V wrote:
It's pretty clear to me by this point that your argument has no rational basis, so I'm regarding this thread as a dead end. -- Brendan Barnwell "Do not follow where the path may lead. Go, instead, where there is no path, and leave a trail." --author unknown

On Thu, Oct 13, 2016 at 1:46 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Even if you were right that your approach is somehow inherently easier, it is flat-out wrong that other approaches lead to "brain impairment". On the contrary, it is well-established that challenging the brain prevents or at least delays brain impairment. And it also makes no sense that it would cause visual impairment, either. Comparing glyphs is a higher-level task in the brain, it has little to do with your eyes. All your eyes detect are areas of changing contrast, any set of lines and curves, not even glyphs, is functionally identical at that level (and even at medium-level brain regions). The size of the glyphs can make a difference, but not the number of available ones. On the contrary, having more glyphs increases the information density of text, reducing the amount of reading you have to do to get the same information.

On 16 October 2016 at 17:16, Todd <toddrjen@gmail.com> wrote:
My phrasing "impairment" is of course somewhat exaggeration. It cannot be compared to harm due to smoking for example. However it also known that many people who do big amount of information processing and intensive reading are subject to earlier loss of the vision sharpness. And I feel it myself. How exactly this happens to the eye itself is not clear for me. One my supposition is that during the reading there is very intensive two-directional signalling between eye and brain. So generally you are correct, the eye is technically a camera attached to the brain and simply sends pictures at some frequency to the brain. But I would tend to think that it is not so simple actually. You probably have heard sometimes users who claim something like: "this text hurts my eyes" For example if you read non-antialiased text and with too high contrast, you'll notice that something is indeed going wrong with your eyes. This can happen probably because the brain starts to signal the eye control system "something is wrong, stop doing it" Since your eye cannot do anything with wrong contrast on your screen and you still need to continue reading, this happens again and again. This can cause indeed unwanted processes and overtiredness of muscles inside the eye. So in case of my examle with Chinese students, who wear goggles more frequently, this would probaly mean that they could "recover" if they just stop reading a lot. "challenging the brain prevents or at least delays brain" Yes but I hardly see connection with this case, I would probably recommend to make some creative exercises, like drawing or solving puzzles for this purpose. But if I propose reading books in illegible font than I would be wrong in any case.
You forget about that whith illegible font or wrong contrast for example you *do* need to do more concentrarion, This causes again your eye to try harder to adopt to the information you see, reread, which again affects your lens and eye movements. Anyway, how do you think then this earlier vision loss happens? You'd say I fantasise? Mikhail

On Sun, Oct 16, 2016 at 3:26 PM, Mikhail V <mikhailwas@gmail.com> wrote:
The downards-projecting signals from the brain to the eye are heavily studied. In fact I have friends who specialize in studying those connections specifically. They simply don't behave the way you are describing. You are basing your claims on the superiority of certain sorts of glyphs on conjecture about how the brain works, conjecture that goes against what the evidence says about how the brain actually processes visual information. Yes, the quality of the glyphs can make a big difference. There is no indication, however, that the number of possible glyphs can.
I don't want to imply bad faith on your part, but you cut off an important part of what I said: "The size of the glyphs can make a difference, but not the number of available ones. On the contrary, having more glyphs increases the information density of text, reducing the amount of reading you have to do to get the same information." Badly-antialised text can be a problem from that standpoint too. But again, none of this has anything whatsoever to do with the number of glyphs, which is your complaint. Again, I don't want to imply bad faith, but the argument you are making now is completely different than the argument I was addressing. I don't disagree that bad text quality or too much reading can hurt your eyes. On the contrary, I said explicitly that it can. The claim of yours that I was addressing is that having too many glyphs can hurt your eyes or brain, which doesn't match with anything we know about how the eyes or brain work.

On Oct 12, 2016 9:25 PM, "Chris Angelico" <rosuav@gmail.com> wrote:
Emoji, of course! What else?
-- Ryan (ライアン) [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/

Hello, and welcome to Python-ideas, where only a small portion of ideas go further, and where most newcomers that wish to improve the language get hit by the reality bat! I hope you enjoy your stay :)
I'll turn your argument around: Not being comfortable with hex does not make it unreadable; it's a matter of habit (as Brendan pointed out in his separate reply).
Unicode code points are represented using hex notation virtually everywhere I ever saw it. Your Unicode-code-points-as-decimal website was a new discovery for me (and, I presume, many others on this list). Since it's widely used in the world, going against that effectively makes you non-standard. That doesn't mean it's necessarily a bad thing, but it does mean that your chances (or anyone's chances) of actually changing that are equal to zero (and this isn't some gross exaggeration),
I fail to see your point here. Where is that "everyone uses decimal"? Unless you stopped talking about representation in strings (which seems likely, as you're talking about indexing?), everything is represented as hex.
But I do want that you could abstract yourself from your habit for a while and talk about what would be better for the future usage.
I'll be that guy and tell you that you need to step back from your own idea for a while and consider your proposal and the current state of things. I'll also take the opportunity to reiterate that there is virtually no chance to change this behaviour. This doesn't, however, prevent you or anyone from talking about the topic, either for fun, or for finding other (related or otherwise) areas of interest that you think might be worth investigating further. A lot of threads actually branch off in different topics that came up when discussing, and that are interesting enough to pursue on their own.
You're mixing up more than just one concept here: - Integer literals; I assume this is what you meant, and you seem to forget (or maybe you didn't know, in which case here's to learning something new!) that 0xff is perfectly valid syntax, and store the integer with the value of 255 in base 10. - Indexing, and that's completely irrelevant to the topic at hand (also see above bullet point). - ord() which returns an integer (which can be interpreted in any base!), and that's both an argument for and against this proposal; the "against" side is actually that decimal notation has no defined boundary for when to stop (and before you argue that it does, I'll point out that the separations, e.g. grouping by the thousands, are culture-driven and not an international standard). There's actually a precedent for this in Python 2 with the \x escape (need I remind anyone why Python 3 was created again? :), but that's exactly a stone in the "don't do that" camp, instead of the other way around.
You seem to have misunderstood the question - in "\u00123456", there is no ambiguity that this is a string consisting of 5 characters; the first one is '\u0012', the second one is '3', the third one is '4', the fourth one is '5', and the last one is '6'. In the string (using \d as a hypothetical escape method; regex gurus can go read #27364 ;) "\d00123456", how many characters does this contain? It's decimal, so should the escape grab the first 5 digits? Or 6 maybe? You tell me.
No leading zeros? That means you don't have a fixed number of digits, and your string is suddenly very ambiguous (also see my point above).
Unicode, a temporary happening? Well, strictly speaking, nobody can know that, but I'd expect that it's going to, someday, be *the* common standard. I'm not bathed in illusion, though.
Mikhail
All in all, that's a pretty interesting idea. However, it has no chance of happening, because a lot of code would break, Python would deviate from the rest of the world, this wouldn't be backwards compatible (and another backwards-incompatible major release isn't happening; the community still hasn't fully caught up with the one 8 years ago), and it would be unintuitive to anyone who's done computer programming before (or after, or during, or anytime). I do see some bits worth pursuing in your idea, though, and I encourage you to keep going! As I said earlier, Python-ideas is a place where a lot of ideas are born and die, and that shouldn't stop you from trying to contribute. Python is 25 years old, and a bunch of stuff is there just for backwards compatibility; these kind of things can't get changed easily. The older (older by contribution period, not actual age) contributors still active don't try to fix what's not broken (to them). Newcomers, such as you, are a breath of fresh air to the language, and what helps make it thrive even more! By bringing new, uncommon ideas, you're challenging the status quo and potentially changing it for the best. But keep in mind that, with no clear consensus, the status quo always wins a stalemate. I hope that makes sense! Cheers, Emanuel

On 13 October 2016 at 04:49, Emanuel Barry <vgr255@live.ca> wrote:
Matter of habit does not reflect the readability, see my last reply to Brandan. It is quite precise engeneering. And readability it is kind of serious stuff especially if you decide for programming carreer. Young people underestimate it and for oldies it is too late when they realize it :) And Python is all about readability and I like it. As for your other points, I'll need to read it with fresh head tomorrow, Of course I don't believe this would all suddenly happen with Python, or other programming language, it is just an idea anyway. And I do want to learn more actually. Especially want to see some example where it would be really beneficial to use hex, either technically (some low level binary related stuff?) or regarding comprehension, which is to my knowledge hardly possible.
Mikhail

Mikhail V wrote:
Did you see much code written with hex literals?
From /usr/include/sys/fcntl.h: /* * File status flags: these are used by open(2), fcntl(2). * They are also used (indirectly) in the kernel file structure f_flags, * which is a superset of the open/fcntl flags. Open flags and f_flags * are inter-convertible using OFLAGS(fflags) and FFLAGS(oflags). * Open/fcntl flags begin with O_; kernel-internal flags begin with F. */ /* open-only flags */ #define O_RDONLY 0x0000 /* open for reading only */ #define O_WRONLY 0x0001 /* open for writing only */ #define O_RDWR 0x0002 /* open for reading and writing */ #define O_ACCMODE 0x0003 /* mask for above modes */ /* * Kernel encoding of open mode; separate read and write bits that are * independently testable: 1 greater than the above. * * XXX * FREAD and FWRITE are excluded from the #ifdef KERNEL so that TIOCFLUSH, * which was documented to use FREAD/FWRITE, continues to work. */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define FREAD 0x0001 #define FWRITE 0x0002 #endif #define O_NONBLOCK 0x0004 /* no delay */ #define O_APPEND 0x0008 /* set append mode */ #ifndef O_SYNC /* allow simultaneous inclusion of <aio.h> */ #define O_SYNC 0x0080 /* synch I/O file integrity */ #endif #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_SHLOCK 0x0010 /* open with shared file lock */ #define O_EXLOCK 0x0020 /* open with exclusive file lock */ #define O_ASYNC 0x0040 /* signal pgrp when data ready */ #define O_FSYNC O_SYNC /* source compatibility: do not use */ #define O_NOFOLLOW 0x0100 /* don't follow symlinks */ #endif /* (_POSIX_C_SOURCE && !_DARWIN_C_SOURCE) */ #define O_CREAT 0x0200 /* create if nonexistant */ #define O_TRUNC 0x0400 /* truncate to zero length */ #define O_EXCL 0x0800 /* error if already exists */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_EVTONLY 0x8000 /* descriptor requested for event notifications only */ #endif #define O_NOCTTY 0x20000 /* don't assign controlling terminal */ #if !defined(_POSIX_C_SOURCE) || defined(_DARWIN_C_SOURCE) #define O_DIRECTORY 0x100000 #define O_SYMLINK 0x200000 /* allow open of a symlink */ #endif #ifndef O_DSYNC /* allow simultaneous inclusion of <aio.h> */ #define O_DSYNC 0x400000 /* synch I/O data integrity */ #endif -- Greg

Backing Greg up for a moment, hex literals are extremely common in any code that needs to work with binary data, such as network programming or fine data structure manipulation. For example, consider the frequent requirement to mask out certain bits of a given integer (e.g., keep the low 24 bits of a 32 bit integer). Here are a few ways to represent that: integer & 0x00FFFFFF # Hex integer & 16777215 # Decimal integer & 0o77777777 # Octal integer & 0b111111111111111111111111 # Binary Of those four, hexadecimal has the advantage of being both extremely concise and clear. The octal representation is infuriating because one octal digit refers to *three* bits, which means that there is a non-whole number of octal digits in a byte (that is, one byte with all bits set is represented by 0o377). This causes problems both with reading comprehension and with most other common tasks. For example, moving from 0xFF to 0xFFFF (or 255 to 65535, also known as setting the next most significant byte to all 1) is represented in octal by moving from 0o377 to 0o177777. This is not an obvious transition, and I doubt many programmers could do it from memory in any representation but hex or binary. Decimal is no clearer. Programmers know how to represent certain bit patterns from memory in decimal simply because they see them a lot: usually they can do the all 1s case, and often the 0 followed by all 1s case (255 and 128 for one byte, 65535 and 32767 for two bytes, and then increasingly few programmers know the next set). But trying to work out what mask to use for setting only bits 15 and 14 is tricky in decimal, while in hex it’s fairly easy (in hex it’s 0xC000, in decimal it’s 49152). Binary notation seems like the solution, but note the above case: the only way to work out how many bits are being masked out is to count them, and there can be quite a lot. IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111, which would help the readability case, but it’s still substantially less dense and loses clarity for many kinds of unusual bit patterns. Additionally, as the number of bits increases life gets really hard: masking out certain bits of a 64-bit number requires a literal that’s at least 66 characters long, not including the underscores that would add another 15 underscores for a literal that is 81 characters long (more than the PEP8 line width recommendation). That starts getting unwieldy fast, while the hex representation is still down at 18 characters. Hexadecimal has the clear advantage that each character wholly represents 4 bits, and the next 4 bits are independent of the previous bits. That’s not true of decimal or octal, and while it’s true of binary it costs a fourfold increase in the length of the representation. It’s definitely not as intuitive to the average human being, but that’s ok: it’s a specialised use case, and we aren’t requiring that all human beings learn this skill. This is a very long argument to suggest that your argument against hexadecimal literals (namely, that they use 16 glyphs as opposed to the 10 glyphs used in decimal) is an argument that is too simple to be correct. Different collections of glyphs are clearer in different contexts. For example, decimal numerals can be represented using 10 glyphs, while the english language requires 26 glyphs plus punctuation. But I don’t think you’re seriously proposing we should swap from writing English using the larger glyph set to writing it in decimal representation of ASCII bytes. Given this, I think the argument that says that the Unicode consortium said “write the number in hex” is good enough for me. Cory

On Thu, Oct 13, 2016 at 9:05 PM, Cory Benfield <cory@lukasa.co.uk> wrote:
Binary notation seems like the solution, but note the above case: the only way to work out how many bits are being masked out is to count them, and there can be quite a lot. IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111, which would help the readability case, but it’s still substantially less dense and loses clarity for many kinds of unusual bit patterns.
And if you were to write them like this, you would start to read them in blocks of four - effectively, treating each underscore-separated unit as a glyph, despite them being represented with four characters. Fortunately, just like with Hangul characters, we have a transformation that combines these multi-character glyphs into single characters. We call it 'hexadecimal'. ChrisA

On 13 October 2016 at 12:05, Cory Benfield <cory@lukasa.co.uk> wrote:
Correct, makes it not so nice looking and 8-bit-paradigm friendly. Does not make it however bad option in general and according to my personal suppositions and works on glyph development the optimal set is exactly of 8 glyphs.
Decimal is no clearer.
In same alignment problematic context, yes, correct.
Binary notation seems like the solution, ...
Agree with you, see my last reply to Greg for more thoughts on bitstrings and quoternary approach.
IIRC there’s some new syntax coming for binary literals that would let us represent them as 0b1111_1111_1111_1111
Very good. Healthy attitude :)
less dense and loses clarity for many kinds of unusual bit patterns.
Not very clear for me what is exactly there with patterns.
Additionally, as the number of bits increases life gets really hard: masking out certain bits of a 64-bit number requires
Self the editing of such a BITmask in hex notation makes life hard. Editing it in binary notation makes life easier.
a literal that’s at least 66 characters long,
Length is a feature of binary, though it is not major issue, see my ideas on it in reply to Greg
Hexadecimal has the clear advantage that each character wholly represents 4 bits,
This advantage is brevity, but one need slightly less brevity to make it more readable. So what do you think about base 4 ?
I didn't understood this sorry :))) Youre welcome to ask more if youre intersted in this.
Different collections of glyphs are clearer in different contexts. How much different collections and how much different contexts?
while the english language requires 26 glyphs plus punctuation.
Does not *require*, but of course 8 glyphs would not suffice to effectively read the language, so one finds a way to extend the glyph set. Roughly speaking 20 letters is enough, but this is not exact science. And it is quite hard science.
I didn't understand this sentence :) In general I think we agree on many points, thank you for the input! Cheers, Mikhail

On Thu, Oct 13, 2016 at 03:56:59AM +0200, Mikhail V wrote:
How many decimal digits would you use to denote a single character?
for text, three decimal digits would be enough for me personally,
Well, if it's enough for you, why would anyone need more?
and in long perspective when the world's alphabetical garbage will dissapear, two digits would be ok.
Are you serious? Talking about "alphabetical garbage" like that makes you seem to be an ASCII bigot: rude, ignorant, arrogant and rather foolish as well. Even 7-bit ASCII has more than 100 characters (128). -- Steve

On Fri, Oct 14, 2016 at 1:25 AM, Steven D'Aprano <steve@pearwood.info> wrote:
Solution: Abolish most of the control characters. Let's define a brand new character encoding with no "alphabetical garbage". These characters will be sufficient for everyone: * [2] Formatting characters: space, newline. Everything else can go. * [8] Digits: 01234567 * [26] Lower case Latin letters a-z * [2] Vital social media characters: # (now officially called "HASHTAG"), @ * [2] Can't-type-URLs-without-them: colon, slash (now called both "SLASH" and "BACKSLASH") That's 40 characters that should cover all the important things anyone does - namely, Twitter, Facebook, and email. We don't need punctuation or capitalization, as they're dying arts and just make you look pretentious. I might have missed a few critical characters, but it should be possible to fit it all within 64, which you can then represent using two digits from our newly-restricted set; octal is better than decimal, as it needs less symbols. (Oh, sorry, so that's actually "50" characters, of which "32" are the letters. And we can use up to "100" and still fit within two digits.) Is this the wrong approach, Mikhail? Perhaps we should go the other way, then, and be *inclusive* of people who speak other languages. Thanks to Unicode's rich collection of characters, we can represent multiple languages in a single document; see, for instance, how this uses four languages and three entirely distinct scripts: http://youtu.be/iydlR_ptLmk Turkish and French both use the Latin script, but have different characters. Alphabetical garbage, or accurate representations of sounds and words in those languages? Python 3 gives the world's languages equal footing. This is a feature, not a bug. It has consequences, including that arbitrary character entities could involve up to seven decimal digits or six hex (although for most practical work, six decimal or five hex will suffice). Those consequences are a trivial price to pay for uniting the whole internet, as opposed to having pockets of different languages, like we had up until the 90s. ChrisA

On 13 October 2016 at 16:50, Chris Angelico <rosuav@gmail.com> wrote:
This is sort of rude. Are you from unicode consortium?
This is sort of correct approach. We do need punctuation however. And one does not need of course to make it too tight. So 8-bit units for text is excellent and enough space left for experiments.
Perhaps we should go the other way, then, and be *inclusive* of people who speak other languages.
What keeps people from using same characters? I will tell you what - it is local law. If you go to school you *have* to write in what is prescribed by big daddy. If youre in europe or America, you are more lucky. And if you're in China you'll be punished if you want some freedom. So like it or not, learn hieroglyphs and become visually impaired in age of 18.
Thanks to Unicode's rich collection of characters, we can represent multiple languages in a single document;
Can do it without unicode in 8-bit boundaries with tagged text, just need fonts for your language, of course if your local charset is less than 256 letters. This is how it was before unicode I suppose. BTW I don't get it still what such revolutionary advantages has unicode compared to tagged text.
script, but have different characters. Alphabetical garbage, or accurate representations of sounds and words in those languages?
Accurate with some 50 characters is more than enough. Mikhail

On Fri, Oct 14, 2016 at 6:53 PM, Mikhail V <mikhailwas@gmail.com> wrote:
No, he's not. He just knows a thing or two.
... okay. I'm done arguing. Go do some translation work some time. Here, have a read of some stuff I've written before. http://rosuav.blogspot.com/2016/09/case-sensitivity-matters.html http://rosuav.blogspot.com/2015/03/file-systems-case-insensitivity-is.html http://rosuav.blogspot.com/2014/12/unicode-makes-life-easy.html
Never mind about China and its political problems. All you need to do is move around Europe for a bit and see how there are more sounds than can be usefully represented. Turkish has a simple system wherein the written and spoken forms have direct correspondence, which means they need to distinguish eight fundamental vowels. How are you going to spell those? Scandinavian languages make use of letters like "å" (called "A with ring" in English, but identified by its sound in Norwegian, same as our letters are - pronounced "Aww" or "Or" or "Au" or thereabouts). To adequately represent both Turkish and Norwegian in the same document, you *need* more letters than our 26.
No, you can't. Also, you shouldn't. It makes virtually every text operation impossible: you can't split and rejoin text without tracking the encodings. Go try to write a text editor under your scheme and see how hard it is.
It's not tagged. That's the huge advantage.
Go build a chat room or something. Invite people to enter their names. Now make sure you're courteous enough to display those names to people. Try doing that without Unicode. I'm done. None of this belongs on python-ideas - it's getting pretty off-topic even for python-list, and you're talking about modifying Python 2.7 which is a total non-starter anyway. ChrisA

So you know, for the future, I think this comment is going to be the one that causes most of the people who were left to disengage with this discussion. The many glyphs that exist for writing various human languages are not inefficiency to be optimised away. Further, I should note that most places to not legislate about what character sets are acceptable to transcribe their languages. Indeed, plenty of non-romance-language-speakers have found ways to transcribe their languages of choice into the limited 8-bit character sets that the Anglophone world propagated: take a look at Arabish for the best kind of example of this behaviour, where "الجو عامل ايه النهارده فى إسكندرية؟" will get rendered as "el gaw 3amel eh elnaharda f eskendereya?” But I think you’re in a tiny minority of people who believe that all languages should be rendered in the same script. I can think of only two reasons to argue for this: 1. Dealing with lots of scripts is technologically tricky and it would be better if we didn’t bother. This is the anti-Unicode argument, and it’s a weak argument, though it has the advantage of being internally consistent. 2. There is some genuine harm caused by learning non-ASCII scripts. Your paragraph suggest that you really believe that learning to write in Kanji (logographic system) as opposed to Katagana (alphabetic system with 48 non-punctuation characters) somehow leads to active harm (your phrase was “become visually impaired”). I’m afraid that you’re really going to need to provide one hell of a citation for that, because that’s quite an extraordinary claim. Otherwise, I’m afraid I have to say お先に失礼します. Cory

On Fri, Oct 14, 2016 at 7:18 PM, Cory Benfield <cory@lukasa.co.uk> wrote:
The many glyphs that exist for writing various human languages are not inefficiency to be optimised away. Further, I should note that most places to not legislate about what character sets are acceptable to transcribe their languages. Indeed, plenty of non-romance-language-speakers have found ways to transcribe their languages of choice into the limited 8-bit character sets that the Anglophone world propagated: take a look at Arabish for the best kind of example of this behaviour, where "الجو عامل ايه النهارده فى إسكندرية؟" will get rendered as "el gaw 3amel eh elnaharda f eskendereya?”
I've worked with transliterations enough to have built myself a dedicated translit tool. It's pretty straight-forward to come up with something you can type on a US-English keyboard (eg "a\o" for "å", and "d\-" for "đ"), and in some cases, it helps with visual/audio synchronization, but nobody would ever claim that it's the best way to represent that language. https://github.com/Rosuav/LetItTrans/blob/master/25%20languages.srt
#1 does carry a decent bit of weight, but only if you start with the assumption that "characters are bytes". If you once shed that assumption (and the related assumption that "characters are 16-bit numbers"), the only weight it carries is "right-to-left text is hard"... and let's face it, that *is* hard, but there are far, far harder problems in computing. Oh wait. Naming things. In Hebrew. That's hard. ChrisA


On 14.10.2016 10:26, Serhiy Storchaka wrote:
And then we store Python identifiers in a single 64-bit word, allow at most 20 chars per identifier and use the remaining bits for cool things like type information :-) Not a bad idea, really. But then again: even microbits support Unicode these days, so apparently there isn't much need for such memory footprint optimizations anymore. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 15 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

Mikhail V wrote:
And decimal is objectively way more readable than hex standard character set, regardless of how strong your habits are.
That depends on what you're trying to read from it. I can look at a hex number and instantly get a mental picture of the bit pattern it represents. I can't do that with decimal numbers. This is the reason hex exists. It's used when the bit pattern represented by a number is more important to know than its numerical value. This is the case with Unicode code points. Their numerical value is irrelevant, but the bit pattern conveys useful information, such as which page and plane it belongs to, whether it fits in 1 or 2 bytes, etc. -- Greg

Forgot to reply to all, duping my mesage... On 12 October 2016 at 23:48, M.-A. Lemburg <mal@egenix.com> wrote:
I posted output with Python2 and Windows 7 BTW , In Windows 10 'print' won't work in cmd console at all by default with unicode but thats another story, let us not go into that. I think you get my idea right, it is not only about printing.
In programming literature it is used often, but let me point out that decimal is THE standard and is much much better standard in sence of readability. And there is no solid reason to use 2 standards at the same time.
How it is not clear if the digit amount is fixed? Not very clear what did you mean.

On 13.10.2016 01:06, Mikhail V wrote:
I guess it's a matter of choosing the right standard for the right purpose. For \uXXXX and \UXXXXXXXX the intention was to be able to represent a Unicode code point using its standard Unicode ordinal representation and since the standard uses hex for this, it's quite natural to use the same here.
Unicode code points have ordinals from the range [0, 1114111], so it's not clear where to stop parsing the decimal representation and continue to interpret the literal as regular string, since I suppose you did not intend everyone to have to write \u0000010 just to get a newline code point to avoid the ambiguity. PS: I'm not even talking about the breakage such a change would cause. This discussion is merely about the pointing out how things got to be how they are now. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Oct 13 2016)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 13 October 2016 at 10:18, M.-A. Lemburg <mal@egenix.com> wrote:
Ok there are different usage cases. So in short without going into detail, for example if I need to type in a unicode string literal in ASCII editor I would find such notation replacement beneficial for me: u'\u0430\u0431\u0432.txt' --> u"{1072}{1073}{1074}.txt" Printing could be the same I suppose. I use Python 2.7. And printing so with numbers instead of non-ASCII would help me see where I have non-ASCII chars. But I think the print behavior must be easily configurable. Any critics on it? Besides not following the unicode consortium. Also I would not even mind fixed width 7-digit decimals actually. Ugly but still for me better than hex. Mikhail

On Fri, Oct 14, 2016 at 08:05:40AM +0200, Mikhail V wrote:
Any critics on it? Besides not following the unicode consortium.
Besides the other remarks on "tradition", I think this is where a big problem lies: We should not deviate from a common standard (without very good cause). There are cases where a language does good by deviating from the common standard. There are also cases where it is bad to deviate. Almost all current programming languages understand unicode, for instance: * C: http://en.cppreference.com/w/c/language/escape * C++: http://en.cppreference.com/w/cpp/language/escape * JavaScript: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Grammar_and_ty... and that were only the first 3 I tried. They all use `\u` followed by 4 hexadecimal digits. You may not like the current standard. You may think/know/... it to be suboptimal for human comprehension. However, what you are suggesting is a very costly change. A change where --- I believe --- Python should not take the lead, but also should not be afraid to follow if other programming languages start to change. I would suggest that this is a change that might be best proposed to the unicode consortium itself, instead of going to (just) a programming language. It'd be interesting to see whether or not you can convince the unicode consortium that 8 symbols will be enough.

FWIW, Python 3.6 should print this in the console just fine. Feel free to upgrade whenever you're ready. Cheers, Steve -----Original Message----- From: "Mikhail V" <mikhailwas@gmail.com> Sent: 10/12/2016 16:07 To: "M.-A. Lemburg" <mal@egenix.com> Cc: "python-ideas@python.org" <python-ideas@python.org> Subject: Re: [Python-ideas] Proposal for default character representation Forgot to reply to all, duping my mesage... On 12 October 2016 at 23:48, M.-A. Lemburg <mal@egenix.com> wrote:
I posted output with Python2 and Windows 7 BTW , In Windows 10 'print' won't work in cmd console at all by default with unicode but thats another story, let us not go into that. I think you get my idea right, it is not only about printing.
In programming literature it is used often, but let me point out that decimal is THE standard and is much much better standard in sence of readability. And there is no solid reason to use 2 standards at the same time.
How it is not clear if the digit amount is fixed? Not very clear what did you mean. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

On 10/12/2016 05:33 PM, Mikhail V wrote:
Hello all,
Hello! New to this list so not sure if I can reply here... :)
Now printing it we get:
u'\u0430\u0431\u0432.txt'
By "printing it", do you mean "this is the string representation"? I would presume printing it would show characters nicely rendered. Does it not for you?
Since when was decimal notation "standard"? It seems to be quite the opposite. For unicode representations, byte notation seems standard.
This is an opinion. I should clarify that for many cases I personally find byte notation much simpler. In this case, I view it as a toss up though for something like utf8-encoded text I would had it if I saw decimal numbers and not bytes.
2. Mixing of two notations (hex and decimal) is a _very_ bad idea, I hope no need to explain why.
Still not sure which "mixing" you refer to.
Cheers, Thomas

Mikhail V wrote:
Consider unicode table as an array with glyphs.
You mean like this one? http://unicode-table.com/en/ Unless I've miscounted, that one has the characters arranged in rows of 16, so it would be *harder* to look up a decimal index in it. -- Greg

On 13 October 2016 at 08:02, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nice point finally, I admit, although quite minor. Where the data implies such pagings or alignment, the notation should be (probably) more binary-oriented. But: you claim to see bit patterns in hex numbers? Then I bet you will see them much better if you take binary notation (2 symbols) or quaternary notation (4 symbols), I guarantee. And if you take consistent glyph set for them also you'll see them twice better, also guarantee 100%. So not that the decimal is cool, but hex sucks (too big alphabet) and _the character set_ used for hex optically sucks. That is the point. On the other hand why would unicode glyph table which is to the biggest part a museum of glyphs would be necesserily paged in a binary-friendly manner and not in a decimal friendly manner? But I am not saying it should or not, its quite irrelevant for this particular case I think. Mikhail

Mikhail V wrote:
Nope. The meaning of 0xC001 is much clearer to me than 1100000000000001, because I'd have to count the bits very carefully in the latter to distinguish it from, e.g. 6001 or 18001. The bits could be spaced out: 1100 0000 0000 0001 but that just takes up even more room to no good effect. I don't find it any faster to read -- if anything, it's slower, because my eyes have to travel further to see the whole thing. Another point -- a string of hex digits is much easier for me to *remember* if I'm transliterating it from one place to another. Not only because it's shorter, but because I can pronounce it. "Cee zero zero one" is a lot easier to keep in my head than "one one zero zero zero zero zero zero zero zero zero zero zero zero zero one"... by the time I get to the end, I've already forgotten how it started!
And if you take consistent glyph set for them also you'll see them twice better, also guarantee 100%.
When I say "instantly", I really do mean *instantly*. I fail to see how a different glyph set could reduce the recognition time to less than zero. -- Greg

Greg Ewing wrote:
Good example. But it is not an average high level code of course. Example works again only if we for some reason follow binary segmentation which is bound to low level functionality, in this case 8 bit grouping.
c = "\u1235" if "\u1230" <= c <= "\u123f":
I don't see a need for any conversions back and forth.
I'll explain what I mean with an example. This is an example which I would make to support my proposal. Compare: if "\u1230" <= c <= "\u123f": and: o = ord (c) if 100 <= o <= 150: So yours is a valid code but for me its freaky, and surely I stick to the second variant. You said, I can better see in which unicode page I am by looking at hex ordinal, but I hardly need it, I just need to know one integer, namely where some range begins, that's it. Furthermore this is the code which would an average programmer better read and maintain. So it is the question of maintainability (+1). Secondly, for me it is the question of being able to type in and correct these decimal values: look, if I make a mistake, typo, or want to expand the range by some value I need to make summ and substract operation in my head to progress with my code effectively. Obviously nobody operates good with two notations in head simultanosly, so I will complete my code without extra effort. Is it clear now what I mean by conversions back and forth? This example alone actually explains my whole point very well, I feel however like being misunderstood or so.
Yes ideally one uses other glyphs for base-16 it does not however mean that one must use new invented glyphs. In standard ASCII there are enough glyphs that would work way better together, but it is too late anyway, should be better decided at the time of standard declaration. Got to love it.
Greg, I feel somehow that you are an open minded person and I value this. You also can understand quite good how you read. What you refer to here is the brevity of the word Indeed there is some degrade of readability if the word is too big, or a font is set to big size, so you brake it, one step towards better. And now I'll explain you some further magic regarding the binary representation. If you find free time you can experiment a bit. So what is so peculiar about bitstring actually? Bitstring unlike higher bases can be treated as an abscence/presence of the signal, which is not possible for higher bases, literally binary string can be made almost "analphabetic" if one could say so. Consider such notation: instead of 1100 0000 0000 0001 you write ұұ-ұ ---- ---- ---ұ (NOTE: of course if you read this in non monospaced font you will not see it correct, I should make screenshots which I will do in a while) Note that I choose this letter not accidentally, this letter is similar to one of glyphs with peak readability. The unset value simply would be a stroke. So I take only one letter. ұұ-ұ ---- ---- ---ұ ---ұ ---ұ --ұ- -ұ-- --ұ- ---- ---- ---ұ ---- ---- --ұ- ---ұ -ұұ- ұұ-- ---- ---- ---- ---- ---- ---- --ұұ ---- ұ--- ---ұ -ұ-- --ұұ ---- ---ұ So the digits need not be equal-weighted as in higher bases. What does it bring? Simple: you can downscale the strings, so a 16-bit value would be ~60 pixels wide (for 96dpi display) without legibility loss, so it compensate the "too wide to scan" issue. And don't forget to make enough linespacing. Other benefits of binary string obviously: - nice editing feautures like bitshifting - very interesting cognitive features, (it becomes more noticable however if you train to work with it) ... So there is a whole bunch of good effects. Understand me right, I don't have reason not to believe you that you don't see any effect, but you should always remember that this can be simply caused by your habit. So if you are more than 40 years old (sorry for some familiarity) this can be really strong issue and unfortunately hardly changeable. It is not bad, it is natural thing, it is with everyone so.
It is not about speed, it is about brain load. Chinese can read their hieroglyphs fast, but the cognition load on the brain is 100 times higher than current latin set. I know people who can read bash scripts fast, but would you claim that bash syntax can be any good compared to Python syntax?
Already noted, another good alternative for 8bit aligned data will be quoternary notation, it is 2x more compact and can be very legible due to few glyphs, it is also possible to emulate it with existing chars. Mikhail

On Fri, Oct 14, 2016 at 07:21:48AM +0200, Mikhail V wrote:
For an English-speaker writing that, I'd recommend: if "\N{ETHIOPIC SYLLABLE SA}" <= c <= "\N{ETHIOPIC SYLLABLE SHWA}": ... which is a bit verbose, but that's the price you pay for programming with text in a language you don't read. If you do read Ethiopian, then you can simply write: if "ሰ" <= c <= "ሿ": ... which to a literate reader of Ethiopean, is no harder to understand than the strange and mysterious rotated and reflected glyphs used by Europeans: if "d" <= c <= "p": ... (Why is "double-u" written as vv (w) instead of uu?)
Which is clearly not the same thing, and better written as: if "d" <= c <= "\x96": ...
No, the average programmer is MUCH more skillful than that. Your standard for what you consider "average" seems to me to be more like "lowest 20%". [...]
I feel however like being misunderstood or so.
Trust me, we understand you perfectly. You personally aren't familiar or comfortable with hexadecimal, Unicode code points, or programming standards which have been in widespread use for at least 35 years, and probably more like 50, but rather than accepting that this means you have a lot to learn, you think you can convince the rest of the world to dumb-down and de-skill to a level that you are comfortable with. And that eventually the entire world will standardise on just 100 characters, which you think is enough for all communication, maths and science. Good luck with that last one. Even if you could convince the Chinese and Japanese to swap to ASCII, I'd like to see you pry the emoji out of the young folk's phones. [...]
Citation required. -- Steve

On Fri, Oct 14, 2016, at 01:54, Steven D'Aprano wrote:
This is actually probably the one part of this proposal that *is* feasible. While encoding emoji as a single character each makes sense for a culture that already uses thousands of characters; before they existed the English-speaking software industry already had several competing "standards" emerging for encoding them as sequences of ASCII characters.

On Fri, Oct 14, 2016 at 07:56:29AM -0400, Random832 wrote:
It really isn't feasible to use emoticons instead of emoji, not if you're serious about it. To put it bluntly, emoticons are amateur hour. Emoji implemented as dedicated code points are what professionals use. Why do you think phone manufacturers are standardising on dedicated code points instead of using emoticons? Anyone who has every posted (say) source code on IRC, Usenet, email or many web forums has probably seen unexpected smileys in the middle of their code (false positives). That's because some sequence of characters is being wrongly interpreted as an emoticon by the client software. The more emoticons you support, the greater the chance this will happen. A concrete example: bash code in Pidgin (IRC) will often show unwanted smileys. The quality of applications can vary greatly: once the false emoticon is displayed as a graphic, you may not be able to copy the source code containing the graphic and paste it into a text editor unchanged. There are false negatives as well as false positives: if your :-) happens to fall on the boundary of a line, and your software breaks the sequence with a soft line break, instead of seeing the smiley face you expected, you might see a line ending with :- and a new line starting with ). It's hard to use punctuation or brackets around emoticons without risking them being misinterpreted as an invalid or different sequence. If you are serious about offering smileys, snowmen and piles of poo to your users, you are much better off supporting real emoji (dedicated Unicode characters) instead of emoticons. It is much easier to support ☺ than :-) and you don't need any special software apart from fonts that support the emoji you care about. -- Steve

Steven D'Aprano wrote:
That's because some sequence of characters is being wrongly interpreted as an emoticon by the client software.
The only thing wrong here is that the client software is trying to interpret the emoticons. Emoticons are for *humans* to interpret, not software. Subtlety and cleverness is part of their charm. If you blatantly replace them with explicit images, you crush that. And don't even get me started on *animated* emoji... -- Greg

On Sat, Oct 15, 2016 at 01:42:34PM +1300, Greg Ewing wrote:
Heh :-) I agree with you. But so long as people want, or at least phone and software developers think people want, graphical smiley faces and dancing paperclips and piles of poo, then emoticons are a distictly more troublesome way of dealing with them. -- Steve

Mikhail V wrote:
Note that, if need be, you could also write that as if 0x64 <= o <= 0x96:
So yours is a valid code but for me its freaky, and surely I stick to the second variant.
The thing is, where did you get those numbers from in the first place? If you got them in some way that gives them to you in decimal, such as print(ord(c)), there is nothing to stop you from writing them as decimal constants in the code. But if you got them e.g. by looking up a character table that gives them to you in hex, you can equally well put them in as hex constants. So there is no particular advantage either way.
To a maintainer who is familiar with the layout of the unicode code space, the hex representation of a character is likely to have some meaning, whereas the decimal representation will not. So for that person, using decimal would make the code *harder* to maintain. To a maintainer who doesn't have that familiarity, it makes no difference either way. So your proposal would result in a *decrease* of maintainability overall.
Yes, but in my experience the number of times I've had to do that kind of arithmetic with character codes is very nearly zero. And when I do, I'm more likely to get the computer to do it for me than work out the numbers and then type them in as literals. I just don't see this as being anywhere near being a significant problem.
Out of curiosity, what glyphs do you have in mind?
Yes, you can make the characters narrow enough that you can take 4 of them in at once, almost as though they were a single glyph... at which point you've effectively just substituted one set of 16 glyphs for another. Then you'd have to analyse whether the *combined* 4-element glyphs were easier to disinguish from each other than the ones they replaced. Since the new ones are made up of repetitions of just two elements, whereas the old ones contain a much more varied set of elements, I'd be skeptical about that. BTW, your choice of ұ because of its "peak readibility" seems to be a case of taking something out of context. The readability of a glyph can only be judged in terms of how easy it is to distinguish from other glyphs. Here, the only thing that matters is distinguishing it from the other symbol, so something like "|" would perhaps be a better choice. ||-| ---- ---- ---|
Sure, being familiar with the current system means that it would take me some effort to become proficient with a new one. What I'm far from convinced of is that I would gain any benefit from making that effort, or that a fresh person would be noticeably better off if they learned your new system instead of the old one. At this point you're probably going to say "Greg, it's taken you 40 years to become that proficient in hex. Someone learning my system would do it much faster!" Well, no. When I was about 12 I built a computer whose only I/O devices worked in binary. From the time I first started toggling programs into it to the time I had the whole binary/hex conversion table burned into my neurons was maybe about 1 hour. And I wasn't even *trying* to memorise it, it just happened.
Has that been measured? How? This one sets off my skepticism alarm too, because people that read Latin scripts don't read them a letter at a time -- they recognise whole *words* at once, or at least large chunks of them. The number of English words is about the same order of magnitude as the number of Chinese characters.
For the things that bash was designed to be good for, yes, it can. Python wins for anything beyond very simple programming, but bash wasn't designed for that. (The fact that some people use it that way says more about their dogged persistence in the face of adversity than it does about bash.) I don't doubt that some sets of glyphs are easier to distinguish from each other than others. But the letters and digits that we currently use have already been pretty well optimised by scribes and typographers over the last few hundred years, and I'd be surprised if there's any *major* room left for improvement. Mixing up letters and digits is certainly jarring to many people, but I'm not sure that isn't largely just because we're so used to mentally categorising them into two distinct groups. Maybe there is some objective difference that can be measured, but I'd expect it to be quite small compared to the effect of these prior "habits" as you call them. -- Greg

On Fri, Oct 14, 2016 at 8:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
And any time I look at a large and complex bash script and say "this needs to be a Python script" or "this would be better done in Pike" or whatever, I end up missing the convenient syntax of piping one thing into another. Shell scripting languages are the undisputed kings of process management. ChrisA

On 14 October 2016 at 11:36, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
I can not judge what bash is good for, since I never tried to learn it. But it *looks* indeed frightening. First feeling is OMG, I must close this and never see again. Also I can only hard imagine that special purpose of some language can ignore readability, even if it is assembler or whatever, it can be made readable without much effort. So I just look for some other solution for same task, let it be 10 times more code.
That is because that person from beginning (blindly) follows the convention. So my intention of course was not to find out if the majority does or not, but rather which one of two makes more sence *initially*, just trying to imagine that we can decide. To be more precise, if you were to choose between two options: 1. use hex for the glyph index and use hex for numbers (e.g. some arbitrary value like screen coordinates) 2. use decimal for both cases. I personally choose option 2. Probably nothing will convince me that option 1. will be better, all the more I don't believe that anything more than base-8 makes much sense for readable numbers. Just little bit dissapointed that others again and again speak of convention.
I didn't mean that, it is just slightly annoys me.
Out of curiosity, what glyphs do you have in mind?
If I were to decide, I would look into few options here: 1. Easy option which would raise less further questions is to take 16 first lowercase letters. 2. Better option would be to choose letters and possibly other glyphs to build up a more readable set. E.g. drop "c" letter and leave "e" due to their optical collision, drop some other weak glyphs, like "l" "h". That is of course would raise many further questions, like why you do you drop this glyph and not this and so on so it will surely end up in quarrel. Here lies another problem - non-constant width of letters, but this is more the problem of fonts and rendering, so adresses IDE and editors problematics. But as said I won't recommend base 16 at all.
No no. I didn't mean to shrink them till they melt together. The structure is still there, only that with such notation you don't need to keep the glyph so big as with many-glyph systems.
I get your idea and this a very good point. Seems you have experience in such things? Currently I don't know for sure if such approach more effective or less than others and for what case. But I can bravely claim that it is better than *any* hex notation, it just follows from what I have here on paper on my table, namely that it is physically impossible to make up highly effective glyph system of more than 8 symbols. You want more only if really *need* more glyphs. And skepticism should always be present. One thing however especially interests me, here not only the differentiation of glyph comes in play, but also positional principle which helps to compare and it can be beneficaial for specific cases. So you can clearly see if one number is two times bigger than other for example. And of course, strictly speaking those bit groups are not glyphs, you can call them of course so, but this is just rhetorics. So one could call all english written words also glyphs but they are not really. But I get your analogy, this is how the tests should be made.
True and false. Each single taken glyph has a specific structure and put alone it has optical qualities. This is somewhat quite complicated and hardly describable by words, but anyway, only tests can tell what is better. In this case it is still 2 glyphs or better say one and a half glyph. And indeed you can distinguish them really good since they have different mass.
||-| ---- ---- ---|
I can get your idea, although not really correct statement, see above. A vertical stab is hardly a good glyph, actually quite a bad one. Such notation will cause quite uncomfortable effect on eyes, and there are many things here. Less technically, here is a rule: - a good glyph has a structure, and the boundary of the glyph is a proportional form (like a bulb) (not your case) - vertical gaps/sheers inside these boundaries are bad (your case). One can't always do without them, but vertical ones are much worse than horizontal. - too primitive glyph structure is bad (your case) So a stab is good only as some punctuation sign. For this exact reason such letters, as "l", "L", "i" are bad ones, especially their sans-serif variants. And *not* in the first place because they collide with other glyphs. This is somewhat non obvious. One should understand of course that I just took the standard symbols that only try to mimic the correct representation. So if sometime you will play around with bitstrings, here are the ASCII-only variants which are best working: -y-y ---y -yy- -y-- -o-o ---o -oo- -o-- -k-k ---k -kk- -k-- -s-s ---s -ss- -s-- No need to say that these will be way, way better than "01" notation which is used as standard. If you read a lot numbers you should have noticed how unpleasant is to scan through 010101
"far from convinced" sounds quite positive however :) it is never too late. I heard from Captain Crunch https://en.wikipedia.org/wiki/John_Draper That he was so tired of C syntax that he finally switched to Python for some projects. I can imagine how unwanted this can be in age. All depends on tasks that one often does. If say, imagine you'll read binaries for a long time, in one of notations I proposed above (like "-y-- --y-" for example) and after that try to switch back to "0100 0010" notation, I bet you will realize that better. Indeed learning new notation for numbers is quite easy, it is only some practice. And with base-2 you don't need learn at all, just can switch to other notation and use straight away.
Has that been measured? How?
I don't think it is measurable at all. That is my opinion, and 100 just shows that I think it is very stressfull, also due to lot of meaning disambiguation that such system can cause. I also heard pesonal complains from chinese young students, they all had problems with vision already in early years, but I cannot support it oficially. So just imagine: if take for truth, max number of effective glyphs is 8. and hieroglyphs are *all* printed in same sized box! how would this provide efficient reading, and if you've seen chinese books, they all printed with quite small font. I am not very sentimental person but somehow feel sorry for people, one doesn't deserve it. You know, I become friends with one chinese girl, she loves to read and eager to learn and need to always carry pair of goglles with her everywhere. Somehow sad I become now writing it, she is so sweet young girl... And yes in this sence one can say that this cognition load can be measured. You go to universities in China and count those with vision problems.
I don't doubt that some sets of glyphs are easier to distinguish from each other than others. But the
That sounds good, this is not so often that one realizes that :) Most people would say "it's just matter of habit"
Here I would slightly disagree First, *Digits* are not optimised for anything, they are are just a heritage from ancient time. They have some minimal readability, namely "2" is not bad, others are quite poor. Second, *small latin letters* are indeed well fabricated. However don't have an illusion that someone cared much about their optimisation in last 1000 years. If you are skeptical about that, take a look at this http://daten.digitale-sammlungen.de/~db/bsb00003258/images/index.html?seite=... If believe (there are skeptics who do not believe) that this dates back end of 10th century, so we have an interesting picture here, You see that this is indeed very similar to what you read now, somewhat optimised of course, but without much improvements. Actually in some cases there is even some degradation: now we have "pbqd" letters, which are just rotation and reflection of each other, which is no good. Strictly speaking you can use only one of these 4 glyphs. And in last 500 hundred years there was zero modifications. How much improvent can be made is hard question. According to my results, indeed the peak readability forms are similar to certain small latin letters, But I would say quite significant improvement could be made. But this is not really measurable. Mikhail

On Sun, Oct 16, 2016 at 12:06 AM, Mikhail V <mikhailwas@gmail.com> wrote:
You should go and hang out with jmf. Both of you have made bold assertions that our current system is physically/mathematically impossible, despite the fact that *it is working*. Neither of you can cite any actual scientific research to back your claims. Bye bye. ChrisA

Mikhail V wrote:
Also I can only hard imagine that special purpose of some language can ignore readability,
Readability is not something absolute that stands on its own. It depends a great deal on what is being expressed.
even if it is assembler or whatever, it can be made readable without much effort.
You seem to be focused on a very narrow aspect of readability, i.e. fine details of individual character glyphs. That's not what we mean when we talk about readability of programs.
So I just look for some other solution for same task, let it be 10 times more code.
Then it will take you 10 times longer to write, and will on average contain 10 times as many bugs. Is that really worth some small, probably mostly theoretical advantage at the typographical level?
That is because that person from beginning (blindly) follows the convention.
What you seem to be missing is that there are *reasons* for those conventions. They were not arbitrary choices. Ultimately they can be traced back to the fact that our computers are built from two-state electronic devices. That's definitely not an arbitrary choice -- there are excellent physical reasons for it. Base 10, on the other hand, *is* an arbitrary choice. Due to an accident of evolution, we ended up with 10 easily accessible appendages for counting on, and that made its way into the counting system that is currently the most widely used by everyday people. So, if anything, *you're* the one who is "blindly following tradition" by wanting to use base 10.
Well, that's the thing. If there were large, objective, easily measurable differences between different possible sets of glyphs, then there would be no room for such arguments. The fact that you anticipate such arguments suggests that any differences are likely to be small, hard to measure and/or subjective.
I think "on paper" is the important thing here. I suspect you are looking at the published results from some study or other and greatly overestimating the size of the effects compared to other considerations. -- Greg

On 16 October 2016 at 02:58, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
In this discussion yes, but layout aspects can be also improved and I suppose special purpose of language does not always dictate the layout of code, it is up to you who can define that also. And glyphs is not very narrow aspect, it is one of the fundamental aspects. Also it is much harder to develop than good layout, note that.
So, if anything, *you're* the one who is "blindly following tradition" by wanting to use base 10.
Yes because when I was a child I learned it everywhere for everything, others too. As said I don't defend usage of base-10 as you can already note from my posts.
Those things cannot be easiliy measured, if at all, it requires a lot of tests and huge amount of time, you cannot plug measure device to the brain to precisely measure the load. In this case the only choice is to trust most experienced people who show the results which worked for them better and try self to implement and compare. Not saying you have special reason to trust me personally.
If you try to google that particular topic you'll see that there is zero related published material, there are tons of papers on readability, but zero concrete proposals or any attempts to develop something real. That is the thing. I would look in results if there was something. In my case I am looking at what I've achieved during years of my work on it and indeed there some interesting things there. Not that I am overestimating the role of it, but indeed it can really help in many cases, e.g, like in my example with bitstrings. Last but not the least, I am not a "paper ass" in any case, I try to keep only experimantal work where possible. Mikhail

Mikhail V wrote:
Those things cannot be easiliy measured, if at all,
If you can't measure something, you can't be sure it exists at all.
Have you *measured* anything, though? Do you have any feel for how *big* the effects you're talking about are?
There must *very* solid reason for digits+letters against my variant, wonder what is it.
The reasons only have to be *very* solid if there are *very* large advantages to the alternative you propose. My conjecture is that the advantages are actually extremely *small* by comparison. To refute that, you would need to provide some evidence to the contrary. Here are some reasons in favour of the current system: * At the point where most people learn to program, they are already intimately familiar with reading, writing and pronouncing letters and digits. * It makes sense to use 0-9 to represent the first ten digits, because they have the same numerical value. * Using letters for the remaining digits, rather than punctuation characters, makes sense because we're already used to thinking of them as a group. * Using a consecutive sequence of letters makes sense because we're already familiar with their ordering. * In the absence of any strong reason otherwise, we might as well take them from the beginning of the alphabet. Yes, those are all based on "habits", but they're habits shared by everyone, just like the base 10 that you have a preference for. You would have to provide some strong evidence that it's worth disregarding them and using your system instead. -- Greg

On 16 October 2016 at 23:23, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Those things cannot be easiliy measured, if at all,
If you can't measure something, you can't be sure it exists at all.
What do you mean I can't be sure? I am fully functional, mentally healthy man :)
For what case, of course. So the difference for "0010 0011" and "--k- --kk" I can feel indeed big difference. Literally, I can read the latter clearly even I close my left eye and *fully* defocus my right eye. That is indeed a big difference and tells a lot. I suppose for disabled people this would be the only chance to see anything there. Currently I experiment myself and of course I plan to do it with experimental subjects. I plan one survey session in the end of November. But indeed this is very off-topic. So feel free to mail me, if anything. So back to hex notation, which is still not so off-topic I suppose.
There must *very* solid reason for digits+letters against my variant, wonder what is it.
First ,I am the opinion that *initial* decision in such a case must be supported by solid reasons and not just like, "hey, John has already written them in such a manner, lets take it!". Second, I totally disagree that there always must be *very* large advantages for new standards, if we would follow such principle, we would still use cuneiform for writing or bash-like syntax, since everytime when someone proposes a slight improvement, there would be somebody who says : "but the new is not *that much* better than old!". Actually in many cases it is better when it is evolving - everybody is aware.
Here are some reasons in favour of the current system:
So you mean they start to learn hex and see numbers and think like: ooo it looks like a number, not so scary. So less time to learn, yes, +1 (less pain now, more pain later) But if I am an adult intelligent man, I understand that there are only ten digits and I will despite need to extend the set and they *all* should be optically consequent and good readable. And what is a good readable set with >=16 glyphs? Small letters! Somewhat from the height of my current knowledge, since I know that digits anyway not very good readable.
I actually proposed consecutive, but that does not make much difference: being familiar with ordering of the alphabet will have next to zero influence on the reading of numbers encoded with letters, it is just an illusion that it will, since the letter is a symbol, if I see "z" I don't think of 26. More probably, the weight of the glyph could play some role, that means less the weight - less the number. Mikhail

On Sun, Oct 16, 2016 at 05:02:49PM +0200, Mikhail V wrote:
This discussion is completely and utterly off-topic for this mailing list. If you want to discuss changing the world to use your own custom character set for all human communication, you should write a blog or a book. It is completely off-topic for Python: we're interested in improving the Python programming language, not yet another constructed language or artifical alphabet: https://en.wikipedia.org/wiki/Shavian_alphabet If you're interested in this, there is plenty of prior art. See for example: Esperanto, Ido, Volapük, Interlingua, Lojban. But don't discuss it here. -- Steve

On 17 October 2016 at 02:23, Steven D'Aprano <steve@pearwood.info> wrote:
You're right, I was just answering the questions so it came to other thing somehow. BTW, among others we have discussed bitstring representation. So if you work with those, for example if you model cryptography algorithms or similar things in Python, this could help you for example to debug your programs and generally one could interpret it as, say, how is about adding an extra notation for this sake. And if you noticed this is not really about my glyphs, but lies in ASCII. So actually it is me who tried to turn it back to on-topic. Mikhail

On 10/13/16 2:42 AM, Mikhail V wrote:
You continue to overlook the fact that Unicode codepoints are conventionally presented in hexadecimal, including in the page you linked us to. This is the convention. It makes sense to stick to the convention. When I see a numeric representation of a character, there are only two things I can do with it: look it up in a reference someplace, or glean some meaning from it directly. For looking things up, please remember that all Unicode references use hex numbering. Looking up a character by decimal numbers is simply more difficult than looking them up by hex numbers. For gleaning meaning directly, please keep in mind that Unicode fundamentally structured around pages of 256 code points, organized into planes of 256 pages. The very structure of how code points are allocated and grouped is based on a hexadecimal-friendly system. The blocks of codepoints are aligned on hexadecimal boundaries: http://www.fileformat.info/info/unicode/block/index.htm . When I see \u0414, I know it is a Cyrillic character because it is in block 04xx. It simply doesn't make sense to present Unicode code points in anything other than hex. --Ned.

On 10/12/2016 07:13 PM, Mikhail V wrote:
If you mean that decimal notation is the standard used for _counting_ by people, then yes of course that is standard. But decimal notation certainly is not standard in this domain.
Hexadecimal notation is hardly "obscure", but yes I understand that fewer people understand it than decimal notation. Regardless, byte notation seems standard for unicode and unless you can convince the unicode community at large to switch, I don't think it makes any sense for python to switch. Sometimes it's better to go with the flow even if you don't want to.
There is not mixing for unicode in python; it displays as hexadecimal. Decimal is used in other places though. So if by "mixing" you mean python should not use the standard notations of subdomains when working with those domains, then I would totally disagree. The language used in different disciplines is and has always been variable. Until that's no longer true it's better to stick with convention than add inconsistency which will be much more confusing in the long-term than learning the idiosyncrasies of a specific domain (in this case the use of hexadecimal in the unicode world). Cheers, Thomas

On Oct 12, 2016 4:33 PM, "Mikhail V" <mikhailwas@gmail.com> wrote:
If decimal notation isn't used for parsing, only for printing, it would be confusing as heck, but using it for both would break a lot of code in subtle ways (the worst kind of code breakage).
The Unicode standard. I agree that hex is hard to read, but the standard uses it to refer to the code points. It's great to be able to google code points and find the characters easily, and switching to decimal would screw it up. And I've never seen someone *need* to figure out the decimal version from the hex before. It's far more likely to google the hex #. TL;DR: I think this change would induce a LOT of short-term issues, despite it being up in the air if there's any long-term gain. So -1 from me.
2. Mixing of two notations (hex and decimal) is a _very_ bad idea, I hope no need to explain why.
Indeed, you don't. :)
-- Ryan (ライアン) [ERROR]: Your autotools build scripts are 200 lines longer than your program. Something’s wrong. http://kirbyfan64.github.io/
participants (19)
-
Brendan Barnwell
-
Chris Angelico
-
Cory Benfield
-
Danilo J. S. Bellini
-
Emanuel Barry
-
Greg Ewing
-
Jonathan Goble
-
M.-A. Lemburg
-
Mikhail V
-
MRAB
-
Ned Batchelder
-
Random832
-
Ryan Gonzalez
-
Serhiy Storchaka
-
Sjoerd Job Postmus
-
Steve Dower
-
Steven D'Aprano
-
Thomas Nyberg
-
Todd