Input characters in strings by decimals (Was: Proposal for default character representation)

In past discussion about inputing and printing characters, I was proposing decimal notation instead of hex. Since the discussion was lost in off-topic talks, I'll try to summarise my idea better. I use ASCII only for code input (there are good reasons for that). Here I'll use Python 3.6, and Windows 7, so I can use print() with unicode directly and it works now in system console. Suppose I only start programming and want to do some character manipulation. The vey first thing I would probably start with is a simple output for latin and cyrillic capital letters: caps_lat = "" for o in range(65, 91): caps_lat = caps_lat + chr(o) print (caps_lat) caps_cyr = "" for o in range(1040, 1072): caps_cyr = caps_cyr + chr(o) print (caps_cyr) Which prints: ABCDEFGHIJKLMNOPQRSTUVWXYZ АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ Say, I want now to input something direct in code: s = "first cyrillic letters: " + chr(1040) + chr(1041) + chr(1042) Which works fine and has clean look. However it is not very convinient because of much typing and also, if I generate such strings, adds a bit more complexity. But in general it is fine, and I use this method currently. ========= Proposal: I would want to have a possibility to input it *by decimals*: s = "first cyrillic letters: \{1040}\{1041}\{1042}" or: s = "first cyrillic letters: \(1040)\(1041)\(1042)" ========= This is more compact and seems not very contradictive with current Python escape characters in string literals. So backslash is a start of some escaping in most cases. For me most important is that in such way I would avoid any presence of hex numbers in strings, which I find very good for readability and for me it is very convinient since I use decimals for processing everywhere (and encourage everyone to do so). So this is my proposal, any comments on this are appreciated. PS: Currently Python 3 supports these in addition to \x: (from https://docs.python.org/3/howto/unicode.html) """ If you can’t enter a particular character in your editor or want to keep the source code ASCII-only for some reason, you can also use escape sequences in string literals.
""" So I have many possibilities and all of them strangely contradicts with my image of intuitive and readable. Well, using charater name is readable, but seriously not much of a practical solution for input, but could be very useful for printing description of a character. Mikhail

Out of curiosity, why do you prefer decimal values to refer to Unicode code points? Most references, http://unicode.org/charts/PDF/U0400.pdf (official) or https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF , prefer to refer to them by hexadecimal as the planes and ranges are broken up by hex values. On Wed, Dec 7, 2016 at 5:52 PM, Mikhail V <mikhailwas@gmail.com> wrote:

On 8 December 2016 at 01:13, Nick Timkovich <prometheus235@gmail.com> wrote:
Well, there was a huge discussion in October, see the subject name. Just didnt want it to go again in that direction. So in short hex notation not so readable and anyway decimal is kind of standard way to represent numbers and I treat string as a number array when I am processing it, so hex simply is redundant and not needed for me. Mikhail

On 12/7/2016 7:22 PM, Mikhail V wrote:
I sympathize with your preference, but ... Perhap the hex numbers would bother you less if you thought of them as 'serial numbers'. It is standard for 'serial numbers' to include letters. It is also common for digit-letter serial numbers to have meaningful fields, as as do the hex versions of unicode serial numbers. The decimal versions are meaningless except as strict sequencers. -- Terry Jan Reedy

hex notation not so readable and anyway decimal is kind of standard way to represent numbers
Can you cite some examples of Unicode reference tables I can look up a decimal number in? They seem rare; perhaps in a list as a secondary column, but they're not organized/grouped decimally. Readability counts, and introducing a competing syntax will make it harder for others to read.

On 8 December 2016 at 01:57, Nick Timkovich <prometheus235@gmail.com> wrote:
There were links to such table in previos discussion. Googling "unicode table decimal" and first link will it be. I think most online tables include decimals as well, usually as tuples of 8-bit decimals. Also earlier the decimal code was the first column in most tables, but it somehow settled in peoples' minds that hex reference should be preferred, for no solid reason IMO. One reason I think due to HTML standards which started to use it in html files long ago and had much influence later, but one should understand, that is just for brevity in most cases. Other reason is, file viewers show hex by default, but that is just misfortune, nothin besides brevity and 4-bit word alignment gives the hex notation unfortunatly, at least in its current typeface. This was discussed actually in that thread. Many people also think they are cool hackers if they make everything in hex :) In some cases it is worth it, but not this case IMO. Mainly for bitwise stuff, but then one should look into binary/trinary/quaternary representation depending on nature of operations and hardware. Yes there is unicode table pagination correspondence in hex reference, but that hardly plays any positive role for real applications, most of the time I need to look in my code and also perform number operations on *specific* ranges and codes, but not on whole pages of the table. This could only play role if I do low-level filtering of large files and want to filter out data after character's page, but that is the only positive thing I can think of, and I don't think it is directly for Python. Imagine some cryptography exercise - you take 27 units, you just give them numbers (0..26) and you do calculations, yes you can view results as hex numbers, but I don't do it and most people don't and should not, since why? It is ugly and not readable.

Dear Mikhail, With python3.6 you can use format strings to get very close to your desired behaviour: f"{48:c}" == "0" f"{<normal int literal here>:c}" == chr(<normal int literal here>) It works with variables too: charvalue = 48 f"{charcvalue:c}" == chr(charvalue) # == "0" This is only 1 character overhead + 1 character extra per char formatted compared to your example. And as an extra you can use hex strings (f"{0x30:c}" == "0") and any other integer literal you might want. I don't see the added value of making character escapes in a non-default way only (chars escaped + 1) bytes shorter, with the added maintenance and development cost. I think that you can do a lot with f-strings, and using the built-in formatting options you can already get the behaviour you want in Python 3.6, months earlier than the next opportunity (Python 3.7). Check out the formatting options for integers and other built-in types here: https://docs.python.org/3.6/library/string.html#format-specification-mini-la... I hope this helps solve your apparent usability problem. -Matthias On 8 December 2016 at 03:07, Mikhail V <mikhailwas@gmail.com> wrote:

On 8 December 2016 at 03:32, Matthias welp <boekewurm@gmail.com> wrote:
Waaa! This works!
I hope this helps solve your apparent usability problem.
Big big thanks, I didn't now this feature, but I have googled alot about "input characters as decimals" , so it is just added? Another evidence that Python rules! I'll rewrite some code, hope it'll have no side issues. Mikhail

On Wed, Dec 7, 2016 at 10:45 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Yes, f-strings are a new feature in Python 3.6, which is currently in the release candidate stage. The final release of 3.6.0 (and thus the first stable release with this feature) is scheduled for December 16.

On Wed, Dec 7, 2016 at 9:07 PM, Mikhail V <mikhailwas@gmail.com> wrote:
it somehow settled in peoples' minds that hex reference should be preferred, for no solid
reason IMO. I may be showing my age, but all the facts that I remember about ASCII codes are in hex: 1. SPACE is 0x20 followed by punctuation symbols. 2. Decimal digits start at 0x30 with '0' = 0x30, '1' = 0x31, ... 3. @ is 0x40 followed by upper-case letter: 'A' = 0x41, 'B' = 0x42, ... 4. Lower-case letters are offset by 0x20 from the uppercase ones: 'a' = 0x61, 'b' = 0x62, ... Unicode is also organized around hexadecimal codes with various scripts positioned in sections that start at round hexadecimal numbers. For example Cyrillic is at 0x0400 through 0x4FF < http://unicode.org/charts/PDF/U0400.pdf>. The only decimal fact I remember about Unicode is that the largest code-point is 1114111 - a palindrome!

On 8 December 2016 at 03:36, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
As an aside, I've just noticed that in my example: s = "first cyrillic letters: \{1040}\{1041}\{1042}" s = "first cyrillic letters: \u0410\u0411\u0412" the hex and decimal codes are made up of same digits, such a peculiar coincidence... So you were catched up from the beginning with hex, as I see ;) I on the contrary in dark times of learning programming (that was C) always oriented myself on decimal codes and don't regret it now.

On Wed, Dec 7, 2016, at 22:06, Mikhail V wrote:
C doesn't support decimal in string literals either, only octal and hex (incidentally octal seems to have been much more common in the environments where C was first invented). I can think of one context where decimal is used for characters, actually, now that I think about it. ANSI/ISO standards for 8-bit character sets often use a 'split' decimal format (i.e. DEL = 7/15 rather than 0x7F or 127.)

On 8 December 2016 at 05:39, Random832 <random832@fastmail.com> wrote:
That is true, it does not support decimals in string literals, but I don't remember (it was more than 10 years ago) that I used anything but decimals for text processing in C. So normally load a file in memory, iterate over bytes, compare the value, and so on. And somewhat very foggy in my memory, but at that time most ASCII tables included decimals and they stood normally in the first column, but I can be wrong now, got to google some original tables. Jeez, how positive came this thread out, first Ethan said it will be never implemented, and it turns out it has already been implemented. Christmas magic.

On 8 December 2016 at 15:46, Alexandre Brault <abrault@mapgears.com> wrote:
No I don't need to specify "unicode table *decimal*". Results for "unicode table" in google: Top Result # 2: www.utf8-chartable.de/ Top Result # 4: http://www.tamasoft.co.jp/en/general-info/index.html Some sites does not provide any code conversion, but everybody can do it easily, also I don't have problems generating a table programmatically. And I hope it is clear why most people stick to hex (I never argued that BTW), but it is mostly historical, nothing to do with "logical". There is just tendency to repeat what majority does and not always it is good, this case would be an example.

Except that both of these websites show you hexadecimal notation.
That's not true. Characters are sorted by ranges. For example, I know that everything below 0x20 is control code, uppercase ASCII letters start at 0x41 (0x40 is '@') and lowercase ASCII letters start at 0x61 (where 0x60 is '`') - trivial to remember. I also know that ASCII goes as high as half a byte, or 0x7f (half of 0x100). For instance, the first letter of my name is 0xc9, and anyone can know, at a glance and without knowing my name or what the letter is, that it's not ASCII. Also, as far as I know, lowercase letters (ASCII or not) begin some multiple of 0x10 after the beginning of the uppercase letters (0x20 for ASCII or latin-1). As such, since I know that 'É' is 0xc9, I can know, without even looking, that 0xe9 is 'é'. That would be a lot trickier in decimal to remember and get right. As an aside, and I don't know this by heart, various sets of characters begin at fixed points, and knowing those points (when you need to work with specific sets of characters) can be very useful. If you look at a website (https://unicode-table.com/ seems good), you can even select ranges of characters, which conveniently end up being multiples of 0x10 (or 16 in decimal). If your point is "it's easier to work with numbers ending with 0", then you'll be pleased to know that character sets are actually designed so that, using hexadecimal notation, you're dealing with numbers ending with 0! Doing this using decimal notation is clunky at best. Yours, \xc9manuel

On Fri, Dec 9, 2016 at 3:06 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Both of those show hex first, and decimal as an additional feature.
In the first place, many people have pointed out to you that Unicode *is* laid out best in hexadecimal. (Another example: umop apisdn ?! are ¿¡, which are ?! with one high bit set.) But in the second place, "what the majority does" actually IS a strong argument. It's called consistency. Why is "\r" a carriage return? Wouldn't it be more logical to use "\c" for that? Except that EVERYONE uses \r for it. And the one time in my life that I found "\123" to mean "{" rather than "S", it was a great frustration for me: http://rosuav.blogspot.com.au/2012/12/i-want-my-octal.html And that's the choice between decimal and *octal*, which is a far less well known base than hex is. I would still prefer octal, because it's consistent. So because of consistency, Python needs to support "\u0303" to mean COMBINING TILDE, and any competing notation has to be in addition to that. Can you justify the confusion of sometimes working with hex and sometimes decimal? It's a pretty high bar to attain. You have to show that decimal isn't just marginally better than hex; you have to show that there are situations where the value of decimal character literals is so great that it's worth forcing everyone to learn two systems. And I'm not convinced you've even hit the first point. ChrisA

On 8 December 2016 at 17:52, Chris Angelico <rosuav@gmail.com> wrote:
In the first place, many people have pointed out to you that Unicode *is* laid out best in hexadecimal.
Ok if it is aligned intentionally on binary grid obviously hex numbers will show some patterns, but who argues? And to be fair, from my examples for Cyrillic: Range start points in hex vs decimal: capitals: U+0410 #1040 lowercase: U+0430 #1072 So I need one number 1040 to remember, then if I know if it is 32 letters (except Ё) I just sum 1040 + 32 and get 1072, and this will be the beginning of lowercase range, there are of course people who can efficiently sum and substract in head with hex, but I am not the one (guess who is in minority here), and there is no need to do it in this case. So if I know distances between ranges I can do it all much easier in head. Not a strong argument? To be more pedantic, if you know the fact that in Russian alphabet there are exactly 33 letters and not 32 as one could suggest from unicode table, you could have notice also that: letter Ё is U+0401, and ё is U+0451 This means they are torn away from other letters and does not even lie in the range. In practice, this means if I want to filter against code ranges, I need to additionally check the value U+0451 and U+0401. Is it not because someone decided to align the alphabet in such a way? Alignment is not bad idea, but it should not contradict with common sense.
Frankly I don't fully understand your point here. Everyone knows decimal, address of an element in a table is a number, in most cases I don't need to learn it by heart, since it is already known and written in some table on your PC. Also inputting characters by decimal is very common thing, alternates key combos (Alt+0192) is something very well established and many people *do* learn decimal code points by heart, including me. So now it is you who want me to learn two numbering systems for no reason. And even with all that said, it is not the strongest argument. Most important is that hex notation is an ugly circumstance, and in this case there is too little reason to introduce it in the algorithm which just checks the ranges and specific values. And for *specific single* values it is absolutely irrelevant which alignment do you have. You just choose what is better readable and/or common for abstract numbers. But that is other big question, and current hex notation does not fall into category "better readable" anyway. Mikhail

On Fri, Dec 9, 2016 at 5:37 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Let me clarify. When you construct a string, you can already use escapes to represent characters: "n\u0303" --> n followed by combining tilde In order to be consistent with other languages, Python *has* to support hexadecimal. Plus, Python has _already_ supported hex for some time. To establish decimal as an alternative, you have to demonstrate that it is worth having ANOTHER way to do this. With completely green-field topics, you can debate the merits of one notation against another, and the overall best one will win. But when there's a well-established existing notation, you have to justify the proliferation of notations. You have to show that your new format is *so much* better than the existing one that it's worth adding it in parallel. That's quite a high bar - not impossible, obviously, but you need some very strong justification. At the moment, you're showing minor advantages to decimal, and other people are showing minor advantages to hex; but IMO nothing yet has been strong enough to justify the implementation of a completely new way to do things - remember, people have to understand *both* in order to read code. ChrisA

On 8 December 2016 at 19:45, Chris Angelico <rosuav@gmail.com> wrote:
If the arguments in the last post are not strong enough, I think it will be too hard to make it more strong. In my eyes benefits in this case outweigh the downsides clearly. And anyway, since I can use f-string now to input it, probably one can just relax now. And this: f"{65:c}{66:c}{66:c}" , looks actually significantly better then: "\d{65}\d{66}\d{67}", And it covers the cases I was addressing with the proposal. I am happy. +1000 to developers, even if this is an "accidental" feature .

On Thu, Dec 8, 2016, at 11:06, Mikhail V wrote:
The problem is that there's a logic associated with how the character sets are designed. The character table works a lot better with rows of 16 than with rows of 10 or 20. In many blocks you get the uppercase letters lined up above the lowercase letters, for example. And if your rows are 16 (or 32, though that doesn't work as well for unicode because e.g. the Cyrillic basic set А-Я/а-я starts from 0x410), then your row and column labels work better in hex because you've lined up 0x40 above 0x50 and 0x60, which share the last digit, unlike 64/80/96, and the whole row (or half the row for 32) shares all but the last digit. And those values are also only off by one bit, too. Even if we were to arrange the characters themselves in rows of 10/20, so you've got 30 or 40 characters in an "alphabet row", then you'd have to add or subtract to change the case, whereas many early character sets were designed to be able to do this by changing a bit, for bit-paired keyboards. What looks better? Hex: АБВГДЕЖЗИЙКЛМНОП РСТУФХЦЧШЩЪЫЬЭЮЯ абвгдежзийклмноп рстуфхцчшщъыьэюя Decimal: АБВГДЕЖЗИЙКЛМНОПРСТУ ФХЦЧШЩЪЫЬЭЮЯабвгдежз ийклмнопрстуфхцчшщъы ьэюя And it's only luck that the uppercase Russian alphabet starts at the beginning of a line. The ASCII section with the English alphabet looks like this in decimal: <=>?@ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_`abc defghijklmnopqrstuvw xyz compared to this in hex: @ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_ `abcdefghijklmno pqrstuvwxyz

The Unicode Consortium reference entirely lacks decimal values in all their tables. EVERYTHING is given solely in hex. I'm sure someone somewhere had created a table with decimal values, but it's very rare. We should not change Python syntax because exactly one user prefers decimal representations. At most there can be an external library to cover strings in whatever manner he wants. Why is octal being neglected for us old fogeys?! 😏 On Dec 7, 2016 6:11 PM, "Mikhail V" <mikhailwas@gmail.com> wrote:

On 12/07/2016 03:52 PM, Mikhail V wrote:
While the discussion did range far and wide, one thing that was fairly constant is that the benefit of adding one more way to represent unicode characters is not worth the work involved to make it happen; and that using hexadecimal to reference unicode characters is nearly universal. To sum up: even if you wrote all the code yourself, it would not be accepted. -- ~Ethan~

On 2016-12-07 23:52, Mikhail V wrote:
It's usually the case that escapes are \ followed by an ASCII-range letter or digit; \ followed by anything else makes it a literal, even if it's a metacharacter, e.g. " terminates a string that starts with ", but \" is a literal ", so I don't like \{...}. Perl doesn't have \u... or \U..., it has \x{...} instead, and Python already has \N{...}, so: s = "first cyrillic letters: \d{1040}\d{1041}\d{1042}" might be better, but I'm still -1 because hex is usual when referring to Unicode codepoints.

On 7 December 2016 at 23:52, Mikhail V <mikhailwas@gmail.com> wrote:
-1. We already have plenty of ways to specify characters in strings[1], we don't need another. If readability is what matters to you, and you (unlike many others) consider hex to be unreadable, use the \N{...} form. Paul [1] Including (ab)using f-strings to hide the use of chr().

Out of curiosity, why do you prefer decimal values to refer to Unicode code points? Most references, http://unicode.org/charts/PDF/U0400.pdf (official) or https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF , prefer to refer to them by hexadecimal as the planes and ranges are broken up by hex values. On Wed, Dec 7, 2016 at 5:52 PM, Mikhail V <mikhailwas@gmail.com> wrote:

On 8 December 2016 at 01:13, Nick Timkovich <prometheus235@gmail.com> wrote:
Well, there was a huge discussion in October, see the subject name. Just didnt want it to go again in that direction. So in short hex notation not so readable and anyway decimal is kind of standard way to represent numbers and I treat string as a number array when I am processing it, so hex simply is redundant and not needed for me. Mikhail

On 12/7/2016 7:22 PM, Mikhail V wrote:
I sympathize with your preference, but ... Perhap the hex numbers would bother you less if you thought of them as 'serial numbers'. It is standard for 'serial numbers' to include letters. It is also common for digit-letter serial numbers to have meaningful fields, as as do the hex versions of unicode serial numbers. The decimal versions are meaningless except as strict sequencers. -- Terry Jan Reedy

hex notation not so readable and anyway decimal is kind of standard way to represent numbers
Can you cite some examples of Unicode reference tables I can look up a decimal number in? They seem rare; perhaps in a list as a secondary column, but they're not organized/grouped decimally. Readability counts, and introducing a competing syntax will make it harder for others to read.

On 8 December 2016 at 01:57, Nick Timkovich <prometheus235@gmail.com> wrote:
There were links to such table in previos discussion. Googling "unicode table decimal" and first link will it be. I think most online tables include decimals as well, usually as tuples of 8-bit decimals. Also earlier the decimal code was the first column in most tables, but it somehow settled in peoples' minds that hex reference should be preferred, for no solid reason IMO. One reason I think due to HTML standards which started to use it in html files long ago and had much influence later, but one should understand, that is just for brevity in most cases. Other reason is, file viewers show hex by default, but that is just misfortune, nothin besides brevity and 4-bit word alignment gives the hex notation unfortunatly, at least in its current typeface. This was discussed actually in that thread. Many people also think they are cool hackers if they make everything in hex :) In some cases it is worth it, but not this case IMO. Mainly for bitwise stuff, but then one should look into binary/trinary/quaternary representation depending on nature of operations and hardware. Yes there is unicode table pagination correspondence in hex reference, but that hardly plays any positive role for real applications, most of the time I need to look in my code and also perform number operations on *specific* ranges and codes, but not on whole pages of the table. This could only play role if I do low-level filtering of large files and want to filter out data after character's page, but that is the only positive thing I can think of, and I don't think it is directly for Python. Imagine some cryptography exercise - you take 27 units, you just give them numbers (0..26) and you do calculations, yes you can view results as hex numbers, but I don't do it and most people don't and should not, since why? It is ugly and not readable.

Dear Mikhail, With python3.6 you can use format strings to get very close to your desired behaviour: f"{48:c}" == "0" f"{<normal int literal here>:c}" == chr(<normal int literal here>) It works with variables too: charvalue = 48 f"{charcvalue:c}" == chr(charvalue) # == "0" This is only 1 character overhead + 1 character extra per char formatted compared to your example. And as an extra you can use hex strings (f"{0x30:c}" == "0") and any other integer literal you might want. I don't see the added value of making character escapes in a non-default way only (chars escaped + 1) bytes shorter, with the added maintenance and development cost. I think that you can do a lot with f-strings, and using the built-in formatting options you can already get the behaviour you want in Python 3.6, months earlier than the next opportunity (Python 3.7). Check out the formatting options for integers and other built-in types here: https://docs.python.org/3.6/library/string.html#format-specification-mini-la... I hope this helps solve your apparent usability problem. -Matthias On 8 December 2016 at 03:07, Mikhail V <mikhailwas@gmail.com> wrote:

On 8 December 2016 at 03:32, Matthias welp <boekewurm@gmail.com> wrote:
Waaa! This works!
I hope this helps solve your apparent usability problem.
Big big thanks, I didn't now this feature, but I have googled alot about "input characters as decimals" , so it is just added? Another evidence that Python rules! I'll rewrite some code, hope it'll have no side issues. Mikhail

On Wed, Dec 7, 2016 at 10:45 PM, Mikhail V <mikhailwas@gmail.com> wrote:
Yes, f-strings are a new feature in Python 3.6, which is currently in the release candidate stage. The final release of 3.6.0 (and thus the first stable release with this feature) is scheduled for December 16.

On Wed, Dec 7, 2016 at 9:07 PM, Mikhail V <mikhailwas@gmail.com> wrote:
it somehow settled in peoples' minds that hex reference should be preferred, for no solid
reason IMO. I may be showing my age, but all the facts that I remember about ASCII codes are in hex: 1. SPACE is 0x20 followed by punctuation symbols. 2. Decimal digits start at 0x30 with '0' = 0x30, '1' = 0x31, ... 3. @ is 0x40 followed by upper-case letter: 'A' = 0x41, 'B' = 0x42, ... 4. Lower-case letters are offset by 0x20 from the uppercase ones: 'a' = 0x61, 'b' = 0x62, ... Unicode is also organized around hexadecimal codes with various scripts positioned in sections that start at round hexadecimal numbers. For example Cyrillic is at 0x0400 through 0x4FF < http://unicode.org/charts/PDF/U0400.pdf>. The only decimal fact I remember about Unicode is that the largest code-point is 1114111 - a palindrome!

On 8 December 2016 at 03:36, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
As an aside, I've just noticed that in my example: s = "first cyrillic letters: \{1040}\{1041}\{1042}" s = "first cyrillic letters: \u0410\u0411\u0412" the hex and decimal codes are made up of same digits, such a peculiar coincidence... So you were catched up from the beginning with hex, as I see ;) I on the contrary in dark times of learning programming (that was C) always oriented myself on decimal codes and don't regret it now.

On Wed, Dec 7, 2016, at 22:06, Mikhail V wrote:
C doesn't support decimal in string literals either, only octal and hex (incidentally octal seems to have been much more common in the environments where C was first invented). I can think of one context where decimal is used for characters, actually, now that I think about it. ANSI/ISO standards for 8-bit character sets often use a 'split' decimal format (i.e. DEL = 7/15 rather than 0x7F or 127.)

On 8 December 2016 at 05:39, Random832 <random832@fastmail.com> wrote:
That is true, it does not support decimals in string literals, but I don't remember (it was more than 10 years ago) that I used anything but decimals for text processing in C. So normally load a file in memory, iterate over bytes, compare the value, and so on. And somewhat very foggy in my memory, but at that time most ASCII tables included decimals and they stood normally in the first column, but I can be wrong now, got to google some original tables. Jeez, how positive came this thread out, first Ethan said it will be never implemented, and it turns out it has already been implemented. Christmas magic.

On 8 December 2016 at 15:46, Alexandre Brault <abrault@mapgears.com> wrote:
No I don't need to specify "unicode table *decimal*". Results for "unicode table" in google: Top Result # 2: www.utf8-chartable.de/ Top Result # 4: http://www.tamasoft.co.jp/en/general-info/index.html Some sites does not provide any code conversion, but everybody can do it easily, also I don't have problems generating a table programmatically. And I hope it is clear why most people stick to hex (I never argued that BTW), but it is mostly historical, nothing to do with "logical". There is just tendency to repeat what majority does and not always it is good, this case would be an example.

Except that both of these websites show you hexadecimal notation.
That's not true. Characters are sorted by ranges. For example, I know that everything below 0x20 is control code, uppercase ASCII letters start at 0x41 (0x40 is '@') and lowercase ASCII letters start at 0x61 (where 0x60 is '`') - trivial to remember. I also know that ASCII goes as high as half a byte, or 0x7f (half of 0x100). For instance, the first letter of my name is 0xc9, and anyone can know, at a glance and without knowing my name or what the letter is, that it's not ASCII. Also, as far as I know, lowercase letters (ASCII or not) begin some multiple of 0x10 after the beginning of the uppercase letters (0x20 for ASCII or latin-1). As such, since I know that 'É' is 0xc9, I can know, without even looking, that 0xe9 is 'é'. That would be a lot trickier in decimal to remember and get right. As an aside, and I don't know this by heart, various sets of characters begin at fixed points, and knowing those points (when you need to work with specific sets of characters) can be very useful. If you look at a website (https://unicode-table.com/ seems good), you can even select ranges of characters, which conveniently end up being multiples of 0x10 (or 16 in decimal). If your point is "it's easier to work with numbers ending with 0", then you'll be pleased to know that character sets are actually designed so that, using hexadecimal notation, you're dealing with numbers ending with 0! Doing this using decimal notation is clunky at best. Yours, \xc9manuel

On Fri, Dec 9, 2016 at 3:06 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Both of those show hex first, and decimal as an additional feature.
In the first place, many people have pointed out to you that Unicode *is* laid out best in hexadecimal. (Another example: umop apisdn ?! are ¿¡, which are ?! with one high bit set.) But in the second place, "what the majority does" actually IS a strong argument. It's called consistency. Why is "\r" a carriage return? Wouldn't it be more logical to use "\c" for that? Except that EVERYONE uses \r for it. And the one time in my life that I found "\123" to mean "{" rather than "S", it was a great frustration for me: http://rosuav.blogspot.com.au/2012/12/i-want-my-octal.html And that's the choice between decimal and *octal*, which is a far less well known base than hex is. I would still prefer octal, because it's consistent. So because of consistency, Python needs to support "\u0303" to mean COMBINING TILDE, and any competing notation has to be in addition to that. Can you justify the confusion of sometimes working with hex and sometimes decimal? It's a pretty high bar to attain. You have to show that decimal isn't just marginally better than hex; you have to show that there are situations where the value of decimal character literals is so great that it's worth forcing everyone to learn two systems. And I'm not convinced you've even hit the first point. ChrisA

On 8 December 2016 at 17:52, Chris Angelico <rosuav@gmail.com> wrote:
In the first place, many people have pointed out to you that Unicode *is* laid out best in hexadecimal.
Ok if it is aligned intentionally on binary grid obviously hex numbers will show some patterns, but who argues? And to be fair, from my examples for Cyrillic: Range start points in hex vs decimal: capitals: U+0410 #1040 lowercase: U+0430 #1072 So I need one number 1040 to remember, then if I know if it is 32 letters (except Ё) I just sum 1040 + 32 and get 1072, and this will be the beginning of lowercase range, there are of course people who can efficiently sum and substract in head with hex, but I am not the one (guess who is in minority here), and there is no need to do it in this case. So if I know distances between ranges I can do it all much easier in head. Not a strong argument? To be more pedantic, if you know the fact that in Russian alphabet there are exactly 33 letters and not 32 as one could suggest from unicode table, you could have notice also that: letter Ё is U+0401, and ё is U+0451 This means they are torn away from other letters and does not even lie in the range. In practice, this means if I want to filter against code ranges, I need to additionally check the value U+0451 and U+0401. Is it not because someone decided to align the alphabet in such a way? Alignment is not bad idea, but it should not contradict with common sense.
Frankly I don't fully understand your point here. Everyone knows decimal, address of an element in a table is a number, in most cases I don't need to learn it by heart, since it is already known and written in some table on your PC. Also inputting characters by decimal is very common thing, alternates key combos (Alt+0192) is something very well established and many people *do* learn decimal code points by heart, including me. So now it is you who want me to learn two numbering systems for no reason. And even with all that said, it is not the strongest argument. Most important is that hex notation is an ugly circumstance, and in this case there is too little reason to introduce it in the algorithm which just checks the ranges and specific values. And for *specific single* values it is absolutely irrelevant which alignment do you have. You just choose what is better readable and/or common for abstract numbers. But that is other big question, and current hex notation does not fall into category "better readable" anyway. Mikhail

On Fri, Dec 9, 2016 at 5:37 AM, Mikhail V <mikhailwas@gmail.com> wrote:
Let me clarify. When you construct a string, you can already use escapes to represent characters: "n\u0303" --> n followed by combining tilde In order to be consistent with other languages, Python *has* to support hexadecimal. Plus, Python has _already_ supported hex for some time. To establish decimal as an alternative, you have to demonstrate that it is worth having ANOTHER way to do this. With completely green-field topics, you can debate the merits of one notation against another, and the overall best one will win. But when there's a well-established existing notation, you have to justify the proliferation of notations. You have to show that your new format is *so much* better than the existing one that it's worth adding it in parallel. That's quite a high bar - not impossible, obviously, but you need some very strong justification. At the moment, you're showing minor advantages to decimal, and other people are showing minor advantages to hex; but IMO nothing yet has been strong enough to justify the implementation of a completely new way to do things - remember, people have to understand *both* in order to read code. ChrisA

On 8 December 2016 at 19:45, Chris Angelico <rosuav@gmail.com> wrote:
If the arguments in the last post are not strong enough, I think it will be too hard to make it more strong. In my eyes benefits in this case outweigh the downsides clearly. And anyway, since I can use f-string now to input it, probably one can just relax now. And this: f"{65:c}{66:c}{66:c}" , looks actually significantly better then: "\d{65}\d{66}\d{67}", And it covers the cases I was addressing with the proposal. I am happy. +1000 to developers, even if this is an "accidental" feature .

On Thu, Dec 8, 2016, at 11:06, Mikhail V wrote:
The problem is that there's a logic associated with how the character sets are designed. The character table works a lot better with rows of 16 than with rows of 10 or 20. In many blocks you get the uppercase letters lined up above the lowercase letters, for example. And if your rows are 16 (or 32, though that doesn't work as well for unicode because e.g. the Cyrillic basic set А-Я/а-я starts from 0x410), then your row and column labels work better in hex because you've lined up 0x40 above 0x50 and 0x60, which share the last digit, unlike 64/80/96, and the whole row (or half the row for 32) shares all but the last digit. And those values are also only off by one bit, too. Even if we were to arrange the characters themselves in rows of 10/20, so you've got 30 or 40 characters in an "alphabet row", then you'd have to add or subtract to change the case, whereas many early character sets were designed to be able to do this by changing a bit, for bit-paired keyboards. What looks better? Hex: АБВГДЕЖЗИЙКЛМНОП РСТУФХЦЧШЩЪЫЬЭЮЯ абвгдежзийклмноп рстуфхцчшщъыьэюя Decimal: АБВГДЕЖЗИЙКЛМНОПРСТУ ФХЦЧШЩЪЫЬЭЮЯабвгдежз ийклмнопрстуфхцчшщъы ьэюя And it's only luck that the uppercase Russian alphabet starts at the beginning of a line. The ASCII section with the English alphabet looks like this in decimal: <=>?@ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_`abc defghijklmnopqrstuvw xyz compared to this in hex: @ABCDEFGHIJKLMNO PQRSTUVWXYZ[\]^_ `abcdefghijklmno pqrstuvwxyz

The Unicode Consortium reference entirely lacks decimal values in all their tables. EVERYTHING is given solely in hex. I'm sure someone somewhere had created a table with decimal values, but it's very rare. We should not change Python syntax because exactly one user prefers decimal representations. At most there can be an external library to cover strings in whatever manner he wants. Why is octal being neglected for us old fogeys?! 😏 On Dec 7, 2016 6:11 PM, "Mikhail V" <mikhailwas@gmail.com> wrote:

On 12/07/2016 03:52 PM, Mikhail V wrote:
While the discussion did range far and wide, one thing that was fairly constant is that the benefit of adding one more way to represent unicode characters is not worth the work involved to make it happen; and that using hexadecimal to reference unicode characters is nearly universal. To sum up: even if you wrote all the code yourself, it would not be accepted. -- ~Ethan~

On 2016-12-07 23:52, Mikhail V wrote:
It's usually the case that escapes are \ followed by an ASCII-range letter or digit; \ followed by anything else makes it a literal, even if it's a metacharacter, e.g. " terminates a string that starts with ", but \" is a literal ", so I don't like \{...}. Perl doesn't have \u... or \U..., it has \x{...} instead, and Python already has \N{...}, so: s = "first cyrillic letters: \d{1040}\d{1041}\d{1042}" might be better, but I'm still -1 because hex is usual when referring to Unicode codepoints.

On 7 December 2016 at 23:52, Mikhail V <mikhailwas@gmail.com> wrote:
-1. We already have plenty of ways to specify characters in strings[1], we don't need another. If readability is what matters to you, and you (unlike many others) consider hex to be unreadable, use the \N{...} form. Paul [1] Including (ab)using f-strings to hide the use of chr().
participants (16)
-
Alexander Belopolsky
-
Alexandre Brault
-
Chris Angelico
-
David Mertz
-
Emanuel Barry
-
Ethan Furman
-
Greg Ewing
-
Jonathan Goble
-
Matthias welp
-
Mikhail V
-
MRAB
-
Nick Timkovich
-
Paul Moore
-
Random832
-
Terry Reedy
-
Victor Stinner