Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

unicodedata.name<http://unicodedata.name> raises KeyError for a few unicode characters like '\0' or '\n', altough the documentation is very clear on the implementation, this is often not what people want, ie. a string describing the character. In Python 3.3, the name aliases became accepted in unicodedata.lookup('NULL') and '\N{NULL}' == '\N{NUL}'. One could expect that lookup(name(x)) == x for all unicode character but this property doesn't hold because of the few characters that do not have a name (mainly control characters). The use case where the KeyError is raised when a codepoint for a unused character or newest version of unicode is however still useful. In the NameAliases https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt one can see that some characters have multiple aliases, so there are multiple ways to map a character to a name. I propose adding a keyword argument, to unicodedata.name<http://unicodedata.name> that would implement one of some useful behavior when the value does not exist. In that case. One simple behavior would be to chose the name in the "abbreviation" list. Currently all characters except three only have one and only one abbreviation so that would be a good pick, so I'd imagine name('\x00', abbreviation=True) == 'NUL' The three characters in NameAlias.txt that have more than one abbreviation are : '\n' with ['LF', 'NL', 'EOL'] '\t' with ['HT', 'TAB'] '\ufeff' with ['BOM', 'ZWNBSP'] In case multiple abbreviations exist, one could take the first introduced to unicode (for backward compability with python versions). If this is a tie, one could take the first in the list. If it has no name and no abbreviation, unicodata.name<http://unicodata.name> raises an error or returns default as usual. lookup(name(x)) == x for all x is natural isn't it ?

Replying to a few points out of order... On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:
lookup(name(x)) == x for all x is natural isn't it ?
The Unicode Consortium doesn't think so, or else they would mandate that all defined code points have a name.
That's a pretty old version -- we're up to version 11 now. https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
I propose adding a keyword argument, to unicodedata.name<http://unicodedata.name>
I don't think that's a real URL.
that would implement one of some useful behavior when the value does not exist.
I am cautious about overloading functions with keyword-only arguments to implement special behaviour. Guido has a strong preference for the "no constant flags" rule of thumb, (except I think we can extend it beyond just True/False to any N-state value) and I agree with that. The rule of thumb says that if you have a function that takes an optional flag which chooses between two (or more) distinct behaviours, AND the function is usually called with that flag given as a constant, then we should usually prefer to split the function into two separately named functions. For example, in the statistics module, I have stdev() and pstdev(), rather than stdev(population=False) and stdev(population=True). (Its a rule of thumb, not a hard law of nature. There are exceptions.) It sounds to me that your proposal would fit those conditions and so we should prefer a separate function, or a separate API, for doing more complex name look-ups. *Especially* if there's a chance that we'll want to extend this some day to use more flags... name(char, abbreviation=False, correction=True, control=True, figment=True, alternate=False, ) which are all alias types defined by NameAliases.txt.
To my mind, that calls out for a separate API to return character alias properties as a separate data type: alias('\u0001') => UnicodeAlias(control='START OF HEADING', abbreviation='SOH') alias('\u000B') => UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'), abbreviation='VT') # alternatively, fields could be a single semi-colon delimited string rather than a tuple in the event of multiple aliases alias('\u01A2') => UnicodeAlias(correction='LATIN CAPITAL LETTER GHA') alias('\u0099') => UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', abbreviation='SGC') Fields not shown return the empty string. This avoids overloading the name() function, future-proofs against new alias types, and if UnicodeAlias is a mutable object, easily permits the caller to customise the records to suit their own application's needs: def myalias(char): alias = unicodedata.alias(char) if char == '\U0001f346': alias.other = ('eggplant', 'purple vegetable') alias.slang = ('phallic', ... ) return alias -- Steve

On Thu, Jul 12, 2018 at 05:17:26PM +1000, Steven D'Aprano <steve@pearwood.info> wrote:
I'm sure it was a stupid autoreplacement by web mail (hotmail in this case). As '.name' is a valid domain hotmail decided that unicodedata.name is a host name. And "URLified" it, so to say.
-- Steve
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Yes, my gmail client transformed unicodata . name to a url. I hope the mobile gmail client won't do it here. Yes current version is 11. I noticed it after sending the mail, I've compared to the version 6 and all my arguments are still valid (they just added some characters in the "correction" set). As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue. Yes a new function name is also what I prefer but I thought it would clutter the unicodata namespace. I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea. Instead of name(char, abbreviation=True, correction=False) I would have Imagined a "default_behavior" ala csv.dialect such that name(char, default_bevior=unicodata.first_abbreviation) would use my algorithm. first_abbreviation would be a enum, or like in csv.dialect a class like : class first_abbreviation: abbreviation = True; correction = False; ... But I guess that's too specific, abbreviation=True would mean "take the first abbreviation in the list".

I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. Or try: return name(x) expect: if category(x) == 'Cc': return alias(x).abbreviations[0] else: raise That would then raise only for unassigned codepoints.

Robert Vanden Eynde writes:
The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices). This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible. And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).
I don't understand why that's particularly useful, especially in the Han case (see below).
I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8): On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name. (I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers. On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents. So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized. For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go. I think this problem is a lot harder than many of the folk participating in this discussion so far realize. Steve

I don't understand why that's particularly useful, especially in the Han case (see below). Since python 3.3 has the NameAliases.txt builtin in the distribution in order to full fil \N{} construct, I think it would be nice to have an api to access this files, to do like unicodedata.alias('\n').abbreviations[:3] == ['LF', 'NL', 'EOL'] I don't understand what you're asking for. The Unicode Standard already provides canonical names. Not for control characters. About the Han case, they all have a unicodedata.name<http://unicodedata.name> don't they ? (Sorry if I misread your message)

On Thu, Jul 12, 2018 at 03:11:59PM +0000, Robert Vanden Eynde wrote: [Stephen]
That's because the Unicode Consortium considers that control characters have no canonical name. And I think that they are right.
I think that the point Stephen is making is that the canonical name for most Han characters is terribly uninformative, even to native Han users. For Englishg speakers, the analogous situation would be if name("A") returned "LATIN CAPITAL LETTER 0041". There are good reasons for that, but it does mean that if your intention is to report the name of the character to a non-technical end-user, in their own native language, using the Unicode name or even any of the aliases is probably not a great solution. On the other hand, if you are in a lucky enough situation (unlike Stephen) of being able to say "Han characters? We'll fix that in the next version..." using the Unicode name is not a terrible solution. At least, it's The Standard terrible solution *wink* -- Steve

Robert Vanden Eynde writes:
Not for control characters.
There's a standard convention for "naming" control characters (U+0000, U+0001, etc), which is recommended by the Unicode Standard (in slightly generalized form) for characters that otherwise don't have names, as "code point labels". This has been suggested by MRAB in the past. Personally I would generalized Steven d'Aprano's function a bit, and provide a "CONTROL-" prefix for these instead of "U+". I don't see why even the C0 ASCII control function aliases should be particularly privileged, especially since the main alias is the spelled-out name, not the more commonly used 2- or 3-character abbreviation (will people associate "alarm" with "BEL"? I don't). Many are just meaningless (the 4 "device control" codes). And some are actively misleading: U+0018 (^X) "cancel" and U+001A (^Z) "substitute", which are generally interpreted as "exit" (an interactive program) and "end of file" (on Windows), or as "cut" and "revert" in CUA UI. I for one would find it more useful if they aliased to "ctrl-c-prefix" and "zap-up-to-char".[1] And nobody's ever heard of the C1 ISO 6249 control characters (not to mention that three of them are literally figments of somebody's imagination, and never standardized). So I think using NameAliases.txt for this purpose is silly. If we're going to provide aliases based on the traditional control functions, I would use only the NameAliases.txt aliases for the following: NUL, BEL, BS, HT, LF, VT, FF, CR, ESC, SP, DEL, NEL, NBSP, and SHY. (NEL is included because it's recommended that it be treated as a newline function in the Unicode standard.) For the rest, I would use CONTROL-<code>, which is more likely to make sense in most contexts.[2]
Yes, they have names, constructed algorithmically from the code point: "CJK UNIFIED IDEOGRAPH-4E00". I know what that one is (the character that denotes the number 1). But that's the only one that I know offhand. I think Han (which are named daily, surely millions, if not billions, of times) should be treated as well as controls (which even programmers rarely bother to name, especially for those that don't have standard escape sequences). That's why I strongly advocate that there be provision for extension, and that the databases at least be provided by a module that can be updated far more frequently than the stdlib is. Footnotes: [1] Those are the commands they are bound to in Emacs. [2] There are a few others that I personally would find useful and unambiguous because they're used in multilingual ISO 2022 encodings, but that's rather far into the weeds. They're rarely seen in practice; most of the time 7-bit codes with escape sequences are used, or 8-bit codes without control sequences.

On Fri, Jul 13, 2018 at 12:02:20AM +0900, Stephen J. Turnbull wrote:
Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it. My alias() function is just an programmatic interface to information already available in the NameAliases.txt file. Don't you think that's useful enough as it stands? What people do with it will depend on their application, of course. [...]
Indeed. That's also the case for emoji. That's why I suggested making alias() return a mutable record rather than an immutable tuple, so application writers can add their own records to suit their own needs. I'll admit I haven't thought really deeply about what the most useful API would be -- this was only an initial post on Python-Ideas, not a fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others. The NameAlias.txt file makes all that information available, but we can't access it (easily, or at all) from unicodedata. [...]
That seems fairly extreme. New Unicode versions don't come out that frequently. Surely we don't expect to track draft aliases, or characters outside of Unicode? Application writers might choose to do so -- if somebody wants to support "Ay" as an alias for LATIN CAPITAL LETTER A they can be my guest, but the stdlib doesn't have to directly support it until it hits the NameAliases.txt file :-) [...]
To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?
And even there I think a canonical name based on block name + code point in hex is the best way to go.
I believe you might be thinking of the Unicode "code point label" concept. I have this implementation in my toolbox: NONCHARACTERS = tuple( [unichr(n) for n in range(0xFDD0, 0xFDF0)] + [unichr(n*0x10000 + 0xFFFE +i) for n in range(17) for i in range(2)] ) assert len(NONCHARACTERS) == 66 def label(c): """Return the Code Point Label or character name of c. If c is a code point with a name, the name is used as the label; otherwise the Code Point Label is returned. >>> label(unichr(0x0394)) # u'Δ' 'GREEK CAPITAL LETTER DELTA' >>> label(unichr(0x001F)) '<control-001F>' """ name = unicodedata.name(c, '') if name == '': # See section on Code Point Labels # http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf number = ord(c) category = unicodedata.category(c) assert category in ('Cc', 'Cn', 'Co', 'Cs') if category == 'Cc': kind = 'control' elif category == 'Cn': if c in NONCHARACTERS: kind = 'noncharacter' else: kind = 'reserved' elif category == 'Co': kind = 'private-use' else: assert category == 'Cs' kind = 'surrogate' name = "<%s-%04X>" % (kind, number) return name -- Steve

Steven D'Aprano writes:
Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it.
I was questioning the utility of "If the abbreviation list is sorted by AdditionToUnicodeDate." But since you ask, neither function is useful TO ME, as I understand them, because they're based on the UCD NameAliases.txt. That doesn't have any aliases I would actually use. I've never needed aliases for control characters, and for everything else the canonical name is perfectly useful (including for Korean characters and Japanese kana, which have phonetic names, as do Chinese bopomofo AIUI). There's nothing useful for Han characters yet, sadly.
To be perfectly frank, if that's all it is, I don't know when I'd ever use it. Your label function is *much* more useful. To be specific about the defects of NameAliases.txt: "DEVICE CONTROL 1" tells me a lot less about that control character than "U+0011" does. Other aliases in that file are just wrong: I don't believe I've ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded character entity. That's the DOS "END OF FILE". Certainly, the aliases of category "correction" are useful, though not to me---I don't read any of the relevant languages. The "figment" category is stupid; almost all the names of control characters are figments, except for the half-dozen well-known whitespace characters, NUL, and maybe DEL. The 256 VSxxx "variation selectors" are somewhat useful, but I would think that it would be even more useful to provide skin color aliases for face emoji and X11 RGB.txt color aliases for hearts and the like, which presumably are standardized across vendors. If I were designing a feature for the stdlib, I would 0. Allow the database to consist of multiple alias tables, and be extensible by adding tables via user configuration. 1. Make the priority of the alias tables user-configurable. 2. Provide default top-priority table more suited to likely Python usage than NameAliases.txt. 3. Provide both a primary alias function, and a list of all aliases function. 4. Provide a reverse lookup function. 5. Perhaps provide a context-sensitive alias function. The only context I can think of offhand is "position in file", ie, to distinguish between ZWNBSP and BOM, so perhaps that's not worth doing. On the other hand, given that example, it's worth a few minutes thought to see if there are other context-sensitive naming practices that more than a few people would want to follow.
Why should they add them to the tuple returned by the function, rather than to the database the function consults?
fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others.
I don't understand. By providing stdlib support for NameAliases.txt only, you are privileging those aliases. If you mean privileging the Name property over the aliases, well, that's what "canonical" means, and yes, I think the Name property should be privileged (eg ZERO WIDTH NO-BREAK SPACE over BYTE ORDER MARK).
Why not track draft aliases in a "draft alias" table? More important, why not track aliases of *Unicode* characters that could use aliases (eg, translations), in separate tables? For example, there are "shape based names" for Han characters, which are standard enough so that users would be able to construct them (Unicode 11 includes one such system, see section 18.2). And Japanese names for Han radicals often vary from the UCD Name property, and are often more precise (many describe the geometric relation of the radical to the rest of the character). It is not obvious to me that an alias() that only looks at NameAliases.txt is so useful as to belong in the stdlib, but on the other hand providing a module that can include rapidly accumulating databases along the lines I've mentioned above definitely doesn't belong in the stdlib (a la pytz). On the other hand, the *access functions* might belong in the stdlib ---in the same way that timezone-sensitive datetime APIs do---but that sort of requires knowing what databases and "schema" are out there, and trying to set things up so that the same APIs can access a number of databases.
To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?
I didn't understand your alias function correctly, which I think is overengineered for the purpose of handling aliases. I was thinking in terms of returning a string, or at most general a list of strings. If you are going to define a class to represent metadata about a character, why not make *all* metadata available? Probably most of the attributes would be properties, lazily accessing various databases: class Codepoint(object): def __init__(self, codepoint): self.codepoint = codepoint @property def name(self): # Access name database and cache result. @property def category(self): # Access category database and cache result. @property def alias(self): # Populates alias_list, and returns the first one. @property def alias_list(self): # Access alias database (not limited to NameAliases.txt) and # cache result. @property def label(self): # Populates and returns name, if available, otherwise a code # point label. and so on. But that's a new thread.
Yes, as MRAB has suggested. I would be a little more precise than he, in that I would label the C0 and C1 control blocks with CONTROL-<code> rather than just U+<code>. Steve

Replying to a few points out of order... On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:
lookup(name(x)) == x for all x is natural isn't it ?
The Unicode Consortium doesn't think so, or else they would mandate that all defined code points have a name.
That's a pretty old version -- we're up to version 11 now. https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt
I propose adding a keyword argument, to unicodedata.name<http://unicodedata.name>
I don't think that's a real URL.
that would implement one of some useful behavior when the value does not exist.
I am cautious about overloading functions with keyword-only arguments to implement special behaviour. Guido has a strong preference for the "no constant flags" rule of thumb, (except I think we can extend it beyond just True/False to any N-state value) and I agree with that. The rule of thumb says that if you have a function that takes an optional flag which chooses between two (or more) distinct behaviours, AND the function is usually called with that flag given as a constant, then we should usually prefer to split the function into two separately named functions. For example, in the statistics module, I have stdev() and pstdev(), rather than stdev(population=False) and stdev(population=True). (Its a rule of thumb, not a hard law of nature. There are exceptions.) It sounds to me that your proposal would fit those conditions and so we should prefer a separate function, or a separate API, for doing more complex name look-ups. *Especially* if there's a chance that we'll want to extend this some day to use more flags... name(char, abbreviation=False, correction=True, control=True, figment=True, alternate=False, ) which are all alias types defined by NameAliases.txt.
To my mind, that calls out for a separate API to return character alias properties as a separate data type: alias('\u0001') => UnicodeAlias(control='START OF HEADING', abbreviation='SOH') alias('\u000B') => UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'), abbreviation='VT') # alternatively, fields could be a single semi-colon delimited string rather than a tuple in the event of multiple aliases alias('\u01A2') => UnicodeAlias(correction='LATIN CAPITAL LETTER GHA') alias('\u0099') => UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', abbreviation='SGC') Fields not shown return the empty string. This avoids overloading the name() function, future-proofs against new alias types, and if UnicodeAlias is a mutable object, easily permits the caller to customise the records to suit their own application's needs: def myalias(char): alias = unicodedata.alias(char) if char == '\U0001f346': alias.other = ('eggplant', 'purple vegetable') alias.slang = ('phallic', ... ) return alias -- Steve

On Thu, Jul 12, 2018 at 05:17:26PM +1000, Steven D'Aprano <steve@pearwood.info> wrote:
I'm sure it was a stupid autoreplacement by web mail (hotmail in this case). As '.name' is a valid domain hotmail decided that unicodedata.name is a host name. And "URLified" it, so to say.
-- Steve
Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Yes, my gmail client transformed unicodata . name to a url. I hope the mobile gmail client won't do it here. Yes current version is 11. I noticed it after sending the mail, I've compared to the version 6 and all my arguments are still valid (they just added some characters in the "correction" set). As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue. Yes a new function name is also what I prefer but I thought it would clutter the unicodata namespace. I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea. Instead of name(char, abbreviation=True, correction=False) I would have Imagined a "default_behavior" ala csv.dialect such that name(char, default_bevior=unicodata.first_abbreviation) would use my algorithm. first_abbreviation would be a enum, or like in csv.dialect a class like : class first_abbreviation: abbreviation = True; correction = False; ... But I guess that's too specific, abbreviation=True would mean "take the first abbreviation in the list".

I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. Or try: return name(x) expect: if category(x) == 'Cc': return alias(x).abbreviations[0] else: raise That would then raise only for unassigned codepoints.

Robert Vanden Eynde writes:
The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices). This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible. And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).
I don't understand why that's particularly useful, especially in the Han case (see below).
I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8): On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name. (I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers. On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents. So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized. For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go. I think this problem is a lot harder than many of the folk participating in this discussion so far realize. Steve

I don't understand why that's particularly useful, especially in the Han case (see below). Since python 3.3 has the NameAliases.txt builtin in the distribution in order to full fil \N{} construct, I think it would be nice to have an api to access this files, to do like unicodedata.alias('\n').abbreviations[:3] == ['LF', 'NL', 'EOL'] I don't understand what you're asking for. The Unicode Standard already provides canonical names. Not for control characters. About the Han case, they all have a unicodedata.name<http://unicodedata.name> don't they ? (Sorry if I misread your message)

On Thu, Jul 12, 2018 at 03:11:59PM +0000, Robert Vanden Eynde wrote: [Stephen]
That's because the Unicode Consortium considers that control characters have no canonical name. And I think that they are right.
I think that the point Stephen is making is that the canonical name for most Han characters is terribly uninformative, even to native Han users. For Englishg speakers, the analogous situation would be if name("A") returned "LATIN CAPITAL LETTER 0041". There are good reasons for that, but it does mean that if your intention is to report the name of the character to a non-technical end-user, in their own native language, using the Unicode name or even any of the aliases is probably not a great solution. On the other hand, if you are in a lucky enough situation (unlike Stephen) of being able to say "Han characters? We'll fix that in the next version..." using the Unicode name is not a terrible solution. At least, it's The Standard terrible solution *wink* -- Steve

Robert Vanden Eynde writes:
Not for control characters.
There's a standard convention for "naming" control characters (U+0000, U+0001, etc), which is recommended by the Unicode Standard (in slightly generalized form) for characters that otherwise don't have names, as "code point labels". This has been suggested by MRAB in the past. Personally I would generalized Steven d'Aprano's function a bit, and provide a "CONTROL-" prefix for these instead of "U+". I don't see why even the C0 ASCII control function aliases should be particularly privileged, especially since the main alias is the spelled-out name, not the more commonly used 2- or 3-character abbreviation (will people associate "alarm" with "BEL"? I don't). Many are just meaningless (the 4 "device control" codes). And some are actively misleading: U+0018 (^X) "cancel" and U+001A (^Z) "substitute", which are generally interpreted as "exit" (an interactive program) and "end of file" (on Windows), or as "cut" and "revert" in CUA UI. I for one would find it more useful if they aliased to "ctrl-c-prefix" and "zap-up-to-char".[1] And nobody's ever heard of the C1 ISO 6249 control characters (not to mention that three of them are literally figments of somebody's imagination, and never standardized). So I think using NameAliases.txt for this purpose is silly. If we're going to provide aliases based on the traditional control functions, I would use only the NameAliases.txt aliases for the following: NUL, BEL, BS, HT, LF, VT, FF, CR, ESC, SP, DEL, NEL, NBSP, and SHY. (NEL is included because it's recommended that it be treated as a newline function in the Unicode standard.) For the rest, I would use CONTROL-<code>, which is more likely to make sense in most contexts.[2]
Yes, they have names, constructed algorithmically from the code point: "CJK UNIFIED IDEOGRAPH-4E00". I know what that one is (the character that denotes the number 1). But that's the only one that I know offhand. I think Han (which are named daily, surely millions, if not billions, of times) should be treated as well as controls (which even programmers rarely bother to name, especially for those that don't have standard escape sequences). That's why I strongly advocate that there be provision for extension, and that the databases at least be provided by a module that can be updated far more frequently than the stdlib is. Footnotes: [1] Those are the commands they are bound to in Emacs. [2] There are a few others that I personally would find useful and unambiguous because they're used in multilingual ISO 2022 encodings, but that's rather far into the weeds. They're rarely seen in practice; most of the time 7-bit codes with escape sequences are used, or 8-bit codes without control sequences.

On Fri, Jul 13, 2018 at 12:02:20AM +0900, Stephen J. Turnbull wrote:
Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it. My alias() function is just an programmatic interface to information already available in the NameAliases.txt file. Don't you think that's useful enough as it stands? What people do with it will depend on their application, of course. [...]
Indeed. That's also the case for emoji. That's why I suggested making alias() return a mutable record rather than an immutable tuple, so application writers can add their own records to suit their own needs. I'll admit I haven't thought really deeply about what the most useful API would be -- this was only an initial post on Python-Ideas, not a fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others. The NameAlias.txt file makes all that information available, but we can't access it (easily, or at all) from unicodedata. [...]
That seems fairly extreme. New Unicode versions don't come out that frequently. Surely we don't expect to track draft aliases, or characters outside of Unicode? Application writers might choose to do so -- if somebody wants to support "Ay" as an alias for LATIN CAPITAL LETTER A they can be my guest, but the stdlib doesn't have to directly support it until it hits the NameAliases.txt file :-) [...]
To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?
And even there I think a canonical name based on block name + code point in hex is the best way to go.
I believe you might be thinking of the Unicode "code point label" concept. I have this implementation in my toolbox: NONCHARACTERS = tuple( [unichr(n) for n in range(0xFDD0, 0xFDF0)] + [unichr(n*0x10000 + 0xFFFE +i) for n in range(17) for i in range(2)] ) assert len(NONCHARACTERS) == 66 def label(c): """Return the Code Point Label or character name of c. If c is a code point with a name, the name is used as the label; otherwise the Code Point Label is returned. >>> label(unichr(0x0394)) # u'Δ' 'GREEK CAPITAL LETTER DELTA' >>> label(unichr(0x001F)) '<control-001F>' """ name = unicodedata.name(c, '') if name == '': # See section on Code Point Labels # http://www.unicode.org/versions/Unicode10.0.0/ch04.pdf number = ord(c) category = unicodedata.category(c) assert category in ('Cc', 'Cn', 'Co', 'Cs') if category == 'Cc': kind = 'control' elif category == 'Cn': if c in NONCHARACTERS: kind = 'noncharacter' else: kind = 'reserved' elif category == 'Co': kind = 'private-use' else: assert category == 'Cs' kind = 'surrogate' name = "<%s-%04X>" % (kind, number) return name -- Steve

Steven D'Aprano writes:
Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it.
I was questioning the utility of "If the abbreviation list is sorted by AdditionToUnicodeDate." But since you ask, neither function is useful TO ME, as I understand them, because they're based on the UCD NameAliases.txt. That doesn't have any aliases I would actually use. I've never needed aliases for control characters, and for everything else the canonical name is perfectly useful (including for Korean characters and Japanese kana, which have phonetic names, as do Chinese bopomofo AIUI). There's nothing useful for Han characters yet, sadly.
To be perfectly frank, if that's all it is, I don't know when I'd ever use it. Your label function is *much* more useful. To be specific about the defects of NameAliases.txt: "DEVICE CONTROL 1" tells me a lot less about that control character than "U+0011" does. Other aliases in that file are just wrong: I don't believe I've ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded character entity. That's the DOS "END OF FILE". Certainly, the aliases of category "correction" are useful, though not to me---I don't read any of the relevant languages. The "figment" category is stupid; almost all the names of control characters are figments, except for the half-dozen well-known whitespace characters, NUL, and maybe DEL. The 256 VSxxx "variation selectors" are somewhat useful, but I would think that it would be even more useful to provide skin color aliases for face emoji and X11 RGB.txt color aliases for hearts and the like, which presumably are standardized across vendors. If I were designing a feature for the stdlib, I would 0. Allow the database to consist of multiple alias tables, and be extensible by adding tables via user configuration. 1. Make the priority of the alias tables user-configurable. 2. Provide default top-priority table more suited to likely Python usage than NameAliases.txt. 3. Provide both a primary alias function, and a list of all aliases function. 4. Provide a reverse lookup function. 5. Perhaps provide a context-sensitive alias function. The only context I can think of offhand is "position in file", ie, to distinguish between ZWNBSP and BOM, so perhaps that's not worth doing. On the other hand, given that example, it's worth a few minutes thought to see if there are other context-sensitive naming practices that more than a few people would want to follow.
Why should they add them to the tuple returned by the function, rather than to the database the function consults?
fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others.
I don't understand. By providing stdlib support for NameAliases.txt only, you are privileging those aliases. If you mean privileging the Name property over the aliases, well, that's what "canonical" means, and yes, I think the Name property should be privileged (eg ZERO WIDTH NO-BREAK SPACE over BYTE ORDER MARK).
Why not track draft aliases in a "draft alias" table? More important, why not track aliases of *Unicode* characters that could use aliases (eg, translations), in separate tables? For example, there are "shape based names" for Han characters, which are standard enough so that users would be able to construct them (Unicode 11 includes one such system, see section 18.2). And Japanese names for Han radicals often vary from the UCD Name property, and are often more precise (many describe the geometric relation of the radical to the rest of the character). It is not obvious to me that an alias() that only looks at NameAliases.txt is so useful as to belong in the stdlib, but on the other hand providing a module that can include rapidly accumulating databases along the lines I've mentioned above definitely doesn't belong in the stdlib (a la pytz). On the other hand, the *access functions* might belong in the stdlib ---in the same way that timezone-sensitive datetime APIs do---but that sort of requires knowing what databases and "schema" are out there, and trying to set things up so that the same APIs can access a number of databases.
To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?
I didn't understand your alias function correctly, which I think is overengineered for the purpose of handling aliases. I was thinking in terms of returning a string, or at most general a list of strings. If you are going to define a class to represent metadata about a character, why not make *all* metadata available? Probably most of the attributes would be properties, lazily accessing various databases: class Codepoint(object): def __init__(self, codepoint): self.codepoint = codepoint @property def name(self): # Access name database and cache result. @property def category(self): # Access category database and cache result. @property def alias(self): # Populates alias_list, and returns the first one. @property def alias_list(self): # Access alias database (not limited to NameAliases.txt) and # cache result. @property def label(self): # Populates and returns name, if available, otherwise a code # point label. and so on. But that's a new thread.
Yes, as MRAB has suggested. I would be a little more precise than he, in that I would label the C0 and C1 control blocks with CONTROL-<code> rather than just U+<code>. Steve
participants (5)
-
MRAB
-
Oleg Broytman
-
Robert Vanden Eynde
-
Stephen J. Turnbull
-
Steven D'Aprano