Mailman 3 Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names - Python-ideas

Steven D'Aprano

July 2018

7:17 a.m.

Replying to a few points out of order... On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:

...

lookup(name(x)) == x for all x is natural isn't it ?

The Unicode Consortium doesn't think so, or else they would mandate that all defined code points have a name.

...

In the NameAliases https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt one can see that some characters have multiple aliases, so there are multiple ways to map a character to a name.

That's a pretty old version -- we're up to version 11 now. https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt

...

I propose adding a keyword argument, to unicodedata.name<http://unicodedata.name>

I don't think that's a real URL.

...

that would implement one of some useful behavior when the value does not exist.

I am cautious about overloading functions with keyword-only arguments to implement special behaviour. Guido has a strong preference for the "no constant flags" rule of thumb, (except I think we can extend it beyond just True/False to any N-state value) and I agree with that. The rule of thumb says that if you have a function that takes an optional flag which chooses between two (or more) distinct behaviours, AND the function is usually called with that flag given as a constant, then we should usually prefer to split the function into two separately named functions. For example, in the statistics module, I have stdev() and pstdev(), rather than stdev(population=False) and stdev(population=True). (Its a rule of thumb, not a hard law of nature. There are exceptions.) It sounds to me that your proposal would fit those conditions and so we should prefer a separate function, or a separate API, for doing more complex name look-ups. *Especially* if there's a chance that we'll want to extend this some day to use more flags... name(char, abbreviation=False, correction=True, control=True, figment=True, alternate=False, ) which are all alias types defined by NameAliases.txt.

...

One simple behavior would be to chose the name in the "abbreviation" list. Currently all characters except three only have one and only one abbreviation so that would be a good pick, so I'd imagine name('\x00', abbreviation=True) == 'NUL'

To my mind, that calls out for a separate API to return character alias properties as a separate data type: alias('\u0001') => UnicodeAlias(control='START OF HEADING', abbreviation='SOH') alias('\u000B') => UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'), abbreviation='VT') # alternatively, fields could be a single semi-colon delimited string rather than a tuple in the event of multiple aliases alias('\u01A2') => UnicodeAlias(correction='LATIN CAPITAL LETTER GHA') alias('\u0099') => UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', abbreviation='SGC') Fields not shown return the empty string. This avoids overloading the name() function, future-proofs against new alias types, and if UnicodeAlias is a mutable object, easily permits the caller to customise the records to suit their own application's needs: def myalias(char): alias = unicodedata.alias(char) if char == '\U0001f346': alias.other = ('eggplant', 'purple vegetable') alias.slang = ('phallic', ... ) return alias -- Steve

Reply

Sign in to reply online Use email software

Robert Vanden Eynde

12:15 p.m.

Yes, my gmail client transformed unicodata . name to a url. I hope the mobile gmail client won't do it here. Yes current version is 11. I noticed it after sending the mail, I've compared to the version 6 and all my arguments are still valid (they just added some characters in the "correction" set). As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue. Yes a new function name is also what I prefer but I thought it would clutter the unicodata namespace. I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea. Instead of name(char, abbreviation=True, correction=False) I would have Imagined a "default_behavior" ala csv.dialect such that name(char, default_bevior=unicodata.first_abbreviation) would use my algorithm. first_abbreviation would be a enum, or like in csv.dialect a class like : class first_abbreviation: abbreviation = True; correction = False; ... But I guess that's too specific, abbreviation=True would mean "take the first abbreviation in the list".

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

3:02 p.m.

Robert Vanden Eynde writes:

...

The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices). This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible. And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).

...

I don't understand why that's particularly useful, especially in the Han case (see below).

...

I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8): On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name. (I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers. On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents. So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized. For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go. I think this problem is a lot harder than many of the folk participating in this discussion so far realize. Steve

Reply

Sign in to reply online Use email software

Steven D'Aprano

5:27 p.m.

On Thu, Jul 12, 2018 at 03:11:59PM +0000, Robert Vanden Eynde wrote: [Stephen]

...

That's because the Unicode Consortium considers that control characters have no canonical name. And I think that they are right.

...

I think that the point Stephen is making is that the canonical name for most Han characters is terribly uninformative, even to native Han users. For Englishg speakers, the analogous situation would be if name("A") returned "LATIN CAPITAL LETTER 0041". There are good reasons for that, but it does mean that if your intention is to report the name of the character to a non-technical end-user, in their own native language, using the Unicode name or even any of the aliases is probably not a great solution. On the other hand, if you are in a lucky enough situation (unlike Stephen) of being able to say "Han characters? We'll fix that in the next version..." using the Unicode name is not a terrible solution. At least, it's The Standard terrible solution *wink* -- Steve

Reply

Sign in to reply online Use email software

MRAB

5:09 p.m.

On 2018-07-12 16:02, Stephen J. Turnbull wrote:

...

Robert Vanden Eynde writes:

...
As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue.

The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices).

This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible.

And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).

...
I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate.

I don't understand why that's particularly useful, especially in the Han case (see below).

...
However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea.

I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8):

On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name.

(I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers.

On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents.

So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized.

For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go.

I think this problem is a lot harder than many of the folk participating in this discussion so far realize.

AFAIR, the last time codepoint names were talked about, I suggested that there could be a fallback to U+XXXX. unicodedata.name can accept a default, but unicodedata.lookup doesn't accept such a 'name'.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

5:27 a.m.

Steven D'Aprano writes:

...

Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it.

I was questioning the utility of "If the abbreviation list is sorted by AdditionToUnicodeDate." But since you ask, neither function is useful TO ME, as I understand them, because they're based on the UCD NameAliases.txt. That doesn't have any aliases I would actually use. I've never needed aliases for control characters, and for everything else the canonical name is perfectly useful (including for Korean characters and Japanese kana, which have phonetic names, as do Chinese bopomofo AIUI). There's nothing useful for Han characters yet, sadly.

...

My alias() function is just an programmatic interface to information already available in the NameAliases.txt file. Don't you think that's useful enough as it stands?

To be perfectly frank, if that's all it is, I don't know when I'd ever use it. Your label function is *much* more useful. To be specific about the defects of NameAliases.txt: "DEVICE CONTROL 1" tells me a lot less about that control character than "U+0011" does. Other aliases in that file are just wrong: I don't believe I've ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded character entity. That's the DOS "END OF FILE". Certainly, the aliases of category "correction" are useful, though not to me---I don't read any of the relevant languages. The "figment" category is stupid; almost all the names of control characters are figments, except for the half-dozen well-known whitespace characters, NUL, and maybe DEL. The 256 VSxxx "variation selectors" are somewhat useful, but I would think that it would be even more useful to provide skin color aliases for face emoji and X11 RGB.txt color aliases for hearts and the like, which presumably are standardized across vendors. If I were designing a feature for the stdlib, I would 0. Allow the database to consist of multiple alias tables, and be extensible by adding tables via user configuration. 1. Make the priority of the alias tables user-configurable. 2. Provide default top-priority table more suited to likely Python usage than NameAliases.txt. 3. Provide both a primary alias function, and a list of all aliases function. 4. Provide a reverse lookup function. 5. Perhaps provide a context-sensitive alias function. The only context I can think of offhand is "position in file", ie, to distinguish between ZWNBSP and BOM, so perhaps that's not worth doing. On the other hand, given that example, it's worth a few minutes thought to see if there are other context-sensitive naming practices that more than a few people would want to follow.

...

Indeed. [Multiple non-UCD aliases is] also the case for emoji. That's why I suggested making alias() return a mutable record rather than an immutable tuple, so application writers can add their own records to suit their own needs.

Why should they add them to the tuple returned by the function, rather than to the database the function consults?

...

fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others.

I don't understand. By providing stdlib support for NameAliases.txt only, you are privileging those aliases. If you mean privileging the Name property over the aliases, well, that's what "canonical" means, and yes, I think the Name property should be privileged (eg ZERO WIDTH NO-BREAK SPACE over BYTE ORDER MARK).

...

That seems fairly extreme. New Unicode versions don't come out that frequently. Surely we don't expect to track draft aliases, or characters outside of Unicode?

Why not track draft aliases in a "draft alias" table? More important, why not track aliases of *Unicode* characters that could use aliases (eg, translations), in separate tables? For example, there are "shape based names" for Han characters, which are standard enough so that users would be able to construct them (Unicode 11 includes one such system, see section 18.2). And Japanese names for Han radicals often vary from the UCD Name property, and are often more precise (many describe the geometric relation of the radical to the rest of the character). It is not obvious to me that an alias() that only looks at NameAliases.txt is so useful as to belong in the stdlib, but on the other hand providing a module that can include rapidly accumulating databases along the lines I've mentioned above definitely doesn't belong in the stdlib (a la pytz). On the other hand, the *access functions* might belong in the stdlib ---in the same way that timezone-sensitive datetime APIs do---but that sort of requires knowing what databases and "schema" are out there, and trying to set things up so that the same APIs can access a number of databases.

...

To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?

I didn't understand your alias function correctly, which I think is overengineered for the purpose of handling aliases. I was thinking in terms of returning a string, or at most general a list of strings. If you are going to define a class to represent metadata about a character, why not make *all* metadata available? Probably most of the attributes would be properties, lazily accessing various databases: class Codepoint(object): def __init__(self, codepoint): self.codepoint = codepoint @property def name(self): # Access name database and cache result. @property def category(self): # Access category database and cache result. @property def alias(self): # Populates alias_list, and returns the first one. @property def alias_list(self): # Access alias database (not limited to NameAliases.txt) and # cache result. @property def label(self): # Populates and returns name, if available, otherwise a code # point label. and so on. But that's a new thread.

...

...
And even there I think a canonical name based on block name + code point in hex is the best way to go.

I believe you might be thinking of the Unicode "code point label" concept.

Yes, as MRAB has suggested. I would be a little more precise than he, in that I would label the C0 and C1 control blocks with CONTROL-<code> rather than just U+<code>. Steve

Reply

Sign in to reply online Use email software

Steven D'Aprano

July 2018

7:17 a.m.

Replying to a few points out of order... On Thu, Jul 12, 2018 at 02:03:07AM +0000, Robert Vanden Eynde wrote:

...

lookup(name(x)) == x for all x is natural isn't it ?

The Unicode Consortium doesn't think so, or else they would mandate that all defined code points have a name.

...

In the NameAliases https://www.unicode.org/Public/6.3.0/ucd/NameAliases.txt one can see that some characters have multiple aliases, so there are multiple ways to map a character to a name.

That's a pretty old version -- we're up to version 11 now. https://www.unicode.org/Public/11.0.0/ucd/NameAliases.txt

...

I propose adding a keyword argument, to unicodedata.name<http://unicodedata.name>

I don't think that's a real URL.

...

that would implement one of some useful behavior when the value does not exist.

I am cautious about overloading functions with keyword-only arguments to implement special behaviour. Guido has a strong preference for the "no constant flags" rule of thumb, (except I think we can extend it beyond just True/False to any N-state value) and I agree with that. The rule of thumb says that if you have a function that takes an optional flag which chooses between two (or more) distinct behaviours, AND the function is usually called with that flag given as a constant, then we should usually prefer to split the function into two separately named functions. For example, in the statistics module, I have stdev() and pstdev(), rather than stdev(population=False) and stdev(population=True). (Its a rule of thumb, not a hard law of nature. There are exceptions.) It sounds to me that your proposal would fit those conditions and so we should prefer a separate function, or a separate API, for doing more complex name look-ups. *Especially* if there's a chance that we'll want to extend this some day to use more flags... name(char, abbreviation=False, correction=True, control=True, figment=True, alternate=False, ) which are all alias types defined by NameAliases.txt.

...

One simple behavior would be to chose the name in the "abbreviation" list. Currently all characters except three only have one and only one abbreviation so that would be a good pick, so I'd imagine name('\x00', abbreviation=True) == 'NUL'

To my mind, that calls out for a separate API to return character alias properties as a separate data type: alias('\u0001') => UnicodeAlias(control='START OF HEADING', abbreviation='SOH') alias('\u000B') => UnicodeAlias(control=('LINE TABULATION', 'VERTICAL TABULATION'), abbreviation='VT') # alternatively, fields could be a single semi-colon delimited string rather than a tuple in the event of multiple aliases alias('\u01A2') => UnicodeAlias(correction='LATIN CAPITAL LETTER GHA') alias('\u0099') => UnicodeAlias(figment='SINGLE GRAPHIC CHARACTER INTRODUCER', abbreviation='SGC') Fields not shown return the empty string. This avoids overloading the name() function, future-proofs against new alias types, and if UnicodeAlias is a mutable object, easily permits the caller to customise the records to suit their own application's needs: def myalias(char): alias = unicodedata.alias(char) if char == '\U0001f346': alias.other = ('eggplant', 'purple vegetable') alias.slang = ('phallic', ... ) return alias -- Steve

Reply

Sign in to reply online Use email software

Robert Vanden Eynde

12:15 p.m.

Yes, my gmail client transformed unicodata . name to a url. I hope the mobile gmail client won't do it here. Yes current version is 11. I noticed it after sending the mail, I've compared to the version 6 and all my arguments are still valid (they just added some characters in the "correction" set). As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue. Yes a new function name is also what I prefer but I thought it would clutter the unicodata namespace. I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate. However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea. Instead of name(char, abbreviation=True, correction=False) I would have Imagined a "default_behavior" ala csv.dialect such that name(char, default_bevior=unicodata.first_abbreviation) would use my algorithm. first_abbreviation would be a enum, or like in csv.dialect a class like : class first_abbreviation: abbreviation = True; correction = False; ... But I guess that's too specific, abbreviation=True would mean "take the first abbreviation in the list".

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

3:02 p.m.

Robert Vanden Eynde writes:

...

The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices). This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible. And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).

...

I don't understand why that's particularly useful, especially in the Han case (see below).

...

I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8): On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name. (I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers. On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents. So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized. For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go. I think this problem is a lot harder than many of the folk participating in this discussion so far realize. Steve

Reply

Sign in to reply online Use email software

Steven D'Aprano

July 2018

5:27 p.m.

On Thu, Jul 12, 2018 at 03:11:59PM +0000, Robert Vanden Eynde wrote: [Stephen]

...

That's because the Unicode Consortium considers that control characters have no canonical name. And I think that they are right.

...

I think that the point Stephen is making is that the canonical name for most Han characters is terribly uninformative, even to native Han users. For Englishg speakers, the analogous situation would be if name("A") returned "LATIN CAPITAL LETTER 0041". There are good reasons for that, but it does mean that if your intention is to report the name of the character to a non-technical end-user, in their own native language, using the Unicode name or even any of the aliases is probably not a great solution. On the other hand, if you are in a lucky enough situation (unlike Stephen) of being able to say "Han characters? We'll fix that in the next version..." using the Unicode name is not a terrible solution. At least, it's The Standard terrible solution *wink* -- Steve

Reply

Sign in to reply online Use email software

MRAB

5:09 p.m.

On 2018-07-12 16:02, Stephen J. Turnbull wrote:

...

Robert Vanden Eynde writes:

...
As I'm at, I mentionned the ffef character but we don't care about it because it already has a name, so that's mostly a control character issue.

The problem with control characters is that from the point of view of the Unicode Standard, the C0 and C1 registers are basically a space reserved for private use (see ISO 6429 for the huge collection of standardized control functions). That is, unlike the rest of the Unicode repertoire, the "characters" mapped there are neither unique nor context-independent. It's true that ISO 6429 recommends specific C0 and C1 sets (but the recommended C1 set isn't even complete: U+0080, U+0081, and U+0099 aren't assigned!) However, Unicode only suggests that those should be the default interpretations, because the useful control functions are going to be dependent on context (eg, input and output devices).

This is like the situation with Internet addresses and domain names. The mapping is inherently many-many; round-tripping is not possible.

And in fact there are a number of graphic characters that have multiple code points due to bugs in national character sets. So for graphic characters, it's possible to ensure name(code(x)) = x, but it's not possible to ensure code(name(x)) = x, except under special circumstances (which apply to the vast majority of characters, of course).

...
I like your alias(...) function, with that one, an application could code my function like try name(x) expect alias(x).abbreviations[0]. If the abbreviation list is sorted by AdditionToUnicodeDate.

I don't understand why that's particularly useful, especially in the Han case (see below).

...
However, having a standard canonical name for all character in the stdlib would help people choosing the same convention. A new function like "canonical_name" or a shorter name would be an idea.

I don't understand what you're asking for. The Unicode Standard already provides canonical names. Of course, the canonical name of most Han ideographs (near and dear to my heart) are pretty useless (they look like "CJK UNIFIED IDEOGRAPH-4E00"). (You probably don't want to get the Japanese, Chinese---and there are a lot of different kinds of Chinese---and Koreans started on what the "canonical" name should be. One Han Unification controversy is enough for this geological epoch!) This is closely related to the Unicode standard's generic recommendation (Ch. 4.8):

On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise con- struct a code point label to stand in for a character name.

(I suppose "should" here is used in the sense of RFC 2119.) So, the standard defines a canonical naming scheme, although many character names are not terribly mnemonic even to native speakers.

On the other hand, if you want useful aliases for Han characters, for many of them there could be scores of aliases, based on pronunciation, semantics, and appearance, the first two of which of which vary substantially within a single language, let alone across languages. Worse, as far as I know there are no standard equivalent ways to express these things in English, as when writing about these characters in English you often adopt a romanized version of the technical terms in the language you're studying. And, it's a minor point, but there are new Han characters discovered every day (I'm not even sure that's an exaggeration), as scholars examine regional and historical documents.

So for this to be most useful to me, I would want it developed OUTSIDE of the stdlib, with releases even more frequent than pytz (that is an exaggeration). Not so much because I'll frequently need anything outside of the main CJK block in Plane 0, but because the complexity of character naming in East Asia suggests that improvements in heuristics for assigning priority to aliases, language-specific variations in heuristics, and so on will be rapid for the forseeable future. It would be a shame to shackle that to the current stdlib release cycle even if it doesn't need to be as frenetic as pytz. This goes in spades for people who are waiting for their own scripts to be standardized.

For the stdlib, I'm -1 on anything other than the canonical names plus the primary aliases for characters which are well-defined in the code charts of the Unicode Standard, such as those for the C0 and (most of) the C1 control characters. And even there I think a canonical name based on block name + code point in hex is the best way to go.

I think this problem is a lot harder than many of the folk participating in this discussion so far realize.

AFAIR, the last time codepoint names were talked about, I suggested that there could be a fallback to U+XXXX. unicodedata.name can accept a default, but unicodedata.lookup doesn't accept such a 'name'.

Reply

Sign in to reply online Use email software

Stephen J. Turnbull

5:27 a.m.

Steven D'Aprano writes:

...

Sorry, I'm not sure if you mean my proposed alias() function isn't useful, or Robert's try...except loop around it.

I was questioning the utility of "If the abbreviation list is sorted by AdditionToUnicodeDate." But since you ask, neither function is useful TO ME, as I understand them, because they're based on the UCD NameAliases.txt. That doesn't have any aliases I would actually use. I've never needed aliases for control characters, and for everything else the canonical name is perfectly useful (including for Korean characters and Japanese kana, which have phonetic names, as do Chinese bopomofo AIUI). There's nothing useful for Han characters yet, sadly.

...

My alias() function is just an programmatic interface to information already available in the NameAliases.txt file. Don't you think that's useful enough as it stands?

To be perfectly frank, if that's all it is, I don't know when I'd ever use it. Your label function is *much* more useful. To be specific about the defects of NameAliases.txt: "DEVICE CONTROL 1" tells me a lot less about that control character than "U+0011" does. Other aliases in that file are just wrong: I don't believe I've ever seen U+001A used as "SUBSTITUTE" for an unrepresentable coded character entity. That's the DOS "END OF FILE". Certainly, the aliases of category "correction" are useful, though not to me---I don't read any of the relevant languages. The "figment" category is stupid; almost all the names of control characters are figments, except for the half-dozen well-known whitespace characters, NUL, and maybe DEL. The 256 VSxxx "variation selectors" are somewhat useful, but I would think that it would be even more useful to provide skin color aliases for face emoji and X11 RGB.txt color aliases for hearts and the like, which presumably are standardized across vendors. If I were designing a feature for the stdlib, I would 0. Allow the database to consist of multiple alias tables, and be extensible by adding tables via user configuration. 1. Make the priority of the alias tables user-configurable. 2. Provide default top-priority table more suited to likely Python usage than NameAliases.txt. 3. Provide both a primary alias function, and a list of all aliases function. 4. Provide a reverse lookup function. 5. Perhaps provide a context-sensitive alias function. The only context I can think of offhand is "position in file", ie, to distinguish between ZWNBSP and BOM, so perhaps that's not worth doing. On the other hand, given that example, it's worth a few minutes thought to see if there are other context-sensitive naming practices that more than a few people would want to follow.

...

Indeed. [Multiple non-UCD aliases is] also the case for emoji. That's why I suggested making alias() return a mutable record rather than an immutable tuple, so application writers can add their own records to suit their own needs.

Why should they add them to the tuple returned by the function, rather than to the database the function consults?

...

fully-fledged PEP -- but I think the critical point here is that we shouldn't be privileging one alias type over the others.

I don't understand. By providing stdlib support for NameAliases.txt only, you are privileging those aliases. If you mean privileging the Name property over the aliases, well, that's what "canonical" means, and yes, I think the Name property should be privileged (eg ZERO WIDTH NO-BREAK SPACE over BYTE ORDER MARK).

...

That seems fairly extreme. New Unicode versions don't come out that frequently. Surely we don't expect to track draft aliases, or characters outside of Unicode?

Why not track draft aliases in a "draft alias" table? More important, why not track aliases of *Unicode* characters that could use aliases (eg, translations), in separate tables? For example, there are "shape based names" for Han characters, which are standard enough so that users would be able to construct them (Unicode 11 includes one such system, see section 18.2). And Japanese names for Han radicals often vary from the UCD Name property, and are often more precise (many describe the geometric relation of the radical to the rest of the character). It is not obvious to me that an alias() that only looks at NameAliases.txt is so useful as to belong in the stdlib, but on the other hand providing a module that can include rapidly accumulating databases along the lines I've mentioned above definitely doesn't belong in the stdlib (a la pytz). On the other hand, the *access functions* might belong in the stdlib ---in the same way that timezone-sensitive datetime APIs do---but that sort of requires knowing what databases and "schema" are out there, and trying to set things up so that the same APIs can access a number of databases.

...

To clarify, do you mean the aliases defined in NameAliases.txt? Or a subset of them?

I didn't understand your alias function correctly, which I think is overengineered for the purpose of handling aliases. I was thinking in terms of returning a string, or at most general a list of strings. If you are going to define a class to represent metadata about a character, why not make *all* metadata available? Probably most of the attributes would be properties, lazily accessing various databases: class Codepoint(object): def __init__(self, codepoint): self.codepoint = codepoint @property def name(self): # Access name database and cache result. @property def category(self): # Access category database and cache result. @property def alias(self): # Populates alias_list, and returns the first one. @property def alias_list(self): # Access alias database (not limited to NameAliases.txt) and # cache result. @property def label(self): # Populates and returns name, if available, otherwise a code # point label. and so on. But that's a new thread.

...

...
And even there I think a canonical name based on block name + code point in hex is the best way to go.

I believe you might be thinking of the Unicode "code point label" concept.

Yes, as MRAB has suggested. I would be a little more precise than he, in that I would label the C0 and C1 control blocks with CONTROL-<code> rather than just U+<code>. Steve

Reply

Sign in to reply online Use email software

Unicode Name Aliases keyword argument abbreviation in unicodedata.name for missing names

Robert Vanden Eynde

Steven D'Aprano

Oleg Broytman

Robert Vanden Eynde

Robert Vanden Eynde

Stephen J. Turnbull

Robert Vanden Eynde

Steven D'Aprano

Stephen J. Turnbull

MRAB

Steven D'Aprano

Stephen J. Turnbull

Steven D'Aprano

Oleg Broytman

Robert Vanden Eynde

Robert Vanden Eynde

Stephen J. Turnbull

Robert Vanden Eynde

Steven D'Aprano

Stephen J. Turnbull

MRAB

Steven D'Aprano

Stephen J. Turnbull

tags

participants (5)