[Python-ideas] Support Unicode code point notation
Nick Coghlan
ncoghlan at gmail.com
Sun Jul 28 15:06:18 CEST 2013
On 28 July 2013 22:00, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
> > Using "modal encoding" to refer to that change isn't really valid
> > though
>
> No, it's quite correct, at least in ISO-land. There, a modal encoding
> is one which must maintain state across *code points*. The single-
> code-point "\N" syntax needs to maintain state across *code units*,
> but when it's done with a code *point*, it's done - there's no state
> to worry about before starting to parse the next one. By your
> definition, UTF-8 is modal, but that doesn't seem a very useful
> categorization to me.
My bytes-oriented comms background is showing ;)
I agree, preserving the property that "one escape sequence = one code
point" is valuable, so the proposal should just be to make this
resolve to the right value:
"\N{U+<code-point>}"
It would also be more consistent if unicodedata.lookup() was updated
to handle numeric code point names. Something like:
>>> import unicodedata
>>> def enhanced_lookup(name):
... if name.startswith("U+"):
... return chr(int(name[2:], 16))
... return unicodedata.lookup(name)
...
>>> enhanced_lookup("GREEK SMALL LETTER ALPHA")
'α'
>>> enhanced_lookup("U+03B1")
'α'
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia
More information about the Python-ideas
mailing list