[Python-ideas] Support Unicode code point notation

Nick Coghlan ncoghlan at gmail.com
Sun Jul 28 15:06:18 CEST 2013


On 28 July 2013 22:00, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Nick Coghlan writes:
>  > Using "modal encoding" to refer to that change isn't really valid
>  > though
>
> No, it's quite correct, at least in ISO-land.  There, a modal encoding
> is one which must maintain state across *code points*.  The single-
> code-point "\N" syntax needs to maintain state across *code units*,
> but when it's done with a code *point*, it's done - there's no state
> to worry about before starting to parse the next one.  By your
> definition, UTF-8 is modal, but that doesn't seem a very useful
> categorization to me.

My bytes-oriented comms background is showing ;)

I agree, preserving the property that "one escape sequence = one code
point" is valuable, so the proposal should just be to make this
resolve to the right value:

    "\N{U+<code-point>}"

It would also be more consistent if unicodedata.lookup() was updated
to handle numeric code point names. Something like:

>>> import unicodedata
>>> def enhanced_lookup(name):
...     if name.startswith("U+"):
...         return chr(int(name[2:], 16))
...     return unicodedata.lookup(name)
...
>>> enhanced_lookup("GREEK SMALL LETTER ALPHA")
'α'
>>> enhanced_lookup("U+03B1")
'α'

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-ideas mailing list