[Python-ideas] Add str.bmp() to only expand non-BMP chars, for tkinter use
Serhiy Storchaka
storchaka at gmail.com
Mon Mar 16 01:03:03 CET 2015
On 16.03.15 01:09, Terry Reedy wrote:
> 3.x comes with builtin ascii(obj) to return a string representation of
> obj that only has ascii characters.
>
> >>> s = 'a\xaa\ua000\U0001a000'
> >>> len(s)
> 4
> >>> sa = ascii(s)
> >>> sa, len(sa)
> ("'a\\xaa\\ua000\\U0001a000'", 23)
>
> This allows any string to be printed on even a minimal ascii terminal.
>
> Python also comes with the tkinter interface to tk. Tk widgets are not
> limited to ascii but support the full BMP subset of unicode. (This is
> better than Windows consoles limited by codepages.) Thus, for use with
> tkinter, ascii() has two faults: it adds a quote at beginning and end of
> the string (like repr); it expands too much.
>
> I looked at repr, which expands less, but it seems to be buggy in that
> it is not consistent in its handling of non-BMP chars.
>
> >>> s1 = 'a\xaa\ua000\U00011000'
> >>> sa = 'a\xaa\ua000\U0001a000'
> >>> s1r = repr(s1); len(s1r)
> 6
> >>> sar = repr(sa); len(sar)
> 15
>
> '\U0001a000' gets expanded, and can be printed.
> '\U00011000' does not, and cannot be consistently printed.
>
> >>> s1r # only works at >>> prompt
> "'a\xaa\ua000\U00011000'"
> >>> print(s1r) # required in programs
> Traceback (most recent call last):
> File "<pyshell#43>", line 1, in <module>
> print(s1r)
> File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
> return self.shell.write(s, self.tags)
> UnicodeEncodeError: 'UCS-2' codec can't encode characters in position
> 4-4: Non-BMP character not supported in Tk
>
> Printing s1 or sa directly, by either means, gives the same error.
> (Since '>>> expr' is supposed to be the same as 'print(expr)' the above
> difference puzzles me.)
>
> Even if repr always worked as it does for '\U0001a000', there would
> still be the problem of the added quotes. I therefore proposed the
> addition of a new str method, such as 's.bmp()', that returns s with all
> non-BMP chars, and only such chars, expanded. Since strings (in
> CPython) are internally marked by 'kind', the method would just return s
> when no expansion is needed. I presume it could otherwise re-use the
> expansion code already in repr.
>
> Aside from tkinter programmers in general, this issue bites Idle in at
> least two ways. Internally, filenames can contain non-BMP chars and
> Idle displays them in 3 places.
> See http://bugs.python.org/issue23672.
> Externally, Idle users sometimes want to print strings with non-BMP
> chars. I believe the automatic use of .bmp() with console prints could
> be user selectable. There have been issues about this on both our
> tracker and StackOverflow.
>
> I believe that the use of non-BMP chars is becoming more common and can
> no longer be simply dismissed as too rare to worry about. Telling
> Windows users that they are better off than if they use python directly,
> with the windows console, does not solve the inability to print any
> Python string. This proposal would.
>
Right now I'm writing a patch that implements similar idea for issue18814.
>>> codecs.convert_astral('a\u20ac\U000e007f', 'strict')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/serhiy/py/cpython/Lib/codecs.py", line 1159, in
convert_astral
'astral characters'))
UnicodeTranslateError: can't translate character '\U000e007f' in
position 2: astral characters
>>> codecs.convert_astral('a\u20ac\U000e007f', 'ignore')
'a€'
>>> codecs.convert_astral('a\u20ac\U000e007f', 'replace')
'a€�'
>>> codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
'a€\\U000e007f'
>>> codecs.convert_astral('a\u20ac\U000e007f', 'namereplace')
'a€\\N{CANCEL TAG}'
>>> codecs.convert_astral('a\u20ac\U000e007f', 'xmlcharrefreplace')
'a€'
More information about the Python-ideas
mailing list