[Python-ideas] Add str.bmp() to only expand non-BMP chars, for tkinter use
Terry Reedy
tjreedy at udel.edu
Mon Mar 16 00:09:39 CET 2015
3.x comes with builtin ascii(obj) to return a string representation of
obj that only has ascii characters.
>>> s = 'a\xaa\ua000\U0001a000'
>>> len(s)
4
>>> sa = ascii(s)
>>> sa, len(sa)
("'a\\xaa\\ua000\\U0001a000'", 23)
This allows any string to be printed on even a minimal ascii terminal.
Python also comes with the tkinter interface to tk. Tk widgets are not
limited to ascii but support the full BMP subset of unicode. (This is
better than Windows consoles limited by codepages.) Thus, for use with
tkinter, ascii() has two faults: it adds a quote at beginning and end of
the string (like repr); it expands too much.
I looked at repr, which expands less, but it seems to be buggy in that
it is not consistent in its handling of non-BMP chars.
>>> s1 = 'a\xaa\ua000\U00011000'
>>> sa = 'a\xaa\ua000\U0001a000'
>>> s1r = repr(s1); len(s1r)
6
>>> sar = repr(sa); len(sar)
15
'\U0001a000' gets expanded, and can be printed.
'\U00011000' does not, and cannot be consistently printed.
>>> s1r # only works at >>> prompt
"'a\xaa\ua000\U00011000'"
>>> print(s1r) # required in programs
Traceback (most recent call last):
File "<pyshell#43>", line 1, in <module>
print(s1r)
File "C:\Programs\Python34\lib\idlelib\PyShell.py", line 1347, in write
return self.shell.write(s, self.tags)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position
4-4: Non-BMP character not supported in Tk
Printing s1 or sa directly, by either means, gives the same error.
(Since '>>> expr' is supposed to be the same as 'print(expr)' the above
difference puzzles me.)
Even if repr always worked as it does for '\U0001a000', there would
still be the problem of the added quotes. I therefore proposed the
addition of a new str method, such as 's.bmp()', that returns s with all
non-BMP chars, and only such chars, expanded. Since strings (in
CPython) are internally marked by 'kind', the method would just return s
when no expansion is needed. I presume it could otherwise re-use the
expansion code already in repr.
Aside from tkinter programmers in general, this issue bites Idle in at
least two ways. Internally, filenames can contain non-BMP chars and
Idle displays them in 3 places.
See http://bugs.python.org/issue23672.
Externally, Idle users sometimes want to print strings with non-BMP
chars. I believe the automatic use of .bmp() with console prints could
be user selectable. There have been issues about this on both our
tracker and StackOverflow.
I believe that the use of non-BMP chars is becoming more common and can
no longer be simply dismissed as too rare to worry about. Telling
Windows users that they are better off than if they use python directly,
with the windows console, does not solve the inability to print any
Python string. This proposal would.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list