[Python-3000] How will unicode get used?

Mon Sep 25 06:34:12 CEST 2006

gabor <gabor at nekomancer.net> wrote:
> Martin v. Löwis wrote:
> > Gábor Farkas schrieb:
[snip]
> > Python is not aiming at 100% portability at all costs. Many aspects
> > are platform dependent, and while this has complicated some
> > applications, is has simplified others (which could make use of
> > platform details that otherwise would not have been exposed to the
> > Python programmer).
> 
> hmmm.. i thought that all those 'platform dependent' aspects are in the 
> libraries (win32/sys/posix/os/whatetever), and not in the "core" part.
> 
> so, are there any in the "core" (stupid naming i know. i mean 
> not-in-libraries) part?

sys.setrecursionlimit(10000)

def foo():
    foo()

Run that in Windows, and you get a MemoryError.  Run it in Linux, and
you get a segfault.  Blame linux malloc.

> >> should he write his own slicing/whatever functions to get consistent 
> >> behaviour on linux/windows?
> > 
> > Depends on the application, and the specific slicing operations.
> > If the slicing appears in the processing of .ini files (say),
> > no platform-dependent slicing should be necessary.
[snip]
> let's say in an application i only want to display the first 70 
> characters of a string.
> 
> now, for this to behave correctly on non-bmp characters, i will need to 
> write a custom function, correct?

That depends on what you mean by "now," and on the Python compile option.
If you mean that "today ... i would need to write a custom function",
then you would be correct on a utf-16 compiled Python for all characters
with a code point > 65535, but not so on a ucs-4 build (but perhaps both
when there are surrogate pairs). In the future, the plan, I believe, is
to attempt to make utf-16 behave like ucs-4 eith regards to all
operations available from Python, at least for all characters
represented with a single code point.

> >> but the same way i could say, that because most of the unix-world is 
> >> utf-8, for those pythons the best way is to handle it internally as 
> >> utf-8, couldn't i?
> > 
> > I think you live in a free country: you can certainly say that
> > I think you would be wrong. The common on-disk/on-wire representation
> > of text should not influence the design of an in-memory representation.
> 
> sorry, i should have clarified this more.
> 
> i simply reacted to the situation that for example cpython-win32 and 
> IronPython use 16bit unicode-strings, which makes it easy for them to 
> communicate with the (afaik) mostly 16bit-unicode win32 API.
> 
> on the other hand, for example GTK uses utf8-encoded strings...so when 
> on linux the python-GTK bindings want to transfer strings, they will 
> have to do charset-conversion.
> 
> but this was only an example.

The current CPython implementation keeps two representations of unicode
strings in memory; the utf-16 or ucs-4 representation (depending on
compile-time options) and a default system encoding representation.  If
you set your default system encoding to be utf-8, Python doesn't need to
do anything more to hand unicode strings off to GTK, aside from
recognizing that it has what it wants already.

[snip]
> hmmm.. for me having to worry about string-handling differences in the 
> programming language i use qualifies as 'harder'.

With what Martin and Frederik have been saying recently, I don't believe
that you have anything significant to worry about when it comes to
string behavior on CPython vs. IronPython, Jython, or even PyPy.

> > He said it will be implementation-dependent, referring to Jython
> > and IronPython.
> > Whether or not CPython uses a consistent representation
> > or consistent python-level experience across platforms is a different
> > issue. CPython could behave absolutely consistently, and use four-byte
> > Unicode on all systems, and the length of a non-BMP string would
> > still be implementation-defined.
> 
> i understand that difference.
> 
> (i just find it hard to believe, that string-handling does not seem 
> important enough to make it truly cross-platform (or cross-implementation))

It is important, arguably one of the most important pieces.  But there
are three parts; 1) code points not currently defined within the unicode
spec, but who have specific encodings (based on the code point value), 2)
in the case of UTF-16 representations, Python's handling of characters >
65535, 3) surrogates.

I believe #1 is handled "correctly" today, Martin sounds like he wants
#2 fixed for Py3k (I don't believe anyone *doesn't* want it fixed), and
#3 could be fixed while fixing #2 with a little more work (if desired).

 - Josiah