
Le vendredi 26 août 2011 02:01:42, Dino Viehland a écrit :
The biggest difficulty for IronPython here would be dealing w/ .NET interop. We can certainly introduce either an IronPython specific string class which is similar to CPython's PyUnicodeObject or we could have multiple distinct .NET types (IronPython.Runtime.AsciiString, System.String, and IronPython.Runtime.Ucs4String) which all appear as the same type to Python.
But when Python is calling a .NET API it's always going to return a System.String which is UTF-16. If we had to check and convert all of those strings when they cross into Python it would be very bad for performance. Presumably we could have a 4th type of "interop" string which lazily computes this but if we start wrapping .Net strings we could also get into object identity issues.
Python 3 encodes all Unicode strings to the OS encoding (and the result is decoded) for all syscalls and calls to libraries: to the locale encoding on UNIX, to UTF-16 on Windows. Currently, Py_UNICODE is wchar_t which is 16 bits. So Py_UNICODE* is already a UTF-16 string. I don't know if the overhead of the PEP 393 (encode to UTF-16 on Windows) for these calls is important or not. But on UNIX, pure ASCII string don't have to be encoded anymore if the locale encoding is UTF-8 or ASCII. IronPython can wait to see how CPython+PEP 383 handles these problems, and how slower it is.
But it's a huge change - it'll almost certainly touch every single source file in IronPython.
With the PEP 393, it's transparent: the PyUnicode_AS_UNICODE encodes the string to UTF-16 (allocate memory, etc.). Except that applications should now check if an error occurred (check for NULL).
I would think we'd get 3.2 done first and then think about what to do here.
I don't think that IronPython needs to support non-BMP characters without using surrogates. Bug reports about non-BMP characters usually don't have use cases, but just want to make Python perfect. There is no need to hurry. PEP 393 tries to reduce the memory footprint. The effect on non-BMP character is just a *nice* border effect. Or was the PEP design to solve narrow build issues? Victor