
Ronald Oussoren wrote:
On 27 Jun, 2010, at 11:48, Greg Ewing wrote:
Stefan Behnel wrote:
Greg Ewing, 26.06.2010 09:58:
Would there be any sanity in having an option to compile Python with UTF-8 as the internal string representation? It would break Py_UNICODE, because the internal size of a unicode character would no longer be fixed.
It's not fixed anyway with the 2-char build -- some characters are represented using a pair of surrogates.
It is for practical purposes not even fixed in 4-char builds. In 4-char builds every Unicode code points corresponds to one item in a python unicode string, but a base characters with combining characters is still a sequence of characters and should IMHO almost always be treated as a single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2] or s[2:] is almost certainly semanticly invalid.
Just to clarify: Python uses code units for Unicode storage. Whether those code units map to code points or glyphs depends on the used Python build and the code points in question. See http://www.egenix.com/library/presentations/#PythonAndUnicode for more background information (esp. page 8). Note that using UTF-8 as internal storage format would not work in Python, since Python is a Unicode producer, i.e. it needs to be able to generate and work with code points that are not allowed in UTF-8, e.g. lone surrogates. Another reason not to use UTF-8 encoded code units is that slicing based on code units could easily create invalid UTF-8 which would then render the data unusable. This is a lot less likely to happen with UCS2 or UCS4. And finally: RAM is cheap and today's CPUs work better with 16- or 32-bit values than 8-bit characters. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jul 07 2010)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2010-07-19: EuroPython 2010, Birmingham, UK 11 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/