[Python-Dev] thoughts on the bytes/string discussion
M.-A. Lemburg
mal at egenix.com
Wed Jul 7 11:13:09 CEST 2010
Ronald Oussoren wrote:
>
> On 27 Jun, 2010, at 11:48, Greg Ewing wrote:
>
>> Stefan Behnel wrote:
>>> Greg Ewing, 26.06.2010 09:58:
>>>> Would there be any sanity in having an option to compile
>>>> Python with UTF-8 as the internal string representation?
>>> It would break Py_UNICODE, because the internal size of a unicode character would no longer be fixed.
>>
>> It's not fixed anyway with the 2-char build -- some
>> characters are represented using a pair of surrogates.
>
> It is for practical purposes not even fixed in 4-char builds. In 4-char builds every Unicode code points corresponds to one item in a python unicode string, but a base characters with combining characters is still a sequence of characters and should IMHO almost always be treated as a single object. As an example, given s="be\N{COMBINING DIAERESIS}" s[:2] or s[2:] is almost certainly semanticly invalid.
Just to clarify: Python uses code units for Unicode storage.
Whether those code units map to code points or glyphs depends
on the used Python build and the code points in question.
See
http://www.egenix.com/library/presentations/#PythonAndUnicode
for more background information (esp. page 8).
Note that using UTF-8 as internal storage format would not work
in Python, since Python is a Unicode producer, i.e. it needs to
be able to generate and work with code points that are not allowed
in UTF-8, e.g. lone surrogates.
Another reason not to use UTF-8 encoded code units is that slicing
based on code units could easily create invalid UTF-8 which would
then render the data unusable. This is a lot less likely to happen
with UCS2 or UCS4.
And finally: RAM is cheap and today's CPUs work better with 16- or
32-bit values than 8-bit characters.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Jul 07 2010)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
2010-07-19: EuroPython 2010, Birmingham, UK 11 days to go
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
More information about the Python-Dev
mailing list