[Python-Dev] Bug in PyLocale_strcoll
M.-A. Lemburg
mal at egenix.com
Mon Nov 22 14:03:23 CET 2004
Andreas Degert wrote:
> "M.-A. Lemburg" <mal at egenix.com> writes:
>
>
>>You're right: they are always 0-terminated just like 8-bit strings
>>and even though it doesn't seem to be necessary since Python
>>functions will always use the size field when working on
>>a Unicode object rather than rely on the 0-termination.
>
>
> OK, should be documented in the code
It is, but I wasn't sure whether it is really such a good
idea to waist the extra memory and wanted to keep the option
of removing the 0-termination.
>>>Ok... I'm still not sure if I should file a bug for PyLocale_strcoll
>>>or PyUnicode_AsWideChar and if the patch for the latter should assume
>>>that the unicode string buffer is 0-terminated...
>>
>>I think it's probably wise to fix both:
>>
>>Looking again, the patch we applied to PyUnicode_AsWideChar()
>>only fixes the 0-termination problem in the case where
>>HAVE_USABLE_WCHAR_T is set. This should be extended to
>>the memcpy() as well.
>
>
> What I read from the code is that now in both cases the string is
> copied without 0 and that is consistent with the size the buffer is
> checked for (PyUnicode_GET_SIZE gives the value of the length field
> and that doesn't include the 0-termination)
>
>
>>Still, if the buffer passed to PyUnicode_AsWideChar()
>>is not big enough, you won't get the 0-termination (due
>>to truncation), so PyLocale_strcoll() must be either very
>>careful to allocate a buffer that is always big enough
>>or apply 0-termination itself.
>
>
> PyLocale_strcoll() acts quite careful but even so it didn't get what
> it expected ;-). This bug is masked by the bug you referred to when
> the copy loop is used (ie. if wchar sizes don't match) and the output
> buffer string is big enough (like in the strcoll case because the
> buffer size already accounts for the 0-termination).
>
> I appended a (untested) patch for unicodeobject.c.
I've just checked in a patch which should correct the
problem.
> The documentation should be clarified too. Would a patch against
> concrete.tex be accepted where I change
>
> - 'Unicode object' to 'Unicode string' when only the string part of
> the python object is referenced,
Not sure what you mean here.
> - 'size of the object' to 'length of the string'
Dito.
> - mention the 0-termination of the return-value of
> PyUnicode_AS_UNICODE()
>
> - mention the 0-termination of the return-value of
> PyUnicode_AsWideChar
I don't think we should document this. Programmers should always
use the size of the object rather than rely on the 0-termination.
> - '... represents a 16-bit...' to something that explains 16 vs. 32
> but depending on internal representation (UCS-2 or UCS-4) selected at
> compile time
+1
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Nov 22 2004)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list