[Python-Dev] Bug in PyLocale_strcoll

Mon Nov 22 09:17:44 CET 2004

Andreas Degert wrote:
> "M.-A. Lemburg" <mal at egenix.com> writes:
> 
> 
>>Aahz wrote:
>>
>>>On Sat, Nov 20, 2004, Andreas Degert wrote:
>>>
>>>
>>>>I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used
>>>>with 2 unicode strings, it converts them to wchar strings and uses
>>>>wcscoll. The bug is that the wchar strings are not 0-terminated.
>>>
>>>If you're sure this is a bug, please file on SF and report back the
>>>ID.
>>>(If you're not sure, what until you get confirmation from one of the
>>>Unicode experts and then file the bug. ;-)
>>
>>Please also check that the bug is still present in Python 2.4 and/or
>>CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just
>>recently for Python 2.4.
> 
> 
> The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is
> correcting a buffer overflow, is just in the same piece of code.
> 
> I didn't find a clear statement if the unicode string should be
> 0-terminated or not.

You're right: they are always 0-terminated just like 8-bit strings
and even though it doesn't seem to be necessary since Python
functions will always use the size field when working on
a Unicode object rather than rely on the 0-termination.

> In _PyUnicode_New it's 0-terminated, even in the
> case when it had to call unicode_resize (though there is a comment in
> unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these
> is the only place where unicode objects are created or modified, they
> seem to be always 0-terminated.

Right.

> wchar strings must be 0-terminated if they are to be used with the
> wcs* functions. So it's not a good idea to return a non-terminated
> string from PyUnicode_AsWideChar. If the unicode strings are always
> 0-terminated (the unicode buffer size is length+1), then we could just
> change
> 
>     if (size > PyUnicode_GET_SIZE(unicode))
> 	size = PyUnicode_GET_SIZE(unicode);
> 
> to 
> 
>     if (size > PyUnicode_GET_SIZE(unicode)+1)
> 	size = PyUnicode_GET_SIZE(unicode)+1;
> 
> in PyUnicode_AsWideChar to get 0-terminated wchars.
> 
> Ok... I'm still not sure if I should file a bug for PyLocale_strcoll
> or PyUnicode_AsWideChar and if the patch for the latter should assume
> that the unicode string buffer is 0-terminated...

I think it's probably wise to fix both:

Looking again, the patch we applied to PyUnicode_AsWideChar()
only fixes the 0-termination problem in the case where
HAVE_USABLE_WCHAR_T is set. This should be extended to
the memcpy() as well.

Still, if the buffer passed to PyUnicode_AsWideChar()
is not big enough, you won't get the 0-termination (due
to truncation), so PyLocale_strcoll() must be either very
careful to allocate a buffer that is always big enough
or apply 0-termination itself.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 22 2004)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::