[Python-Dev] Bug in PyLocale_strcoll

Sun Nov 21 23:22:02 CET 2004

"M.-A. Lemburg" <mal at egenix.com> writes:

> Aahz wrote:
>> On Sat, Nov 20, 2004, Andreas Degert wrote:
>>
>>>I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used
>>>with 2 unicode strings, it converts them to wchar strings and uses
>>>wcscoll. The bug is that the wchar strings are not 0-terminated.
>> If you're sure this is a bug, please file on SF and report back the
>> ID.
>> (If you're not sure, what until you get confirmation from one of the
>> Unicode experts and then file the bug. ;-)
>
> Please also check that the bug is still present in Python 2.4 and/or
> CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just
> recently for Python 2.4.

The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is
correcting a buffer overflow, is just in the same piece of code.

I didn't find a clear statement if the unicode string should be
0-terminated or not. In _PyUnicode_New it's 0-terminated, even in the
case when it had to call unicode_resize (though there is a comment in
unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these
is the only place where unicode objects are created or modified, they
seem to be always 0-terminated.

wchar strings must be 0-terminated if they are to be used with the
wcs* functions. So it's not a good idea to return a non-terminated
string from PyUnicode_AsWideChar. If the unicode strings are always
0-terminated (the unicode buffer size is length+1), then we could just
change

    if (size > PyUnicode_GET_SIZE(unicode))
	size = PyUnicode_GET_SIZE(unicode);

to 

    if (size > PyUnicode_GET_SIZE(unicode)+1)
	size = PyUnicode_GET_SIZE(unicode)+1;

in PyUnicode_AsWideChar to get 0-terminated wchars.

Ok... I'm still not sure if I should file a bug for PyLocale_strcoll
or PyUnicode_AsWideChar and if the patch for the latter should assume
that the unicode string buffer is 0-terminated...

cheers
Andreas