Bug in PyLocale_strcoll

Hello, I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated. Checking with assert(ws1[len1-1] == 0 && ws2[len2-1] == 0); right before the line result = PyInt_FromLong(wcscoll(ws1, ws2)); confirms the bug. I'm not quite sure what the best fix is. PyUnicode_AsWideChar() copies the unicode chars, but not the terminating 0-char of the unicode string (which is not used in python, but its there anyhow, if I understand the implementation correctly). So one fix would be to change PyUnicode_AsWideChar to copy the terminating 0-char if there's enough space in the output buffer. Another fix would be to terminate the strings in PyLocale_strcoll() before using them: ---------------------------------------------------------- --- _localemodule.c~ Sat Nov 20 21:33:17 2004 +++ _localemodule.c Sat Nov 20 21:35:04 2004 @@ -353,15 +353,19 @@ PyErr_NoMemory(); goto done; } - if (PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1) == -1) + len1 = PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1); + if (len1 == -1) goto done; + ws1[len1-1] = 0; ws2 = PyMem_MALLOC(len2 * sizeof(wchar_t)); if (!ws2) { PyErr_NoMemory(); goto done; } - if (PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2) == -1) + len2 = PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2); + if (len2 == -1) goto done; + ws2[len2-1] = 0; /* Collate the strings. */ result = PyInt_FromLong(wcscoll(ws1, ws2)); done: ---------------------------------------------------------- cheers Andreas

sorry to reply to my own post.. corrected patch following --- _localemodule.c-2.49 Sat Nov 20 22:02:11 2004 +++ _localemodule.c Sat Nov 20 22:01:48 2004 @@ -313,6 +313,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1) == -1) goto done; + ws1[len1-1] = 0; ws2 = PyMem_MALLOC(len2 * sizeof(wchar_t)); if (!ws2) { PyErr_NoMemory(); @@ -320,6 +321,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2) == -1) goto done; + ws2[len2-1] = 0; /* Collate the strings. */ result = PyInt_FromLong(wcscoll(ws1, ws2)); done:

On Sat, Nov 20, 2004, Andreas Degert wrote:
If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ WiFi is the SCSI of the 21st Century -- there are fundamental technical reasons for sacrificing a goat. (with no apologies to John Woods)

Aahz wrote:
Please also check that the bug is still present in Python 2.4 and/or CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just recently for Python 2.4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 21 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@egenix.com> writes:
The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is correcting a buffer overflow, is just in the same piece of code. I didn't find a clear statement if the unicode string should be 0-terminated or not. In _PyUnicode_New it's 0-terminated, even in the case when it had to call unicode_resize (though there is a comment in unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these is the only place where unicode objects are created or modified, they seem to be always 0-terminated. wchar strings must be 0-terminated if they are to be used with the wcs* functions. So it's not a good idea to return a non-terminated string from PyUnicode_AsWideChar. If the unicode strings are always 0-terminated (the unicode buffer size is length+1), then we could just change if (size > PyUnicode_GET_SIZE(unicode)) size = PyUnicode_GET_SIZE(unicode); to if (size > PyUnicode_GET_SIZE(unicode)+1) size = PyUnicode_GET_SIZE(unicode)+1; in PyUnicode_AsWideChar to get 0-terminated wchars. Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated... cheers Andreas

Andreas Degert wrote:
You're right: they are always 0-terminated just like 8-bit strings and even though it doesn't seem to be necessary since Python functions will always use the size field when working on a Unicode object rather than rely on the 0-termination.
Right.
I think it's probably wise to fix both: Looking again, the patch we applied to PyUnicode_AsWideChar() only fixes the 0-termination problem in the case where HAVE_USABLE_WCHAR_T is set. This should be extended to the memcpy() as well. Still, if the buffer passed to PyUnicode_AsWideChar() is not big enough, you won't get the 0-termination (due to truncation), so PyLocale_strcoll() must be either very careful to allocate a buffer that is always big enough or apply 0-termination itself. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@egenix.com> writes:
OK, should be documented in the code
What I read from the code is that now in both cases the string is copied without 0 and that is consistent with the size the buffer is checked for (PyUnicode_GET_SIZE gives the value of the length field and that doesn't include the 0-termination)
PyLocale_strcoll() acts quite careful but even so it didn't get what it expected ;-). This bug is masked by the bug you referred to when the copy loop is used (ie. if wchar sizes don't match) and the output buffer string is big enough (like in the strcoll case because the buffer size already accounts for the 0-termination). I appended a (untested) patch for unicodeobject.c. The documentation should be clarified too. Would a patch against concrete.tex be accepted where I change - 'Unicode object' to 'Unicode string' when only the string part of the python object is referenced, - 'size of the object' to 'length of the string' - mention the 0-termination of the return-value of PyUnicode_AS_UNICODE() - mention the 0-termination of the return-value of PyUnicode_AsWideChar - '... represents a 16-bit...' to something that explains 16 vs. 32 but depending on internal representation (UCS-2 or UCS-4) selected at compile time --- unicodeobject.c-2.229 Mon Nov 22 10:58:42 2004 +++ unicodeobject.c Mon Nov 22 11:10:07 2004 @@ -144,8 +144,7 @@ return -1; } - /* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? */ + /* We allocate one more Py_UNICODE and make the string Ux0000 terminated */ oldstr = unicode->str; PyMem_RESIZE(unicode->str, Py_UNICODE, length + 1); if (!unicode->str) { @@ -167,8 +166,7 @@ return 0; } -/* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? +/* We allocate length+1 and make the string Ux0000 terminated XXX This allocator could further be enhanced by assuring that the free list never reduces its size below 1. @@ -384,8 +382,10 @@ PyErr_BadInternalCall(); return -1; } - if (size > PyUnicode_GET_SIZE(unicode)) - size = PyUnicode_GET_SIZE(unicode); + /* copy all chars including 0-termination if the output buffer + size is sufficient */ + if (size > PyUnicode_GET_SIZE(unicode)+1) + size = PyUnicode_GET_SIZE(unicode)+1; #ifdef HAVE_USABLE_WCHAR_T memcpy(w, unicode->str, size * sizeof(wchar_t)); #else

Andreas Degert wrote:
It is, but I wasn't sure whether it is really such a good idea to waist the extra memory and wanted to keep the option of removing the 0-termination.
I've just checked in a patch which should correct the problem.
Not sure what you mean here.
- 'size of the object' to 'length of the string'
Dito.
I don't think we should document this. Programmers should always use the size of the object rather than rely on the 0-termination.
+1 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

sorry to reply to my own post.. corrected patch following --- _localemodule.c-2.49 Sat Nov 20 22:02:11 2004 +++ _localemodule.c Sat Nov 20 22:01:48 2004 @@ -313,6 +313,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1) == -1) goto done; + ws1[len1-1] = 0; ws2 = PyMem_MALLOC(len2 * sizeof(wchar_t)); if (!ws2) { PyErr_NoMemory(); @@ -320,6 +321,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2) == -1) goto done; + ws2[len2-1] = 0; /* Collate the strings. */ result = PyInt_FromLong(wcscoll(ws1, ws2)); done:

On Sat, Nov 20, 2004, Andreas Degert wrote:
If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ WiFi is the SCSI of the 21st Century -- there are fundamental technical reasons for sacrificing a goat. (with no apologies to John Woods)

Aahz wrote:
Please also check that the bug is still present in Python 2.4 and/or CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just recently for Python 2.4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 21 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@egenix.com> writes:
The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is correcting a buffer overflow, is just in the same piece of code. I didn't find a clear statement if the unicode string should be 0-terminated or not. In _PyUnicode_New it's 0-terminated, even in the case when it had to call unicode_resize (though there is a comment in unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these is the only place where unicode objects are created or modified, they seem to be always 0-terminated. wchar strings must be 0-terminated if they are to be used with the wcs* functions. So it's not a good idea to return a non-terminated string from PyUnicode_AsWideChar. If the unicode strings are always 0-terminated (the unicode buffer size is length+1), then we could just change if (size > PyUnicode_GET_SIZE(unicode)) size = PyUnicode_GET_SIZE(unicode); to if (size > PyUnicode_GET_SIZE(unicode)+1) size = PyUnicode_GET_SIZE(unicode)+1; in PyUnicode_AsWideChar to get 0-terminated wchars. Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated... cheers Andreas

Andreas Degert wrote:
You're right: they are always 0-terminated just like 8-bit strings and even though it doesn't seem to be necessary since Python functions will always use the size field when working on a Unicode object rather than rely on the 0-termination.
Right.
I think it's probably wise to fix both: Looking again, the patch we applied to PyUnicode_AsWideChar() only fixes the 0-termination problem in the case where HAVE_USABLE_WCHAR_T is set. This should be extended to the memcpy() as well. Still, if the buffer passed to PyUnicode_AsWideChar() is not big enough, you won't get the 0-termination (due to truncation), so PyLocale_strcoll() must be either very careful to allocate a buffer that is always big enough or apply 0-termination itself. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

"M.-A. Lemburg" <mal@egenix.com> writes:
OK, should be documented in the code
What I read from the code is that now in both cases the string is copied without 0 and that is consistent with the size the buffer is checked for (PyUnicode_GET_SIZE gives the value of the length field and that doesn't include the 0-termination)
PyLocale_strcoll() acts quite careful but even so it didn't get what it expected ;-). This bug is masked by the bug you referred to when the copy loop is used (ie. if wchar sizes don't match) and the output buffer string is big enough (like in the strcoll case because the buffer size already accounts for the 0-termination). I appended a (untested) patch for unicodeobject.c. The documentation should be clarified too. Would a patch against concrete.tex be accepted where I change - 'Unicode object' to 'Unicode string' when only the string part of the python object is referenced, - 'size of the object' to 'length of the string' - mention the 0-termination of the return-value of PyUnicode_AS_UNICODE() - mention the 0-termination of the return-value of PyUnicode_AsWideChar - '... represents a 16-bit...' to something that explains 16 vs. 32 but depending on internal representation (UCS-2 or UCS-4) selected at compile time --- unicodeobject.c-2.229 Mon Nov 22 10:58:42 2004 +++ unicodeobject.c Mon Nov 22 11:10:07 2004 @@ -144,8 +144,7 @@ return -1; } - /* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? */ + /* We allocate one more Py_UNICODE and make the string Ux0000 terminated */ oldstr = unicode->str; PyMem_RESIZE(unicode->str, Py_UNICODE, length + 1); if (!unicode->str) { @@ -167,8 +166,7 @@ return 0; } -/* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? +/* We allocate length+1 and make the string Ux0000 terminated XXX This allocator could further be enhanced by assuring that the free list never reduces its size below 1. @@ -384,8 +382,10 @@ PyErr_BadInternalCall(); return -1; } - if (size > PyUnicode_GET_SIZE(unicode)) - size = PyUnicode_GET_SIZE(unicode); + /* copy all chars including 0-termination if the output buffer + size is sufficient */ + if (size > PyUnicode_GET_SIZE(unicode)+1) + size = PyUnicode_GET_SIZE(unicode)+1; #ifdef HAVE_USABLE_WCHAR_T memcpy(w, unicode->str, size * sizeof(wchar_t)); #else

Andreas Degert wrote:
It is, but I wasn't sure whether it is really such a good idea to waist the extra memory and wanted to keep the option of removing the 0-termination.
I've just checked in a patch which should correct the problem.
Not sure what you mean here.
- 'size of the object' to 'length of the string'
Dito.
I don't think we should document this. Programmers should always use the size of the object rather than rely on the 0-termination.
+1 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
participants (4)
-
Aahz
-
Andreas Degert
-
M.-A. Lemburg
-
Thomas Heller