Bug in PyLocale_strcoll
Hello, I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated. Checking with assert(ws1[len1-1] == 0 && ws2[len2-1] == 0); right before the line result = PyInt_FromLong(wcscoll(ws1, ws2)); confirms the bug. I'm not quite sure what the best fix is. PyUnicode_AsWideChar() copies the unicode chars, but not the terminating 0-char of the unicode string (which is not used in python, but its there anyhow, if I understand the implementation correctly). So one fix would be to change PyUnicode_AsWideChar to copy the terminating 0-char if there's enough space in the output buffer. Another fix would be to terminate the strings in PyLocale_strcoll() before using them: ---------------------------------------------------------- --- _localemodule.c~ Sat Nov 20 21:33:17 2004 +++ _localemodule.c Sat Nov 20 21:35:04 2004 @@ -353,15 +353,19 @@ PyErr_NoMemory(); goto done; } - if (PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1) == -1) + len1 = PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1); + if (len1 == -1) goto done; + ws1[len1-1] = 0; ws2 = PyMem_MALLOC(len2 * sizeof(wchar_t)); if (!ws2) { PyErr_NoMemory(); goto done; } - if (PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2) == -1) + len2 = PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2); + if (len2 == -1) goto done; + ws2[len2-1] = 0; /* Collate the strings. */ result = PyInt_FromLong(wcscoll(ws1, ws2)); done: ---------------------------------------------------------- cheers Andreas
sorry to reply to my own post.. corrected patch following --- _localemodule.c-2.49 Sat Nov 20 22:02:11 2004 +++ _localemodule.c Sat Nov 20 22:01:48 2004 @@ -313,6 +313,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os1, ws1, len1) == -1) goto done; + ws1[len1-1] = 0; ws2 = PyMem_MALLOC(len2 * sizeof(wchar_t)); if (!ws2) { PyErr_NoMemory(); @@ -320,6 +321,7 @@ } if (PyUnicode_AsWideChar((PyUnicodeObject*)os2, ws2, len2) == -1) goto done; + ws2[len2-1] = 0; /* Collate the strings. */ result = PyInt_FromLong(wcscoll(ws1, ws2)); done:
On Sat, Nov 20, 2004, Andreas Degert wrote:
I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated.
If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-) -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ WiFi is the SCSI of the 21st Century -- there are fundamental technical reasons for sacrificing a goat. (with no apologies to John Woods)
Aahz wrote:
On Sat, Nov 20, 2004, Andreas Degert wrote:
I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated.
If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-)
Please also check that the bug is still present in Python 2.4 and/or CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just recently for Python 2.4. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 21 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"M.-A. Lemburg"
Aahz wrote:
On Sat, Nov 20, 2004, Andreas Degert wrote:
I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated. If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-)
Please also check that the bug is still present in Python 2.4 and/or CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just recently for Python 2.4.
The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is correcting a buffer overflow, is just in the same piece of code. I didn't find a clear statement if the unicode string should be 0-terminated or not. In _PyUnicode_New it's 0-terminated, even in the case when it had to call unicode_resize (though there is a comment in unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these is the only place where unicode objects are created or modified, they seem to be always 0-terminated. wchar strings must be 0-terminated if they are to be used with the wcs* functions. So it's not a good idea to return a non-terminated string from PyUnicode_AsWideChar. If the unicode strings are always 0-terminated (the unicode buffer size is length+1), then we could just change if (size > PyUnicode_GET_SIZE(unicode)) size = PyUnicode_GET_SIZE(unicode); to if (size > PyUnicode_GET_SIZE(unicode)+1) size = PyUnicode_GET_SIZE(unicode)+1; in PyUnicode_AsWideChar to get 0-terminated wchars. Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated... cheers Andreas
Andreas Degert wrote:
"M.-A. Lemburg"
writes: Aahz wrote:
On Sat, Nov 20, 2004, Andreas Degert wrote:
I think I found a bug in PyLocale_strcoll() (Python 2.3.4). When used with 2 unicode strings, it converts them to wchar strings and uses wcscoll. The bug is that the wchar strings are not 0-terminated.
If you're sure this is a bug, please file on SF and report back the ID. (If you're not sure, what until you get confirmation from one of the Unicode experts and then file the bug. ;-)
Please also check that the bug is still present in Python 2.4 and/or CVS. We've corrected a bug in the PyUnicode_*WideChar*() APIs just recently for Python 2.4.
The off-by-one error fix in unicodeobject.c (2.228 -> 2.229) is correcting a buffer overflow, is just in the same piece of code.
I didn't find a clear statement if the unicode string should be 0-terminated or not.
You're right: they are always 0-terminated just like 8-bit strings and even though it doesn't seem to be necessary since Python functions will always use the size field when working on a Unicode object rather than rely on the 0-termination.
In _PyUnicode_New it's 0-terminated, even in the case when it had to call unicode_resize (though there is a comment in unicode_resize "Ux0000 terminated -- XXX is this needed ?"). If these is the only place where unicode objects are created or modified, they seem to be always 0-terminated.
Right.
wchar strings must be 0-terminated if they are to be used with the wcs* functions. So it's not a good idea to return a non-terminated string from PyUnicode_AsWideChar. If the unicode strings are always 0-terminated (the unicode buffer size is length+1), then we could just change
if (size > PyUnicode_GET_SIZE(unicode)) size = PyUnicode_GET_SIZE(unicode);
to
if (size > PyUnicode_GET_SIZE(unicode)+1) size = PyUnicode_GET_SIZE(unicode)+1;
in PyUnicode_AsWideChar to get 0-terminated wchars.
Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated...
I think it's probably wise to fix both: Looking again, the patch we applied to PyUnicode_AsWideChar() only fixes the 0-termination problem in the case where HAVE_USABLE_WCHAR_T is set. This should be extended to the memcpy() as well. Still, if the buffer passed to PyUnicode_AsWideChar() is not big enough, you won't get the 0-termination (due to truncation), so PyLocale_strcoll() must be either very careful to allocate a buffer that is always big enough or apply 0-termination itself. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"M.-A. Lemburg"
You're right: they are always 0-terminated just like 8-bit strings and even though it doesn't seem to be necessary since Python functions will always use the size field when working on a Unicode object rather than rely on the 0-termination.
OK, should be documented in the code
Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated...
I think it's probably wise to fix both:
Looking again, the patch we applied to PyUnicode_AsWideChar() only fixes the 0-termination problem in the case where HAVE_USABLE_WCHAR_T is set. This should be extended to the memcpy() as well.
What I read from the code is that now in both cases the string is copied without 0 and that is consistent with the size the buffer is checked for (PyUnicode_GET_SIZE gives the value of the length field and that doesn't include the 0-termination)
Still, if the buffer passed to PyUnicode_AsWideChar() is not big enough, you won't get the 0-termination (due to truncation), so PyLocale_strcoll() must be either very careful to allocate a buffer that is always big enough or apply 0-termination itself.
PyLocale_strcoll() acts quite careful but even so it didn't get what it expected ;-). This bug is masked by the bug you referred to when the copy loop is used (ie. if wchar sizes don't match) and the output buffer string is big enough (like in the strcoll case because the buffer size already accounts for the 0-termination). I appended a (untested) patch for unicodeobject.c. The documentation should be clarified too. Would a patch against concrete.tex be accepted where I change - 'Unicode object' to 'Unicode string' when only the string part of the python object is referenced, - 'size of the object' to 'length of the string' - mention the 0-termination of the return-value of PyUnicode_AS_UNICODE() - mention the 0-termination of the return-value of PyUnicode_AsWideChar - '... represents a 16-bit...' to something that explains 16 vs. 32 but depending on internal representation (UCS-2 or UCS-4) selected at compile time --- unicodeobject.c-2.229 Mon Nov 22 10:58:42 2004 +++ unicodeobject.c Mon Nov 22 11:10:07 2004 @@ -144,8 +144,7 @@ return -1; } - /* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? */ + /* We allocate one more Py_UNICODE and make the string Ux0000 terminated */ oldstr = unicode->str; PyMem_RESIZE(unicode->str, Py_UNICODE, length + 1); if (!unicode->str) { @@ -167,8 +166,7 @@ return 0; } -/* We allocate one more byte to make sure the string is - Ux0000 terminated -- XXX is this needed ? +/* We allocate length+1 and make the string Ux0000 terminated XXX This allocator could further be enhanced by assuring that the free list never reduces its size below 1. @@ -384,8 +382,10 @@ PyErr_BadInternalCall(); return -1; } - if (size > PyUnicode_GET_SIZE(unicode)) - size = PyUnicode_GET_SIZE(unicode); + /* copy all chars including 0-termination if the output buffer + size is sufficient */ + if (size > PyUnicode_GET_SIZE(unicode)+1) + size = PyUnicode_GET_SIZE(unicode)+1; #ifdef HAVE_USABLE_WCHAR_T memcpy(w, unicode->str, size * sizeof(wchar_t)); #else
Andreas Degert wrote:
"M.-A. Lemburg"
writes: You're right: they are always 0-terminated just like 8-bit strings and even though it doesn't seem to be necessary since Python functions will always use the size field when working on a Unicode object rather than rely on the 0-termination.
OK, should be documented in the code
It is, but I wasn't sure whether it is really such a good idea to waist the extra memory and wanted to keep the option of removing the 0-termination.
Ok... I'm still not sure if I should file a bug for PyLocale_strcoll or PyUnicode_AsWideChar and if the patch for the latter should assume that the unicode string buffer is 0-terminated...
I think it's probably wise to fix both:
Looking again, the patch we applied to PyUnicode_AsWideChar() only fixes the 0-termination problem in the case where HAVE_USABLE_WCHAR_T is set. This should be extended to the memcpy() as well.
What I read from the code is that now in both cases the string is copied without 0 and that is consistent with the size the buffer is checked for (PyUnicode_GET_SIZE gives the value of the length field and that doesn't include the 0-termination)
Still, if the buffer passed to PyUnicode_AsWideChar() is not big enough, you won't get the 0-termination (due to truncation), so PyLocale_strcoll() must be either very careful to allocate a buffer that is always big enough or apply 0-termination itself.
PyLocale_strcoll() acts quite careful but even so it didn't get what it expected ;-). This bug is masked by the bug you referred to when the copy loop is used (ie. if wchar sizes don't match) and the output buffer string is big enough (like in the strcoll case because the buffer size already accounts for the 0-termination).
I appended a (untested) patch for unicodeobject.c.
I've just checked in a patch which should correct the problem.
The documentation should be clarified too. Would a patch against concrete.tex be accepted where I change
- 'Unicode object' to 'Unicode string' when only the string part of the python object is referenced,
Not sure what you mean here.
- 'size of the object' to 'length of the string'
Dito.
- mention the 0-termination of the return-value of PyUnicode_AS_UNICODE()
- mention the 0-termination of the return-value of PyUnicode_AsWideChar
I don't think we should document this. Programmers should always use the size of the object rather than rely on the 0-termination.
- '... represents a 16-bit...' to something that explains 16 vs. 32 but depending on internal representation (UCS-2 or UCS-4) selected at compile time
+1 -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
- mention the 0-termination of the return-value of PyUnicode_AS_UNICODE() - mention the 0-termination of the return-value of PyUnicode_AsWideChar
I don't think we should document this. Programmers should always use the size of the object rather than rely on the 0-termination.
Then, *this* should be clearly mentioned somewhere. imo. Thomas
Thomas Heller wrote:
- mention the 0-termination of the return-value of PyUnicode_AS_UNICODE() - mention the 0-termination of the return-value of PyUnicode_AsWideChar
I don't think we should document this. Programmers should always use the size of the object rather than rely on the 0-termination.
Then, *this* should be clearly mentioned somewhere. imo.
FYI, I added this to the PyUnicode_AsWideChar() documentation. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 22 2004)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
participants (4)
-
Aahz
-
Andreas Degert
-
M.-A. Lemburg
-
Thomas Heller