[ python-Bugs-1532726 ] incorrect behaviour of PyUnicode_EncodeMBCS?
SourceForge.net
noreply at sourceforge.net
Thu Aug 3 06:09:26 CEST 2006
Bugs item #1532726, was opened at 2006-08-01 14:20
Message generated for change (Comment added) made by nnorwitz
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
>Group: 3rd Party
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Jan-Willem (jwnmulder)
Assigned to: Nobody/Anonymous (nobody)
Summary: incorrect behaviour of PyUnicode_EncodeMBCS?
Initial Comment:
Using python 2.4.3
This behaviour is not reproducable on a window or
linux machine. I found the bug when trying to find a
problem on python 2.4.3 ported to the xbox.
running the next two commands
test_string = 'encode me'
print repr(test_string.encode('mbcs'))
results on windows in : 'encode me'
and on the xbox : 'encode me\\x00'
The problem is that 'PyUnicode_EncodeMBCS' returns an
PyStringObject that contains the data 'encode me' but
with an object size of 10.
string_repr(test_string) assumes the string contains
a 0 character and encodes it as '\\x00'
looking at the function 'PyUnicode_EncodeMBCS(const
Py_UNICODE *p, int size, const char *errors)' there
are basicly two functions
{
mbcssize = WideCharToMultiByte(CP_ACP, 0, p, size,
NULL, 0, NULL, NULL);
repr = PyString_FromStringAndSize(NULL, mbcssize);
}
WideCharToMultiByte returns the nummer of bytes
needed for the buffer, because of the string
termination this functions returns 10.
PyString_FromStringAndSize assumes its second
argument to be the number of needed characters, not
bytes. So an easy fix would be
to change
repr = PyString_FromStringAndSize(NULL, mbcssize);
in
repr = PyString_FromStringAndSize(NULL, mbcssize -
1);
Just checked the 2.4.3 svn trunk and it contains the
same bug.
----------------------------------------------------------------------
>Comment By: Neal Norwitz (nnorwitz)
Date: 2006-08-02 21:09
Message:
Logged In: YES
user_id=33168
Thanks for letting us know. Closing as requested.
----------------------------------------------------------------------
Comment By: Jan-Willem (jwnmulder)
Date: 2006-08-02 10:44
Message:
Logged In: YES
user_id=770969
and the result for the xbox
10
101 110 99 111 100 101 32 109 101 0
11
101 110 99 111 100 101 32 109 101 0 0
It seems the xbox calculates an extra character for a '\0'
count(L"encode me", -1);
results on both platforms in ret = 10
So I think this bug can be closed and clasified as an xbox
bug... Not so hard for us to fix, almost all api calls for
dlls are emulated in our application, so it is easy enough
to put a fix in for us.
----------------------------------------------------------------------
Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-08-01 22:31
Message:
Logged In: YES
user_id=1200846
I think this is not related to that patch.
On my win2000sp4, teminating null character is not passed to
PyUnicode_EncodeMBCS.
//////////////////////////////////////////////
// patch for debug (release24-maint branch)
Index: Objects/unicodeobject.c
===================================================================
--- Objects/unicodeobject.c (revision 51033)
+++ Objects/unicodeobject.c (working copy)
@@ -2782,6 +2782,20 @@
char *s;
DWORD mbcssize;
+{ /* debug */
+
+ int i;
+
+ printf("------------> %d\n", size);
+
+ for (i = 0; i < size; ++i) {
+ printf("%d ", (int)p[i]);
+ }
+
+ printf("\n");
+
+} /* debug */
+
/* If there are no characters, bail now! */
if (size==0)
return PyString_FromString("");
//////////////////////////////////
// a.py
test_string = 'encode me'
print repr(test_string.encode('mbcs'))
//////////////////////////////////
// result
R:\>py a.py
------------> 9
101 110 99 111 100 101 32 109 101
'encode me'
[7660 refs]
And I tried this.
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
void count(LPCWSTR w, int size)
{
char *buf; int i;
const int ret = ::WideCharToMultiByte(
CP_ACP,
0,
w,
size,
NULL,
0,
NULL,
NULL
);
if (ret == 0)
{
printf("error\n");
}
else
{
printf("%d\n", ret);
}
buf = (char*)malloc(ret);
::WideCharToMultiByte(
CP_ACP,
0,
w,
size,
buf,
ret,
NULL,
NULL
);
for (i = 0; i < ret; ++i)
{
printf("%d ", (int)buf[i]);
}
printf("\n");
free(buf);
}
int main()
{
count(L"encode me", 9);
count(L"encode me", 10); /* include null charater */
}
/*
9
101 110 99 111 100 101 32 109 101
10
101 110 99 111 100 101 32 109 101 0
*/
As stated in
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
, WideCharToMultiByte never output null character if source
string doesn't contain null character. So I think usage of
WideCharToMultiByte is correct.
I don't know why, but probably some behavior difference
should exist between win2000 and xbox. (ie: xbox calls
PyUnicode_EncodeMBCS with size 10 ... or WideCharToMultiByte
on xbox outputs null character even if source string doesn't
contain it?)
Can you try above C code and debug patch on xbox?
----------------------------------------------------------------------
Comment By: Jan-Willem (jwnmulder)
Date: 2006-08-01 14:30
Message:
Logged In: YES
user_id=770969
related to patch 1455898 ?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470
More information about the Python-bugs-list
mailing list