[ python-Bugs-1532726 ] incorrect behaviour of PyUnicode_EncodeMBCS?

Thu Aug 3 06:09:26 CEST 2006

Bugs item #1532726, was opened at 2006-08-01 14:20
Message generated for change (Comment added) made by nnorwitz
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
>Group: 3rd Party
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Jan-Willem (jwnmulder)
Assigned to: Nobody/Anonymous (nobody)
Summary: incorrect behaviour of PyUnicode_EncodeMBCS?

Initial Comment:
Using python 2.4.3
This behaviour is not reproducable on a window or 
linux machine. I found the bug when trying to find a 
problem on python 2.4.3 ported to the xbox.

running the next two commands

  test_string = 'encode me'
  print repr(test_string.encode('mbcs'))

results on windows in : 'encode me'
and on the xbox : 'encode me\\x00'

The problem is that 'PyUnicode_EncodeMBCS' returns an 
PyStringObject that contains the data 'encode me' but 
with an object size of 10.
string_repr(test_string) assumes the string contains 
a 0 character and encodes it as '\\x00'

looking at the function 'PyUnicode_EncodeMBCS(const 
Py_UNICODE *p, int size, const char *errors)' there 
are basicly two functions

{
  mbcssize = WideCharToMultiByte(CP_ACP, 0, p, size, 
NULL, 0, NULL, NULL);
  repr = PyString_FromStringAndSize(NULL, mbcssize);
}

WideCharToMultiByte returns the nummer of bytes 
needed for the buffer, because of the string 
termination this functions returns 10.
PyString_FromStringAndSize assumes its second 
argument to be the number of needed characters, not 
bytes. So an easy fix would be
to change
  repr = PyString_FromStringAndSize(NULL, mbcssize);
in
  repr = PyString_FromStringAndSize(NULL, mbcssize - 
1);

Just checked the 2.4.3 svn trunk and it contains the 
same bug.

----------------------------------------------------------------------

>Comment By: Neal Norwitz (nnorwitz)
Date: 2006-08-02 21:09

Message:
Logged In: YES 
user_id=33168

Thanks for letting us know.  Closing as requested.

----------------------------------------------------------------------

Comment By: Jan-Willem (jwnmulder)
Date: 2006-08-02 10:44

Message:
Logged In: YES 
user_id=770969

and the result for the xbox
10
101 110 99 111 100 101 32 109 101 0 
11
101 110 99 111 100 101 32 109 101 0 0 

It seems the xbox calculates an extra character for a '\0'

count(L"encode me", -1);
results on both platforms in ret = 10

So I think this bug can be closed and clasified as an xbox 
bug... Not so hard for us to fix, almost all api calls for 
dlls are emulated in our application, so it is easy enough 
to put a fix in for us.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-08-01 22:31

Message:
Logged In: YES 
user_id=1200846

I think this is not related to that patch.

On my win2000sp4, teminating null character is not passed to
PyUnicode_EncodeMBCS.

//////////////////////////////////////////////
// patch for debug (release24-maint branch)

Index: Objects/unicodeobject.c
===================================================================

--- Objects/unicodeobject.c	(revision 51033)
+++ Objects/unicodeobject.c	(working copy)
@@ -2782,6 +2782,20 @@
     char *s;
     DWORD mbcssize;
 
+{ /* debug */
+
+    int i;
+
+    printf("------------> %d\n", size);
+
+    for (i = 0; i < size; ++i) {
+	printf("%d ", (int)p[i]);
+    }
+
+    printf("\n");
+
+} /* debug */
+
     /* If there are no characters, bail now! */
     if (size==0)
 	    return PyString_FromString("");

//////////////////////////////////
// a.py

test_string = 'encode me'
print repr(test_string.encode('mbcs'))

//////////////////////////////////
// result

R:\>py a.py
------------> 9
101 110 99 111 100 101 32 109 101
'encode me'
[7660 refs]


And I tried this.



#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

void count(LPCWSTR w, int size)
{
    char *buf; int i;

    const int ret = ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        NULL,
        0,
        NULL,
        NULL
    );

    if (ret == 0)
    {
        printf("error\n");
    }
    else
    {
        printf("%d\n", ret);
    }

    buf = (char*)malloc(ret);

    ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        buf,
        ret,
        NULL,
        NULL
    );

    for (i = 0; i < ret; ++i)
    {
        printf("%d ", (int)buf[i]);
    }

    printf("\n");

    free(buf);
}

int main()
{
    count(L"encode me", 9);
    count(L"encode me", 10); /* include null charater */
}

/*
9
101 110 99 111 100 101 32 109 101
10
101 110 99 111 100 101 32 109 101 0
*/


As stated in
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
, WideCharToMultiByte never output null character if source
string doesn't contain null character. So I think usage of
WideCharToMultiByte is correct.

I don't know why, but probably some behavior difference
should exist between win2000 and xbox. (ie: xbox calls
PyUnicode_EncodeMBCS with size 10 ... or WideCharToMultiByte
on xbox outputs null character even if source string doesn't
contain it?)

Can you try above C code and debug patch on xbox?


----------------------------------------------------------------------

Comment By: Jan-Willem (jwnmulder)
Date: 2006-08-01 14:30

Message:
Logged In: YES 
user_id=770969

related to patch 1455898 ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470