[ python-Bugs-1532726 ] incorrect behaviour of PyUnicode_EncodeMBCS?

Wed Aug 2 07:31:33 CEST 2006

Bugs item #1532726, was opened at 2006-08-02 06:20
Message generated for change (Comment added) made by ocean-city
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Interpreter Core
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Jan-Willem (jwnmulder)
Assigned to: Nobody/Anonymous (nobody)
Summary: incorrect behaviour of PyUnicode_EncodeMBCS?

Initial Comment:
Using python 2.4.3
This behaviour is not reproducable on a window or 
linux machine. I found the bug when trying to find a 
problem on python 2.4.3 ported to the xbox.

running the next two commands

  test_string = 'encode me'
  print repr(test_string.encode('mbcs'))

results on windows in : 'encode me'
and on the xbox : 'encode me\\x00'

The problem is that 'PyUnicode_EncodeMBCS' returns an 
PyStringObject that contains the data 'encode me' but 
with an object size of 10.
string_repr(test_string) assumes the string contains 
a 0 character and encodes it as '\\x00'

looking at the function 'PyUnicode_EncodeMBCS(const 
Py_UNICODE *p, int size, const char *errors)' there 
are basicly two functions

{
  mbcssize = WideCharToMultiByte(CP_ACP, 0, p, size, 
NULL, 0, NULL, NULL);
  repr = PyString_FromStringAndSize(NULL, mbcssize);
}

WideCharToMultiByte returns the nummer of bytes 
needed for the buffer, because of the string 
termination this functions returns 10.
PyString_FromStringAndSize assumes its second 
argument to be the number of needed characters, not 
bytes. So an easy fix would be
to change
  repr = PyString_FromStringAndSize(NULL, mbcssize);
in
  repr = PyString_FromStringAndSize(NULL, mbcssize - 
1);

Just checked the 2.4.3 svn trunk and it contains the 
same bug.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-08-02 14:31

Message:
Logged In: YES 
user_id=1200846

I think this is not related to that patch.

On my win2000sp4, teminating null character is not passed to
PyUnicode_EncodeMBCS.

//////////////////////////////////////////////
// patch for debug (release24-maint branch)

Index: Objects/unicodeobject.c
===================================================================

--- Objects/unicodeobject.c	(revision 51033)
+++ Objects/unicodeobject.c	(working copy)
@@ -2782,6 +2782,20 @@
     char *s;
     DWORD mbcssize;
 
+{ /* debug */
+
+    int i;
+
+    printf("------------> %d\n", size);
+
+    for (i = 0; i < size; ++i) {
+	printf("%d ", (int)p[i]);
+    }
+
+    printf("\n");
+
+} /* debug */
+
     /* If there are no characters, bail now! */
     if (size==0)
 	    return PyString_FromString("");

//////////////////////////////////
// a.py

test_string = 'encode me'
print repr(test_string.encode('mbcs'))

//////////////////////////////////
// result

R:\>py a.py
------------> 9
101 110 99 111 100 101 32 109 101
'encode me'
[7660 refs]


And I tried this.



#include <windows.h>
#include <stdio.h>
#include <stdlib.h>

void count(LPCWSTR w, int size)
{
    char *buf; int i;

    const int ret = ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        NULL,
        0,
        NULL,
        NULL
    );

    if (ret == 0)
    {
        printf("error\n");
    }
    else
    {
        printf("%d\n", ret);
    }

    buf = (char*)malloc(ret);

    ::WideCharToMultiByte(
        CP_ACP,
        0,
        w,
        size,
        buf,
        ret,
        NULL,
        NULL
    );

    for (i = 0; i < ret; ++i)
    {
        printf("%d ", (int)buf[i]);
    }

    printf("\n");

    free(buf);
}

int main()
{
    count(L"encode me", 9);
    count(L"encode me", 10); /* include null charater */
}

/*
9
101 110 99 111 100 101 32 109 101
10
101 110 99 111 100 101 32 109 101 0
*/


As stated in
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_2bj9.asp
, WideCharToMultiByte never output null character if source
string doesn't contain null character. So I think usage of
WideCharToMultiByte is correct.

I don't know why, but probably some behavior difference
should exist between win2000 and xbox. (ie: xbox calls
PyUnicode_EncodeMBCS with size 10 ... or WideCharToMultiByte
on xbox outputs null character even if source string doesn't
contain it?)

Can you try above C code and debug patch on xbox?


----------------------------------------------------------------------

Comment By: Jan-Willem (jwnmulder)
Date: 2006-08-02 06:30

Message:
Logged In: YES 
user_id=770969

related to patch 1455898 ?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1532726&group_id=5470