[Patches] [ python-Patches-1455898 ] patch for mbcs codecs

Tue Jul 25 18:17:42 CEST 2006

Patches item #1455898, was opened at 2006-03-22 16:31
Message generated for change (Comment added) made by ocean-city
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1455898&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Windows
Group: Python 2.5
>Status: Open
Resolution: Accepted
Priority: 7
Submitted By: Hirokazu Yamamoto (ocean-city)
Assigned to: Nobody/Anonymous (nobody)
Summary: patch for mbcs codecs

Initial Comment:
Hello.

I have noticed mbcs codecs sometimes generates broken
string. I'm using Windows(Japanese) so mbcs is mapped
to cp932 (close to shift_jis)

When I run the attached script "a.zip", the entry
"Error 00007"'s message becomes broken like attached
file "b.txt".

I think this happens because the string passed to
PyUnicode_DecodeMBCS() sometimes terminates with
leading byte, and MultiByteToWideChar() counts it for
size of result string.buffer size.

I hope attached patch "mbcs.patch" may fix the problem.
It would be nice if this bug will be fixed in 2.4.3...
Thank you.






----------------------------------------------------------------------

>Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-07-26 01:17

Message:
Logged In: YES 
user_id=1200846

Sorry, I reopened this issue because I found problem.

With attached "mbcs.py" and "mbcs.txt", result file
"hoge.txt" gets corrupted. This happens because Codec.decode
is called incrementally, while default value for final in
mbcs_decode() is True.

>I think the default value for final in mbcs_decode() should
>be true, so that the stateless decoder detects incomplete
>byte sequences too.

Probably this is not true. I think "stateless" means codec
doesn't have internal state for incremental usage, doesn't
mean it is not called incrementally.

# I hope attached "fix.patch" fixes the problem.


----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-06-14 14:21

Message:
Logged In: YES 
user_id=21627

Thanks for the patch. Committed as r46945.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-05-27 06:40

Message:
Logged In: YES 
user_id=1200846

I reverted PyUnicode_Resize() patch for now, and recreated
the patch as "mbcs.patch".

>You should probably find someone on python-dev with a
>multibyte version of Windows to look at the patch.

OK, I will.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-05-27 06:18

Message:
Logged In: YES 
user_id=1200846

>The change to PyUnicode_Resize() should be reverted (or done
>in a way that doesn't lead to bugs).

Sorry, how about this?

Index: Objects/unicodeobject.c
===================================================================

--- Objects/unicodeobject.c	(revision 46417)
+++ Objects/unicodeobject.c	(working copy)
@@ -326,7 +326,7 @@
 	return -1;
     }
     v = (PyUnicodeObject *)*unicode;
-    if (v == NULL || !PyUnicode_Check(v) || v->ob_refcnt !=
1 || length < 0) {
+    if (v == NULL || !PyUnicode_Check(v) || length < 0) {
 	PyErr_BadInternalCall();
 	return -1;
     }
@@ -335,7 +335,7 @@
        possible since these are being shared. We simply
return a fresh
        copy with the same Unicode content. */
     if (v->length != length &&
-	(v == unicode_empty || v->length == 1)) {
+	(v == unicode_empty || v->length == 1 || v->ob_refcnt != 1)) {
 	PyUnicodeObject *w = _PyUnicode_New(length);
 	if (w == NULL)
 	    return -1;


----------------------------------------------------------------------

Comment By: Walter DÃ¶rwald (doerwalter)
Date: 2006-05-27 00:43

Message:
Logged In: YES 
user_id=89016

The change to PyUnicode_Resize() should be reverted (or done
in a way that doesn't lead to bugs).

Unfortunately I don't have a Windows where I can test the
patch, so I'm unassigning the bug.

You should probably find someone on python-dev with a
multibyte version of Windows to look at the patch.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-05-25 20:06

Message:
Logged In: YES 
user_id=1200846

>PyUnicode_DecodeMBCS does not support size >= INT_MAX yet,
>but probably I'll fix it too.

Done. Attached as "mbcs_win64_support.patch".

Now, total summary...

    - MBCS decoder and encoder now supports 64bit Py_ssize_t
environment. (I don't have such machine, but I checked
routine by defining NEED_RETRY and redefining INT_MAX as 2,
3, 4)

    - Fixed a bug of MBCS incremental decoder which was
originaly reported by me.




----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-05-25 05:18

Message:
Logged In: YES 
user_id=1200846

I updated the patch.

  - PyUnicode_DecodeMBCS now supports size >= INT_MAX. (I
don't have machine to test such big string, but I have
tested this routine replaced INT_MAX with 2 and 3)

PyUnicode_DecodeMBCS does not support size >= INT_MAX yet,
but probably I'll fix it too.

This patch includes Patch#1494487.


----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-05-02 20:40

Message:
Logged In: YES 
user_id=1200846

I updated the patch. (I copy and pasted "int final = 0" from
above code (ex: utf_16_ex_decode), maybe they also should be
changed for consistency?)

And one more thing, I noticed "errors" is ignored now. We
can detect invalid character if we set MB_ERR_INVALID_CHARS
flag when calling MultiByteToWideChar, but we cannot tell
where is the position of invalid character, and MSDN saids
this flag is available Win2000SP4 or later (I don't know
why)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_17si.asp
So I didn't make the patch for it.


----------------------------------------------------------------------

Comment By: Walter DÃ¶rwald (doerwalter)
Date: 2006-04-26 02:22

Message:
Logged In: YES 
user_id=89016

I think the default value for final in mbcs_decode() should
be true, so that the stateless decoder detects incomplete
byte sequences too.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-04-07 18:10

Message:
Logged In: YES 
user_id=1200846

I have sent contributor form via postal mail. Probably you
can get it after 10 days.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-28 17:16

Message:
Logged In: YES 
user_id=1200846

You are right. I've updated the patch. (mbcs5.patch)

>>> import codecs
[20198 refs]
>>> d = codecs.getincrementaldecoder("mbcs")()
[20198 refs]
>>> d.decode('\x82\xa0\x82')
u'\u3042'
[20198 refs]
>>> d.decode('')
u''
[20198 refs]
>>> d.decode('', final=True)
u'\x00'
[20198 refs]


----------------------------------------------------------------------

Comment By: Walter DÃ¶rwald (doerwalter)
Date: 2006-03-28 01:06

Message:
Logged In: YES 
user_id=89016

_buffer_decode() in the IncrementalDecoder ignores the final
argument. IncrementalDecoder._buffer_decode() should pass on
its final argument to _codecsmodules.c::mbcs_decode(), which
should be extended to accept the final argument. Also
PyUnicode_DecodeMBCSStateful() must handle consumed == NULL
correctly (with your patch it drops trailing lead bytes even
if consumed == NULL)

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-27 16:41

Message:
Logged In: YES 
user_id=1200846

I replaced tests. Probably this is better instead of
comparing the two string generated by same decoder.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-27 14:44

Message:
Logged In: YES 
user_id=1200846

My real name is Hirokazu Yamamoto. But sorry, I don't have
FAX. (It's needed to send contributor form, isn't it?)

I'll attach the patch updated for trunk. And I'll attach the
tests.

----------------------------------------------------------------------

Comment By: Martin v. LÃ¶wis (loewis)
Date: 2006-03-27 06:05

Message:
Logged In: YES 
user_id=21627

I have reservations against this patch because of the
quasi-anonymous nature of the submission. ocean-city, can
you please state your real name? Would you also be willing
to fill out a contributor form, as shown on

http://www.python.org/psf/contrib-form.html

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-24 23:02

Message:
Logged In: YES 
user_id=1200846

OK, I'll try.

----------------------------------------------------------------------

Comment By: Walter DÃ¶rwald (doerwalter)
Date: 2006-03-24 06:44

Message:
Logged In: YES 
user_id=89016

This isn't a bugfix in the strictest sense, so IMHO this
patch shouldn't go into 2.4. 

If the patch goes into 2.5, it would need the appropriate
changes to encodings/mbcs.py (i.e. the IncrementalDecoder
would have to be changed (inheriting from
BufferedIncrementalDecoder).

I realize that this patch might be hard to test, as results
are dependent on locale. Nevertheless at least some tests
would be good (even if they are only run or do something
useful on a certain locale and are skipped otherwise).

ocean-city, can you update the patch for the trunk and add
tests?


----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-23 11:51

Message:
Logged In: YES 
user_id=1200846

Hello. This is my final patch. (mbcs4.patch)

 - mbcs3a.patch: _mbsbtype depends on locale not system ANSI
code page. so probably it's not good to use it with
MultiByteToWideChar.

 - mbcs3b.patch: CharNext may cause buffer overflow. and
this patch always calls CharPrev but it's not needed if
string is not terminated with "potensial" lead byte.

I hope this is stable enough to commit on repositry. Thank you.


----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-22 23:36

Message:
Logged In: YES 
user_id=1200846

Sorry, I was stupid.

MSDN
(http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_0o2t.asp)
saids,

> IsDBCSLeadByte can only indicate a potential lead byte value. 

IsDBCSLeadByte was returning 1 for some trail byte (ex: "æ´"[1])

The patch "mbcs3a.patch" worked for me, but _mbsbtype is
probably compiler specific. Is that OK?

The patch "mbcs3b.patch" also worked for me and it only uses
Win32API, but I don't have enough faith on this
implementation...



----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-22 19:31

Message:
Logged In: YES 
user_id=1200846

Sorry, I found problem when tried more long text file...
Please wait. I'll investigate more intensibly.

----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-22 19:13

Message:
Logged In: YES 
user_id=1200846

Thank you for reply. How about this? (I'm a newbie, I hope
this is right tex format but... can you confirm this? I
created this patch by copy & paste from
PyUnicode_DecodeUTF16Stateful and some modification)


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-22 18:12

Message:
Logged In: YES 
user_id=38388

One more nit: the doc patch is missing. Please add a patch
for the API docs.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-03-22 18:11

Message:
Logged In: YES 
user_id=38388

As I understand your comment, the mbcs codec will have a
problem if the input string terminates with a lead byte.

Could you add a comment regarding this to the patch ?!

I can't test the patch, since I don't have a Japanese
Windows to check on, but from looking at the patch, it seems OK.


----------------------------------------------------------------------

Comment By: Hirokazu Yamamoto (ocean-city)
Date: 2006-03-22 16:42

Message:
Logged In: YES 
user_id=1200846

I forgot to mention this. "mbcs.patch" is for
release24-maint branch.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1455898&group_id=5470