[Python-bugs-list] [ python-Bugs-541828 ] Regression in unicodestr.encode()

noreply@sourceforge.net noreply@sourceforge.net
Wed, 10 Apr 2002 11:53:49 -0700


Bugs item #541828, was opened at 2002-04-10 01:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Open
Resolution: None
Priority: 7
Submitted By: Barry Warsaw (bwarsaw)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Regression in unicodestr.encode()

Initial Comment:
I'm porting over the latest email package to Python
2.3cvs, and I've
had one of my tests fail.  I've narrowed it down to the
following test
case:

a =
u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn
ist das Nunstuck git und'
print repr(a.encode('utf-8', 'replace'))

In Python 2.2.1 I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git und'

but in Python 2.3 cvs I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git u\x00\x00'

Note that the last two characters, which should be `n'
and `d' are now
NULs.  My very limited Tim-enlightened understanding is
that encoding
a string to UTF-8 should never produce a string with NULs.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 18:53

Message:
Logged In: YES 
user_id=38388

I'm not in favour of the precomputation. We already had a
discussion about the performance of this.

About the cbWritten thingie: that was your invention, IIRC :-)
I'll try ripping that bit out again and use pointer arithmetics
instead.

Still, I believe the real cause of the problem is in pymalloc,
since a debugging session indicated that the codec did write
the 'n', 'd' characters. It's the final _PyString_Resize() which
causes these to be dropped during the copying of the
memory block.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 18:07

Message:
Logged In: YES 
user_id=21627

It appears that cbWritten can still run above cbAllocated,
namely if a long sequence of 3-byte characters is followed
by a long sequence of 1-byte or 2-byte characters.

I'm still in favour of dropping the resizing of the result
string, and computing the number of bytes in a first run.
The code becomes clearer that way and more performant; see
attached unicode.diff.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470