[Python-bugs-list] [ python-Bugs-541828 ] Regression in unicodestr.encode()

noreply@sourceforge.net noreply@sourceforge.net
Wed, 10 Apr 2002 14:36:46 -0700


Bugs item #541828, was opened at 2002-04-10 03:56
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Closed
Resolution: Fixed
Priority: 7
Submitted By: Barry Warsaw (bwarsaw)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Regression in unicodestr.encode()

Initial Comment:
I'm porting over the latest email package to Python
2.3cvs, and I've
had one of my tests fail.  I've narrowed it down to the
following test
case:

a =
u'\u6b63\u78ba\u306b\u8a00\u3046\u3068\u7ffb\u8a33\u306f\u3055\u308c\u3066\u3044\u307e\u305b\u3093\u3002\u4e00\u90e8\u306f\u30c9\u30a4\u30c4\u8a9e\u3067\u3059\u304c\u3001\u3042\u3068\u306f\u3067\u305f\u3089\u3081\u3067\u3059\u3002\u5b9f\u969b\u306b\u306f\u300cWenn
ist das Nunstuck git und'
print repr(a.encode('utf-8', 'replace'))

In Python 2.2.1 I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git und'

but in Python 2.3 cvs I get

'\xe6\xad\xa3\xe7\xa2\xba\xe3\x81\xab\xe8\xa8\x80\xe3\x81\x86\xe3\x81\xa8\xe7\xbf\xbb\xe8\xa8\xb3\xe3\x81\xaf\xe3\x81\x95\xe3\x82\x8c\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x9b\xe3\x82\x93\xe3\x80\x82\xe4\xb8\x80\xe9\x83\xa8\xe3\x81\xaf\xe3\x83\x89\xe3\x82\xa4\xe3\x83\x84\xe8\xaa\x9e\xe3\x81\xa7\xe3\x81\x99\xe3\x81\x8c\xe3\x80\x81\xe3\x81\x82\xe3\x81\xa8\xe3\x81\xaf\xe3\x81\xa7\xe3\x81\x9f\xe3\x82\x89\xe3\x82\x81\xe3\x81\xa7\xe3\x81\x99\xe3\x80\x82\xe5\xae\x9f\xe9\x9a\x9b\xe3\x81\xab\xe3\x81\xaf\xe3\x80\x8cWenn
ist das Nunstuck git u\x00\x00'

Note that the last two characters, which should be `n'
and `d' are now
NULs.  My very limited Tim-enlightened understanding is
that encoding
a string to UTF-8 should never produce a string with NULs.

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 23:36

Message:
Logged In: YES 
user_id=21627

There is no bug in pymalloc. The codec wrote beyond the end
of the allocated buffer, this causes undefined behaviour.
The malloc implemementation could not possibly know that the
data extends beyond the space it provided to the application.

Python 2.2 suffers from the same problem: If you have a
string of 10 characters, it will allocate 30 bytes. In UCS4
mode, if the first 6 characters consume each 4 bytes, this
will consume 24 bytes, leaving 6 bytes (resizing would only
be triggered if 4 bytes or less would be left). Now, if the
remaining 4 characters each consume 2 bytes, the total size
written will be 32 bytes, causing a write into unallocated
memory by 2 bytes. So this is the same problem.

About cbWritten: it was introduced in unicodeobject.c 2.41,
where the checkin message says

  New surrogate support in the UTF-8 codec. By Bill Tutt.

So I'd challenge the claim that this is my doing.

As for computing the size in advance: Your arguments on
performance are not convincing, since your measurements were
flawed.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 22:50

Message:
Logged In: YES 
user_id=38388

Just confirmed: Python 2.2.1 definitely doesn't have
this problem.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 22:37

Message:
Logged In: YES 
user_id=38388

Fix checked in. Probably does not apply to the 2.2.1 branch
since this uses a different technique.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-10 20:53

Message:
Logged In: YES 
user_id=38388

I'm not in favour of the precomputation. We already had a
discussion about the performance of this.

About the cbWritten thingie: that was your invention, IIRC :-)
I'll try ripping that bit out again and use pointer arithmetics
instead.

Still, I believe the real cause of the problem is in pymalloc,
since a debugging session indicated that the codec did write
the 'n', 'd' characters. It's the final _PyString_Resize() which
causes these to be dropped during the copying of the
memory block.


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-04-10 20:07

Message:
Logged In: YES 
user_id=21627

It appears that cbWritten can still run above cbAllocated,
namely if a long sequence of 3-byte characters is followed
by a long sequence of 1-byte or 2-byte characters.

I'm still in favour of dropping the resizing of the result
string, and computing the number of bytes in a first run.
The code becomes clearer that way and more performant; see
attached unicode.diff.


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=541828&group_id=5470