[Python-bugs-list] [ python-Bugs-554916 ] test_unicode fails in wide unicode build

Sun, 19 Jan 2003 15:02:05 -0800

Bugs item #554916, was opened at 2002-05-11 18:25
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=554916&group_id=5470

Category: Unicode
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Michael Hudson (mwh)
Assigned to: M.-A. Lemburg (lemburg)
Summary: test_unicode fails in wide unicode build

Initial Comment:
Assigned somewhat arbitrarily.

It's a roundtrip test, I think.

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2003-01-20 00:02

Message:
Logged In: YES 
user_id=38388

Michael, is the test still failing or can I close this ?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-10-10 17:30

Message:
Logged In: YES 
user_id=38388

I'm not exactly sure why things work again, but I do
know that I looked into this some time ago. Perhaps I
simply forgot to close the bug or one of the UTF-8
codec overhauls remedied the problem.

Here's what I get with python 2.3 UCS4:

>>> len(u'\U000d0000')
1
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
False
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
1
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
1

This is what I get with Python 2.2.1:
>>> len(u'\U000d0000')
2
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
1
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
2
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
2

There's still a difference there, but the UTF-8 codec behaves
consistently.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2002-10-09 14:57

Message:
Logged In: YES 
user_id=6656

Hmm.  The test has stopped failing, so maybe we can close this.

I'd be happier if I knew why, though.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2002-05-13 16:06

Message:
Logged In: YES 
user_id=6656

Even better: 

$ ./python 
Adding parser accelerators ...
Done.
Python 2.2.1 (#1, May 13 2002, 15:02:01) 
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8") ==
u"\udb00\udc00"
0
[18762 refs]

but the test passes.  And there was me thinking that it
wasn't a problem on the release22-maint branch.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2002-05-13 15:58

Message:
Logged In: YES 
user_id=6656

>>> a = u"\udb00\udc00"
[20811 refs]
>>> b = unicode(a.encode("utf-8"), "utf-8")
[21061 refs]
>>> a, b 
(u'\U000d0000', u'\U000d0000')
[21063 refs]
>>> len(a), len(b)
(2, 1)
[21063 refs]

Erm...?

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-05-13 15:38

Message:
Logged In: YES 
user_id=89016

The minimal failing testcase is:

>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8") ==
u"\udb00\udc00"
False

which is strange, because they *seem* to be the same:

u"\udb00\udc00"
u'\U000d0000'
>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8")      

u'\U000d0000'

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=554916&group_id=5470