[ python-Bugs-1076790 ] test test_codecs failed

Fri Dec 3 15:50:13 CET 2004

Bugs item #1076790, was opened at 2004-12-01 15:41
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076790&group_id=5470

Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Michal &#268;iha&#345; (nijel)
Assigned to: Nobody/Anonymous (nobody)
Summary: test test_codecs failed

Initial Comment:
test test_codecs failed -- Traceback (most recent call
last):
  File
"/usr/src/packages/BUILD/Python-2.4/Lib/test/test_codecs.py",
line 446, in test_nameprep
    raise test_support.TestFailed("Test 3.%d: %s" %
(pos+1, str(e)))
TestFailed: Test 3.5: u'\u0143 \u03b9' != u'\u0144 \u03b9'

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 15:50

Message:
Logged In: YES 
user_id=38388

Thanks for the tests. 

Looks to me as if the trouble of keeping the wctype support
and working around quirks with the locales is not worth it.

I think it's better to remove the support altogether and
stick with the builtin type database.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-03 14:26

Message:
Logged In: YES 
user_id=192186

without wctype: 100x test_codecs: 10.209s, libpython size:
1140098
with wctype: 100x test_codecs: 10.120s (removed one failing
test), libpython size: 1140314

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-03 14:13

Message:
Logged In: YES 
user_id=192186

After talk to glibc developer: towlower/towupper will never
work as expected with POSIX/C locales (because anything
besides a-z is not alpha character for these).

I can give some performace results, but even without tests,
it looks to me like good idea to drop support for this.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 13:16

Message:
Logged In: YES 
user_id=38388

Thanks for the patch. I see a few problems with this
approach, though:

* We brake binary compatibility depending on the configure
settings used for building Python; if this is really
necessary we should place the changes into the
_PyUnicode_ToLowerCase() et al. APIs defined in unicodectype.c

* I'm not sure whether there is any performance or memory
usage win in using the wctype functions from glibc: the
Unicode type mapping DB table has to be included anyway (due
to the title case mapping), so the only win I could see is a
performance one and given that towlower et al. do seem to be
locale aware I have strong doubts that these functions are
actually faster than the lookup in our own database.

Could you check whether using the wctype functions from
glibc does have any effect on size of the interpreter and
performance of e.g. .lower() and .upper() ?

If not, I'm inclined to remove the wctype function support
altogether.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-03 12:46

Message:
Logged In: YES 
user_id=192186

Okay, it IS locales problem. You should trust man page :-),
calling towupper/towlower without set locales (or with POSIX
locales) gives wrong result. After applying attached patch,
all problems in tests are gone.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 12:19

Message:
Logged In: YES 
user_id=38388

Could you run this test (comparing lower and upper) for all
code points in the range(sys.maxunicode) ?!

The origin of the problem could be a different code point.

I don't think that it has to do with locale (but you never
know...), since Unicode is all about unifying locales. The C
functions should not be locale aware (even though the man
page says it depends on LC_CTYPE).

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-03 12:03

Message:
Logged In: YES 
user_id=192186

However when I make simple C program containing:

    s = 0x143;
    printf("%lc %lc %lc\n", s, towupper(s), towlower(s));
    s = 0x144;
    printf("%lc %lc %lc\n", s, towupper(s), towlower(s));

I get expected results and they're same as from python code:

s =u'\u0143'
print '%s %s %s' % (s, s.upper(), s.lower())
s =u'\u0144'
print '%s %s %s' % (s, s.upper(), s.lower())

I'm starting to thing that it might be something with
locales, I'll investigate it more.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 11:43

Message:
Logged In: YES 
user_id=38388

Maybe you should add some hooks to the Py_UNICODE_* macros and
recompile (or run the script in a C debugger).

The difference in output is minimal (\u0143 vs. \u0144) which I
believe hints at a change in the used Unicode DB:

0143;LATIN CAPITAL LETTER N WITH ACUTE;Lu;0;L;004E
0301;;;;N;LATIN CAPITAL LETTER N ACUTE;;;0144;
0144;LATIN SMALL LETTER N WITH ACUTE;Ll;0;L;006E
0301;;;;N;LATIN SMALL LETTER N ACUTE;;0143;;0143

The only difference here is the case.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-02 17:38

Message:
Logged In: YES 
user_id=192186

I tried towupper and towupper functions for all characters
in failed test and I can see no difference comared to python
ones...

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-02 17:23

Message:
Logged In: YES 
user_id=38388

The punycode codec uses the .upper() method on Unicode objects.

Since this method uses Py_UNICODE_TOUPPER(), any difference
in case mapping between the Unicode DB used in Python and the
one used in glibc will be noticable as a result of 
--with-wctype-functions.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-02 17:03

Message:
Logged In: YES 
user_id=192186

Compiling without --with-wctype-functions "fixes" this problem.

I still don't see what has wctype functions to do with this.
They are used for operations like is this numeric,
alphanumeric, upper,... I'd like to trace this bug either it
is in Python or glibc, but I still don't know what of glibc
functions do influence this test.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-02 16:40

Message:
Logged In: YES 
user_id=38388

Do you get the same error when compiling without
--with-wctype-functions ?

If not, then we'll just have to close this report as "won't
fix" - the
reason is that we as Python developers don't have control over
what glibc does or does not do. 

Unfortunately, there's not way to disable the failing tests
since 
the configure option is not available to the Python program.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-02 12:07

Message:
Logged In: YES 
user_id=192186

Well, glibc 2.3.3 is reportedly using Unicode DB 3.2, so
there must be either bug in it or in Python, I can't tell.
Any idea how to find out?

----------------------------------------------------------------------

Comment By: Pierre (pierre42)
Date: 2004-12-01 22:30

Message:
Logged In: YES 
user_id=512388

I have the same problem

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 19:37

Message:
Logged In: YES 
user_id=192186

I understand the question, but I have no idea how to find
this information inside glibc.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-01 19:33

Message:
Logged In: YES 
user_id=38388

The wctype functions must have been built using tables from 
the Unicode code point database. Python's own APIs for this
were built using the Unicode DB 3.2. My question is whether
you know which version the glibc was built from.

It is not surprising that the two tests fail if the underlying 
Unicode DB versions differ.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 18:29

Message:
Logged In: YES 
user_id=192186

I'm not sure what means "uses", but I found several mentions
of Unicode 3.2 in code and in changelogs.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-01 18:20

Message:
Logged In: YES 
user_id=38388

Ah, now I understand: it is well possible that the Unicode
database versions differ. Python uses version 3.2.

Do you know which version glibc 2.3.3 uses ?

Note that for portability it is usually better not to use wctype
functions.

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 17:32

Message:
Logged In: YES 
user_id=192186

The problem seems to be in glibc, when I remove
--with-wctype-functions, it passes. Or could it be in Python
interface to wctype functions?

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 17:21

Message:
Logged In: YES 
user_id=192186

gcc (GCC) 3.3.4 (pre 3.3.5 20040809)

Yes, I'm building UCS4 version.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-01 17:16

Message:
Logged In: YES 
user_id=38388

Sorry: I misread glibc as gcc. Still, this sounds a lot like
a broken compiler.

BTW, are you building a UCS4 version ?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-01 17:15

Message:
Logged In: YES 
user_id=38388

The tests pass just fine on my machine. 

Is it possible that your compiler is broken ? 
gcc 2.3.3 is *very* old !

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 16:26

Message:
Logged In: YES 
user_id=192186

System information:
i386
kernel 2.6.8
glibc 2.3.3

----------------------------------------------------------------------

Comment By: Michal &#268;iha&#345; (nijel)
Date: 2004-12-01 15:59

Message:
Logged In: YES 
user_id=192186

It's clean build root with no other python, so it has no
chance to pickup bad modules.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-01 15:53

Message:
Logged In: YES 
user_id=38388

Please make sure that Python is picking up the correct modules.
You can do so, buy running Python in verbose mode (python -vv).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076790&group_id=5470