[ python-Bugs-1528802 ] Turkish Character

Tue Aug 29 19:43:30 CEST 2006

Bugs item #1528802, was opened at 2006-07-26 09:05
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 6
Submitted By: Ahmet Bişkinler (ahmetbiskinler)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Turkish Character

Initial Comment:
>>> print "Mayıs".upper()
>>> MAYıS
>>> import locale
>>> locale.setlocale(locale.LC_ALL,'Turkish_Turkey.1254')
>>> print "Mayıs".upper()
>>> MAYıS

>>> print "ğüşiöçı".upper()
>>> ğüşIöçı

MAYıS     should be MAYIS
ğüşIöçı   should be ĞÜŞİÖÇI

but 
>>> "Mayıs".upper()
>>> "MAYIS"

is right

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2006-08-29 19:43

Message:
Logged In: YES 
user_id=38388

Could you test this with Unicode strings, ie. u"...".upper() ?

It would also help if you'd provide the repr()-version of
the strings - makes testing on non-Turkish systems easier.

Thanks.

----------------------------------------------------------------------

Comment By: Ahmet Bişkinler (ahmetbiskinler)
Date: 2006-08-28 15:57

Message:
Logged In: YES 
user_id=1481281

As you saw in the picture the IDLE does its work. Its is the
one who is working right.
The python interpreter(C:\Python25\Python.exe) has the
problem with it. Does the interpreter generate bug reports
if there is no crashing or else... And I don't know how to
file an IDLE bug report from the
interpreter(C:\Python25\Python.exe).

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-08-21 12:01

Message:
Logged In: YES 
user_id=38388

Could we please get some things straight first:

1. if you're working with IDLE and it doesn't do what you
expect it to, please file an IDLE bug report, not a Python
one; the same it true for any other Python IDE you are using

2. string's .lower() and .upper() method rely 100% on the
platform's C lib implementation of these functions; there's
nothing Python can do about bugs in these implementations

3. if you want reproducable behavior across platforms,
please always use Unicode, *not* 8-bit strings, for text data.

I see that #1 has already been done, so the IDLE specific
discussion should continue there.

#2 is the cause of the problem, then all we can do is point
you to #3.

If #3 fails for some reason, then we should investigate
this. However, be aware that the Unicode database has a
fixed set of case mappings and we currently don't support
extended case mapping which is locale and context sensitive.
Again, patches are welcome.

Please provide your examples using the repr() of the string
or Unicode objects in question. This makes it a lot easier
to test your examples on other platforms.

Thanks.

----------------------------------------------------------------------

Comment By: Ahmet Bişkinler (ahmetbiskinler)
Date: 2006-08-21 09:55

Message:
Logged In: YES 
user_id=1481281

There are still some problems with it. As in the image.
http://img205.imageshack.us/img205/3998/turkishcharpythonyu5.jpg
The upper() works fine(except ı and i uppercase) with IDLE
since upper() doesn't even work.

Another problem is with the ı(dotless) and i(dotted) 's upper.
ı(dotless) should be I (dotless)
i(dotted)  should be İ (dotted)
ı = I
i = İ

For more information:
http://www.i18nguy.com/unicode/turkish-i18n.html

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-18 16:37

Message:
Logged In: YES 
user_id=178886

Done: Bug #1542677

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 21:08

Message:
Logged In: YES 
user_id=849994

Please submit that as a separate IDLE bug.

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 20:58

Message:
Logged In: YES 
user_id=178886

Idle from 2.5rc1 (svn today) produces a different result
than console (with my default, utf-8, encoding):

IDLE 1.2c1      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
Ã¡
>>> print len(u"á")
2
>>> print u"á".upper()
Ã¡
>>> str(u"á")

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    str(u"á")
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-1: ordinal not in range(128)

Again, IDLE 1.1.3 (python 2.4.3) produces a different result:

IDLE 1.1.3      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
Ã¡
>>> print len(u"á")
2
>>> print u"á".upper()
Ã¡
>>> str(u"á")
'\xc3\x83\xc2\xa1'
>>> 

I'd say idle is broken, as it is not able to respect utf-8
for print (or even len) of unicode strings.

OTOH, with some tricks I can manage to get an accented a in
a unicode in idle:

>>> import unicodedata
>>> print unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE")
á
>>> print len(unicodedata.lookup("LATIN SMALL LETTER A WITH
ACUTE"))
1

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 17:08

Message:
Logged In: YES 
user_id=849994

Using Unicode strings, the OP's example works.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-08-17 17:04

Message:
Logged In: YES 
user_id=38388

String upper and lower conversion are locale dependent and
implemented by the underlying libc, whereas Unicode
upper/lower conversion is not and only depends on the
Unicode character database.

OTOH, there are special cases where the standard Unicode
upper/lower mapping is no what you might expect, since the
database only provides a single mapping and is not context
aware.

There's nothing we can do if the libc is broken in some
respect. As for the extended case mapping support in
Unicode: patches are welcome.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 17:03

Message:
Logged In: YES 
user_id=849994

sgala: it looks like your console sends UTF-8 encoded text.

>>> print "á"
á

print is just printing out a byte string consisting of two
bytes, which your console displays as accent-a.

>>> print len("á")
2

A UTF-8-encoded string containing an accented a has two bytes.

>>> print "á".upper()
á

str.upper() doesn't take locale into account, so the
accented a has no uppercase version defined.

>>> str("á")
'\xc3\xa1'

str() applied to a byte string returns that byte string.
Since return values from functions are printed by the
interactive interpreter using repr() first, you get this
representation (which you could also get from "print
repr('a')".)

>>> print u"á"
á

That's also okay. Python knows the terminal encoding and
properly translates the byte string to a unicode string of
one character. On printout, it converts it to a UTF-8 string
again, which your terminal displays correctly.

>>> print len(u"á")
1

Since your two-byte-UTF-8 sequence is converted to a unicode
character, the length of this unicode string is 1.

>>> print u"á".upper()
Á

There are comprehensive capitalization tables available for
unicode.

>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec
can't encode
character u'\xe1' in position 0: ordinal not in
range(128)

Applying str() to a unicode string must convert it to a byte
string. If you don't specify an encoding, the default
encoding is "ascii", which can't encode the accented a. Use
"a".encode("utf-8").

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 16:59

Message:
Logged In: YES 
user_id=178886

(I tested it in 2.5rc1), 2.4 gives 

>>> str(u"á")
'\xc3\xa1'

instead of the exception

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 16:53

Message:
Logged In: YES 
user_id=178886

The behaviour of python in this area is confusing. See a
session with my Spanish keyboard:

>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
á
>>> print len(u"á")
1
>>> print u"á".upper()
Á
>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec can't encode
character u'\xe1' in position 0: ordinal not in range(128)

I guess this is what is happening to the reporter.

This violates the least surprising behavior principle in so
many different ways that it hurts. Can anybody make sense of it?

----------------------------------------------------------------------

Comment By: Ahmet Bişkinler (ahmetbiskinler)
Date: 2006-08-11 10:10

Message:
Logged In: YES 
user_id=1481281

What happened?
Is it solved?
How is it going?
What is the final step?
...?
...?

Could you please give me some information about the bug please?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470