[ python-Bugs-1528802 ] Turkish Character

Fri Aug 18 16:37:46 CEST 2006

Bugs item #1528802, was opened at 2006-07-26 09:05
Message generated for change (Comment added) made by sgala
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.4
Status: Open
Resolution: None
Priority: 6
Submitted By: Ahmet Bişkinler (ahmetbiskinler)
Assigned to: M.-A. Lemburg (lemburg)
Summary: Turkish Character

Initial Comment:
>>> print "Mayıs".upper()
>>> MAYıS
>>> import locale
>>> locale.setlocale(locale.LC_ALL,'Turkish_Turkey.1254')
>>> print "Mayıs".upper()
>>> MAYıS

>>> print "ğüşiöçı".upper()
>>> ğüşIöçı

MAYıS     should be MAYIS
ğüşIöçı   should be ĞÜŞİÖÇI

but 
>>> "Mayıs".upper()
>>> "MAYIS"

is right

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-18 16:37

Message:
Logged In: YES 
user_id=178886

Done: Bug #1542677

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 21:08

Message:
Logged In: YES 
user_id=849994

Please submit that as a separate IDLE bug.

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 20:58

Message:
Logged In: YES 
user_id=178886

Idle from 2.5rc1 (svn today) produces a different result
than console (with my default, utf-8, encoding):

IDLE 1.2c1      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
Ã¡
>>> print len(u"á")
2
>>> print u"á".upper()
Ã¡
>>> str(u"á")

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    str(u"á")
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-1: ordinal not in range(128)

Again, IDLE 1.1.3 (python 2.4.3) produces a different result:

IDLE 1.1.3      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
Ã¡
>>> print len(u"á")
2
>>> print u"á".upper()
Ã¡
>>> str(u"á")
'\xc3\x83\xc2\xa1'
>>> 

I'd say idle is broken, as it is not able to respect utf-8
for print (or even len) of unicode strings.

OTOH, with some tricks I can manage to get an accented a in
a unicode in idle:

>>> import unicodedata
>>> print unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE")
á
>>> print len(unicodedata.lookup("LATIN SMALL LETTER A WITH
ACUTE"))
1

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 17:08

Message:
Logged In: YES 
user_id=849994

Using Unicode strings, the OP's example works.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2006-08-17 17:04

Message:
Logged In: YES 
user_id=38388

String upper and lower conversion are locale dependent and
implemented by the underlying libc, whereas Unicode
upper/lower conversion is not and only depends on the
Unicode character database.

OTOH, there are special cases where the standard Unicode
upper/lower mapping is no what you might expect, since the
database only provides a single mapping and is not context
aware.

There's nothing we can do if the libc is broken in some
respect. As for the extended case mapping support in
Unicode: patches are welcome.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2006-08-17 17:03

Message:
Logged In: YES 
user_id=849994

sgala: it looks like your console sends UTF-8 encoded text.

>>> print "á"
á

print is just printing out a byte string consisting of two
bytes, which your console displays as accent-a.

>>> print len("á")
2

A UTF-8-encoded string containing an accented a has two bytes.

>>> print "á".upper()
á

str.upper() doesn't take locale into account, so the
accented a has no uppercase version defined.

>>> str("á")
'\xc3\xa1'

str() applied to a byte string returns that byte string.
Since return values from functions are printed by the
interactive interpreter using repr() first, you get this
representation (which you could also get from "print
repr('a')".)

>>> print u"á"
á

That's also okay. Python knows the terminal encoding and
properly translates the byte string to a unicode string of
one character. On printout, it converts it to a UTF-8 string
again, which your terminal displays correctly.

>>> print len(u"á")
1

Since your two-byte-UTF-8 sequence is converted to a unicode
character, the length of this unicode string is 1.

>>> print u"á".upper()
Á

There are comprehensive capitalization tables available for
unicode.

>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec
can't encode
character u'\xe1' in position 0: ordinal not in
range(128)

Applying str() to a unicode string must convert it to a byte
string. If you don't specify an encoding, the default
encoding is "ascii", which can't encode the accented a. Use
"a".encode("utf-8").

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 16:59

Message:
Logged In: YES 
user_id=178886

(I tested it in 2.5rc1), 2.4 gives 

>>> str(u"á")
'\xc3\xa1'

instead of the exception

----------------------------------------------------------------------

Comment By: Santiago Gala (sgala)
Date: 2006-08-17 16:53

Message:
Logged In: YES 
user_id=178886

The behaviour of python in this area is confusing. See a
session with my Spanish keyboard:

>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
á
>>> print len(u"á")
1
>>> print u"á".upper()
Á
>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec can't encode
character u'\xe1' in position 0: ordinal not in range(128)

I guess this is what is happening to the reporter.

This violates the least surprising behavior principle in so
many different ways that it hurts. Can anybody make sense of it?

----------------------------------------------------------------------

Comment By: Ahmet Bişkinler (ahmetbiskinler)
Date: 2006-08-11 10:10

Message:
Logged In: YES 
user_id=1481281

What happened?
Is it solved?
How is it going?
What is the final step?
...?
...?

Could you please give me some information about the bug please?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1528802&group_id=5470