[ python-Bugs-1324237 ] ISO8859-9 broken
SourceForge.net
noreply at sourceforge.net
Mon Oct 24 16:51:34 CEST 2005
Bugs item #1324237, was opened at 2005-10-11 23:35
Message generated for change (Settings changed) made by lemburg
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1324237&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.4
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: Eray Ozkural (exa)
Assigned to: M.-A. Lemburg (lemburg)
Summary: ISO8859-9 broken
Initial Comment:
Probably not limited to ISO8859-9.
The problem is that the encodings returned by getlocale()
and getpreferredencoding() are not guaranteed to work
with, say, encode method of string.
I'm on MDK10.2 and i switch to Turkish locale
>>> locale.setlocale(locale.LC_ALL, '')
'tr_TR'
There is nothing in sys.stdout.encoding!
>>> sys.stdout.encoding
>>>
So I take a look at the encoding:
>>> locale.getlocale()
['tr_TR', 'ISO8859-9']
>>> locale.getpreferredencoding()
'ISO-8859-9'
Too bad I cannot use either encoding to encode innocent
unicode strings
>>> a = unicode('André','latin-1')
>>> print a.encode(locale.getpreferredencoding())
Traceback (most recent call last):
File "<stdin>", line 1, in ?
LookupError: unknown encoding: ISO-8859-9
>>> print a.encode(locale.getlocale()[1])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
LookupError: unknown encoding: ISO8859-9
So I take a look at python page and I see that all encoding
names are in lowercase. That's no good, because:
>>> locale.getpreferredencoding().lower()
'\xfdso-8859-9'
(see bug 1193061 )
So I have to do this by hand! But of course this is
unacceptable for any locale aware application.
>>> print a.encode('iso-8859-9')
André
Expected:
1. I expect the encoding string returned by
getpreferredencoding and getlocale to be *identical*
2. I expect the encoding string returned to *work* with
encode method and in general *any* function that accepts
locales.
Got:
1. Different, ad hoc strings
2. Not all aliases present, only lowercases present, no
reliable way to find a canonical locale name.
Recommendations:
a. Please consider the Java-like solution to make Locale
into a class or an enum, something reliable, rather than
just a string.
b. Please test the locale functions in locales other than
US (that is not really a locale anyway)
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-24 16:51
Message:
Logged In: YES
user_id=38388
I can only repeat: Python will not work if you set up the
GLIBC to have it convert ASCII characters from lower to
upper or vice-versa to characters outside the ASCII range.
Please reread my reply.
If you write a locale aware application that deals with text
data, you should use Unicode to store the text data - not
8-bit strings. And no, writing a locale aware application
does not mean that you start it up with setlocale(LC_ALL,
'') - this simply doesn't work and is also the reason why
the locale module goes through great lengths in only
temporarily using this C API in order to apply a few
conversions.
If you think that we should have locale dependent string
conversion functions that work in the same way (temporarily
set a certain locale and then reset it to what it was
previously set to), please provide a patch for the locale
module.
Thanks.
----------------------------------------------------------------------
Comment By: Eray Ozkural (exa)
Date: 2005-10-24 16:01
Message:
Logged In: YES
user_id=1454
First, my system isn't broken. All applications run fine in this particular
locale setting. The system was Mandrake 10.2, and now I have
upgraded to Mandriva 2006, which is the same regarding this matter
(However, I will check once again).
I do not understand your suggestion of not setting the locale to tr_TR. I
am not doing that. I am doing:
locale.setlocale(locale.LC_ALL, '')
which must work for _any_ locale not just one or two. As you know,
that is the standard way of starting up a localized application.
My suggestions stand:
1. Make the locale identifier something else than a string. Make it an
object, just like in Java standard library
2. To _all_ text processing functions affected by locale setting, most
notably lower() and upper() methods, append an optional argument of
locale.
The problem here might be greater than you seem to think it is.
I should be able to use the result of locale.getpreferredencoding()
*without* recourse to any text processing (the frustrating bit here is
that, simply using lower is not sufficient in this case, but that is just a
side matter). The simple answer is that it should return an ID or an
Object that is not text.
I suggest you to also review the Java standard library about these
functions.
At any rate, it is unacceptable for locale-specific functions to not work
in some locales, in a locale-aware application that supports any locale.
Regards,
--
Eray Ozkural, eray at uludag.org.tr
Uludag Developer http://uludag.org.tr
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-21 16:25
Message:
Logged In: YES
user_id=38388
SF has problems again it seems...
Anyway, I tried to set the TR_tr locale on my system and got
a surprising result:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'tr_TR')
'tr_TR'
>>> locale.getpreferredencoding().lower()
'ans\xfd_x3.4-1968'
>>> locale.getpreferredencoding()
'ANSI_X3.4-1968'
So I think the problem lies with the fact that
string.lower() is locale dependent and the GLIBC folks chose
a highly incompatible way of dealing with the special
Turkish situation of the capital "I" mapping to lower-case.
While this kind of mapping may make sense for text
processing in applications it certainly does not make sense
when dealing with programming code or things that need to be
specified in plain ASCII.
In short: the encoding used for the TR_tr locale is not
ASCII-compatible and thus not suitable for Python source code.
I'm not sure what to say to this. My only advice is to *not*
set the global locale setting to TR_tr, but only do this
when it comes to actually processsing text in an application.
Alternatively, you could write you application text using
Unicode and the use the ISO-8859-9 codec to encode it for I/O.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-21 16:18
Message:
Logged In: YES
user_id=38388
Something in your installation must be broken: it seems the
system cannot find the ISO-8859-9 codec.
Note that the .encode() method uses the codec registry for
the lookup of the codec. The lookup itself is done
case-insensitive and subject to a few other normalizations
(see encodings/__init__.py).
Please check your system and then report back whether you
still see the reported error.
Thanks.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2005-10-21 16:12
Message:
Logged In: YES
user_id=38388
Something in your installation must be broken: it seems the
system cannot find the ISO-8859-9 codec. Note that the
.encode() method uses the codec registry for the lookup of
the codec. The lookup itself is done case-insensitive and
subject to a few other normalizations (see
encodings/__init__.py).
Please check your system and then report back whether you
still see the reported error.
Thanks.
----------------------------------------------------------------------
Comment By: Eray Ozkural (exa)
Date: 2005-10-11 23:46
Message:
Logged In: YES
user_id=1454
BTW, I put this into Unicode category, because the bugs in it
seemed relevant to localization. Thank you very much for your
consideration.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1324237&group_id=5470
More information about the Python-bugs-list
mailing list