[ python-Bugs-1026480 ] iso-latin-1 strings and functions lower & upper

SourceForge.net noreply at sourceforge.net
Tue Sep 14 02:12:59 CEST 2004

Bugs item #1026480, was opened at 2004-09-11 18:28
Message generated for change (Comment added) made by kowaltowski
You can respond by visiting: 

Category: None
Group: Python 2.3
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Tomasz Kowaltowski (kowaltowski)
Assigned to: Nobody/Anonymous (nobody)
Summary: iso-latin-1 strings and functions lower & upper

Initial Comment:
I have no problems in Python in using strings which
contain accented letters (my Emacs has no problems in
producing them using one-byte iso-8859-1 encoding).
However functions 'lower' and 'upper' do not work
properly on these letters as shown below (I hope all
accents appear properly within your browsers):

as = "aáàâãä"      # except for the first 'a', all
other have accents
AS = "AÁÀÂÃÄ"      # except for the first 'A', all
other have accents
print "direct: %s -- %s" % (as, AS)
print "lower:  %s -- %s" % (as.lower(), AS.lower())
print "upper:  %s -- %s" % (as.upper(), AS.upper())

The output is:
direct: aáàâãä -- AÁÀÂÃÄ
lower:  aáàâãä -- aÁÀÂÃÄ
upper:  Aáàâãä -- AÁÀÂÃÄ

i.e., accented letters (above 128) are not translated.
It did not make any difference to put the line 

# -*- coding: iso-latin-1 -*-

about the encoding as recommended by PEP 0263.

I am not sure whether this is a bug or it is
intentional, i.e. these functions work only for pure
ASCII letters. However it is a major inconvenience for
those who use any language which is not English but
uses the Latin aplphabet :-(. 

There should be some mechanism to signal these
functions which Latin variant (iso-8859-1, iso-8859-2,
...) is being used, so that they behave properly; eg,
optional second argument?


>Comment By: Tomasz Kowaltowski (kowaltowski)
Date: 2004-09-13 21:12

Logged In: YES 

I guess you are right from conceptual point of view. It is
just somewhat frustrating because almost every language
which uses the Latin alphabet needs characters above 128 (is
English the only exception?). On the other hand 'lower' and
'upper' work for Unicode (really utf-8) representation in
which many alphabets do not even have the concept of lower
and upper cases!

Your suggestion about 'latinlower' and 'latinupper' is
basically what I asked for, but about 10 times slower than
direct 'lower' and 'upper' :-(.

Thanks anyway -- I guess the matter may be closed.


Comment By: Scott David Daniels (scott_daniels)
Date: 2004-09-13 17:00

Logged In: YES 

Note: lower and upper are defined as for ASCII on strs, 
but works correctly for unicode.
 uas = u"aáàâãä" # except first 'a', all have accents
 UAS = u"AÁÀÂÃÄ" # except first 'A', all have accents
 print u"direct: %s -- %s" % (uas, UAS)
 print u"lower: %s -- %s" % (uas.lower(), UAS.lower())
 print u"upper: %s -- %s" % (uas.upper(), UAS.upper())

What you are asking is pretty hopeless.  With two 
modules loaded with differing encodings, whose idea of 
"how to uppercase an 8-bit character" should be used?

What you might want to use is:
  def codedupper(coding, string):
     return string.decode(coding).upper().encode(coding)
  def codedlower(coding, string):
     return string.decode(coding).lower().encode(coding)
  def latinupper(string):
     return string.decode('latin-1').upper().encode('latin-1')
  def latinlower(string):
     return string.decode('latin-1').lower().encode('latin-1')

Any of these functions is well-defined even with several 
modules of differing encodings loaded.


You can respond by visiting: 

More information about the Python-bugs-list mailing list