[Python-Dev] RE: Ill-defined encoding for CP875?

M.-A. Lemburg mal@lemburg.com
Tue, 15 May 2001 10:32:14 +0200


Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > The problem is: which part would raise the exception -- the
> > encoder or the decoder ?
> 
> Since I don't yet use any of this stuff for real, I have no idea:  seems
> mostly a question of pragmatics, and I don't have any feel for how cp875
> users would view it.

If there are any... that code page dates back to 1996 and is
based in the EBCDIC world.
 
> > Here are some more options:
> >
> > * sort the items before creating the encoding table from the
> >   decoding one (makes the mapping stable)
> 
> If users don't care that round-trip can fail silently, fine.
> 
> > * map keys which have multiple mappings in the encoding table
> >   to None -- this causes their usage to raise an exception
> >   (undefined mapping)
> 
> If users don't care that they'll get an exception when they try something
> that can't be round-tripped, fine.  Or would this depend on the value of the
> "errors" argument too?  Then it's easier to impose.

The errors argument tells the codecs what to do in case a mapping
fails (from codecs.py):

        The .encode()/.decode() methods may implement different error
        handling schemes by providing the errors argument. These
        string values are defined:

         'strict' - raise a ValueError error (or a subclass)
         'ignore' - ignore the character and continue with the next
         'replace' - replace with a suitable replacement character;
                    Python will use the official U+FFFD REPLACEMENT
                    CHARACTER for the builtin Unicode codecs.

'strict' is the default for all operations that deal with auto-
conversion. 'ignore' and 'replace' allow silently ignoring the
problem.
 
> There's a theme here <wink>:  I have no idea how important roundtrip is in
> Unicode Practice, or even that it's a constant across apps and encodings.  If
> I write a codec to map all ASCII consonants to u"k" and vowels to u"a",  I
> wouldn't care that I can't get "love" back from u"kaka" <wink>.

Round-tripping is obviously very important if you use Unicode
as basis for working on text. I don't know about the reasoning
behind making cp875 fail the round-trip -- Unicode certainly
provides means to make mappings round-trip safe (e.g. by reverting
to the private Unicode char. point areas).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/