[Python-Dev] RE: Ill-defined encoding for CP875?
M.-A. Lemburg
mal@lemburg.com
Tue, 15 May 2001 10:32:14 +0200
Tim Peters wrote:
>
> [M.-A. Lemburg]
> > The problem is: which part would raise the exception -- the
> > encoder or the decoder ?
>
> Since I don't yet use any of this stuff for real, I have no idea: seems
> mostly a question of pragmatics, and I don't have any feel for how cp875
> users would view it.
If there are any... that code page dates back to 1996 and is
based in the EBCDIC world.
> > Here are some more options:
> >
> > * sort the items before creating the encoding table from the
> > decoding one (makes the mapping stable)
>
> If users don't care that round-trip can fail silently, fine.
>
> > * map keys which have multiple mappings in the encoding table
> > to None -- this causes their usage to raise an exception
> > (undefined mapping)
>
> If users don't care that they'll get an exception when they try something
> that can't be round-tripped, fine. Or would this depend on the value of the
> "errors" argument too? Then it's easier to impose.
The errors argument tells the codecs what to do in case a mapping
fails (from codecs.py):
The .encode()/.decode() methods may implement different error
handling schemes by providing the errors argument. These
string values are defined:
'strict' - raise a ValueError error (or a subclass)
'ignore' - ignore the character and continue with the next
'replace' - replace with a suitable replacement character;
Python will use the official U+FFFD REPLACEMENT
CHARACTER for the builtin Unicode codecs.
'strict' is the default for all operations that deal with auto-
conversion. 'ignore' and 'replace' allow silently ignoring the
problem.
> There's a theme here <wink>: I have no idea how important roundtrip is in
> Unicode Practice, or even that it's a constant across apps and encodings. If
> I write a codec to map all ASCII consonants to u"k" and vowels to u"a", I
> wouldn't care that I can't get "love" back from u"kaka" <wink>.
Round-tripping is obviously very important if you use Unicode
as basis for working on text. I don't know about the reasoning
behind making cp875 fail the round-trip -- Unicode certainly
provides means to make mappings round-trip safe (e.g. by reverting
to the private Unicode char. point areas).
--
Marc-Andre Lemburg
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/