[Python-Dev] RE: Ill-defined encoding for CP875?
Tim Peters
tim.one@home.com
Sat, 12 May 2001 17:48:38 -0400
[Martin v. Loewis, whose encyclopedic knowledge of encoding details
still isn't enough to get a clear answer (it's like somebody asking
me for a simple answer to a floating point question <wink>]
> ...
> So I think we can take one of two approaches:
>
> 1. admit that CP 875 is not round-trippable, and exclude it from the
> test (although when looking at the first 128 characters only, it
> is round-trippable).
As I noted later, 875 is already excluded from the roundtrip test across
range(128, 256). What it's failing is the roundtrip test across range(128):
after unicode("?", "cp875") produces u'\x1a', the following .encode('c875')
has no way to know which range the original input came from. So it's not
really round-trippable across range(128) either unless more info is given to
.encode().
> 2. remove the SUBSTITUTE mappings from CP875, acknowledging that
> apparently these characters have no meaning in that code page.
> Unfortunately, I could not find any official IBM documentation
> page that lists the characters supported in each of the EBCDIC
> code pages.
>
> The second seems to be more corrrect to me, although it is a deviation
> from the Unicode consortium publications.
Until you and MAL agree on the best thing to do (I have no opinion: my only
exposure to Unicode in daily programming life remains the Python test suite),
I'm going to opt for #1: as cp875.py stands today, it's simply a fact that
it's not round-trippable across any range including 0x3f.