[Python-Dev] Ill-defined encoding for CP875?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Sat, 12 May 2001 22:12:39 +0200
> But I don't know whether the ambiguity in cp875 is a bug or an
> undocumented feature
The official (as in "as official as it gets") mapping between CP 875
and Unicode is at
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/EBCDIC/CP875.TXT
This is also the file which served as an input to generate cp875.py.
Character 1A, which is the mapping result of these characters, is
indeed known with the name "SUBSTITUTE", apparently following the
definition in
http://www.its.bldrdoc.gov/fs-1037/dir-035/_5170.htm
# substitute character (SUB): A control character that is used in the
# place of a character that is recognized to be invalid or in error or
# that cannot be represented on a given device.
That would suggest that these characters in EBCDIC 875 do not have
equivalents in Unicode. However,
http://www.kostis.net/charsets/ebc875.htm
suggests that the characters in question (3F, DC, E1, EC, ED, FC, and
FD) have no character meaning at all.
It seems that IBM's ICU library also maps U+001A to character 3F, see
http://oss.software.ibm.com/developerworks/opensource/cvs/icu/data/ibm-875_P100-2000.ucm?rev=1.1&content-type=text/x-cvsweb-markup
It appears, from looking at
http://www.natural-innovations.com/boo/asciiebcdic.html
that byte 3F *is* the substitution character in EBCDIC. So it is a bug
in the CP875 codec to map Unicode SUBSTITUTE to an arbitrary EBCDIC
character which is mapped to SUBSTITUTE; I think cp875 should be
corrected to always map U+001A to 3F. That is not something the
generator can currently do, though.
So I think we can take one of two approaches:
1. admit that CP 875 is not round-trippable, and exclude it from the
test (although when looking at the first 128 characters only, it
is round-trippable).
2. remove the SUBSTITUTE mappings from CP875, acknowledging that
apparently these characters have no meaning in that code page.
Unfortunately, I could not find any official IBM documentation
page that lists the characters supported in each of the EBCDIC
code pages.
The second seems to be more corrrect to me, although it is a deviation
from the Unicode consortium publications.
Regards,
Martin