[I18n-sig] JapaneseCodecs 1.4.8 released

06 Sep 2002 01:35:33 +0200

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> The only one reason for choosing the Microsoft mapping is that
> it seems better.  The Consortium's mapping has a problem that
> both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which
> is in turn mapped to 0x5c in Shift_JIS.  In other words, the
> Consortium's mapping is one-to-many.=20=20

I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm
confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN).

Also, where did you get the mapping from the Consortium? I can't find
a current table, but

http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT

maps 0x5C to U+00A5, and 0x815F to 0x005C. So this roundtrips just
fine.

> On the other hand, the Microsoft's mapping is one-to-one.  There is
> no conversion problem like the one in the Consortium's mapping.
> That's why I think the Microsoft's mapping is better.

There are many ways to achieve this; starting with the question
whether 0x5c is a reverse solidus, or a yen sign. It seems clear that
0x815f is a reverse solidus - the question is whether it is full width
or not.

This is all unrelated to the other issues that you brought up.

> To tell the truth, I don't care whether a Unicode character that
> corresponds to a character in Shift_JIS is a full-width form or
> not.  What I want to solve by choosing the Microsoft's mapping
> is only the problem just mentioned above.

Ok, then my suggestion would be to make minimal changes to your
current mapping; the candidates to look at seem to be
- Consortium (but where does it have the current Shift JIS mapping?),
- MS
- Linux glibc
- ICU
- Java

Of those, I would pick the one that round-trips, and is closest to
your current mapping.

It appears that ICU does not have a SJIS mapping of its own, only the
Linux and the Java one. It appears that Java (according to ICU)
- maps 0x5c to U+005C,
- maps 0x815f to U+FF3C,
- fallback-maps U+00A5 to 0x5c

BTW, what does Microsoft map U+00A5 to?

> The interoperability of the MS932 codec and other codecs is a plus.
> I don't think it is necessary.  However, it seems not preferable to
> me that a small package like JapaneseCodecs has an interoperability
> problem due to differences among vendor- specific mappings.

I agree that you should copy mapping data from other sources, instead
of inventing your own. I also agree that it is desirable if the
mapping round-trips (also there might be a good reason to have one-way
mappings, e.g. for the yen sign - if you decide that 0x5c is a
backslash).

I just don't see the point of having shift-jis be a synonym for
cp932. It appears that cp932 is slightly different from shift-jis
(even though there are multiple interpretations of both shift-jis and
cp932 circulating).

It appears that ICU has an exhaustive collection of mappings, which, I
hope, are all correct (e.g. that when they claim they have the glibc
shift_jis, that this really is what glibc does).

> Sorry, I not sure I've got the picture of what transliteration
> support would do.  Transliteration support is meant to solve
> interoperability problems due to differences among vendor-
> specific mappings, right?

No. In general, transliteration adds one-way mappings, to allow
mapping a larger subset of Unicode to the target mapping. For example,
"=F6" is not supported in ASCII, but a common transliteration (for
German) is to write "oe". So, u"\u00f6".encode("ascii") raises a
UnicodeError, where u"\u00f6".encode("ascii//translit-german") might
return "oe" (this is not implemented in Python).

Therefore, a transliteration mapping never roundtrips - but it is
still useful as it attempts to map as much of Unicode to the target
encoding as reasonable. In your specific case, you could use
transliteration to map both the default form and the full-width form
from Unicode to the same JIS - but only one of the forms will
round-trip.

I agree that round-trip support is a valuable, and should be the
default. I do think there is also a need for a "best effort" mapping.

Regards,
Martin