[I18n-sig] JapaneseCodecs 1.4.8 released
Tamito KAJIYAMA
kajiyama@grad.sccs.chukyo-u.ac.jp
Fri, 6 Sep 2002 10:38:05 +0900
martin@v.loewis.de (Martin v. Loewis) writes:
|
| > The only one reason for choosing the Microsoft mapping is that
| > it seems better. The Consortium's mapping has a problem that
| > both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which
| > is in turn mapped to 0x5c in Shift_JIS. In other words, the
| > Consortium's mapping is one-to-many.
|
| I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm
| confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN).
|
| Also, where did you get the mapping from the Consortium? I can't find
| a current table, but
|
| http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT
|
| maps 0x5C to U+00A5, and 0x815F to 0x005C. So this roundtrips just
| fine.
I've finally understood what was wrong: the mapping in
JapaneseCodecs has a number of bugs! The Unicode Consortium's
mapping is totally okay, but it had not been implemented in
JapaneseCodecs in the right way (I intended to do so, though).
I got the Consortium's mapping from the URL shown above.
However, I happened to carelessly modify the original mapping
as follows:
the Unicode Consortium's original mapping:
0x5c -> U+00A5 -> 0x5c
0x7e -> U+203e -> 0x7e
0x815f -> U+005c -> 0x815f
the current (buggy) mapping in JapaneseCodecs:
0x5c -> U+005c -> 0x5c
0x7e -> U+007e -> 0x7e
0x815f -> U+005c -> 0x815f
In other words, I had introduced the non-reversibility problem
myself! I'd like to hit my head against the wall thousands of
times...
It seems that there are two solutions: the one is to implement
the Consortium's mapping intact, and the other is to fix the
current buggy mapping so that 0x815f maps to U+ff3c (the latter
means that Java's mapping is adopted, I believe).
| > Sorry, I not sure I've got the picture of what transliteration
| > support would do. Transliteration support is meant to solve
| > interoperability problems due to differences among vendor-
| > specific mappings, right?
|
| No. In general, transliteration adds one-way mappings, to allow
| mapping a larger subset of Unicode to the target mapping. For example,
| "=F6" is not supported in ASCII, but a common transliteration (for
| German) is to write "oe". So, u"\u00f6".encode("ascii") raises a
| UnicodeError, where u"\u00f6".encode("ascii//translit-german") might
| return "oe" (this is not implemented in Python).
|
| Therefore, a transliteration mapping never roundtrips - but it is
| still useful as it attempts to map as much of Unicode to the target
| encoding as reasonable. In your specific case, you could use
| transliteration to map both the default form and the full-width form
| from Unicode to the same JIS - but only one of the forms will
| round-trip.
|
| I agree that round-trip support is a valuable, and should be the
| default. I do think there is also a need for a "best effort" mapping.
I see. Transliteration, in the context of JapaneseCodecs, can
be used to provide fallback mappings, right? I agree that such
a "best effort" mapping is useful and surely needed in a variety
of applications.
Thank a lot!
--
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>