[I18n-sig] Mixed encodings and XML
uche.ogbuji@fourthought.com
uche.ogbuji@fourthought.com
Fri, 15 Dec 2000 08:57:18 -0700
> I did exactly this in an internal help page for a company that was
> learning this stuff a year ago. I don't see a problem, because most
> CJKV encodings are 8-bit and ASCII compatible. Declare the document as
> Latin-1 - because that way your parser will not choke on or corrupt
> bytes above 127. Then paste in text in whatever encoding you want.
> Any Kanji text in one of the common ASCII-compatible encodings
> (Shift-JIS, EUC, or even UTF8) will appear as gobbledegook, but the
> underlying bytes will not be corrupted, so they should be able to
> paste them out. You should be able to transform the whole document
> from iso-latin-1 to utf8 and back without loss of data; do a quick
> test from Python to verify it.
>
> Not exactly an industrial solution, but it's not exactly an industrial
> problem.
>
> It would of course go horribly wrong if you used exotic encodings like
> UTF-16 with null bytes :-)
Now I _know_ I need more sleep. I never even tried the simple expedient of
adding the XML declaration with LATIN-1 encoding. Not even when the original
HTML doc geve a strong hint by adding a META tag that did the same thing.
Now my problem is completely solved without needing to resort to multiple
files.
Thanks, Andy.
--
Uche Ogbuji Principal Consultant
uche.ogbuji@fourthought.com +1 303 583 9900 x 101
Fourthought, Inc. http://Fourthought.com
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python