[I18n-sig] Mixed encodings and XML

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Fri, 15 Dec 2000 08:57:18 -0700


> I did exactly this in an internal help page for a company that was
> learning this stuff a year ago.  I don't see a problem, because most
> CJKV encodings are 8-bit and ASCII compatible. Declare the document as
> Latin-1 - because that way your parser will not choke on or corrupt
> bytes above 127.  Then paste in text in whatever encoding you want.
> Any Kanji text in one of the common ASCII-compatible encodings
> (Shift-JIS, EUC, or even UTF8) will appear as gobbledegook, but the
> underlying bytes will not be corrupted, so they should be able to
> paste them out.  You should be able to transform the whole document
> from iso-latin-1 to utf8 and back without loss of data; do a quick
> test from Python to verify it.
> 
> Not exactly an industrial solution, but it's not exactly an industrial
> problem.
> 
> It would of course go horribly wrong if you used exotic encodings like
> UTF-16 with null bytes :-)

Now I _know_ I need more sleep.  I never even tried the simple expedient of 
adding the XML declaration with LATIN-1 encoding.  Not even when the original 
HTML doc geve a strong hint by adding a META tag that did the same thing.

Now my problem is completely solved without needing to resort to multiple 
files.

Thanks, Andy.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python