[I18n-sig] Mixed encodings and XML

uche.ogbuji@fourthought.com uche.ogbuji@fourthought.com
Wed, 13 Dec 2000 17:14:40 -0700

> Uche Ogbuji writes:
> > It contains many sections within HTML PREs with the different encodings
> > I mentioned.  They look like
> > 
> > <PRE LANG="zh-TW">
> > ... BIG5-encoded stuff ...
> > </PRE>
> The LANG attribute does not specify an encoding, it specifies a
> language. You cannot safely imply anything about the encoding based on
> the value of the LANG attribute. For example, "zh-TW" text could be
> encoded in Big 5, Big 5+, GBK, CP950, CP936, EUC-CN (depending on the
> text), ISO-2022-CN, ISO-2022-CN-EXT, and others.
> The LANG attribute can be used by the application to help generate the
> appropriate glyph variants, however, though I don't know of any off
> hand that do this.

Makes sense, but I wasn't clear on this.

> > I need to convert the document to XML Docbook format.  My naive attempts
> > at converting to 
> > 
> > <screen xml:lang="zh-TW">
> > ... BIG5-encoded stuff ...
> > </screen>
> >
> > Of course don't work because the parser takes one look at the BIG5 and
> > throws a well-formedness error.
> Which it is required to do, see Section 4.3.3 of the XML specification.

I'm quite aware of this (I read the XML spec more often that I'd like to).  
That's why I said "of course".

> > Is there any way to manage this besides using XInclude?  Do any of the
> > Python parsers have any tricks that could help?
> Convert all of those sections into Unicode, using UTF-8 as the
> encoding form. You could write a trivial Python script to do this for
> you.

Not what I need, unfortunately.  The whole point of the exercise is to have 
examples in the actual encodings.

> The bigger problem (IMHO) will be convincing your DocBook tool chain
> to handle the Asian characters. If you find a good solution to that
> (i.e., allowing Simplified and Traditional Chinese, Korean, and (say)
> Thai in a single document) let me know.

Hmm?  My docbook tool is simply 4XSLT, which handles the individual encodings 
just fine now.

Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python