[I18n-sig] XML and UTF-16

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Fri, 1 Jun 2001 14:59:37 +0200

> > > Yes, I think this would be a good idea. I would use something along
> > > the lines of:
> > 
> > Please have a look at
> > xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
> > follows the procedure in the XML recommendation, except that it does
> > not expect "unusual" byte orders (2134, 3412), and that it does not
> > detect EBCDIC.
> I don't have a file EntityParser in the xmlproc subdir... is
> that in CVS somewhere ?

Oops, missed on level of indirection:


And yes, the function is only in the CVS, not in a released version

> Could we maybe have the function autodetect_encoding at
> some higher level in PyXML ?! This is a very basic API and
> doesn't only apply to xmlproc.

We might (contributions are welcome). However, such a function would
not necessarily be usable for xmlproc: xmlproc deals with reading data
in small chunks, expecting that information may be broken at arbitrary
boundaries. For example, would you expect that the autodetection
function looks for the encoding= attribute? That may not be included
in the first fragment of data.

> I also think that it would be worthwhile adding a similar
> API to codecs.py which takes the magic ('<?xml' in this case)
> as argument and then tries to determine whether the input
> data is an ASCII superset, UTF-8 or UTF-16/32.

I don't think so. Doing the XML autodetection is not terribly
complicated, and rarely needs to be done - you'd normally pass the
byte stream to an XML parser, so you would not need to care about the

As for XML and encodings, having a convenient mechanism to extend
existing codecs to encode unknown characters as character entities is
much more important, IMO, since that is very difficult to achieve with
the existing API.