[I18n-sig] XML and UTF-16

M.-A. Lemburg mal@lemburg.com
Fri, 01 Jun 2001 10:10:08 +0200


"Martin v. Loewis" wrote:
> 
> > Yes, I think this would be a good idea. I would use something along
> > the lines of:
> 
> Please have a look at
> xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
> follows the procedure in the XML recommendation, except that it does
> not expect "unusual" byte orders (2134, 3412), and that it does not
> detect EBCDIC.

I don't have a file EntityParser in the xmlproc subdir... is
that in CVS somewhere ?
 
> > 0) Assume UTF-8.
> >
> > 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
> >    appropriate transmission format and endian nature. Goto 4.
> >
> > 2) Look for the UTF-8 uniBOM, since some editors like putting that in.
> >    Ignore it and goto 4.
> 
> I see this was added to the XML recommendation only in the second
> edition, so I should also added to xmlproc.
> 
> > 3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
> >    with appropriate endian variants. If found, assume the detected
> >    encoding. Goto 4.
> 
> Please note that ASCII is not detectable this way: If you see '<?xml',
> then you don't know anything about the encoding except that you should
> be able to parse the encoding= attribute successfully if present.

I think that's what Tom had in mind here.

Could we maybe have the function autodetect_encoding at
some higher level in PyXML ?! This is a very basic API and
doesn't only apply to xmlproc.

I also think that it would be worthwhile adding a similar
API to codecs.py which takes the magic ('<?xml' in this case)
as argument and then tries to determine whether the input
data is an ASCII superset, UTF-8 or UTF-16/32.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/