[Chicago] xml encodings, umlauts, ouch

Ian Bicking ianb at colorstudy.com
Thu Apr 16 20:54:57 CEST 2009


On Thu, Apr 16, 2009 at 10:45 AM, Kumar McMillan
<kumar.mcmillan at gmail.com>wrote:

>  > Seems like this is a common enough problem... a
> > dear_diety_please_fix_this_broken_xml(..) OSS library would have made for
> a
> > good (though perhaps unsexy) Pycon sprint.  Maybe next  year...  unless
> we
> > start sprinting at Chipy. ;-P
>
> eh, I *hope* no one else has to work around this kind of problem.  I
> don't even know how to detect it.  Since I know it's happening, I did
> cobble up a wrapper around the file object that has a custom read
> method; it reads incrementally to make sure it has all the char ref
> byte strings, decodes from utf-8, then char ref encodes *again* using
> the proper code points, and returns the chunk.  Seems to work from
> unit tests so far but will probably be very slow.  The xml file is
> around 300MB.
>

Technically it would be possible to create a codec that would handle this
case.  That has all the interface to stream the encoding process (though
it's not actually a terribly easy interface to use).

Then if you can get chardet to return your encoding (using, say, some string
marker that would suggest the problem) then you'd really be set.  Set for
something that should never happen, so maybe not the most productive way to
handle the problem.

There's Unicode Dammit, but I doubt it addresses this particular issue.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20090416/011dd7cc/attachment.htm>


More information about the Chicago mailing list