[Chicago] xml encodings, umlauts, ouch

Kumar McMillan kumar.mcmillan at gmail.com
Thu Apr 16 17:45:25 CEST 2009


On Thu, Apr 16, 2009 at 9:28 AM, Pete <pfein at pobox.com> wrote:
> On Apr 15, 2009, at 10:41 PM, Aaron Lav wrote:
>
>> On Wed, Apr 15, 2009 at 09:37:25PM -0500, Kumar McMillan wrote:
>>>
>>> Dear gurus,
>
> <more than you ever wanted to know about xml & encoding>

thanks Aaron!  that's exactly what they did: char ref encoded a utf-8
bytestream instead of a Unicode stream.

>
>> you'll have to add some hackish preprocessing or postprocessing.
>
> Seems like this is a common enough problem... a
> dear_diety_please_fix_this_broken_xml(..) OSS library would have made for a
> good (though perhaps unsexy) Pycon sprint.  Maybe next  year...  unless we
> start sprinting at Chipy. ;-P

eh, I *hope* no one else has to work around this kind of problem.  I
don't even know how to detect it.  Since I know it's happening, I did
cobble up a wrapper around the file object that has a custom read
method; it reads incrementally to make sure it has all the char ref
byte strings, decodes from utf-8, then char ref encodes *again* using
the proper code points, and returns the chunk.  Seems to work from
unit tests so far but will probably be very slow.  The xml file is
around 300MB.

K

>
> --Pete
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>


More information about the Chicago mailing list