[Chicago] xml encodings, umlauts, ouch
Kumar McMillan
kumar.mcmillan at gmail.com
Thu Apr 16 17:45:25 CEST 2009
On Thu, Apr 16, 2009 at 9:28 AM, Pete <pfein at pobox.com> wrote:
> On Apr 15, 2009, at 10:41 PM, Aaron Lav wrote:
>
>> On Wed, Apr 15, 2009 at 09:37:25PM -0500, Kumar McMillan wrote:
>>>
>>> Dear gurus,
>
> <more than you ever wanted to know about xml & encoding>
thanks Aaron! that's exactly what they did: char ref encoded a utf-8
bytestream instead of a Unicode stream.
>
>> you'll have to add some hackish preprocessing or postprocessing.
>
> Seems like this is a common enough problem... a
> dear_diety_please_fix_this_broken_xml(..) OSS library would have made for a
> good (though perhaps unsexy) Pycon sprint. Maybe next year... unless we
> start sprinting at Chipy. ;-P
eh, I *hope* no one else has to work around this kind of problem. I
don't even know how to detect it. Since I know it's happening, I did
cobble up a wrapper around the file object that has a custom read
method; it reads incrementally to make sure it has all the char ref
byte strings, decodes from utf-8, then char ref encodes *again* using
the proper code points, and returns the chunk. Seems to work from
unit tests so far but will probably be very slow. The xml file is
around 300MB.
K
>
> --Pete
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
More information about the Chicago
mailing list