[Chicago] xml encodings, umlauts, ouch

Kumar McMillan kumar.mcmillan at gmail.com
Thu Apr 16 21:36:34 CEST 2009


On Thu, Apr 16, 2009 at 1:54 PM, Ian Bicking <ianb at colorstudy.com> wrote:
> On Thu, Apr 16, 2009 at 10:45 AM, Kumar McMillan <kumar.mcmillan at gmail.com>
> wrote:
>>
>> > Seems like this is a common enough problem... a
>> > dear_diety_please_fix_this_broken_xml(..) OSS library

For the curious, this is what I wrote on the train to work this
morning (attached).  Doesn't support all types of utf-8 byte encodings
yet but that should be an easy addition.  Otherwise, it's a sad amount
of code and will probably make feeding the parser a lot slower.

I haven't looked into building a parser for lxml that overloads its
own char ref decoding.  I'm sure that would be less code since it
wouldn't have to read incrementally.

>> > would have made
>> > for a
>> > good (though perhaps unsexy) Pycon sprint.  Maybe next  year...  unless
>> > we
>> > start sprinting at Chipy. ;-P
>>
>> eh, I *hope* no one else has to work around this kind of problem.  I
>> don't even know how to detect it.  Since I know it's happening, I did
>> cobble up a wrapper around the file object that has a custom read
>> method; it reads incrementally to make sure it has all the char ref
>> byte strings, decodes from utf-8, then char ref encodes *again* using
>> the proper code points, and returns the chunk.  Seems to work from
>> unit tests so far but will probably be very slow.  The xml file is
>> around 300MB.
>
> Technically it would be possible to create a codec that would handle this
> case.  That has all the interface to stream the encoding process (though
> it's not actually a terribly easy interface to use).
>
> Then if you can get chardet to return your encoding (using, say, some string
> marker that would suggest the problem) then you'd really be set.  Set for
> something that should never happen, so maybe not the most productive way to
> handle the problem.
>
> There's Unicode Dammit, but I doubt it addresses this particular issue.
>
> --
> Ian Bicking  |  http://blog.ianbicking.org
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xmlfixerupper.py
Type: application/octet-stream
Size: 2453 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/chicago/attachments/20090416/b426523b/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_xmlfixerupper.py
Type: application/octet-stream
Size: 1179 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/chicago/attachments/20090416/b426523b/attachment-0003.obj>


More information about the Chicago mailing list