Unicode error in sax parser

Stefan Behnel stefan_ml at behnel.de
Tue Feb 8 12:00:46 EST 2011


Rickard Lindberg, 08.02.2011 16:57:
> Hi,
>
> Here is a bash script to reproduce my error:
>
>      #!/bin/sh
>
>      cat>  å.timeline<<EOF
>      <?xml version="1.0" encoding="utf-8"?>
>      <timeline>
>        <version>0.13.0devb38ace0a572b+</version>
>        <categories>
>        </categories>
>        <events>
>          <event>
>            <start>2011-02-01 00:00:00</start>
>            <end>2011-02-03 08:46:00</end>
>            <text>asdsd</text>
>          </event>
>        </events>
>        <view>
>          <displayed_period>
>            <start>2011-01-24 16:38:11</start>
>            <end>2011-02-23 16:38:11</end>
>          </displayed_period>
>          <hidden_categories>
>          </hidden_categories>
>        </view>
>      </timeline>
>      EOF
>
>      python<<EOF
>      # encoding: utf-8
>      from xml.sax import parse
>      from xml.sax.handler import ContentHandler
>      parse(u"å.timeline", ContentHandler())
>      EOF
>
> If I instead do
>
>      parse(u"å.timeline".encode("utf-8"), ContentHandler())
>
> the script runs without errors.
>
> Is this a bug or expected behavior?

Expected behaviour. You cannot parse XML from unicode strings, especially 
not when the XML data explicitly declares itself as being encoded in UTF-8.

Parse from a byte string instead, as you do in your fixed code.

Stefan




More information about the Python-list mailing list