windows utf8 & lxml

Stefan Behnel stefan_ml at behnel.de
Mon Dec 26 10:55:52 EST 2016


Hi!

Sayth Renshaw schrieb am 20.12.2016 um 12:53:
> I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.
> 
> Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.
> 
> I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?
> 
> The key part of my script is 
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             fix_ascii = utf8_parser.decode('windows-1252')

This looks rather broken. Are you sure this is what your code looks like,
or did just you type this into your email while trying to strip down your
actual code into a simpler example?


>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=fix_ascii)

Note that lxml can parse from Unicode, so once you have decoded your data,
you can just pass it into the parser as is, e.g.

    mytree = etree.fromstring(content.decode('windows-1252'))

This is not something I'd encourage since it requires a bit of back and
forth encoding internally and is rather memory inefficient, but if your
decoding is non-trivial, this might still be a viable approach.


> Without the added .decode my code looks like
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=utf8_parser)
> 
> However doing it in such a fashion returns this error:
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Same thing as above: I don't see how this error message matches the code
you show here. The exception you get might be a Python 2.x problem in the
first place.

Stefan



More information about the Python-list mailing list