windows utf8 & lxml

Sayth Renshaw flebber.crue at gmail.com
Wed Dec 21 04:03:48 EST 2016


On Tuesday, 20 December 2016 22:54:03 UTC+11, Sayth Renshaw  wrote:
> Hi 
> 
> I have been trying to get a script to work on windows that works on mint. The key blocker has been utf8 errors, most of which I have solved.
> 
> Now however the last error I am trying to overcome, the solution appears to be to use the .decode('windows-1252') to correct an ascii error.
> 
> I am using lxml to read my content and decode is not supported are there any known ways to read with lxml and fix unicode faults?
> 
> The key part of my script is 
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             fix_ascii = utf8_parser.decode('windows-1252')
>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=fix_ascii)
> 
> Without the added .decode my code looks like
> 
>         for content in roots:
>             utf8_parser = etree.XMLParser(encoding='utf-8')
>             mytree = etree.fromstring(
>                 content.read().encode('utf-8'), parser=utf8_parser)
> 
> However doing it in such a fashion returns this error:
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
> Which I found this SO for http://stackoverflow.com/a/29217546/461887 but cannot seem to implement with lxml.
> 
> Ideas?
> 
> Sayth

Why is windows so hard. Sort of running out of ideas, tried methods in the docs SO etc.

Currently

        for xml_data in roots:
            parser_xml = etree.XMLParser()
            mytree = etree.parse(xml_data, parser_xml)

Returns
C:\Users\Sayth\Anaconda3\envs\race\python.exe C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml
Traceback (most recent call last):
  File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in <module>
    data_attr(rootObs)
  File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in data_attr
    mytree = etree.parse(xml_data, parser_xml)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src\lxml\lxml.etree.c:81110)
  File "src/lxml/parser.pxi", line 1832, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109)
  File "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument (src\lxml\lxml.etree.c:118392)
  File "src/lxml/parser.pxi", line 1747, in lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180)
  File "src/lxml/parser.pxi", line 1162, in lxml.etree._BaseParser._parseDocFromFilelike (src\lxml\lxml.etree.c:111907)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105102)
  File "src/lxml/parser.pxi", line 702, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769)
  File "src/lxml/lxml.etree.pyx", line 324, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:12074)
  File "src/lxml/parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer (src\lxml\lxml.etree.c:102431)
io.UnsupportedOperation: read

Process finished with exit code 1

Thoughts?

Sayth


More information about the Python-list mailing list