windows utf8 & lxml

Peter Otten __peter__ at web.de
Wed Dec 21 04:36:39 EST 2016


Sayth Renshaw wrote:

> On Tuesday, 20 December 2016 22:54:03 UTC+11, Sayth Renshaw  wrote:
>> Hi
>> 
>> I have been trying to get a script to work on windows that works on mint.
>> The key blocker has been utf8 errors, most of which I have solved.
>> 
>> Now however the last error I am trying to overcome, the solution appears
>> to be to use the .decode('windows-1252') to correct an ascii error.
>> 
>> I am using lxml to read my content and decode is not supported are there
>> any known ways to read with lxml and fix unicode faults?
>> 
>> The key part of my script is
>> 
>>         for content in roots:
>>             utf8_parser = etree.XMLParser(encoding='utf-8')
>>             fix_ascii = utf8_parser.decode('windows-1252')
>>             mytree = etree.fromstring(
>>                 content.read().encode('utf-8'), parser=fix_ascii)
>> 
>> Without the added .decode my code looks like
>> 
>>         for content in roots:
>>             utf8_parser = etree.XMLParser(encoding='utf-8')
>>             mytree = etree.fromstring(
>>                 content.read().encode('utf-8'), parser=utf8_parser)
>> 
>> However doing it in such a fashion returns this error:
>> 
>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
>> invalid start byte Which I found this SO for
>> http://stackoverflow.com/a/29217546/461887 but cannot seem to implement
>> with lxml.
>> 
>> Ideas?
>> 
>> Sayth
> 
> Why is windows so hard. 

I don't think this has anything to do with the OS. Your lxml_data is 
probably not what you think it is. Compare:

$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import lxml.etree
>>> lxml.etree.parse(sys.stdout)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "lxml.etree.pyx", line 3239, in lxml.etree.parse 
(src/lxml/lxml.etree.c:69955)
  File "parser.pxi", line 1769, in lxml.etree._parseDocument 
(src/lxml/lxml.etree.c:102257)
  File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument 
(src/lxml/lxml.etree.c:102516)
  File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike 
(src/lxml/lxml.etree.c:101442)
  File "parser.pxi", line 1134, in 
lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
  File "parser.pxi", line 582, in 
lxml.etree._ParserContext._handleParseResultDoc 
(src/lxml/lxml.etree.c:91275)
  File "parser.pxi", line 679, in lxml.etree._handleParseResult 
(src/lxml/lxml.etree.c:92426)
  File "lxml.etree.pyx", line 327, in 
lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:10196)
  File "parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer 
(src/lxml/lxml.etree.c:89083)
io.UnsupportedOperation: not readable

That looks similar to what you get.

> Sort of running out of ideas, tried methods in the
> docs SO etc.
> 
> Currently
> 
>         for xml_data in roots:
>             parser_xml = etree.XMLParser()
>             mytree = etree.parse(xml_data, parser_xml)
> 
> Returns
> C:\Users\Sayth\Anaconda3\envs\race\python.exe
> C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml Traceback
> (most recent call last):
>   File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in
>   <module>
>     data_attr(rootObs)
>   File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in
>   data_attr
>     mytree = etree.parse(xml_data, parser_xml)
>   File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse
>   (src\lxml\lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1832, in
>   lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109) File
>   "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument
>   (src\lxml\lxml.etree.c:118392) File "src/lxml/parser.pxi", line 1747, in
>   lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180) File
>   "src/lxml/parser.pxi", line 1162, in
>   lxml.etree._BaseParser._parseDocFromFilelike
>   (src\lxml\lxml.etree.c:111907) File "src/lxml/parser.pxi", line 595, in
>   lxml.etree._ParserContext._handleParseResultDoc
>   (src\lxml\lxml.etree.c:105102) File "src/lxml/parser.pxi", line 702, in
>   lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769) File
>   "src/lxml/lxml.etree.pyx", line 324, in
>   lxml.etree._ExceptionContext._raise_if_stored
>   (src\lxml\lxml.etree.c:12074) File "src/lxml/parser.pxi", line 373, in
>   lxml.etree._FileReaderContext.copyToBuffer
>   (src\lxml\lxml.etree.c:102431)
> io.UnsupportedOperation: read
> 
> Process finished with exit code 1
> 
> Thoughts?
> 
> Sayth




More information about the Python-list mailing list