windows utf8 & lxml
Peter Otten
__peter__ at web.de
Wed Dec 21 04:36:39 EST 2016
Sayth Renshaw wrote:
> On Tuesday, 20 December 2016 22:54:03 UTC+11, Sayth Renshaw wrote:
>> Hi
>>
>> I have been trying to get a script to work on windows that works on mint.
>> The key blocker has been utf8 errors, most of which I have solved.
>>
>> Now however the last error I am trying to overcome, the solution appears
>> to be to use the .decode('windows-1252') to correct an ascii error.
>>
>> I am using lxml to read my content and decode is not supported are there
>> any known ways to read with lxml and fix unicode faults?
>>
>> The key part of my script is
>>
>> for content in roots:
>> utf8_parser = etree.XMLParser(encoding='utf-8')
>> fix_ascii = utf8_parser.decode('windows-1252')
>> mytree = etree.fromstring(
>> content.read().encode('utf-8'), parser=fix_ascii)
>>
>> Without the added .decode my code looks like
>>
>> for content in roots:
>> utf8_parser = etree.XMLParser(encoding='utf-8')
>> mytree = etree.fromstring(
>> content.read().encode('utf-8'), parser=utf8_parser)
>>
>> However doing it in such a fashion returns this error:
>>
>> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0:
>> invalid start byte Which I found this SO for
>> http://stackoverflow.com/a/29217546/461887 but cannot seem to implement
>> with lxml.
>>
>> Ideas?
>>
>> Sayth
>
> Why is windows so hard.
I don't think this has anything to do with the OS. Your lxml_data is
probably not what you think it is. Compare:
$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> import lxml.etree
>>> lxml.etree.parse(sys.stdout)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 3239, in lxml.etree.parse
(src/lxml/lxml.etree.c:69955)
File "parser.pxi", line 1769, in lxml.etree._parseDocument
(src/lxml/lxml.etree.c:102257)
File "parser.pxi", line 1789, in lxml.etree._parseFilelikeDocument
(src/lxml/lxml.etree.c:102516)
File "parser.pxi", line 1684, in lxml.etree._parseDocFromFilelike
(src/lxml/lxml.etree.c:101442)
File "parser.pxi", line 1134, in
lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:97069)
File "parser.pxi", line 582, in
lxml.etree._ParserContext._handleParseResultDoc
(src/lxml/lxml.etree.c:91275)
File "parser.pxi", line 679, in lxml.etree._handleParseResult
(src/lxml/lxml.etree.c:92426)
File "lxml.etree.pyx", line 327, in
lxml.etree._ExceptionContext._raise_if_stored (src/lxml/lxml.etree.c:10196)
File "parser.pxi", line 373, in lxml.etree._FileReaderContext.copyToBuffer
(src/lxml/lxml.etree.c:89083)
io.UnsupportedOperation: not readable
That looks similar to what you get.
> Sort of running out of ideas, tried methods in the
> docs SO etc.
>
> Currently
>
> for xml_data in roots:
> parser_xml = etree.XMLParser()
> mytree = etree.parse(xml_data, parser_xml)
>
> Returns
> C:\Users\Sayth\Anaconda3\envs\race\python.exe
> C:/Users/Sayth/PycharmProjects/bs4race/race.py data/ -e *.xml Traceback
> (most recent call last):
> File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 100, in
> <module>
> data_attr(rootObs)
> File "C:/Users/Sayth/PycharmProjects/bs4race/race.py", line 55, in
> data_attr
> mytree = etree.parse(xml_data, parser_xml)
> File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse
> (src\lxml\lxml.etree.c:81110) File "src/lxml/parser.pxi", line 1832, in
> lxml.etree._parseDocument (src\lxml\lxml.etree.c:118109) File
> "src/lxml/parser.pxi", line 1852, in lxml.etree._parseFilelikeDocument
> (src\lxml\lxml.etree.c:118392) File "src/lxml/parser.pxi", line 1747, in
> lxml.etree._parseDocFromFilelike (src\lxml\lxml.etree.c:117180) File
> "src/lxml/parser.pxi", line 1162, in
> lxml.etree._BaseParser._parseDocFromFilelike
> (src\lxml\lxml.etree.c:111907) File "src/lxml/parser.pxi", line 595, in
> lxml.etree._ParserContext._handleParseResultDoc
> (src\lxml\lxml.etree.c:105102) File "src/lxml/parser.pxi", line 702, in
> lxml.etree._handleParseResult (src\lxml\lxml.etree.c:106769) File
> "src/lxml/lxml.etree.pyx", line 324, in
> lxml.etree._ExceptionContext._raise_if_stored
> (src\lxml\lxml.etree.c:12074) File "src/lxml/parser.pxi", line 373, in
> lxml.etree._FileReaderContext.copyToBuffer
> (src\lxml\lxml.etree.c:102431)
> io.UnsupportedOperation: read
>
> Process finished with exit code 1
>
> Thoughts?
>
> Sayth
More information about the Python-list
mailing list