Fwd: character encoding on windows
data:image/s3,"s3://crabby-images/56944/5694463b56960606063ab7fb1f12438f1aa12bfe" alt=""
Hi, Not sure this is a bug but I have some difficulties in parsing an xml file with the utf8 character superscript minus <https://www.fileformat.info/info/unicode/char/207b/index.htm> on windows (windows 10 64bits). I can reproduce this with the following script: ``` python import sys from lxml import etree print("%-20s: %s" % ('Python', sys.version_info)) print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION)) print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION)) print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION)) print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION)) print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION)) tree = etree.parse(sys.stdin) ``` Using the following xml file: ``` xml <?xml version="1.0" encoding="UTF-8"?> <root>⁻</root> ``` And I'm getting the following output: ``` python-traceback Python : sys.version_info(major=3, minor=6, micro=1, releaselevel='final', serial=0) lxml.etree : (4, 4, 1, 0) libxml used : (2, 9, 5) libxml compiled : (2, 9, 5) libxslt used : (1, 1, 30) libxslt compiled : (1, 1, 30) Traceback (most recent call last): File "test.py", line 14, in <module> tree = etree.parse(sys.stdin) File "src\lxml\etree.pyx", line 3467, in lxml.etree.parse File "src\lxml\parser.pxi", line 1860, in lxml.etree._parseDocument File "src\lxml\parser.pxi", line 1880, in lxml.etree._parseFilelikeDocument File "src\lxml\parser.pxi", line 1775, in lxml.etree._parseDocFromFilelike File "src\lxml\parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc File "src\lxml\parser.pxi", line 707, in lxml.etree._handleParseResult File "src\lxml\etree.pyx", line 318, in lxml.etree._ExceptionContext._raise_if_stored File "src\lxml\parser.pxi", line 374, in lxml.etree._FileReaderContext.copyToBuffer UnicodeEncodeError: 'utf-8' codec can't encode character '\udc81' in position 46: surrogates not allowed ``` When called by the following command: ``` python script.py < input.xml ``` Note that this works flawlessly on an ubuntu system with both Python 2.7 and 3.7. Best regards, Romain
data:image/s3,"s3://crabby-images/e350e/e350e292292944d3dd5d3e9be791464f5144f6e3" alt=""
On Tue, Aug 27, 2019 at 3:57 AM Romain Goffe <romain.goffe@gmail.com> wrote:
Hi, Romain. That's not caused by lxml, as you would see if you changed tree = etree.parse(sys.stdin) to tree = etree.parse("input.xml") The problem is that with Windows console, you're subjecting your XML to a pipeline which is not a pure pass-through. Here's a more thorough explanation: https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-... Cheers, Bob Kline
data:image/s3,"s3://crabby-images/e350e/e350e292292944d3dd5d3e9be791464f5144f6e3" alt=""
On Tue, Aug 27, 2019 at 3:57 AM Romain Goffe <romain.goffe@gmail.com> wrote:
Hi, Romain. That's not caused by lxml, as you would see if you changed tree = etree.parse(sys.stdin) to tree = etree.parse("input.xml") The problem is that with Windows console, you're subjecting your XML to a pipeline which is not a pure pass-through. Here's a more thorough explanation: https://devblogs.microsoft.com/commandline/windows-command-line-unicode-and-... Cheers, Bob Kline
participants (2)
-
Bob Kline
-
Romain Goffe