Unicode decoding is not supported on this platform with python3.3 debug

Im trying to debug a python problem and I built lxml using the pydebug compiled version of python3.3, but the XMLParser.feed() method isn't working... Corvidae:Python-3.3.0 markgrandi$ ./python.exe Python 3.3.0 (default, Dec 9 2012, 14:01:13) [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin Type "help", "copyright", "credits" or "license" for more information.
for line in x: ... print(x)
[126070 refs]
any idea on what is causing this and how i can fix it? ~mark

Mark Grandi, 09.12.2012 23:17:
How is wchar_t defined on your platform? That's what it's currently trying to parse (which is more of a missing feature - this could be more efficient in Py3.3 but currently isn't). Also, make sure you linked against libiconv when building. Here's how lxml figures out how to parse Unicode strings: https://github.com/lxml/lxml/blob/7eca2bb4b704058c0430ded3d1c05ed418ac7223/s... Stefan

Well, the thing is, it seems to be a bug with either lxml or libxml2 with python3. As lxml works fine both on the release and debug builds of python 3.2, but on python3.3,neither works. I looked at the source for parser.pxi, and its basically just comes down to libxml2 can't find a suitable encoding? if enchandler is not NULL: global _UNICODE_ENCODING tree.xmlCharEncCloseFunc(enchandler) _UNICODE_ENCODING = enc and py_buffer_len = python.PyBytes_GET_SIZE(data) elif python.PyUnicode_Check(data): if _UNICODE_ENCODING is NULL: raise ParserError, \ u"Unicode parsing is not supported on this platform" On Thu, Dec 13, 2012 at 11:00 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:

Hi, please don't top-post. Mark Grandi, 14.12.2012 07:29:
That's because the way Unicode works has changed in Py3.3. So, again: how is wchar_t defined on your system? Is it two bytes or four bytes long? And are you using a two-bytes Unicode build of Py3.2 or a four-bytes one? I would guess that both are different on your system.
Correct. Stefan

I was using the http://www.python.org builds both times, so if anything changed, then whatever build settings that python.org is using to build their mac os x binaries changed. On my mac, cpp -dM then ctrl+d says that "#define __WCHAR_MAX__ 2147483647", so wchar_t is 4 bytes. I also printed sys.maxunicode on both the python.org build of python3.3, and my own build of python3.2.3 (default settings): Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 01:25:11) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information.
Python 3.2.3 (default, Aug 28 2012, 06:42:49) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)] on darwin Type "help", "copyright", "credits" or "license" for more information.
So it seems something is going wrong when python is using a 4 bytes for unicode? ~mark

Mark Grandi, 09.12.2012 23:17:
How is wchar_t defined on your platform? That's what it's currently trying to parse (which is more of a missing feature - this could be more efficient in Py3.3 but currently isn't). Also, make sure you linked against libiconv when building. Here's how lxml figures out how to parse Unicode strings: https://github.com/lxml/lxml/blob/7eca2bb4b704058c0430ded3d1c05ed418ac7223/s... Stefan

Well, the thing is, it seems to be a bug with either lxml or libxml2 with python3. As lxml works fine both on the release and debug builds of python 3.2, but on python3.3,neither works. I looked at the source for parser.pxi, and its basically just comes down to libxml2 can't find a suitable encoding? if enchandler is not NULL: global _UNICODE_ENCODING tree.xmlCharEncCloseFunc(enchandler) _UNICODE_ENCODING = enc and py_buffer_len = python.PyBytes_GET_SIZE(data) elif python.PyUnicode_Check(data): if _UNICODE_ENCODING is NULL: raise ParserError, \ u"Unicode parsing is not supported on this platform" On Thu, Dec 13, 2012 at 11:00 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:

Hi, please don't top-post. Mark Grandi, 14.12.2012 07:29:
That's because the way Unicode works has changed in Py3.3. So, again: how is wchar_t defined on your system? Is it two bytes or four bytes long? And are you using a two-bytes Unicode build of Py3.2 or a four-bytes one? I would guess that both are different on your system.
Correct. Stefan

I was using the http://www.python.org builds both times, so if anything changed, then whatever build settings that python.org is using to build their mac os x binaries changed. On my mac, cpp -dM then ctrl+d says that "#define __WCHAR_MAX__ 2147483647", so wchar_t is 4 bytes. I also printed sys.maxunicode on both the python.org build of python3.3, and my own build of python3.2.3 (default settings): Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 01:25:11) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information.
Python 3.2.3 (default, Aug 28 2012, 06:42:49) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)] on darwin Type "help", "copyright", "credits" or "license" for more information.
So it seems something is going wrong when python is using a 4 bytes for unicode? ~mark
participants (2)
-
Mark Grandi
-
Stefan Behnel