[lxml-dev] Writing TargetParser in Cython
Hi all! I'm trying to write TargetParser in Cython just to compare perfomance. The problem is with data types. If I define data method as "def data(self, char *data):" I'm unable to use it as TargetParser. I get " def data(self, char *data): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)" error. I could instance and directly call data() and close() methods and everything works fine, but it refuses to work with lxml. Small testcase following: ----- _target.pyx ----------- cdef class Target: cdef list _data def __init__(self): self._data = [] def data(self, char *data): self._data.append(data) def close(self): return ''.join(self._data) ---- end of target.pyx ------ ---- test.py ------- # -*- encoding: utf-8 -*- import lxml.html from lxml import etree from _target import Target res = etree.HTML(u"<span>ABCD</span>", parser=lxml.html.HTMLParser(target = Target())) -------end of target.pyx ------
Hi, Max Ivanov wrote:
I'm trying to write TargetParser in Cython just to compare perfomance. The problem is with data types. If I define data method as "def data(self, char *data):" I'm unable to use it as TargetParser. I get " def data(self, char *data): UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: ordinal not in range(128)" error.
That's because you get a unicode string as input, which is not compatible with a char*.
def data(self, char *data): self._data.append(data)
This is actually very inefficient. Cython will generate code here that retrieves the char* from the Python input string and then creates a new Python string from it to pass it into the .append() method. lxml uses a C interface internally, but AFAIR, it's not exposed at the C API level. Check the sources in parser.pxi and parsertarget.pxi. Stefan
participants (2)
-
Max Ivanov
-
Stefan Behnel