Re: [Cython] Specializing str methods?
Hi, the right place to discuss this is the cython-devel mailing list. John Ehresman, 19.05.2011 18:17:
On 4/13/11 11:23 AM, Stefan Behnel wrote:
You can add BuiltinMethod entries for the "str" type in Builtin.py and map them to a function name like "__Pyx_Str_PyString_WhatEver()". Then, define "utility_code" code blocks for each of the methods that #defines that name to the appropriate PyString_*() or PyUnicode_*() C-API function, depending on the Python version it runs with. There are a lot of examples for that in Builtin.py already.
I've finally got back to this and have the start of an implementation. I've attached a diff, though I'm not sure what the preferred workflow is. Should I push to github and / or create a bug in trac?
It's best to create a pull request on github, potentially accompanied by a request for review on the developer mailing list. We get a notification about pull requests, but it's still better to give us an idea in what state you think your patch is and what kind of review you need. The review can then happen by commenting on the changes directly in github.
I'll probably add more methods, but wanted to get feedback before doing so.
Certainly the right decision.
I also intend to contribute other patches in the future.
Please do. :)
One thing that I didn't expect is that I originally declared the method as BuiltinMethod("startswith", "TT", "O", "_Pyx_StrStartsWith", utility_code=str_startswith_utility_code), and expected the specialization to only be used when both self and arg where a str, but instead cython always used the specialization and added a check for arg being a str before the specialization was called. I ended up writing a function that works for both unicode & bytes self and an object arg.
You don't really have to care about the type of the argument when applying this optimisation. As long as it's clear that the object that provides the method is a "str" (although it's potentially None even then!), it's clear what the method will do when being called, even if we don't know the argument type at compile time. Just provide a fallback, as you did anyway. I'll comment on your patch below. diff --git a/Cython/Compiler/Builtin.py b/Cython/Compiler/Builtin.py +str_startswith_utility_code = UtilityCode( +proto=""" "proto" is for the forward declaration of the function signature. Except for very simple cases (short macros) the implementation goes into "impl". +static PyObject* _Pyx_StrStartsWith(PyObject* self, PyObject* arg) +{ + PyObject* ret; + + if ( PyBytes_CheckExact(self) && PyBytes_CheckExact(arg) ) { + if ( PyBytes_GET_SIZE(self) < PyBytes_GET_SIZE(arg) ) + ret = Py_False; + else if ( memcmp(PyBytes_AS_STRING(self), PyBytes_AS_STRING(arg), + PyBytes_GET_SIZE(arg)) == 0 ) + ret = Py_True; + else + ret = Py_False; + Py_INCREF(ret); + return ret; + } + + if ( PyUnicode_CheckExact(self) && PyUnicode_CheckExact(arg) ) { + if ( PyUnicode_GET_SIZE(self) < PyUnicode_GET_SIZE(arg) ) + ret = Py_False; + else if ( memcmp(PyUnicode_AS_UNICODE(self), PyUnicode_AS_UNICODE(arg), + PyUnicode_GET_DATA_SIZE(arg)) == 0 ) + ret = Py_True; + else + ret = Py_False; + Py_INCREF(ret); + return ret; + } + + return PyObject_CallMethod(self, "startswith", "O", arg); I'd split this up into two functions that can be applied to different types independently, and wrap it in a function that calls either one depending on PY_MAJOR_VERSION. After all, "str" is *known* to be "bytes" in Py2 and "unicode" in Py3. @@ -522,8 +557,12 @@ builtin_types_table = [ ]), ("bytes", "PyBytes_Type", []), - ("str", "PyString_Type", []), + ("str", "PyString_Type", [BuiltinMethod("startswith", "TO", "O", "_Pyx_StrStartsWith", + utility_code=str_startswith_utility_code), + ]), It's better to return a "bint" to avoid returning Python objects for True/False. The type character for that is "b". There's a "__Pyx_PyObject_IsTrue(obj)" macro that you can use to convert the fallback return value. It's usually also ok to copy error handling code from CPython, and to raise the exception directly without going into the fallback case. ("unicode", "PyUnicode_Type", [BuiltinMethod("join", "TO", "T", "PyUnicode_Join"), + BuiltinMethod("startswith", "TO", "O", "_Pyx_StrStartsWith", + utility_code=str_startswith_utility_code), ]), This is already optimised in Cython/Compiler/Optimize.py ("tailmatch"), basically because it allows different numbers of arguments that need to be dealt with. It may be worth going the same route for "str". The decision usually depends on how complex the code transformation is. The method table in Builtins is clearly limited. diff --git a/tests/run/stropts.pyx b/tests/run/stropts.pyx @@ -0,0 +1,84 @@ +cimport cython + +@cython.test_assert_path_exists( + '//AttributeNode[@member="_Pyx_StrStartsWith"]') +def str_startswith_str(): + """ + >>> str_startswith_str() + True + """ + + cdef str a + cdef str b + + a = '' + b = '' + return a.startswith(b) [...] I usually just implement one or a few test functions and use the doctest to pass in different values. Stefan
On 5/19/11 3:35 PM, Stefan Behnel wrote:
This is already optimised in Cython/Compiler/Optimize.py ("tailmatch"), basically because it allows different numbers of arguments that need to be dealt with. It may be worth going the same route for "str". The decision usually depends on how complex the code transformation is. The method table in Builtins is clearly limited.
I'll look at doing this in Optimize.py. The hunk of my patch below fixes a generic bug, though. Without it a utility function for a method is used, but the definition of it is never injected into the C file. Thanks, John diff --git a/Cython/Compiler/ExprNodes.py b/Cython/Compiler/ExprNodes.py index 6ab8be0..ac7ca67 100755 --- a/Cython/Compiler/ExprNodes.py +++ b/Cython/Compiler/ExprNodes.py @@ -3746,6 +3746,8 @@ class AttributeNode(ExprNode): if entry.is_variable or entry.is_cmethod: self.type = entry.type self.member = entry.cname + if entry.utility_code: + env.use_utility_code(entry.utility_code) return else: # If it's not a variable or C method, it must be a Python
participants (2)
-
John Ehresman -
Stefan Behnel