[cpyext] partial fake PEP393 implementation to provide access to single unicode characters in strings
Hi, PEP393 (the new Unicode type in Py3.3) defines a rather useful C interface towards the characters of a Unicode string. I think it would be cool if cpyext provided that, so that access to single characters won't require copying the unicode buffer into C space anymore. I attached an untested (and likely non-working) patch that adds the most important parts of it. The implementation does not care about non-BMP characters, which (if I'm not mistaken) are encoded as surrogate pairs in PyPy. Apart from that, the functions behave like their CPython counterparts, which means that the implementation shouldn't get in the way of a future real PEP393 implementation. What do you think? I have no idea if the way the index access is done in PyUnicode_READ_CHAR() is in any way efficient - would be good if it was. Specifically, the intention is to avoid creating a 1-character unicode string copy before taking its ord(). Does this happen automatically, or is there a way to make sure it does that? Stefan
Hi Stefan, On Sat, Apr 14, 2012 at 18:44, Stefan Behnel <stefan_ml@behnel.de> wrote:
PEP393 (the new Unicode type in Py3.3) defines a rather useful C interface towards the characters of a Unicode string. I think it would be cool if cpyext provided that, so that access to single characters won't require copying the unicode buffer into C space anymore.
FWIW, if it makes sense, you can add PyPy-specific API functions not in the standard CPython C API, too. I'm thinking about accessing *string* characters, for example.
Specifically, the intention is to avoid creating a 1-character unicode string copy before taking its ord(). Does this happen automatically, or is there a way to make sure it does that?
In RPython, indexing a string returns a single char, which is a different low-level type than a full string (just "char" in C). A bientôt, Armin.
2012/4/20 Armin Rigo <arigo@tunes.org>
On Sat, Apr 14, 2012 at 18:44, Stefan Behnel <stefan_ml@behnel.de> wrote:
PEP393 (the new Unicode type in Py3.3) defines a rather useful C interface towards the characters of a Unicode string. I think it would be cool if cpyext provided that, so that access to single characters won't require copying the unicode buffer into C space anymore.
FWIW, if it makes sense, you can add PyPy-specific API functions not in the standard CPython C API, too. I'm thinking about accessing *string* characters, for example.
But is it desirable? The first call to PyUnicode_AsUnicode will allocate and copy the unicode buffer, but subsequent calls will quickly return the same address. -- Amaury Forgeot d'Arc
Hi Amaury, On Fri, Apr 20, 2012 at 11:02, Amaury Forgeot d'Arc <amauryfa@gmail.com> wrote:
But is it desirable? The first call to PyUnicode_AsUnicode will allocate and copy the unicode buffer, but subsequent calls will quickly return the same address.
Indeed, it's a bit unclear. If I may repeat myself, I still think that the performance problems of cpyext are really due to the costly double-mapping between PyPy's real objects and PyObjects, together with INCREF/DECREF being function calls. This is the first place I would look at if I were concerned about it. (Stefan: see a previous mail where I described how to start.) A bientôt, Armin.
Armin Rigo, 20.04.2012 11:16:
On Fri, Apr 20, 2012 at 11:02, Amaury Forgeot d'Arc wrote:
But is it desirable? The first call to PyUnicode_AsUnicode will allocate and copy the unicode buffer, but subsequent calls will quickly return the same address.
Indeed, it's a bit unclear. If I may repeat myself, I still think that the performance problems of cpyext are really due to the costly double-mapping between PyPy's real objects and PyObjects, together with INCREF/DECREF being function calls. This is the first place I would look at if I were concerned about it.
Well, have you seen my macro changes in issue 1121? https://bugs.pypy.org/issue1121 At least for the new ref-counting macros, I already presented the usual stupid micro benchmark in a previous mail, giving me almost a factor of 2 in performance for objects with a ref-count > 1 in C space. I'll add the numbers to the ticket. Stefan
participants (3)
-
Amaury Forgeot d'Arc -
Armin Rigo -
Stefan Behnel