[Python-Dev] VM and Language summit info for those not at Pycon (and those that are!)
Stefan Behnel
stefan_ml at behnel.de
Mon Mar 21 11:58:50 CET 2011
[long post ahead, again]
Guido van Rossum, 21.03.2011 03:46:
> Thanks for the clarifications. I now have a much better understanding
> of what Cython is. But I'm not sold. For one, your attitude about
> strict language compatibility worries me when it comes to the stdlib.
Not sure what you mean exactly. Given our large user base, we do worry a
lot about things like backwards compatibility, for example.
If you are referring to compatibility with Python, I don't think anyone in
the project really targets Cython as a a drop-in replacement for a Python
runtime. We aim to compile Python code, yes, and there's a hand-wavy idea
in the back of our head that we may want a plain Python compatibility mode
at some point that will disable several important optimisations. But
there's no real drive for that, simply because Cython users usually care a
lot more about speed than about strict Python language compliance in
dangerous areas like overridden builtins (such as 'range'). And Cython
users know that they also have CPython available, which allows them to
easily get 100% compatibility if they need it, be it through an import or
by calling "exec".
That being said, we do consider any deviation from Python language
semantics a bug, and try to fix at least those with a user impact.
Compatibility has improved a lot since the early days.
> Also, I don't know how big it is
It's not small. The compiler is getting close to some 50,000 lines of
python code.
> but it seems putting the cart before
> the horse to use it to optimize the stdlib. Cython feels much less
> mature than CPython;
It certainly is not completely stable, neither the language nor the
compiler, but it has been used for production code ever since the project
started (from Pyrex' original inheritance).
There are parts of the language that we still fledge out, but we try hard
to keep the user impact low and to adhere to the "expected" Python
semantics as closely as possible whenever we design new language features.
Much of what we need to fix these days is actually due to different
language semantics that originally appeared in Pyrex, or to differences
between Python 2 and Python 3 that make it tricky for users to write
portable code.
> but the latter should only have dependencies that
> themselves change even slower than CPython.
I understand that. C is certainly evolving a *lot* slower than Cython.
Personally, I wouldn't consider Cython a dependency even if CPython started
using code written in Cython. It's more like a development tool, as users
won't have to care if the generated C sources ship with the distribution.
Only those who want to build from hg sources and distributors that patch
impacted release sources will have to take care to install the
corresponding Cython version. Shipping tested C sources is certainly the
recommended way of using Cython.
> I also am unclear on how
> exactly you're supporting the different semantics in Python 2 vs. 3
> without recompiling.
We try to make it easy for users to write portable code by keeping the code
semantics fixed as much as possible once it's compiled. However, there are
things that we don't currently fix. For example, we only try to keep
builtins compatible as far as we consider reasonable. If you write
x = range(5)
in your Cython code, you will get a list in Py2 and an iterator in Py3. If
you write "xrange(5)", however, you will get an xrange object in Py2 and a
range object in Py3. Same for "unicode" etc. We also don't change the API
of the bytes type (returning integers on indexing in Python 3), even though
it represents a major portability hassle for our users and also prevents
several optimisations (and language features) that Cython could otherwise
provide.
String semantics are actually quite complex inside of the Cython compiler
(as the cross-Python/C/C++ type system in general) and were subject to
major design/usability discussions in the past. We basically have three
Python string types: bytes (Py2/3 bytes), unicode (Py2 unicode, Py3 str)
and str (Py2/3 str), as well as C types like char/char* or Py_UCS4. The
'str' type is needed because parts of CPython, its stdlib and external
libraries actually require bytes in Python 2 (and it's sort-of the "native"
string type for ASCII text there), but require unicode text in Python 3. To
write portable code, you can use unprefixed string constants in Cython
code, which will become the respective 'str' type in each of the runtime
environments. That's an impressively well appreciated feature for our
users, and obviously modelled after 2to3.
However, since the API of 'str' isn't portable, you will only get a
performance boost when you use the unicode (and, for portable operations,
bytes) type, especially for looping, 'in' tests, etc. That will basically
allow Cython to 'unbox' the strings into a C array, with the obvious
optimisations like unboxed Unicode characters etc. As I said, quite a
complex type system.
Cython is actually a pretty cool tool for text processing these days. For
example, this
for c in some_typed_unicode_string:
if c == u'X': ...
elif c in u' \t\r\n': ...
elif c in u'AB12UV': ...
else: ...
will turn into a C pointer loop around a C switch statement. And I heard of
some fast bindings to C/C++ regex libs etc. that are getting written.
Shipping those with a PyCapsule based C-API (Cython can generate and import
those) and buffer interface support would provide a really speedy way to
use them from other Cython modules, without sacrificing the Python language
feeling.
> OTOH I think you've got the perfect audience in the scientific Python
> world.
Partly, but joined with the majority of FFI and "speeding up CPython code"
users. The scientific Python world is (obviously) very focussed on numeric
computation. Cython is much more versatile than that.
> Have you tried replacing selected stdlib modules with their
> Cython-optimized equivalents in some of the NumPy/SciPy distros? (E.g.
> what about Enthought's Python distros?) Depending on how well that
> goes I might warm up to Cython more!
Hmm, I hadn't heard about that before. I'll ask on our mailing list if
anyone's aware of them. I doubt that the stdlib participates in the
critical parts of scientific computation code. Maybe alternative CSV
parsers or something like that, but I'd be surprised if they were
compatible with what's in the stdlib.
Stefan
More information about the Python-Dev
mailing list