[Python-Dev] VM and Language summit info for those not at Pycon (and those that are!)

Mon Mar 21 11:58:50 CET 2011

[long post ahead, again]

Guido van Rossum, 21.03.2011 03:46:
> Thanks for the clarifications. I now have a much better understanding
> of what Cython is. But I'm not sold. For one, your attitude about
> strict language compatibility worries me when it comes to the stdlib.

Not sure what you mean exactly. Given our large user base, we do worry a 
lot about things like backwards compatibility, for example.

If you are referring to compatibility with Python, I don't think anyone in 
the project really targets Cython as a a drop-in replacement for a Python 
runtime. We aim to compile Python code, yes, and there's a hand-wavy idea 
in the back of our head that we may want a plain Python compatibility mode 
at some point that will disable several important optimisations. But 
there's no real drive for that, simply because Cython users usually care a 
lot more about speed than about strict Python language compliance in 
dangerous areas like overridden builtins (such as 'range'). And Cython 
users know that they also have CPython available, which allows them to 
easily get 100% compatibility if they need it, be it through an import or 
by calling "exec".

That being said, we do consider any deviation from Python language 
semantics a bug, and try to fix at least those with a user impact. 
Compatibility has improved a lot since the early days.

> Also, I don't know how big it is

It's not small. The compiler is getting close to some 50,000 lines of 
python code.

> but it seems putting the cart before
> the horse to use it to optimize the stdlib. Cython feels much less
> mature than CPython;

It certainly is not completely stable, neither the language nor the 
compiler, but it has been used for production code ever since the project 
started (from Pyrex' original inheritance).

There are parts of the language that we still fledge out, but we try hard 
to keep the user impact low and to adhere to the "expected" Python 
semantics as closely as possible whenever we design new language features. 
Much of what we need to fix these days is actually due to different 
language semantics that originally appeared in Pyrex, or to differences 
between Python 2 and Python 3 that make it tricky for users to write 
portable code.

> but the latter should only have dependencies that
> themselves change even slower than CPython.

I understand that. C is certainly evolving a *lot* slower than Cython.

Personally, I wouldn't consider Cython a dependency even if CPython started 
using code written in Cython. It's more like a development tool, as users 
won't have to care if the generated C sources ship with the distribution. 
Only those who want to build from hg sources and distributors that patch 
impacted release sources will have to take care to install the 
corresponding Cython version. Shipping tested C sources is certainly the 
recommended way of using Cython.

> I also am unclear on how
> exactly you're supporting the different semantics in Python 2 vs. 3
> without recompiling.

We try to make it easy for users to write portable code by keeping the code 
semantics fixed as much as possible once it's compiled. However, there are 
things that we don't currently fix. For example, we only try to keep 
builtins compatible as far as we consider reasonable. If you write

    x = range(5)

in your Cython code, you will get a list in Py2 and an iterator in Py3. If 
you write "xrange(5)", however, you will get an xrange object in Py2 and a 
range object in Py3. Same for "unicode" etc. We also don't change the API 
of the bytes type (returning integers on indexing in Python 3), even though 
it represents a major portability hassle for our users and also prevents 
several optimisations (and language features) that Cython could otherwise 
provide.

String semantics are actually quite complex inside of the Cython compiler 
(as the cross-Python/C/C++ type system in general) and were subject to 
major design/usability discussions in the past. We basically have three 
Python string types: bytes (Py2/3 bytes), unicode (Py2 unicode, Py3 str) 
and str (Py2/3 str), as well as C types like char/char* or Py_UCS4. The 
'str' type is needed because parts of CPython, its stdlib and external 
libraries actually require bytes in Python 2 (and it's sort-of the "native" 
string type for ASCII text there), but require unicode text in Python 3. To 
write portable code, you can use unprefixed string constants in Cython 
code, which will become the respective 'str' type in each of the runtime 
environments. That's an impressively well appreciated feature for our 
users, and obviously modelled after 2to3.

However, since the API of 'str' isn't portable, you will only get a 
performance boost when you use the unicode (and, for portable operations, 
bytes) type, especially for looping, 'in' tests, etc. That will basically 
allow Cython to 'unbox' the strings into a C array, with the obvious 
optimisations like unboxed Unicode characters etc. As I said, quite a 
complex type system.

Cython is actually a pretty cool tool for text processing these days. For 
example, this

     for c in some_typed_unicode_string:
         if c == u'X': ...
         elif c in u' \t\r\n': ...
         elif c in u'AB12UV': ...
         else: ...

will turn into a C pointer loop around a C switch statement. And I heard of 
some fast bindings to C/C++ regex libs etc. that are getting written. 
Shipping those with a PyCapsule based C-API (Cython can generate and import 
those) and buffer interface support would provide a really speedy way to 
use them from other Cython modules, without sacrificing the Python language 
feeling.

> OTOH I think you've got the perfect audience in the scientific Python
> world.

Partly, but joined with the majority of FFI and "speeding up CPython code" 
users. The scientific Python world is (obviously) very focussed on numeric 
computation. Cython is much more versatile than that.

> Have you tried replacing selected stdlib modules with their
> Cython-optimized equivalents in some of the NumPy/SciPy distros? (E.g.
> what about Enthought's Python distros?) Depending on how well that
> goes I might warm up to Cython more!

Hmm, I hadn't heard about that before. I'll ask on our mailing list if 
anyone's aware of them. I doubt that the stdlib participates in the 
critical parts of scientific computation code. Maybe alternative CSV 
parsers or something like that, but I'd be surprised if they were 
compatible with what's in the stdlib.

Stefan