Cython and large generated .c files
I'm speeding up the Matlab file IO routines by using Cython. Generally this has been a pleasant experience, but the size of the generated .c files is beginning to worry me. As y'all know, the accepted approach is to commit both the Cython .pyx files, and the Cython-generated .c files. My problem is that the .c files are getting big - a 22K .pyx file generating 516K of .c file. If I do a few changes in the .pyx file, and commit the .pyx and the .c file, that's going to add up to lot of space in the SVN repository. An obvious way round this is to let distutils run Cython for developer builds, and only generate the .c for releases, or some such, but I can see that could get messy. But, the space the .c file takes up is making me think twice about commits. Have y'all got any thoughts how to deal with this problem? Thanks a lot, Matthew
su, 2009-11-08 kello 01:05 -0800, Matthew Brett kirjoitti: [clip: huge Cython-generated .c files]
An obvious way round this is to let distutils run Cython for developer builds, and only generate the .c for releases, or some such, but I can see that could get messy. But, the space the .c file takes up is making me think twice about commits.
Have y'all got any thoughts how to deal with this problem?
I think Cython has some pre-made distutils hooks to make this less painless. I don't know how nicely they work with numpy.distutils, however... A bigger issue than build setup may be that then we'd require users to have the latest Cython installed to build Scipy. (Perhaps the generated .c files could be shipped with the distributed tarballs, though, so only people interested in SVN would need Cython.) Actually, this might not be so much to ask, given that Cython is a pure-Python package. So, should we make Scipy depend on Cython at build-time? I agree that things may get annoying if we decide to avoid this in the long run, if the number of Cython-based extensions starts to grow. -- Pauli Virtanen
Pauli Virtanen wrote:
I think Cython has some pre-made distutils hooks to make this less painless. I don't know how nicely they work with numpy.distutils, however...
Should not be too hard to support it.
I agree that things may get annoying if we decide to avoid this in the long run, if the number of Cython-based extensions starts to grow.
Is size really a problem ? I am a bit worried about adding cython as a build requirement. We have already enough problems as it is, adding a requirement can only make things worse. David
David Cournapeau wrote:
Is size really a problem ? I am a bit worried about adding cython as a build requirement. We have already enough problems as it is, adding a requirement can only make things worse.
I meant "enough *build* problems" in the above sentence. David
David Cournapeau wrote:
Pauli Virtanen wrote:
I think Cython has some pre-made distutils hooks to make this less painless. I don't know how nicely they work with numpy.distutils, however...
Should not be too hard to support it.
I agree that things may get annoying if we decide to avoid this in the long run, if the number of Cython-based extensions starts to grow.
Is size really a problem ? I am a bit worried about adding cython as a build requirement. We have already enough problems as it is, adding a requirement can only make things worse.
What mpi4py does is have a simple script "cythonize" which runs Cython on all pyx files in the tree. Then: - Generated C files are NOT checked into version control - Generated C files ARE shipped with tarball source downloads (i.e. cythonize is run on Lisandro's computer after exporting from revision control). So endusers building from source do not need Cython. Obviously setup.py etc. can be tailored to run cythonize if the C files are older than the pyx files etc. I think this sounds like the solutions with least drawbacks. Dag Sverre
Hi,
What mpi4py does is have a simple script "cythonize" which runs Cython on all pyx files in the tree. Then:
- Generated C files are NOT checked into version control - Generated C files ARE shipped with tarball source downloads (i.e. cythonize is run on Lisandro's computer after exporting from revision control). So endusers building from source do not need Cython.
Obviously setup.py etc. can be tailored to run cythonize if the C files are older than the pyx files etc. I think this sounds like the solutions with least drawbacks.
Something like that sounds as though it could work very well. I guess it does mean that, if you have an old version of cython on your system, the pyx code might not compile properly to c? If so, will that overwrite the c source in the release, making the result impossible to build? Or do you somehow disable 'cythonize' for building from source code releases? I think that size does matter, as they say, because, if we add a moderate number of .pyx files to scipy, we can easily end up adding megabytes of diff _per_commit_. In the meantime the repository will start growing rapidly. If we ever go to distributed version control, we'll get all of that with a 'git clone' or similar. Best, Matthew
2009/11/8 Matthew Brett <matthew.brett@gmail.com>:
Something like that sounds as though it could work very well. I guess it does mean that, if you have an old version of cython on your system, the pyx code might not compile properly to c? If so, will that overwrite the c source in the release, making the result impossible to build? Or do you somehow disable 'cythonize' for building from source code releases?
In scikits.image we take md5sums of the generated and the distributed .c files (skipping the time-stamp headers). If those don't match, the generated version is used: http://github.com/stefanv/scikits.image/blob/master/scikits/image/_build.py#... Cheers Stéfan
Matthew Brett wrote:
I think that size does matter, as they say, because, if we add a moderate number of .pyx files to scipy, we can easily end up adding megabytes of diff _per_commit_.
The diff problem can be somewhat alleviated by marking the generated files as binary.
In the meantime the repository will start growing rapidly. If we ever go to distributed version control, we'll get all of that with a 'git clone' or similar.
Git file-format compression is good. In numpy's case, a git clone takes the same space as a svn checkout. Now, if everybody else does not mind having a build dependency on a recent cython version, I won't be against it either. David
Den 9. nov. 2009 kl. 06.51 skrev David Cournapeau <david@ar.media.kyoto-u.ac.jp
:
Matthew Brett wrote:
I think that size does matter, as they say, because, if we add a moderate number of .pyx files to scipy, we can easily end up adding megabytes of diff _per_commit_.
The diff problem can be somewhat alleviated by marking the generated files as binary.
There is a security issue here: the genererated C code is difficult to review. Not that I think scipy developers cannot be trusted, but it is easy to hide malicious code in there. At least we need to verify that .pyx and generated .c match. As for the diff issue: generated C must be considered binary. The purpose of a diff is for reviewing code. Those C files are hardly human readable. Should we depend on Cython or include Cython? Or have a verified buildbot that converts SciPy's Cython source to C? Sturla
Den 9. nov. 2009 kl. 11.12 skrev Sturla Molden <sturla@molden.no>:
Den 9. nov. 2009 kl. 06.51 skrev David Cournapeau <david@ar.media.kyoto-u.ac.jp
:
Matthew Brett wrote:
I think that size does matter, as they say, because, if we add a moderate number of .pyx files to scipy, we can easily end up adding megabytes of diff _per_commit_.
The diff problem can be somewhat alleviated by marking the generated files as binary.
There is a security issue here: the genererated C code is difficult to review. Not that I think scipy developers cannot be trusted, but it is easy to hide malicious code in there. At least we need to verify that .pyx and generated .c match.
And there is the issue of sloppy mistakes: The source of a segfault could be in the C, but we would look for it in the Cython source. All it takes is for the developer to forget to generate new C before commiting. Sturla
Dag Sverre Seljebotn wrote:
What mpi4py does is have a simple script "cythonize" which runs Cython on all pyx files in the tree. Then:
- Generated C files are NOT checked into version control - Generated C files ARE shipped with tarball source downloads (i.e. cythonize is run on Lisandro's computer after exporting from revision control). So endusers building from source do not need Cython.
I expect quite a few users to build sources from svn, and I am worried that many distributions do not package a recent enough version of cython. David
On Sun, Nov 8, 2009 at 9:38 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
I expect quite a few users to build sources from svn, and I am worried that many distributions do not package a recent enough version of cython.
Surely anyone taking the trouble to build from sources (checking out, making sure they have a writeable install location, etc.) can just as easily type 'easy_install cython' along the way? (That's what I do...) -- Nathaniel
Sendt fra min iPhone Den 9. nov. 2009 kl. 06.38 skrev David Cournapeau <david@ar.media.kyoto-u.ac.jp
:
I expect quite a few users to build sources from svn, and I am worried that many distributions do not package a recent enough version of Cython
Then include one in SciPy. It's just a small Python package. Sturla
On Nov 8, 2009, at 8:20 AM, Pauli Virtanen wrote:
su, 2009-11-08 kello 01:05 -0800, Matthew Brett kirjoitti: [clip: huge Cython-generated .c files]
An obvious way round this is to let distutils run Cython for developer builds, and only generate the .c for releases, or some such, but I can see that could get messy. But, the space the .c file takes up is making me think twice about commits.
Have y'all got any thoughts how to deal with this problem?
I think Cython has some pre-made distutils hooks to make this less painless. I don't know how nicely they work with numpy.distutils, however...
A bigger issue than build setup may be that then we'd require users to have the latest Cython installed to build Scipy. (Perhaps the generated .c files could be shipped with the distributed tarballs, though, so only people interested in SVN would need Cython.) Actually, this might not be so much to ask, given that Cython is a pure-Python package.
So, should we make Scipy depend on Cython at build-time?
I think this would be fine. In fact, I think it would be a good idea. It would encourage Cython extensions. -Travis
On Tue, Nov 10, 2009 at 8:03 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
So, should we make Scipy depend on Cython at build-time?
I think this would be fine. In fact, I think it would be a good idea. It would encourage Cython extensions.
That seems good to me too.
That's decided then. I will add cython support to numpy.distutils (and will also expand the cython support in numscons). For numpy.distutils, I would prefer add it after 1.4.0 (meaning manual cython for scipy 0.8.0): is this acceptable for you Matthew ? cheers, David
Hi,
I think this would be fine. In fact, I think it would be a good idea. It would encourage Cython extensions.
That seems good to me too.
That's decided then. I will add cython support to numpy.distutils (and will also expand the cython support in numscons). For numpy.distutils, I would prefer add it after 1.4.0 (meaning manual cython for scipy 0.8.0): is this acceptable for you Matthew ?
Fine by me - thank you. Matthew
On Mon, Nov 9, 2009 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
So, should we make Scipy depend on Cython at build-time?
I think this would be fine. In fact, I think it would be a good idea. It would encourage Cython extensions.
That seems good to me too.
I have some naive code implementing automatic download of a zip file checkout from the Cython repos. After a few fixes some time ago, Cython can happily run with zipimport (there is a nasty warning that could be worked for removal), so the build-time dependency could be somewhat alleviated, the scipy buildsystem could just try to automatically download a zip file in case of a missing/outdated Cython installation. -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Lisandro Dalcin wrote:
I have some naive code implementing automatic download of a zip file checkout from the Cython repos. After a few fixes some time ago, Cython can happily run with zipimport (there is a nasty warning that could be worked for removal), so the build-time dependency could be somewhat alleviated, the scipy buildsystem could just try to automatically download a zip file in case of a missing/outdated Cython installation.
Please, no. Building should not involve any internet connection. This can fail in so many ways, we will get many reports of people with weird network policies and so on. And OS packagers will hate us even more. Cython becomes a build dependency, it should be checked and handled as such. I will add numpy.distutils support to make this transparent (unless someone beats me to it). cheers, David
On Mon, Nov 9, 2009 at 10:52 PM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Lisandro Dalcin wrote:
I have some naive code implementing automatic download of a zip file checkout from the Cython repos. After a few fixes some time ago, Cython can happily run with zipimport (there is a nasty warning that could be worked for removal), so the build-time dependency could be somewhat alleviated, the scipy buildsystem could just try to automatically download a zip file in case of a missing/outdated Cython installation.
Please, no. Building should not involve any internet connection. This can fail in so many ways, we will get many reports of people with weird network policies and so on. And OS packagers will hate us even more.
David, Iwas just talking about building -dev SVN checkouts. Release tarballs should ship the generated C sources.
Cython becomes a build dependency, it should be checked and handled as such. I will add numpy.distutils support to make this transparent (unless someone beats me to it).
What this means? -- Lisandro Dalcín --------------- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Hi Lisandro, Lisandro Dalcin wrote:
David, Iwas just talking about building -dev SVN checkouts. Release tarballs should ship the generated C sources.
It does not change my argument: dependencies should be explicit, and build should never trigger network access. Network is inherently fragile, and makes sandboxing builds more difficult. Making a difference between release and dev makes the release process is not great either: the more different they are, the more likely you are to produce a broken release out there in my experience. I think it is a good policy that dev builds are as similar as possible to building releases from sources.
What this means?
It means that if a package depends on cython at a certain version, it should check for this, and fails if it does not, like it does if you don't have a fortran compiler or blas/lapack for scipy. It was argued that installing the right version of cython is not difficult, after all :) cheers, David
participants (10)
-
Dag Sverre Seljebotn
-
David Cournapeau
-
David Cournapeau
-
Lisandro Dalcin
-
Matthew Brett
-
Nathaniel Smith
-
Pauli Virtanen
-
Sturla Molden
-
Stéfan van der Walt
-
Travis Oliphant