On Apr 19, 2014, at 6:51 PM, David Wilson email@example.com wrote:
Thanks for replying,
And honestly with the new PR I’ve laid out in Travis adding any older versions to be supported out of the box is simple and the biggest barrier is likely convincing the Travis folks they are needed (For example, if you’re distributing on PyPI the number of 2.5 or 3.1 users are *tiny* https://caremad.io/blog/a-look-at-pypi-downloads/). As far as I’m aware the Travis-CI folks are working on Windows support but there is currently no ETA on it as they have to figure out licensing concerns and the like.
Travis can work well, but the effort involved in maintaining it has, at least for me, been unjustifiably high: in total I've spent close to a man-week in the past 6 months trying various configurations, and fixing things up once a month, before resorting to what I have now. In all, I have probably spent 10x the time maintaining Travis as I have actually writing comprehensive tests.
It cannot reasonably be expected that project maintainers should pay similarly, or ideally even need to be aware their package is undergoing testing someplace.
My biggest gripe with Travis, though, is that they can and do remove things that break what for me seems trivial functionality. This is simply the nature of a one-size-fits-all service. From what I gather, they previously removed old Python releases from their base image to reduce its size.
I’ve not had anywhere near that type of experience trying to get anything setup on Travis. The one exception had been when I’ve attempted to use their OSX builders, however a big portion of that is because their travis builders do not have Python configured by default.
That’s not to say you didn’t have that behavior, but generally the worst I’ve had to do is install a python other than what they’ve provided, which has been trivially easy to do as well (See: https://github.com/pypa/pip/blob/develop/.travis.yml#L18-L23). Looking over the history of the pip .travis.yml the only times it’s really changes were to add versions or because we were trying out some new thing (Like using pytest-xdist).
For the record They’ve never supported 2.4, and I recommended they drop 2.5 and 3.1 because the incoming traffic on PyPI don’t really justify keeping it in the base image because a larger base image takes a longer period of time to boot which slows down the entire build queue. For projects that want to keep it, it’s trivial to add back in a way similar to how pip has added 3.4 support.
Windows licensing is hard, but given the scope of either Travis or the PSF, I'd be surprised if there wasn't some person or group at Microsoft to solve it for us, especially considering the new climate under Nadella. I know at least that they offer free Azure time to open source projects, even something like this could be used.
Sure, Licensing is a solvable problem, but it’s still a problem :) There are other problems too of course, such as handling fundamental differences in how the two platforms interact (Easy to SSH into a *nix box, not so much a Windows box).
The sanest way of doing this is to manage a pool of VM workers, pull one off the pool, run tasks on it, and then destroy it and let the pool spawn a new one. This puts a bottleneck on how many tests you can run both in how large your pool is, and in how fast your VMs boot.
Look at a project like https://travis-ci.org/pyca/cryptography which builds 40 different configurations for each “build” each taking about 8 minutes each.
They are curious numbers, perhaps indicative of overcontention on extra-whimpy cloud VMs. On uncontended spinning rust with 2GB RAM my measily Core 2 Duo gets:
Clone: 4.7s dev_requirements.txt + test vectors + pip install -e: 24.6s py.test: 2m10s
For a package on the opposite end of the spectrum, in terms of test comprehensiveness and size, py-lmdb:
Clone: 0.99s pip install -e: 2.646s py.test: 0.996s
I believe that Travis is using OpenVM VMs which may or may not be on the greatest hardware. A different VM provider would probably give better performance.
So if you have a pool of 5 VMs, you can run 5 tests at a time, so that splits that 40 into 5 chunks, and you get roughly 64 minutes to process 40 builds, plus we’ll say our machines boot in roughly 60s (a little fast for cloud servers, but not unreasonable),
It's easy to beat 60 seconds for a farm: resuming a pre-booted 2GB VM from cached snapshot with Qemu/KVM takes between 1.5s (465mb dirty) to 3s (2gb dirty), again on a whimpy Core 2 Duo. Actually it's surprising the numbers are so high, Qemu's IO code seems not the most efficient.
Dirty numbers are interesting as tooling can be warmed into the VM page cache prior to snapshot. Since Python has solid side-by-side support for over 10 years, only one per-OS base image need exist with all versions installed and cache-warm prior to test (about 1GB total if you include standard libraries). Just to be clear, this is entirely free in the cost of resuming a snapshot.
All of those things require even more ongoing support since we’d have to support the host machines as well and not just reuse some providers VM/cloud images.
Further assuming local mirrors of PyPI and source repositories, it's easy to see how a specialized farm can vastly improve efficiency compared to a general solution.
As far as I’m aware Travis has been willing to setup a PyPI mirror, it’s mostly been nobody has helped them set one up :)
so there’s an additional 4-5 minutes just in booting. So roughly an hour and 10 minutes for a single test run if with 5 VMs for just a single project. (In reality Cryptography has more than 40 builds because it uses Travis and jenkins together to handle things travis doesn’t).
So the machine cost is high, you’re probably looking at let’s just say a worker pool of 20 (though I think to actually replace Travis CI it’d need to be much higher)
40*2m40s works out to about 1h45m whimpy CPU-hours for a complete cycle, but this assumes results of all 40 workers are needed to gauge project health. In reality there are perhaps 3-6 configurations needing priority for prompt feedback (say CPyMac, CPyLinux, CPyWin, PyPyMac, PyPyLinux, PyPyWin).
Ideally py-lmdb has about 32 configurations to test, including Windows. Using earlier numbers that's (32*7.5s) or about 7m whimpy CPU hours, however 45 seconds is sufficient to test the main configurations. It doesn't matter if Python2.5 breaks and it's not obvious for 24 hours, since in this case, releases are only made at most once per month.
Assuming the average PyPI package lies somewhere between Cryptography and py-lmdb, and there are about 42k packages, roughly averaging this out gives: (7.5s*32 + 2m40s*40)/(32+40) = 1m32s, or about 1088 whimpy CPU hours to completely run one configuration of each package in PyPI.
Now instead of whimpy CPUs we have an 8-core Xeon, pessimistically assuming the Xeon is only 20% faster, that gives 108.88 beefy 8-core-CPU hours to rebuild PyPI once.
Assuming in a day 3% of packages are updated (which seems high), that's 3h15m 8-core-CPU hours to test each updated package from a fresh checkout in a fresh VM against a single Python version, which as we've seen, is the worst case behaviour since side-by-side versions are possible.
In summary, one 8-core machine might suffice (allowing for napkin math) to retest 3% of PyPI in one config at least 7 times a day, or assuming each package has 6 primary configurations, one "all important configs" cycle once per day.
Going back to cryptography, those 40 test runs are *just* the ones that can run on Travis. There is also an additional 54 runs for Windows, OSX, FreeBSD, and other Linux installs. The cryptography project runs roughly 15 of those builds almost every day, and goes upwards of 30 builds a day on other days. So that’s 1410-2820 total runs, even with a 2m40s run you’re looking at that single project taking 60-125hours of CPU build time a day. Now this project is a fairly intensive one, but it’s useful to look at worst case uses in order to determine scaling.
of roughly 4GB machines (last I recall travis machines were 3gig and change) which comes out to roughly 1k to 2k in server costs just for that pool. Add into that whatever support machines you would need (queues, web nodes, database servers, etc) you’re probably looking in the 2-4k range just for the servers once all was said and done.
If one build per 4 hours sufficed for most projects, $1k/month seems like a good cap: a VM with comparable specs to the above scenario, GCE's n1-standard-8, costs around $275/month to run 24/7, assuming the project couldn't find a sponsor of multiple machines, which I suspect would be quite easy.
The above estimates are a little optimistic: in addition to 2-4GB guest RAM per core, the host would need at least 8-32GB more to keep hot parts of the base image filesystems cached to achieve the time estimates. However, I've never seen any Python extension needing 4GB to build.
Regarding supplementary services, a farm produces logs and perhaps assets for which a static file bucket suffices, and consumes jobs from a queue, which wouldn't require more beef than an SQLite database, and I'd be impressed if 40 jobs * 45k packages would fill 2GB.
I believe the code cost would also be fairly high. There isn’t really an off the shelf solution that is going to work for this. .. There is a good chance some things could be reused from Openstack but I’m fairly sure some of it is specific enough to Openstack that it’d still require a decent amount of coding to make it generic enough to work.
It's easy to overdesign and overspec these things, but the code and infrastructure involved is fairly minimal, especially to produce something basic that just ensures, e.g. recently published or manually triggered-via-webhook packages get queued and tested.
The most complex aspect is likely maintaining reproducable base images, versioning them and preventing breakage for users during updates, but that is almost a planning problem rather than an implementation problem. On saying this, though, I can imagine a useful versioning policy as simple as two paragraphs, and an image/snapshot builder as simple as two shell scripts.
Integrating complex "best practices" off-the-shelf components is an example of where simple projects often explode.. Qemu is literally a self contained command-line tool, no more complex to manage the execution of than, say, wget. With a suitably prepped snapshot, all required is to run qemu, allow it 10 mins to complete, and read job results via a redirected port.
This means that it’s likely not going to be something we can just set it up and forget about it and will require an active infrastructure team to handle it. Now Python happens to have one of those, but we’re mostly volunteers with part times for on call stuff (who also volunteer for other stuff too!) and this would be a significant increase I believe in that work load.
That's very true, though looking at the lack of straightforward Python solution at the ecosystem level, it also seems quite a small cost.
It’s my personal opinion that a sort of “test during development” CI system like Travis is not something that making Python specific is very useful in the long run. Travis is open source and they are incredibly open to things that will make Python a more first class citizen there (disclaimer: I’m friends with them and I’ve helped add the Python support to start with, and then improve it over time). One obvious flaw in this is that Travis only supports github while there are others on Bitbucket or even their own hosting, and for those people this idea might be much more attractive. I do think there is maybe room for a “release testing” system and I definitely think there is room for a build farm (Which is a much smaller scale since the whole of PyPI pushes releases far less often than push commits to branches or make PRs or what have you).
It's not clear what of scope would work best for a Python-specific system. For my needs, I'm only interested in removing the effort required to ensure my packages get good, reasonably timely coverage, and not worry about things like Windows licenses, or my build exploding monthly due to base image changes sacrificing Python-related functionality for something else.
The size and complexity of the service could creep up massively, especially if attempting to compete with e.g. Travis' 5-minutes-or-less turnaround time, but at least for me, those fast turnaround times aren't particularly useful, I just need something low effort that works.
Regarding the value of a Python-specific system, almost identical infrastructure could be used for your Wheel farm, or running any kind of "ecosystem lint", like detecting setup.py /etc/passwd writes and suchlike. Granted though, these could be also be solved independently.
Finally it's not obvious to me that this is absolutely a good idea, however I'm willing to argue for it for as long as I'm procrastinating on setting up my own Jenkins install :-P
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA