[Python-ideas] Python-specific Travis CI

Sun Apr 20 00:51:10 CEST 2014

Hi Donald!

Thanks for replying,

> And honestly with the new PR I’ve laid out in Travis adding any older
> versions to be supported out of the box is simple and the biggest
> barrier is likely convincing the Travis folks they are needed (For
> example, if you’re distributing on PyPI the number of 2.5 or 3.1 users
> are *tiny* https://caremad.io/blog/a-look-at-pypi-downloads/).
> 
> As far as I’m aware the Travis-CI folks are working on Windows support
> but there is currently no ETA on it as they have to figure out
> licensing concerns and the like.

Travis can work well, but the effort involved in maintaining it has, at 
least for me, been unjustifiably high: in total I've spent close to a 
man-week in the past 6 months trying various configurations, and fixing 
things up once a month, before resorting to what I have now. In all, I 
have probably spent 10x the time maintaining Travis as I have actually 
writing comprehensive tests.

It cannot reasonably be expected that project maintainers should pay 
similarly, or ideally even need to be aware their package is undergoing 
testing someplace.

My biggest gripe with Travis, though, is that they can and do remove 
things that break what for me seems trivial functionality. This is 
simply the nature of a one-size-fits-all service. From what I gather, 
they previously removed old Python releases from their base image to 
reduce its size.

Windows licensing is hard, but given the scope of either Travis or the 
PSF, I'd be surprised if there wasn't some person or group at Microsoft 
to solve it for us, especially considering the new climate under 
Nadella. I know at least that they offer free Azure time to open source 
projects, even something like this could be used.

> The sanest way of doing this is to manage a pool of VM workers, pull
> one off the pool, run tasks on it, and then destroy it and let the
> pool spawn a new one. This puts a bottleneck on how many tests you can
> run both in how large your pool is, and in how fast your VMs boot.

> Look at a project like https://travis-ci.org/pyca/cryptography which
> builds 40 different configurations for each “build” each taking about
> 8 minutes each.

They are curious numbers, perhaps indicative of overcontention on 
extra-whimpy cloud VMs. On uncontended spinning rust with 2GB RAM my 
measily Core 2 Duo gets:

     Clone: 4.7s
     dev_requirements.txt + test vectors + pip install -e: 24.6s
     py.test: 2m10s

For a package on the opposite end of the spectrum, in terms of test 
comprehensiveness and size, py-lmdb:

     Clone: 0.99s
     pip install -e: 2.646s
     py.test: 0.996s

> So if you have a pool of 5 VMs, you can run 5 tests at a time, so that
> splits that 40 into 5 chunks, and you get roughly 64 minutes to
> process 40 builds, plus we’ll say our machines boot in roughly 60s (a
> little fast for cloud servers, but not unreasonable),

It's easy to beat 60 seconds for a farm: resuming a pre-booted 2GB VM 
from cached snapshot with Qemu/KVM takes between 1.5s (465mb dirty) to 
3s (2gb dirty), again on a whimpy Core 2 Duo. Actually it's surprising 
the numbers are so high, Qemu's IO code seems not the most efficient.

Dirty numbers are interesting as tooling can be warmed into the VM page 
cache prior to snapshot. Since Python has solid side-by-side support for 
over 10 years, only one per-OS base image need exist with all versions 
installed and cache-warm prior to test (about 1GB total if you include 
standard libraries). Just to be clear, this is entirely free in the cost 
of resuming a snapshot.

Further assuming local mirrors of PyPI and source repositories, it's 
easy to see how a specialized farm can vastly improve efficiency 
compared to a general solution.

> so there’s an additional 4-5 minutes just in booting. So roughly an
> hour and 10 minutes for a single test run if with 5 VMs for just a
> single project. (In reality Cryptography has more than 40 builds
> because it uses Travis and jenkins together to handle things travis
> doesn’t).

> So the machine cost is high, you’re probably looking at let’s just say
> a worker pool of 20 (though I think to actually replace Travis CI it’d
> need to be much higher)

40*2m40s works out to about 1h45m whimpy CPU-hours for a complete cycle, 
but this assumes results of all 40 workers are needed to gauge project 
health. In reality there are perhaps 3-6 configurations needing priority 
for prompt feedback (say CPyMac, CPyLinux, CPyWin, PyPyMac, PyPyLinux, 
PyPyWin).

Ideally py-lmdb has about 32 configurations to test, including Windows. 
Using earlier numbers that's (32*7.5s) or about 7m whimpy CPU hours, 
however 45 seconds is sufficient to test the main configurations. It 
doesn't matter if Python2.5 breaks and it's not obvious for 24 hours, 
since in this case, releases are only made at most once per month.

Assuming the average PyPI package lies somewhere between Cryptography 
and py-lmdb, and there are about 42k packages, roughly averaging this 
out gives: (7.5s*32 + 2m40s*40)/(32+40) = 1m32s, or about 1088 whimpy 
CPU hours to completely run one configuration of each package in PyPI.

Now instead of whimpy CPUs we have an 8-core Xeon, pessimistically 
assuming the Xeon is only 20% faster, that gives 108.88 beefy 8-core-CPU 
hours to rebuild PyPI once.

Assuming in a day 3% of packages are updated (which seems high), that's 
3h15m 8-core-CPU hours to test each updated package from a fresh 
checkout in a fresh VM against a single Python version, which as we've 
seen, is the worst case behaviour since side-by-side versions are 
possible.

In summary, one 8-core machine might suffice (allowing for napkin math) 
to retest 3% of PyPI in one config at least 7 times a day, or assuming 
each package has 6 primary configurations, one "all important configs" 
cycle once per day.

> of roughly 4GB machines (last I recall travis machines were 3gig and
> change) which comes out to roughly 1k to 2k in server costs just for
> that pool. Add into that whatever support machines you would need
> (queues, web nodes, database servers, etc) you’re probably looking in
> the 2-4k range just for the servers once all was said and done.

If one build per 4 hours sufficed for most projects, $1k/month seems 
like a good cap: a VM with comparable specs to the above scenario, GCE's 
n1-standard-8, costs around $275/month to run 24/7, assuming the project 
couldn't find a sponsor of multiple machines, which I suspect would be 
quite easy.

The above estimates are a little optimistic: in addition to 2-4GB guest 
RAM per core, the host would need at least 8-32GB more to keep hot parts 
of the base image filesystems cached to achieve the time estimates. 
However, I've never seen any Python extension needing 4GB to build.

Regarding supplementary services, a farm produces logs and perhaps 
assets for which a static file bucket suffices, and consumes jobs from a 
queue, which wouldn't require more beef than an SQLite database, and I'd 
be impressed if 40 jobs * 45k packages would fill 2GB.

> I believe the code cost would also be fairly high. There isn’t really
> an off the shelf solution that is going to work for this. .. There is
> a good chance some things could be reused from Openstack but I’m
> fairly sure some of it is specific enough to Openstack that it’d still
> require a decent amount of coding to make it generic enough to work.

It's easy to overdesign and overspec these things, but the code and 
infrastructure involved is fairly minimal, especially to produce 
something basic that just ensures, e.g. recently published or manually 
triggered-via-webhook packages get queued and tested.

The most complex aspect is likely maintaining reproducable base images, 
versioning them and preventing breakage for users during updates, but 
that is almost a planning problem rather than an implementation problem. 
On saying this, though, I can imagine a useful versioning policy as 
simple as two paragraphs, and an image/snapshot builder as simple as two 
shell scripts.

Integrating complex "best practices" off-the-shelf components is an 
example of where simple projects often explode.. Qemu is literally a 
self contained command-line tool, no more complex to manage the 
execution of than, say, wget. With a suitably prepped snapshot, all 
required is to run qemu, allow it 10 mins to complete, and read job 
results via a redirected port.

> This means that it’s likely not going to be something we can just set
> it up and forget about it and will require an active infrastructure
> team to handle it. Now Python happens to have one of those, but we’re
> mostly volunteers with part times for on call stuff (who also
> volunteer for other stuff too!) and this would be a significant
> increase I believe in that work load.

That's very true, though looking at the lack of straightforward Python 
solution at the ecosystem level, it also seems quite a small cost.

> It’s my personal opinion that a sort of “test during development” CI
> system like Travis is not something that making Python specific is
> very useful in the long run. Travis is open source and they are
> incredibly open to things that will make Python a more first class
> citizen there (disclaimer: I’m friends with them and I’ve helped add
> the Python support to start with, and then improve it over time). One
> obvious flaw in this is that Travis only supports github while there
> are others on Bitbucket or even their own hosting, and for those
> people this idea might be much more attractive.
> 
> I do think there is maybe room for a “release testing” system and I
> definitely think there is room for a build farm (Which is a much
> smaller scale since the whole of PyPI pushes releases far less often
> than push commits to branches or make PRs or what have you).

It's not clear what of scope would work best for a Python-specific 
system. For my needs, I'm only interested in removing the effort 
required to ensure my packages get good, reasonably timely coverage, and 
not worry about things like Windows licenses, or my build exploding 
monthly due to base image changes sacrificing Python-related 
functionality for something else.

The size and complexity of the service could creep up massively, 
especially if attempting to compete with e.g. Travis' 5-minutes-or-less 
turnaround time, but at least for me, those fast turnaround times aren't 
particularly useful, I just need something low effort that works.

Regarding the value of a Python-specific system, almost identical 
infrastructure could be used for your Wheel farm, or running any kind of 
"ecosystem lint", like detecting setup.py /etc/passwd writes and 
suchlike. Granted though, these could be also be solved independently.

Finally it's not obvious to me that this is absolutely a good idea, 
however I'm willing to argue for it for as long as I'm procrastinating 
on setting up my own Jenkins install :-P

David