[pypy-dev] Great experience with PyPy

Thu Feb 7 12:55:42 CET 2013

Hi,

I would like to share short story with you and share what we have
accomplished with PyPy and its friends so far.

Company that I have worked for last 7 months (intentionally unnamed)
gave me absolute permission to pick up technologies on which we based
our solution. What we do is: crawl for PDFs and newspapers articles,
download, translate them if needed, OCR if needed, do extensive
analysis of downloaded PDFs and articles, store them in more organized
structures for faster querying, search for them and generate bunch of
complex reports.

>From very beginning I decided to go with PyPy no matter what. What we
picked is following:
* Flask for web framework, and few of its extensions such as
Flask-Login, Flask-Principal, Flask-WTF, Flask-Mail, etc.
* Cassandra as database because of its features and great experience
with it. PyCassa is used as client to talk to Cassandra server.
* ElasticSearch as distributed search engine, and its client library pyes.
* Whoosh as search engine, but with some modifications to support
Cassandra as storage and distributed locking.
* Redis, and its client library redis-py, for caching and to speed up
common auto-completion patterns.
* ZooKeeper, and its client library Kazoo, for distributed locking
which plays essential role in system for transaction-like behavior
over many services at once.
* Celery in conjunction with RabbitMQ for task distribution.
* Sentry for error logging.

What we have developed on our own are wrappers and clients for:
* Moses which is language translator
* Tesseract which is OCR engine
* Cassandra store for Whoosh
* wkhtmltopdf and wkhtmltoimage which are used for conversion of HTML
to PDF/Image
* etc

Now when product is finished and in final testing phase, I can say
that we did not regret because we used PyPy and stack around it.
Typical speed improvement is 2x-3x over CPython in our case, but
anyway we are mostly IO and memory bound, expect for Celery workers
where we do analysis which are again many small CPU intensive tasks
that are exchanged via RabbitMQ. Another reason why we don't see
speedup us is that we are dependent on external software (servers)
written in Erlang and Java.

I'm already planing to do Cassandra (distributed key/value only
database without index features), ZooKeeper, Redis and ElasticSearch
ports in Python for next projects, and hopefully opensource them.

Regards,
Marko Tasic