What is mrjob? ----------------------- mrjob is a Python package that helps you write and run Hadoop Streaming jobs.
mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster.
Some important features:
* Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's $PYTHONPATH * Run make and other setup scripts * Set environment variables (e.g. $TZ) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by mrjob.conf config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set $AWS_ACCESS_KEY_ID and $AWS_SECRET_ACCESS_KEY * To run on your Hadoop cluster, install simplejson and make sure $HADOOP_HOME is set.
* Install mrjob: pip install mrjob -OR- easy_install mrjob * Documentation: http://packages.python.org/mrjob/ * PyPI: http://pypi.python.org/pypi/mrjob * Mailing list: http://groups.google.com/group/mrjob * Development is hosted at github: http://github.com/Yelp/mrjob
What's new? -------------------- mrjob v0.3.0 is a major new release. Full details are at http://packages.python.org/mrjob/whats-new.html - here are a few highlights:
v0.3.0, 2011-12-07 * Combiners * *_init() and *_final() for mappers, reducers, and combiners * Custom option parsers * Job flow pooling on EMR (saves time and money!) * SSH log fetching * New EMR diagnostic tools
A big thanks to the contributors to this release: Steve Johnson, Dave Marin, Wahbeh Qardaji, Derek Wilson, Jordan Andersen, and Benjamin Goldenberg!