Suggestions for Python MapReduce?

Phillip B Oldham phillip.oldham at gmail.com
Wed Jul 22 11:09:31 EDT 2009


On Jul 22, 2:23 pm, Casey Webster <Casey... at gmail.com> wrote:
> I can't answer your question, but I would like to better understand
> the
> problem you are trying to solve.  The Apache Hadoop/MapReduce java
> application isn't really that "large" by modern standards, although it
> is generally run with large heap sizes for performance (-Xmx1024m or
> larger for the mapred.child.java.opts parameter).
>
> MapReduce is designed to do extremely fast parallel data set
> processing
> on terabytes of data over hundreds of physical nodes; what advantage
> would a pure Python approach have here?

We're always taught that it is a good idea to reduce the number of
dependencies for a project. So you could use Disco, or Dumbo, or even
Skynet. But then you've introduced another system you have to manage.
Each new system will have a learning curve, which is lessened if the
system is written in an environment/language you already work with/
understand. And since a large cost with environments like erlang and
java is in understanding them any issues that are out of the ordinary
can be killer; changing the heap size as you mentioned above for Java
could be one of these issues that a non-java dev trying to use Hadoop
might come across.

With the advent of cloud computing and the new semi-structured/
document databases that are coming to the fore sometimes you need to
use MapReduce on smaller/fewer systems to the same effect. Also, you
may need only to ensure that a job is done in a timely fashion without
taking up too many resources, rather than lightening-fast. Dumbo/disco
in these situations may be considered overkill.

Implementations like BashReduce <http://blog.last.fm/2009/04/06/
mapreduce-bash-script> are perfect for such scenarios. I'm simply
wondering if there's another simpler/smaller implementation of
MapReduce that plays nicely with Python but doesn't require the setup/
knowledge overhead of more "robust" implementations such as hadoop and
disco... maybe similar to Ruby's Skynet.



More information about the Python-list mailing list