[Baypiggies] Fw: pydoop -- Python MapReduce and HDFS API for Hadoop

Thu Nov 12 15:10:15 CET 2009

On Nov 6, 2009, at 11:59 AM, Joel VanderKwaak wrote:
> we recently released pydoop, a Python MapReduce and HDFS API for  
> Hadoop:
> ...
> It is implemented as a Boost.Python wrapper around the C++ code (pipes
> and libhdfs). It allows you to write complete MapReduce application in
> CPython, with the same capabilities as the C++ API. Here is a minimal
> wordcount example:
> ...
>   def reduce(self, context):
>     s = 0
>     while context.nextValue():
>       s += int(context.getInputValue())
>     context.emit(context.getInputKey(), str(s))
> ...
> Any feedback would be greatly appreciated.

This is an impressive piece of work, and I'm really glad to have it  
around.  I've recently started to look at hadoop's map reduce and this  
is going to make my life much easier.  It strikes me as being a little  
un-pythonic though, reading much like Java written in Python, and I  
think a few changes would improve it.

1)  Use the python_standard_for_method_and_attribute names instead of  
the javaLowerCamelCaseConvention.

With this change the reduce method would become:

     def reduce(self, context):
         s = 0
         while context.next_value():
             s += int(context.get_input_value())
         context.emit(context.get_input_key(), str(s))

2) Use properties instead of getter functions.

The Context class would define:

     input_value = property(lambda x: x.get_input_value())
     input_key = property(lambda x: x.get_input_key())

And the reduce method becomes:

     def reduce(self, context):
         s = 0
         while context.next_value():
             s += int(context.input_value)
         context.emit(context.input_key, str(s))

3) Use generators for traversing the results.

The Context class would define:

     def values(self):
         while self.next_value():
             yield self.input_value

And the reduce method becomes:

     def reduce(self, context):
         s = 0
         for x in context.values():
             s += int(x)
         context.emit(context.input_key, str(s))

4) If next_value() is the primary traversal then consider making the  
Context into an iterator.

The Context class would then define:

     def __iter__(self):
         return self.values()

     def values(self):
         while self.next_value():
             yield self.input_value

And the reduce function becomes:

     def reduce(self, context):
         s = 0
         for x in context:
             s += int(x)
         context.emit(context.input_key, str(s))

Once again, thanks for doing this, and thanks for releasing it.  This  
is a fabulous package to have in my toolchest.

- Jeff Younker - jeff at drinktomi.com -