[Baypiggies] Fw: pydoop -- Python MapReduce and HDFS API for Hadoop
Jeff Younker
jeff at drinktomi.com
Thu Nov 12 15:10:15 CET 2009
On Nov 6, 2009, at 11:59 AM, Joel VanderKwaak wrote:
> we recently released pydoop, a Python MapReduce and HDFS API for
> Hadoop:
> ...
> It is implemented as a Boost.Python wrapper around the C++ code (pipes
> and libhdfs). It allows you to write complete MapReduce application in
> CPython, with the same capabilities as the C++ API. Here is a minimal
> wordcount example:
> ...
> def reduce(self, context):
> s = 0
> while context.nextValue():
> s += int(context.getInputValue())
> context.emit(context.getInputKey(), str(s))
> ...
> Any feedback would be greatly appreciated.
This is an impressive piece of work, and I'm really glad to have it
around. I've recently started to look at hadoop's map reduce and this
is going to make my life much easier. It strikes me as being a little
un-pythonic though, reading much like Java written in Python, and I
think a few changes would improve it.
1) Use the python_standard_for_method_and_attribute names instead of
the javaLowerCamelCaseConvention.
With this change the reduce method would become:
def reduce(self, context):
s = 0
while context.next_value():
s += int(context.get_input_value())
context.emit(context.get_input_key(), str(s))
2) Use properties instead of getter functions.
The Context class would define:
input_value = property(lambda x: x.get_input_value())
input_key = property(lambda x: x.get_input_key())
And the reduce method becomes:
def reduce(self, context):
s = 0
while context.next_value():
s += int(context.input_value)
context.emit(context.input_key, str(s))
3) Use generators for traversing the results.
The Context class would define:
def values(self):
while self.next_value():
yield self.input_value
And the reduce method becomes:
def reduce(self, context):
s = 0
for x in context.values():
s += int(x)
context.emit(context.input_key, str(s))
4) If next_value() is the primary traversal then consider making the
Context into an iterator.
The Context class would then define:
def __iter__(self):
return self.values()
def values(self):
while self.next_value():
yield self.input_value
And the reduce function becomes:
def reduce(self, context):
s = 0
for x in context:
s += int(x)
context.emit(context.input_key, str(s))
Once again, thanks for doing this, and thanks for releasing it. This
is a fabulous package to have in my toolchest.
- Jeff Younker - jeff at drinktomi.com -
More information about the Baypiggies
mailing list