Hi all,

for one of my projects I am using basically using NLTK for pos tagging, which internally uses a 'english.pickle' file. I managed to package the nltk library with these pickle files to make them available to mapper and reducer for hadoop streaming job using -file option.

However, when nltk library is trying to load that pickle file, it gives error for numpy- since the cluster I am running this job does not have numpy installed. Also, I don't have root access thus, can't install numpy or any other package on cluster. So the only way is to package the python modules to make it available for mapper and reducer. I successfully managed to do that. But now the problem is when numpy is imported, it imports multiarray by default( as seen in init.py) and this is where I am getting the error:

File "/usr/lib64/python2.6/pickle.py", line 1370, in load
        return Unpickler(file).load()
      File "/usr/lib64/python2.6/pickle.py", line 858, in load
      File "/usr/lib64/python2.6/pickle.py", line 1090, in load_global
        klass = self.find_class(module, name)
      File "/usr/lib64/python2.6/pickle.py", line 1124, in find_class
      File "numpy.mod/numpy/__init__.py", line 170, in <module>
      File "numpy.mod/numpy/add_newdocs.py", line 13, in <module>
      File "numpy.mod/numpy/lib/__init__.py", line 8, in <module>
      File "numpy.mod/numpy/lib/type_check.py", line 11, in <module>
      File "numpy.mod/numpy/core/__init__.py", line 6, in <module>
    ImportError: cannot import name multiarray

I tried moving numpy directory on my local machine that contains multiarray.pyd, to the cluster to make it available to mapper and reducer but this didn't help.

Any input on how to resolve this(keeping the constraint that I cannot install anything on cluster machines)? 



Kartik Perisetla