Using numpy on hadoop streaming: ImportError: cannot import name multiarray
Hi all, for one of my projects I am using basically using NLTK for pos tagging, which internally uses a 'english.pickle' file. I managed to package the nltk library with these pickle files to make them available to mapper and reducer for hadoop streaming job using -file option. However, when nltk library is trying to load that pickle file, it gives error for numpy- since the cluster I am running this job does not have numpy installed. Also, I don't have root access thus, can't install numpy or any other package on cluster. So the only way is to package the python modules to make it available for mapper and reducer. I successfully managed to do that. But now the problem is when numpy is imported, it imports multiarray by default( as seen in *init*.py) and this is where I am getting the error: File "/usr/lib64/python2.6/pickle.py", line 1370, in load return Unpickler(file).load() File "/usr/lib64/python2.6/pickle.py", line 858, in load dispatch[key](self) File "/usr/lib64/python2.6/pickle.py", line 1090, in load_global klass = self.find_class(module, name) File "/usr/lib64/python2.6/pickle.py", line 1124, in find_class __import__(module) File "numpy.mod/numpy/__init__.py", line 170, in <module> File "numpy.mod/numpy/add_newdocs.py", line 13, in <module> File "numpy.mod/numpy/lib/__init__.py", line 8, in <module> File "numpy.mod/numpy/lib/type_check.py", line 11, in <module> File "numpy.mod/numpy/core/__init__.py", line 6, in <module> ImportError: cannot import name multiarray I tried moving numpy directory on my local machine that contains multiarray.pyd, to the cluster to make it available to mapper and reducer but this didn't help. Any input on how to resolve this(keeping the constraint that I cannot install anything on cluster machines)? Thanks! -- Regards, Kartik Perisetla
On 11 February 2015 at 03:38, Kartik Kumar Perisetla
Also, I don't have root access thus, can't install numpy or any other package on cluster
You can create a virtualenv, and install packages on it without needing root access. To minimize trouble, you can ensure it uses the system packages when available. Here are instructions on how to install it: https://stackoverflow.com/questions/9348869/how-to-install-virtualenv-withou... http://opensourcehacker.com/2012/09/16/recommended-way-for-sudo-free-install... This does not require root access, but it is probably good to check with the sysadmins to make sure they are fine with it. /David.
Thanks David. But do I need to install virtualenv on every node in hadoop
cluster? Actually I am not very sure whether same namenodes are assigned
for my every hadoop job. So how shall I proceed on such scenario.
Thanks for your inputs.
Kartik
On Feb 11, 2015 1:56 AM, "Daπid"
On 11 February 2015 at 03:38, Kartik Kumar Perisetla
wrote: Also, I don't have root access thus, can't install numpy or any other package on cluster
You can create a virtualenv, and install packages on it without needing root access. To minimize trouble, you can ensure it uses the system packages when available. Here are instructions on how to install it:
https://stackoverflow.com/questions/9348869/how-to-install-virtualenv-withou...
http://opensourcehacker.com/2012/09/16/recommended-way-for-sudo-free-install...
This does not require root access, but it is probably good to check with the sysadmins to make sure they are fine with it.
/David. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 11 February 2015 at 08:06, Kartik Kumar Perisetla
Thanks David. But do I need to install virtualenv on every node in hadoop cluster? Actually I am not very sure whether same namenodes are assigned for my every hadoop job. So how shall I proceed on such scenario.
I have never used hadoop, but in the clusters I have used, you have a home folder on the central node, and each and every computing node has access to it. You can then install Python in your home folder and make every node run that, or pull a local copy. Probably the cluster support can clear this up further and adapt it to your particular case. /David.
Hi David,
Thanks for your response.
But I can't install anything on cluster.
*Could anyone please help me understand how the file 'multiarray.so' is
used by the tagger. I mean how it is loaded( I assume its some sort of DLL
for windows and shared library for unix based systems). Is it a module or
what?*
Right now what I did is I packaged numpy so that numpy will be present at
the current working directory for mapper and reducer. So now control goes
into numpy packaged alongwith mapper.
But still right now I see such error:
*File "glossextractionengine.mod/nltk/tag/__init__.py", line 123, in pos_tag
File "glossextractionengine.mod/pickle.py", line 1380, in load
return doctest.testmod()
File "glossextractionengine.mod/pickle.py", line 860, in load
return stopinst.value
File "glossextractionengine.mod/pickle.py", line 1092, in load_global
dispatch[GLOBAL] = load_global
File "glossextractionengine.mod/pickle.py", line 1126, in find_class
klass = getattr(mod, name)
File "numpy.mod/numpy/__init__.py", line 137, in <module>
File "numpy.mod/numpy/add_newdocs.py", line 13, in <module>
File "numpy.mod/numpy/lib/__init__.py", line 4, in <module>
File "numpy.mod/numpy/lib/type_check.py", line 21, in <module>
File "numpy.mod/numpy/core/__init__.py", line 9, in <module>
ImportError: No module named multiarray*
In this case the file 'multiarray.so' is present in within core package
only, but it is still not found.
Can anyone throw some light on it.
Thanks!
Kartik
On Wed, Feb 11, 2015 at 7:17 AM, Daπid
On 11 February 2015 at 08:06, Kartik Kumar Perisetla
wrote: Thanks David. But do I need to install virtualenv on every node in hadoop cluster? Actually I am not very sure whether same namenodes are assigned for my every hadoop job. So how shall I proceed on such scenario.
I have never used hadoop, but in the clusters I have used, you have a home folder on the central node, and each and every computing node has access to it. You can then install Python in your home folder and make every node run that, or pull a local copy.
Probably the cluster support can clear this up further and adapt it to your particular case.
/David. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Regards, Kartik Perisetla
participants (2)
-
Daπid
-
Kartik Kumar Perisetla