Predict Method of OneVsRestClassifier Integration with Google Cloud ML
Hi all, I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l... which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error. Is this a known issue? Is there an accepted way to convert this into a dense array? Thanks, Liam Geron
Hi Liam, not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here: https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra... The usage would then basically be model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))]) Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into a dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Sebastian, Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do. Thanks, Liam On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hi Liam,
not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra...
The usage would then basically be
model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into a dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e., (model.predict(X)).toarray() Best, Sebastian
On Apr 10, 2019, at 1:10 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi Sebastian,
Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do.
Thanks, Liam
On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka <mail@sebastianraschka.com> wrote: Hi Liam,
not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra...
The usage would then basically be
model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into a dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Unfortunately I don't believe that you get that level of freedom, it's an API call that automatically calls the model's predict method so I don't think that I get to specify something like model.predict(X).toarray(). I could be wrong however, I don't pretend to be an expert on Cloud ML by any stretch. Thanks, Liam On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka <mail@sebastianraschka.com> wrote:
Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e.,
(model.predict(X)).toarray()
Best, Sebastian
On Apr 10, 2019, at 1:10 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi Sebastian,
Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do.
Thanks, Liam
On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < mail@sebastianraschka.com> wrote: Hi Liam,
not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra...
The usage would then basically be
model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the
predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like:
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the Stack
Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the predict_proba
method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into a
dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
I think it's a bit weird if we're returning sparse output from OneVsRestClassifier.predict if it wasn't fit on sparse Y. Actually, I would be in favour of deprecating multilabel support in OneVsRestClassifier, since it is performing "binary relevance method" for multilabel, not actually OvR. MultiOutputClassifier duplicates this functionality (more or less), outputs a dense array (indeed it doesn't support sparse Y and perhaps it should) and lives closer to functional alternatives to binary relevance, such as ClassifierChain. On Thu, 11 Apr 2019 at 05:32, Liam Geron <liam@chatdesk.com> wrote:
Unfortunately I don't believe that you get that level of freedom, it's an API call that automatically calls the model's predict method so I don't think that I get to specify something like model.predict(X).toarray(). I could be wrong however, I don't pretend to be an expert on Cloud ML by any stretch.
Thanks, Liam
On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka < mail@sebastianraschka.com> wrote:
Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e.,
(model.predict(X)).toarray()
Best, Sebastian
On Apr 10, 2019, at 1:10 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi Sebastian,
Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do.
Thanks, Liam
On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < mail@sebastianraschka.com> wrote: Hi Liam,
not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra...
The usage would then basically be
model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the Stack
Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the
predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into
a dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
That's a great tip actually, I was unaware about the MultiOutputClassifier option. I'll give it a try! Thanks, Liam On Wed, Apr 10, 2019 at 11:03 PM Joel Nothman <joel.nothman@gmail.com> wrote:
I think it's a bit weird if we're returning sparse output from OneVsRestClassifier.predict if it wasn't fit on sparse Y.
Actually, I would be in favour of deprecating multilabel support in OneVsRestClassifier, since it is performing "binary relevance method" for multilabel, not actually OvR. MultiOutputClassifier duplicates this functionality (more or less), outputs a dense array (indeed it doesn't support sparse Y and perhaps it should) and lives closer to functional alternatives to binary relevance, such as ClassifierChain.
On Thu, 11 Apr 2019 at 05:32, Liam Geron <liam@chatdesk.com> wrote:
Unfortunately I don't believe that you get that level of freedom, it's an API call that automatically calls the model's predict method so I don't think that I get to specify something like model.predict(X).toarray(). I could be wrong however, I don't pretend to be an expert on Cloud ML by any stretch.
Thanks, Liam
On Wed, Apr 10, 2019 at 3:23 PM Sebastian Raschka < mail@sebastianraschka.com> wrote:
Hm, weird that their platform seems to be so picky about it. Have you tried to just make the output of the pipeline dense? I.e.,
(model.predict(X)).toarray()
Best, Sebastian
On Apr 10, 2019, at 1:10 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi Sebastian,
Thanks for the advice! The model actually works on it's own in python fine luckily, so I don't think that that is the issue exactly. I have tried rolling my own estimator to wrap the pipeline to have it call the predict_proba method to return a dense array, however I then came across the problem that I would have to have that custom estimator defined on the Cloud ML end, which I'm unsure how to do.
Thanks, Liam
On Wed, Apr 10, 2019 at 2:06 PM Sebastian Raschka < mail@sebastianraschka.com> wrote: Hi Liam,
not sure what your exact error message is, but it may also be that the XGBClassifier only accepts dense arrays? I think the TfidfVectorizer returns sparse arrays. You could probably fix your issues by inserting a "DenseTransformer" into your pipelone (a simple class that just transforms an array from a sparse to a dense format). I've implemented sth like that that you can import or copy&paste it from here:
https://github.com/rasbt/mlxtend/blob/master/mlxtend/preprocessing/dense_tra...
The usage would then basically be
model = Pipeline([('tfidf', TfidfVectorizer()), ('to_dense',
DenseTransformer()), ('clf', OneVsRestClassifier(XGBClassifier()))])
Best, Sebastian
On Apr 10, 2019, at 12:25 PM, Liam Geron <liam@chatdesk.com> wrote:
Hi all,
I was hoping to get some guidance re: changing the result of the
model = Pipeline([('tfidf', TfidfVectorizer()), ('clf',
OneVsRestClassifier(XGBClassifier()))])
Which returns a sparse array with the predict method. I saw the
Stack Overflow post here: https://stackoverflow.com/questions/52151548/google-cloud-ml-engine-scikit-l...
which recommends overwriting the predict method with the
predict method of the OneVsRestClassifier to return a dense array rather than a sparse array, given that Google Cloud ML only accepts dense numpy arrays as a result of a given models predict method. Right now my model architecture looks like: predict_proba method, however I found that I can't serialize the model after doing so. I also have a stack overflow post here: https://stackoverflow.com/questions/55366454/how-to-convert-scikit-learn-one... which details the specific pickling error.
Is this a known issue? Is there an accepted way to convert this into
a dense array?
Thanks, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (3)
-
Joel Nothman -
Liam Geron -
Sebastian Raschka