[scikit-learn] Help With Text Classification

Thu Aug 3 00:54:18 EDT 2017

One of the key advantages of Pipeline is that it makes sure that equivalent
processing happens at training and prediction time (assuming you do not
write your own transformers that break their contract). This is what
appears to have broken in your current attempts.

On 3 August 2017 at 13:12, pybokeh <pybokeh at gmail.com> wrote:

> Thanks Joel for recommending FeatureUnion.  I did run across that.  But
> for just 2 features, I thought that might be overkill.  I am aware of
> Pipeline which the scikit-learn example explains very well, which I was
> going to utilize once I finalize my script.  I did not want to abstract
> away too much early on since I am in the beginning stages of learning
> machine learning and scikit-learn.
>
> - Daniel
>
> On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman <joel.nothman at gmail.com>
> wrote:
>
>> Use a Pipeline to help avoid this kind of issue (and others). You might
>> also want to do something like http://scikit-learn.org/stable
>> /auto_examples/hetero_feature_union.html
>>
>> On 3 August 2017 at 12:01, pybokeh <pybokeh at gmail.com> wrote:
>>
>>> Hello,
>>> I am studying this example from scikit-learn's site:
>>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>>> ng_with_text_data.html
>>>
>>> The problem that I need to solve is very similar to this example, except
>>> I have one
>>> additional feature column (part #) that is categorical of type string.
>>> My label or target
>>> values consist of just 2 values: 0 or 1.
>>>
>>> With that additional feature column, I am transforming it with a
>>> LabelEncoder and
>>> then I am encoding it with the OneHotEncoder.
>>>
>>> Then I am concatenating that one-hot encoded column (part #) to the
>>> text/document
>>> feature column (complaint), which I had applied the CountVectorizer and
>>> TfidfTransformer transformations.
>>>
>>> Then I chose the MultinomialNB model to fit my concatenated training
>>> data with.
>>>
>>> The problem I run into is when I invoke the prediction, I get a
>>> dimension mis-match error.
>>>
>>> Here's my jupyter notebook gist:
>>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>>> ef86ba41424b311
>>>
>>> I would gladly appreciate it if someone can guide me where I went
>>> wrong.  Thanks!
>>>
>>> - Daniel
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170803/9506c147/attachment.html>