[scikit-learn] Help With Text Classification

pybokeh pybokeh at gmail.com
Thu Aug 3 17:48:26 EDT 2017

I found my problem.  When I one-hot encoded my test part #, it resulted in
being a 1x1 matrix, when I need it to be a 1x153.  This happened because I
used the default setting ('auto') for n_values, when I needed it set it to
153.  Now when I horizontally stacked it to my other feature matrix, the
resulting total # of columns now correctly comes to 1294, instead of
1142.  Looking back now, not sure if using Pipeline or using FeatureUnion
would have helped in this case or prevented this since this error occurred
on the prediction side, not on training or modeling side.

On Wed, Aug 2, 2017 at 10:38 PM, Joel Nothman <joel.nothman at gmail.com>

> Use a Pipeline to help avoid this kind of issue (and others). You might
> also want to do something like http://scikit-learn.org/
> stable/auto_examples/hetero_feature_union.html
> On 3 August 2017 at 12:01, pybokeh <pybokeh at gmail.com> wrote:
>> Hello,
>> I am studying this example from scikit-learn's site:
>> http://scikit-learn.org/stable/tutorial/text_analytics/worki
>> ng_with_text_data.html
>> The problem that I need to solve is very similar to this example, except
>> I have one
>> additional feature column (part #) that is categorical of type string.
>> My label or target
>> values consist of just 2 values: 0 or 1.
>> With that additional feature column, I am transforming it with a
>> LabelEncoder and
>> then I am encoding it with the OneHotEncoder.
>> Then I am concatenating that one-hot encoded column (part #) to the
>> text/document
>> feature column (complaint), which I had applied the CountVectorizer and
>> TfidfTransformer transformations.
>> Then I chose the MultinomialNB model to fit my concatenated training data
>> with.
>> The problem I run into is when I invoke the prediction, I get a dimension
>> mis-match error.
>> Here's my jupyter notebook gist:
>> http://nbviewer.jupyter.org/gist/anonymous/59ba930a783571c85
>> ef86ba41424b311
>> I would gladly appreciate it if someone can guide me where I went wrong.
>> Thanks!
>> - Daniel
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170803/ebb72335/attachment.html>

More information about the scikit-learn mailing list