Hi, I try to apply the MPLclassifier to a subset (100 data points, 2 classes) of the 20newsgroup dataset. I created (ok, copied) the following pipeline model_MLP = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model_MLP', MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) ) ]) model_MLP.fit(twenty_train.data, twenty_train.target) predicted_MLP = model_MLP.predict(twenty_test.data) print(metrics.classification_report(twenty_test.target, predicted_MLP, target_names=twenty_test.target_names)) The numbers I get are hopeless, precision recall f1-score support alt.atheism 0.00 0.00 0.00 34 sci.electronics 0.66 1.00 0.80 66 The only reason I can think of is that the dictionaries of the training and the test set are not the same (testset: 5204 words, training set: 5402 words). That should not be a problem (if I understand Bayes correctly), but it certainly gives rubbish (see the numbers). The same setup with the SVD routine works great, all values are around .95 thanks, Andreas
I don't think this is an issue directly related to scikit-learn. Your classifier is learning to always predict the majority class. If you do not have good training performance, then you either need more data or your model is in appropriate. You're trying to learn lots of parameters from 100 examples. Use a simpler model. Use stronger regularisation (higher alpha). Work through some tutorials on machine learning diagnostics and modelling choices. On 13 Jan 2018 3:42 am, "andreas heiner" <ap.heiner@gmail.com> wrote:
Hi,
I try to apply the MPLclassifier to a subset (100 data points, 2 classes) of the 20newsgroup dataset. I created (ok, copied) the following pipeline
model_MLP = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model_MLP', MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) ) ])
model_MLP.fit(twenty_train.data, twenty_train.target)
predicted_MLP = model_MLP.predict(twenty_test.data)
print(metrics.classification_report(twenty_test.target, predicted_MLP, target_names=twenty_test.target_names))
The numbers I get are hopeless,
precision recall f1-score support alt.atheism 0.00 0.00 0.00 34 sci.electronics 0.66 1.00 0.80 66
The only reason I can think of is that the dictionaries of the training and the test set are not the same (testset: 5204 words, training set: 5402 words). That should not be a problem (if I understand Bayes correctly), but it certainly gives rubbish (see the numbers).
The same setup with the SVD routine works great, all values are around .95
thanks,
Andreas
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (2)
-
andreas heiner -
Joel Nothman