[scikit-learn] Difference in prediction accuracy using SGDClassifier and Cross validation scores.

Tue Mar 12 19:19:21 EDT 2019

You are calculating recall, not accuracy.

On Sun, 10 Mar 2019 at 05:36, Rajnish kamboj <rajnishk7.info at gmail.com> wrote:
>
> Hi
>
> I have recently started machine learning and it is my first query regarding prediction accuracy.
>
> There is difference in prediction accuracy using SGDClassifier and Cross validation scores.
>
> import numpy as np
> from sklearn.datasets import fetch_openml
> from sklearn.linear_model import SGDClassifier
>
> mnist = fetch_openml('mnist_784', version=1, cache=True)
> X, y = mnist['data'], mnist['target']
> X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
> shuffled_index = np.random.permutation(60000) # shuffle the 0 - 60000 range
> X_train, y_train = X_train[shuffled_index], y_train[shuffled_index]
>
> y_train_5 = (y_train == '5')
> y_test_5 = (y_test == '5')
>
> sgd_clf = SGDClassifier(random_state=42, tol=1e-3, max_iter=1000)
> sgd_clf.fit(X_train, y_train_5)
>
> # Predicting for all 5s
> print("####### PREDICTION STATS ##############")
> y_train_5_pred = sgd_clf.predict(X_train)
>
> print("Total y_train_5 [False|True both]]:", len(y_train_5))
> print("Total y_train_5 [Only 5s]:", sum(y_train_5))
>
> # some other digit may be predicted as 5 and some 5s may be predicted as not 5
> print("Predicted 5s:", sum(y_train_5_pred))
>
> correctly_predicted = sum(np.logical_and(y_train_5_pred, y_train_5))
> print("Correct Predicted", correctly_predicted)
> print("Accuracy:", correctly_predicted/sum(y_train_5) * 100)
>
> from sklearn.model_selection import cross_val_score
> cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring='accuracy')
>
> MY Output
>
> ####### PREDICTION STATS ##############
> Total y_train_5 [False|True both]]: 60000
> Total y_train_5 [Only 5s]: 5421
> Predicted 5s: 3863
> Correct Predicted 3574
> Accuracy: 65.9287954251983
> array([0.9323 , 0.96805, 0.9641 ])
> #######################################
>
> So as per my observation there is a difference, why?
>
> SGDCLassifier is ~65.92% accurate
> cross_val_score are ~95%
>
> Am I comparing it in wrong way? OR I am missing something?
>
>
> Thanks
>
> Rajnish
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn