Number of informative features vs total number of features
Dear sklearn users, I did some supervised classification simulations with the make_classification function from sklearn increasing the number of informative features from 1 out of 40 to 40 out of 40 (100%). I did not generate any repeated or redundant features. I fixed the number of classes to two and the number of clusters per class to one. I split the dataset 100 times using the StratifiedShuffleSplit function into two subsets: a training set and a test set (80% - 20%). I performed a logistic regression and calculated training and testing accuracies and averaged the results over the 100 splits leading to a mean training accuracy and a mean testing accuracy. I was expecting to get an increasing accuracy score as a function of informative features for both the training and the test sets. On the contrary, I have got the best training and test scores for one informative feature. Why do I get these results ? Thanks for your help, Best regards, Ben Below the simulation code I have written: import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt RANDOM_SEED = 4 n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40]) mean_training_score_array = np.array([]) mean_testing_score_array = np.array([]) for n_inf_value in n_inf: X, y = make_classification(n_samples=2500, n_features=40, n_informative=n_inf_value, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1, random_state=RANDOM_SEED, shuffle=False) # print('Simulated data - number of informative features = ' + str(n_inf_value)) # sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=RANDOM_SEED) training_score_array = np.array([]) testing_score_array = np.array([]) for train_index_split, test_index_split in sss.split(X, y): X_split_train, X_split_test = X[train_index_split], X[test_index_split] y_split_train, y_split_test = y[train_index_split], y[test_index_split] scaler = StandardScaler() X_split_train = scaler.fit_transform(X_split_train) X_split_test = scaler.transform(X_split_test) lr = LogisticRegression(fit_intercept=True, max_iter=1e9, verbose=0, random_state=RANDOM_SEED, solver='lbfgs', tol=1e-6, C=10) lr.fit(X_split_train, y_split_train) y_pred_train = lr.predict(X_split_train) y_pred_test = lr.predict(X_split_test) accuracy_train_score = accuracy_score(y_split_train, y_pred_train) accuracy_test_score = accuracy_score(y_split_test, y_pred_test) training_score_array = np.append(training_score_array, accuracy_train_score) testing_score_array = np.append(testing_score_array, accuracy_test_score) mean_training_score_array = np.append(mean_training_score_array, np.average(training_score_array)) mean_testing_score_array = np.append(mean_testing_score_array, np.average(testing_score_array)) # print('mean_training_score_array=' + str(mean_training_score_array)) print('mean_testing_score_array=' + str(mean_testing_score_array)) # plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score') plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score') plt.xlabel('number of informative features out of 40') plt.ylabel('accuracy') plt.legend() plt.show()
Dear sklearn users, I have just checked if the generated features were independents by computing the covariance and correlation matrices and it seems they are, so I really do not understand my results. Any idea ? Thanks for your help, Best regards, Ben Le 31/03/2020 à 15:48, Benoît Presles a écrit :
Dear sklearn users,
I did some supervised classification simulations with the make_classification function from sklearn increasing the number of informative features from 1 out of 40 to 40 out of 40 (100%). I did not generate any repeated or redundant features. I fixed the number of classes to two and the number of clusters per class to one.
I split the dataset 100 times using the StratifiedShuffleSplit function into two subsets: a training set and a test set (80% - 20%). I performed a logistic regression and calculated training and testing accuracies and averaged the results over the 100 splits leading to a mean training accuracy and a mean testing accuracy.
I was expecting to get an increasing accuracy score as a function of informative features for both the training and the test sets. On the contrary, I have got the best training and test scores for one informative feature. Why do I get these results ?
Thanks for your help, Best regards, Ben
Below the simulation code I have written:
import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt
RANDOM_SEED = 4 n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
mean_training_score_array = np.array([]) mean_testing_score_array = np.array([]) for n_inf_value in n_inf: X, y = make_classification(n_samples=2500, n_features=40, n_informative=n_inf_value, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1, random_state=RANDOM_SEED, shuffle=False) # print('Simulated data - number of informative features = ' + str(n_inf_value)) # sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=RANDOM_SEED) training_score_array = np.array([]) testing_score_array = np.array([]) for train_index_split, test_index_split in sss.split(X, y): X_split_train, X_split_test = X[train_index_split], X[test_index_split] y_split_train, y_split_test = y[train_index_split], y[test_index_split] scaler = StandardScaler() X_split_train = scaler.fit_transform(X_split_train) X_split_test = scaler.transform(X_split_test) lr = LogisticRegression(fit_intercept=True, max_iter=1e9, verbose=0, random_state=RANDOM_SEED, solver='lbfgs', tol=1e-6, C=10) lr.fit(X_split_train, y_split_train) y_pred_train = lr.predict(X_split_train) y_pred_test = lr.predict(X_split_test) accuracy_train_score = accuracy_score(y_split_train, y_pred_train) accuracy_test_score = accuracy_score(y_split_test, y_pred_test) training_score_array = np.append(training_score_array, accuracy_train_score) testing_score_array = np.append(testing_score_array, accuracy_test_score) mean_training_score_array = np.append(mean_training_score_array, np.average(training_score_array)) mean_testing_score_array = np.append(mean_testing_score_array, np.average(testing_score_array)) # print('mean_training_score_array=' + str(mean_training_score_array)) print('mean_testing_score_array=' + str(mean_testing_score_array)) # plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score') plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score') plt.xlabel('number of informative features out of 40') plt.ylabel('accuracy') plt.legend() plt.show()
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi Ben. I'd recommend you check the code to see how the data is generated. Best, Andy On 4/3/20 7:00 AM, Benoît Presles wrote:
Dear sklearn users,
I have just checked if the generated features were independents by computing the covariance and correlation matrices and it seems they are, so I really do not understand my results. Any idea ?
Thanks for your help, Best regards, Ben
Le 31/03/2020 à 15:48, Benoît Presles a écrit :
Dear sklearn users,
I did some supervised classification simulations with the make_classification function from sklearn increasing the number of informative features from 1 out of 40 to 40 out of 40 (100%). I did not generate any repeated or redundant features. I fixed the number of classes to two and the number of clusters per class to one.
I split the dataset 100 times using the StratifiedShuffleSplit function into two subsets: a training set and a test set (80% - 20%). I performed a logistic regression and calculated training and testing accuracies and averaged the results over the 100 splits leading to a mean training accuracy and a mean testing accuracy.
I was expecting to get an increasing accuracy score as a function of informative features for both the training and the test sets. On the contrary, I have got the best training and test scores for one informative feature. Why do I get these results ?
Thanks for your help, Best regards, Ben
Below the simulation code I have written:
import numpy as np from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedShuffleSplit from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score import matplotlib.pyplot as plt
RANDOM_SEED = 4 n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
mean_training_score_array = np.array([]) mean_testing_score_array = np.array([]) for n_inf_value in n_inf: X, y = make_classification(n_samples=2500, n_features=40, n_informative=n_inf_value, n_redundant=0, n_repeated=0, n_classes=2, n_clusters_per_class=1, random_state=RANDOM_SEED, shuffle=False) # print('Simulated data - number of informative features = ' + str(n_inf_value)) # sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, random_state=RANDOM_SEED) training_score_array = np.array([]) testing_score_array = np.array([]) for train_index_split, test_index_split in sss.split(X, y): X_split_train, X_split_test = X[train_index_split], X[test_index_split] y_split_train, y_split_test = y[train_index_split], y[test_index_split] scaler = StandardScaler() X_split_train = scaler.fit_transform(X_split_train) X_split_test = scaler.transform(X_split_test) lr = LogisticRegression(fit_intercept=True, max_iter=1e9, verbose=0, random_state=RANDOM_SEED, solver='lbfgs', tol=1e-6, C=10) lr.fit(X_split_train, y_split_train) y_pred_train = lr.predict(X_split_train) y_pred_test = lr.predict(X_split_test) accuracy_train_score = accuracy_score(y_split_train, y_pred_train) accuracy_test_score = accuracy_score(y_split_test, y_pred_test) training_score_array = np.append(training_score_array, accuracy_train_score) testing_score_array = np.append(testing_score_array, accuracy_test_score) mean_training_score_array = np.append(mean_training_score_array, np.average(training_score_array)) mean_testing_score_array = np.append(mean_testing_score_array, np.average(testing_score_array)) # print('mean_training_score_array=' + str(mean_training_score_array)) print('mean_testing_score_array=' + str(mean_testing_score_array)) # plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score') plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score') plt.xlabel('number of informative features out of 40') plt.ylabel('accuracy') plt.legend() plt.show()
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
participants (2)
-
Andreas Mueller -
Benoît Presles