The post How to Hill Climb the Test Set for Machine Learning appeared first on Machine Learning Mastery.

]]>As an approach to machine learning competitions, it is rightfully frowned upon, and most competition platforms impose limitations to prevent it, which is important.

Nevertheless, hill climbing the test set is something that a machine learning practitioner accidentally does as part of participating in a competition. By developing an explicit implementation to hill climb a test set, it helps to better understand how easy it can be to overfit a test dataset by overusing it to evaluate modeling pipelines.

In this tutorial, you will discover how to hill climb the test set for machine learning.

After completing this tutorial, you will know:

- Perfect predictions can be made by hill climbing the test set without even looking at the training dataset.
- How to hill climb the test set for classification and regression tasks.
- We implicitly hill climb the test set when we overuse the test set to evaluate our modeling pipelines.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Hill Climb the Test Set
- Hill Climbing Algorithm
- How to Implement Hill Climbing
- Hill Climb Diabetes Classification Dataset
- Hill Climb Housing Regression Dataset

Machine learning competitions, like those on Kaggle, provide a complete training dataset as well as just the input for the test set.

The objective for a given competition is to predict target values, such as labels or numerical values for the test set. Solutions are evaluated against the hidden test set target values and scored appropriately. The submission with the best score against the test set wins the competition.

The challenge of a machine learning competition can be framed as an optimization problem. Traditionally, the competition participant acts as the optimization algorithm, exploring different modeling pipelines that result in different sets of predictions, scoring the predictions, then making changes to the pipeline that are expected to result in an improved score.

This process can also be modeled directly with an optimization algorithm where candidate predictions are generated and evaluated without ever looking at the training set.

Generally, this is referred to as hill climbing the test set, as one of the simplest optimization algorithms to implement to solve this problem is the hill climbing algorithm.

Although **hill climbing the test set is rightfully frowned upon** in actual machine learning competitions, it can be an interesting exercise to implement the approach in order to learn about the limitations of the approach and the dangers of overfitting the test set. Additionally, the fact that the test set can be predicted perfectly without ever touching the training dataset often shocks a lot of beginner machine learning practitioners.

Most importantly, we implicitly hill climb the test set when we repeatedly evaluate different modeling pipelines. The risk is that score is improved on the test set at the cost of increased generalization error, i.e. worse performance on the broader problem.

People that run machine learning competitions are well aware of this problem and impose limitations on prediction evaluation to counter it, such as limiting evaluation to one or a few per day and reporting scores on a hidden subset of the test set rather than the entire test set. For more on this, see the papers listed in the further reading section.

Next, let’s look at how we can implement the hill climbing algorithm to optimize predictions for a test set.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The **hill climbing algorithm** is a very simple optimization algorithm.

It involves generating a candidate solution and evaluating it. This is the starting point that is then incrementally improved until either no further improvement can be achieved or we run out of time, resources, or interest.

New candidate solutions are generated from the existing candidate solution. Typically, this involves making a single change to the candidate solution, evaluating it, and accepting the candidate solution as the new “*current*” solution if it is as good or better than the previous current solution. Otherwise, it is discarded.

We might think that it is a good idea to accept only candidates that have a better score. This is a reasonable approach for many simple problems, although, on more complex problems, it is desirable to accept different candidates with the same score in order to aid the search process to scale flat areas (plateaus) in the feature space.

When hill climbing the test set, a candidate solution is a list of predictions. For a binary classification task, this is a list of 0 and 1 values for the two classes. For a regression task, this is a list of numbers in the range of the target variable.

A modification to a candidate solution for classification would be to select one prediction and flip it from 0 to 1 or 1 to 0. A modification to a candidate solution for regression would be to add Gaussian noise to one value in the list or replace a value in the list with a new value.

Scoring of solutions involves calculating a scoring metric, such as classification accuracy on classification tasks or mean absolute error for a regression task.

Now that we are familiar with the algorithm, let’s implement it.

We will develop our hill climbing algorithm on a synthetic classification task.

First, let’s create a binary classification task with many input variables and 5,000 rows of examples. We can then split the dataset into train and test sets.

The complete example is listed below.

# example of a synthetic dataset. from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1) print(X.shape, y.shape) # split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example first reports the shape of the created dataset, showing 5,000 rows and 20 input variables.

The dataset is then split into train and test sets with about 3,300 for training and about 1,600 for testing.

(5000, 20) (5000,) (3350, 20) (1650, 20) (3350,) (1650,)

Now we can develop a hill climber.

First, we can create a function that will load, or in this case, define the dataset. We can update this function later when we want to change the dataset.

# load or prepare the classification dataset def load_dataset(): return make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1)

Next, we need a function to evaluate candidate solutions–that is, lists of predictions.

We will use classification accuracy where scores range between 0 for the worst possible solution to 1 for a perfect set of predictions.

# evaluate a set of predictions def evaluate_predictions(y_test, yhat): return accuracy_score(y_test, yhat)

Next, we need a function to create an initial candidate solution.

That is a list of predictions for 0 and 1 class labels, long enough to match the number of examples in the test set, in this case, 1650.

We can use the randint() function to generate random values of 0 and 1.

# create a random set of predictions def random_predictions(n_examples): return [randint(0, 1) for _ in range(n_examples)]

Next, we need a function to create a modified version of a candidate solution.

In this case, this involves selecting one value in the solution and flipping it from 0 to 1 or 1 to 0.

Typically, we make a single change for each new candidate solution during hill climbing, but I have parameterized the function so you can explore making more than one change if you want.

# modify the current set of predictions def modify_predictions(current, n_changes=1): # copy current solution updated = current.copy() for i in range(n_changes): # select a point to change ix = randint(0, len(updated)-1) # flip the class label updated[ix] = 1 - updated[ix] return updated

So far, so good.

Next, we can develop the function that performs the search.

First, an initial solution is created and evaluated by calling the *random_predictions()* function followed by the *evaluate_predictions()* function.

Then we loop for a fixed number of iterations and generate a new candidate by calling *modify_predictions()*, evaluate it, and if the score is as good as or better than the current solution, replace it.

The loop ends when we finish the pre-set number of iterations (chosen arbitrarily) or when a perfect score is achieved, which we know in this case is an accuracy of 1.0 (100 percent).

The function *hill_climb_testset()* below implements this, taking the test set as input and returning the best set of predictions found during the hill climbing.

# run a hill climb for a set of predictions def hill_climb_testset(X_test, y_test, max_iterations): scores = list() # generate the initial solution solution = random_predictions(X_test.shape[0]) # evaluate the initial solution score = evaluate_predictions(y_test, solution) scores.append(score) # hill climb to a solution for i in range(max_iterations): # record scores scores.append(score) # stop once we achieve the best score if score == 1.0: break # generate new candidate candidate = modify_predictions(solution) # evaluate candidate value = evaluate_predictions(y_test, candidate) # check if it is as good or better if value >= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score)) return solution, scores

That’s all there is to it.

The complete example of hill climbing the test set is listed below.

# example of hill climbing the test set for a classification task from random import randint from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from matplotlib import pyplot # load or prepare the classification dataset def load_dataset(): return make_classification(n_samples=5000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # evaluate a set of predictions def evaluate_predictions(y_test, yhat): return accuracy_score(y_test, yhat) # create a random set of predictions def random_predictions(n_examples): return [randint(0, 1) for _ in range(n_examples)] # modify the current set of predictions def modify_predictions(current, n_changes=1): # copy current solution updated = current.copy() for i in range(n_changes): # select a point to change ix = randint(0, len(updated)-1) # flip the class label updated[ix] = 1 - updated[ix] return updated # run a hill climb for a set of predictions def hill_climb_testset(X_test, y_test, max_iterations): scores = list() # generate the initial solution solution = random_predictions(X_test.shape[0]) # evaluate the initial solution score = evaluate_predictions(y_test, solution) scores.append(score) # hill climb to a solution for i in range(max_iterations): # record scores scores.append(score) # stop once we achieve the best score if score == 1.0: break # generate new candidate candidate = modify_predictions(solution) # evaluate candidate value = evaluate_predictions(y_test, candidate) # check if it is as good or better if value >= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score)) return solution, scores # load the dataset X, y = load_dataset() print(X.shape, y.shape) # split dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # run hill climb yhat, scores = hill_climb_testset(X_test, y_test, 20000) # plot the scores vs iterations pyplot.plot(scores) pyplot.show()

Running the example will run the search for 20,000 iterations or stop if a perfect accuracy is achieved.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we found a perfect set of predictions for the test set in about 12,900 iterations.

Recall that this was achieved without touching the training dataset and without cheating by looking at the test set target values. Instead, we simply optimized a set of numbers.

The lesson here is that repeated evaluation of a modeling pipeline against a test set will do the same thing, using you as the hill climbing optimization algorithm. The solution will be overfit to the test set.

... >8092, score=0.996 >8886, score=0.997 >9202, score=0.998 >9322, score=0.998 >9521, score=0.999 >11046, score=0.999 >12932, score=1.000

A plot is also created of the progress of the optimization.

This can be helpful to see how changes to the optimization algorithm, such as the choice of what to change and how it is changed during the hill climb, impact the convergence of the search.

Now that we are familiar with hill climbing the test set, let’s try the approach on a real dataset.

We will use the diabetes dataset as the basis for exploring hill climbing the test set for a classification problem.

Each record describes the medical details of a female, and the prediction is the onset of diabetes within the next five years.

The dataset has eight input variables and 768 rows of data; the input variables are all numeric and the target has two class labels, e.g. it is a binary classification task.

Below provides a sample of the first five rows of the dataset.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

We can load the dataset directly using Pandas, as follows.

# load or prepare the classification dataset def load_dataset(): url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' df = read_csv(url, header=None) data = df.values return data[:, :-1], data[:, -1]

The rest of the code remains unchanged.

This is created so that you can drop in your own binary classification task and try it out.

The complete example is listed below.

# example of hill climbing the test set for the diabetes dataset from random import randint from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from matplotlib import pyplot # load or prepare the classification dataset def load_dataset(): url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' df = read_csv(url, header=None) data = df.values return data[:, :-1], data[:, -1] # evaluate a set of predictions def evaluate_predictions(y_test, yhat): return accuracy_score(y_test, yhat) # create a random set of predictions def random_predictions(n_examples): return [randint(0, 1) for _ in range(n_examples)] # modify the current set of predictions def modify_predictions(current, n_changes=1): # copy current solution updated = current.copy() for i in range(n_changes): # select a point to change ix = randint(0, len(updated)-1) # flip the class label updated[ix] = 1 - updated[ix] return updated # run a hill climb for a set of predictions def hill_climb_testset(X_test, y_test, max_iterations): scores = list() # generate the initial solution solution = random_predictions(X_test.shape[0]) # evaluate the initial solution score = evaluate_predictions(y_test, solution) scores.append(score) # hill climb to a solution for i in range(max_iterations): # record scores scores.append(score) # stop once we achieve the best score if score == 1.0: break # generate new candidate candidate = modify_predictions(solution) # evaluate candidate value = evaluate_predictions(y_test, candidate) # check if it is as good or better if value >= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score)) return solution, scores # load the dataset X, y = load_dataset() print(X.shape, y.shape) # split dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # run hill climb yhat, scores = hill_climb_testset(X_test, y_test, 5000) # plot the scores vs iterations pyplot.plot(scores) pyplot.show()

Running the example reports the iteration number and accuracy each time an improvement is seen during the search.

We use fewer iterations in this case because it is a simpler problem to optimize as we have fewer predictions to make.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that we achieved perfect accuracy in about 1,500 iterations.

... >617, score=0.961 >627, score=0.965 >650, score=0.969 >683, score=0.972 >743, score=0.976 >803, score=0.980 >817, score=0.984 >945, score=0.988 >1350, score=0.992 >1387, score=0.996 >1565, score=1.000

A line plot of the search progress is also created showing that convergence was rapid.

We will use the housing dataset as the basis for exploring hill climbing the test set regression problem.

The housing dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem, meaning we are predicting a numerical value. There are 506 observations with 13 input variables and one output variable.

A sample of the first five rows is listed below.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20 ...

First, we can update the *load_dataset()* function to load the housing dataset.

As part of loading the dataset, we will normalize the target value. This will make hill climbing the predictions simpler as we can limit the floating-point values to range 0 to 1.

This is not required generally, just the approach taken here to simplify the search algorithm.

# load or prepare the classification dataset def load_dataset(): url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # normalize the target scaler = MinMaxScaler() y = y.reshape((len(y), 1)) y = scaler.fit_transform(y) return X, y

Next, we can update the scoring function to use the mean absolute error between the expected and predicted values.

# evaluate a set of predictions def evaluate_predictions(y_test, yhat): return mean_absolute_error(y_test, yhat)

We must also update the representation for a solution from 0 and 1 labels to floating-point values between 0 and 1.

The generation of the initial candidate solution must be changed to create a list of random floats.

# create a random set of predictions def random_predictions(n_examples): return [random() for _ in range(n_examples)]

The single change made to a solution to create a new candidate solution, in this case, involves simply replacing a randomly chosen prediction in the list with a new random float.

I chose this because it was simple.

# modify the current set of predictions def modify_predictions(current, n_changes=1): # copy current solution updated = current.copy() for i in range(n_changes): # select a point to change ix = randint(0, len(updated)-1) # flip the class label updated[ix] = random() return updated

A better approach would be to add Gaussian noise to an existing value, and I leave this to you as an extension. If you try it, let me know in the comments below.

For example:

... # add gaussian noise updated[ix] += gauss(0, 0.1)

Finally, the search must be updated.

The best value is now an error of 0.0, used to stop the search if found.

... # stop once we achieve the best score if score == 0.0: break

We also need to change the search from maximizing the score to now minimize the score.

... # check if it is as good or better if value <= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score))

The updated search function with both of these changes is listed below.

# run a hill climb for a set of predictions def hill_climb_testset(X_test, y_test, max_iterations): scores = list() # generate the initial solution solution = random_predictions(X_test.shape[0]) # evaluate the initial solution score = evaluate_predictions(y_test, solution) print('>%.3f' % score) # hill climb to a solution for i in range(max_iterations): # record scores scores.append(score) # stop once we achieve the best score if score == 0.0: break # generate new candidate candidate = modify_predictions(solution) # evaluate candidate value = evaluate_predictions(y_test, candidate) # check if it is as good or better if value <= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score)) return solution, scores

Tying this together, the complete example of hill climbing the test set for a regression task is listed below.

# example of hill climbing the test set for the housing dataset from random import random from random import randint from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error from sklearn.preprocessing import MinMaxScaler from matplotlib import pyplot # load or prepare the classification dataset def load_dataset(): url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # normalize the target scaler = MinMaxScaler() y = y.reshape((len(y), 1)) y = scaler.fit_transform(y) return X, y # evaluate a set of predictions def evaluate_predictions(y_test, yhat): return mean_absolute_error(y_test, yhat) # create a random set of predictions def random_predictions(n_examples): return [random() for _ in range(n_examples)] # modify the current set of predictions def modify_predictions(current, n_changes=1): # copy current solution updated = current.copy() for i in range(n_changes): # select a point to change ix = randint(0, len(updated)-1) # flip the class label updated[ix] = random() return updated # run a hill climb for a set of predictions def hill_climb_testset(X_test, y_test, max_iterations): scores = list() # generate the initial solution solution = random_predictions(X_test.shape[0]) # evaluate the initial solution score = evaluate_predictions(y_test, solution) print('>%.3f' % score) # hill climb to a solution for i in range(max_iterations): # record scores scores.append(score) # stop once we achieve the best score if score == 0.0: break # generate new candidate candidate = modify_predictions(solution) # evaluate candidate value = evaluate_predictions(y_test, candidate) # check if it is as good or better if value <= score: solution, score = candidate, value print('>%d, score=%.3f' % (i, score)) return solution, scores # load the dataset X, y = load_dataset() print(X.shape, y.shape) # split dataset into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # run hill climb yhat, scores = hill_climb_testset(X_test, y_test, 100000) # plot the scores vs iterations pyplot.plot(scores) pyplot.show()

Running the example reports the iteration number and MAE each time an improvement is seen during the search.

We use many more iterations in this case because it is a more complex problem to optimize. The chosen method for creating candidate solutions also makes it slower and less likely we will achieve perfect error.

In fact, we would not achieve perfect error; instead, it would be better to stop if error reached a value below a minimum value such as 1e-7 or something meaningful to the target domain. This, too, is left as an exercise for the reader.

For example:

... # stop once we achieve a good enough if score <= 1e-7: break

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that we achieved a good error by the end of the run.

... >95991, score=0.001 >96011, score=0.001 >96295, score=0.001 >96366, score=0.001 >96585, score=0.001 >97575, score=0.001 >98828, score=0.001 >98947, score=0.001 >99712, score=0.001 >99913, score=0.001

A line plot of the search progress is also created showing that convergence was rapid and sits flat for most of the iterations.

This section provides more resources on the topic if you are looking to go deeper.

- Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle, 2018.
- Toward a Better Understanding of Leaderboard, 2017.
- The Ladder: A Reliable Leaderboard for Machine Learning Competitions, 2015.

In this tutorial, you discovered how to hill climb the test set for machine learning.

Specifically, you learned:

- Perfect predictions can be made by hill climbing the test set without even looking at the training dataset.
- How to hill climb the test set for classification and regression tasks.
- We implicitly hill climb the test set when we overuse the test set to evaluate our modeling pipelines.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Hill Climb the Test Set for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Train to the Test Set in Machine Learning appeared first on Machine Learning Mastery.

]]>It is a type of overfitting that is common in machine learning competitions where a complete training dataset is provided and where only the input portion of a test set is provided. One approach to **training to the test set** involves constructing a training set that most resembles the test set and then using it as the basis for training a model. The model is expected to have better performance on the test set, but most likely worse performance on the training dataset and on any new data in the future.

Although overfitting the test set is not desirable, it can be interesting to explore as a thought experiment and provide more insight into both machine learning competitions and avoiding overfitting generally.

In this tutorial, you will discover how to intentionally train to the test set for classification and regression problems.

After completing this tutorial, you will know:

- Training to the test set is a type of data leakage that may occur in machine learning competitions.
- One approach to training to the test set involves creating a training dataset that is most similar to a provided test set.
- How to use a KNN model to construct a training dataset and train to the test set with a real dataset.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Train to the Test Set
- Train to Test Set for Classification
- Train to Test Set for Regression

In applied machine learning, we seek a model that learns the relationship between the input and output variables using the training dataset.

The hope and goal is that we learn a relationship that generalizes to new examples beyond the training dataset. This goal motivates why we use resampling techniques like k-fold cross-validation to estimate the performance of the model when making predictions on data not used during training.

In the case of machine learning competitions, like those on Kaggle, we are given access to the complete training dataset and the inputs of the test dataset and are required to make predictions for the test dataset.

This leads to a possible situation where we may accidentally or choose to train a model to the test set. That is, tune the model behavior to achieve the best performance on the test dataset rather than develop a model that performs well in general, using a technique like k-fold cross-validation.

Another, more overt path to information leakage, can sometimes be seen in machine learning competitions where the training and test set data are given at the same time.

— Page 56, Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.

**Training to the test set is often a bad idea**.

It is an explicit type of data leakage. Nevertheless, it is an interesting thought experiment.

One approach to training to the test set is to contrive a training dataset that is most similar to the test set. For example, we could discard all rows in the training set that are too different from the test set and only train on those rows in the training set that are maximally similar to rows in the test set.

While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data. This may very well improve the model’s performance scores for this particular test set but might ruin the model for predicting on a broader data set.

— Page 56, Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.

We would expect the model to overfit the test set, but this is the whole point of this thought experiment.

Let’s explore this approach to training to the test set in this tutorial.

We can use a k-nearest neighbor model to select those instances of the training set that are most similar to the test set. The KNeighborsRegressor and KNeighborsClassifier both provide the kneighbors() function that will return indexes into the training dataset for rows that are most similar to a given data, such as a test set.

... # get the most similar neighbor for each point in the test set neighbor_ix = knn.kneighbors(X_test, 2, return_distance=False) ix = neighbor_ix[:,0]

We might want to try removing duplicates from the selected row indexes.

... # remove duplicate rows ix = unique(ix)

We can then use those row indexes to construct a custom training dataset and fit a model.

... # create a training dataset from selected instances X_train_neigh, y_train_neigh = X_train[ix], y_train[ix]

Given that we are using a KNN model to construct the training set from the test set, we will also use the same type of model to make predictions on the test set. This is not required, but it makes the examples simpler.

Using this approach, we can now experiment with training to the test set for both classification and regression datasets.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We will use the diabetes dataset as the basis for exploring training for the test set for classification problems.

Each record describes the medical details of a female and the prediction is the onset of diabetes within the next five years.

The dataset has eight input variables and 768 rows of data; the input variables are all numeric and the target has two class labels, e.g. it is a binary classification task.

Below provides a sample of the first five rows of the dataset.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

First, we can load the dataset directly from the URL, split it into input and output elements, then split the dataset into train and test sets, holding thirty percent back for the test set. We can then evaluate a KNN model with default model hyperparameters by training it on the training set and making predictions on the test set.

The complete example is listed below.

# example of evaluating a knn model on the diabetes classification dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # define model model = KNeighborsClassifier() # fit model model.fit(X_train, y_train) # make predictions yhat = model.predict(X_test) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % (accuracy * 100))

Running the example first loads the dataset and summarizes the number of rows and columns, matching our expectations. The shape of the train and test sets are then reported, showing we have about 230 rows in the test set.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Finally, the classification accuracy of the model is reported to be about 77.056 percent.

(768, 8) (768,) (537, 8) (231, 8) (537,) (231,) Accuracy: 77.056

Now, let’s see if we can achieve better performance on the test set by preparing a model that is trained directly for it.

First, we will construct a training dataset using the simpler example in the training set for each row in the test set.

... # select examples that are most similar to the test set knn = KNeighborsClassifier() knn.fit(X_train, y_train) # get the most similar neighbor for each point in the test set neighbor_ix = knn.kneighbors(X_test, 1, return_distance=False) ix = neighbor_ix[:,0] # create a training dataset from selected instances X_train_neigh, y_train_neigh = X_train[ix], y_train[ix] print(X_train_neigh.shape, y_train_neigh.shape)

Next, we will train the model on this new dataset and evaluate it on the test set as we did before.

... # define model model = KNeighborsClassifier() # fit model model.fit(X_train_neigh, y_train_neigh)

The complete example is listed below.

# example of training to the test set for the diabetes dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # select examples that are most similar to the test set knn = KNeighborsClassifier() knn.fit(X_train, y_train) # get the most similar neighbor for each point in the test set neighbor_ix = knn.kneighbors(X_test, 1, return_distance=False) ix = neighbor_ix[:,0] # create a training dataset from selected instances X_train_neigh, y_train_neigh = X_train[ix], y_train[ix] print(X_train_neigh.shape, y_train_neigh.shape) # define model model = KNeighborsClassifier() # fit model model.fit(X_train_neigh, y_train_neigh) # make predictions yhat = model.predict(X_test) # evaluate predictions accuracy = accuracy_score(y_test, yhat) print('Accuracy: %.3f' % (accuracy * 100))

Running the example, we can see that the reported size of the new training dataset is the same size as the test set, as we expected.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that we have achieved a lift in performance by training to the test set over training the model on the entire training dataset. In this case, we achieved a classification accuracy of about 79.654 percent compared to 77.056 percent when the entire training dataset is used.

(768, 8) (768,) (537, 8) (231, 8) (537,) (231,) (231, 8) (231,) Accuracy: 79.654

You might want to try selecting different numbers of neighbors from the training set for each example in the test set to see if you can achieve better performance.

Also, you might want to try keeping unique row indexes in the training set and see if that makes a difference.

Finally, it might be interesting to hold back a final validation dataset and compare how different “*train-to-the-test-set*” techniques affect performance on the holdout dataset. E.g. see how training to the test set impacts generalization error.

Report your findings in the comments below.

Now that we know how to train to the test set for classification, let’s look at an example for regression.

We will use the housing dataset as the basis for exploring training for the test set for regression problems.

The housing dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.

It is a regression problem, meaning we are predicting a numerical value. There are 506 observations with 13 input variables and one output variable.

A sample of the first five rows is listed below.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20 ...

First, we can load the dataset, split it, and evaluate a KNN model on it directly using the entire training dataset. We will report performance on this regression class using mean absolute error (MAE).

The complete example is listed below.

# example of evaluating a knn model on the housing regression dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error from sklearn.neighbors import KNeighborsRegressor # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # define model model = KNeighborsRegressor() # fit model model.fit(X_train, y_train) # make predictions yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example first loads the dataset and summarizes the number of rows and columns, matching our expectations. The shape of the train and test sets are then reported, showing we have about 150 rows in the test set.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Finally, the MAE of the model is reported to be about 4.488.

(506, 13) (506,) (354, 13) (152, 13) (354,) (152,) MAE: 4.488

Now, let’s see if we can achieve better performance on the test set by preparing a model that is trained to it.

First, we will construct a training dataset using the simpler example in the training set for each row in the test set.

... # select examples that are most similar to the test set knn = KNeighborsClassifier() knn.fit(X_train, y_train) # get the most similar neighbor for each point in the test set neighbor_ix = knn.kneighbors(X_test, 1, return_distance=False) ix = neighbor_ix[:,0] # create a training dataset from selected instances X_train_neigh, y_train_neigh = X_train[ix], y_train[ix] print(X_train_neigh.shape, y_train_neigh.shape)

Next, we will train the model on this new dataset and evaluate it on the test set as we did before.

... # define model model = KNeighborsClassifier() # fit model model.fit(X_train_neigh, y_train_neigh)

The complete example is listed below.

# example of training to the test set for the housing dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error from sklearn.neighbors import KNeighborsRegressor # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] print(X.shape, y.shape) # split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) print(X_train.shape, X_test.shape, y_train.shape, y_test.shape) # select examples that are most similar to the test set knn = KNeighborsRegressor() knn.fit(X_train, y_train) # get the most similar neighbor for each point in the test set neighbor_ix = knn.kneighbors(X_test, 1, return_distance=False) ix = neighbor_ix[:,0] # create a training dataset from selected instances X_train_neigh, y_train_neigh = X_train[ix], y_train[ix] print(X_train_neigh.shape, y_train_neigh.shape) # define model model = KNeighborsRegressor() # fit model model.fit(X_train_neigh, y_train_neigh) # make predictions yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example, we can see that the reported size of the new training dataset is the same size as the test set, as we expected.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that we have achieved a lift in performance by training to the test set over training the model on the entire training dataset. In this case, we achieved a MAE of about 4.433 compared to 4.488 when the entire training dataset is used.

Again, you might want to explore using a different number of neighbors when constructing the new training set and see if keeping unique rows in the training dataset makes a difference. Report your findings in the comments below.

(506, 13) (506,) (354, 13) (152, 13) (354,) (152,) (152, 13) (152,) MAE: 4.433

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to intentionally train to the test set for classification and regression problems.

Specifically, you learned:

- Training to the test set is a type of data leakage that may occur in machine learning competitions.
- One approach to training to the test set involves creating a training dataset that is most similar to a provided test set.
- How to use a KNN model to construct a training dataset and train to the test set with a real dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Train to the Test Set in Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully **selecting which data transform to apply to each input variable** prior to modeling.

In this tutorial, you will discover how to apply selective scaling of numerical input variables.

After completing this tutorial, you will know:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Diabetes Numerical Dataset
- Non-Selective Scaling of Numerical Inputs
- Normalize All Input Variables
- Standardize All Input Variables

- Selective Scaling of Numerical Inputs
- Normalize Only Non-Gaussian Input Variables
- Standardize Only Gaussian-Like Input Variables
- Selectively Normalize and Standardize Input Variables

As the basis of this tutorial, we will use the so-called “diabetes” dataset that has been widely studied as a machine learning dataset since the 1990s.

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

- Diabetes Dataset (pima-indians-diabetes.csv)
- Diabetes Dataset Description (pima-indians-diabetes.names)

No need to download the dataset; we will download it automatically as part of the worked examples that follow.

Looking at the data, we can see that all nine input variables are numerical.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 ...

We can load this dataset into memory using the Pandas library.

The example below downloads and summarizes the diabetes dataset.

# load and summarize the diabetes dataset from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv" dataset = read_csv(url, header=None) # summarize the shape of the dataset print(dataset.shape) # histograms of the variables dataset.hist() pyplot.show()

Running the example first downloads the dataset and loads it as a DataFrame.

The shape of the dataset is printed, confirming the number of rows, and nine variables, eight input, and one target.

(768, 9)

Finally, a plot is created showing a histogram for each variable in the dataset.

This is useful as we can see that some variables have a Gaussian or Gaussian-like distribution (1, 2, 5) and others have an exponential-like distribution (0, 3, 4, 6, 7). This may suggest the need for different numerical data transforms for the different types of input variables.

Now that we are a little familiar with the dataset, let’s try fitting and evaluating a model on the raw dataset.

We will use a logistic regression model as they are a robust and effective linear model for binary classification tasks. We will evaluate the model using repeated stratified k-fold cross-validation, a best practice, and use 10 folds and three repeats.

The complete example is listed below.

# evaluate a logistic regression model on the raw diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver='liblinear') # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the model and reports the mean and standard deviation accuracy for fitting a logistic regression model on the raw dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 76.8 percent.

Accuracy: 0.768 (0.040)

Now that we have established a baseline in performance on the dataset, let’s see if we can improve the performance using data scaling.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Many algorithms prefer or require that input variables are scaled to a consistent range prior to fitting a model.

This includes the logistic regression model that assumes input variables have a Gaussian probability distribution. It may also provide a more numerically stable model if the input variables are standardized. Nevertheless, even when these expectations are violated, the logistic regression can perform well or best for a given dataset as may be the case for the diabetes dataset.

Two common techniques for scaling numerical input variables are normalization and standardization.

Normalization scales each input variable to the range 0-1 and can be implemented using the MinMaxScaler class in scikit-learn. Standardization scales each input variable to have a mean of 0.0 and a standard deviation of 1.0 and can be implemented using the StandardScaler class in scikit-learn.

To learn more about normalization, standardization, and how to use these methods in scikit-learn, see the tutorial:

A naive approach to data scaling applies a single transform to all input variables, regardless of their scale or probability distribution. And this is often effective.

Let’s try normalizing and standardizing all input variables directly and compare the performance to the baseline logistic regression model fit on the raw data.

We can update the baseline code example to use a modeling pipeline where the first step is to apply a scaler and the final step is to fit the model.

This ensures that the scaling operation is fit or prepared on the training set only and then applied to the train and test sets during the cross-validation process, avoiding data leakage. Data leakage can result in an optimistically biased estimate of model performance.

This can be achieved using the Pipeline class where each step in the pipeline is defined as a tuple with a name and the instance of the transform or model to use.

... # define the modeling pipeline scaler = MinMaxScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression on diabetes dataset with all input variables normalized is listed below.

# evaluate a logistic regression model on the normalized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline model = LogisticRegression(solver='liblinear') scaler = MinMaxScaler() pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the normalized dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the normalization of the input variables has resulted in a drop in the mean classification accuracy from 76.8 percent with a model fit on the raw data to about 76.4 percent for the pipeline with normalization.

Accuracy: 0.764 (0.045)

Next, let’s try standardizing all input variables.

We can update the modeling pipeline to use standardization instead of normalization for all input variables prior to fitting and evaluating the logistic regression model.

This might be an appropriate transform for those input variables with a Gaussian-like distribution, but perhaps not the other variables.

... # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on diabetes dataset with all input variables standardized is listed below.

# evaluate a logistic regression model on the standardized diabetes dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the modeling pipeline scaler = StandardScaler() model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',scaler),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the standardized dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that standardizing all numerical input variables has resulted in a lift in mean classification accuracy from 76.8 percent with a model evaluated on the raw dataset to about 77.2 percent for a model evaluated on the dataset with standardized input variables.

Accuracy: 0.772 (0.043)

So far, we have learned that normalizing all variables does not help performance, but standardizing all input variables does help performance.

Next, let’s explore if selectively applying scaling to the input variables can offer further improvement.

Data transforms can be applied selectively to input variables using the ColumnTransformer class in scikit-learn.

It allows you to specify the transform (or pipeline of transforms) to apply and the column indexes to apply them to. This can then be used as part of a modeling pipeline and evaluated using cross-validation.

You can learn more about how to use the ColumnTransformer in the tutorial:

We can explore using the ColumnTransformer to selectively apply normalization and standardization to the numerical input variables of the diabetes dataset in order to see if we can achieve further performance improvements.

First, let’s try normalizing just those input variables that do not have a Gaussian-like probability distribution and leave the rest of the input variables alone in the raw state.

We can define two groups of input variables using the column indexes, one for the variables with a Gaussian-like distribution, and one for the input variables with the exponential-like distribution.

... # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7]

We can then selectively normalize the “*exp_ix*” group and let the other input variables pass through without any data preparation.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

The selective transform can then be used as part of our modeling pipeline.

... # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective normalization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see slightly better performance, increasing mean accuracy with the baseline model fit on the raw dataset with 76.8 percent to about 76.9 with selective normalization of some input variables.

The results are not as good as standardizing all input variables though.

Accuracy: 0.769 (0.043)

We can repeat the experiment from the previous section, although in this case, selectively standardize those input variables that have a Gaussian-like distribution and leave the remaining input variables untouched.

... # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough')

Tying this together, the complete example of evaluating a logistic regression model on data with selective standardizing of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective standardization from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t, remainder='passthrough') # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that we achieved a lift in performance over both the baseline model fit on the raw dataset with 76.8 percent and over the standardization of all input variables that achieved 77.2 percent. With selective standardization, we have achieved a mean accuracy of about 77.3 percent, a modest but measurable bump.

Accuracy: 0.773 (0.041)

The results so far raise the question as to whether we can get a further lift by combining the use of selective normalization and standardization on the dataset at the same time.

This can be achieved by defining both transforms and their respective column indexes for the ColumnTransformer class, with no remaining variables being passed through.

... # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t)

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization and standardization of the input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective scaling from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv' dataframe = read_csv(url, header=None) data = dataframe.values # separate into input and output elements X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define column indexes for the variables with "normal" and "exponential" distributions norm_ix = [1, 2, 5] exp_ix = [0, 3, 4, 6, 7] # define the selective transforms t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)] selective = ColumnTransformer(transformers=t) # define the modeling pipeline model = LogisticRegression(solver='liblinear') pipeline = Pipeline([('s',selective),('m',model)]) # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize the result print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, interestingly, we can see that we have achieved the same performance as standardizing all input variables with 77.2 percent.

Further, the results suggest that the chosen model performs better when the non-Gaussian like variables are left as-is than being standardized or normalized.

I would not have guessed at this finding, which highlights the importance of careful experimentation.

Accuracy: 0.772 (0.040)

**Can you do better?**

Try other transforms or combinations of transforms and see if you can achieve better results.

Share your findings in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- How to Use the ColumnTransformer for Data Preparation
- How to Use StandardScaler and MinMaxScaler Transforms in Python

In this tutorial, you discovered how to apply selective scaling of numerical input variables.

Specifically, you learned:

- How to load and calculate a baseline predictive performance for the diabetes classification dataset.
- How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
- How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

]]>The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the dataset to be modeled as per normal but gives no indication to the model that the row original contained missing values.

One approach to address this issue is to include additional binary flag input features that indicate whether a row or a column contained a missing value that was imputed. This additional information may or may not be helpful to the model in predicting the target value.

In this tutorial, you will discover how to **add binary flags for missing values** for modeling.

After completing this tutorial, you will know:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Jul/2020**: Fixed bug in the creation of the flag variable.

This tutorial is divided into three parts; they are:

- Imputing the Horse Colic Dataset
- Model With a Binary Flag for Missing Values
- Model With Indicators of All Missing Values

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2 1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2 2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1 1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1 ...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “*na_values*” to load values of ‘?’ as missing, marked with a NaN value.

The example below downloads the dataset, marks “?” values as NaN (missing) and summarizes the shape of the dataset.

# summarize the horse colic dataset from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape, y.shape)

Running the example downloads the dataset and reports the number of rows and columns, matching our expectations.

(300, 27) (300,)

Next, we can evaluate a model on this dataset.

We can use the SimpleImputer class to perform statistical imputation and replace the missing values with the mean of each column. We can then fit a random forest model on the dataset.

For more on how to use the SimpleImputer class, see the tutorial:

To achieve this, we will define a pipeline that first performs imputation, then fits the model and evaluates this modeling pipeline using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the random forest with mean statistical imputation on the horse colic dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the pipeline achieved an estimated classification accuracy of about 86.2 percent.

Mean Accuracy: 0.862 (0.056)

Next, let’s see if we can improve the performance of the model by providing more information about missing values.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In the previous section, we replaced missing values with a calculated statistic.

The model is unaware that missing values were replaced.

It is possible that knowledge of whether a row contains a missing value or not will be useful to the model when making a prediction.

One approach to exposing the model to this knowledge is by providing an additional column that is a binary flag indicating whether the row had a missing value or not.

- 0: Row does not contain a missing value.
- 1: Row contains a missing value (which was/will be imputed).

This can be achieved directly on the loaded dataset. First, we can sum the values for each row to create a new column where if the row contains at least one NaN, then the sum will be a NaN.

We can then mark all values in the new column as 1 if they contain a NaN, or 0 otherwise.

Finally, we can add this column to the loaded dataset.

Tying this together, the complete example of adding a binary flag to indicate one or more missing values in each row is listed below.

# add a binary flag that indicates if a row contains a missing value from numpy import isnan from numpy import hstack from pandas import read_csv # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) print(X.shape)

Running the example first downloads the dataset and reports the number of rows and columns, as expected.

Then the new binary variable indicating whether a row contains a missing value is created and added to the end of the input variables. The shape of the input data is then reported, confirming the addition of the feature, from 27 to 28 columns.

(300, 27) (300, 28)

We can then evaluate the model as we did in the previous section with the additional binary flag and see if it impacts model performance.

The complete example is listed below.

# evaluate model performance with a binary flag for missing values and imputed missing from numpy import isnan from numpy import hstack from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # sum each row where rows with a nan will sum to nan a = X.sum(axis=1) # mark all non-nan as 0 a[~isnan(a)] = 0 # mark all nan as 1 a[isnan(a)] = 1 a = a.reshape((len(a), 1)) # add to the dataset as another column X = hstack((X, a)) # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer() pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional feature and imputation.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we see a modest lift in performance from 86.2 percent to 86.3 percent. The difference is small and may not be statistically significant.

Mean Accuracy: 0.863 (0.055)

Most rows in this dataset have a missing value, and this approach might be more beneficial on datasets with fewer missing values.

Next, let’s see if we can provide even more information about the missing values to the model.

In the previous section, we added one additional column to indicate whether a row contains a missing value or not.

One step further is to indicate whether each input value was missing and imputed or not. This effectively adds one additional column for each input variable that contains missing values and may offer benefit to the model.

This can be achieved by setting the “*add_indicator*” argument to *True* when defining the SimpleImputer instance.

... # impute and mark missing values X = SimpleImputer(add_indicator=True).fit_transform(X)

We can demonstrate this with a worked example.

The example below loads the horse colic dataset as before, then imputes the missing values on the entire dataset and adds indicators variables for each input variable that has missing values

# impute and add indicators for columns with missing values from pandas import read_csv from sklearn.impute import SimpleImputer # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') data = dataframe.values # split into input and output elements ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] print(X.shape) # impute and mark missing values X = SimpleImputer(strategy='mean', add_indicator=True).fit_transform(X) print(X.shape)

Running the example first downloads and summarizes the shape of the dataset as expected, then applies the imputation and adds the binary (1 and 0 values) columns indicating whether each row contains a missing value for a given input variable.

We can see that the number of input variables has increased from 27 to 48, indicating the addition of 21 binary input variables, and in turn, that 21 of the 27 input variables must contain at least one missing value.

(300, 27) (300, 48)

Next, we can evaluate the model with this additional information.

The complete example below demonstrates this.

# evaluate imputation with added indicators features on the horse colic dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.ensemble import RandomForestClassifier from sklearn.impute import SimpleImputer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline # load dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv' dataframe = read_csv(url, header=None, na_values='?') # split into input and output elements data = dataframe.values ix = [i for i in range(data.shape[1]) if i != 23] X, y = data[:, ix], data[:, 23] # define modeling pipeline model = RandomForestClassifier() imputer = SimpleImputer(add_indicator=True) pipeline = Pipeline(steps=[('i', imputer), ('m', model)]) # define model evaluation cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional indicators features and imputation.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we see a nice lift in performance from 86.3 percent in the previous section to 86.7 percent.

This may provide strong evidence that adding one flag per column that was inputted is a better strategy on this dataset and chosen model.

Mean Accuracy: 0.867 (0.055)

This section provides more resources on the topic if you are looking to go deeper.

- Best Results for Standard Machine Learning Datasets
- Statistical Imputation for Missing Values in Machine Learning
- How to Handle Missing Data with Python

In this tutorial, you discovered how to add binary flags for missing values for modeling.

Specifically, you learned:

- How to load and evaluate models with statistical imputation on a classification dataset with missing values.
- How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
- How to add a flag for each input variable that has missing values and evaluate models with these new features.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.

The scikit-learn library provides a way to wrap these **custom data transforms** in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.

In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.

After completing this tutorial, you will know:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into four parts; they are:

- Custom Data Transforms in Scikit-Learn
- Oil Spill Dataset
- Custom Transform to Remove Columns
- Custom Transform to Replace Outliers

Data preparation refers to changing the raw data in some way that makes it more appropriate for predictive modeling with machine learning algorithms.

The scikit-learn Python machine learning library offers many different data preparation techniques directly, such as techniques for scaling numerical input variables and changing the probability distribution of variables.

These transforms can be fit and then applied on a dataset or used as part of a predictive modeling pipeline, allowing a sequence of transforms to be applied correctly without data leakage when evaluating model performance with data sampling techniques, such as k-fold cross-validation.

Although the data preparation techniques available in scikit-learn are extensive, there may be additional data preparation steps that are required.

Typically, these additional steps are performed manually prior to modeling and require writing custom code. The risk is that these data preparation steps may be performed inconsistently.

The solution is to create a custom data transform in scikit-learn using the FunctionTransformer class.

This class allows you to specify a function that is called to transform the data. You can define the function and perform any valid change, such as changing values or removing columns of data (not removing rows).

The class can then be used just like any other data transform in scikit-learn, e.g. to transform data directly, or used in a modeling pipeline.

The catch is that **the transform is stateless**, meaning that no state can be kept.

This means that the transform cannot be used to calculate statistics on the training dataset that are then used to transform the train and test datasets.

In addition to custom scaling operations, this can be helpful for standard data cleaning operations, such as identifying and removing columns with few unique values and identifying and removing relative outliers.

We will explore both of these cases, but first, let’s define a dataset that we can use as the basis for exploration.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The so-called “oil spill” dataset is a standard machine learning dataset.

The task involves predicting whether a patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean, given a vector that describes the contents of a patch of a satellite image.

There are 937 cases. Each case is composed of 48 numerical computer vision derived features, a patch number, and a class label.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1. There are 896 cases for no oil spill and 41 cases of an oil spill.

You can access the entire dataset here:

Review the contents of the file.

The first few lines of the file should look as follows:

1,2558,1506.09,456.63,90,6395000,40.88,7.89,29780,0.19,214.7,0.21,0.26,0.49,0.1,0.4,99.59,32.19,1.84,0.16,0.2,87.65,0,0.47,132.78,-0.01,3.78,0.22,3.2,-3.71,-0.18,2.19,0,2.19,310,16110,0,138.68,89,69,2850,1000,763.16,135.46,3.73,0,33243.19,65.74,7.95,1 2,22325,79.11,841.03,180,55812500,51.11,1.21,61900,0.02,901.7,0.02,0.03,0.11,0.01,0.11,6058.23,4061.15,2.3,0.02,0.02,87.65,0,0.58,132.78,-0.01,3.78,0.84,7.09,-2.21,0,0,0,0,704,40140,0,68.65,89,69,5750,11500,9593.48,1648.8,0.6,0,51572.04,65.73,6.26,0 3,115,1449.85,608.43,88,287500,40.42,7.34,3340,0.18,86.1,0.21,0.32,0.5,0.17,0.34,71.2,16.73,1.82,0.19,0.29,87.65,0,0.46,132.78,-0.01,3.78,0.7,4.79,-3.36,-0.23,1.95,0,1.95,29,1530,0.01,38.8,89,69,1400,250,150,45.13,9.33,1,31692.84,65.81,7.84,1 4,1201,1562.53,295.65,66,3002500,42.4,7.97,18030,0.19,166.5,0.21,0.26,0.48,0.1,0.38,120.22,33.47,1.91,0.16,0.21,87.65,0,0.48,132.78,-0.01,3.78,0.84,6.78,-3.54,-0.33,2.2,0,2.2,183,10080,0,108.27,89,69,6041.52,761.58,453.21,144.97,13.33,1,37696.21,65.67,8.07,1 5,312,950.27,440.86,37,780000,41.43,7.03,3350,0.17,232.8,0.15,0.19,0.35,0.09,0.26,289.19,48.68,1.86,0.13,0.16,87.65,0,0.47,132.78,-0.01,3.78,0.02,2.28,-3.44,-0.44,2.19,0,2.19,45,2340,0,14.39,89,69,1320.04,710.63,512.54,109.16,2.58,0,29038.17,65.66,7.35,0 ...

We can see that the first column contains integers for the patch number. We can also see that the computer vision derived features are real-valued with differing scales, such as thousands in the second column and fractions in other columns.

This dataset contains columns with very few unique values and columns with outliers that provide a good basis for data cleaning.

The example below downloads the dataset and loads it as a numPy array and summarizes the number of rows and columns.

# load the oil dataset from pandas import read_csv # define the location of the dataset path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset df = read_csv(path, header=None) # split data into inputs and outputs data = df.values X = data[:, :-1] y = data[:, -1] print(X.shape, y.shape)

Running the example loads the dataset and confirms the expected number of rows and columns.

(937, 49) (937,)

Now that we have a dataset that we can use as the basis for data transforms, let’s look at how we can define some custom data cleaning transforms using the *FunctionTransformer* class.

Columns that have few unique values are probably not contributing anything useful to predicting the target value.

This is not absolutely true, but it is true enough that you should test the performance of your model fit on a dataset with columns of this type removed.

This is a type of data cleaning, and there is a data transform provided in scikit-learn called the VarianceThreshold that attempts to address this using the variance of each column.

Another approach is to remove columns that have fewer than a specified number of unique values, such as 1.

We can develop a function that applies this transform and use the minimum number of unique values as a configurable default argument. We will also add some debugging to confirm it is working as we expect.

First, the number of unique values for each column can be calculated. Ten columns with equal or fewer than the minimum number of unique values can be identified. Finally, those identified columns can be removed from the dataset.

The *cust_transform()* function below implements this.

# remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result

We can then use this function in the FunctionTransformer.

A limitation of this transform is that it selects columns to delete based on the provided data. This means if a train and test dataset differ greatly, then it is possible for different columns to be removed from each, making model evaluation challenging (*unstable*!?). As such, it is best to keep the minimum number of unique values small, such as 1.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for removing columns with few unique values from numpy import unique from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # remove columns with few unique values def cust_transform(X, min_values=1, verbose=True): # get number of unique values for each column counts = [len(unique(X[:, i])) for i in range(X.shape[1])] if verbose: print('Unique Values: %s' % counts) # select columns to delete to_del = [i for i,v in enumerate(counts) if v <= min_values] if verbose: print('Deleting: %s' % to_del) if len(to_del) is 0: return X # select all but the columns that are being removed ix = [i for i in range(X.shape[1]) if i not in to_del] result = X[:, ix] return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the number of rows and columns in the raw dataset.

Next, a list is printed that shows the number of unique values observed for each column in the dataset. We can see that many columns have very few unique values.

The columns with one (or fewer) unique values are then identified and reported. In this case, column index 22. This column is removed from the dataset.

Finally, the shape of the transformed dataset is reported, showing 48 instead of 49 columns, confirming that the column with a single unique value was deleted.

(937, 49) (937,) Unique Values: [238, 297, 927, 933, 179, 375, 820, 618, 561, 57, 577, 59, 73, 107, 53, 91, 893, 810, 170, 53, 68, 9, 1, 92, 9, 8, 9, 308, 447, 392, 107, 42, 4, 45, 141, 110, 3, 758, 9, 9, 388, 220, 644, 649, 499, 2, 937, 169, 286] Deleting: [22] (937, 48)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different minimum number of unique values.
- Use a percentage rather than an absolute number of unique values.

If you explore any of these extensions, let me know in the comments below.

Next, let’s look at a transform that replaces values in the dataset.

Outliers are observations that are different or unlike the other observations.

If we consider one variable at a time, an outlier would be a value that is far from the center of mass (the rest of the values), meaning it is rare or has a low probability of being observed.

There are standard ways for identifying outliers for common probability distributions. For Gaussian data, we can identify outliers as observations that are three or more standard deviations from the mean.

This may or may not be a desirable way to identify outliers for data that has many input variables, yet can be effective in some cases.

We can identify outliers in this way and replace their value with a correction, such as the mean.

Each column is considered one at a time and mean and standard deviation statistics are calculated. Using these statistics, upper and lower bounds of “*normal*” values are defined, then all values that fall outside these bounds can be identified. If one or more outliers are identified, their values are then replaced with the mean value that was already calculated.

The *cust_transform()* function below implements this as a function applied to the dataset, where we parameterize the number of standard deviations from the mean and whether or not debug information will be displayed.

# replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result

We can then use this function in the FunctionTransformer.

The method of outlier detection assumes a Gaussian probability distribution and applies to each variable independently, both of which are strong assumptions.

An additional limitation of this implementation is that the mean and standard deviation statistics are calculated on the provided dataset, meaning that the definition of an outlier and its replacement value are both relative to the dataset. This means that different definitions of outliers and different replacement values could be used if the transform is used on the train and test sets.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for replacing outliers from numpy import mean from numpy import std from numpy import where from numpy import logical_or from pandas import read_csv from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import LabelEncoder # load a dataset def load_dataset(path): # load the dataset df = read_csv(path, header=None) data = df.values # split data into inputs and outputs X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # replace outliers def cust_transform(X, n_stdev=3, verbose=True): # copy the array result = X.copy() # enumerate each column for i in range(result.shape[1]): # retrieve values for column col = X[:, i] # calculate statistics mu, sigma = mean(col), std(col) # define bounds lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev) # select indexes that are out of bounds ix = where(logical_or(col < lower, col > upper))[0] if verbose and len(ix) > 0: print('>col=%d, outliers=%d' % (i, len(ix))) # replace values result[ix, i] = mu return result # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv' # load the dataset X, y = load_dataset(url) print(X.shape, y.shape) # define the transformer trans = FunctionTransformer(cust_transform) # apply the transform X = trans.fit_transform(X) # summarize new shape print(X.shape)

Running the example first reports the shape of the dataset prior to any change.

Next, the number of outliers for each column is calculated and only those columns with one or more outliers are reported in the output. We can see that a total of 32 columns in the dataset have one or more outliers.

The outliers are then removed and the shape of the resulting dataset is reported, confirming no change in the number of rows or columns.

(937, 49) (937,) >col=0, outliers=10 >col=1, outliers=8 >col=3, outliers=8 >col=5, outliers=7 >col=6, outliers=1 >col=7, outliers=12 >col=8, outliers=15 >col=9, outliers=14 >col=10, outliers=19 >col=11, outliers=17 >col=12, outliers=22 >col=13, outliers=2 >col=14, outliers=16 >col=15, outliers=8 >col=16, outliers=8 >col=17, outliers=6 >col=19, outliers=12 >col=20, outliers=20 >col=27, outliers=14 >col=28, outliers=18 >col=29, outliers=2 >col=30, outliers=13 >col=32, outliers=3 >col=34, outliers=14 >col=35, outliers=15 >col=37, outliers=13 >col=40, outliers=18 >col=41, outliers=13 >col=42, outliers=12 >col=43, outliers=12 >col=44, outliers=19 >col=46, outliers=21 (937, 49)

There are many extensions you could explore to this transform, such as:

- Ensure that it is only applied to numerical input variables.
- Experiment with a different number of standard deviations from the mean, such as 2 or 4.
- Use a different definition of outlier, such as the IQR or a model.

If you explore any of these extensions, let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- How to Remove Outliers for Machine Learning
- How to Perform Data Cleaning for Machine Learning with Python

In this tutorial, you discovered how to define and use custom data transforms for scikit-learn.

Specifically, you learned:

- That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
- How to develop and apply a custom transform to remove columns with few unique values.
- How to develop and apply a custom transform that replaces outliers for each column.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Create Custom Data Transforms for Scikit-Learn appeared first on Machine Learning Mastery.

]]>The post How to Grid Search Data Preparation Techniques appeared first on Machine Learning Mastery.

]]>The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithms, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to grid search a suite of common and commonly useful data preparation techniques to the raw data. This is an alternative philosophy for data preparation that **treats data transforms as another hyperparameter** of the modeling pipeline to be searched and tuned.

This approach requires less expertise than the traditional manual approach to data preparation, although it is computationally costly. The benefit is that it can aid in the discovery of non-intuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use the grid search approach for data preparation with tabular data.

After completing this tutorial, you will know:

- Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
- How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to grid search sequences of data preparation methods to further improve model performance.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Grid Search Technique for Data Preparation
- Dataset and Performance Baseline
- Wine Classification Dataset
- Baseline Model Performance

- Grid Search Approach to Data Preparation

Data preparation can be challenging.

The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.

This can be effective but is also slow and can require deep expertise with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configurations.

This might be a data transform that “*should not work*” or “*should not be appropriate for the algorithm*” yet results in good or great performance. Alternatively, it may be the absence of a data transform for an input variable that is deemed “*absolutely required*” yet results in good or great performance.

This can be achieved by designing a **grid search of data preparation techniques** and/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.

The benefit of this approach is that it always results in suggestions of modeling pipelines that give good relative results. Most importantly, it can unearth the non-obvious and unintuitive solutions to practitioners without the need for deep expertise.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the grid search method of data preparation in the next section.

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1 13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1 13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1 14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1 13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1 ...

The example below loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' # load the dataset as a data frame df = read_csv(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y = data[:, :-1], data[:, -1] # summarize the shape of the loaded data print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can define a function to load the dataset and perform some minimal data preparation to ensure the inputs are numeric and the target is label encoded.

# prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

# evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the function to load the dataset, define our model, then evaluate it, reporting the mean and standard deviation accuracy.

... # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # evaluate the model scores = evaluate_model(X, y, model) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # evaluate the model scores = evaluate_model(X, y, model) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the grid-search-based approach to data preparation.

In this section, we can explore whether we can improve performance using the grid search approach to data preparation.

The first step is to define a series of modeling pipelines to evaluate, where each pipeline defines one (or more) data preparation techniques and ends with a model that takes the transformed data as input.

We will define a function to create these pipelines as a list of tuples, where each tuple defines the short name for the pipeline and the pipeline itself. We will evaluate a range of different data scaling methods (e.g. MinMaxScaler and StandardScaler), distribution transforms (QuantileTransformer and KBinsDiscretizer), as well as dimensionality reduction transforms (PCA and SVD).

# get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # normalize p = Pipeline([('s',MinMaxScaler()), ('m',model)]) pipelines.append(('norm', p)) # standardize p = Pipeline([('s',StandardScaler()), ('m',model)]) pipelines.append(('std', p)) # quantile p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)]) pipelines.append(('quan', p)) # discretize p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)]) pipelines.append(('kbins', p)) # pca p = Pipeline([('s',PCA(n_components=7)), ('m',model)]) pipelines.append(('pca', p)) # svd p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)]) pipelines.append(('svd', p)) return pipelines

We can then call this function to get the list of transforms, then enumerate each, evaluating it and reporting the performance along the way.

... # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name)

At the end of the run, we can create a box and whisker plot for each set of scores and compare the distributions of results visually.

... # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this together, the complete example of grid searching data preparation techniques on the wine classification dataset is listed below.

# compare data preparation methods for the wine classification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD from matplotlib import pyplot # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # normalize p = Pipeline([('s',MinMaxScaler()), ('m',model)]) pipelines.append(('norm', p)) # standardize p = Pipeline([('s',StandardScaler()), ('m',model)]) pipelines.append(('std', p)) # quantile p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)]) pipelines.append(('quan', p)) # discretize p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)]) pipelines.append(('kbins', p)) # pca p = Pipeline([('s',PCA(n_components=7)), ('m',model)]) pipelines.append(('pca', p)) # svd p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)]) pipelines.append(('svd', p)) return pipelines # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name) # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that standardizing the input variables and using a quantile transform both achieves the best result with a classification accuracy of about 98.7 percent, an improvement over the baseline with no data preparation that achieved a classification accuracy of 95.3 percent.

You can add your own modeling pipelines to the *get_pipelines()* function and compare their result.

**Can you get better results?**

Let me know in the comments below.

>norm: 0.976 (0.031) >std: 0.987 (0.023) >quan: 0.987 (0.023) >kbins: 0.968 (0.045) >pca: 0.963 (0.039) >svd: 0.953 (0.048)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique. We can see that the distribution of scores for the standardization and quantile transforms are compact and very similar and have an outlier. We can see that the spread of scores for the other transforms is larger and skewing down.

The results may suggest that standardizing the dataset is probably an important step in the data preparation and related transforms, such as the quantile transform, and perhaps even the power transform may offer benefits if combined with standardization by making one or more input variables more Gaussian.

We can also explore sequences of transforms to see if they can offer a lift in performance.

For example, we might want to apply RFE feature selection after the standardization transform to see if the same or better results can be used with fewer input variables (e.g. less complexity).

We might also want to see if a power transform preceded with a data scaling transform can achieve good performance on the dataset as we believe it could given the success of the quantile transform.

The updated *get_pipelines()* function with sequences of transforms is provided below.

# get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # standardize p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)]) pipelines.append(('std', p)) # scale and power p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)]) pipelines.append(('power', p)) return pipelines

Tying this together, the complete example is listed below.

# compare sequences of data preparation methods for the wine classification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD from sklearn.feature_selection import RFE from matplotlib import pyplot # prepare the dataset def load_dataset(): # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) return X, y # evaluate a model def evaluate_model(X, y, model): # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # get modeling pipelines to evaluate def get_pipelines(model): pipelines = list() # standardize p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)]) pipelines.append(('std', p)) # scale and power p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)]) pipelines.append(('power', p)) return pipelines # get the dataset X, y = load_dataset() # define the model model = LogisticRegression(solver='liblinear') # get the modeling pipelines pipelines = get_pipelines(model) # evaluate each pipeline results, names = list(), list() for name, pipeline in pipelines: # evaluate scores = evaluate_model(X, y, pipeline) # summarize print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores))) # store results.append(scores) names.append(name) # plot the result pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the standardization with feature selection offers an additional lift in accuracy from 98.7 percent to 98.9 percent, although the data scaling and power transform do not offer any additional benefit over the quantile transform.

>std: 0.989 (0.022) >power: 0.987 (0.023)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique.

We can see that the distribution of results for both pipelines of transforms is compact with very little spread other than outlier.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to use a grid search approach for data preparation with tabular data.

Specifically, you learned:

- Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
- How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to grid search sequences of data preparation methods to further improve model performance.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Grid Search Data Preparation Techniques appeared first on Machine Learning Mastery.

]]>The post Framework for Data Preparation Techniques in Machine Learning appeared first on Machine Learning Mastery.

]]>In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner.

Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike.

The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.

In this tutorial, you will discover a framework that provides a structured approach to both thinking about and grouping data preparation techniques for predictive modeling with structured data.

After completing this tutorial, you will know:

- The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
- A framework that defines five groups of data preparation techniques to consider.
- Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Challenge of Data Preparation
- Framework for Data Preparation
- Data Preparation Techniques

Data preparation refers to transforming raw data into a form that is better suited to predictive modeling.

This may be required because the data itself contains mistakes or errors. It may also be because the chosen algorithms have expectations regarding the type and distribution of the data.

To make the task of data preparation even more challenging, it is also common that the data preparation required to get the best performance from a predictive model may not be obvious and may bend or violate the expectations of the model that is being used.

As such, it is common to treat the choice and configuration of data preparation applied to the raw data as yet another hyperparameter of the modeling pipeline to be tuned.

This framing of data preparation is very effective in practice, as it allows you to use automatic search techniques like grid search and random search to discover unintuitive data preparation steps that result in skillful predictive models.

This framing of data preparation can also feel overwhelming to beginners given the large number and variety of data preparation techniques.

The solution to this overwhelm is to think about data preparation techniques in a systematic way.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Effective data preparation requires that the data preparation techniques available are organized and considered in a structured and systematic way.

This allows you to ensure that approach techniques are explored for your dataset and that potentially effective techniques are not skipped or ignored.

This can be achieved using a framework to organize data preparation techniques that consider their effect on the raw dataset.

For example, structured machine learning data, such as data we might store in a CSV file for classification and regression, consists of rows, columns, and values. We might consider data preparation techniques that operate at each of these levels.

- Data Preparation for Rows
- Data Preparation for Columns
- Data Preparation for Values

Data preparation for rows may be techniques that add or remove rows of data from the dataset. Similarly, data preparation for columns may be techniques that add or remove columns (features or variables) from the dataset. Whereas data preparation for values may be techniques that change the values in the dataset, often for a given column.

There is one more type of data preparation that does not neatly fit into this structure, and that is dimensionality reduction techniques. These techniques change the columns and the values at the same time, e.g. projecting the data into a lower-dimensional space.

- Data Preparation for Columns + Values

This raises the question of techniques that might apply to rows and values at the same time. This might include data preparation that consolidates rows of data in some way.

- Data Preparation for Rows + Values

We can summarize this framework and some high-level groups of data preparation methods in the following image.

Now that we have a framework for thinking about data preparation based on their effect on the data, let’s look at examples of techniques that fit into each group.

This section explores the five high-level groups of data preparation techniques defined in the previous section and suggests specific techniques that may fall within each group.

**Did I miss one of your preferred or favorite data preparation techniques?**

Let me know in the comments below.

This group is for data preparation techniques that add or remove rows of data.

In machine learning, rows are often referred to as samples, examples, or instances.

These techniques are often used to augment a limited training dataset or to remove errors or ambiguity from the dataset.

The main class of techniques that come to mind are data preparation techniques that are often used for imbalanced classification.

This includes techniques such as SMOTE that create synthetic rows of training data for under-represented classes and random undersampling that remove examples for over-represented classes.

For more on SMOTE data sampling, see the tutorial:

It also includes more advanced combined over- and undersampling techniques that attempt to identify and remove ambiguous examples along the decision boundary of a classification problem and remove them or change their class label.

For more on these types of data preparation, see the tutorial:

This class of data preparation techniques also includes algorithms for identifying and removing outliers from the data. These are rows of data that may be far from the center of probability mass in the dataset and, in turn, may be unrepresentative of the data from the domain.

For more on outlier detection and removal methods, see the tutorial:

This group is for data preparation techniques that add or remove columns of data.

In machine learning, columns are often referred to as variables or features.

These techniques are often required to either reduce the complexity (dimensionality) of a prediction problem or to unpack compound input variables or complex interactions between features.

The main class of techniques that come to mind are feature selection techniques.

This includes techniques that use statistics to score the relevance of input variables to the target variable based on the data type of each.

For more on these types of data preparation techniques, see the tutorial:

This also includes feature selection techniques that systematically test the impact of different combinations of input variables on the predictive skill of a machine learning model.

For more on these types of methods, see the tutorial:

Related are techniques that use a model to score the importance of input features based on their use by a predictive model, referred to as feature importance methods. These methods are often used for data interpretation, although they can also be used for feature selection.

For more on these types of methods, see the tutorial:

This group of methods also brings to mind techniques for creating or deriving new columns of data, new features. These are often referred to as feature engineering, although sometimes the whole field of data preparation is referred to as feature engineering.

For example, new features that represent values raised to exponents or multiplicative combinations of features can be created and added to the dataset as new columns.

For more on these types of data preparation techniques, see the tutorial:

This might also include data transforms that change a variable type, such as creating dummy variables for a categorical variable, often referred to as a one-hot encoding.

For more on these types of data preparation techniques, see the tutorial:

This group is for data preparation techniques that change the raw values in the data.

These techniques are often required to meet the expectations or requirements of specific machine learning algorithms.

The main class of techniques that come to mind is data transforms that change the scale or distribution of input variables.

For example, data transforms such as standardization and normalization change the scale of numeric input variables. Data transforms like ordinal encoding change the type of categorical input variables.

There are also many data transforms for changing the distribution of input variables.

For example, discretization or binning change the distribution of numerical input variables into categorical variables with an ordinal ranking.

For more on this type of data transform, see the tutorial:

The power transform can be used to change the distribution of data to remove a skew and make the distribution more normal (Gaussian).

For more on this method, see the tutorial:

The quantile transform is a flexible type of data preparation technique that can map a numerical input variable or to different types of distributions such as normal or Gaussian.

You can learn more about this data preparation technique here:

Another type of data preparation technique that belongs to this group are methods that systematically change values in the dataset.

This includes techniques that identify and replace missing values, often referred to as missing value imputation. This can be achieved using statistical methods or more advanced model-based methods.

For more on these methods, see the tutorial:

All of the methods discussed could also be considered feature engineering methods (e.g. fitting into the previously discussed group of data preparation methods) if the results of the transforms are appended to the raw data as new columns.

This group is for data preparation techniques that change both the number of columns and the values in the data.

The main class of techniques that this brings to mind are dimensionality reduction techniques that specifically reduce the number of columns and the scale and distribution of numerical input variables.

This includes matrix factorization methods used in linear algebra as well as manifold learning algorithms used in high-dimensional statistics.

For more information on these techniques, see the tutorial:

Although these techniques are designed to create projections of rows in a lower-dimensional space, perhaps this also leaves the door open to techniques that do the inverse. That is, use all or a subset of the input variables to create a projection into a higher-dimensional space, perhaps decompiling complex non-linear relationships.

Perhaps polynomial transforms where the results replace the raw dataset would fit into this class of data preparation methods.

**Do you know of other methods that fit into this group?**

Let me know in the comments below.

This group is for data preparation techniques that change both the number of rows and the values in the data.

I have not explicitly considered data transforms of this type before, but it falls out of the framework as defined.

A group of methods that come to mind are clustering algorithms where all or subsets of rows of data in the dataset are replaced with data samples at the cluster centers, referred to as cluster centroids.

Related might be replacing rows with exemplars (aggregates of rows) taken from specific machine learning algorithms, such as support vectors from a support vector machine, or the codebook vectors taken from a learning vector quantization.

Naturally, these aggregate rows are simply added to the dataset rather than replacing rows, then they would naturally fit into the “*Data Preparation for Rows*” group described above.

**Do you know of other methods that fit into this group?**

Let me know in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered a framework for systematically grouping data preparation techniques based on their effect on raw data.

Specifically, you learned:

- The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
- A framework that defines five groups of data preparation techniques to consider.
- Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Framework for Data Preparation Techniques in Machine Learning appeared first on Machine Learning Mastery.

]]>The post 6 Dimensionality Reduction Algorithms With Python appeared first on Machine Learning Mastery.

]]>Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms.

There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and evaluate top dimensionality reduction algorithms in Python.

After completing this tutorial, you will know:

- Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
- There are many different dimensionality reduction algorithms and no single best method for all datasets.
- How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Dimensionality Reduction
- Dimensionality Reduction Algorithms
- Examples of Dimensionality Reduction
- Scikit-Learn Library Installation
- Classification Dataset
- Principal Component Analysis
- Singular Value Decomposition
- Linear Discriminant Analysis
- Isomap Embedding
- Locally Linear Embedding
- Modified Locally Linear Embedding

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.

— Page 11, Machine Learning: A Probabilistic Perspective, 2012.

High-dimensionality might mean hundreds, thousands, or even millions of input variables.

Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and may not perform well on new data.

It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.

… dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.

— Page 289, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are many algorithms that can be used for dimensionality reduction.

Two main classes of methods are those drawn from linear algebra and those drawn from manifold learning.

Matrix factorization methods drawn from the field of linear algebra can be used for dimensionality.

For more on matrix factorization, see the tutorial:

Some of the more popular methods include:

- Principal Components Analysis
- Singular Value Decomposition
- Non-Negative Matrix Factorization

Manifold learning methods seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.

Some of the more popular methods include:

- Isomap Embedding
- Locally Linear Embedding
- Multidimensional Scaling
- Spectral Embedding
- t-distributed Stochastic Neighbor Embedding

Each algorithm offers a different approach to the challenge of discovering natural relationships in data at lower dimensions.

There is no best dimensionality reduction algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each subset of these popular dimensionality reduction algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

In this section, we will review how to use popular dimensionality reduction algorithms in scikit-learn.

This includes an example of using the dimensionality reduction technique as a data transform in a modeling pipeline and evaluating a model fit on the data.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data. There are some algorithms available in the scikit-learn library that are not demonstrated because they cannot be used as a data transform directly given the nature of the algorithm.

As such, we will use a synthetic classification dataset in each example.

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version import sklearn print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.23.0

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples with 20 input features, 10 of which are informative and 10 of which are redundant. This provides an opportunity for each technique to identify and remove redundant input features.

The fixed random seed for the pseudorandom number generator ensures we generate the same synthetic dataset each time the code runs.

An example of creating and summarizing the synthetic classification dataset is listed below.

# synthetic classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and reports the number of rows and columns matching our expectations.

(1000, 20) (1000,)

It is a binary classification task and we will evaluate a LogisticRegression model after each dimensionality reduction transform.

The model will be evaluated using the gold standard of repeated stratified 10-fold cross-validation. The mean and standard deviation classification accuracy across all folds and repeats will be reported.

The example below evaluates the model on the raw dataset as a point of comparison.

# evaluate logistic regression model on raw data from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the model model = LogisticRegression() # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the logistic regression on the raw dataset with all 20 columns, achieving a classification accuracy of about 82.4 percent.

A successful dimensionality reduction transform on this data should result in a model that has better accuracy than this baseline, although this may not be possible with all techniques.

Note: we are not trying to “*solve*” this dataset, just provide working examples that you can use as a starting point.

Accuracy: 0.824 (0.034)

Next, we can start looking at examples of dimensionality reduction algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset. Each dimensionality reduction method will be configured to reduce the 20 input columns to 10 where possible.

We will use a Pipeline to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure; for example:

... # define the pipeline steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps)

Let’s get started.

**Can you get a better result for one of the algorithms?**

Let me know in the comments below.

Principal Component Analysis, or PCA, might be the most popular technique for dimensionality reduction with dense data (few zero values).

For more on how PCA works, see the tutorial:

The scikit-learn library provides the PCA class implementation of Principal Component Analysis that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with PCA dimensionality reduction is listed below.

# evaluate pca with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we don’t see any lift in model performance in using the PCA transform.

Accuracy: 0.824 (0.034)

Singular Value Decomposition, or SVD, is one of the most popular techniques for dimensionality reduction for sparse data (data with many zero values).

For more on how SVD works, see the tutorial:

The scikit-learn library provides the TruncatedSVD class implementation of Singular Value Decomposition that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate svd with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.decomposition import TruncatedSVD from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we don’t see any lift in model performance in using the SVD transform.

Accuracy: 0.824 (0.034)

Linear Discriminant Analysis, or LDA, is a multi-class classification algorithm that can be used for dimensionality reduction.

The number of dimensions for the projection is limited to 1 and C-1, where C is the number of classes. In this case, our dataset is a binary classification problem (two classes), limiting the number of dimensions to 1.

For more on LDA for dimensionality reduction, see the tutorial:

The scikit-learn library provides the LinearDiscriminantAnalysis class implementation of Linear Discriminant Analysis that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with LDA dimensionality reduction is listed below.

# evaluate lda with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a slight lift in performance as compared to the baseline fit on the raw data.

Accuracy: 0.825 (0.034)

Isomap Embedding, or Isomap, creates an embedding of the dataset and attempts to preserve the relationships in the dataset.

The scikit-learn library provides the Isomap class implementation of Isomap Embedding that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate isomap with logistic regression algorithm for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import Isomap from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance with the Isomap data transform as compared to the baseline fit on the raw data.

Accuracy: 0.888 (0.029)

Locally Linear Embedding, or LLE, creates an embedding of the dataset and attempts to preserve the relationships between neighborhoods in the dataset.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform

The complete example of evaluating a model with LLE dimensionality reduction is listed below.

# evaluate lle and logistic regression for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import LocallyLinearEmbedding from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance with the LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.886 (0.028)

Modified Locally Linear Embedding, or Modified LLE, is an extension of Locally Linear Embedding that creates multiple weighting vectors for each neighborhood.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Modified Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “*method*” argument must be set to ‘modified’ and the “*n_components*” argument can be set to configure the number of desired dimensions in the output of the transform which must be less than the “*n_neighbors*” argument.

The complete example of evaluating a model with Modified LLE dimensionality reduction is listed below.

# evaluate modified lle and logistic regression for classification from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.pipeline import Pipeline from sklearn.manifold import LocallyLinearEmbedding from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7) # define the pipeline steps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegression())] model = Pipeline(steps=steps) # evaluate model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance with the modified LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.846 (0.036)

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Dimensionality Reduction for Machine Learning
- Principal Component Analysis for Dimensionality Reduction in Python
- Singular Value Decomposition for Dimensionality Reduction in Python
- Linear Discriminant Analysis for Dimensionality Reduction in Python

In this tutorial, you discovered how to fit and evaluate top dimensionality reduction algorithms in Python.

Specifically, you learned:

- Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
- There are many different dimensionality reduction algorithms and no single best method for all datasets.
- How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 6 Dimensionality Reduction Algorithms With Python appeared first on Machine Learning Mastery.

]]>The post 4 Automatic Outlier Detection Algorithms in Python appeared first on Machine Learning Mastery.

]]>Identifying and **removing outliers** is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline and compared, just like other data preparation transforms that may be applied to the dataset.

In this tutorial, you will discover how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

After completing this tutorial, you will know:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Outlier Detection and Removal
- Dataset and Performance Baseline
- House Price Regression Dataset
- Baseline Model Performance

- Automatic Outlier Detection
- Isolation Forest
- Minimum Covariance Determinant
- Local Outlier Factor
- One-Class SVM

Outliers are observations in a dataset that don’t fit in some way.

Perhaps the most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

This is easy to understand when we have one or two variables and we can visualize the data as a histogram or scatter plot, although it becomes very challenging when we have many input variables defining a high-dimensional input feature space.

In this case, simple statistical methods for identifying outliers can break down, such as methods that use standard deviations or the interquartile range.

It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modeling.

Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. Importantly, each method approaches the definition of an outlier is slightly different ways, providing alternate approaches to preparing a training dataset that can be evaluated and compared, just like any other data preparation step in a modeling pipeline.

Before we dive into automatic outlier detection methods, let’s first select a standard machine learning dataset that we can use as the basis for our investigation.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset.

This will provide the context for exploring the outlier identification and removal method of data preparation in the next section.

We will use the house price regression dataset.

This dataset has 13 input variables that describe the properties of the house and suburb and requires the prediction of the median value of houses in the suburb in thousands of dollars.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a regression predictive modeling problem with numerical input variables, each of which has different scales.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00 0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60 0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70 0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40 0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20 ...

The dataset has many numerical input variables that have unknown and complex relationships. We don’t know that outliers exist in this dataset, although we may guess that some outliers may be present.

The example below loads the dataset and splits it into the input and output columns, splits it into train and test datasets, then summarizes the shapes of the data arrays.

# load and summarize the dataset from pandas import read_csv from sklearn.model_selection import train_test_split # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # summarize the shape of the dataset print(X.shape, y.shape) # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the train and test sets print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 506 rows of data with 13 input variables and a single target variable.

The dataset is split into train and test sets with 339 rows used for model training and 167 for model evaluation.

(506, 13) (506,) (339, 13) (167, 13) (339,) (167,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

It is a regression predictive modeling problem, meaning that we will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the mean absolute error (MAE).

The complete example of evaluating a linear regression model on the dataset is listed below.

# evaluate model on the raw dataset from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 3.417. This provides a baseline in performance to which we can compare different outlier identification and removal procedures.

MAE: 3.417

Next, we can try removing outliers from the training dataset.

The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.

In this section, we will review four methods and compare their performance on the house price dataset.

Each method will be defined, then fit on the training dataset. The fit model will then predict which examples in the training dataset are outliers and which are not (so-called inliers). The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “*new data*” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “*None*” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for your project.

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “*contamination*” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

... # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train)

Once identified, we can remove the outliers from the training dataset.

... # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask]

Tying this together, the complete example of evaluating the linear model on the housing dataset with outliers identified and removed with isolation forest is listed below.

# evaluate model performance with outliers removed using isolation forest from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.ensemble import IsolationForest from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset iso = IsolationForest(contamination=0.1) yhat = iso.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that model identified and removed 34 outliers and achieved a MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.

(339, 13) (339,) (305, 13) (305,) MAE: 3.189

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “*contamination*” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train)

Once identified, the outliers can be removed from the training dataset as we did in the prior example.

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the elliptical envelope (minimum covariant determinant) method is listed below.

# evaluate model performance with outliers removed using elliptical envelope from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.covariance import EllipticEnvelope from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = EllipticEnvelope(contamination=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the elliptical envelope method identified and removed only 4 outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.

(339, 13) (339,) (335, 13) (335,) MAE: 3.388

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “*contamination*” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

... # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the local outlier factor method is listed below.

# evaluate model performance with outliers removed using local outlier factor from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.neighbors import LocalOutlierFactor from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset lof = LocalOutlierFactor() yhat = lof.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the local outlier factor method identified and removed 34 outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.

(339, 13) (339,) (305, 13) (305,) MAE: 3.356

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The class provides the “*nu*” argument that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and error.

... # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the one class SVM method is listed below.

# evaluate model performance with outliers removed using one class SVM from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.svm import OneClassSVM from sklearn.metrics import mean_absolute_error # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv' df = read_csv(url, header=None) # retrieve the array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # split into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) # summarize the shape of the training dataset print(X_train.shape, y_train.shape) # identify outliers in the training dataset ee = OneClassSVM(nu=0.01) yhat = ee.fit_predict(X_train) # select all rows that are not outliers mask = yhat != -1 X_train, y_train = X_train[mask, :], y_train[mask] # summarize the shape of the updated training dataset print(X_train.shape, y_train.shape) # fit the model model = LinearRegression() model.fit(X_train, y_train) # evaluate the model yhat = model.predict(X_test) # evaluate predictions mae = mean_absolute_error(y_test, yhat) print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.

(339, 13) (339,) (336, 13) (336,) MAE: 3.431

This section provides more resources on the topic if you are looking to go deeper.

- One-Class Classification Algorithms for Imbalanced Datasets
- How to Remove Outliers for Machine Learning

- Isolation Forest, 2008.
- Minimum Covariance Determinant and Extensions, 2017.
- LOF: Identifying Density-based Local Outliers, 2000.
- Estimating the Support of a High-Dimensional Distribution, 2001.

- Novelty and Outlier Detection, scikit-learn user guide.
- sklearn.covariance.EllipticEnvelope API.
- sklearn.svm.OneClassSVM API.
- sklearn.neighbors.LocalOutlierFactor API.
- sklearn.ensemble.IsolationForest API.

In this tutorial, you discovered how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

Specifically, you learned:

- Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
- How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
- How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post 4 Automatic Outlier Detection Algorithms in Python appeared first on Machine Learning Mastery.

]]>The post How to Use Feature Extraction on Tabular Data for Machine Learning appeared first on Machine Learning Mastery.

]]>The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to apply a suite of common and commonly useful data preparation techniques to the raw data in parallel and combine the results of all of the transforms together into a single large dataset from which a model can be fit and evaluated.

This is an alternative philosophy for data preparation that treats data transforms as an approach to extract salient features from raw data to expose the structure of the problem to the learning algorithms. It requires learning algorithms that are scalable of weight input features and using those input features that are most relevant to the target that is being predicted.

This approach requires less expertise, is computationally effective compared to a full grid search of data preparation methods, and can aid in the discovery of unintuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use feature extraction for data preparation with tabular data.

After completing this tutorial, you will know:

- Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
- How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

**Kick-start your project** with my new book Data Preparation for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- Feature Extraction Technique for Data Preparation
- Dataset and Performance Baseline
- Wine Classification Dataset
- Baseline Model Performance

- Feature Extraction Approach to Data Preparation

Data preparation can be challenging.

The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.

This can be effective, but is also slow and can require deep expertise both with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configuration.

This too can be an effective approach exposing unintuitive solutions and requiring very little expertise, although it can be computationally expensive.

An approach that seeks a middle ground between these two approaches to data preparation is to treat the transformation of input data as a **feature engineering** or **feature extraction** procedure. This involves applying a suite of common or commonly useful data preparation techniques to the raw data, then aggregating all features together to create one large dataset, then fit and evaluate a model on this data.

The philosophy of the approach treats each data preparation technique as a transform that extracts salient features from raw data to be presented to the learning algorithm. Ideally, such transforms untangle complex relationships and compound input variables, in turn allowing the use of simpler modeling algorithms, such as linear machine learning techniques.

For lack of a better name, we will refer to this as the “**Feature Engineering Method**” or the “**Feature Extraction Method**” for configuring data preparation for a predictive modeling project.

It allows data analysis and algorithm expertise to be used in the selection of data preparation methods and allows unintuitive solutions to be found but at a much lower computational cost.

The exclusion in the number of input features can also be explicitly addressed through the use of feature selection techniques that attempt to rank order the importance or value of the vast number of extracted features and only select a small subset of the most relevant to predicting the target variable.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the feature extraction method of data preparation in the next section.

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1 13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1 13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1 14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1 13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1 ...

The example loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' # load the dataset as a data frame df = read_csv(url, header=None) # retrieve the numpy array data = df.values # split the columns into input and output variables X, y = data[:, :-1], data[:, -1] # summarize the shape of the loaded data print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can perform minimum data preparation by ensuring the input variables are numeric and that the target variable is label encoded, as expected by the scikit-learn library.

... # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str'))

Next, we can define our predictive model.

... # define the model model = LogisticRegression(solver='liblinear')

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

Model performance will be evaluated using classification accuracy.

... model = LogisticRegression(solver='liblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

At the end of the run, we will report the mean and standard deviation of the accuracy scores collected across all repeats and evaluation folds.

... # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # define the model model = LogisticRegression(solver='liblinear') # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the feature extraction based approach to data preparation.

In this section, we can explore whether we can improve performance using the feature extraction approach to data preparation.

The first step is to select a suite of common and commonly useful data preparation techniques.

In this case, given that the input variables are numeric, we will use a range of transforms to change the scale of the input variables such as MinMaxScaler, StandardScaler, and RobustScaler, as well as transforms for chaining the distribution of the input variables such as QuantileTransformer and KBinsDiscretizer. Finally, we will also use transforms that remove linear dependencies between the input variables such as PCA and TruncatedSVD.

The FeatureUnion class can be used to define a list of transforms to perform, the results of which will be aggregated together, i.e. unioned. This will create a new dataset that has a vast number of columns.

An estimate of the number of columns would be 13 input variables times five transforms or 65 plus the 14 columns output from the PCA and SVD dimensionality reduction methods, to give a total of about 79 features.

... # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms)

We can then create a modeling Pipeline with the FeatureUnion as the first step and the logistic regression model as the final step.

... # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('m', model)) pipeline = Pipeline(steps=steps)

The pipeline can then be evaluated using repeated stratified k-fold cross-validation as before.

Tying this together, the complete example is listed below.

# data preparation as feature engineering for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms) # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('m', model)) pipeline = Pipeline(steps=steps) # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift in performance over the baseline performance, achieving a mean classification accuracy of about 96.8 percent as compared to 95.3 percent in the previous section.

Accuracy: 0.968 (0.037)

Try adding more data preparation methods to the FeatureUnion to see if you can improve the performance.

**Can you get better results?**

Let me know what you discover in the comments below.

We can also use feature selection to reduce the approximately 80 extracted features down to a subset of those that are most relevant to the model. In addition to reducing the complexity of the model, it can also result in a lift in performance by removing irrelevant and redundant input features.

In this case, we will use the Recursive Feature Elimination, or RFE, technique for feature selection and configure it to select the 15 most relevant features.

... # define the feature selection rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

We can then add the RFE feature selection to the modeling pipeline after the *FeatureUnion* and before the *LogisticRegression* algorithm.

... # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('rfe', rfe)) steps.append(('m', model)) pipeline = Pipeline(steps=steps)

Tying this together, the complete example of the feature selection data preparation method with feature selection is listed below.

# data preparation as feature engineering with feature selection for wine dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.pipeline import FeatureUnion from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import RobustScaler from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import KBinsDiscretizer from sklearn.feature_selection import RFE from sklearn.decomposition import PCA from sklearn.decomposition import TruncatedSVD # load the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv' df = read_csv(url, header=None) data = df.values X, y = data[:, :-1], data[:, -1] # minimally prepare dataset X = X.astype('float') y = LabelEncoder().fit_transform(y.astype('str')) # transforms for the feature union transforms = list() transforms.append(('mms', MinMaxScaler())) transforms.append(('ss', StandardScaler())) transforms.append(('rs', RobustScaler())) transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal'))) transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform'))) transforms.append(('pca', PCA(n_components=7))) transforms.append(('svd', TruncatedSVD(n_components=7))) # create the feature union fu = FeatureUnion(transforms) # define the feature selection rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15) # define the model model = LogisticRegression(solver='liblinear') # define the pipeline steps = list() steps.append(('fu', fu)) steps.append(('rfe', rfe)) steps.append(('m', model)) pipeline = Pipeline(steps=steps) # define the cross-validation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Again, we can see a further lift in performance from 96.8 percent with all extracted features to about 98.9 with feature selection used prior to modeling.

Accuracy: 0.989 (0.022)

**Can you achieve better performance with a different feature selection technique or with more or fewer selected features?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered how to use feature extraction for data preparation with tabular data.

Specifically, you learned:

- Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
- How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
- How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Use Feature Extraction on Tabular Data for Machine Learning appeared first on Machine Learning Mastery.

]]>