[scikit-learn] Problems with running GridSearchCV on a pipeline with a custom transformer

Sam Barnett sambarnett95 at gmail.com
Fri Aug 4 06:29:50 EDT 2017


Hi Andy,
I have since been able to resolve the pickling issue, though I am now
getting an error message saying that an error message does not include the
expected string 'fit'. In general, I am trying to use the fit() method of
my classifier to instantiate a separate SVC() classifier with a custom
kernel, fit THAT to the data, then return this instance as the fitted
version of the new classifier. Is this possible in theory? If so, what is
the best way to implement it?

As before, the requisite code and a .ipynb file is attached.

Best,
Sam

On Thu, Aug 3, 2017 at 6:35 PM, Andreas Mueller <t3kcit at gmail.com> wrote:

> Hi Sam.
> You need to put these into a reachable namespace (possibly as private
> functions) so that they can be pickled.
> Please stay on the sklearn mailing list, I might not have time to reply.
>
> Andy
>
>
> On 08/03/2017 01:24 PM, Sam Barnett wrote:
>
> Hi Andy,
>
> I've since tried a different solution: instead of a pipeline, I've simply
> created a classifier that is for the most part like svm.SVC, though it
> takes a few extra inputs for the sequentialisation step. I've used a Python
> function that can compute the Gram matrix between two datasets of any shape
> to pass into SVC(), though I'm now having trouble with pickling on the
> check_estimator test. It appears that SeqSVC.fit() doesn't like to have
> methods defined within it. Can you see how to pass this test? (the .ipynb
> file shows the error).
>
> Best,
> Sam
>
> On Wed, Aug 2, 2017 at 9:44 PM, Sam Barnett <sambarnett95 at gmail.com>
> wrote:
>
>> You're right: it does fail without GridSearchCV when I change the size of
>> seq_test. I will look at the transform tomorrow to see if I can work this
>> out. Thank you for your help so far!
>>
>> On Wed, Aug 2, 2017 at 9:20 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>>
>>> Change the size of seq_test in your notebook and you'll see the failure
>>> without GridSearchCV.
>>> I haven't looked at your code in detail, but transform is supposed to
>>> work on arbitrary new data with the same number of features.
>>> Your code requires the test data to have the same shape as the training
>>> data.
>>> Cross-validation will lead to training data and test data having
>>> different sizes. But I feel like something is already wrong if your
>>> test data size depends on your training data size.
>>>
>>>
>>>
>>> On 08/02/2017 03:08 PM, Sam Barnett wrote:
>>>
>>> Hi Andy,
>>>
>>> The purpose of the transformer is to take an ordinary kernel (in this
>>> case I have taken 'rbf' as a default) and return a 'sequentialised' kernel
>>> using a few extra parameters. Hence, the transformer takes an ordinary
>>> data-target pair X, y as its input, and the fit_transform(X, y) method will
>>> output the Gram matrix for X that is associated with this sequentialised
>>> kernel. In the pipeline, this Gram matrix is passed into an SVC classifier
>>> with the kernel parameter set to 'precomputed'.
>>>
>>> Therefore, I do not think your hacky solution would be possible.
>>> However, I am still unsure how to implement your first solution: won't the
>>> Gram matrix from the transformer contain all the necessary kernel values?
>>> Could you elaborate further?
>>>
>>>
>>> Best,
>>> Sam
>>>
>>> On Wed, Aug 2, 2017 at 5:05 PM, Andreas Mueller <t3kcit at gmail.com>
>>> wrote:
>>>
>>>> Hi Sam.
>>>> GridSearchCV will do cross-validation, which requires to "transform"
>>>> the test data.
>>>> The shape of the test-data will be different from the shape of the
>>>> training data.
>>>> You need to have the ability to compute the kernel between the training
>>>> data and new test data.
>>>>
>>>> A more hacky solution would be to compute the full kernel matrix in
>>>> advance and pass that to GridSearchCV.
>>>>
>>>> You probably don't need it here, but you should also checkout what the
>>>> _pairwise attribute does in cross-validation,
>>>> because that it likely to come up when playing with kernels.
>>>>
>>>> Hth,
>>>> Andy
>>>>
>>>>
>>>> On 08/02/2017 08:38 AM, Sam Barnett wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I have created a 2-step pipeline with a custom transformer followed by
>>>> a simple SVC classifier, and I wish to run a grid-search over it. I am able
>>>> to successfully create the transformer and the pipeline, and each of these
>>>> elements work fine. However, when I try to use the fit() method on my
>>>> GridSearchCV object, I get the following error:
>>>>
>>>>      57         # during fit.
>>>>      58         if X.shape != self.input_shape_:
>>>> ---> 59             raise ValueError('Shape of input is different from
>>>> what was seen '
>>>>      60                              'in `fit`')
>>>>      61
>>>>
>>>> ValueError: Shape of input is different from what was seen in `fit`
>>>>
>>>> For a full breakdown of the problem, I have written a Jupyter notebook
>>>> showing exactly how the error occurs (this also contains all .py files
>>>> necessary to run the notebook). Can anybody see how to work through this?
>>>>
>>>> Many thanks,
>>>> Sam Barnett
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170804/0d150d4c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: seqsvc.py
Type: text/x-python-script
Size: 3051 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170804/0d150d4c/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Sequential Kernel SVC GridSearchCV Test.ipynb
Type: application/octet-stream
Size: 7678 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170804/0d150d4c/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SeqKernelLucy.py
Type: text/x-python-script
Size: 2628 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170804/0d150d4c/attachment-0003.bin>


More information about the scikit-learn mailing list