From pyformulas at gmail.com  Sat Jun  2 01:13:52 2018
From: pyformulas at gmail.com (pyformulas)
Date: Fri, 1 Jun 2018 23:13:52 -0600
Subject: [scikit-learn] Novel efficient one-shot optimizer for regression
Message-ID: <CAF8Ap7+V2b5ZiECsh=Bk1J8ZnhqSKMhGf0wA2yid+SF3oFEBnQ@mail.gmail.com>

 Hi, I created an algorithm that may solve linear regression problems with
less time complexity than Singular Value Decomposition. It only requires
the gradient and the diagonal of the hessian to calculate the optimal
weights. I attached the Tensorflow code below. I haven't been able to get
it to work in pure NumPy yet, but I'm sure someone will be able port it if
it really does what it purports to do.

import numpy as np

Y = np.arange(10).reshape(10,1)**0.5

bias_X = np.ones(10).reshape(10,1)
X_feature1 = Y**3
X_feature2 = Y**4
X_feature3 = Y**5
X = np.concatenate((bias_X, X_feature1, X_feature2, X_feature3), axis=1)

num_features = 4


import tensorflow as tf

X_in = tf.placeholder(tf.float64, [None,num_features])
Y_in = tf.placeholder(tf.float64, [None,1])

W = tf.placeholder(tf.float64, [num_features,1])

W_squeezed = tf.squeeze(W)

Y_hat = tf.expand_dims(tf.tensordot(X_in, W_squeezed, ([1],[0])), axis=1)

loss = tf.reduce_mean(Y - Y_hat)**2

gradient = tf.gradients(loss, [W_squeezed])[0]

gradient_2nd = tf.diag_part(tf.hessians(loss, [W_squeezed])[0])

vertex_offset = -gradient/gradient_2nd/num_features

W_star = W_squeezed + vertex_offset
W_star = tf.expand_dims(W_star, axis=1)

with tf.Session() as sess:
    random_W = np.random.normal(size=(num_features,1)).astype(np.float64)
    result1 = sess.run([loss, W_star, gradient, gradient_2nd],
feed_dict={X_in:X, Y_in:Y, W:random_W})
    random_loss = result1[0]
    optimal_W = result1[1]
    print('Random loss:',result1[0])
    print('Gradient:', result1[-2])
    print("2nd-order Gradient:", result1[-1])

    print("W:")
    print(random_W)
    print()
    print("W*:")
    print(result1[1])
    print()

    optimal_loss = sess.run(loss, feed_dict={X_in:X, Y_in:Y, W:optimal_W})
    print('Optimal loss:', optimal_loss)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180601/0b0adc95/attachment.html>

From amirouche.boubekki at gmail.com  Sun Jun  3 17:03:08 2018
From: amirouche.boubekki at gmail.com (Amirouche Boubekki)
Date: Sun, 3 Jun 2018 23:03:08 +0200
Subject: [scikit-learn] Supervised prediction of multiple scores for a
 document
Message-ID: <CAL7_Mo9fCuLjm3fonapVY9oo0gc9kjyAwOWgZcH+-gbP19QwoQ@mail.gmail.com>

H?llo,

I started a natural language processing project a few weeks ago called
wikimark <https://github.com/amirouche/wikimark/> (the code is all in
wikimark.py
<https://github.com/amirouche/wikimark/blob/master/wikimark.py#L1>)

Given a text it wants to return a dictionary scoring the input against vital
articles categories
<https://en.wikipedia.org/api/rest_v1/page/html/Wikipedia%3AVital_articles%2FLevel%2F5>,
e.g.:

out = wikimark("""Peter Hintjens wrote about the relation between
technology and culture. Without using a scientifical tone of
state-of-the-art review of the anthroposcene antropology, he gives a fair
amount of food for thought. According to Hintjens, technology is doomed to
become cheap. As matter of fact, intelligence tools will become more and
more accessible which will trigger a revolution to rebalance forces in
society.""")

for category, score in out:
    print('{} ~ {}'.format(category, score))

The above program would output something like that:

Art ~ 0.1
Science ~ 0.5
Society ~ 0.4

Except not everything went as planned. Mind the fact that in the above
example the total is equal to 1, but I could not achieve that at all.

I am using gensim to compute vectors of paragraphs (doc2vev) and then
submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is
scored 1 if it's in that subcategory and zero otherwise. At prediction
time, it goes though the same doc2vec pipeline. The computer will score *each
paragraph* against the SVR models of wikipedia vital article subcategories
and get a value between 0 and 1 for *each paragraph*. I compute the sum and
group by subcategory and then I have a score per category for the input
document

It somewhat works. I made a web ui online you can find it at
https://sensimark.com where you can test it. You can directly access the
full api e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1

The output JSON document is a list of category dictionary where the
prediction key is associated with the average of the "prediction" of the
subcategories. If you replace &all=1 by &top=5 you might get something else
as top categories e.g.
https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10

<https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1>
or

https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5

I wrote "prediction" with double quotes because the value you see, is the
result of some formula. Since, the predictions I get are rather small
between 0 and 0.015 I apply the following formula:

value = math.exp(prediction)
magic = ((value * 100) - 110) * 100

In order to have values to spread between -200 and 200. Maybe this is the
symptom that my model doesn't work at all.

Still, the top 10 results are almost always near each other (try with BBC
<http://www.bbc.com/> articles on https://sensimark.com . It is only when a
regression model is disqualified with a score of 0 that the results are
simple to understand. Sadly, I don't have an example at hand to support
that claim. You have to believe me.

I just figured looking at the machine learning map
<http://scikit-learn.org/stable/tutorial/machine_learning_map/> that my
problem might be classification problem, except I don't really want to know
what is *the* class of new documents, I want to how what are the different
subjects that are dealt in the document based on a hiearchical corpus;
I don't want to guess a hiearchy! I want to now how the document content
spread over the different categories or subcategories.

I quickly read about multinomal regression, is it something do you
recommend I use? Maybe you think about something else?

Also, it seems I should benchmark / evaluate my model against LDA.

I am rather noob in terms of datascience and my math skills are not so
fresh. I more likely looking for ideas on what algorithm, fine tuning and
some practice of datascience I must follow that doesn't involve writing my
own algorithm.

Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180603/c7dc7ad0/attachment.html>

From mail at sebastianraschka.com  Sun Jun  3 17:20:49 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sun, 3 Jun 2018 17:20:49 -0400
Subject: [scikit-learn] Supervised prediction of multiple scores for a
 document
In-Reply-To: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com>
References: <CAL7_Mo9fCuLjm3fonapVY9oo0gc9kjyAwOWgZcH+-gbP19QwoQ@mail.gmail.com>
 <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com>
Message-ID: <F2F91BBD-F704-496F-A425-A6D194D39C37@sebastianraschka.com>

sorry, I had a copy & paste error, I meant "LogisticRegression(..., multi_class='multinomial')" and not "LogisticRegression(..., multi_class='ovr')" 

> On Jun 3, 2018, at 5:19 PM, Sebastian Raschka <mail at sebastianraschka.com> wrote:
> 
> Hi,
> 
>> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 
> 
> Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical difference is that Softmax regression assumes that the classes are mutually exclusive, which is probably not the case in your setting since e.g., an article could be both "Art" and "Science" to some extend or so. Here a quick summary of softmax regression if useful: https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').
> 
> Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be a better choice in your case. I.e., fit the model on the corpus for a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics.
> 
> Best,
> Sebastian
> 
> 
> 
> 
>> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki <amirouche.boubekki at gmail.com> wrote:
>> 
>> H?llo,
>> 
>> I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py)
>> 
>> Given a text it wants to return a dictionary scoring the input against vital articles categories, e.g.:
>> 
>> out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") 
>> 
>> for category, score in out: 
>>    print('{} ~ {}'.format(category, score))
>> 
>> The above program would output something like that:
>> 
>> Art ~ 0.1 
>> Science ~ 0.5 
>> Society ~ 0.4
>> 
>> Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all.
>> 
>> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score each paragraph against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for each paragraph. I compute the sum and group by subcategory and then I have a score per category for the input document
>> 
>> It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the
>> full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
>> 
>> The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
>> 
>> or 
>> 
>> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
>> 
>> I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula:
>> value = math.exp(prediction)
>> magic = ((value * 100) - 110) * 100
>> 
>> In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. 
>> 
>> Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me.
>> 
>> I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is the class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus;
>> I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories.
>> 
>> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 
>> 
>> Also, it seems I should benchmark / evaluate my model against LDA.
>> 
>> I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm.
>> 
>> Thanks in advance!
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 


From mail at sebastianraschka.com  Sun Jun  3 17:19:32 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sun, 3 Jun 2018 17:19:32 -0400
Subject: [scikit-learn] Supervised prediction of multiple scores for a
 document
In-Reply-To: <CAL7_Mo9fCuLjm3fonapVY9oo0gc9kjyAwOWgZcH+-gbP19QwoQ@mail.gmail.com>
References: <CAL7_Mo9fCuLjm3fonapVY9oo0gc9kjyAwOWgZcH+-gbP19QwoQ@mail.gmail.com>
Message-ID: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com>

Hi,

> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 

Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical difference is that Softmax regression assumes that the classes are mutually exclusive, which is probably not the case in your setting since e.g., an article could be both "Art" and "Science" to some extend or so. Here a quick summary of softmax regression if useful: https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr').

Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be a better choice in your case. I.e., fit the model on the corpus for a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics.

Best,
Sebastian


> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki <amirouche.boubekki at gmail.com> wrote:
> 
> H?llo,
> 
> I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py)
> 
> Given a text it wants to return a dictionary scoring the input against vital articles categories, e.g.:
> 
> out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") 
> 
> for category, score in out: 
>     print('{} ~ {}'.format(category, score))
> 
> The above program would output something like that:
> 
> Art ~ 0.1 
> Science ~ 0.5 
> Society ~ 0.4
> 
> Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all.
> 
> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score each paragraph against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for each paragraph. I compute the sum and group by subcategory and then I have a score per category for the input document
> 
> It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the
> full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1
> 
> The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10
> 
> or 
> 
> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5
> 
> I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula:
> value = math.exp(prediction)
> magic = ((value * 100) - 110) * 100
> 
> In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. 
> 
> Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me.
> 
> I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is the class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus;
> I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories.
> 
> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? 
> 
> Also, it seems I should benchmark / evaluate my model against LDA.
> 
> I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm.
> 
> Thanks in advance!
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From sepand.haghighi at yahoo.com  Mon Jun  4 11:06:52 2018
From: sepand.haghighi at yahoo.com (Sepand Haghighi)
Date: Mon, 4 Jun 2018 15:06:52 +0000 (UTC)
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
References: <21980772.1932951.1528124812254.ref@mail.yahoo.com>
Message-ID: <21980772.1932951.1528124812254@mail.yahoo.com>

Hi Stuart
Thanks ;-)
Activation threshold is in our plan and will be added in next release (in the?next few weeks)


Best RegardsSepand Haghighi 

    On Thursday, May 31, 2018, 9:56:43 PM GMT+4:30, Stuart Reynolds <stuart at stuartreynolds.net> wrote:  
 
 Hi Sepand,

Thanks for this -- looks useful. I had to write something similar (for
the binary case) and wish scikit had something like this.

I wonder if there's something similar for the binary class case where,
the prediction is a real value (activation) and from this we can also
derive
 - CMs for all prediction cutoff (or set of cutoffs?)
 - scores over all cutoffs (AUC, AP, ...)

For me, in analyzing (binary class) performance, reporting scores for
a single cutoff is less useful than seeing how the many scores (tpr,
ppv, mcc, relative risk, chi^2, ...) vary at various false positive
rates, or prediction quantiles.
Does your library provide any tools for the binary case where we add
an activation threshold?

Thanks again for releasing this and providing pip packaging.
- Stuart


On Thu, May 31, 2018 at 6:05 AM, Sepand Haghighi via scikit-learn
<scikit-learn at python.org> wrote:
> PyCM is a multi-class confusion matrix library written in Python that
> supports both input data vectors and direct matrix, and a proper tool for
> post-classification model evaluation that supports most classes and overall
> statistics parameters. PyCM is the swiss-army knife of confusion matrices,
> targeted mainly at data scientists that need a broad array of metrics for
> predictive models and an accurate evaluation of large variety of
> classifiers.
>
> Github Repo : https://github.com/sepandhaghighi/pycm
>
> Webpage : http://pycm.shaghighi.ir/
>
> JOSS Paper : https://doi.org/10.21105/joss.00729
>
>
>
>
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180604/4207a4ef/attachment.html>

From t3kcit at gmail.com  Mon Jun  4 11:40:51 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 4 Jun 2018 11:40:51 -0400
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
Message-ID: <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>


On 5/31/18 1:26 PM, Stuart Reynolds wrote:
> Hi Sepand,
>
> Thanks for this -- looks useful. I had to write something similar (for
> the binary case) and wish scikit had something like this.
Which part of it? I'm not entirely sure I understand what the core 
functionality is.
>
> I wonder if there's something similar for the binary class case where,
> the prediction is a real value (activation) and from this we can also
> derive
>   - CMs for all prediction cutoff (or set of cutoffs?)
>   - scores over all cutoffs (AUC, AP, ...)
AUC and AP are by definition over all cut-offs. And CMs for all
cutoffs doesn't seem a good idea, because that'll be n_samples many
in the general case. If you want to specify a set of cutoffs, that would 
be pretty easy to do.
How do you find these cut-offs, though?
>
> For me, in analyzing (binary class) performance, reporting scores for
> a single cutoff is less useful than seeing how the many scores (tpr,
> ppv, mcc, relative risk, chi^2, ...) vary at various false positive
> rates, or prediction quantiles.
You can totally do that with sklearn right now. Granted, it's not
as convenient as it could be, but we're working on it.

What's really the crucial point for me is how to pick the cut-offs.


Cheers,

Andy


From jbbrown at kuhp.kyoto-u.ac.jp  Mon Jun  4 11:56:22 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Tue, 5 Jun 2018 00:56:22 +0900
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
 <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
Message-ID: <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>

Hello community,

I wonder if there's something similar for the binary class case where,
>> the prediction is a real value (activation) and from this we can also
>> derive
>>   - CMs for all prediction cutoff (or set of cutoffs?)
>>   - scores over all cutoffs (AUC, AP, ...)
>>
> AUC and AP are by definition over all cut-offs. And CMs for all
> cutoffs doesn't seem a good idea, because that'll be n_samples many
> in the general case. If you want to specify a set of cutoffs, that would
> be pretty easy to do.
> How do you find these cut-offs, though?
>
>>
>> For me, in analyzing (binary class) performance, reporting scores for
>> a single cutoff is less useful than seeing how the many scores (tpr,
>> ppv, mcc, relative risk, chi^2, ...) vary at various false positive
>> rates, or prediction quantiles.
>>
>
In terms of finding cut-offs, one could use the idea of metric surfaces
that I recently proposed
https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127
and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces
to determine what conditions you are willing to accept against the
background of your prediction problem.

I use these surfaces (a) to think about the prediction problem before any
attempt at modeling is made, and (b) to deconstruct results such as
"Accuracy=85%" into interpretations in the context of my field and the data
being predicted.

Hope this contributes a bit of food for thought.
J.B.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180605/050ae30a/attachment.html>

From t3kcit at gmail.com  Mon Jun  4 12:06:40 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 4 Jun 2018 12:06:40 -0400
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
 <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
 <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>
Message-ID: <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>

Is that Jet?!

https://www.youtube.com/watch?v=xAoljeRJ3lU

;)

On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote:
> Hello community,
>
>         I wonder if there's something similar for the binary class
>         case where,
>         the prediction is a real value (activation) and from this we
>         can also
>         derive
>         ? - CMs for all prediction cutoff (or set of cutoffs?)
>         ? - scores over all cutoffs (AUC, AP, ...)
>
>     AUC and AP are by definition over all cut-offs. And CMs for all
>     cutoffs doesn't seem a good idea, because that'll be n_samples many
>     in the general case. If you want to specify a set of cutoffs, that
>     would be pretty easy to do.
>     How do you find these cut-offs, though?
>
>
>         For me, in analyzing (binary class) performance, reporting
>         scores for
>         a single cutoff is less useful than seeing how the many scores
>         (tpr,
>         ppv, mcc, relative risk, chi^2, ...) vary at various false
>         positive
>         rates, or prediction quantiles.
>
>
> In terms of finding cut-offs, one could use the idea of metric 
> surfaces that I recently proposed
> https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127
> and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc 
> surfaces to determine what conditions you are willing to accept 
> against the background of your prediction problem.
>
> I use these surfaces (a) to think about the prediction problem before 
> any attempt at modeling is made, and (b) to deconstruct results such 
> as "Accuracy=85%" into interpretations in the context of my field and 
> the data being predicted.
>
> Hope this contributes a bit of food for thought.
> J.B.
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180604/836362d7/attachment.html>

From joel.nothman at gmail.com  Mon Jun  4 21:09:57 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 5 Jun 2018 11:09:57 +1000
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
 <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
 <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>
 <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>
Message-ID: <CAAkaFLVzQTa8Kz7wQV10iiTe9=z+vu8Feyr63bxSvby_wuLHHQ@mail.gmail.com>

>
> Thanks for this -- looks useful. I had to write something similar (for
>> the binary case) and wish scikit had something like this.
>
>
> Which part of it? I'm not entirely sure I understand what the core
> functionality is.
>
>
I think the core efficiently evaluating the full set of metrics appropriate
for the kind of task. We now support multi-metric scoring in things like
cross_validation and GridSearchCV (but not in other CV implementations
yet), but:

   1. it's not efficient (there are PRs in progress to work around this,
   but they are definitely work-arounds in the sense that we're still
   repeatedly calling metric functions rather than calculating sufficient
   statistics once), and
   2. we don't have a pre-defined set of scorers appropriate to binary
   classification; or for multiclass classification with 4 classes, one of
   which is the majority "no finding" class, etc.

But assuming we could solve or work around the first issue, having an
interface, in the core library or elsewhere which gave us a series of
appropriately-named scorers for different task types might be neat and
avoid code that a lot of people repeat:

def get_scorers_for_binary(pos_label, neg_label, proba_thresholds=(0.5,)):
    return {'precision:p>0.5': make_scorer(precision_score,
pos_label=pos_label),
            'accuracy:p>0.5': 'accuracy',
            'roc_auc': 'roc_auc',
            'log_loss': 'log_loss',
            ...
            }

def get_scorers_for_multiclass(pos_labels, neg_labels=()):
    out = {'accuracy': 'accuracy',
           'mcc': make_scorer(matthews_corrcoef),
           'cohen_kappa': make_scorer(cohen_kapppa_score),
           'precision_macro': make_scorer(precision_score,
labels=pos_labels, average='macro'),
           'precision_weighted': make_scorer(precision_score,
labels=pos_labels, average='weighted'),
           ...}
    if neg_labels:
        # micro-average precision is != accuracy only if some labels
are excluded
        out['precision_micro'] = make_scorer(precision_score,
labels=pos_labels, average='micro')
        ...
    return out


I note some risk of encouraging bad practice around multiple hypotheses,
etc... but generally I think this would be helpful to users.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180605/5dcd7590/attachment-0001.html>

From jbbrown at kuhp.kyoto-u.ac.jp  Tue Jun  5 02:48:17 2018
From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.)
Date: Tue, 5 Jun 2018 15:48:17 +0900
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
 <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
 <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>
 <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>
Message-ID: <CAJe_vxBvUPMr-GTeUyh8n=VCCN9zHPV-8Pt1v2Dhe69q+Do8sQ@mail.gmail.com>

2018-06-05 1:06 GMT+09:00 Andreas Mueller <t3kcit at gmail.com>:

> Is that Jet?!
>
> https://www.youtube.com/watch?v=xAoljeRJ3lU
>
> ;)
>

Quite an entertaining presentation and informative to the non-expert about
color theory, though I'm not sure I'd go so far as to call jet "evil" and
that everyone hates it.
Actually, I didn't know that the colormap known as Jet actually had a
name...I had reversed engineered it to reproduce what I saw elsewhere.
I suppose I'm glad I have already built my infrastructure's version of the
metric surface plotter to allow complete color customization at runtime
from the CLI, and can then tailor results to my audiences. :)

I'll keep this video's explanation in mind - thanks for the reference.

Cheers,
J.B.


> On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote:
>
> Hello community,
>
> I wonder if there's something similar for the binary class case where,
>>> the prediction is a real value (activation) and from this we can also
>>> derive
>>>   - CMs for all prediction cutoff (or set of cutoffs?)
>>>   - scores over all cutoffs (AUC, AP, ...)
>>>
>> AUC and AP are by definition over all cut-offs. And CMs for all
>> cutoffs doesn't seem a good idea, because that'll be n_samples many
>> in the general case. If you want to specify a set of cutoffs, that would
>> be pretty easy to do.
>> How do you find these cut-offs, though?
>>
>>>
>>> For me, in analyzing (binary class) performance, reporting scores for
>>> a single cutoff is less useful than seeing how the many scores (tpr,
>>> ppv, mcc, relative risk, chi^2, ...) vary at various false positive
>>> rates, or prediction quantiles.
>>>
>>
> In terms of finding cut-offs, one could use the idea of metric surfaces
> that I recently proposed
> https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127
> and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces
> to determine what conditions you are willing to accept against the
> background of your prediction problem.
>
> I use these surfaces (a) to think about the prediction problem before any
> attempt at modeling is made, and (b) to deconstruct results such as
> "Accuracy=85%" into interpretations in the context of my field and the data
> being predicted.
>
> Hope this contributes a bit of food for thought.
> J.B.
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180605/6f3d2929/attachment.html>

From nelle.varoquaux at gmail.com  Tue Jun  5 20:06:37 2018
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Tue, 5 Jun 2018 17:06:37 -0700
Subject: [scikit-learn] 2018 John Hunter Excellence in Plotting Contest
Message-ID: <CAE-UAvT7RwrNa50acnKKszWNqe8ceFb5vNSopo8ap0KOHBZimg@mail.gmail.com>

Hello everyone,

Sorry about the cross-posting.

There's a couple more days to submit to the John Hunter Excellence in
Plotting Competition!
If you have any scientific plot worth sharing, submit an entry before June
8th.

For more information, see below.

Thanks,
Nelle

In memory of John Hunter, we are pleased to be reviving the SciPy John
Hunter Excellence in Plotting Competition for 2018. This open competition
aims to highlight the importance of data visualization to scientific
progress and showcase the capabilities of open source software.

Participants are invited to submit scientific plots to be judged by a
panel. The winning entries will be announced and displayed at the
conference.

John Hunter?s family and NumFocus are graciously sponsoring cash prizes for
the winners in the following amounts:


   -

   1st prize: $1000
   -

   2nd prize: $750
   -

   3rd prize: $500


   -

   Entries must be submitted by June, 8th to the form at
   https://goo.gl/forms/7q86zgu5OYUOjODH3
   <https://goo.gl/forms/7q86zgu5OYUOjODH3>.
   -

   Winners will be announced at Scipy 2018 in Austin, TX.
   -

   Participants do not need to attend the Scipy conference.
   -

   Entries may take the definition of ?visualization? rather broadly.
   Entries may be, for example, a traditional printed plot, an interactive
   visualization for the web, or an animation.
   -

   Source code for the plot must be provided, in the form of Python code
   and/or a Jupyter notebook, along with a rendering of the plot in a widely
   used format.  This may be, for example, PDF for print, standalone HTML and
   Javascript for an interactive plot, or MPEG-4 for a video. If the original
   data can not be shared for reasons of size or licensing, "fake" data may be
   substituted, along with an image of the plot using real data.
   -

   Each entry must include a 300-500 word abstract describing the plot and
   its importance for a general scientific audience.
   -

   Entries will be judged on their clarity, innovation and aesthetics, but
   most importantly for their effectiveness in communicating a real-world
   problem. Entrants are encouraged to submit plots that were used during the
   course of research or work, rather than merely being hypothetical.
   -

   SciPy reserves the right to display any and all entries, whether
   prize-winning or not, at the conference, use in any materials or on its
   website, with attribution to the original author(s).


SciPy John Hunter Excellence in Plotting Competition Co-Chairs

Thomas Caswell

Michael Droettboom

Nelle Varoquaux
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180605/06748474/attachment.html>

From t3kcit at gmail.com  Wed Jun  6 13:33:18 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 6 Jun 2018 13:33:18 -0400
Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in
 Python
In-Reply-To: <CAJe_vxBvUPMr-GTeUyh8n=VCCN9zHPV-8Pt1v2Dhe69q+Do8sQ@mail.gmail.com>
References: <253486979.646953.1527771937416.ref@mail.yahoo.com>
 <253486979.646953.1527771937416@mail.yahoo.com>
 <CAAy-kdkxxW6=rR4wgVpTznG2Sho+VrS1E=1y3ZEyGaCsL2=dbQ@mail.gmail.com>
 <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com>
 <CAJe_vxBppv_1mbU6BTYiA0kJH+GYVucMq1-Kibwpb2uNt6q+Ew@mail.gmail.com>
 <e80479f8-a9d6-c647-76a4-4cbf42b13b1e@gmail.com>
 <CAJe_vxBvUPMr-GTeUyh8n=VCCN9zHPV-8Pt1v2Dhe69q+Do8sQ@mail.gmail.com>
Message-ID: <fe98e3fe-5f0d-fa67-64d5-042b0260dc21@gmail.com>


On 6/5/18 2:48 AM, Brown J.B. via scikit-learn wrote:
>
>
> 2018-06-05 1:06 GMT+09:00 Andreas Mueller <t3kcit at gmail.com 
> <mailto:t3kcit at gmail.com>>:
>
>     Is that Jet?!
>
>     https://www.youtube.com/watch?v=xAoljeRJ3lU
>     <https://www.youtube.com/watch?v=xAoljeRJ3lU>
>
>     ;)
>
>
> Quite an entertaining presentation and informative to the non-expert 
> about color theory, though I'm not sure I'd go so far as to call jet 
> "evil" and that everyone hates it.
> Actually, I didn't know that the colormap known as Jet actually had a 
> name...I had reversed engineered it to reproduce what I saw elsewhere.
> I suppose I'm glad I have already built my infrastructure's version of 
> the metric surface plotter to allow complete color customization at 
> runtime from the CLI, and can then tailor results to my audiences. :)
 From what I understood, there is evidence of misdiagnosis because of 
the use of jet. The main issue is that it creates borders in the image 
where there are none, and that seems
like something that might be an issue in your application as well.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180606/b06d60ea/attachment.html>

From robert.kern at gmail.com  Thu Jun  7 14:50:16 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Thu, 7 Jun 2018 11:50:16 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
Message-ID: <pfbul6$sfo$1@blaine.gmane.org>

https://mail.python.org/pipermail/numpy-discussion/2018-June/078126.html

Hi, sklearners!

I have a NEP out for discussion that proposes a change in numpy.random's 
stream-compatibility policy. As scikit-learn is a well-disciplined consumer of 
reproducible streams, I would appreciate your input on the numpy-discussion 
thread linked above.

The very short form is that there is a new PRNG subsystem being developed with 
better core PRNGs (among other things, providing nice features like independent 
streams for parallel computations), and we would like to relax our strict 
stream-compatibility policy for the non-uniform distributions in this new 
subsystem so that we can improve our algorithms. The core uniform numbers would 
still be strictly stream-compatible across numpy versions. But we would like to 
be able to upgrade our non-uniform algorithms, for example, to make normal 
variates faster to generate.

RandomState would be frozen and subject to a long deprecation cycle for a period 
of strict backwards compatibility. There would be some non-deprecated provision 
to get strictly-compatible streams for a subset of distributions for the limited 
purpose of generating test data for unit tests.

Please read the NEP and the thread through. I do propose at least one 
alternative in the thread and would like some feedback on it. I would also 
appreciate it if we could consolidate the discussion on the numpy-discussion 
thread and not have a split-off conversation here too.

Thank you very much! I appreciate your attention.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From guettliml at thomas-guettler.de  Fri Jun  8 04:48:59 2018
From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=)
Date: Fri, 8 Jun 2018 10:48:59 +0200
Subject: [scikit-learn] Mapping fulltext OCR to issue type
Message-ID: <fd50aef8-ec86-b9dc-3aed-b1a108d694c5@thomas-guettler.de>

We run an issue tracking application. A lot of issues get generated
from scanned letters.

I have 70k full text OCR result files. Their got created with tesseract.

Every file of these 70k files corresponds to a issue. Each issue has an issue type.

I want to use machine learning and in the future the machine
should be able to guess the issue type by looking at the full text OCR.

The issue types are not a simple list, it is a tree.

Example:

electricity / power grid
electricity / outages
customer support / invoices / complaint
customer support / invoices / tax
....


If the machine can't guess

    "customer support / invoices / complaint"

it would be nice if it could at least guess roughly the parent issue type:

    "customer support / invoices"

I never used sciki before, but I use Python since several years.

Could you please guide me to the right direction?

Regards,
   Thomas G?ttler


-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

From francois.dion at gmail.com  Fri Jun  8 07:32:06 2018
From: francois.dion at gmail.com (Francois Dion)
Date: Fri, 8 Jun 2018 07:32:06 -0400
Subject: [scikit-learn] Trained model repository?
Message-ID: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>

Does anybody know of a repo or site that has scikit-learn pre-trained
models / pipelines?

There are specific projects that might include a model in their github repo
(I've done that for a PyData talk in the past), and I've also seen specific
frameworks including some pre-trained neural networks (keras and caffe2 for
example), but I don't think there's anything for scikit models.

I've asked around and on twitter, but nothing. I figured, if anybody would
know, it would have to be on the sklearn list.


Francois
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180608/814df953/attachment.html>

From randalljellis at gmail.com  Fri Jun  8 09:34:11 2018
From: randalljellis at gmail.com (Randy Ellis)
Date: Fri, 8 Jun 2018 09:34:11 -0400
Subject: [scikit-learn] Trained model repository?
In-Reply-To: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>
References: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>
Message-ID: <CAMN2r7+GeTrYC4eUDLGrD3R6U6f8EvvM8vNLUw1-u_pNRBpBow@mail.gmail.com>

Not sure if sklearn has one, but Tensorflow has Tensorhub
https://www.tensorflow.org/hub/

On Fri, Jun 8, 2018 at 7:32 AM, Francois Dion <francois.dion at gmail.com>
wrote:

> Does anybody know of a repo or site that has scikit-learn pre-trained
> models / pipelines?
>
> There are specific projects that might include a model in their github
> repo (I've done that for a PyData talk in the past), and I've also seen
> specific frameworks including some pre-trained neural networks (keras and
> caffe2 for example), but I don't think there's anything for scikit models.
>
> I've asked around and on twitter, but nothing. I figured, if anybody would
> know, it would have to be on the sklearn list.
>
>
> Francois
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
*Randall J. Ellis, B.S.*
PhD Student, Biomedical Science, Mount Sinai
Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
Cell: (954)-260-9891
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180608/afd89259/attachment.html>

From francois.dion at gmail.com  Fri Jun  8 14:13:25 2018
From: francois.dion at gmail.com (Francois Dion)
Date: Fri, 8 Jun 2018 14:13:25 -0400
Subject: [scikit-learn] Trained model repository?
In-Reply-To: <CAMN2r7+GeTrYC4eUDLGrD3R6U6f8EvvM8vNLUw1-u_pNRBpBow@mail.gmail.com>
References: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>
 <CAMN2r7+GeTrYC4eUDLGrD3R6U6f8EvvM8vNLUw1-u_pNRBpBow@mail.gmail.com>
Message-ID: <CAOLi1KCEaDqiHoH_JhMyb8mRv6vqXAEBg_V7-eqpAnPXYtzQ8w@mail.gmail.com>

Thanks.

Speaking tomorrow at SouthEast Linux Fest in Charlotte and am providing
examples of pre-trained models. Once more, I will demo a pre-trained model
I did, but It would have been nice to point to a hub / repo.


Francois

On Fri, Jun 8, 2018 at 9:34 AM, Randy Ellis <randalljellis at gmail.com> wrote:

> Not sure if sklearn has one, but Tensorflow has Tensorhub https://www.
> tensorflow.org/hub/
>
> On Fri, Jun 8, 2018 at 7:32 AM, Francois Dion <francois.dion at gmail.com>
> wrote:
>
>> Does anybody know of a repo or site that has scikit-learn pre-trained
>> models / pipelines?
>>
>> There are specific projects that might include a model in their github
>> repo (I've done that for a PyData talk in the past), and I've also seen
>> specific frameworks including some pre-trained neural networks (keras and
>> caffe2 for example), but I don't think there's anything for scikit models.
>>
>> I've asked around and on twitter, but nothing. I figured, if anybody
>> would know, it would have to be on the sklearn list.
>>
>>
>> Francois
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> *Randall J. Ellis, B.S.*
> PhD Student, Biomedical Science, Mount Sinai
> Special Volunteer, http://www.michaelideslab.org/, NIDA IRP
> Cell: (954)-260-9891
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180608/63c407b2/attachment.html>

From t3kcit at gmail.com  Fri Jun  8 14:18:01 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 8 Jun 2018 14:18:01 -0400
Subject: [scikit-learn] Trained model repository?
In-Reply-To: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>
References: <CAOLi1KAskdt2SdGzmPF9S10cy1tfPi8WQGvyY14nfD1NmjKS_w@mail.gmail.com>
Message-ID: <868e24f8-464d-e351-c5e3-e30a787a1f56@gmail.com>

I'm not sure what you mean. Pre-trained on what task?
And what kind of models?
The only task I can think that would make sense would be text data with 
BoW representation, and I'm not sure which models we could pretrain for 
that.
Maybe PCA, topic models and MLP?
To what end though for PCA and topic models?
And if you're serious about MLPs, why not use keras?

On 6/8/18 7:32 AM, Francois Dion wrote:
> Does anybody know of a repo or site that has scikit-learn pre-trained 
> models / pipelines?
>
> There are specific projects that might include a model in their github 
> repo (I've done that for a PyData talk in the past), and I've also 
> seen specific frameworks including some pre-trained neural networks 
> (keras and caffe2 for example), but I don't think there's anything for 
> scikit models.
>
> I've asked around and on twitter, but nothing. I figured, if anybody 
> would know, it would have to be on the sklearn list.
>
>
> Francois
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180608/03ec7a1f/attachment.html>

From jeff1evesque at yahoo.com  Sun Jun 10 22:11:18 2018
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Sun, 10 Jun 2018 22:11:18 -0400
Subject: [scikit-learn] Jeff Levesque: profit functionality
Message-ID: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>

Hi guys,
Does sklearn have both probit, and logic functionality?

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

From jeff1evesque at yahoo.com  Sun Jun 10 23:26:55 2018
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Sun, 10 Jun 2018 23:26:55 -0400
Subject: [scikit-learn] Jeff Levesque: profit functionality
In-Reply-To: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>
References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>
Message-ID: <F720F071-1DBC-4B4A-B85B-0D6F041F5B3D@yahoo.com>

Sorry typo: meant logit, and probit.

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

> On Jun 10, 2018, at 10:11 PM, Jeffrey Levesque via scikit-learn <scikit-learn at python.org> wrote:
> 
> Hi guys,
> Does sklearn have both probit, and logic functionality?
> 
> Thank you,
> 
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From alexandre.gramfort at inria.fr  Mon Jun 11 04:29:19 2018
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Mon, 11 Jun 2018 10:29:19 +0200
Subject: [scikit-learn] Jeff Levesque: profit functionality
In-Reply-To: <F720F071-1DBC-4B4A-B85B-0D6F041F5B3D@yahoo.com>
References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>
 <F720F071-1DBC-4B4A-B85B-0D6F041F5B3D@yahoo.com>
Message-ID: <CADeotZru-WuvrBySRPqthZ6O7gztFb=pJvPf2n3EcXXHuvazyw@mail.gmail.com>

no only logit with LogisticRegression estimator.

Alex

From dylanf123 at gmail.com  Mon Jun 11 05:02:45 2018
From: dylanf123 at gmail.com (Dylan Fernando)
Date: Mon, 11 Jun 2018 19:02:45 +1000
Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files
In-Reply-To: <CALQtMBaC0_hFhrVSWz0EY2NLc7m-ov_5ge=OofV27Yk1YLb3Zw@mail.gmail.com>
References: <CAPa-kAxSLOEcrcjm-y++JOsG47Ak1tr_1OpN4hO3ZQ9Jmr0YrQ@mail.gmail.com>
 <CALQtMBaC0_hFhrVSWz0EY2NLc7m-ov_5ge=OofV27Yk1YLb3Zw@mail.gmail.com>
Message-ID: <CAPa-kAymjcus5pfZgGw9uB_mUrE9N0M-Cbv2iTQnKfJO-iAj4w@mail.gmail.com>

Hi Joris,

Thanks, I'll try that.

On Fri, Jun 1, 2018 at 5:20 AM, Joris Van den Bossche <
jorisvandenbossche at gmail.com> wrote:

> Hi Dylan,
>
> In case you are still looking for a solution:I didn't directly find good
> templates for packages that depend on cython (there are quite some, but
> from quickly looking at them, I didn't find a simple one), but you can
> maybe have a look at one of the other scikit-learn-contrib packages that
> uses cython: https://github.com/scikit-learn-contrib/hdbscan
> And you can check here how to adapt the Extension class to specify c++:
> http://cython.readthedocs.io/en/latest/src/userguide/
> wrapping_CPlusPlus.html#specify-c-language-in-setup-py
>
> Best,
> Joris
>
>
>
> 2018-05-29 6:56 GMT+02:00 Dylan Fernando <dylanf123 at gmail.com>:
>
>> Hi,
>>
>> I would like to publish this:
>> https://github.com/dil12321/scikit-learn/tree/aode
>> https://github.com/scikit-learn/scikit-learn/pull/11093
>>
>> as a scikit-learn-contrib project. However, I'm not sure how to write the
>> setup.py file so that aode_helper.cpp and _aode.pyx get included in the
>> package, and run correctly. How should I write setup.py?
>>
>> Regards,
>> Dylan
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/5b68b33c/attachment.html>

From jeff1evesque at yahoo.com  Mon Jun 11 07:23:12 2018
From: jeff1evesque at yahoo.com (Jeffrey Levesque)
Date: Mon, 11 Jun 2018 07:23:12 -0400
Subject: [scikit-learn] Jeff Levesque: association rules
Message-ID: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>

Hi guys,
What are some good approaches for association rules. Is there something built in, or do people sometimes use alternate packages, maybe apache spark?

Thank you,

Jeff Levesque
https://github.com/jeff1evesque

From stuart at stuartreynolds.net  Mon Jun 11 12:18:18 2018
From: stuart at stuartreynolds.net (Stuart Reynolds)
Date: Mon, 11 Jun 2018 09:18:18 -0700
Subject: [scikit-learn] Jeff Levesque: profit functionality
In-Reply-To: <CADeotZru-WuvrBySRPqthZ6O7gztFb=pJvPf2n3EcXXHuvazyw@mail.gmail.com>
References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>
 <F720F071-1DBC-4B4A-B85B-0D6F041F5B3D@yahoo.com>
 <CADeotZru-WuvrBySRPqthZ6O7gztFb=pJvPf2n3EcXXHuvazyw@mail.gmail.com>
Message-ID: <CAAy-kd=cedLvVn0r_8UEGc+-UmFxuMfJxQEbdF05tHsF1LwRWQ@mail.gmail.com>

Scikit has a section on 'GLMs'
1.1. Generalized Linear Models
http://scikit-learn.org/stable/modules/linear_model.html
not covered there? (That page doesn't look like GLMs -- mostly it
covers different fitting, loss and regularlization methids, but not
general functional distributions).

If not, check out statsmodels' GLM
http://www.statsmodels.org/dev/glm.html
http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html


On Mon, Jun 11, 2018 at 1:29 AM, Alexandre Gramfort
<alexandre.gramfort at inria.fr> wrote:
> no only logit with LogisticRegression estimator.
>
> Alex
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From mail at sebastianraschka.com  Mon Jun 11 13:05:23 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 11 Jun 2018 13:05:23 -0400
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
Message-ID: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>

Hi Jeff,

had a similar question 1-2 years ago and ended up using Chris Borgelt's C command line tools but for convenience, i also implemented basic association rule & frequent pattern mining in Python here:
http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

Best,
Sebastian

> On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn <scikit-learn at python.org> wrote:
> 
> Hi guys,
> What are some good approaches for association rules. Is there something built in, or do people sometimes use alternate packages, maybe apache spark?
> 
> Thank you,
> 
> Jeff Levesque
> https://github.com/jeff1evesque
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From dmitrii.ignatov at gmail.com  Mon Jun 11 14:17:30 2018
From: dmitrii.ignatov at gmail.com (Dmitry Ignatov)
Date: Mon, 11 Jun 2018 20:17:30 +0200
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
 <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
Message-ID: <CAKnnxJ1eyd4JL7xvVvc2Y6KeVa1nnKB8VukoDCYhbwb29oawGg@mail.gmail.com>

Hi All,

A good tool. I also use SPMF (Java-based library) and Apache Spark (they do
not have closed itemsets there). There is a part of Orange data mining on
association rules mining, which can be used as a Python library.

A couple of years ago I asked Gilles Louppe about frequent itemset mining
tools within scikit-learn as well. The answer was something like that ?
nobody asked us about that...

Best regards,
Dmiry

??, 11 ???? 2018 ?. ? 19:30, Sebastian Raschka <mail at sebastianraschka.com>:

> Hi Jeff,
>
> had a similar question 1-2 years ago and ended up using Chris Borgelt's C
> command line tools but for convenience, i also implemented basic
> association rule & frequent pattern mining in Python here:
>
> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
>
> Best,
> Sebastian
>
> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > Hi guys,
> > What are some good approaches for association rules. Is there something
> built in, or do people sometimes use alternate packages, maybe apache spark?
> >
> > Thank you,
> >
> > Jeff Levesque
> > https://github.com/jeff1evesque
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/f14f595b/attachment.html>

From christian.braune79 at gmail.com  Mon Jun 11 14:51:00 2018
From: christian.braune79 at gmail.com (Christian Braune)
Date: Mon, 11 Jun 2018 20:51:00 +0200
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
 <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
Message-ID: <CABfx9=eJjV6ZiRZrSAHq2Y535tmfGX4g4n5vxdPiM7QfyeZGzg@mail.gmail.com>

Hey,

Christian Borgelt currently has several itemset mining algorithms online
with a python interface: http://borgelt.net/pyfim.html .

Best regards,
 Chris

Sebastian Raschka <mail at sebastianraschka.com> schrieb am Mo., 11. Juni 2018
um 19:30 Uhr:

> Hi Jeff,
>
> had a similar question 1-2 years ago and ended up using Chris Borgelt's C
> command line tools but for convenience, i also implemented basic
> association rule & frequent pattern mining in Python here:
>
> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
>
> Best,
> Sebastian
>
> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn <
> scikit-learn at python.org> wrote:
> >
> > Hi guys,
> > What are some good approaches for association rules. Is there something
> built in, or do people sometimes use alternate packages, maybe apache spark?
> >
> > Thank you,
> >
> > Jeff Levesque
> > https://github.com/jeff1evesque
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/814512cf/attachment-0001.html>

From dmitrii.ignatov at gmail.com  Mon Jun 11 14:54:30 2018
From: dmitrii.ignatov at gmail.com (Dmitry Ignatov)
Date: Mon, 11 Jun 2018 20:54:30 +0200
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <CABfx9=eJjV6ZiRZrSAHq2Y535tmfGX4g4n5vxdPiM7QfyeZGzg@mail.gmail.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
 <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
 <CABfx9=eJjV6ZiRZrSAHq2Y535tmfGX4g4n5vxdPiM7QfyeZGzg@mail.gmail.com>
Message-ID: <CAKnnxJ3-kTLuUbYH+=H2TwF_g=JOwoPnaQY6-5h2iV36Y6KSbA@mail.gmail.com>

My students use it too :-)

??, 11 ???? 2018 ?. ? 20:53, Christian Braune <christian.braune79 at gmail.com
>:

> Hey,
>
> Christian Borgelt currently has several itemset mining algorithms online
> with a python interface: http://borgelt.net/pyfim.html .
>
> Best regards,
>  Chris
>
>
> Sebastian Raschka <mail at sebastianraschka.com> schrieb am Mo., 11. Juni
> 2018 um 19:30 Uhr:
>
>> Hi Jeff,
>>
>> had a similar question 1-2 years ago and ended up using Chris Borgelt's C
>> command line tools but for convenience, i also implemented basic
>> association rule & frequent pattern mining in Python here:
>>
>> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
>>
>> Best,
>> Sebastian
>>
>> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn <
>> scikit-learn at python.org> wrote:
>> >
>> > Hi guys,
>> > What are some good approaches for association rules. Is there something
>> built in, or do people sometimes use alternate packages, maybe apache spark?
>> >
>> > Thank you,
>> >
>> > Jeff Levesque
>> > https://github.com/jeff1evesque
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/fa4484a2/attachment.html>

From jonathanrocher at gmail.com  Mon Jun 11 17:45:42 2018
From: jonathanrocher at gmail.com (Jonathan Rocher)
Date: Mon, 11 Jun 2018 16:45:42 -0500
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <CAKnnxJ3-kTLuUbYH+=H2TwF_g=JOwoPnaQY6-5h2iV36Y6KSbA@mail.gmail.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
 <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
 <CABfx9=eJjV6ZiRZrSAHq2Y535tmfGX4g4n5vxdPiM7QfyeZGzg@mail.gmail.com>
 <CAKnnxJ3-kTLuUbYH+=H2TwF_g=JOwoPnaQY6-5h2iV36Y6KSbA@mail.gmail.com>
Message-ID: <CA+QKUtJg_DLmwf+jcuCGu4Nq3ehg0xAKvnbmr1YfUQjhcYAb2w@mail.gmail.com>

Yep, pyfim is what I too used for a past project...

On Mon, Jun 11, 2018 at 1:55 PM Dmitry Ignatov <dmitrii.ignatov at gmail.com>
wrote:

> My students use it too :-)
>
> ??, 11 ???? 2018 ?. ? 20:53, Christian Braune <
> christian.braune79 at gmail.com>:
>
>> Hey,
>>
>> Christian Borgelt currently has several itemset mining algorithms online
>> with a python interface: http://borgelt.net/pyfim.html .
>>
>> Best regards,
>>  Chris
>>
>>
>> Sebastian Raschka <mail at sebastianraschka.com> schrieb am Mo., 11. Juni
>> 2018 um 19:30 Uhr:
>>
>>> Hi Jeff,
>>>
>>> had a similar question 1-2 years ago and ended up using Chris Borgelt's
>>> C command line tools but for convenience, i also implemented basic
>>> association rule & frequent pattern mining in Python here:
>>>
>>> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/
>>>
>>> Best,
>>> Sebastian
>>>
>>> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn <
>>> scikit-learn at python.org> wrote:
>>> >
>>> > Hi guys,
>>> > What are some good approaches for association rules. Is there
>>> something built in, or do people sometimes use alternate packages, maybe
>>> apache spark?
>>> >
>>> > Thank you,
>>> >
>>> > Jeff Levesque
>>> > https://github.com/jeff1evesque
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Jonathan Rocher
Austin TX, USA
twitter:@jonrocher <http://twitter.com/jonrocher>, linkedin:jonathanrocher
<http://www.linkedin.com/in/jonathanrocher>
-------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/a80a40fa/attachment.html>

From joel.nothman at gmail.com  Mon Jun 11 21:09:35 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 12 Jun 2018 11:09:35 +1000
Subject: [scikit-learn] Jeff Levesque: profit functionality
In-Reply-To: <CAAy-kd=cedLvVn0r_8UEGc+-UmFxuMfJxQEbdF05tHsF1LwRWQ@mail.gmail.com>
References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com>
 <F720F071-1DBC-4B4A-B85B-0D6F041F5B3D@yahoo.com>
 <CADeotZru-WuvrBySRPqthZ6O7gztFb=pJvPf2n3EcXXHuvazyw@mail.gmail.com>
 <CAAy-kd=cedLvVn0r_8UEGc+-UmFxuMfJxQEbdF05tHsF1LwRWQ@mail.gmail.com>
Message-ID: <CAAkaFLUjsdONxUkBbqbhb3S+w5Thzw3V=GSy4Q2TSr0pHGLj3w@mail.gmail.com>

There is a PR for more GLM support (
https://github.com/scikit-learn/scikit-learn/pull/9405), but I don't think
it will be in the next release.?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180612/bab46e63/attachment.html>

From joel.nothman at gmail.com  Mon Jun 11 21:16:46 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Tue, 12 Jun 2018 11:16:46 +1000
Subject: [scikit-learn] Jeff Levesque: association rules
In-Reply-To: <CA+QKUtJg_DLmwf+jcuCGu4Nq3ehg0xAKvnbmr1YfUQjhcYAb2w@mail.gmail.com>
References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com>
 <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com>
 <CABfx9=eJjV6ZiRZrSAHq2Y535tmfGX4g4n5vxdPiM7QfyeZGzg@mail.gmail.com>
 <CAKnnxJ3-kTLuUbYH+=H2TwF_g=JOwoPnaQY6-5h2iV36Y6KSbA@mail.gmail.com>
 <CA+QKUtJg_DLmwf+jcuCGu4Nq3ehg0xAKvnbmr1YfUQjhcYAb2w@mail.gmail.com>
Message-ID: <CAAkaFLVUt7MvDRhQXsOsTAi4qcBeKk3bdkqzospgFVFfoH9+ng@mail.gmail.com>

We have definitely discussed association rules in issues before. It's
considered out of scope for scikit-learn, except insofar as it is used for
learning classification. We haven't yet been convinced that classifiers
based on associative learning have enough practical demand to justify their
maintenance in the project. Then again, we have not had a pull request
implementing any such algorithms; there seems to be demand mostly for the
vanilla association rule mining algorithms. They are definitely out of
scope for scikit-learn.

See: https://github.com/scikit-learn/scikit-learn/issues/801,
https://github.com/scikit-learn/scikit-learn/issues/2662,
https://github.com/scikit-learn/scikit-learn/issues/2872
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180612/98bb9332/attachment-0001.html>

From guettliml at thomas-guettler.de  Wed Jun 13 05:43:55 2018
From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=)
Date: Wed, 13 Jun 2018 11:43:55 +0200
Subject: [scikit-learn] Mapping fulltext OCR to issue type
In-Reply-To: <fd50aef8-ec86-b9dc-3aed-b1a108d694c5@thomas-guettler.de>
References: <fd50aef8-ec86-b9dc-3aed-b1a108d694c5@thomas-guettler.de>
Message-ID: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de>

I am still willing to learn.

Does anyone have a recommendation which book or website could help me?

Regards,
   Thomas

Am 08.06.2018 um 10:48 schrieb Thomas G?ttler:
> We run an issue tracking application. A lot of issues get generated
> from scanned letters.
> 
> I have 70k full text OCR result files. Their got created with tesseract.
> 
> Every file of these 70k files corresponds to a issue. Each issue has an issue type.
> 
> I want to use machine learning and in the future the machine
> should be able to guess the issue type by looking at the full text OCR.
> 
> The issue types are not a simple list, it is a tree.
> 
> Example:
> 
> electricity / power grid
> electricity / outages
> customer support / invoices / complaint
> customer support / invoices / tax
> ....
> 
> 
> If the machine can't guess
> 
>  ?? "customer support / invoices / complaint"
> 
> it would be nice if it could at least guess roughly the parent issue type:
> 
>  ?? "customer support / invoices"
> 
> I never used sciki before, but I use Python since several years.
> 
> Could you please guide me to the right direction?
> 
> Regards,
>  ? Thomas G?ttler
> 
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

From davidasfaha at gmail.com  Wed Jun 13 06:25:45 2018
From: davidasfaha at gmail.com (David Asfaha)
Date: Wed, 13 Jun 2018 11:25:45 +0100
Subject: [scikit-learn] Mapping fulltext OCR to issue type
In-Reply-To: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de>
References: <fd50aef8-ec86-b9dc-3aed-b1a108d694c5@thomas-guettler.de>
 <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de>
Message-ID: <CAJXjNB0jgGNyrMXXL8sqjKohYz6562fDRViuU812a-Eboyda6Q@mail.gmail.com>

Hi,

I would recommend starting with Naive Bayes [1] to classify the issues by
parent issue type. To check how that works learn about F1 accuracy scores
[2] and use them. If you are happy with the results, and depending on how
much data you have, try to modify the Naive Bayes classifier to predict the
specific issue type. From here there are many more things to do, like using
an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF,
n-grams...

Natural Language Processing with Python is a good book on NLP , also Andrew
Ng's Machine Learning course on coursera if you're new to the subject.

Hope this helps.

David


[1] http://scikit-learn.org/stable/modules/naive_bayes.html
[2]
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html


On 13 June 2018 at 10:43, Thomas G?ttler <guettliml at thomas-guettler.de>
wrote:

> I am still willing to learn.
>
> Does anyone have a recommendation which book or website could help me?
>
> Regards,
>   Thomas
>
>
> Am 08.06.2018 um 10:48 schrieb Thomas G?ttler:
>
>> We run an issue tracking application. A lot of issues get generated
>> from scanned letters.
>>
>> I have 70k full text OCR result files. Their got created with tesseract.
>>
>> Every file of these 70k files corresponds to a issue. Each issue has an
>> issue type.
>>
>> I want to use machine learning and in the future the machine
>> should be able to guess the issue type by looking at the full text OCR.
>>
>> The issue types are not a simple list, it is a tree.
>>
>> Example:
>>
>> electricity / power grid
>> electricity / outages
>> customer support / invoices / complaint
>> customer support / invoices / tax
>> ....
>>
>>
>> If the machine can't guess
>>
>>     "customer support / invoices / complaint"
>>
>> it would be nice if it could at least guess roughly the parent issue type:
>>
>>     "customer support / invoices"
>>
>> I never used sciki before, but I use Python since several years.
>>
>> Could you please guide me to the right direction?
>>
>> Regards,
>>    Thomas G?ttler
>>
>>
>>
> --
> Thomas Guettler http://www.thomas-guettler.de/
> I am looking for feedback: https://github.com/guettli/pro
> gramming-guidelines
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180613/b4aabaf6/attachment.html>

From alexandra.log at sintef.no  Thu Jun 14 06:44:28 2018
From: alexandra.log at sintef.no (Alexandra Metallinou Log)
Date: Thu, 14 Jun 2018 10:44:28 +0000
Subject: [scikit-learn] help
Message-ID: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>

Dear Sir/Madam,


I have stumbled upon a problem while trying to run some old code using skcikit-learn:


scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func = metrics.mean_squared_error)


This line will not run in a program I downloaded, and as I am not yet very familliar with scikit-learn I do not know how I should replace "score_func = metrics.mean_squared_error" to produce the same result as intended by the ones who made the program. Any help is greatly appreciated.


Best regards,


Alexandra Log

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180614/85b6e73f/attachment.html>

From gryllosprokopis at gmail.com  Thu Jun 14 11:50:21 2018
From: gryllosprokopis at gmail.com (Prokopis Gryllos)
Date: Thu, 14 Jun 2018 17:50:21 +0200
Subject: [scikit-learn] help
In-Reply-To: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
References: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
Message-ID: <CAH7PF24zUP=M-_SwyytT9eBwv=NP1GETO7Uc=A2si+tG-3-t5w@mail.gmail.com>

Hey Alexandra,

Can you maybe share the error output?

gr,
Prokopis

On Thu, Jun 14, 2018 at 5:20 PM Alexandra Metallinou Log <
alexandra.log at sintef.no> wrote:

> Dear Sir/Madam,
>
>
> I have stumbled upon a problem while trying to run some old code using
> skcikit-learn:
>
>
> scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func
> = metrics.mean_squared_error)
>
>
> This line will not run in a program I downloaded, and as I am not yet very
> familliar with scikit-learn I do not know how I should replace "score_func
> = metrics.mean_squared_error" to produce the same result as intended by the
> ones who made the program. Any help is greatly appreciated.
>
>
> Best regards,
>
>
> Alexandra Log
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180614/d9307b05/attachment.html>

From ichkoar at gmail.com  Thu Jun 14 14:52:22 2018
From: ichkoar at gmail.com (Christos Aridas)
Date: Thu, 14 Jun 2018 21:52:22 +0300
Subject: [scikit-learn] help
In-Reply-To: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
References: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
Message-ID: <CAHTPD-15rzfynirbdK2JE5KG+VU80=mb7mZfw28tZKTmEs6bkw@mail.gmail.com>

 Hey Alexandra . Could you please post a minimal, complete, and verifiable
example? Apart from this could you post the exact error message?

Best,
Chris

On Thu, Jun 14, 2018 at 1:44 PM, Alexandra Metallinou Log <
alexandra.log at sintef.no> wrote:

> Dear Sir/Madam,
>
>
> I have stumbled upon a problem while trying to run some old code using
> skcikit-learn:
>
>
> scores = cross_validation.cross_val_score(model, X, Y, cv = 10,
> score_func = metrics.mean_squared_error)
>
>
> This line will not run in a program I downloaded, and as I am not yet very
> familliar with scikit-learn I do not know how I should replace "score_func
> = metrics.mean_squared_error" to produce the same result as intended by the
> ones who made the program. Any help is greatly appreciated.
>
>
> Best regards,
>
>
> Alexandra Log
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180614/80b93668/attachment.html>

From joel.nothman at gmail.com  Thu Jun 14 19:57:31 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Fri, 15 Jun 2018 09:57:31 +1000
Subject: [scikit-learn] help
In-Reply-To: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
References: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
Message-ID: <CAAkaFLWRC+LjL9hxEqXjMwQAkuyzRSs0ZaUCyTcp8_5P-jAJ6Q@mail.gmail.com>

model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will
produce the same, but negated so that greater is better.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180615/4546bb52/attachment.html>

From robert.kern at gmail.com  Sat Jun 16 03:59:26 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Sat, 16 Jun 2018 00:59:26 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pfbul6$sfo$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org>
Message-ID: <pg2fsr$l26$1@blaine.gmane.org>

I have made a significant revision. In this version, downstream projects like 
scikit-learn should experience significantly less forced churn.

https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst

https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html

tl;dr RandomState lives! But its distributions are forever frozen. So maybe 
"undead" is more apt. Anyways, RandomState will continue to provide the same 
stream-compatibility that it always has. But it will be internally refactored to 
use the same core uniform PRNG objects that the new RandomGenerator 
distributions class will use underneath (defaulting to the current Mersenne 
Twister, of course). The distribution methods on RandomGenerator will be allowed 
to evolve with numpy versions and get better/faster implementations.

Your code can mix the usage of RandomState and RandomGenerator as needed, but 
they can be made to share the same underlying RNG algorithm's state.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From alexandra.log at sintef.no  Fri Jun 15 03:51:59 2018
From: alexandra.log at sintef.no (Alexandra Metallinou Log)
Date: Fri, 15 Jun 2018 07:51:59 +0000
Subject: [scikit-learn] help
In-Reply-To: <CAAkaFLWRC+LjL9hxEqXjMwQAkuyzRSs0ZaUCyTcp8_5P-jAJ6Q@mail.gmail.com>
References: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>,
 <CAAkaFLWRC+LjL9hxEqXjMwQAkuyzRSs0ZaUCyTcp8_5P-jAJ6Q@mail.gmail.com>
Message-ID: <DB6PR06MB3975D3E2FA6601955FD30C39F47C0@DB6PR06MB3975.eurprd06.prod.outlook.com>

Thank you, this worked. The error message was: undefined keyword: 'score-func'


I also changed the line of code from


scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func = metrics.mean_squared_error)


to


scores = cross_validation.cross_val_score(model, X, Y, cv = 10, scores = 'mean_squared_error')


the code runs with this (I recieve negative outputs though, so I took the abolute value of these afterwards). However the following deprecation warning is displayed:


C:\Python27\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
C:\Python27\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)


When I changed the code to:


model_evaluation.cross_val_score(model, X, y, scoring='neg_mean_squared_error'),


the code runs fine ('neg_mse' was not an acceptable keyword). I still get the same deprecation warning, though I don't understand why as I am using model_evaluation now. Regardless, I think the problem is fixed.


Once again, thank you for your help!


Kind regards,


Alexandra

________________________________
Fra: scikit-learn <scikit-learn-bounces+alexandra.log=sintef.no at python.org> p? vegne av Joel Nothman <joel.nothman at gmail.com>
Sendt: fredag 15. juni 2018 01.57.31
Til: Scikit-learn user and developer mailing list
Emne: Re: [scikit-learn] help

model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will produce the same, but negated so that greater is better.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180615/65f16927/attachment.html>

From joel.nothman at gmail.com  Sat Jun 16 08:13:20 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 16 Jun 2018 22:13:20 +1000
Subject: [scikit-learn] help
In-Reply-To: <DB6PR06MB3975D3E2FA6601955FD30C39F47C0@DB6PR06MB3975.eurprd06.prod.outlook.com>
References: <DB6PR06MB3975E999FE0102041885F9B1F47D0@DB6PR06MB3975.eurprd06.prod.outlook.com>
 <CAAkaFLWRC+LjL9hxEqXjMwQAkuyzRSs0ZaUCyTcp8_5P-jAJ6Q@mail.gmail.com>
 <DB6PR06MB3975D3E2FA6601955FD30C39F47C0@DB6PR06MB3975.eurprd06.prod.outlook.com>
Message-ID: <CAAkaFLUCvYniKzKjGnVLO+ZuRgGxf0TXnX1d+TH80Jjjx7+Ycg@mail.gmail.com>

Sorry, should have been model_selection, not model_evaluation.
cross_validation is now deprecated.

On Sat, 16 Jun 2018 at 18:28, Alexandra Metallinou Log <
alexandra.log at sintef.no> wrote:

> Thank you, this worked. The error message was: undefined keyword:
> 'score-func'
>
>
> I also changed the line of code from
>
>
> scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func
> = metrics.mean_squared_error)
>
>
> to
>
>
> scores = cross_validation.cross_val_score(model, X, Y, cv = 10, scores =
> 'mean_squared_error')
>
>
> the code runs with this (I recieve negative outputs though, so I took the
> abolute value of these afterwards). However the following deprecation
> warning is displayed:
>
>
> C:\Python27\lib\site-packages\sklearn\cross_validation.py:41:
> DeprecationWarning: This module was deprecated in version 0.18 in favor of
> the model_selection module into which all the refactored classes and
> functions are moved. Also note that the interface of the new CV iterators
> are different from that of this module. This module will be removed in 0.20.
>   "This module will be removed in 0.20.", DeprecationWarning)
> C:\Python27\lib\site-packages\sklearn\grid_search.py:42:
> DeprecationWarning: This module was deprecated in version 0.18 in favor of
> the model_selection module into which all the refactored classes and
> functions are moved. This module will be removed in 0.20.
>   DeprecationWarning)
>
> When I changed the code to:
>
>
> model_evaluation.cross_val_score(model, X, y, scoring=
> 'neg_mean_squared_error'),
>
>
> the code runs fine ('neg_mse' was not an acceptable keyword). I still get
> the same deprecation warning, though I don't understand why as I am using
> model_evaluation now. Regardless, I think the problem is fixed.
>
>
> Once again, thank you for your help!
>
>
> Kind regards,
>
>
> Alexandra
> ------------------------------
> *Fra:* scikit-learn <scikit-learn-bounces+alexandra.log=
> sintef.no at python.org> p? vegne av Joel Nothman <joel.nothman at gmail.com>
> *Sendt:* fredag 15. juni 2018 01.57.31
> *Til:* Scikit-learn user and developer mailing list
> *Emne:* Re: [scikit-learn] help
>
> model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will
> produce the same, but negated so that greater is better.
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180616/ea633949/attachment-0001.html>

From josef.pktd at gmail.com  Sat Jun 16 08:54:36 2018
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Sat, 16 Jun 2018 08:54:36 -0400
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pg2fsr$l26$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
Message-ID: <CAMMTP+B_Ndfv4oqvkeS86d+2AijzSs12VuLMSHF-sJ8TVix0dw@mail.gmail.com>

On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern at gmail.com> wrote:
> I have made a significant revision. In this version, downstream projects
> like scikit-learn should experience significantly less forced churn.
>
> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst
>
> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>
> tl;dr RandomState lives! But its distributions are forever frozen. So maybe
> "undead" is more apt. Anyways, RandomState will continue to provide the same
> stream-compatibility that it always has. But it will be internally
> refactored to use the same core uniform PRNG objects that the new
> RandomGenerator distributions class will use underneath (defaulting to the
> current Mersenne Twister, of course). The distribution methods on
> RandomGenerator will be allowed to evolve with numpy versions and get
> better/faster implementations.
>
> Your code can mix the usage of RandomState and RandomGenerator as needed,
> but they can be made to share the same underlying RNG algorithm's state.


Sounds good to me, and I think handles all our concerns.

I also think that the issues behind the np.random.* section about the
global state and seed can be revisited for possible deprecation of
convenience features.

One clarifying question, mainly to see IIUC

in this quote
"""
Calling numpy.random.seed() thereafter SHOULD just pass the given seed
to the current basic RNG object and not attempt to reset the basic RNG
to the Mersenne Twister. The global RandomState instance MUST be
accessible by the name numpy.random.mtrand._rand
"""

"the current basic RNG object" refers to the global object. AFAIU, it
is possible to change it numpy.random.mtrand._rand. Is it?

I never tried that so I didn't know we can change the global
RandomState, and thought we will have to replace np.random.seed usage
with a specific RandomState(seed) instance


In loose analogy:

Matplotlib has a "global" current figure and axis, gca, gcf.
In statsmodels we avoid any access to and usage of it and only work
with individual figure/axis instances that can be provided by the
user. (except for maybe some documentation examples and maybe some
"legacy" code.)
( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
)

AFAICS, essentially, statsmodels will need a similar policy for
RandomState/RandomGenerator and give up the usage of the global random
instance.

Josef

>
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless enigma
>  that is made terrible by our own mad attempt to interpret it as though it
> had
>  an underlying truth."
>   -- Umberto Eco
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From robert.kern at gmail.com  Sat Jun 16 20:29:33 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Sat, 16 Jun 2018 17:29:33 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <CAMMTP+B_Ndfv4oqvkeS86d+2AijzSs12VuLMSHF-sJ8TVix0dw@mail.gmail.com>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <CAMMTP+B_Ndfv4oqvkeS86d+2AijzSs12VuLMSHF-sJ8TVix0dw@mail.gmail.com>
Message-ID: <pg49ta$m0e$1@blaine.gmane.org>

On 6/16/18 05:54, josef.pktd at gmail.com wrote:
> On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern at gmail.com> wrote:
>> I have made a significant revision. In this version, downstream projects
>> like scikit-learn should experience significantly less forced churn.
>>
>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst
>>
>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>>
>> tl;dr RandomState lives! But its distributions are forever frozen. So maybe
>> "undead" is more apt. Anyways, RandomState will continue to provide the same
>> stream-compatibility that it always has. But it will be internally
>> refactored to use the same core uniform PRNG objects that the new
>> RandomGenerator distributions class will use underneath (defaulting to the
>> current Mersenne Twister, of course). The distribution methods on
>> RandomGenerator will be allowed to evolve with numpy versions and get
>> better/faster implementations.
>>
>> Your code can mix the usage of RandomState and RandomGenerator as needed,
>> but they can be made to share the same underlying RNG algorithm's state.
> 
> 
> Sounds good to me, and I think handles all our concerns.
> 
> I also think that the issues behind the np.random.* section about the
> global state and seed can be revisited for possible deprecation of
> convenience features.
> 
> One clarifying question, mainly to see IIUC
> 
> in this quote
> """
> Calling numpy.random.seed() thereafter SHOULD just pass the given seed
> to the current basic RNG object and not attempt to reset the basic RNG
> to the Mersenne Twister. The global RandomState instance MUST be
> accessible by the name numpy.random.mtrand._rand
> """
> 
> "the current basic RNG object" refers to the global object. AFAIU, it
> is possible to change it numpy.random.mtrand._rand. Is it?

numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is 
now) a RandomState instance. "the current basic RNG object" would be the basic 
RNG that that global RandomState instance is currently using.

It is not possible (now or in the glorious NEP future) to assign a new instance 
to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually 
just simple aliases to the methods on that object when the module is first 
built. Re-assigning _rand wouldn't reassign those aliases. 
numpy.random.standard_normal(), for instance, would still be the 
.standard_normal() method on the RandomState instance that _rand initially 
pointed to.

Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is 
to call its methods (i.e. the numpy.random.* convenience functions) to modify 
its internal state. That's not changing.

The only thing that will change will be that there will be a new numpy.random.* 
function to call that will let you give the global RandomState a new basic RNG 
object that it will swap in internally. Let's call it 
np.random.swap_global_basic_rng(). If you don't use that function, you won't 
have a problem. I intend to make this new function *very* explicit about what it 
is doing, and document the crap out of it so it won't be misused like 
np.random.seed() is.

> I never tried that so I didn't know we can change the global
> RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance

I did a quick review of np.random.seed() usage in statsmodels, and I think you 
are mostly fine. It looks like you mostly use it in unit tests and at the top of 
examples. The only possible problem that I can see that you might have with the 
swap_global_basic_rng() is if some other package that you rely on calls it in 
its library code. Then subsequent statsmodels unit tests might fail because when 
they call np.random.seed(), it won't be reseeding a Mersenne Twister but another 
basic RNG.

However, I intend to make that a weird and unnatural thing to do. It's already 
unlikely to happen as it's a niche requirement that one mostly would need at the 
start of a whole program, not buried down inside library code. But we will also 
document that function to discourage such usage, and probably have unconditional 
noisy warnings that users would have to explicitly silence.

If one of your dependencies did that, you'd be well within your rights to tell 
them that they are misusing numpy and causing breakage in statsmodels.

> In loose analogy:
> 
> Matplotlib has a "global" current figure and axis, gca, gcf.
> In statsmodels we avoid any access to and usage of it and only work
> with individual figure/axis instances that can be provided by the
> user. (except for maybe some documentation examples and maybe some
> "legacy" code.)
> ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
> )
> 
> AFAICS, essentially, statsmodels will need a similar policy for
> RandomState/RandomGenerator and give up the usage of the global random
> instance.

I mean, you certainly *should* (outside of unit tests) for very similar reasons 
why you avoid the global state in matplotlib, but this NEP won't force you to. 
You should do so anyways under the status quo, too. For any of your functions 
that call np.random.* functions internally, it's hard to use them in threaded 
applications, for instance, because it is relying on that global state.

scikit-learn's check_random_state() is a good pattern to follow.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From josef.pktd at gmail.com  Sat Jun 16 20:42:12 2018
From: josef.pktd at gmail.com (josef.pktd at gmail.com)
Date: Sat, 16 Jun 2018 20:42:12 -0400
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pg49ta$m0e$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <CAMMTP+B_Ndfv4oqvkeS86d+2AijzSs12VuLMSHF-sJ8TVix0dw@mail.gmail.com>
 <pg49ta$m0e$1@blaine.gmane.org>
Message-ID: <CAMMTP+DywCedYW5wid8N30E_AYGmXKp24X=t_nsKDKJBR5Sg9g@mail.gmail.com>

On Sat, Jun 16, 2018 at 8:29 PM, Robert Kern <robert.kern at gmail.com> wrote:
> On 6/16/18 05:54, josef.pktd at gmail.com wrote:
>>
>> On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.kern at gmail.com>
>> wrote:
>>>
>>> I have made a significant revision. In this version, downstream projects
>>> like scikit-learn should experience significantly less forced churn.
>>>
>>>
>>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst
>>>
>>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>>>
>>> tl;dr RandomState lives! But its distributions are forever frozen. So
>>> maybe
>>> "undead" is more apt. Anyways, RandomState will continue to provide the
>>> same
>>> stream-compatibility that it always has. But it will be internally
>>> refactored to use the same core uniform PRNG objects that the new
>>> RandomGenerator distributions class will use underneath (defaulting to
>>> the
>>> current Mersenne Twister, of course). The distribution methods on
>>> RandomGenerator will be allowed to evolve with numpy versions and get
>>> better/faster implementations.
>>>
>>> Your code can mix the usage of RandomState and RandomGenerator as needed,
>>> but they can be made to share the same underlying RNG algorithm's state.
>>
>>
>>
>> Sounds good to me, and I think handles all our concerns.
>>
>> I also think that the issues behind the np.random.* section about the
>> global state and seed can be revisited for possible deprecation of
>> convenience features.
>>
>> One clarifying question, mainly to see IIUC
>>
>> in this quote
>> """
>> Calling numpy.random.seed() thereafter SHOULD just pass the given seed
>> to the current basic RNG object and not attempt to reset the basic RNG
>> to the Mersenne Twister. The global RandomState instance MUST be
>> accessible by the name numpy.random.mtrand._rand
>> """
>>
>> "the current basic RNG object" refers to the global object. AFAIU, it
>> is possible to change it numpy.random.mtrand._rand. Is it?
>
>
> numpy.random.mtrand._rand would not be a basic RNG object; it would be (as
> it is now) a RandomState instance. "the current basic RNG object" would be
> the basic RNG that that global RandomState instance is currently using.
>
> It is not possible (now or in the glorious NEP future) to assign a new
> instance to numpy.random.mtrand._rand. All of the numpy.random.* functions
> are actually just simple aliases to the methods on that object when the
> module is first built. Re-assigning _rand wouldn't reassign those aliases.
> numpy.random.standard_normal(), for instance, would still be the
> .standard_normal() method on the RandomState instance that _rand initially
> pointed to.
>
> Currently and under the NEP, the only way to modify
> numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.*
> convenience functions) to modify its internal state. That's not changing.
>
> The only thing that will change will be that there will be a new
> numpy.random.* function to call that will let you give the global
> RandomState a new basic RNG object that it will swap in internally. Let's
> call it np.random.swap_global_basic_rng(). If you don't use that function,
> you won't have a problem. I intend to make this new function *very* explicit
> about what it is doing, and document the crap out of it so it won't be
> misused like np.random.seed() is.

I didn't catch that part. Now it's clear.

>
>> I never tried that so I didn't know we can change the global
>> RandomState, and thought we will have to replace np.random.seed usage >
>> with a specific RandomState(seed) instance
>
>
> I did a quick review of np.random.seed() usage in statsmodels, and I think
> you are mostly fine. It looks like you mostly use it in unit tests and at
> the top of examples. The only possible problem that I can see that you might
> have with the swap_global_basic_rng() is if some other package that you rely
> on calls it in its library code. Then subsequent statsmodels unit tests
> might fail because when they call np.random.seed(), it won't be reseeding a
> Mersenne Twister but another basic RNG.
>
> However, I intend to make that a weird and unnatural thing to do. It's
> already unlikely to happen as it's a niche requirement that one mostly would
> need at the start of a whole program, not buried down inside library code.
> But we will also document that function to discourage such usage, and
> probably have unconditional noisy warnings that users would have to
> explicitly silence.
>
> If one of your dependencies did that, you'd be well within your rights to
> tell them that they are misusing numpy and causing breakage in statsmodels.
>
>> In loose analogy:
>>
>> Matplotlib has a "global" current figure and axis, gca, gcf.
>> In statsmodels we avoid any access to and usage of it and only work
>> with individual figure/axis instances that can be provided by the
>> user. (except for maybe some documentation examples and maybe some
>> "legacy" code.)
>> (
>> https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
>> )
>>
>> AFAICS, essentially, statsmodels will need a similar policy for
>> RandomState/RandomGenerator and give up the usage of the global random
>> instance.
>
>
> I mean, you certainly *should* (outside of unit tests) for very similar
> reasons why you avoid the global state in matplotlib, but this NEP won't
> force you to. You should do so anyways under the status quo, too. For any of
> your functions that call np.random.* functions internally, it's hard to use
> them in threaded applications, for instance, because it is relying on that
> global state.
>
> scikit-learn's check_random_state() is a good pattern to follow.

Thanks for the clarification.

I just realized that I had replied to scikit-learn mailing list.
I had thought this was numpy-discussion.

sorry about that.

Josef


>
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless enigma
>  that is made terrible by our own mad attempt to interpret it as though it
> had
>  an underlying truth."
>   -- Umberto Eco
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From guettliml at thomas-guettler.de  Mon Jun 18 06:16:19 2018
From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=)
Date: Mon, 18 Jun 2018 12:16:19 +0200
Subject: [scikit-learn] Mapping fulltext OCR to issue type
In-Reply-To: <CAJXjNB0jgGNyrMXXL8sqjKohYz6562fDRViuU812a-Eboyda6Q@mail.gmail.com>
References: <fd50aef8-ec86-b9dc-3aed-b1a108d694c5@thomas-guettler.de>
 <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de>
 <CAJXjNB0jgGNyrMXXL8sqjKohYz6562fDRViuU812a-Eboyda6Q@mail.gmail.com>
Message-ID: <be5f6c28-1d0c-3a63-ca33-aae1a3ad95b4@thomas-guettler.de>

Thank you very much David,

I ordered the book

Regards,
   Thomas

Am 13.06.2018 um 12:25 schrieb David Asfaha:
> 
> Hi,
> 
> I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works 
> learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you 
> have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things 
> to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams...
> 
> Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if 
> you're new to the subject.
> 
> Hope this helps.
> 
> David
> 
> 
> [1] http://scikit-learn.org/stable/modules/naive_bayes.html
> [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
> 
> 
> On 13 June 2018 at 10:43, Thomas G?ttler <guettliml at thomas-guettler.de <mailto:guettliml at thomas-guettler.de>> wrote:
> 
>     I am still willing to learn.
> 
>     Does anyone have a recommendation which book or website could help me?
> 
>     Regards,
>      ? Thomas
> 
> 
>     Am 08.06.2018 um 10:48 schrieb Thomas G?ttler:
> 
>         We run an issue tracking application. A lot of issues get generated
>         from scanned letters.
> 
>         I have 70k full text OCR result files. Their got created with tesseract.
> 
>         Every file of these 70k files corresponds to a issue. Each issue has an issue type.
> 
>         I want to use machine learning and in the future the machine
>         should be able to guess the issue type by looking at the full text OCR.
> 
>         The issue types are not a simple list, it is a tree.
> 
>         Example:
> 
>         electricity / power grid
>         electricity / outages
>         customer support / invoices / complaint
>         customer support / invoices / tax
>         ....
> 
> 
>         If the machine can't guess
> 
>          ??? "customer support / invoices / complaint"
> 
>         it would be nice if it could at least guess roughly the parent issue type:
> 
>          ??? "customer support / invoices"
> 
>         I never used sciki before, but I use Python since several years.
> 
>         Could you please guide me to the right direction?
> 
>         Regards,
>          ?? Thomas G?ttler
> 
> 
> 
>     -- 
>     Thomas Guettler http://www.thomas-guettler.de/
>     I am looking for feedback: https://github.com/guettli/programming-guidelines
>     <https://github.com/guettli/programming-guidelines>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
> 
> 
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

-- 
Thomas Guettler http://www.thomas-guettler.de/
I am looking for feedback: https://github.com/guettli/programming-guidelines

From robert.kern at gmail.com  Tue Jun 19 02:34:38 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Mon, 18 Jun 2018 23:34:38 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pg2fsr$l26$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
Message-ID: <pga81s$i7m$1@blaine.gmane.org>

On 6/16/18 00:59, Robert Kern wrote:
> I have made a significant revision. In this version, downstream projects like 
> scikit-learn should experience significantly less forced churn.
> 
> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst 
> 
> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html

The screaming has died down on numpy-discussion, and it seems like everyone who 
has participated over there has more or less come to consensus about accepting 
this NEP. However, I'd really appreciate it if I could get some kind of feedback 
from a scikit-learn dev, whether it's "I don't care" or "I need a couple of days 
to get around to reading the NEP" or just "+1" or "-1000; this is awful!"

I'm not picky.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From hamidizade.s at gmail.com  Tue Jun 19 10:52:28 2018
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Tue, 19 Jun 2018 19:22:28 +0430
Subject: [scikit-learn] imbalanced classes: class_weight
Message-ID: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>

Hi

I would appreciate if you could let me know what is the best way to
categorize the approaches which have been developed to deal with imbalance
class problem?

*This article
<https://www.sciencedirect.com/science/article/pii/S0020025513005124>
categorizes them into:*

   1. Preprocessing: includes oversampling, undersampling and hybrid
   methods,
   2. Cost-sensitive learning: includes direct methods and meta-learning
   which the latter further divides into thresholding and sampling,
   3. Ensemble techniques: includes cost-sensitive ensembles and data
   preprocessing in conjunction with ensemble learning.

*The second <https://dl.acm.org/citation.cfm?id=2907070> classification:*

   1. Data Pre-processing: includes distribution change and weighting the
   data space. One-class learning is considered as distribution change.
   2. Special-purpose Learning Methods
   3. Prediction Post-processing: includes threshold method and
   cost-sensitive post-processing
   4. Hybrid Methods:

*The third article
<https://link.springer.com/article/10.1007/s13748-016-0094-0>:*

   1. Data-level methods
   2. Algorithm-level methods
   3. Hybrid methods

The last classification also considers output adjustment as an independent
approach.

Could you please let me know the class-weight in the sklearn's classifiers
e.g., logistic regression is classified into which category? Is it true to
say:

In case of the first categorization, it falls into cost-sensitive learning

In case of the second taxonomy, it would be classified into the third
category i.e., cost-sensitive post-processing

In case of the third classification, it should fall into algorithm level

Best regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180619/56195a7d/attachment.html>

From t3kcit at gmail.com  Tue Jun 19 11:12:18 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 19 Jun 2018 11:12:18 -0400
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pga81s$i7m$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <pga81s$i7m$1@blaine.gmane.org>
Message-ID: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com>

I don't think I have the bandwidth but I agree :-/
Not sure if any of the other core devs do. I can try to read it next 
week but that's probably too late?


On 06/19/2018 02:34 AM, Robert Kern wrote:
> On 6/16/18 00:59, Robert Kern wrote:
>> I have made a significant revision. In this version, downstream 
>> projects like scikit-learn should experience significantly less 
>> forced churn.
>>
>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst 
>>
>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html
>
> The screaming has died down on numpy-discussion, and it seems like 
> everyone who has participated over there has more or less come to 
> consensus about accepting this NEP. However, I'd really appreciate it 
> if I could get some kind of feedback from a scikit-learn dev, whether 
> it's "I don't care" or "I need a couple of days to get around to 
> reading the NEP" or just "+1" or "-1000; this is awful!"
>
> I'm not picky.
>


From gael.varoquaux at normalesup.org  Tue Jun 19 11:29:14 2018
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Tue, 19 Jun 2018 17:29:14 +0200
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <pga81s$i7m$1@blaine.gmane.org>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <pga81s$i7m$1@blaine.gmane.org>
Message-ID: <20180619152914.youaux7nkjftujvt@phare.normalesup.org>

On Mon, Jun 18, 2018 at 11:34:38PM -0700, Robert Kern wrote:
> However, I'd really appreciate it if I could get some
> kind of feedback from a scikit-learn dev,

I didn't read the NEP, only your summary. That said, it seems quite
reasonably aligned with our practice, and hence shouldn't pose a problem.
Ideally, I believe that in the long run it should enable us to have
cleaner / more robust code, but I suspect that it will take a while
before we get there.

Ga?l

From ichkoar at gmail.com  Tue Jun 19 12:34:50 2018
From: ichkoar at gmail.com (Christos Aridas)
Date: Tue, 19 Jun 2018 19:34:50 +0300
Subject: [scikit-learn] imbalanced classes: class_weight
In-Reply-To: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
References: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
Message-ID: <CAHTPD-35GVV5QujxcmN=mdTcaWo1ZD2hmdMq55yg8HMs6x-FwA@mail.gmail.com>

Hi,

Have you seen http://imbalanced-learn.org?

Best,
Chris

On Tue, 19 Jun 2018 17:53 S Hamidizade, <hamidizade.s at gmail.com> wrote:

> Hi
>
> I would appreciate if you could let me know what is the best way to
> categorize the approaches which have been developed to deal with imbalance
> class problem?
>
> *This article
> <https://www.sciencedirect.com/science/article/pii/S0020025513005124>
> categorizes them into:*
>
>    1. Preprocessing: includes oversampling, undersampling and hybrid
>    methods,
>    2. Cost-sensitive learning: includes direct methods and meta-learning
>    which the latter further divides into thresholding and sampling,
>    3. Ensemble techniques: includes cost-sensitive ensembles and data
>    preprocessing in conjunction with ensemble learning.
>
> *The second <https://dl.acm.org/citation.cfm?id=2907070> classification:*
>
>    1. Data Pre-processing: includes distribution change and weighting the
>    data space. One-class learning is considered as distribution change.
>    2. Special-purpose Learning Methods
>    3. Prediction Post-processing: includes threshold method and
>    cost-sensitive post-processing
>    4. Hybrid Methods:
>
> *The third article
> <https://link.springer.com/article/10.1007/s13748-016-0094-0>:*
>
>    1. Data-level methods
>    2. Algorithm-level methods
>    3. Hybrid methods
>
> The last classification also considers output adjustment as an independent
> approach.
>
> Could you please let me know the class-weight in the sklearn's classifiers
> e.g., logistic regression is classified into which category? Is it true to
> say:
>
> In case of the first categorization, it falls into cost-sensitive learning
>
> In case of the second taxonomy, it would be classified into the third
> category i.e., cost-sensitive post-processing
>
> In case of the third classification, it should fall into algorithm level
>
> Best regards,
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180619/e21dbd12/attachment.html>

From robert.kern at gmail.com  Tue Jun 19 18:19:03 2018
From: robert.kern at gmail.com (Robert Kern)
Date: Tue, 19 Jun 2018 15:19:03 -0700
Subject: [scikit-learn] NEP: Random Number Generator Policy
In-Reply-To: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com>
References: <pfbul6$sfo$1@blaine.gmane.org> <pg2fsr$l26$1@blaine.gmane.org>
 <pga81s$i7m$1@blaine.gmane.org>
 <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com>
Message-ID: <pgbvck$fva$1@blaine.gmane.org>

On 6/19/18 08:12, Andreas Mueller wrote:
> I don't think I have the bandwidth but I agree :-/
> Not sure if any of the other core devs do. I can try to read it next week but 
> that's probably too late?

We're not on a deadline. If you're interested in reading the NEP and providing 
feedback/consent, I'm happy to hold off on formally accepting the NEP until then.

Thanks!

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco


From hamidizade.s at gmail.com  Wed Jun 20 23:35:54 2018
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Thu, 21 Jun 2018 08:05:54 +0430
Subject: [scikit-learn] imbalanced classes: class_weight
In-Reply-To: <CAHTPD-35GVV5QujxcmN=mdTcaWo1ZD2hmdMq55yg8HMs6x-FwA@mail.gmail.com>
References: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
 <CAHTPD-35GVV5QujxcmN=mdTcaWo1ZD2hmdMq55yg8HMs6x-FwA@mail.gmail.com>
Message-ID: <CALx+=wsTPRBe-gy8vXJnXVmS24cK07sg-ActQGw4z1F3_Wnzzw@mail.gmail.com>

Hi

Thanks a lot for your time and consideration. I have seen imblearn but my
question is not related to it.

Best regards,

On Tue, Jun 19, 2018 at 9:04 PM, Christos Aridas <ichkoar at gmail.com> wrote:

> Hi,
>
> Have you seen http://imbalanced-learn.org?
>
> Best,
> Chris
>
> On Tue, 19 Jun 2018 17:53 S Hamidizade, <hamidizade.s at gmail.com> wrote:
>
>> Hi
>>
>> I would appreciate if you could let me know what is the best way to
>> categorize the approaches which have been developed to deal with imbalance
>> class problem?
>>
>> *This article
>> <https://www.sciencedirect.com/science/article/pii/S0020025513005124>
>> categorizes them into:*
>>
>>    1. Preprocessing: includes oversampling, undersampling and hybrid
>>    methods,
>>    2. Cost-sensitive learning: includes direct methods and meta-learning
>>    which the latter further divides into thresholding and sampling,
>>    3. Ensemble techniques: includes cost-sensitive ensembles and data
>>    preprocessing in conjunction with ensemble learning.
>>
>> *The second <https://dl.acm.org/citation.cfm?id=2907070> classification:*
>>
>>    1. Data Pre-processing: includes distribution change and weighting
>>    the data space. One-class learning is considered as distribution change.
>>    2. Special-purpose Learning Methods
>>    3. Prediction Post-processing: includes threshold method and
>>    cost-sensitive post-processing
>>    4. Hybrid Methods:
>>
>> *The third article
>> <https://link.springer.com/article/10.1007/s13748-016-0094-0>:*
>>
>>    1. Data-level methods
>>    2. Algorithm-level methods
>>    3. Hybrid methods
>>
>> The last classification also considers output adjustment as an
>> independent approach.
>>
>> Could you please let me know the class-weight in the sklearn's
>> classifiers e.g., logistic regression is classified into which category? Is
>> it true to say:
>>
>> In case of the first categorization, it falls into cost-sensitive learning
>>
>> In case of the second taxonomy, it would be classified into the third
>> category i.e., cost-sensitive post-processing
>>
>> In case of the third classification, it should fall into algorithm level
>>
>> Best regards,
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180621/9ef67e0d/attachment.html>

From joel.nothman at gmail.com  Thu Jun 21 00:19:23 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 21 Jun 2018 14:19:23 +1000
Subject: [scikit-learn] imbalanced classes: class_weight
In-Reply-To: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
References: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
Message-ID: <CAAkaFLWDmXaCt00KFCck+THfu+jk5hzLqS3xYTHPHjk7mLQ3wg@mail.gmail.com>

We don't usually do any postprocessing for class weight (although there is
an open issue:).

In the second taxonomy, I'd say Data Pre-processing ("weighting the data
space"), but maybe there are exceptions in some estimators.

The classification in the first taxonomy is correct, IMO.

In the third, perhaps "Algorithm-level"
?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180621/53d7059f/attachment.html>

From joel.nothman at gmail.com  Thu Jun 21 00:19:56 2018
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 21 Jun 2018 14:19:56 +1000
Subject: [scikit-learn] imbalanced classes: class_weight
In-Reply-To: <CAAkaFLWDmXaCt00KFCck+THfu+jk5hzLqS3xYTHPHjk7mLQ3wg@mail.gmail.com>
References: <CALx+=wu6kBd6b7NZVuHkHdVTOAbUYcrCTTVrCXJDqUcuj=n_Wg@mail.gmail.com>
 <CAAkaFLWDmXaCt00KFCck+THfu+jk5hzLqS3xYTHPHjk7mLQ3wg@mail.gmail.com>
Message-ID: <CAAkaFLXPu4Q2K6eY_i+-jOknY7MgVB6m9DGWibbzKvNEPrWEAA@mail.gmail.com>

the open issue on post-processing / prior adjustment to adjust for
class_weight: https://github.com/scikit-learn/scikit-learn/issues/10613?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180621/05be7919/attachment.html>

From olivier.grisel at ensta.org  Sat Jun 23 06:42:27 2018
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Sat, 23 Jun 2018 12:42:27 +0200
Subject: [scikit-learn] New core dev: Joris Van den Bossche
Message-ID: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>

Hi everyone!

Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
scikit-learn core developer!

Joris is one of the maintainers of the pandas project and recently
contributed many new great PRs to scikit-learn (notably the
ColumnTransformer and a refactoring of the categorical variable
preprocessing tools).

Cheers!

-- 
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180623/a5d2ac88/attachment.html>

From mail at sebastianraschka.com  Sat Jun 23 11:13:07 2018
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Sat, 23 Jun 2018 11:13:07 -0400
Subject: [scikit-learn] New core dev: Joris Van den Bossche
In-Reply-To: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
References: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
Message-ID: <EB95966D-936F-4B9E-9896-AE43F355E57D@sebastianraschka.com>

That's great news! I am glad to hear that you joined the project, Joris Van den Bossche!  I am a scikit-learn user (and sometimes contributor) and really appreciate all the time and effort that the core developers and contributors spend on maintaining and extending the library. 

Best regards,
Sebastian


> On Jun 23, 2018, at 6:42 AM, Olivier Grisel <olivier.grisel at ensta.org> wrote:
> 
> Hi everyone!
> 
> Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer!
> 
> Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools). 
> 
> Cheers!
> 
> -- 
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From zephyr14 at gmail.com  Sat Jun 23 11:20:18 2018
From: zephyr14 at gmail.com (Vlad Niculae)
Date: Sat, 23 Jun 2018 11:20:18 -0400
Subject: [scikit-learn] New core dev: Joris Van den Bossche
In-Reply-To: <EB95966D-936F-4B9E-9896-AE43F355E57D@sebastianraschka.com>
References: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
 <EB95966D-936F-4B9E-9896-AE43F355E57D@sebastianraschka.com>
Message-ID: <CAFJw_eGdZLbB8BYA0q6C01WNRT25d-1DEbA7NF2wLjrscRJAqw@mail.gmail.com>

Congratulations Joris, very well deserved!

Vlad

On Sat, Jun 23, 2018, 11:15 Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> That's great news! I am glad to hear that you joined the project, Joris
> Van den Bossche!  I am a scikit-learn user (and sometimes contributor) and
> really appreciate all the time and effort that the core developers and
> contributors spend on maintaining and extending the library.
>
> Best regards,
> Sebastian
>
>
> > On Jun 23, 2018, at 6:42 AM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
> >
> > Hi everyone!
> >
> > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
> scikit-learn core developer!
> >
> > Joris is one of the maintainers of the pandas project and recently
> contributed many new great PRs to scikit-learn (notably the
> ColumnTransformer and a refactoring of the categorical variable
> preprocessing tools).
> >
> > Cheers!
> >
> > --
> > Olivier
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180623/fb629717/attachment.html>

From jason.wolosonovich at infusionsoft.com  Mon Jun 25 13:14:03 2018
From: jason.wolosonovich at infusionsoft.com (Jason Wolosonovich)
Date: Mon, 25 Jun 2018 17:14:03 +0000
Subject: [scikit-learn] New core dev: Joris Van den Bossche
In-Reply-To: <mailman.47.1529769605.4693.scikit-learn@python.org>
References: <mailman.47.1529769605.4693.scikit-learn@python.org>
Message-ID: <4f99a8848387404388328dbc50ac701e@infusionsoft.com>

Welcome Joris!


- Jason


________________________________
From: scikit-learn <scikit-learn-bounces+jason.wolosonovich=infusionsoft.com at python.org> on behalf of scikit-learn-request at python.org <scikit-learn-request at python.org>
Sent: Saturday, June 23, 2018 9:00 AM
To: scikit-learn at python.org
Subject: scikit-learn Digest, Vol 27, Issue 24

Send scikit-learn mailing list submissions to
        scikit-learn at python.org

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/scikit-learn
or, via email, send a message with subject or body 'help' to
        scikit-learn-request at python.org

You can reach the person managing the list at
        scikit-learn-owner at python.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."


Today's Topics:

   1. New core dev: Joris Van den Bossche (Olivier Grisel)
   2. Re: New core dev: Joris Van den Bossche (Sebastian Raschka)
   3. Re: New core dev: Joris Van den Bossche (Vlad Niculae)


----------------------------------------------------------------------

Message: 1
Date: Sat, 23 Jun 2018 12:42:27 +0200
From: Olivier Grisel <olivier.grisel at ensta.org>
To: Scikit-learn user and developer mailing list
        <scikit-learn at python.org>
Subject: [scikit-learn] New core dev: Joris Van den Bossche
Message-ID:
        <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi everyone!

Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
scikit-learn core developer!

Joris is one of the maintainers of the pandas project and recently
contributed many new great PRs to scikit-learn (notably the
ColumnTransformer and a refactoring of the categorical variable
preprocessing tools).

Cheers!

--
Olivier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180623/a5d2ac88/attachment-0001.html>

------------------------------

Message: 2
Date: Sat, 23 Jun 2018 11:13:07 -0400
From: Sebastian Raschka <mail at sebastianraschka.com>
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] New core dev: Joris Van den Bossche
Message-ID:
        <EB95966D-936F-4B9E-9896-AE43F355E57D at sebastianraschka.com>
Content-Type: text/plain;       charset=us-ascii

That's great news! I am glad to hear that you joined the project, Joris Van den Bossche!  I am a scikit-learn user (and sometimes contributor) and really appreciate all the time and effort that the core developers and contributors spend on maintaining and extending the library.

Best regards,
Sebastian


> On Jun 23, 2018, at 6:42 AM, Olivier Grisel <olivier.grisel at ensta.org> wrote:
>
> Hi everyone!
>
> Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer!
>
> Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools).
>
> Cheers!
>
> --
> Olivier
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

Message: 3
Date: Sat, 23 Jun 2018 11:20:18 -0400
From: Vlad Niculae <zephyr14 at gmail.com>
To: Scikit-learn user and developer mailing list
        <scikit-learn at python.org>
Subject: Re: [scikit-learn] New core dev: Joris Van den Bossche
Message-ID:
        <CAFJw_eGdZLbB8BYA0q6C01WNRT25d-1DEbA7NF2wLjrscRJAqw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Congratulations Joris, very well deserved!

Vlad

On Sat, Jun 23, 2018, 11:15 Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> That's great news! I am glad to hear that you joined the project, Joris
> Van den Bossche!  I am a scikit-learn user (and sometimes contributor) and
> really appreciate all the time and effort that the core developers and
> contributors spend on maintaining and extending the library.
>
> Best regards,
> Sebastian
>
>
> > On Jun 23, 2018, at 6:42 AM, Olivier Grisel <olivier.grisel at ensta.org>
> wrote:
> >
> > Hi everyone!
> >
> > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
> scikit-learn core developer!
> >
> > Joris is one of the maintainers of the pandas project and recently
> contributed many new great PRs to scikit-learn (notably the
> ColumnTransformer and a refactoring of the categorical variable
> preprocessing tools).
> >
> > Cheers!
> >
> > --
> > Olivier
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180623/fb629717/attachment-0001.html>

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


------------------------------

End of scikit-learn Digest, Vol 27, Issue 24
********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180625/e139cc9b/attachment.html>

From mlcnworkshop at gmail.com  Mon Jun 11 04:27:08 2018
From: mlcnworkshop at gmail.com (MLCN Workshop)
Date: Mon, 11 Jun 2018 08:27:08 -0000
Subject: [scikit-learn] Deadline Extension: the first International Workshop
 on Machine Learning in Clinical Neuroimaging (MLCN 2018)
Message-ID: <CAN9Uh4cXAZKbvq=YjVJ6Yg6WKfzWM5GGeUW4eQ31tt9jT-ZbpA@mail.gmail.com>

 Dear Colleagues,

The paper submission deadline for MLCN 2018 <https://mlcn2018.com/> workshop
has been extended to June 25, 2018.

------------------------------------------------------------
------------------------------------------------------

CALL FOR PAPERS:

Recent advances in neuroimaging and statistical machine learning provide an
exceptional opportunity for investigators and physicians to discover
complex relationships between brain, behaviors, and mental and neurological
disorders. MLCN 2018 workshop, as a satellite event of MICCAI 2018, aims to
bring together researchers in both theory and application from various
fields in domains of spatial statistics, pattern recognition in
neuroimaging, and predictive clinical neuroscience. Topics of interests
include but are not limited to:
        - Applications of spatio-temporal modeling in predictive clinical
neuroscience
        - Spatial regularization in decoding clinical neuroimaging data
        - Spatial statistics in neuroimaging
        - Learning with structured inputs and outputs in clinical
neuroscience
        - Multi-task learning in analyzing structured neuroimaging data
        - Deep learning in analyzing structured neuroimaging data
        - Model stability and interpretability in clinical neuroscience

------------------------------------------------------------
---------------------------------------------------------

CONFIRMED SPEAKERS:

Christos Davatzikos (University of Pennsylvania)
Ga?l Varoquaux (Parietal team, INRIA)
Jian Kang (University of Michigan)

------------------------------------------------------------
----------------------------------------------------------

SUBMISSION PROCESS:

The workshop seeks high quality, original, and unpublished work on
algorithms, theory, and applications of machine learning in clinical
neuroimaging and spatially structured data analysis. Papers should be
submitted electronically in Springer Lecture Notes in Computer Science
(LCNS) style of up to 8-pages papers using the CMT system at
https://cmt3.research.microsoft.com/MLCN2018. This workshop uses a
double-blind review process in the evaluation phase, thus authors must
ensure anonymous submissions. Accepted papers will be published in a joint
proceeding with the MICCAI conference.

------------------------------------------------------------
-----------------------------------------------------------

IMPORTANT DATES:

Paper submission deadline: June 25, 2018
Notification of Acceptance: July 16, 2018
Camera-ready Submission: July 23, 2018
Workshop Date: September 20, 2018

------------------------------------------------------------
------------------------------------------------------

Best regards,
MLCN 2018 Organizing Committee,
Email: mlcnworkshop at gmail.com
Website: https://mlcn2018.com/
twitter: @MLCN2018
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180611/ef0f4187/attachment.html>

From zephyr14 at gmail.com  Tue Jun 26 11:08:37 2018
From: zephyr14 at gmail.com (Vlad Niculae)
Date: Tue, 26 Jun 2018 11:08:37 -0400
Subject: [scikit-learn] Scikit Multi learn error.
In-Reply-To: <CAPn2g_YK-6ic_xfOCq2Mqd25JGVjP7XQts8jS=D3+2AuWUppgQ@mail.gmail.com>
References: <CAPn2g_YK-6ic_xfOCq2Mqd25JGVjP7XQts8jS=D3+2AuWUppgQ@mail.gmail.com>
Message-ID: <CAFJw_eFD0pmjFB2vBk8Ws_dnfvwbWzxcff0VBbfybAvZdLgw9g@mail.gmail.com>

Hi Aijaz,

You're writing to the wrong mailing list. This is the mailing list for
scikit-learn, not scikit-multilearn, which is a different and unrelated
project. You're unlikely to get an answer here;
I recommend following the contact information on the scikit-multilearn
website.

Best of luck,

Yours
Vlad

On Tue, Jun 26, 2018, 11:04 aijaz qazi <aqsdmcet at gmail.com> wrote:

> Dear developer ,
>
> I am working on web page categorization  with http://scikit.ml/ .
>
>
> *Question*: I am not able to execute MLkNN code on the link
> http://scikit.ml/api/classify.html. I have installed py 3.6.
>
> I found scipy versions not compatible with scikit.ml 0.0.5.
>
> Which version of scipy would work with scikit.ml 0.0.5.
>
> Kindly let me know. I will be grateful.
>
>
> *Regards,*
> *Aijaz A.Qazi *
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180626/7c4efbf7/attachment-0001.html>

From t3kcit at gmail.com  Tue Jun 26 11:36:46 2018
From: t3kcit at gmail.com (Andreas Mueller)
Date: Tue, 26 Jun 2018 11:36:46 -0400
Subject: [scikit-learn] New core dev: Joris Van den Bossche
In-Reply-To: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
References: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
Message-ID: <ad42b6ef-731d-915a-4c05-669054a3c96d@gmail.com>

Welcome on board Joris, and thank you for all your work so far!

On 06/23/2018 06:42 AM, Olivier Grisel wrote:
> Hi everyone!
>
> Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a 
> scikit-learn core developer!
>
> Joris is one of the maintainers of the pandas project and recently 
> contributed many new great PRs to scikit-learn (notably the 
> ColumnTransformer and a refactoring of the categorical variable 
> preprocessing tools).
>
> Cheers!
>
> -- 
> Olivier
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180626/83638451/attachment.html>

From fernando.wittmann at gmail.com  Tue Jun 26 13:40:16 2018
From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann)
Date: Tue, 26 Jun 2018 19:40:16 +0200
Subject: [scikit-learn] Scikit Multi learn error.
In-Reply-To: <CAFJw_eFD0pmjFB2vBk8Ws_dnfvwbWzxcff0VBbfybAvZdLgw9g@mail.gmail.com>
References: <CAPn2g_YK-6ic_xfOCq2Mqd25JGVjP7XQts8jS=D3+2AuWUppgQ@mail.gmail.com>
 <CAFJw_eFD0pmjFB2vBk8Ws_dnfvwbWzxcff0VBbfybAvZdLgw9g@mail.gmail.com>
Message-ID: <CABM1w2TwjL6EOMt7ibb1yk2=9Y_BqVrFpHqT_Awx+yf=70KDLA@mail.gmail.com>

Why there's a library based on Sklearn for multi classification? Sklearn
itself can handle this (
http://scikit-learn.org/stable/modules/multiclass.html)

On Tue, Jun 26, 2018 at 5:08 PM, Vlad Niculae <zephyr14 at gmail.com> wrote:

> Hi Aijaz,
>
> You're writing to the wrong mailing list. This is the mailing list for
> scikit-learn, not scikit-multilearn, which is a different and unrelated
> project. You're unlikely to get an answer here;
> I recommend following the contact information on the scikit-multilearn
> website.
>
> Best of luck,
>
> Yours
> Vlad
>
> On Tue, Jun 26, 2018, 11:04 aijaz qazi <aqsdmcet at gmail.com> wrote:
>
>> Dear developer ,
>>
>> I am working on web page categorization  with http://scikit.ml/ .
>>
>>
>> *Question*: I am not able to execute MLkNN code on the link
>> http://scikit.ml/api/classify.html. I have installed py 3.6.
>>
>> I found scipy versions not compatible with scikit.ml 0.0.5.
>>
>> Which version of scipy would work with scikit.ml 0.0.5.
>>
>> Kindly let me know. I will be grateful.
>>
>>
>> *Regards,*
>> *Aijaz A.Qazi *
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 

Fernando Marcos Wittmann
MS Student - Energy Systems Dept.
School of Electrical and Computer Engineering, FEEC
University of Campinas, UNICAMP, Brazil
+55 (19) 987-211302
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180626/97480d48/attachment.html>

From niedakh at gmail.com  Tue Jun 26 16:07:13 2018
From: niedakh at gmail.com (=?UTF-8?Q?Piotr_Szyma=C5=84ski?=)
Date: Tue, 26 Jun 2018 22:07:13 +0200
Subject: [scikit-learn] Scikit Multi learn error.
In-Reply-To: <CAOMdbFA5H=8U0iOxfp+rBXftuu_HJKWGsciD414xyqtYpr-75w@mail.gmail.com>
References: <CAPn2g_YK-6ic_xfOCq2Mqd25JGVjP7XQts8jS=D3+2AuWUppgQ@mail.gmail.com>
 <CAFJw_eFD0pmjFB2vBk8Ws_dnfvwbWzxcff0VBbfybAvZdLgw9g@mail.gmail.com>
 <CABM1w2TwjL6EOMt7ibb1yk2=9Y_BqVrFpHqT_Awx+yf=70KDLA@mail.gmail.com>
 <CAOMdbFA5H=8U0iOxfp+rBXftuu_HJKWGsciD414xyqtYpr-75w@mail.gmail.com>
Message-ID: <CAOMdbFDHGpbcibeebfo=n7YC6dMpciJGxsxPJbuDtw6yk9pbzQ@mail.gmail.com>

Scikit-multilearn features a larger variety of models, many of which are
still not above the selectiveness threshold of scikit-learn. In general,
scikit-learn implements only three multi-label classifiers -
BinaryRelevance, OneVsRest and ClassifierChains. And generally
sklearn.multioutput is a very recent addition (2016), added 3 years after
the scikit-multilearn library was started.

(responding again, after joining list, i'm sorry if anyone got this twice)


wt., 26 cze 2018 o 19:59 u?ytkownik Piotr Szyma?ski <niedakh at gmail.com>
napisa?:

> Scikit-multilearn features a larger variety of models, many of which are
> still not above the selectiveness threshold of scikit-learn. In general,
> scikit-learn implements only three multi-label classifiers -
> BinaryRelevance, OneVsRest and ClassifierChains. And generally
> sklearn.multioutput is a very recent addition (2016), added 3 years after
> the scikit-multilearn library was started.
>
> wt., 26 cze 2018 o 19:40 u?ytkownik Fernando Marcos Wittmann <
> fernando.wittmann at gmail.com> napisa?:
>
>> Why there's a library based on Sklearn for multi classification? Sklearn
>> itself can handle this (
>> http://scikit-learn.org/stable/modules/multiclass.html)
>>
>> On Tue, Jun 26, 2018 at 5:08 PM, Vlad Niculae <zephyr14 at gmail.com> wrote:
>>
>>> Hi Aijaz,
>>>
>>> You're writing to the wrong mailing list. This is the mailing list for
>>> scikit-learn, not scikit-multilearn, which is a different and unrelated
>>> project. You're unlikely to get an answer here;
>>> I recommend following the contact information on the scikit-multilearn
>>> website.
>>>
>>> Best of luck,
>>>
>>> Yours
>>> Vlad
>>>
>>> On Tue, Jun 26, 2018, 11:04 aijaz qazi <aqsdmcet at gmail.com> wrote:
>>>
>>>> Dear developer ,
>>>>
>>>> I am working on web page categorization  with http://scikit.ml/ .
>>>>
>>>>
>>>> *Question*: I am not able to execute MLkNN code on the link
>>>> http://scikit.ml/api/classify.html. I have installed py 3.6.
>>>>
>>>> I found scipy versions not compatible with scikit.ml 0.0.5.
>>>>
>>>> Which version of scipy would work with scikit.ml 0.0.5.
>>>>
>>>> Kindly let me know. I will be grateful.
>>>>
>>>>
>>>> *Regards,*
>>>> *Aijaz A.Qazi *
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>>
>> --
>>
>> Fernando Marcos Wittmann
>> MS Student - Energy Systems Dept.
>> School of Electrical and Computer Engineering, FEEC
>> University of Campinas, UNICAMP, Brazil
>> +55 (19) 987-211302
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180626/f0f68363/attachment-0001.html>

From jorisvandenbossche at gmail.com  Tue Jun 26 16:32:56 2018
From: jorisvandenbossche at gmail.com (Joris Van den Bossche)
Date: Tue, 26 Jun 2018 22:32:56 +0200
Subject: [scikit-learn] New core dev: Joris Van den Bossche
In-Reply-To: <ad42b6ef-731d-915a-4c05-669054a3c96d@gmail.com>
References: <CAFvE7K61Usim7Qpegc+CCNfh01xcFVLqbPTAs5ozPM-O9AkZaA@mail.gmail.com>
 <ad42b6ef-731d-915a-4c05-669054a3c96d@gmail.com>
Message-ID: <CALQtMBYfrGMxyMXun-=x=JbyVofjoNxb67UOipuVVmYa5Fe5kA@mail.gmail.com>

Thanks all!

I have been really enjoying working with the scikit-learn community!

Joris

2018-06-26 17:36 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:

> Welcome on board Joris, and thank you for all your work so far!
>
>
> On 06/23/2018 06:42 AM, Olivier Grisel wrote:
>
> Hi everyone!
>
> Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a
> scikit-learn core developer!
>
> Joris is one of the maintainers of the pandas project and recently
> contributed many new great PRs to scikit-learn (notably the
> ColumnTransformer and a refactoring of the categorical variable
> preprocessing tools).
>
> Cheers!
>
> --
> Olivier
>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20180626/97bd5975/attachment.html>