From pyformulas at gmail.com Sat Jun 2 01:13:52 2018 From: pyformulas at gmail.com (pyformulas) Date: Fri, 1 Jun 2018 23:13:52 -0600 Subject: [scikit-learn] Novel efficient one-shot optimizer for regression Message-ID: Hi, I created an algorithm that may solve linear regression problems with less time complexity than Singular Value Decomposition. It only requires the gradient and the diagonal of the hessian to calculate the optimal weights. I attached the Tensorflow code below. I haven't been able to get it to work in pure NumPy yet, but I'm sure someone will be able port it if it really does what it purports to do. import numpy as np Y = np.arange(10).reshape(10,1)**0.5 bias_X = np.ones(10).reshape(10,1) X_feature1 = Y**3 X_feature2 = Y**4 X_feature3 = Y**5 X = np.concatenate((bias_X, X_feature1, X_feature2, X_feature3), axis=1) num_features = 4 import tensorflow as tf X_in = tf.placeholder(tf.float64, [None,num_features]) Y_in = tf.placeholder(tf.float64, [None,1]) W = tf.placeholder(tf.float64, [num_features,1]) W_squeezed = tf.squeeze(W) Y_hat = tf.expand_dims(tf.tensordot(X_in, W_squeezed, ([1],[0])), axis=1) loss = tf.reduce_mean(Y - Y_hat)**2 gradient = tf.gradients(loss, [W_squeezed])[0] gradient_2nd = tf.diag_part(tf.hessians(loss, [W_squeezed])[0]) vertex_offset = -gradient/gradient_2nd/num_features W_star = W_squeezed + vertex_offset W_star = tf.expand_dims(W_star, axis=1) with tf.Session() as sess: random_W = np.random.normal(size=(num_features,1)).astype(np.float64) result1 = sess.run([loss, W_star, gradient, gradient_2nd], feed_dict={X_in:X, Y_in:Y, W:random_W}) random_loss = result1[0] optimal_W = result1[1] print('Random loss:',result1[0]) print('Gradient:', result1[-2]) print("2nd-order Gradient:", result1[-1]) print("W:") print(random_W) print() print("W*:") print(result1[1]) print() optimal_loss = sess.run(loss, feed_dict={X_in:X, Y_in:Y, W:optimal_W}) print('Optimal loss:', optimal_loss) -------------- next part -------------- An HTML attachment was scrubbed... URL: From amirouche.boubekki at gmail.com Sun Jun 3 17:03:08 2018 From: amirouche.boubekki at gmail.com (Amirouche Boubekki) Date: Sun, 3 Jun 2018 23:03:08 +0200 Subject: [scikit-learn] Supervised prediction of multiple scores for a document Message-ID: H?llo, I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py ) Given a text it wants to return a dictionary scoring the input against vital articles categories , e.g.: out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") for category, score in out: print('{} ~ {}'.format(category, score)) The above program would output something like that: Art ~ 0.1 Science ~ 0.5 Society ~ 0.4 Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all. I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score *each paragraph* against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for *each paragraph*. I compute the sum and group by subcategory and then I have a score per category for the input document It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 or https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula: value = math.exp(prediction) magic = ((value * 100) - 110) * 100 In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me. I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is *the* class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus; I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories. I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? Also, it seems I should benchmark / evaluate my model against LDA. I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm. Thanks in advance! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sun Jun 3 17:20:49 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 3 Jun 2018 17:20:49 -0400 Subject: [scikit-learn] Supervised prediction of multiple scores for a document In-Reply-To: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com> References: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com> Message-ID: sorry, I had a copy & paste error, I meant "LogisticRegression(..., multi_class='multinomial')" and not "LogisticRegression(..., multi_class='ovr')" > On Jun 3, 2018, at 5:19 PM, Sebastian Raschka wrote: > > Hi, > >> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? > > Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical difference is that Softmax regression assumes that the classes are mutually exclusive, which is probably not the case in your setting since e.g., an article could be both "Art" and "Science" to some extend or so. Here a quick summary of softmax regression if useful: https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr'). > > Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be a better choice in your case. I.e., fit the model on the corpus for a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics. > > Best, > Sebastian > > > > >> On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki wrote: >> >> H?llo, >> >> I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py) >> >> Given a text it wants to return a dictionary scoring the input against vital articles categories, e.g.: >> >> out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") >> >> for category, score in out: >> print('{} ~ {}'.format(category, score)) >> >> The above program would output something like that: >> >> Art ~ 0.1 >> Science ~ 0.5 >> Society ~ 0.4 >> >> Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all. >> >> I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score each paragraph against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for each paragraph. I compute the sum and group by subcategory and then I have a score per category for the input document >> >> It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the >> full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 >> >> The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 >> >> or >> >> https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 >> >> I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula: >> value = math.exp(prediction) >> magic = ((value * 100) - 110) * 100 >> >> In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. >> >> Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me. >> >> I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is the class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus; >> I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories. >> >> I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? >> >> Also, it seems I should benchmark / evaluate my model against LDA. >> >> I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm. >> >> Thanks in advance! >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > From mail at sebastianraschka.com Sun Jun 3 17:19:32 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sun, 3 Jun 2018 17:19:32 -0400 Subject: [scikit-learn] Supervised prediction of multiple scores for a document In-Reply-To: References: Message-ID: <037411E4-8B6D-4EAB-A9C6-45AA73479364@sebastianraschka.com> Hi, > I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? Multinomial regression (or Softmax Regression) should give you results somewhat similar to a linear SVC (or logistic regression with OvO or OvR). The theoretical difference is that Softmax regression assumes that the classes are mutually exclusive, which is probably not the case in your setting since e.g., an article could be both "Art" and "Science" to some extend or so. Here a quick summary of softmax regression if useful: https://sebastianraschka.com/faq/docs/softmax_regression.html. In scikit-learn, you can use it via LogisticRegression(..., multi_class='ovr'). Howeever, spontaneously, I would say that Latent Dirichlet Allocation could be a better choice in your case. I.e., fit the model on the corpus for a specified number of topics (e.g., 10, but depends on your dataset, I would experiment a bit here), look at the top words in each topic and then assign a topic label to each topic. Then, for a given article, you can assign e.g., the top X labeled topics. Best, Sebastian > On Jun 3, 2018, at 5:03 PM, Amirouche Boubekki wrote: > > H?llo, > > I started a natural language processing project a few weeks ago called wikimark (the code is all in wikimark.py) > > Given a text it wants to return a dictionary scoring the input against vital articles categories, e.g.: > > out = wikimark("""Peter Hintjens wrote about the relation between technology and culture. Without using a scientifical tone of state-of-the-art review of the anthroposcene antropology, he gives a fair amount of food for thought. According to Hintjens, technology is doomed to become cheap. As matter of fact, intelligence tools will become more and more accessible which will trigger a revolution to rebalance forces in society.""") > > for category, score in out: > print('{} ~ {}'.format(category, score)) > > The above program would output something like that: > > Art ~ 0.1 > Science ~ 0.5 > Society ~ 0.4 > > Except not everything went as planned. Mind the fact that in the above example the total is equal to 1, but I could not achieve that at all. > > I am using gensim to compute vectors of paragraphs (doc2vev) and then submit those vectors to svm.SVR in a one-vs-all strategy ie. a document is scored 1 if it's in that subcategory and zero otherwise. At prediction time, it goes though the same doc2vec pipeline. The computer will score each paragraph against the SVR models of wikipedia vital article subcategories and get a value between 0 and 1 for each paragraph. I compute the sum and group by subcategory and then I have a score per category for the input document > > It somewhat works. I made a web ui online you can find it at https://sensimark.com where you can test it. You can directly access the > full api e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&all=1 > > The output JSON document is a list of category dictionary where the prediction key is associated with the average of the "prediction" of the subcategories. If you replace &all=1 by &top=5 you might get something else as top categories e.g. https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=10 > > or > > https://sensimark.com/api/v0?url=http://www.multivax.com/last_question.html&top=5 > > I wrote "prediction" with double quotes because the value you see, is the result of some formula. Since, the predictions I get are rather small between 0 and 0.015 I apply the following formula: > value = math.exp(prediction) > magic = ((value * 100) - 110) * 100 > > In order to have values to spread between -200 and 200. Maybe this is the symptom that my model doesn't work at all. > > Still, the top 10 results are almost always near each other (try with BBC articles on https://sensimark.com . It is only when a regression model is disqualified with a score of 0 that the results are simple to understand. Sadly, I don't have an example at hand to support that claim. You have to believe me. > > I just figured looking at the machine learning map that my problem might be classification problem, except I don't really want to know what is the class of new documents, I want to how what are the different subjects that are dealt in the document based on a hiearchical corpus; > I don't want to guess a hiearchy! I want to now how the document content spread over the different categories or subcategories. > > I quickly read about multinomal regression, is it something do you recommend I use? Maybe you think about something else? > > Also, it seems I should benchmark / evaluate my model against LDA. > > I am rather noob in terms of datascience and my math skills are not so fresh. I more likely looking for ideas on what algorithm, fine tuning and some practice of datascience I must follow that doesn't involve writing my own algorithm. > > Thanks in advance! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From sepand.haghighi at yahoo.com Mon Jun 4 11:06:52 2018 From: sepand.haghighi at yahoo.com (Sepand Haghighi) Date: Mon, 4 Jun 2018 15:06:52 +0000 (UTC) Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python References: <21980772.1932951.1528124812254.ref@mail.yahoo.com> Message-ID: <21980772.1932951.1528124812254@mail.yahoo.com> Hi Stuart Thanks ;-) Activation threshold is in our plan and will be added in next release (in the?next few weeks) Best RegardsSepand Haghighi On Thursday, May 31, 2018, 9:56:43 PM GMT+4:30, Stuart Reynolds wrote: Hi Sepand, Thanks for this -- looks useful. I had to write something similar (for the binary case) and wish scikit had something like this. I wonder if there's something similar for the binary class case where, the prediction is a real value (activation) and from this we can also derive - CMs for all prediction cutoff (or set of cutoffs?) - scores over all cutoffs (AUC, AP, ...) For me, in analyzing (binary class) performance, reporting scores for a single cutoff is less useful than seeing how the many scores (tpr, ppv, mcc, relative risk, chi^2, ...) vary at various false positive rates, or prediction quantiles. Does your library provide any tools for the binary case where we add an activation threshold? Thanks again for releasing this and providing pip packaging. - Stuart On Thu, May 31, 2018 at 6:05 AM, Sepand Haghighi via scikit-learn wrote: > PyCM is a multi-class confusion matrix library written in Python that > supports both input data vectors and direct matrix, and a proper tool for > post-classification model evaluation that supports most classes and overall > statistics parameters. PyCM is the swiss-army knife of confusion matrices, > targeted mainly at data scientists that need a broad array of metrics for > predictive models and an accurate evaluation of large variety of > classifiers. > > Github Repo : https://github.com/sepandhaghighi/pycm > > Webpage : http://pycm.shaghighi.ir/ > > JOSS Paper : https://doi.org/10.21105/joss.00729 > > > > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 4 11:40:51 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 4 Jun 2018 11:40:51 -0400 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> Message-ID: <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> On 5/31/18 1:26 PM, Stuart Reynolds wrote: > Hi Sepand, > > Thanks for this -- looks useful. I had to write something similar (for > the binary case) and wish scikit had something like this. Which part of it? I'm not entirely sure I understand what the core functionality is. > > I wonder if there's something similar for the binary class case where, > the prediction is a real value (activation) and from this we can also > derive > - CMs for all prediction cutoff (or set of cutoffs?) > - scores over all cutoffs (AUC, AP, ...) AUC and AP are by definition over all cut-offs. And CMs for all cutoffs doesn't seem a good idea, because that'll be n_samples many in the general case. If you want to specify a set of cutoffs, that would be pretty easy to do. How do you find these cut-offs, though? > > For me, in analyzing (binary class) performance, reporting scores for > a single cutoff is less useful than seeing how the many scores (tpr, > ppv, mcc, relative risk, chi^2, ...) vary at various false positive > rates, or prediction quantiles. You can totally do that with sklearn right now. Granted, it's not as convenient as it could be, but we're working on it. What's really the crucial point for me is how to pick the cut-offs. Cheers, Andy From jbbrown at kuhp.kyoto-u.ac.jp Mon Jun 4 11:56:22 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 5 Jun 2018 00:56:22 +0900 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> Message-ID: Hello community, I wonder if there's something similar for the binary class case where, >> the prediction is a real value (activation) and from this we can also >> derive >> - CMs for all prediction cutoff (or set of cutoffs?) >> - scores over all cutoffs (AUC, AP, ...) >> > AUC and AP are by definition over all cut-offs. And CMs for all > cutoffs doesn't seem a good idea, because that'll be n_samples many > in the general case. If you want to specify a set of cutoffs, that would > be pretty easy to do. > How do you find these cut-offs, though? > >> >> For me, in analyzing (binary class) performance, reporting scores for >> a single cutoff is less useful than seeing how the many scores (tpr, >> ppv, mcc, relative risk, chi^2, ...) vary at various false positive >> rates, or prediction quantiles. >> > In terms of finding cut-offs, one could use the idea of metric surfaces that I recently proposed https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127 and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces to determine what conditions you are willing to accept against the background of your prediction problem. I use these surfaces (a) to think about the prediction problem before any attempt at modeling is made, and (b) to deconstruct results such as "Accuracy=85%" into interpretations in the context of my field and the data being predicted. Hope this contributes a bit of food for thought. J.B. -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Jun 4 12:06:40 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 4 Jun 2018 12:06:40 -0400 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> Message-ID: Is that Jet?! https://www.youtube.com/watch?v=xAoljeRJ3lU ;) On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote: > Hello community, > > I wonder if there's something similar for the binary class > case where, > the prediction is a real value (activation) and from this we > can also > derive > ? - CMs for all prediction cutoff (or set of cutoffs?) > ? - scores over all cutoffs (AUC, AP, ...) > > AUC and AP are by definition over all cut-offs. And CMs for all > cutoffs doesn't seem a good idea, because that'll be n_samples many > in the general case. If you want to specify a set of cutoffs, that > would be pretty easy to do. > How do you find these cut-offs, though? > > > For me, in analyzing (binary class) performance, reporting > scores for > a single cutoff is less useful than seeing how the many scores > (tpr, > ppv, mcc, relative risk, chi^2, ...) vary at various false > positive > rates, or prediction quantiles. > > > In terms of finding cut-offs, one could use the idea of metric > surfaces that I recently proposed > https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127 > and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc > surfaces to determine what conditions you are willing to accept > against the background of your prediction problem. > > I use these surfaces (a) to think about the prediction problem before > any attempt at modeling is made, and (b) to deconstruct results such > as "Accuracy=85%" into interpretations in the context of my field and > the data being predicted. > > Hope this contributes a bit of food for thought. > J.B. > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 4 21:09:57 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 5 Jun 2018 11:09:57 +1000 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> Message-ID: > > Thanks for this -- looks useful. I had to write something similar (for >> the binary case) and wish scikit had something like this. > > > Which part of it? I'm not entirely sure I understand what the core > functionality is. > > I think the core efficiently evaluating the full set of metrics appropriate for the kind of task. We now support multi-metric scoring in things like cross_validation and GridSearchCV (but not in other CV implementations yet), but: 1. it's not efficient (there are PRs in progress to work around this, but they are definitely work-arounds in the sense that we're still repeatedly calling metric functions rather than calculating sufficient statistics once), and 2. we don't have a pre-defined set of scorers appropriate to binary classification; or for multiclass classification with 4 classes, one of which is the majority "no finding" class, etc. But assuming we could solve or work around the first issue, having an interface, in the core library or elsewhere which gave us a series of appropriately-named scorers for different task types might be neat and avoid code that a lot of people repeat: def get_scorers_for_binary(pos_label, neg_label, proba_thresholds=(0.5,)): return {'precision:p>0.5': make_scorer(precision_score, pos_label=pos_label), 'accuracy:p>0.5': 'accuracy', 'roc_auc': 'roc_auc', 'log_loss': 'log_loss', ... } def get_scorers_for_multiclass(pos_labels, neg_labels=()): out = {'accuracy': 'accuracy', 'mcc': make_scorer(matthews_corrcoef), 'cohen_kappa': make_scorer(cohen_kapppa_score), 'precision_macro': make_scorer(precision_score, labels=pos_labels, average='macro'), 'precision_weighted': make_scorer(precision_score, labels=pos_labels, average='weighted'), ...} if neg_labels: # micro-average precision is != accuracy only if some labels are excluded out['precision_micro'] = make_scorer(precision_score, labels=pos_labels, average='micro') ... return out I note some risk of encouraging bad practice around multiple hypotheses, etc... but generally I think this would be helpful to users. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Tue Jun 5 02:48:17 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Tue, 5 Jun 2018 15:48:17 +0900 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> Message-ID: 2018-06-05 1:06 GMT+09:00 Andreas Mueller : > Is that Jet?! > > https://www.youtube.com/watch?v=xAoljeRJ3lU > > ;) > Quite an entertaining presentation and informative to the non-expert about color theory, though I'm not sure I'd go so far as to call jet "evil" and that everyone hates it. Actually, I didn't know that the colormap known as Jet actually had a name...I had reversed engineered it to reproduce what I saw elsewhere. I suppose I'm glad I have already built my infrastructure's version of the metric surface plotter to allow complete color customization at runtime from the CLI, and can then tailor results to my audiences. :) I'll keep this video's explanation in mind - thanks for the reference. Cheers, J.B. > On 6/4/18 11:56 AM, Brown J.B. via scikit-learn wrote: > > Hello community, > > I wonder if there's something similar for the binary class case where, >>> the prediction is a real value (activation) and from this we can also >>> derive >>> - CMs for all prediction cutoff (or set of cutoffs?) >>> - scores over all cutoffs (AUC, AP, ...) >>> >> AUC and AP are by definition over all cut-offs. And CMs for all >> cutoffs doesn't seem a good idea, because that'll be n_samples many >> in the general case. If you want to specify a set of cutoffs, that would >> be pretty easy to do. >> How do you find these cut-offs, though? >> >>> >>> For me, in analyzing (binary class) performance, reporting scores for >>> a single cutoff is less useful than seeing how the many scores (tpr, >>> ppv, mcc, relative risk, chi^2, ...) vary at various false positive >>> rates, or prediction quantiles. >>> >> > In terms of finding cut-offs, one could use the idea of metric surfaces > that I recently proposed > https://onlinelibrary.wiley.com/doi/abs/10.1002/minf.201700127 > and then plot your per-threshold TPR/TNR pairs on the PPV/MCC/etc surfaces > to determine what conditions you are willing to accept against the > background of your prediction problem. > > I use these surfaces (a) to think about the prediction problem before any > attempt at modeling is made, and (b) to deconstruct results such as > "Accuracy=85%" into interpretations in the context of my field and the data > being predicted. > > Hope this contributes a bit of food for thought. > J.B. > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nelle.varoquaux at gmail.com Tue Jun 5 20:06:37 2018 From: nelle.varoquaux at gmail.com (Nelle Varoquaux) Date: Tue, 5 Jun 2018 17:06:37 -0700 Subject: [scikit-learn] 2018 John Hunter Excellence in Plotting Contest Message-ID: Hello everyone, Sorry about the cross-posting. There's a couple more days to submit to the John Hunter Excellence in Plotting Competition! If you have any scientific plot worth sharing, submit an entry before June 8th. For more information, see below. Thanks, Nelle In memory of John Hunter, we are pleased to be reviving the SciPy John Hunter Excellence in Plotting Competition for 2018. This open competition aims to highlight the importance of data visualization to scientific progress and showcase the capabilities of open source software. Participants are invited to submit scientific plots to be judged by a panel. The winning entries will be announced and displayed at the conference. John Hunter?s family and NumFocus are graciously sponsoring cash prizes for the winners in the following amounts: - 1st prize: $1000 - 2nd prize: $750 - 3rd prize: $500 - Entries must be submitted by June, 8th to the form at https://goo.gl/forms/7q86zgu5OYUOjODH3 . - Winners will be announced at Scipy 2018 in Austin, TX. - Participants do not need to attend the Scipy conference. - Entries may take the definition of ?visualization? rather broadly. Entries may be, for example, a traditional printed plot, an interactive visualization for the web, or an animation. - Source code for the plot must be provided, in the form of Python code and/or a Jupyter notebook, along with a rendering of the plot in a widely used format. This may be, for example, PDF for print, standalone HTML and Javascript for an interactive plot, or MPEG-4 for a video. If the original data can not be shared for reasons of size or licensing, "fake" data may be substituted, along with an image of the plot using real data. - Each entry must include a 300-500 word abstract describing the plot and its importance for a general scientific audience. - Entries will be judged on their clarity, innovation and aesthetics, but most importantly for their effectiveness in communicating a real-world problem. Entrants are encouraged to submit plots that were used during the course of research or work, rather than merely being hypothetical. - SciPy reserves the right to display any and all entries, whether prize-winning or not, at the conference, use in any materials or on its website, with attribution to the original author(s). SciPy John Hunter Excellence in Plotting Competition Co-Chairs Thomas Caswell Michael Droettboom Nelle Varoquaux -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jun 6 13:33:18 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 6 Jun 2018 13:33:18 -0400 Subject: [scikit-learn] PyCM: Multiclass confusion matrix library in Python In-Reply-To: References: <253486979.646953.1527771937416.ref@mail.yahoo.com> <253486979.646953.1527771937416@mail.yahoo.com> <4a3499e7-d5d9-6194-5d63-9ba9e8d36f56@gmail.com> Message-ID: On 6/5/18 2:48 AM, Brown J.B. via scikit-learn wrote: > > > 2018-06-05 1:06 GMT+09:00 Andreas Mueller >: > > Is that Jet?! > > https://www.youtube.com/watch?v=xAoljeRJ3lU > > > ;) > > > Quite an entertaining presentation and informative to the non-expert > about color theory, though I'm not sure I'd go so far as to call jet > "evil" and that everyone hates it. > Actually, I didn't know that the colormap known as Jet actually had a > name...I had reversed engineered it to reproduce what I saw elsewhere. > I suppose I'm glad I have already built my infrastructure's version of > the metric surface plotter to allow complete color customization at > runtime from the CLI, and can then tailor results to my audiences. :) From what I understood, there is evidence of misdiagnosis because of the use of jet. The main issue is that it creates borders in the image where there are none, and that seems like something that might be an issue in your application as well. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Thu Jun 7 14:50:16 2018 From: robert.kern at gmail.com (Robert Kern) Date: Thu, 7 Jun 2018 11:50:16 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy Message-ID: https://mail.python.org/pipermail/numpy-discussion/2018-June/078126.html Hi, sklearners! I have a NEP out for discussion that proposes a change in numpy.random's stream-compatibility policy. As scikit-learn is a well-disciplined consumer of reproducible streams, I would appreciate your input on the numpy-discussion thread linked above. The very short form is that there is a new PRNG subsystem being developed with better core PRNGs (among other things, providing nice features like independent streams for parallel computations), and we would like to relax our strict stream-compatibility policy for the non-uniform distributions in this new subsystem so that we can improve our algorithms. The core uniform numbers would still be strictly stream-compatible across numpy versions. But we would like to be able to upgrade our non-uniform algorithms, for example, to make normal variates faster to generate. RandomState would be frozen and subject to a long deprecation cycle for a period of strict backwards compatibility. There would be some non-deprecated provision to get strictly-compatible streams for a subset of distributions for the limited purpose of generating test data for unit tests. Please read the NEP and the thread through. I do propose at least one alternative in the thread and would like some feedback on it. I would also appreciate it if we could consolidate the discussion on the numpy-discussion thread and not have a split-off conversation here too. Thank you very much! I appreciate your attention. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From guettliml at thomas-guettler.de Fri Jun 8 04:48:59 2018 From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=) Date: Fri, 8 Jun 2018 10:48:59 +0200 Subject: [scikit-learn] Mapping fulltext OCR to issue type Message-ID: We run an issue tracking application. A lot of issues get generated from scanned letters. I have 70k full text OCR result files. Their got created with tesseract. Every file of these 70k files corresponds to a issue. Each issue has an issue type. I want to use machine learning and in the future the machine should be able to guess the issue type by looking at the full text OCR. The issue types are not a simple list, it is a tree. Example: electricity / power grid electricity / outages customer support / invoices / complaint customer support / invoices / tax .... If the machine can't guess "customer support / invoices / complaint" it would be nice if it could at least guess roughly the parent issue type: "customer support / invoices" I never used sciki before, but I use Python since several years. Could you please guide me to the right direction? Regards, Thomas G?ttler -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines From francois.dion at gmail.com Fri Jun 8 07:32:06 2018 From: francois.dion at gmail.com (Francois Dion) Date: Fri, 8 Jun 2018 07:32:06 -0400 Subject: [scikit-learn] Trained model repository? Message-ID: Does anybody know of a repo or site that has scikit-learn pre-trained models / pipelines? There are specific projects that might include a model in their github repo (I've done that for a PyData talk in the past), and I've also seen specific frameworks including some pre-trained neural networks (keras and caffe2 for example), but I don't think there's anything for scikit models. I've asked around and on twitter, but nothing. I figured, if anybody would know, it would have to be on the sklearn list. Francois -------------- next part -------------- An HTML attachment was scrubbed... URL: From randalljellis at gmail.com Fri Jun 8 09:34:11 2018 From: randalljellis at gmail.com (Randy Ellis) Date: Fri, 8 Jun 2018 09:34:11 -0400 Subject: [scikit-learn] Trained model repository? In-Reply-To: References: Message-ID: Not sure if sklearn has one, but Tensorflow has Tensorhub https://www.tensorflow.org/hub/ On Fri, Jun 8, 2018 at 7:32 AM, Francois Dion wrote: > Does anybody know of a repo or site that has scikit-learn pre-trained > models / pipelines? > > There are specific projects that might include a model in their github > repo (I've done that for a PyData talk in the past), and I've also seen > specific frameworks including some pre-trained neural networks (keras and > caffe2 for example), but I don't think there's anything for scikit models. > > I've asked around and on twitter, but nothing. I figured, if anybody would > know, it would have to be on the sklearn list. > > > Francois > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- *Randall J. Ellis, B.S.* PhD Student, Biomedical Science, Mount Sinai Special Volunteer, http://www.michaelideslab.org/, NIDA IRP Cell: (954)-260-9891 -------------- next part -------------- An HTML attachment was scrubbed... URL: From francois.dion at gmail.com Fri Jun 8 14:13:25 2018 From: francois.dion at gmail.com (Francois Dion) Date: Fri, 8 Jun 2018 14:13:25 -0400 Subject: [scikit-learn] Trained model repository? In-Reply-To: References: Message-ID: Thanks. Speaking tomorrow at SouthEast Linux Fest in Charlotte and am providing examples of pre-trained models. Once more, I will demo a pre-trained model I did, but It would have been nice to point to a hub / repo. Francois On Fri, Jun 8, 2018 at 9:34 AM, Randy Ellis wrote: > Not sure if sklearn has one, but Tensorflow has Tensorhub https://www. > tensorflow.org/hub/ > > On Fri, Jun 8, 2018 at 7:32 AM, Francois Dion > wrote: > >> Does anybody know of a repo or site that has scikit-learn pre-trained >> models / pipelines? >> >> There are specific projects that might include a model in their github >> repo (I've done that for a PyData talk in the past), and I've also seen >> specific frameworks including some pre-trained neural networks (keras and >> caffe2 for example), but I don't think there's anything for scikit models. >> >> I've asked around and on twitter, but nothing. I figured, if anybody >> would know, it would have to be on the sklearn list. >> >> >> Francois >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > > -- > *Randall J. Ellis, B.S.* > PhD Student, Biomedical Science, Mount Sinai > Special Volunteer, http://www.michaelideslab.org/, NIDA IRP > Cell: (954)-260-9891 > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Fri Jun 8 14:18:01 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 8 Jun 2018 14:18:01 -0400 Subject: [scikit-learn] Trained model repository? In-Reply-To: References: Message-ID: <868e24f8-464d-e351-c5e3-e30a787a1f56@gmail.com> I'm not sure what you mean. Pre-trained on what task? And what kind of models? The only task I can think that would make sense would be text data with BoW representation, and I'm not sure which models we could pretrain for that. Maybe PCA, topic models and MLP? To what end though for PCA and topic models? And if you're serious about MLPs, why not use keras? On 6/8/18 7:32 AM, Francois Dion wrote: > Does anybody know of a repo or site that has scikit-learn pre-trained > models / pipelines? > > There are specific projects that might include a model in their github > repo (I've done that for a PyData talk in the past), and I've also > seen specific frameworks including some pre-trained neural networks > (keras and caffe2 for example), but I don't think there's anything for > scikit models. > > I've asked around and on twitter, but nothing. I figured, if anybody > would know, it would have to be on the sklearn list. > > > Francois > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Sun Jun 10 22:11:18 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Sun, 10 Jun 2018 22:11:18 -0400 Subject: [scikit-learn] Jeff Levesque: profit functionality Message-ID: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> Hi guys, Does sklearn have both probit, and logic functionality? Thank you, Jeff Levesque https://github.com/jeff1evesque From jeff1evesque at yahoo.com Sun Jun 10 23:26:55 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Sun, 10 Jun 2018 23:26:55 -0400 Subject: [scikit-learn] Jeff Levesque: profit functionality In-Reply-To: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> Message-ID: Sorry typo: meant logit, and probit. Thank you, Jeff Levesque https://github.com/jeff1evesque > On Jun 10, 2018, at 10:11 PM, Jeffrey Levesque via scikit-learn wrote: > > Hi guys, > Does sklearn have both probit, and logic functionality? > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From alexandre.gramfort at inria.fr Mon Jun 11 04:29:19 2018 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Mon, 11 Jun 2018 10:29:19 +0200 Subject: [scikit-learn] Jeff Levesque: profit functionality In-Reply-To: References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> Message-ID: no only logit with LogisticRegression estimator. Alex From dylanf123 at gmail.com Mon Jun 11 05:02:45 2018 From: dylanf123 at gmail.com (Dylan Fernando) Date: Mon, 11 Jun 2018 19:02:45 +1000 Subject: [scikit-learn] scikit-learn-contrib: building Cython, cpp files In-Reply-To: References: Message-ID: Hi Joris, Thanks, I'll try that. On Fri, Jun 1, 2018 at 5:20 AM, Joris Van den Bossche < jorisvandenbossche at gmail.com> wrote: > Hi Dylan, > > In case you are still looking for a solution:I didn't directly find good > templates for packages that depend on cython (there are quite some, but > from quickly looking at them, I didn't find a simple one), but you can > maybe have a look at one of the other scikit-learn-contrib packages that > uses cython: https://github.com/scikit-learn-contrib/hdbscan > And you can check here how to adapt the Extension class to specify c++: > http://cython.readthedocs.io/en/latest/src/userguide/ > wrapping_CPlusPlus.html#specify-c-language-in-setup-py > > Best, > Joris > > > > 2018-05-29 6:56 GMT+02:00 Dylan Fernando : > >> Hi, >> >> I would like to publish this: >> https://github.com/dil12321/scikit-learn/tree/aode >> https://github.com/scikit-learn/scikit-learn/pull/11093 >> >> as a scikit-learn-contrib project. However, I'm not sure how to write the >> setup.py file so that aode_helper.cpp and _aode.pyx get included in the >> package, and run correctly. How should I write setup.py? >> >> Regards, >> Dylan >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff1evesque at yahoo.com Mon Jun 11 07:23:12 2018 From: jeff1evesque at yahoo.com (Jeffrey Levesque) Date: Mon, 11 Jun 2018 07:23:12 -0400 Subject: [scikit-learn] Jeff Levesque: association rules Message-ID: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> Hi guys, What are some good approaches for association rules. Is there something built in, or do people sometimes use alternate packages, maybe apache spark? Thank you, Jeff Levesque https://github.com/jeff1evesque From stuart at stuartreynolds.net Mon Jun 11 12:18:18 2018 From: stuart at stuartreynolds.net (Stuart Reynolds) Date: Mon, 11 Jun 2018 09:18:18 -0700 Subject: [scikit-learn] Jeff Levesque: profit functionality In-Reply-To: References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> Message-ID: Scikit has a section on 'GLMs' 1.1. Generalized Linear Models http://scikit-learn.org/stable/modules/linear_model.html not covered there? (That page doesn't look like GLMs -- mostly it covers different fitting, loss and regularlization methids, but not general functional distributions). If not, check out statsmodels' GLM http://www.statsmodels.org/dev/glm.html http://www.statsmodels.org/dev/generated/statsmodels.genmod.generalized_linear_model.GLM.html On Mon, Jun 11, 2018 at 1:29 AM, Alexandre Gramfort wrote: > no only logit with LogisticRegression estimator. > > Alex > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Mon Jun 11 13:05:23 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 11 Jun 2018 13:05:23 -0400 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> Message-ID: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Hi Jeff, had a similar question 1-2 years ago and ended up using Chris Borgelt's C command line tools but for convenience, i also implemented basic association rule & frequent pattern mining in Python here: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ Best, Sebastian > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn wrote: > > Hi guys, > What are some good approaches for association rules. Is there something built in, or do people sometimes use alternate packages, maybe apache spark? > > Thank you, > > Jeff Levesque > https://github.com/jeff1evesque > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From dmitrii.ignatov at gmail.com Mon Jun 11 14:17:30 2018 From: dmitrii.ignatov at gmail.com (Dmitry Ignatov) Date: Mon, 11 Jun 2018 20:17:30 +0200 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Message-ID: Hi All, A good tool. I also use SPMF (Java-based library) and Apache Spark (they do not have closed itemsets there). There is a part of Orange data mining on association rules mining, which can be used as a Python library. A couple of years ago I asked Gilles Louppe about frequent itemset mining tools within scikit-learn as well. The answer was something like that ? nobody asked us about that... Best regards, Dmiry ??, 11 ???? 2018 ?. ? 19:30, Sebastian Raschka : > Hi Jeff, > > had a similar question 1-2 years ago and ended up using Chris Borgelt's C > command line tools but for convenience, i also implemented basic > association rule & frequent pattern mining in Python here: > > http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ > > Best, > Sebastian > > > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn < > scikit-learn at python.org> wrote: > > > > Hi guys, > > What are some good approaches for association rules. Is there something > built in, or do people sometimes use alternate packages, maybe apache spark? > > > > Thank you, > > > > Jeff Levesque > > https://github.com/jeff1evesque > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.braune79 at gmail.com Mon Jun 11 14:51:00 2018 From: christian.braune79 at gmail.com (Christian Braune) Date: Mon, 11 Jun 2018 20:51:00 +0200 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Message-ID: Hey, Christian Borgelt currently has several itemset mining algorithms online with a python interface: http://borgelt.net/pyfim.html . Best regards, Chris Sebastian Raschka schrieb am Mo., 11. Juni 2018 um 19:30 Uhr: > Hi Jeff, > > had a similar question 1-2 years ago and ended up using Chris Borgelt's C > command line tools but for convenience, i also implemented basic > association rule & frequent pattern mining in Python here: > > http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ > > Best, > Sebastian > > > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn < > scikit-learn at python.org> wrote: > > > > Hi guys, > > What are some good approaches for association rules. Is there something > built in, or do people sometimes use alternate packages, maybe apache spark? > > > > Thank you, > > > > Jeff Levesque > > https://github.com/jeff1evesque > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dmitrii.ignatov at gmail.com Mon Jun 11 14:54:30 2018 From: dmitrii.ignatov at gmail.com (Dmitry Ignatov) Date: Mon, 11 Jun 2018 20:54:30 +0200 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Message-ID: My students use it too :-) ??, 11 ???? 2018 ?. ? 20:53, Christian Braune : > Hey, > > Christian Borgelt currently has several itemset mining algorithms online > with a python interface: http://borgelt.net/pyfim.html . > > Best regards, > Chris > > > Sebastian Raschka schrieb am Mo., 11. Juni > 2018 um 19:30 Uhr: > >> Hi Jeff, >> >> had a similar question 1-2 years ago and ended up using Chris Borgelt's C >> command line tools but for convenience, i also implemented basic >> association rule & frequent pattern mining in Python here: >> >> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ >> >> Best, >> Sebastian >> >> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn < >> scikit-learn at python.org> wrote: >> > >> > Hi guys, >> > What are some good approaches for association rules. Is there something >> built in, or do people sometimes use alternate packages, maybe apache spark? >> > >> > Thank you, >> > >> > Jeff Levesque >> > https://github.com/jeff1evesque >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathanrocher at gmail.com Mon Jun 11 17:45:42 2018 From: jonathanrocher at gmail.com (Jonathan Rocher) Date: Mon, 11 Jun 2018 16:45:42 -0500 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Message-ID: Yep, pyfim is what I too used for a past project... On Mon, Jun 11, 2018 at 1:55 PM Dmitry Ignatov wrote: > My students use it too :-) > > ??, 11 ???? 2018 ?. ? 20:53, Christian Braune < > christian.braune79 at gmail.com>: > >> Hey, >> >> Christian Borgelt currently has several itemset mining algorithms online >> with a python interface: http://borgelt.net/pyfim.html . >> >> Best regards, >> Chris >> >> >> Sebastian Raschka schrieb am Mo., 11. Juni >> 2018 um 19:30 Uhr: >> >>> Hi Jeff, >>> >>> had a similar question 1-2 years ago and ended up using Chris Borgelt's >>> C command line tools but for convenience, i also implemented basic >>> association rule & frequent pattern mining in Python here: >>> >>> http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/ >>> >>> Best, >>> Sebastian >>> >>> > On Jun 11, 2018, at 7:23 AM, Jeffrey Levesque via scikit-learn < >>> scikit-learn at python.org> wrote: >>> > >>> > Hi guys, >>> > What are some good approaches for association rules. Is there >>> something built in, or do people sometimes use alternate packages, maybe >>> apache spark? >>> > >>> > Thank you, >>> > >>> > Jeff Levesque >>> > https://github.com/jeff1evesque >>> > _______________________________________________ >>> > scikit-learn mailing list >>> > scikit-learn at python.org >>> > https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Jonathan Rocher Austin TX, USA twitter:@jonrocher , linkedin:jonathanrocher ------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 11 21:09:35 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 12 Jun 2018 11:09:35 +1000 Subject: [scikit-learn] Jeff Levesque: profit functionality In-Reply-To: References: <8BA313C7-26FE-41ED-93D1-6EDF5BE024EA@yahoo.com> Message-ID: There is a PR for more GLM support ( https://github.com/scikit-learn/scikit-learn/pull/9405), but I don't think it will be in the next release.? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Mon Jun 11 21:16:46 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Tue, 12 Jun 2018 11:16:46 +1000 Subject: [scikit-learn] Jeff Levesque: association rules In-Reply-To: References: <348CB71E-45A6-478C-893C-8BB0E5765FEB@yahoo.com> <69BEFE99-D433-4605-A005-3787455607AF@sebastianraschka.com> Message-ID: We have definitely discussed association rules in issues before. It's considered out of scope for scikit-learn, except insofar as it is used for learning classification. We haven't yet been convinced that classifiers based on associative learning have enough practical demand to justify their maintenance in the project. Then again, we have not had a pull request implementing any such algorithms; there seems to be demand mostly for the vanilla association rule mining algorithms. They are definitely out of scope for scikit-learn. See: https://github.com/scikit-learn/scikit-learn/issues/801, https://github.com/scikit-learn/scikit-learn/issues/2662, https://github.com/scikit-learn/scikit-learn/issues/2872 ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From guettliml at thomas-guettler.de Wed Jun 13 05:43:55 2018 From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=) Date: Wed, 13 Jun 2018 11:43:55 +0200 Subject: [scikit-learn] Mapping fulltext OCR to issue type In-Reply-To: References: Message-ID: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de> I am still willing to learn. Does anyone have a recommendation which book or website could help me? Regards, Thomas Am 08.06.2018 um 10:48 schrieb Thomas G?ttler: > We run an issue tracking application. A lot of issues get generated > from scanned letters. > > I have 70k full text OCR result files. Their got created with tesseract. > > Every file of these 70k files corresponds to a issue. Each issue has an issue type. > > I want to use machine learning and in the future the machine > should be able to guess the issue type by looking at the full text OCR. > > The issue types are not a simple list, it is a tree. > > Example: > > electricity / power grid > electricity / outages > customer support / invoices / complaint > customer support / invoices / tax > .... > > > If the machine can't guess > > ?? "customer support / invoices / complaint" > > it would be nice if it could at least guess roughly the parent issue type: > > ?? "customer support / invoices" > > I never used sciki before, but I use Python since several years. > > Could you please guide me to the right direction? > > Regards, > ? Thomas G?ttler > > -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines From davidasfaha at gmail.com Wed Jun 13 06:25:45 2018 From: davidasfaha at gmail.com (David Asfaha) Date: Wed, 13 Jun 2018 11:25:45 +0100 Subject: [scikit-learn] Mapping fulltext OCR to issue type In-Reply-To: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de> References: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de> Message-ID: Hi, I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams... Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if you're new to the subject. Hope this helps. David [1] http://scikit-learn.org/stable/modules/naive_bayes.html [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html On 13 June 2018 at 10:43, Thomas G?ttler wrote: > I am still willing to learn. > > Does anyone have a recommendation which book or website could help me? > > Regards, > Thomas > > > Am 08.06.2018 um 10:48 schrieb Thomas G?ttler: > >> We run an issue tracking application. A lot of issues get generated >> from scanned letters. >> >> I have 70k full text OCR result files. Their got created with tesseract. >> >> Every file of these 70k files corresponds to a issue. Each issue has an >> issue type. >> >> I want to use machine learning and in the future the machine >> should be able to guess the issue type by looking at the full text OCR. >> >> The issue types are not a simple list, it is a tree. >> >> Example: >> >> electricity / power grid >> electricity / outages >> customer support / invoices / complaint >> customer support / invoices / tax >> .... >> >> >> If the machine can't guess >> >> "customer support / invoices / complaint" >> >> it would be nice if it could at least guess roughly the parent issue type: >> >> "customer support / invoices" >> >> I never used sciki before, but I use Python since several years. >> >> Could you please guide me to the right direction? >> >> Regards, >> Thomas G?ttler >> >> >> > -- > Thomas Guettler http://www.thomas-guettler.de/ > I am looking for feedback: https://github.com/guettli/pro > gramming-guidelines > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandra.log at sintef.no Thu Jun 14 06:44:28 2018 From: alexandra.log at sintef.no (Alexandra Metallinou Log) Date: Thu, 14 Jun 2018 10:44:28 +0000 Subject: [scikit-learn] help Message-ID: Dear Sir/Madam, I have stumbled upon a problem while trying to run some old code using skcikit-learn: scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func = metrics.mean_squared_error) This line will not run in a program I downloaded, and as I am not yet very familliar with scikit-learn I do not know how I should replace "score_func = metrics.mean_squared_error" to produce the same result as intended by the ones who made the program. Any help is greatly appreciated. Best regards, Alexandra Log -------------- next part -------------- An HTML attachment was scrubbed... URL: From gryllosprokopis at gmail.com Thu Jun 14 11:50:21 2018 From: gryllosprokopis at gmail.com (Prokopis Gryllos) Date: Thu, 14 Jun 2018 17:50:21 +0200 Subject: [scikit-learn] help In-Reply-To: References: Message-ID: Hey Alexandra, Can you maybe share the error output? gr, Prokopis On Thu, Jun 14, 2018 at 5:20 PM Alexandra Metallinou Log < alexandra.log at sintef.no> wrote: > Dear Sir/Madam, > > > I have stumbled upon a problem while trying to run some old code using > skcikit-learn: > > > scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func > = metrics.mean_squared_error) > > > This line will not run in a program I downloaded, and as I am not yet very > familliar with scikit-learn I do not know how I should replace "score_func > = metrics.mean_squared_error" to produce the same result as intended by the > ones who made the program. Any help is greatly appreciated. > > > Best regards, > > > Alexandra Log > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ichkoar at gmail.com Thu Jun 14 14:52:22 2018 From: ichkoar at gmail.com (Christos Aridas) Date: Thu, 14 Jun 2018 21:52:22 +0300 Subject: [scikit-learn] help In-Reply-To: References: Message-ID: Hey Alexandra . Could you please post a minimal, complete, and verifiable example? Apart from this could you post the exact error message? Best, Chris On Thu, Jun 14, 2018 at 1:44 PM, Alexandra Metallinou Log < alexandra.log at sintef.no> wrote: > Dear Sir/Madam, > > > I have stumbled upon a problem while trying to run some old code using > skcikit-learn: > > > scores = cross_validation.cross_val_score(model, X, Y, cv = 10, > score_func = metrics.mean_squared_error) > > > This line will not run in a program I downloaded, and as I am not yet very > familliar with scikit-learn I do not know how I should replace "score_func > = metrics.mean_squared_error" to produce the same result as intended by the > ones who made the program. Any help is greatly appreciated. > > > Best regards, > > > Alexandra Log > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 14 19:57:31 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 15 Jun 2018 09:57:31 +1000 Subject: [scikit-learn] help In-Reply-To: References: Message-ID: model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will produce the same, but negated so that greater is better. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Sat Jun 16 03:59:26 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 16 Jun 2018 00:59:26 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: I have made a significant revision. In this version, downstream projects like scikit-learn should experience significantly less forced churn. https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html tl;dr RandomState lives! But its distributions are forever frozen. So maybe "undead" is more apt. Anyways, RandomState will continue to provide the same stream-compatibility that it always has. But it will be internally refactored to use the same core uniform PRNG objects that the new RandomGenerator distributions class will use underneath (defaulting to the current Mersenne Twister, of course). The distribution methods on RandomGenerator will be allowed to evolve with numpy versions and get better/faster implementations. Your code can mix the usage of RandomState and RandomGenerator as needed, but they can be made to share the same underlying RNG algorithm's state. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From alexandra.log at sintef.no Fri Jun 15 03:51:59 2018 From: alexandra.log at sintef.no (Alexandra Metallinou Log) Date: Fri, 15 Jun 2018 07:51:59 +0000 Subject: [scikit-learn] help In-Reply-To: References: , Message-ID: Thank you, this worked. The error message was: undefined keyword: 'score-func' I also changed the line of code from scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func = metrics.mean_squared_error) to scores = cross_validation.cross_val_score(model, X, Y, cv = 10, scores = 'mean_squared_error') the code runs with this (I recieve negative outputs though, so I took the abolute value of these afterwards). However the following deprecation warning is displayed: C:\Python27\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) C:\Python27\lib\site-packages\sklearn\grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20. DeprecationWarning) When I changed the code to: model_evaluation.cross_val_score(model, X, y, scoring='neg_mean_squared_error'), the code runs fine ('neg_mse' was not an acceptable keyword). I still get the same deprecation warning, though I don't understand why as I am using model_evaluation now. Regardless, I think the problem is fixed. Once again, thank you for your help! Kind regards, Alexandra ________________________________ Fra: scikit-learn p? vegne av Joel Nothman Sendt: fredag 15. juni 2018 01.57.31 Til: Scikit-learn user and developer mailing list Emne: Re: [scikit-learn] help model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will produce the same, but negated so that greater is better. -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sat Jun 16 08:13:20 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 16 Jun 2018 22:13:20 +1000 Subject: [scikit-learn] help In-Reply-To: References: Message-ID: Sorry, should have been model_selection, not model_evaluation. cross_validation is now deprecated. On Sat, 16 Jun 2018 at 18:28, Alexandra Metallinou Log < alexandra.log at sintef.no> wrote: > Thank you, this worked. The error message was: undefined keyword: > 'score-func' > > > I also changed the line of code from > > > scores = cross_validation.cross_val_score(model, X, Y, cv = 10, score_func > = metrics.mean_squared_error) > > > to > > > scores = cross_validation.cross_val_score(model, X, Y, cv = 10, scores = > 'mean_squared_error') > > > the code runs with this (I recieve negative outputs though, so I took the > abolute value of these afterwards). However the following deprecation > warning is displayed: > > > C:\Python27\lib\site-packages\sklearn\cross_validation.py:41: > DeprecationWarning: This module was deprecated in version 0.18 in favor of > the model_selection module into which all the refactored classes and > functions are moved. Also note that the interface of the new CV iterators > are different from that of this module. This module will be removed in 0.20. > "This module will be removed in 0.20.", DeprecationWarning) > C:\Python27\lib\site-packages\sklearn\grid_search.py:42: > DeprecationWarning: This module was deprecated in version 0.18 in favor of > the model_selection module into which all the refactored classes and > functions are moved. This module will be removed in 0.20. > DeprecationWarning) > > When I changed the code to: > > > model_evaluation.cross_val_score(model, X, y, scoring= > 'neg_mean_squared_error'), > > > the code runs fine ('neg_mse' was not an acceptable keyword). I still get > the same deprecation warning, though I don't understand why as I am using > model_evaluation now. Regardless, I think the problem is fixed. > > > Once again, thank you for your help! > > > Kind regards, > > > Alexandra > ------------------------------ > *Fra:* scikit-learn sintef.no at python.org> p? vegne av Joel Nothman > *Sendt:* fredag 15. juni 2018 01.57.31 > *Til:* Scikit-learn user and developer mailing list > *Emne:* Re: [scikit-learn] help > > model_evaluation.cross_val_score(model, X, y, scoring='neg_mse') will > produce the same, but negated so that greater is better. > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From josef.pktd at gmail.com Sat Jun 16 08:54:36 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 16 Jun 2018 08:54:36 -0400 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern wrote: > I have made a significant revision. In this version, downstream projects > like scikit-learn should experience significantly less forced churn. > > https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst > > https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html > > tl;dr RandomState lives! But its distributions are forever frozen. So maybe > "undead" is more apt. Anyways, RandomState will continue to provide the same > stream-compatibility that it always has. But it will be internally > refactored to use the same core uniform PRNG objects that the new > RandomGenerator distributions class will use underneath (defaulting to the > current Mersenne Twister, of course). The distribution methods on > RandomGenerator will be allowed to evolve with numpy versions and get > better/faster implementations. > > Your code can mix the usage of RandomState and RandomGenerator as needed, > but they can be made to share the same underlying RNG algorithm's state. Sounds good to me, and I think handles all our concerns. I also think that the issues behind the np.random.* section about the global state and seed can be revisited for possible deprecation of convenience features. One clarifying question, mainly to see IIUC in this quote """ Calling numpy.random.seed() thereafter SHOULD just pass the given seed to the current basic RNG object and not attempt to reset the basic RNG to the Mersenne Twister. The global RandomState instance MUST be accessible by the name numpy.random.mtrand._rand """ "the current basic RNG object" refers to the global object. AFAIU, it is possible to change it numpy.random.mtrand._rand. Is it? I never tried that so I didn't know we can change the global RandomState, and thought we will have to replace np.random.seed usage with a specific RandomState(seed) instance In loose analogy: Matplotlib has a "global" current figure and axis, gca, gcf. In statsmodels we avoid any access to and usage of it and only work with individual figure/axis instances that can be provided by the user. (except for maybe some documentation examples and maybe some "legacy" code.) ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48 ) AFAICS, essentially, statsmodels will need a similar policy for RandomState/RandomGenerator and give up the usage of the global random instance. Josef > > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless enigma > that is made terrible by our own mad attempt to interpret it as though it > had > an underlying truth." > -- Umberto Eco > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From robert.kern at gmail.com Sat Jun 16 20:29:33 2018 From: robert.kern at gmail.com (Robert Kern) Date: Sat, 16 Jun 2018 17:29:33 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On 6/16/18 05:54, josef.pktd at gmail.com wrote: > On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern wrote: >> I have made a significant revision. In this version, downstream projects >> like scikit-learn should experience significantly less forced churn. >> >> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst >> >> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html >> >> tl;dr RandomState lives! But its distributions are forever frozen. So maybe >> "undead" is more apt. Anyways, RandomState will continue to provide the same >> stream-compatibility that it always has. But it will be internally >> refactored to use the same core uniform PRNG objects that the new >> RandomGenerator distributions class will use underneath (defaulting to the >> current Mersenne Twister, of course). The distribution methods on >> RandomGenerator will be allowed to evolve with numpy versions and get >> better/faster implementations. >> >> Your code can mix the usage of RandomState and RandomGenerator as needed, >> but they can be made to share the same underlying RNG algorithm's state. > > > Sounds good to me, and I think handles all our concerns. > > I also think that the issues behind the np.random.* section about the > global state and seed can be revisited for possible deprecation of > convenience features. > > One clarifying question, mainly to see IIUC > > in this quote > """ > Calling numpy.random.seed() thereafter SHOULD just pass the given seed > to the current basic RNG object and not attempt to reset the basic RNG > to the Mersenne Twister. The global RandomState instance MUST be > accessible by the name numpy.random.mtrand._rand > """ > > "the current basic RNG object" refers to the global object. AFAIU, it > is possible to change it numpy.random.mtrand._rand. Is it? numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is now) a RandomState instance. "the current basic RNG object" would be the basic RNG that that global RandomState instance is currently using. It is not possible (now or in the glorious NEP future) to assign a new instance to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually just simple aliases to the methods on that object when the module is first built. Re-assigning _rand wouldn't reassign those aliases. numpy.random.standard_normal(), for instance, would still be the .standard_normal() method on the RandomState instance that _rand initially pointed to. Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.* convenience functions) to modify its internal state. That's not changing. The only thing that will change will be that there will be a new numpy.random.* function to call that will let you give the global RandomState a new basic RNG object that it will swap in internally. Let's call it np.random.swap_global_basic_rng(). If you don't use that function, you won't have a problem. I intend to make this new function *very* explicit about what it is doing, and document the crap out of it so it won't be misused like np.random.seed() is. > I never tried that so I didn't know we can change the global > RandomState, and thought we will have to replace np.random.seed usage > with a specific RandomState(seed) instance I did a quick review of np.random.seed() usage in statsmodels, and I think you are mostly fine. It looks like you mostly use it in unit tests and at the top of examples. The only possible problem that I can see that you might have with the swap_global_basic_rng() is if some other package that you rely on calls it in its library code. Then subsequent statsmodels unit tests might fail because when they call np.random.seed(), it won't be reseeding a Mersenne Twister but another basic RNG. However, I intend to make that a weird and unnatural thing to do. It's already unlikely to happen as it's a niche requirement that one mostly would need at the start of a whole program, not buried down inside library code. But we will also document that function to discourage such usage, and probably have unconditional noisy warnings that users would have to explicitly silence. If one of your dependencies did that, you'd be well within your rights to tell them that they are misusing numpy and causing breakage in statsmodels. > In loose analogy: > > Matplotlib has a "global" current figure and axis, gca, gcf. > In statsmodels we avoid any access to and usage of it and only work > with individual figure/axis instances that can be provided by the > user. (except for maybe some documentation examples and maybe some > "legacy" code.) > ( https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48 > ) > > AFAICS, essentially, statsmodels will need a similar policy for > RandomState/RandomGenerator and give up the usage of the global random > instance. I mean, you certainly *should* (outside of unit tests) for very similar reasons why you avoid the global state in matplotlib, but this NEP won't force you to. You should do so anyways under the status quo, too. For any of your functions that call np.random.* functions internally, it's hard to use them in threaded applications, for instance, because it is relying on that global state. scikit-learn's check_random_state() is a good pattern to follow. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From josef.pktd at gmail.com Sat Jun 16 20:42:12 2018 From: josef.pktd at gmail.com (josef.pktd at gmail.com) Date: Sat, 16 Jun 2018 20:42:12 -0400 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On Sat, Jun 16, 2018 at 8:29 PM, Robert Kern wrote: > On 6/16/18 05:54, josef.pktd at gmail.com wrote: >> >> On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern >> wrote: >>> >>> I have made a significant revision. In this version, downstream projects >>> like scikit-learn should experience significantly less forced churn. >>> >>> >>> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst >>> >>> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html >>> >>> tl;dr RandomState lives! But its distributions are forever frozen. So >>> maybe >>> "undead" is more apt. Anyways, RandomState will continue to provide the >>> same >>> stream-compatibility that it always has. But it will be internally >>> refactored to use the same core uniform PRNG objects that the new >>> RandomGenerator distributions class will use underneath (defaulting to >>> the >>> current Mersenne Twister, of course). The distribution methods on >>> RandomGenerator will be allowed to evolve with numpy versions and get >>> better/faster implementations. >>> >>> Your code can mix the usage of RandomState and RandomGenerator as needed, >>> but they can be made to share the same underlying RNG algorithm's state. >> >> >> >> Sounds good to me, and I think handles all our concerns. >> >> I also think that the issues behind the np.random.* section about the >> global state and seed can be revisited for possible deprecation of >> convenience features. >> >> One clarifying question, mainly to see IIUC >> >> in this quote >> """ >> Calling numpy.random.seed() thereafter SHOULD just pass the given seed >> to the current basic RNG object and not attempt to reset the basic RNG >> to the Mersenne Twister. The global RandomState instance MUST be >> accessible by the name numpy.random.mtrand._rand >> """ >> >> "the current basic RNG object" refers to the global object. AFAIU, it >> is possible to change it numpy.random.mtrand._rand. Is it? > > > numpy.random.mtrand._rand would not be a basic RNG object; it would be (as > it is now) a RandomState instance. "the current basic RNG object" would be > the basic RNG that that global RandomState instance is currently using. > > It is not possible (now or in the glorious NEP future) to assign a new > instance to numpy.random.mtrand._rand. All of the numpy.random.* functions > are actually just simple aliases to the methods on that object when the > module is first built. Re-assigning _rand wouldn't reassign those aliases. > numpy.random.standard_normal(), for instance, would still be the > .standard_normal() method on the RandomState instance that _rand initially > pointed to. > > Currently and under the NEP, the only way to modify > numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.* > convenience functions) to modify its internal state. That's not changing. > > The only thing that will change will be that there will be a new > numpy.random.* function to call that will let you give the global > RandomState a new basic RNG object that it will swap in internally. Let's > call it np.random.swap_global_basic_rng(). If you don't use that function, > you won't have a problem. I intend to make this new function *very* explicit > about what it is doing, and document the crap out of it so it won't be > misused like np.random.seed() is. I didn't catch that part. Now it's clear. > >> I never tried that so I didn't know we can change the global >> RandomState, and thought we will have to replace np.random.seed usage > >> with a specific RandomState(seed) instance > > > I did a quick review of np.random.seed() usage in statsmodels, and I think > you are mostly fine. It looks like you mostly use it in unit tests and at > the top of examples. The only possible problem that I can see that you might > have with the swap_global_basic_rng() is if some other package that you rely > on calls it in its library code. Then subsequent statsmodels unit tests > might fail because when they call np.random.seed(), it won't be reseeding a > Mersenne Twister but another basic RNG. > > However, I intend to make that a weird and unnatural thing to do. It's > already unlikely to happen as it's a niche requirement that one mostly would > need at the start of a whole program, not buried down inside library code. > But we will also document that function to discourage such usage, and > probably have unconditional noisy warnings that users would have to > explicitly silence. > > If one of your dependencies did that, you'd be well within your rights to > tell them that they are misusing numpy and causing breakage in statsmodels. > >> In loose analogy: >> >> Matplotlib has a "global" current figure and axis, gca, gcf. >> In statsmodels we avoid any access to and usage of it and only work >> with individual figure/axis instances that can be provided by the >> user. (except for maybe some documentation examples and maybe some >> "legacy" code.) >> ( >> https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48 >> ) >> >> AFAICS, essentially, statsmodels will need a similar policy for >> RandomState/RandomGenerator and give up the usage of the global random >> instance. > > > I mean, you certainly *should* (outside of unit tests) for very similar > reasons why you avoid the global state in matplotlib, but this NEP won't > force you to. You should do so anyways under the status quo, too. For any of > your functions that call np.random.* functions internally, it's hard to use > them in threaded applications, for instance, because it is relying on that > global state. > > scikit-learn's check_random_state() is a good pattern to follow. Thanks for the clarification. I just realized that I had replied to scikit-learn mailing list. I had thought this was numpy-discussion. sorry about that. Josef > > > -- > Robert Kern > > "I have come to believe that the whole world is an enigma, a harmless enigma > that is made terrible by our own mad attempt to interpret it as though it > had > an underlying truth." > -- Umberto Eco > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From guettliml at thomas-guettler.de Mon Jun 18 06:16:19 2018 From: guettliml at thomas-guettler.de (=?UTF-8?Q?Thomas_G=c3=bcttler?=) Date: Mon, 18 Jun 2018 12:16:19 +0200 Subject: [scikit-learn] Mapping fulltext OCR to issue type In-Reply-To: References: <4a8ec9b8-24f0-1b7a-4064-4dbdb648a751@thomas-guettler.de> Message-ID: Thank you very much David, I ordered the book Regards, Thomas Am 13.06.2018 um 12:25 schrieb David Asfaha: > > Hi, > > I would recommend starting with Naive Bayes [1] to classify the issues by parent issue type. To check how that works > learn about F1 accuracy scores [2] and use them. If you are happy with the results, and depending on how much data you > have, try to modify the Naive Bayes classifier to predict the specific issue type. From here there are many more things > to do, like using an ensemble of classifiers, experimenting with SVMs, random forrest, TFIDF, n-grams... > > Natural Language Processing with Python is a good book on NLP , also Andrew Ng's Machine Learning course on coursera if > you're new to the subject. > > Hope this helps. > > David > > > [1] http://scikit-learn.org/stable/modules/naive_bayes.html > [2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html > > > On 13 June 2018 at 10:43, Thomas G?ttler > wrote: > > I am still willing to learn. > > Does anyone have a recommendation which book or website could help me? > > Regards, > ? Thomas > > > Am 08.06.2018 um 10:48 schrieb Thomas G?ttler: > > We run an issue tracking application. A lot of issues get generated > from scanned letters. > > I have 70k full text OCR result files. Their got created with tesseract. > > Every file of these 70k files corresponds to a issue. Each issue has an issue type. > > I want to use machine learning and in the future the machine > should be able to guess the issue type by looking at the full text OCR. > > The issue types are not a simple list, it is a tree. > > Example: > > electricity / power grid > electricity / outages > customer support / invoices / complaint > customer support / invoices / tax > .... > > > If the machine can't guess > > ??? "customer support / invoices / complaint" > > it would be nice if it could at least guess roughly the parent issue type: > > ??? "customer support / invoices" > > I never used sciki before, but I use Python since several years. > > Could you please guide me to the right direction? > > Regards, > ?? Thomas G?ttler > > > > -- > Thomas Guettler http://www.thomas-guettler.de/ > I am looking for feedback: https://github.com/guettli/programming-guidelines > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Thomas Guettler http://www.thomas-guettler.de/ I am looking for feedback: https://github.com/guettli/programming-guidelines From robert.kern at gmail.com Tue Jun 19 02:34:38 2018 From: robert.kern at gmail.com (Robert Kern) Date: Mon, 18 Jun 2018 23:34:38 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: On 6/16/18 00:59, Robert Kern wrote: > I have made a significant revision. In this version, downstream projects like > scikit-learn should experience significantly less forced churn. > > https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst > > https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html The screaming has died down on numpy-discussion, and it seems like everyone who has participated over there has more or less come to consensus about accepting this NEP. However, I'd really appreciate it if I could get some kind of feedback from a scikit-learn dev, whether it's "I don't care" or "I need a couple of days to get around to reading the NEP" or just "+1" or "-1000; this is awful!" I'm not picky. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From hamidizade.s at gmail.com Tue Jun 19 10:52:28 2018 From: hamidizade.s at gmail.com (S Hamidizade) Date: Tue, 19 Jun 2018 19:22:28 +0430 Subject: [scikit-learn] imbalanced classes: class_weight Message-ID: Hi I would appreciate if you could let me know what is the best way to categorize the approaches which have been developed to deal with imbalance class problem? *This article categorizes them into:* 1. Preprocessing: includes oversampling, undersampling and hybrid methods, 2. Cost-sensitive learning: includes direct methods and meta-learning which the latter further divides into thresholding and sampling, 3. Ensemble techniques: includes cost-sensitive ensembles and data preprocessing in conjunction with ensemble learning. *The second classification:* 1. Data Pre-processing: includes distribution change and weighting the data space. One-class learning is considered as distribution change. 2. Special-purpose Learning Methods 3. Prediction Post-processing: includes threshold method and cost-sensitive post-processing 4. Hybrid Methods: *The third article :* 1. Data-level methods 2. Algorithm-level methods 3. Hybrid methods The last classification also considers output adjustment as an independent approach. Could you please let me know the class-weight in the sklearn's classifiers e.g., logistic regression is classified into which category? Is it true to say: In case of the first categorization, it falls into cost-sensitive learning In case of the second taxonomy, it would be classified into the third category i.e., cost-sensitive post-processing In case of the third classification, it should fall into algorithm level Best regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jun 19 11:12:18 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 19 Jun 2018 11:12:18 -0400 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com> I don't think I have the bandwidth but I agree :-/ Not sure if any of the other core devs do. I can try to read it next week but that's probably too late? On 06/19/2018 02:34 AM, Robert Kern wrote: > On 6/16/18 00:59, Robert Kern wrote: >> I have made a significant revision. In this version, downstream >> projects like scikit-learn should experience significantly less >> forced churn. >> >> https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst >> >> https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html > > The screaming has died down on numpy-discussion, and it seems like > everyone who has participated over there has more or less come to > consensus about accepting this NEP. However, I'd really appreciate it > if I could get some kind of feedback from a scikit-learn dev, whether > it's "I don't care" or "I need a couple of days to get around to > reading the NEP" or just "+1" or "-1000; this is awful!" > > I'm not picky. > From gael.varoquaux at normalesup.org Tue Jun 19 11:29:14 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 19 Jun 2018 17:29:14 +0200 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: References: Message-ID: <20180619152914.youaux7nkjftujvt@phare.normalesup.org> On Mon, Jun 18, 2018 at 11:34:38PM -0700, Robert Kern wrote: > However, I'd really appreciate it if I could get some > kind of feedback from a scikit-learn dev, I didn't read the NEP, only your summary. That said, it seems quite reasonably aligned with our practice, and hence shouldn't pose a problem. Ideally, I believe that in the long run it should enable us to have cleaner / more robust code, but I suspect that it will take a while before we get there. Ga?l From ichkoar at gmail.com Tue Jun 19 12:34:50 2018 From: ichkoar at gmail.com (Christos Aridas) Date: Tue, 19 Jun 2018 19:34:50 +0300 Subject: [scikit-learn] imbalanced classes: class_weight In-Reply-To: References: Message-ID: Hi, Have you seen http://imbalanced-learn.org? Best, Chris On Tue, 19 Jun 2018 17:53 S Hamidizade, wrote: > Hi > > I would appreciate if you could let me know what is the best way to > categorize the approaches which have been developed to deal with imbalance > class problem? > > *This article > > categorizes them into:* > > 1. Preprocessing: includes oversampling, undersampling and hybrid > methods, > 2. Cost-sensitive learning: includes direct methods and meta-learning > which the latter further divides into thresholding and sampling, > 3. Ensemble techniques: includes cost-sensitive ensembles and data > preprocessing in conjunction with ensemble learning. > > *The second classification:* > > 1. Data Pre-processing: includes distribution change and weighting the > data space. One-class learning is considered as distribution change. > 2. Special-purpose Learning Methods > 3. Prediction Post-processing: includes threshold method and > cost-sensitive post-processing > 4. Hybrid Methods: > > *The third article > :* > > 1. Data-level methods > 2. Algorithm-level methods > 3. Hybrid methods > > The last classification also considers output adjustment as an independent > approach. > > Could you please let me know the class-weight in the sklearn's classifiers > e.g., logistic regression is classified into which category? Is it true to > say: > > In case of the first categorization, it falls into cost-sensitive learning > > In case of the second taxonomy, it would be classified into the third > category i.e., cost-sensitive post-processing > > In case of the third classification, it should fall into algorithm level > > Best regards, > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.kern at gmail.com Tue Jun 19 18:19:03 2018 From: robert.kern at gmail.com (Robert Kern) Date: Tue, 19 Jun 2018 15:19:03 -0700 Subject: [scikit-learn] NEP: Random Number Generator Policy In-Reply-To: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com> References: <2e83ecf0-4f42-6eb3-c372-28bb5baf8583@gmail.com> Message-ID: On 6/19/18 08:12, Andreas Mueller wrote: > I don't think I have the bandwidth but I agree :-/ > Not sure if any of the other core devs do. I can try to read it next week but > that's probably too late? We're not on a deadline. If you're interested in reading the NEP and providing feedback/consent, I'm happy to hold off on formally accepting the NEP until then. Thanks! -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco From hamidizade.s at gmail.com Wed Jun 20 23:35:54 2018 From: hamidizade.s at gmail.com (S Hamidizade) Date: Thu, 21 Jun 2018 08:05:54 +0430 Subject: [scikit-learn] imbalanced classes: class_weight In-Reply-To: References: Message-ID: Hi Thanks a lot for your time and consideration. I have seen imblearn but my question is not related to it. Best regards, On Tue, Jun 19, 2018 at 9:04 PM, Christos Aridas wrote: > Hi, > > Have you seen http://imbalanced-learn.org? > > Best, > Chris > > On Tue, 19 Jun 2018 17:53 S Hamidizade, wrote: > >> Hi >> >> I would appreciate if you could let me know what is the best way to >> categorize the approaches which have been developed to deal with imbalance >> class problem? >> >> *This article >> >> categorizes them into:* >> >> 1. Preprocessing: includes oversampling, undersampling and hybrid >> methods, >> 2. Cost-sensitive learning: includes direct methods and meta-learning >> which the latter further divides into thresholding and sampling, >> 3. Ensemble techniques: includes cost-sensitive ensembles and data >> preprocessing in conjunction with ensemble learning. >> >> *The second classification:* >> >> 1. Data Pre-processing: includes distribution change and weighting >> the data space. One-class learning is considered as distribution change. >> 2. Special-purpose Learning Methods >> 3. Prediction Post-processing: includes threshold method and >> cost-sensitive post-processing >> 4. Hybrid Methods: >> >> *The third article >> :* >> >> 1. Data-level methods >> 2. Algorithm-level methods >> 3. Hybrid methods >> >> The last classification also considers output adjustment as an >> independent approach. >> >> Could you please let me know the class-weight in the sklearn's >> classifiers e.g., logistic regression is classified into which category? Is >> it true to say: >> >> In case of the first categorization, it falls into cost-sensitive learning >> >> In case of the second taxonomy, it would be classified into the third >> category i.e., cost-sensitive post-processing >> >> In case of the third classification, it should fall into algorithm level >> >> Best regards, >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 21 00:19:23 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 21 Jun 2018 14:19:23 +1000 Subject: [scikit-learn] imbalanced classes: class_weight In-Reply-To: References: Message-ID: We don't usually do any postprocessing for class weight (although there is an open issue:). In the second taxonomy, I'd say Data Pre-processing ("weighting the data space"), but maybe there are exceptions in some estimators. The classification in the first taxonomy is correct, IMO. In the third, perhaps "Algorithm-level" ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jun 21 00:19:56 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 21 Jun 2018 14:19:56 +1000 Subject: [scikit-learn] imbalanced classes: class_weight In-Reply-To: References: Message-ID: the open issue on post-processing / prior adjustment to adjust for class_weight: https://github.com/scikit-learn/scikit-learn/issues/10613? -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Sat Jun 23 06:42:27 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Sat, 23 Jun 2018 12:42:27 +0200 Subject: [scikit-learn] New core dev: Joris Van den Bossche Message-ID: Hi everyone! Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer! Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools). Cheers! -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Sat Jun 23 11:13:07 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Sat, 23 Jun 2018 11:13:07 -0400 Subject: [scikit-learn] New core dev: Joris Van den Bossche In-Reply-To: References: Message-ID: That's great news! I am glad to hear that you joined the project, Joris Van den Bossche! I am a scikit-learn user (and sometimes contributor) and really appreciate all the time and effort that the core developers and contributors spend on maintaining and extending the library. Best regards, Sebastian > On Jun 23, 2018, at 6:42 AM, Olivier Grisel wrote: > > Hi everyone! > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer! > > Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools). > > Cheers! > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From zephyr14 at gmail.com Sat Jun 23 11:20:18 2018 From: zephyr14 at gmail.com (Vlad Niculae) Date: Sat, 23 Jun 2018 11:20:18 -0400 Subject: [scikit-learn] New core dev: Joris Van den Bossche In-Reply-To: References: Message-ID: Congratulations Joris, very well deserved! Vlad On Sat, Jun 23, 2018, 11:15 Sebastian Raschka wrote: > That's great news! I am glad to hear that you joined the project, Joris > Van den Bossche! I am a scikit-learn user (and sometimes contributor) and > really appreciate all the time and effort that the core developers and > contributors spend on maintaining and extending the library. > > Best regards, > Sebastian > > > > On Jun 23, 2018, at 6:42 AM, Olivier Grisel > wrote: > > > > Hi everyone! > > > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a > scikit-learn core developer! > > > > Joris is one of the maintainers of the pandas project and recently > contributed many new great PRs to scikit-learn (notably the > ColumnTransformer and a refactoring of the categorical variable > preprocessing tools). > > > > Cheers! > > > > -- > > Olivier > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jason.wolosonovich at infusionsoft.com Mon Jun 25 13:14:03 2018 From: jason.wolosonovich at infusionsoft.com (Jason Wolosonovich) Date: Mon, 25 Jun 2018 17:14:03 +0000 Subject: [scikit-learn] New core dev: Joris Van den Bossche In-Reply-To: References: Message-ID: <4f99a8848387404388328dbc50ac701e@infusionsoft.com> Welcome Joris! - Jason ________________________________ From: scikit-learn on behalf of scikit-learn-request at python.org Sent: Saturday, June 23, 2018 9:00 AM To: scikit-learn at python.org Subject: scikit-learn Digest, Vol 27, Issue 24 Send scikit-learn mailing list submissions to scikit-learn at python.org To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request at python.org You can reach the person managing the list at scikit-learn-owner at python.org When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..." Today's Topics: 1. New core dev: Joris Van den Bossche (Olivier Grisel) 2. Re: New core dev: Joris Van den Bossche (Sebastian Raschka) 3. Re: New core dev: Joris Van den Bossche (Vlad Niculae) ---------------------------------------------------------------------- Message: 1 Date: Sat, 23 Jun 2018 12:42:27 +0200 From: Olivier Grisel To: Scikit-learn user and developer mailing list Subject: [scikit-learn] New core dev: Joris Van den Bossche Message-ID: Content-Type: text/plain; charset="utf-8" Hi everyone! Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer! Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools). Cheers! -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Sat, 23 Jun 2018 11:13:07 -0400 From: Sebastian Raschka To: Scikit-learn mailing list Subject: Re: [scikit-learn] New core dev: Joris Van den Bossche Message-ID: Content-Type: text/plain; charset=us-ascii That's great news! I am glad to hear that you joined the project, Joris Van den Bossche! I am a scikit-learn user (and sometimes contributor) and really appreciate all the time and effort that the core developers and contributors spend on maintaining and extending the library. Best regards, Sebastian > On Jun 23, 2018, at 6:42 AM, Olivier Grisel wrote: > > Hi everyone! > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a scikit-learn core developer! > > Joris is one of the maintainers of the pandas project and recently contributed many new great PRs to scikit-learn (notably the ColumnTransformer and a refactoring of the categorical variable preprocessing tools). > > Cheers! > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ Message: 3 Date: Sat, 23 Jun 2018 11:20:18 -0400 From: Vlad Niculae To: Scikit-learn user and developer mailing list Subject: Re: [scikit-learn] New core dev: Joris Van den Bossche Message-ID: Content-Type: text/plain; charset="utf-8" Congratulations Joris, very well deserved! Vlad On Sat, Jun 23, 2018, 11:15 Sebastian Raschka wrote: > That's great news! I am glad to hear that you joined the project, Joris > Van den Bossche! I am a scikit-learn user (and sometimes contributor) and > really appreciate all the time and effort that the core developers and > contributors spend on maintaining and extending the library. > > Best regards, > Sebastian > > > > On Jun 23, 2018, at 6:42 AM, Olivier Grisel > wrote: > > > > Hi everyone! > > > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a > scikit-learn core developer! > > > > Joris is one of the maintainers of the pandas project and recently > contributed many new great PRs to scikit-learn (notably the > ColumnTransformer and a refactoring of the categorical variable > preprocessing tools). > > > > Cheers! > > > > -- > > Olivier > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Subject: Digest Footer _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn ------------------------------ End of scikit-learn Digest, Vol 27, Issue 24 ******************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mlcnworkshop at gmail.com Mon Jun 11 04:27:08 2018 From: mlcnworkshop at gmail.com (MLCN Workshop) Date: Mon, 11 Jun 2018 08:27:08 -0000 Subject: [scikit-learn] Deadline Extension: the first International Workshop on Machine Learning in Clinical Neuroimaging (MLCN 2018) Message-ID: Dear Colleagues, The paper submission deadline for MLCN 2018 workshop has been extended to June 25, 2018. ------------------------------------------------------------ ------------------------------------------------------ CALL FOR PAPERS: Recent advances in neuroimaging and statistical machine learning provide an exceptional opportunity for investigators and physicians to discover complex relationships between brain, behaviors, and mental and neurological disorders. MLCN 2018 workshop, as a satellite event of MICCAI 2018, aims to bring together researchers in both theory and application from various fields in domains of spatial statistics, pattern recognition in neuroimaging, and predictive clinical neuroscience. Topics of interests include but are not limited to: - Applications of spatio-temporal modeling in predictive clinical neuroscience - Spatial regularization in decoding clinical neuroimaging data - Spatial statistics in neuroimaging - Learning with structured inputs and outputs in clinical neuroscience - Multi-task learning in analyzing structured neuroimaging data - Deep learning in analyzing structured neuroimaging data - Model stability and interpretability in clinical neuroscience ------------------------------------------------------------ --------------------------------------------------------- CONFIRMED SPEAKERS: Christos Davatzikos (University of Pennsylvania) Ga?l Varoquaux (Parietal team, INRIA) Jian Kang (University of Michigan) ------------------------------------------------------------ ---------------------------------------------------------- SUBMISSION PROCESS: The workshop seeks high quality, original, and unpublished work on algorithms, theory, and applications of machine learning in clinical neuroimaging and spatially structured data analysis. Papers should be submitted electronically in Springer Lecture Notes in Computer Science (LCNS) style of up to 8-pages papers using the CMT system at https://cmt3.research.microsoft.com/MLCN2018. This workshop uses a double-blind review process in the evaluation phase, thus authors must ensure anonymous submissions. Accepted papers will be published in a joint proceeding with the MICCAI conference. ------------------------------------------------------------ ----------------------------------------------------------- IMPORTANT DATES: Paper submission deadline: June 25, 2018 Notification of Acceptance: July 16, 2018 Camera-ready Submission: July 23, 2018 Workshop Date: September 20, 2018 ------------------------------------------------------------ ------------------------------------------------------ Best regards, MLCN 2018 Organizing Committee, Email: mlcnworkshop at gmail.com Website: https://mlcn2018.com/ twitter: @MLCN2018 -------------- next part -------------- An HTML attachment was scrubbed... URL: From zephyr14 at gmail.com Tue Jun 26 11:08:37 2018 From: zephyr14 at gmail.com (Vlad Niculae) Date: Tue, 26 Jun 2018 11:08:37 -0400 Subject: [scikit-learn] Scikit Multi learn error. In-Reply-To: References: Message-ID: Hi Aijaz, You're writing to the wrong mailing list. This is the mailing list for scikit-learn, not scikit-multilearn, which is a different and unrelated project. You're unlikely to get an answer here; I recommend following the contact information on the scikit-multilearn website. Best of luck, Yours Vlad On Tue, Jun 26, 2018, 11:04 aijaz qazi wrote: > Dear developer , > > I am working on web page categorization with http://scikit.ml/ . > > > *Question*: I am not able to execute MLkNN code on the link > http://scikit.ml/api/classify.html. I have installed py 3.6. > > I found scipy versions not compatible with scikit.ml 0.0.5. > > Which version of scipy would work with scikit.ml 0.0.5. > > Kindly let me know. I will be grateful. > > > *Regards,* > *Aijaz A.Qazi * > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Jun 26 11:36:46 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 26 Jun 2018 11:36:46 -0400 Subject: [scikit-learn] New core dev: Joris Van den Bossche In-Reply-To: References: Message-ID: Welcome on board Joris, and thank you for all your work so far! On 06/23/2018 06:42 AM, Olivier Grisel wrote: > Hi everyone! > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a > scikit-learn core developer! > > Joris is one of the maintainers of the pandas project and recently > contributed many new great PRs to scikit-learn (notably the > ColumnTransformer and a refactoring of the categorical variable > preprocessing tools). > > Cheers! > > -- > Olivier > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From fernando.wittmann at gmail.com Tue Jun 26 13:40:16 2018 From: fernando.wittmann at gmail.com (Fernando Marcos Wittmann) Date: Tue, 26 Jun 2018 19:40:16 +0200 Subject: [scikit-learn] Scikit Multi learn error. In-Reply-To: References: Message-ID: Why there's a library based on Sklearn for multi classification? Sklearn itself can handle this ( http://scikit-learn.org/stable/modules/multiclass.html) On Tue, Jun 26, 2018 at 5:08 PM, Vlad Niculae wrote: > Hi Aijaz, > > You're writing to the wrong mailing list. This is the mailing list for > scikit-learn, not scikit-multilearn, which is a different and unrelated > project. You're unlikely to get an answer here; > I recommend following the contact information on the scikit-multilearn > website. > > Best of luck, > > Yours > Vlad > > On Tue, Jun 26, 2018, 11:04 aijaz qazi wrote: > >> Dear developer , >> >> I am working on web page categorization with http://scikit.ml/ . >> >> >> *Question*: I am not able to execute MLkNN code on the link >> http://scikit.ml/api/classify.html. I have installed py 3.6. >> >> I found scipy versions not compatible with scikit.ml 0.0.5. >> >> Which version of scipy would work with scikit.ml 0.0.5. >> >> Kindly let me know. I will be grateful. >> >> >> *Regards,* >> *Aijaz A.Qazi * >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Fernando Marcos Wittmann MS Student - Energy Systems Dept. School of Electrical and Computer Engineering, FEEC University of Campinas, UNICAMP, Brazil +55 (19) 987-211302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From niedakh at gmail.com Tue Jun 26 16:07:13 2018 From: niedakh at gmail.com (=?UTF-8?Q?Piotr_Szyma=C5=84ski?=) Date: Tue, 26 Jun 2018 22:07:13 +0200 Subject: [scikit-learn] Scikit Multi learn error. In-Reply-To: References: Message-ID: Scikit-multilearn features a larger variety of models, many of which are still not above the selectiveness threshold of scikit-learn. In general, scikit-learn implements only three multi-label classifiers - BinaryRelevance, OneVsRest and ClassifierChains. And generally sklearn.multioutput is a very recent addition (2016), added 3 years after the scikit-multilearn library was started. (responding again, after joining list, i'm sorry if anyone got this twice) wt., 26 cze 2018 o 19:59 u?ytkownik Piotr Szyma?ski napisa?: > Scikit-multilearn features a larger variety of models, many of which are > still not above the selectiveness threshold of scikit-learn. In general, > scikit-learn implements only three multi-label classifiers - > BinaryRelevance, OneVsRest and ClassifierChains. And generally > sklearn.multioutput is a very recent addition (2016), added 3 years after > the scikit-multilearn library was started. > > wt., 26 cze 2018 o 19:40 u?ytkownik Fernando Marcos Wittmann < > fernando.wittmann at gmail.com> napisa?: > >> Why there's a library based on Sklearn for multi classification? Sklearn >> itself can handle this ( >> http://scikit-learn.org/stable/modules/multiclass.html) >> >> On Tue, Jun 26, 2018 at 5:08 PM, Vlad Niculae wrote: >> >>> Hi Aijaz, >>> >>> You're writing to the wrong mailing list. This is the mailing list for >>> scikit-learn, not scikit-multilearn, which is a different and unrelated >>> project. You're unlikely to get an answer here; >>> I recommend following the contact information on the scikit-multilearn >>> website. >>> >>> Best of luck, >>> >>> Yours >>> Vlad >>> >>> On Tue, Jun 26, 2018, 11:04 aijaz qazi wrote: >>> >>>> Dear developer , >>>> >>>> I am working on web page categorization with http://scikit.ml/ . >>>> >>>> >>>> *Question*: I am not able to execute MLkNN code on the link >>>> http://scikit.ml/api/classify.html. I have installed py 3.6. >>>> >>>> I found scipy versions not compatible with scikit.ml 0.0.5. >>>> >>>> Which version of scipy would work with scikit.ml 0.0.5. >>>> >>>> Kindly let me know. I will be grateful. >>>> >>>> >>>> *Regards,* >>>> *Aijaz A.Qazi * >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >> >> >> -- >> >> Fernando Marcos Wittmann >> MS Student - Energy Systems Dept. >> School of Electrical and Computer Engineering, FEEC >> University of Campinas, UNICAMP, Brazil >> +55 (19) 987-211302 >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorisvandenbossche at gmail.com Tue Jun 26 16:32:56 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 26 Jun 2018 22:32:56 +0200 Subject: [scikit-learn] New core dev: Joris Van den Bossche In-Reply-To: References: Message-ID: Thanks all! I have been really enjoying working with the scikit-learn community! Joris 2018-06-26 17:36 GMT+02:00 Andreas Mueller : > Welcome on board Joris, and thank you for all your work so far! > > > On 06/23/2018 06:42 AM, Olivier Grisel wrote: > > Hi everyone! > > Let's welcome Joris Van den Bossche (@jorisvdbossche) officially as a > scikit-learn core developer! > > Joris is one of the maintainers of the pandas project and recently > contributed many new great PRs to scikit-learn (notably the > ColumnTransformer and a refactoring of the categorical variable > preprocessing tools). > > Cheers! > > -- > Olivier > > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -------------- next part -------------- An HTML attachment was scrubbed... URL: