From nils106 at googlemail.com Tue Mar 3 02:49:51 2020
From: nils106 at googlemail.com (Nils Wagner)
Date: Tue, 3 Mar 2020 08:49:51 +0100
Subject: [scikitlearn] tensorflow and scikitlearn
MessageID:
Hi All,
I am newbie to scikitlearn. Is it possible to use scikitlearn instead of
tensorflow and keras in the attached script?
Best regards,
Nils
 next part 
An HTML attachment was scrubbed...
URL:
 next part 
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import random
import math
import numpy as np
np.random.seed(1)
#
# ModuleNotFoundError: No module named 'tensorflow'
#
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD
def Amplitude(omega, zeta):
"""Analytic amplitude calculation"""
A = 1/math.sqrt((1omega**2)**2+(2*zeta*omega)**2)
return A
zeta_0 = 0.1 # Damping ratio
w_min = 0.0 # Start frequency
w_max = 10.0 # End frequency
N_omega = 300 # Number of points per interval
w = np.linspace(w_min, w_max, N_omega).reshape(1, 1)
Amplitude = np.vectorize(Amplitude)
a = Amplitude(w, zeta_0)
rnd_indices = np.random.rand(len(w)) < 0.80
x_train = w[rnd_indices]
y_train = a[rnd_indices]
x_test = w[~rnd_indices]
y_test = a[~rnd_indices]
print (x_train)
print (x_test)
input('Press enter to continue')
# Create a model
def baseline_model():
height = 100
model = Sequential()
model.add(Dense(height, input_dim=1, activation='tanh', kernel_initializer='uniform'))
model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform'))
model.add(Dense(height, input_dim=height, activation='tanh', kernel_initializer='uniform'))
model.add(Dense(1, input_dim=height, activation='linear', kernel_initializer='uniform'))
sgd = SGD(lr=0.01, momentum=0.9, nesterov=True)
model.compile(loss='mse', optimizer=sgd)
return model
# Training the model
model = baseline_model()
model.fit(x_train, y_train, epochs=1000, verbose = 0)
plt.figure(figsize=(16,8))
plt.rcParams["font.family"] = "arial"
plt.rcParams["font.size"] = "18"
plt.semilogy(x_test, model.predict(x_test), 'og')
plt.semilogy(x_train, model.predict(x_train), 'r')
plt.semilogy(w, a, 'b')
plt.xlabel('Driving Angular Frequency [Hz]')
plt.ylabel('Amplitude [m]')
plt.title('Oscillator Amplitude vs Driving Angular Frequency')
plt.legend(['TensorFlow Test', 'TensorFlow Training', 'Analytic Solution'])
plt.show()
From niourf at gmail.com Tue Mar 3 07:36:41 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Tue, 3 Mar 2020 07:36:41 0500
Subject: [scikitlearn] tensorflow and scikitlearn
InReplyTo:
References:
MessageID:
Hi Nils,
From a quick glance it looks like you're building a fully connected
multilayer perceptron so yes, this is possible in scikitlearn with the
neural_network module (check out the docs). The script would be quite
different though, it's not just plug and play. Also, for anything more
complex in neural nets, we would not recommend scikitlearn.
Nicolas
On 3/3/20 2:49 AM, Nils Wagner via scikitlearn wrote:
> Hi All,
>
> I am newbie to scikitlearn. Is it possible to use scikitlearn
> instead of tensorflow and keras in the attached script?
>
> Best regards,
> ??????????????????????????? Nils
>
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From adrin.jalali at gmail.com Tue Mar 3 08:19:19 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 3 Mar 2020 14:19:19 +0100
Subject: [scikitlearn] tensorflow and scikitlearn
InReplyTo:
References:
MessageID:
skorch is another nice library to do DL in sklearn based
environments/workflows.
On Tue., Mar. 3, 2020, 13:37 Nicolas Hug, wrote:
> Hi Nils,
>
> From a quick glance it looks like you're building a fully connected
> multilayer perceptron so yes, this is possible in scikitlearn with the
> neural_network module (check out the docs). The script would be quite
> different though, it's not just plug and play. Also, for anything more
> complex in neural nets, we would not recommend scikitlearn.
>
> Nicolas
> On 3/3/20 2:49 AM, Nils Wagner via scikitlearn wrote:
>
> Hi All,
>
> I am newbie to scikitlearn. Is it possible to use scikitlearn instead of
> tensorflow and keras in the attached script?
>
> Best regards,
> Nils
>
>
> _______________________________________________
> scikitlearn mailing listscikitlearn at python.orghttps://mail.python.org/mailman/listinfo/scikitlearn
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From joel.nothman at gmail.com Tue Mar 3 17:47:46 2020
From: joel.nothman at gmail.com (Joel Nothman)
Date: Wed, 4 Mar 2020 09:47:46 +1100
Subject: [scikitlearn] distances
MessageID:
I noticed a comment by @amueller on Gitter re considering a project on our
distances implementations.
I think there's a lot of work that can be done in unifying distances
implementations... (though I'm not always sure the benefit.) I thought I
would summarise some of the issues below, as I was unsure what Andy
intended.
As @jeremiedbb said, making n_jobs more effective would be beneficial.
Reducing duplication between metrics.pairwise and neighbors._dist_metrics
and kmeans would be noble (especially with regard to parameters, where
scicpy.spatial's mahalanobis available through sklearn.metrics does not
accept V but sklearn.neighbors does). and perhaps offer higher consistency
of results and efficiencies.
We also have idioms the code like "if the metric is euclidean, use
squared=True where we only need a ranking, then take the squareroot" while
neighbors metrics abstract this with an API by providing rdsit and
rdist_to_dist.
There are issues about making sure that
pairwise_distances(metric='minkowski', p=2) is using the same
implementation as pairwise_distances(metric='euclidean'), etc.
We have issues with chunking and distributing computations in the case that
metric params are derived from the dataset (ideally a training set).
#16419 is a simple instance where the metric param is samplealigned and
needs to be chunked up.
In other cases, we precompute some metric param over all the data, then
pass it to each chunk worker, using _precompute_metric_params introduced in
#12672. This is also relevant to #9555.
While that initial implementation in #12672 is helpful and aims to maintain
backwards compatibility, it makes some dubious choices.
Firstly in terms of code structure it is not a very modular approach  each
metric is handled with an ifthen. Secondly, it *only* handles the chunking
case, relying on the fact that these metrics are in scipy.spatial, and have
a comparable handling of V=None and VI=None. In the Gower Distances PR
(#9555) when implementing a metric locally, rather than relying on
scipy.spatial, we needed to provide an implementation of these default
parameters both when the data is chunked and when the metric function is
called straight out.
Thirdly, its approach to training vs test data is dubious. We don't
formally label X and Y in pairwise_distances as train/test, and perhaps we
should. Maintaining backwards compat with scipy's seuclidean and
mahalanobis, our implementation stacks X and Y to each other if both are
provided, and then calculates their variance. This means that users may be
applying a different metric at train and at test time (if the variance of X
as train and Y as test is substantially different), which I consider a
silent error. We can either make the train/test nature of X and Y more
explicit, or we can require that databased parameters are given explicitly
by the user and not implicitly computed. If I understand correctly,
sklearn.neighbors will not compute V or VI for you, and it must be provided
explicitly. (Requiring that the scaling of each feature be given explicitly
in Gower seems like an unnecessary burden on the user, however.)
Then there are issues like whether we should consistently set the diagonal
to zero in all metrics where Y=None.
In short, there are several projects in distances, and I'd support them
being considered for work.... But it's a lot of engineering, if motivated
by ML needs and consistency for users.
J
 next part 
An HTML attachment was scrubbed...
URL:
From jeremie.duboisberranger at inria.fr Wed Mar 4 13:20:43 2020
From: jeremie.duboisberranger at inria.fr (Jeremie du Boisberranger)
Date: Wed, 4 Mar 2020 19:20:43 +0100
Subject: [scikitlearn] ANN: scikitlearn 0.22.2.post1
InReplyTo:
References:
MessageID:
This is a minor release including a few bug fixes. Here is the full
changelog:
https://scikitlearn.org/stable/whats_new/v0.22.html#version0222
The 0.22.2.post1 release includes a packaging fix for the source distribution
but the content of the packages is otherwise identical to the content of the
wheels with the 0.22.2 version (without the .post1 suffix).
Thank you very much to all who contributed to this release !
Regards,
J?r?mie, on behalf of the scikitlearn maintainer team.
From rawtevipula25 at gmail.com Thu Mar 5 10:00:44 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 10:00:44 0500
Subject: [scikitlearn] Getting identical mse, r2, mae for different data
MessageID:
I am getting identical metric evaluation values for different data, I
printed the matrix shape too.
Below is a screenshot:
[image: image.png]

Regards,
Vipula Rawte
 next part 
An HTML attachment was scrubbed...
URL:
 next part 
A nontext attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL:
From fhjaime96 at gmail.com Thu Mar 5 10:06:44 2020
From: fhjaime96 at gmail.com (Jaime Ferrando Huertas)
Date: Thu, 5 Mar 2020 16:06:44 +0100
Subject: [scikitlearn] Getting identical mse, r2, mae for different data
InReplyTo:
References:
MessageID:
Can you provide the code that produces this output?
El jue., 5 mar. 2020 a las 16:03, Vipula Rawte ()
escribi?:
> I am getting identical metric evaluation values for different data, I
> printed the matrix shape too.
>
> Below is a screenshot:
>
> [image: image.png]
>
> 
> Regards,
> Vipula Rawte
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
 next part 
A nontext attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL:
From rawtevipula25 at gmail.com Thu Mar 5 12:07:27 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 12:07:27 0500
Subject: [scikitlearn] Getting identical mse, r2, mae for different data
InReplyTo:
References:
MessageID:
import os
import sys
import csv
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score,
mean_absolute_error
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
import re
import numpy as np
from sklearn.svm import SVR
import time
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split, cross_val_score,
cross_val_predict
from sklearn import metrics
import copy
from multiscorer import MultiScorer
start = time.time()
#print("metrics: ", metrics.SCORERS.keys())
mae_file = open('mae_scores.txt', 'w')
mse_file = open('mse_scores.txt', 'w')
r2_file = open('r2_scores.txt', 'w')
def tokenizer(text):
if text:
result = re.findall('[az]{2,}', text.lower())
else:
result = []
return result
def tfidf_vect(X):
vect = TfidfVectorizer(tokenizer=tokenizer, stop_words='english')
v = vect.fit(X)
X_vect = v.transform(X)
return X_vect
def compute(X_vect,y):
scorer = MultiScorer({
'r2' : (r2_score , {}),
'mse' : (mean_squared_error, {}),
'mae' : (mean_absolute_error, {})
})
#SVR model
model = SVR(C=1.0, epsilon=0.2, kernel= "poly")
X_train, X_test, y_train, y_test = train_test_split(X_vect, y,
test_size=0.33, shuffle=False, random_state=42)
model.fit(X_train, y_train)
pred = model.predict(X_test)
print("mse: ", mean_squared_error(pred, y_test))
print("mae: ", mean_absolute_error(pred, y_test))
print("r2_score: ", r2_score(pred, y_test))
'''
# Perform 6fold cross validation
scores = cross_val_score(model, X_vect, y, cv=10, scoring=scorer)
results = scorer.get_results()
print("len: ", X_vect.shape[0])
final_scores = []
for metric_name in results.keys():
average_score = np.average(results[metric_name])
print('%s : %f' % (metric_name, average_score))
final_scores.append(average_score)
r2_file.write(str(final_scores[0]) + '\n')
mse_file.write(str(final_scores[1]) + '\n')
mae_file.write(str(final_scores[2]) + '\n')
'''
'''
df_header = ['cik_year', 'words', 'sent_words', 'roa', 'eps', 'tobinq',
'tier1_c', 'leverage', 'Z_score_c']
#10K_t+1
df1 = pd.read_csv("list_10K_next.txt", header=None, usecols=[0],
names=['cik_year'])
df21 = pd.read_csv("train_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df22 = pd.read_csv("test_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df23 = pd.read_csv("train_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df24 = pd.read_csv("test_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df2 = pd.concat([df21, df22, df23, df24])
df5 = df2.copy()
searchfor1 = df1['cik_year'].values.tolist()
df2 = df2[df2.cik_year.str.contains(''.join(searchfor1))].reset_index()
del df2['index']
#all_perf_indicators
basepath1 = "/data/ftm/xgb_regr/ch_an_data/bank_all_perf_ind_data/"
dp11 = pd.read_csv(basepath1 + "train_2007_2012.csv")
dp12 = pd.read_csv(basepath1 + "test_2007_2012.csv")
dp1 = pd.concat([dp11, dp12])
searchfor1 = df1['cik_year'].values.tolist()
dp1 = dp1[dp1.cik_year.str.contains(''.join(searchfor1))].reset_index()
del dp1['index']
dp1 = dp1.drop_duplicates()
df2 = pd.merge(df2, dp1)
df2 = df2.drop_duplicates()
df2['prev_cik_year'] = df2['cik_year'].apply(lambda x: x.split("_")[0] +
"_" + str(int(x.split("_")[1])  1))
#8K_t
df3 = pd.read_csv("list_8K.txt", header=None, usecols=[0],
names=['cik_year'])
df41 = pd.read_csv("train_8K_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df42 = pd.read_csv("test_8K_2006_2011_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df43 = pd.read_csv("train_8K_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df44 = pd.read_csv("test_8K_2007_2012_scaled.csv", usecols=['cik_year',
'mda_words', 'mda_sent_words', 'scaled_roa'])
df4 = pd.concat([df41, df42, df43, df44])
searchfor1 = df3['cik_year'].values.tolist()
df4 = df4[df4.cik_year.str.contains(''.join(searchfor1))].reset_index()
del df4['index']
df4 = pd.merge(df4, df2, left_on='cik_year', right_on='prev_cik_year')
df4 = df4.drop_duplicates()
df4 = df4.rename({'cik_year_x':'cik_year', 'mda_words_x':'words',
'mda_sent_words_x':'sent_words', 'scaled_roa_y': 'roa', 'eps_scaled':
'eps', 'tobinq_scaled': 'tobinq', 'tier1_c_scaled': 'tier1_c',
'leverage_scaled': 'leverage', 'Z_score_c_scaled': 'Z_score_c'}, axis=1)
df4.to_csv("8K_t.csv", columns=df_header)
#10K_t
searchfor1 = df3['cik_year'].values.tolist()
df5 = df5[df5.cik_year.str.contains(''.join(searchfor1))].reset_index()
del df5['index']
df5 = pd.merge(df5, df2, left_on='cik_year', right_on='prev_cik_year')
df5 = df5.drop_duplicates()
df5 = df5.rename({'cik_year_x':'cik_year', 'mda_words_x':'words',
'mda_sent_words_x':'sent_words', 'scaled_roa_x': 'roa', 'eps_prev_scaled':
'eps', 'tobinq_prev_scaled': 'tobinq', 'tier1_c_prev_scaled': 'tier1_c',
'leverage_prev_scaled': 'leverage', 'Z_score_c_prev_scaled': 'Z_score_c'},
axis=1)
df5.to_csv("10K_t.csv", columns=df_header)
df2 = df2.rename({'mda_words':'words', 'mda_sent_words':'sent_words',
'scaled_roa': 'roa', 'eps_scaled': 'eps', 'tobinq_scaled': 'tobinq',
'tier1_c_scaled': 'tier1_c', 'leverage_scaled': 'leverage',
'Z_score_c_scaled': 'Z_score_c'}, axis=1)
df2.to_csv("10K_t1.csv", columns=df_header)
'''
#print("after 8K: ", len(df2), len(df4), len(df5), list(df2), list(df4),
list(df5))
'''
df_10K_t1 = pd.read_csv("10K_t1.csv")
df_10K_t = pd.read_csv("10K_t.csv")
word_type = ['words', 'sent_words']
target = ['roa', 'eps', 'tobinq', 'tier1_c', 'leverage', 'Z_score_c']
for t in target:
print(t)
print(df_10K_t1[t])
print(df_10K_t[t])
for w in word_type:
for t in target:
print("w: ", w, "t: ", t)
#8K
print("8K")
df_8K_t = pd.read_csv("8K_t.csv")
X_8K = df_8K_t[w].values.astype('U')
y_8K = df_8K_t[t]
X_vect_8K = tfidf_vect(X_8K)
compute(X_vect_8K, y_8K)
#10K_t+1
print("10K_t+1")
df_10K_t1 = pd.read_csv("10K_t1.csv")
X_10K1 = df_10K_t1[w].values.astype('U')
y_10K1 = df_10K_t1[t]
X_vect_10K1 = tfidf_vect(X_10K1)
compute(X_vect_10K1, y_10K1)
#10K_t
print("10K_t")
df_10K_t = pd.read_csv("10K_t.csv")
X_10K = df_10K_t[w].values.astype('U')
y_10K = df_10K_t[t]
X_vect_10K = tfidf_vect(X_10K)
#8K+10K (concat)
print("8K+10K (concat)")
X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()),
pd.DataFrame(X_vect_10K1.todense())], axis=1))
compute(X_vect_concat, y_10K1)
#8K+10K (sum)
print("#8K+10K (sum)")
X_vect_sum =
pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()),
fill_value=0)
compute(X_vect_sum, y_10K1)
#changes
print("#changes")
X_vect_diff =
pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()),
fill_value=0)
compute(X_vect_diff, y_10K1)
mae_file.close()
mse_file.close()
r2_file.close()
'''
#df = pd.read_csv("10K_t.csv")
#v = df[df.duplicated(['words'], keep=False)]
#v = pd.concat(g for _, g in df.groupby("words"))# if len(g) > 1)
#print(v)
#print(df['words'])
w = "words"
t = "leverage"
#8K
print("8K")
df_8K_t = pd.read_csv("8K_t.csv")
X_8K = df_8K_t[w].values.astype('U')
y_8K = df_8K_t[t]
X_vect_8K = tfidf_vect(X_8K)
compute(X_vect_8K, y_8K)
print("8K", type(X_vect_8K), X_vect_8K.shape)
#10K_t+1
print("10K_t+1")
df_10K_t1 = pd.read_csv("10K_t1.csv")
X_10K1 = df_10K_t1[w].values.astype('U')
y_10K1 = df_10K_t1[t]
X_vect_10K1 = tfidf_vect(X_10K1)
compute(X_vect_10K1, y_10K1)
print("10K_t1", type(X_vect_10K1), X_vect_10K1.shape)
#10K_t
print("10K_t")
df_10K_t = pd.read_csv("10K_t.csv")
X_10K = df_10K_t[w].values.astype('U')
y_10K = df_10K_t[t]
X_vect_10K = tfidf_vect(X_10K)
compute(X_vect_10K, y_10K)
print("10K: ", type(X_vect_10K), X_vect_10K.shape)
#8K+10K (concat)
print("8K+10K (concat)")
X_vect_concat = csr_matrix(pd.concat([pd.DataFrame(X_vect_8K.todense()),
pd.DataFrame(X_vect_10K1.todense())], axis=1))
compute(X_vect_concat, y_10K1)
print("8K +10K concat: ", type(X_vect_concat), X_vect_concat.shape)
#8K+10K (sum)
print("#8K+10K (sum)")
X_vect_sum =
pd.DataFrame(X_vect_8K.todense()).add(pd.DataFrame(X_vect_10K1.todense()),
fill_value=0)
compute(X_vect_sum, y_10K1)
print("8K + 10K sum: ", type(X_vect_sum), X_vect_sum.shape)
#changes
print("#changes")
X_vect_diff =
pd.DataFrame(X_vect_10K1.todense()).subtract(pd.DataFrame(X_vect_10K.todense()),
fill_value=0)
compute(X_vect_diff, y_10K1)
print("changes: ", type(X_vect_diff), X_vect_diff.shape)
print((X_vect_10K1.todense()==X_vect_diff.todense()))
print("Total execution time: ", time.time()  start)
On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas
wrote:
> Can you provide the code that produces this output?
>
> El jue., 5 mar. 2020 a las 16:03, Vipula Rawte ()
> escribi?:
>
>> I am getting identical metric evaluation values for different data, I
>> printed the matrix shape too.
>>
>> Below is a screenshot:
>>
>> [image: image.png]
>>
>> 
>> Regards,
>> Vipula Rawte
>> _______________________________________________
>> scikitlearn mailing list
>> scikitlearn at python.org
>> https://mail.python.org/mailman/listinfo/scikitlearn
>>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>

Regards,
Vipula Rawte
 next part 
An HTML attachment was scrubbed...
URL:
 next part 
A nontext attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL:
From rawtevipula25 at gmail.com Thu Mar 5 12:09:55 2020
From: rawtevipula25 at gmail.com (Vipula Rawte)
Date: Thu, 5 Mar 2020 12:09:55 0500
Subject: [scikitlearn] Getting identical mse, r2, mae for different data
InReplyTo:
References:
MessageID:
On Thu, Mar 5, 2020 at 10:08 AM Jaime Ferrando Huertas
wrote:
> Can you provide the code that produces this output?
>
> El jue., 5 mar. 2020 a las 16:03, Vipula Rawte ()
> escribi?:
>
>> I am getting identical metric evaluation values for different data, I
>> printed the matrix shape too.
>>
>> Below is a screenshot:
>>
>> [image: image.png]
>>
>> 
>> Regards,
>> Vipula Rawte
>> _______________________________________________
>> scikitlearn mailing list
>> scikitlearn at python.org
>> https://mail.python.org/mailman/listinfo/scikitlearn
>>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>

Regards,
Vipula Rawte
 next part 
An HTML attachment was scrubbed...
URL:
 next part 
A nontext attachment was scrubbed...
Name: image.png
Type: image/png
Size: 126984 bytes
Desc: not available
URL:
 next part 
A nontext attachment was scrubbed...
Name: refine_8K_perf_ind_prac.py
Type: text/xpython
Size: 6791 bytes
Desc: not available
URL:
From t3kcit at gmail.com Thu Mar 5 16:12:43 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 5 Mar 2020 16:12:43 0500
Subject: [scikitlearn] distances
InReplyTo:
References:
MessageID:
Thanks for a great summary of issues!
I agree there's lots to do, though I think most of the issues that you
list are quite hard and require thinking about API pretty hard.
So they might not be super amendable to being solved by a shorterterm
project.
I was hoping there would be some more easy wins that we could get by
exploiting OpenMP better (or at all) in the distances.
Not sure if there is, though.
I wonder if having a multicore implementation of euclidean_distances
would be useful for us, or if that's going too lowlevel.
On 3/3/20 5:47 PM, Joel Nothman wrote:
> I noticed a comment by?@amueller on Gitter re?considering a project on
> our distances implementations.
>
> I think there's a lot of work that can be done in unifying distances
> implementations... (though I'm not always sure the benefit.) I thought
> I would?summarise some of the issues below, as I was unsure what Andy
> intended.
>
> As @jeremiedbb said, making n_jobs more effective would be beneficial.
> Reducing duplication between metrics.pairwise and
> neighbors._dist_metrics and kmeans would?be noble (especially with
> regard to parameters, where scicpy.spatial's mahalanobis available
> through sklearn.metrics does not accept V but sklearn.neighbors does).
> and perhaps offer higher consistency of results and efficiencies.
>
> We also have idioms the code like "if the metric is euclidean, use
> squared=True where we only need a ranking, then take the squareroot"
> while neighbors metrics abstract this with an API by providing rdsit
> and rdist_to_dist.
>
> There are issues about making sure that
> pairwise_distances(metric='minkowski', p=2) is using the same
> implementation as pairwise_distances(metric='euclidean'), etc.
>
> We have issues with chunking and distributing computations in the case
> that metric params are derived from the dataset (ideally a training?set).
>
> #16419 is a simple instance where the metric param is samplealigned
> and needs to be chunked up.
>
> In other cases, we precompute some metric param over all the data,
> then pass it to each chunk worker, using _precompute_metric_params
> introduced in #12672. This is also relevant to #9555.
>
> While that initial implementation in #12672 is helpful and aims to
> maintain backwards compatibility, it makes some dubious choices.
>
> Firstly in terms of code structure it is not a very modular approach 
> each metric is handled with an ifthen. Secondly, it *only* handles
> the chunking case, relying on the fact that these metrics are in
> scipy.spatial, and have a comparable handling of V=None and VI=None.
> In the Gower Distances PR (#9555) when implementing a metric locally,
> rather than relying on scipy.spatial, we needed to provide an
> implementation of these default parameters both when the data is
> chunked and when the metric function is called straight out.
>
> Thirdly, its approach to training vs test data is dubious. We don't
> formally label X and Y in pairwise_distances as train/test, and
> perhaps we should. Maintaining backwards compat with scipy's
> seuclidean and mahalanobis, our implementation stacks X and Y to each
> other if both are provided, and then calculates their variance. This
> means that users may be applying a different metric at train and at
> test time (if the variance of X as train and Y as test is
> substantially different), which I consider a silent error. We can
> either make the train/test nature of X and Y more explicit, or we can
> require that databased parameters are given explicitly by the user
> and not implicitly computed. If I understand correctly,
> sklearn.neighbors will not compute V or VI for you, and it must be
> provided explicitly. (Requiring that the scaling of each feature be
> given explicitly in Gower seems like an unnecessary burden on the
> user, however.)
>
> Then there are issues like whether we should consistently set the
> diagonal to zero in all metrics where Y=None.
>
> In short, there are several projects in distances, and I'd support
> them being considered for work.... But it's a lot of engineering, if
> motivated by ML needs and consistency for users.
>
> J
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From jeremie.duboisberranger at inria.fr Fri Mar 6 05:00:45 2020
From: jeremie.duboisberranger at inria.fr (Jeremie du Boisberranger)
Date: Fri, 6 Mar 2020 11:00:45 +0100
Subject: [scikitlearn] distances
InReplyTo:
References:
MessageID: <30c505cfc178b81c6aa5bf047baeaede@inria.fr>
Although pairwise distances are very good candidates for OpenMP based
multithreading due to their embarrassingly parallel nature, I think
euclidean distances (from the pairwise module) is the one which will
less benefit from that. It's implementation, using the dot trick, uses
BLAS level 3 routine (matrix matrix multiplication) which will always be
better optimized, better parallelized, have runtime cpu detection.
Side note: What really makes KMeans faster is not the fact that
euclidean distances are computed by chunks, it's because the chunked
pairwise distance matrix fits in cache so it stays there for the
following operations on this matrix (finding labels, partially update
centers). So it does not apply to only computing euclidean distances.
On the other hand, other metrics don't all have internal
multithreading, and probably none rely on level 3 BLAS routines.
Usually computing pairwise distances does not involve a lot of
computations and is quite fast, so parallelizing them with joblib has no
benefit due to the joblib overhead being bigger than the computations
themselves. Unless the data is big enough but memory issues will happen
before that :) Those metrics could probably benefit from OpenMP based
multithreading.
About going too lowlevel, we already have this DistanceMetric module
implementing all metrics in cython, so I'd say we're already kind of
lowlevel and in that case, using OpenMP would really just be adding a
'p' before 'range' :) I think a good first step could be to move this
module in metrics, where it really belongs, rework it to make it fused
typed and sparse friendly, and add some prange. Obviously it will keep
most of the API flaws that @jnothman exposed but it might set up a
cleaner ground for future API changes.
In the end, whatever you choose, I'd be happy to help.
J?r?mie (@jeremiedbb)
On 05/03/2020 22:12, Andreas Mueller wrote:
> Thanks for a great summary of issues!
> I agree there's lots to do, though I think most of the issues that you
> list are quite hard and require thinking about API pretty hard.
> So they might not be super amendable to being solved by a shorterterm
> project.
>
> I was hoping there would be some more easy wins that we could get by
> exploiting OpenMP better (or at all) in the distances.
> Not sure if there is, though.
>
> I wonder if having a multicore implementation of euclidean_distances
> would be useful for us, or if that's going too lowlevel.
>
>
>
> On 3/3/20 5:47 PM, Joel Nothman wrote:
>> I noticed a comment by?@amueller on Gitter re?considering a project
>> on our distances implementations.
>>
>> I think there's a lot of work that can be done in unifying distances
>> implementations... (though I'm not always sure the benefit.) I
>> thought I would?summarise some of the issues below, as I was unsure
>> what Andy intended.
>>
>> As @jeremiedbb said, making n_jobs more effective would be
>> beneficial. Reducing duplication between metrics.pairwise and
>> neighbors._dist_metrics and kmeans would?be noble (especially with
>> regard to parameters, where scicpy.spatial's mahalanobis available
>> through sklearn.metrics does not accept V but sklearn.neighbors
>> does). and perhaps offer higher consistency of results and efficiencies.
>>
>> We also have idioms the code like "if the metric is euclidean, use
>> squared=True where we only need a ranking, then take the squareroot"
>> while neighbors metrics abstract this with an API by providing rdsit
>> and rdist_to_dist.
>>
>> There are issues about making sure that
>> pairwise_distances(metric='minkowski', p=2) is using the same
>> implementation as pairwise_distances(metric='euclidean'), etc.
>>
>> We have issues with chunking and distributing computations in the
>> case that metric params are derived from the dataset (ideally a
>> training?set).
>>
>> #16419 is a simple instance where the metric param is samplealigned
>> and needs to be chunked up.
>>
>> In other cases, we precompute some metric param over all the data,
>> then pass it to each chunk worker, using _precompute_metric_params
>> introduced in #12672. This is also relevant to #9555.
>>
>> While that initial implementation in #12672 is helpful and aims to
>> maintain backwards compatibility, it makes some dubious choices.
>>
>> Firstly in terms of code structure it is not a very modular approach
>>  each metric is handled with an ifthen. Secondly, it *only* handles
>> the chunking case, relying on the fact that these metrics are in
>> scipy.spatial, and have a comparable handling of V=None and VI=None.
>> In the Gower Distances PR (#9555) when implementing a metric locally,
>> rather than relying on scipy.spatial, we needed to provide an
>> implementation of these default parameters both when the data is
>> chunked and when the metric function is called straight out.
>>
>> Thirdly, its approach to training vs test data is dubious. We don't
>> formally label X and Y in pairwise_distances as train/test, and
>> perhaps we should. Maintaining backwards compat with scipy's
>> seuclidean and mahalanobis, our implementation stacks X and Y to each
>> other if both are provided, and then calculates their variance. This
>> means that users may be applying a different metric at train and at
>> test time (if the variance of X as train and Y as test is
>> substantially different), which I consider a silent error. We can
>> either make the train/test nature of X and Y more explicit, or we can
>> require that databased parameters are given explicitly by the user
>> and not implicitly computed. If I understand correctly,
>> sklearn.neighbors will not compute V or VI for you, and it must be
>> provided explicitly. (Requiring that the scaling of each feature be
>> given explicitly in Gower seems like an unnecessary burden on the
>> user, however.)
>>
>> Then there are issues like whether we should consistently set the
>> diagonal to zero in all metrics where Y=None.
>>
>> In short, there are several projects in distances, and I'd support
>> them being considered for work.... But it's a lot of engineering, if
>> motivated by ML needs and consistency for users.
>>
>> J
>>
>> _______________________________________________
>> scikitlearn mailing list
>> scikitlearn at python.org
>> https://mail.python.org/mailman/listinfo/scikitlearn
>
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From adityaselfefficient at gmail.com Wed Mar 11 01:10:10 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 10:40:10 +0530
Subject: [scikitlearn] Understanding max_features parameter in
RandomForestClassifier
MessageID:
For RandomForestClassifier in sklearn
max_features parameter gives the max no of features for split in random
forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
combinations for DT formation is nCm. What if nCm is less than n_estimators
(no of decision trees in random forest)?
*example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
combinations of features for decision trees. Now for n_estimators = 100,
will the remaining 65 trees have repeated combination of features? If so,
won't trees be correlated introducing bias in the answer?
Thanks
Aditya Aggarwal
 next part 
An HTML attachment was scrubbed...
URL:
From adityaselfefficient at gmail.com Wed Mar 11 01:22:22 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 10:52:22 +0530
Subject: [scikitlearn] Threshold for roc_curve in binary classification
MessageID:
Hello
I was going through the logic to calculate threshold to plot roc_curve. As
far as I could understand, fps, tps and threshold is calculated in
slklearn.metrics._binary_clf_curve . How are multiple values of threshold
calculated for binary classification?
Also what is happening in the following lines?
distinct_value_indices = np.where(np.diff(y_score))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size  1]
Thanks
 next part 
An HTML attachment was scrubbed...
URL:
From jbbrown at kuhp.kyotou.ac.jp Wed Mar 11 01:26:50 2020
From: jbbrown at kuhp.kyotou.ac.jp (Brown J.B.)
Date: Wed, 11 Mar 2020 14:26:50 +0900
Subject: [scikitlearn] Understanding max_features parameter in
RandomForestClassifier
InReplyTo:
References:
MessageID:
Regardless of the number of features, each DT estimator is given only a
subset of the data.
Each DT estimator then uses the features to derive decision rules for the
samples it was given.
With more trees and few examples, you might get similar or identical trees,
but that is not the norm.
Pardon brevity.
J.B.
2020?3?11?(?) 14:11 aditya aggarwal :
> For RandomForestClassifier in sklearn
>
> max_features parameter gives the max no of features for split in random
> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
> combinations for DT formation is nCm. What if nCm is less than n_estimators
> (no of decision trees in random forest)?
>
> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
> combinations of features for decision trees. Now for n_estimators = 100,
> will the remaining 65 trees have repeated combination of features? If so,
> won't trees be correlated introducing bias in the answer?
>
>
> Thanks
>
> Aditya Aggarwal
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From adityaselfefficient at gmail.com Wed Mar 11 01:43:02 2020
From: adityaselfefficient at gmail.com (aditya aggarwal)
Date: Wed, 11 Mar 2020 11:13:02 +0530
Subject: [scikitlearn] Understanding max_features parameter in
RandomForestClassifier
InReplyTo:
References:
MessageID:
With all the parameters set to default, (especially bootstrap and
max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't
it account for all the instances in the dataset with calculated no. of
feature? Then how come only a subset is given to the estimator?
On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikitlearn <
scikitlearn at python.org> wrote:
> Regardless of the number of features, each DT estimator is given only a
> subset of the data.
> Each DT estimator then uses the features to derive decision rules for the
> samples it was given.
> With more trees and few examples, you might get similar or identical
> trees, but that is not the norm.
>
> Pardon brevity.
> J.B.
>
> 2020?3?11?(?) 14:11 aditya aggarwal :
>
>> For RandomForestClassifier in sklearn
>>
>> max_features parameter gives the max no of features for split in random
>> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
>> combinations for DT formation is nCm. What if nCm is less than n_estimators
>> (no of decision trees in random forest)?
>>
>> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35 unique
>> combinations of features for decision trees. Now for n_estimators = 100,
>> will the remaining 65 trees have repeated combination of features? If so,
>> won't trees be correlated introducing bias in the answer?
>>
>>
>> Thanks
>>
>> Aditya Aggarwal
>> _______________________________________________
>> scikitlearn mailing list
>> scikitlearn at python.org
>> https://mail.python.org/mailman/listinfo/scikitlearn
>>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From venky.yuvy at gmail.com Wed Mar 11 04:18:27 2020
From: venky.yuvy at gmail.com (Venkatachalam N)
Date: Wed, 11 Mar 2020 13:48:27 +0530
Subject: [scikitlearn] Understanding max_features parameter in
RandomForestClassifier
InReplyTo:
References:
MessageID:
Hi Aditya,
The sampling is done with replacement with the default settings.
Hence, you will get different dataset even though you sample same number
(`X.shape[0]`) of datapoints.
Regards,
Venkatachalam N.
On Wed, Mar 11, 2020 at 11:14 AM aditya aggarwal <
adityaselfefficient at gmail.com> wrote:
> With all the parameters set to default, (especially bootstrap and
> max_samples), no of samples passed to each estimator is X.shape[0]. Doesn't
> it account for all the instances in the dataset with calculated no. of
> feature? Then how come only a subset is given to the estimator?
>
> On Wed, Mar 11, 2020 at 10:58 AM Brown J.B. via scikitlearn <
> scikitlearn at python.org> wrote:
>
>> Regardless of the number of features, each DT estimator is given only a
>> subset of the data.
>> Each DT estimator then uses the features to derive decision rules for the
>> samples it was given.
>> With more trees and few examples, you might get similar or identical
>> trees, but that is not the norm.
>>
>> Pardon brevity.
>> J.B.
>>
>> 2020?3?11?(?) 14:11 aditya aggarwal :
>>
>>> For RandomForestClassifier in sklearn
>>>
>>> max_features parameter gives the max no of features for split in random
>>> forest which is sqrt(n_features) as default. If m is sqrt of n, then no of
>>> combinations for DT formation is nCm. What if nCm is less than n_estimators
>>> (no of decision trees in random forest)?
>>>
>>> *example:* For n = 7, max_features is 3, so nCm is 35, meaning 35
>>> unique combinations of features for decision trees. Now for n_estimators =
>>> 100, will the remaining 65 trees have repeated combination of features? If
>>> so, won't trees be correlated introducing bias in the answer?
>>>
>>>
>>> Thanks
>>>
>>> Aditya Aggarwal
>>> _______________________________________________
>>> scikitlearn mailing list
>>> scikitlearn at python.org
>>> https://mail.python.org/mailman/listinfo/scikitlearn
>>>
>> _______________________________________________
>> scikitlearn mailing list
>> scikitlearn at python.org
>> https://mail.python.org/mailman/listinfo/scikitlearn
>>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From gianmarcofucci94 at gmail.com Mon Mar 16 05:42:44 2020
From: gianmarcofucci94 at gmail.com (Gianmarco Fucci)
Date: Mon, 16 Mar 2020 10:42:44 +0100
Subject: [scikitlearn] Study on annotation of design and implementation
choices, and of technical debt
MessageID:
Dear all,
As software engineering research teams at the University of Sannio (Italy)
and Eindhoven University of Technology (The Netherlands) we are interested
in investigating the protocol used by developers while they have to
annotate implementation and design choices during their normal development
activities. More specifically, we are looking at whether, where and what
kind of annotations developers usually use trying to be focused more on
those annotations mainly aimed at highlighting that the code is not in the
right shape (e.g., comments for annotating delayed or intended work
activities such as TODO, FIXME, hack, workaround, etc). In the latter case,
we are looking at what is the content of the above annotations, as well as
how they usually behave while evolving the code that has been previously
annotated.
When answering the survey, in case your annotation practices are different
in different open source projects you may contribute, please refer to how
you behave for the projects where you have been contacted.
Filling out the survey will take about 5 minutes.
Please note that your identity and personal data will not be disclosed,
while we plan to use the aggregated results and anonymized responses as
part of a scientific publication.
If you have any questions about the questionnaire or our research, please
do not hesitate to contact us.
You can find the survey link here:
https://forms.gle/NxdVXiZQSmQ15U4T8
Thanks and regards,
Gianmarco Fucci (gianmarcofucci94 at gmail.com)
Fiorella Zampetti (fzampetti at unisannio.it)
Alexander Serebrenik (a.serebrenik at tue.nl)
Massimiliano Di Penta (dipenta at unisannio.it)
 next part 
An HTML attachment was scrubbed...
URL:
From nelle.varoquaux at gmail.com Tue Mar 17 11:37:11 2020
From: nelle.varoquaux at gmail.com (Nelle Varoquaux)
Date: Tue, 17 Mar 2020 16:37:11 +0100
Subject: [scikitlearn] Announcing the 2020 John Hunter Excellence in
Plotting Contest
MessageID:
Dear all,
I apologize for the crossposting.
In memory of John Hunter, we are pleased to announce the John Hunter
Excellence in Plotting Contest for 2020. This open competition aims to
highlight the importance of data visualization to scientific progress and
showcase the capabilities of open source software.
Participants are invited to submit scientific plots to be judged by a
panel. The winning entries will be announced and displayed at SciPy 2020 or
announced in the John Hunter Excellence in Plotting Contest website and
youtube channel.
John Hunter?s family are graciously sponsoring cash prizes for the winners
in the following amounts:
 1st prize: $1000
 2nd prize: $750
 3rd prize: $500

Entries must be submitted by June 1st to the form at
https://forms.gle/SrexmkDwiAmDc7ej7

Winners will be announced at Scipy 2020 in Austin, TX or publicly on the
John Hunter Excellence in Plotting Contest website and youtube channel

Participants do not need to attend the Scipy conference.

Entries may take the definition of ?visualization? rather broadly.
Entries may be, for example, a traditional printed plot, an interactive
visualization for the web, a dashboard, or an animation.

Source code for the plot must be provided, in the form of Python code
and/or a Jupyter notebook, along with a rendering of the plot in a widely
used format. The rendering may be, for example, PDF for print, standalone
HTML and Javascript for an interactive plot, or MPEG4 for a video. If the
original data can not be shared for reasons of size or licensing, "fake"
data may be substituted, along with an image of the plot using real data.

Each entry must include a 300500 word abstract describing the plot and
its importance for a general scientific audience.

Entries will be judged on their clarity, innovation and aesthetics, but
most importantly for their effectiveness in communicating a realworld
problem. Entrants are encouraged to submit plots that were used during the
course of research or work, rather than merely being hypothetical.

SciPy and the John Hunter Excellence in Plotting Contest organizers
reserves the right to display any and all entries, whether prizewinning or
not, at the conference, use in any materials or on its website, with
attribution to the original author(s).

Past entries can be found at https://jhepc.github.io/

Questions regarding the contest can be sent to jhepc.organizers at gmail.com
John Hunter Excellence in Plotting Contest CoChairs
Madicken Munk
Nelle Varoquaux
 next part 
An HTML attachment was scrubbed...
URL:
From jbc.develop at gmail.com Wed Mar 18 17:42:18 2020
From: jbc.develop at gmail.com (Juan BC)
Date: Wed, 18 Mar 2020 18:42:18 0300
Subject: [scikitlearn] The Coronavirus Tech Handbook
MessageID:
Sorry for the offtopic
https://coronavirustechhandbook.com/ <<<< The Coronavirus Tech Handbook
provides a space for technologists, specialists, civic organisations and
public & private institutions to collaborate on a rapid and sophisticated
response to the coronavirus outbreak. It is a dynamic resource with many
hundreds of contributors that is evolving very quickly.

Juan B Cabral
 next part 
An HTML attachment was scrubbed...
URL:
From gk68118 at gmail.com Thu Mar 19 02:11:49 2020
From: gk68118 at gmail.com (Praneet Singh)
Date: Thu, 19 Mar 2020 11:41:49 +0530
Subject: [scikitlearn] transfer learning doubt
MessageID:
I am training a SGD Classifier with some training dataset which is
temporary and will be lost after sometime. So I am planning to save the
model in pickle file and reuse it and train again with some another dataset
that arrives. But It forgets the previously learned data.
As far as I researched in google, tensorflow model allows transfer learning
and not forgetting the previous learning but is there any other way with
sklearn model to achieve this??
any help would be appreciated
 next part 
An HTML attachment was scrubbed...
URL:
From fad469 at uregina.ca Thu Mar 19 09:19:38 2020
From: fad469 at uregina.ca (Farzana Anowar)
Date: Thu, 19 Mar 2020 07:19:38 0600
Subject: [scikitlearn] transfer learning doubt
InReplyTo:
References:
MessageID:
On 20200319 00:11, Praneet Singh wrote:
> I am training a SGD Classifier with some training dataset which is
> temporary and will be lost after sometime. So I am planning to save
> the model in pickle file and reuse it and train again with some
> another dataset that arrives. But It forgets the previously learned
> data.
>
> As far as I researched in google, tensorflow model allows transfer
> learning and not forgetting the previous learning but is there any
> other way with sklearn model to achieve this??
> any help would be appreciated
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
Did you use incremental estimator and partial _fit? If not, try to use
them. Should work.
Another option is to us deep learning and store the weights for the
first model and initialize the second model with that weight and keep
doing it for the rest of the models.

Best Regards,
Farzana Anowar,
PhD Candidate
Department of Computer Science
University of Regina
From rth.yurchak at gmail.com Thu Mar 19 10:06:37 2020
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Thu, 19 Mar 2020 15:06:37 +0100
Subject: [scikitlearn] transfer learning doubt
InReplyTo:
References:
MessageID: <6ac576551ac534ea416b6c65d641ba7b@gmail.com>
On 19/03/2020 14:19, Farzana Anowar wrote:
> Another option is to us deep learning and store the weights for the
first model and initialize the second model with that weight and keep
doing it for the rest of the models.
This can also be done in scikitlearn with models that support
warm_start=True init parameter (including SGDClassifier).
Roman
From krallinger.martin at gmail.com Thu Mar 19 12:21:36 2020
From: krallinger.martin at gmail.com (Martin Krallinger)
Date: Thu, 19 Mar 2020 17:21:36 +0100
Subject: [scikitlearn] Final CFP CodiEsp: Clinical Case Coding Task
(eHealth CLEF 2020)
InReplyTo:
References:
MessageID:
*** Call for Participation CodiEsp: Clinical Case Coding Task (eHealth CLEF
2020) *** *
*CodiEsp (eHealth CLEF? Multilingual Information Extraction) Shared Task on
automatic assignment of ICD10 codes (procedures, diagnosis) track at CLEF
2020*
http://temu.bsc.es/codiesp
Plan TL Award for the CodiEsp Track
The CodiEsp subtracks:
*1.CodiEsp Diagnosis Coding *subtask* (CodiEspD)*: will require automatic
ICD10CM [CIE10 Diagn?stico] code assignment to each clinical case document.
*2.CodiEsp Procedure Coding *subtask* (CodiEspP):* will require automatic
ICD10PCS [CIE10 Procedimiento] code assignment to each clinical case
document.
*3.CodiEsp Explainable AI *exploratory subtask* (CodiEspX).* Systems are
required to extract the evidence text supporting the predicted codes (both
ICD10CM and ICD10PCS).
*Task description*
Clinical coding essentially requires the transformation (or classification)
of medical texts into a structured or coded format using internationally
recognized class codes.
These codes describe a patient?s diagnosis or treatment. Clinical coding is
critical for standardizing electronic clinical records; enable aetiology
studies, monitor health trends, carry out epidemiology studies, clinical
and biomedical research, assist clinical decisionmaking or even
reimbursement.
As part of the eHealth CLEF (http://clefehealth.org) Multilingual
Information Extraction Shared Task we organize* CodiEsp: Clinical Case
Coding Task (http://temu.bsc.es/codiesp ). *The
CodiEsp task will address the automatic extraction and assignment of
clinical coding (diagnosis and procedures) to clinical case documents in
Spanish.
To enable participation of researches around the world, in addition to the
basic data in Spanish, we will also publish versions of the training,
development, and test set *automatically translated into English*.
Participating systems will be asked to automatically assign ICD10 codes (or
CIE10, in Spanish) to clinical case documents. Evaluation is done through
comparison to manually assigned ICD10 codes.
*Publications and workshop*
As in previous eHealth CLEF efforts, there will be an *evaluation workshop
allocated at CLEF 2020* where participating teams can present their systems
and results. Moreover, participating teams will be invited to submit their
system description papers for publication at the *CLEF 2020 Working Notes
proceedings*. For previous working notes see: http://ceurws.org/Vol2125/
*CodiEsp awards*
There will we three awards for the topscoring teams promoted by the
Spanish Plan for the Advancement of Language Technology (Plan TL) and the
Barcelona Supercomputing Center (BSC).

*Participation and useful info*

1. CodiEsp web, info & detailed description:http://temu.bsc.es/codiesp/
2. Registration for CodiEsp (Multilingual Information Extraction eHealth
track):http://temu.bsc.es/codiesp/index.php/registration/
3. Datasets:https://zenodo.org/record/3693570
4. Additional training resources:https://doi.org/10.5281/zenodo.3606662

*Main CodiEsp Track organizers*

 *Martin Krallinger*, Barcelona Supercomputing Center.
 *Antonio Miranda*, Barcelona Supercomputing Center.
 *Aitor GonzalezAgirre*, Barcelona Supercomputing Center.
 *Marta Villegas*, Barcelona Supercomputing Center.
 *Jordi Armengol*, Barcelona Supercomputing Center.

*Important Dates*

Jan 13: Training and development set release
March 2: Test and background set release
May 3: End of evaluation
May 5: Results notified
May 24: Paper submission
Jun 28: Cameraready paper submission
Sep 2225: CLEF 2020 Conference (Thessaloniki, Greece)
 next part 
An HTML attachment was scrubbed...
URL:
From MC_George123 at hotmail.com Wed Mar 25 22:16:03 2020
From: MC_George123 at hotmail.com (MC_George123 at hotmail.com)
Date: Thu, 26 Mar 2020 02:16:03 +0000
Subject: [scikitlearn] A basic question about kmeans algorithms elkan and
llyod
MessageID:
Hi admins,
My team is working on optimization on scikitlearn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikitlearn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?
Best regards,
George Fan
 next part 
An HTML attachment was scrubbed...
URL:
From alexandre.gramfort at inria.fr Thu Mar 26 03:40:15 2020
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 26 Mar 2020 08:40:15 +0100
Subject: [scikitlearn] A basic question about kmeans algorithms elkan
and llyod
InReplyTo:
References:
MessageID:
hi,
I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic
my 2c
Alex
On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com <
MC_George123 at hotmail.com> wrote:
> Hi admins,
>
>
>
> My team is working on optimization on scikitlearn staff now. When it
> comes to kmeans, I find there are two algorithms, one of which is lloyd and
> the other is elkan, which is the optimized one for lloyd using triangle
> inequality. In the older version of scikitlearn, elkan only supports
> dense dataset instead of sparse one. And in the latest version, elkan
> supports both type of datasets. So there is a question why both two
> algorithms are kept in kmeans since they do the almost same thing and elkan
> is a optimized one for lloyd. Are there any precision difference between
> two algorithms and how can I decide what algorithm to use?
>
>
>
> Best regards,
>
> George Fan
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From niourf at gmail.com Thu Mar 26 15:59:25 2020
From: niourf at gmail.com (Nicolas Hug)
Date: Thu, 26 Mar 2020 15:59:25 0400
Subject: [scikitlearn] Monthly meetings
MessageID: <080445a5123026c2b58203c760d1f80e@gmail.com>
Hi all,
The next scikitlearn monthly meeting will take place on Monday
(https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195
)
While these meetings are mainly for coredevs to discuss the current
topics, we're also happy to welcome noncore devs and other projects
maintainers! Feel free to join.
*Location:*
Join Zoom Meeting
https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09
Meeting ID: 947 129 165 Password: 586745
Thanks,
Nicolas
 next part 
An HTML attachment was scrubbed...
URL:
From t3kcit at gmail.com Fri Mar 27 12:32:39 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 27 Mar 2020 12:32:39 0400
Subject: [scikitlearn] Analysis of sklearn and other python libraries on
github by MS team
MessageID: <60bf621118f9740803daa5157c754145@gmail.com>
Hey all.
There's a pretty cool paper by a team at MS that analyses public github
repos for their use of the sklearn and related libraries:
https://arxiv.org/abs/1912.09536
Thought it might be of interest.
Cheers,
Andy
From t3kcit at gmail.com Fri Mar 27 12:36:52 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 27 Mar 2020 12:36:52 0400
Subject: [scikitlearn] A basic question about kmeans algorithms elkan
and llyod
InReplyTo:
References:
MessageID:
There's an interesting analysis in this paper:
Fast KMeans with Accurate Bounds
http://proceedings.mlr.press/v48/newling16.pdf
On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
> hi,
>
> I suspect Elkan is really winning when you have many centroids
> so the conclusion is not systematic
>
> my 2c
> Alex
>
>
> On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com
> > wrote:
>
> Hi admins,
>
> My team is working on optimization on scikitlearn staff now. When
> it comes to kmeans, I find there are two algorithms, one of which
> is lloyd and the other is elkan, which is the optimized one for
> lloyd using triangle inequality.? In the older version of
> scikitlearn, elkan only supports dense dataset instead of sparse
> one. And in the latest version, elkan supports both type of
> datasets. So there is a question why both two algorithms are kept
> in kmeans since they do the almost same thing and elkan is a
> optimized one for lloyd. Are there any precision difference
> between two algorithms and how can I decide what algorithm to use?
>
> Best regards,
>
> George Fan
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From rth.yurchak at gmail.com Fri Mar 27 13:10:28 2020
From: rth.yurchak at gmail.com (Roman Yurchak)
Date: Fri, 27 Mar 2020 18:10:28 +0100
Subject: [scikitlearn] Analysis of sklearn and other python libraries
on github by MS team
InReplyTo: <60bf621118f9740803daa5157c754145@gmail.com>
References: <60bf621118f9740803daa5157c754145@gmail.com>
MessageID:
Very interesting! A few comments,
> From GH17, we managed to extract only 10.5k pipelines. The
relatively low frequency (with respect to the number of notebooks using
SCIKITLEARN [..]) indicates a nonwide adoption of this specification.
However, the number of pipelines in the GH19 corpus is 132k pipelines
(i.e., an increase of 13? [..] since 2017).
It's nice to see that pipelines are indeed widely used.
> Top5 transformers [from imports] in GH19 are StandardScaler,
CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer
(in this order). Same are the results for GH17 with the difference that
PCA is instead of TfidfVectorizer.
Hmm, I would have expected OneHotEncoder somewhere at the top and much
less text processing. If there is real usage of CountVectorizer and
TfidfTransformer separately, then maybe deprecating TfidfVectorizer
could be done https://github.com/scikitlearn/scikitlearn/issues/14951
Though this ranking looks quite unexpected. I wonder if they have the
full list and not just the top5.
> Regarding learners, Top5 in both GH17 and GH19 are
LogisticRegression, MultinomialNB, SVC, LinearRegression, and
RandomForestClassifier (in this order).
Maybe LinearRegression docstring should more strongly suggest to use
Ridge with small regularization in practice.

Roman
On 27/03/2020 17:32, Andreas Mueller wrote:
> Hey all.
> There's a pretty cool paper by a team at MS that analyses public github
> repos for their use of the sklearn and related libraries:
> https://arxiv.org/abs/1912.09536
>
> Thought it might be of interest.
>
> Cheers,
> Andy
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
From gael.varoquaux at normalesup.org Fri Mar 27 18:20:17 2020
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Fri, 27 Mar 2020 23:20:17 +0100
Subject: [scikitlearn] Analysis of sklearn and other python libraries
on github by MS team
InReplyTo:
References: <60bf621118f9740803daa5157c754145@gmail.com>
MessageID: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
Thanks for the link Andy. This is indeed very interesting!
On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
> > Regarding learners, Top5 in both GH17 and GH19 are LogisticRegression,
> > MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
> > order).
> Maybe LinearRegression docstring should more strongly suggest to use Ridge
> with small regularization in practice.
Yes! I actually wonder if we should not remove LinearRegression. It's a
bit frightening me that so many people use it. The only time that I've
seen it used in a scientific people, it was a mistake and it shouldn't
have been used.
I seldom advocate for deprecating :).
G
From pedro.cardoso.code at gmail.com Sun Mar 29 13:21:21 2020
From: pedro.cardoso.code at gmail.com (Pedro Cardoso)
Date: Sun, 29 Mar 2020 18:21:21 +0100
Subject: [scikitlearn] [GridSearchCV] Reduction of elapsed time at the
second interation
MessageID:
Hello fellows,
i am knew at slkearn and I have a question about GridSearchCV:
I am running the following code at a jupyter notebook :
*code*
opt_models = dict()
for feature in [features1, features2, features3, features4]:
cmb = CMB(x_train, y_train, x_test, y_test, feature)
cmb.fit()
cmb.predict()
opt_models[str(feature)]=cmb.get_best_model()

The CMB class is just a class that contains different classification models
(SVC, decision tree, etc...). When cmb.fit() is running, a gridSearchCV is
performed at the SVC model (which is within the cmb instance) in order to
tune the hyperparameters C, gamma, and kernel. The SCV model is implemented
using the sklearn.svm.SVC class. Here is the output of the first and second
iteration of the for loop:
*output*
> 1st iteration
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 tasks  elapsed: 6.1s
[Parallel(n_jobs=1)]: Done 2 tasks  elapsed: 6.1s
[Parallel(n_jobs=1)]: Done 3 tasks  elapsed: 6.1s
[Parallel(n_jobs=1)]: Done 4 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 5 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 6 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 7 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 8 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 9 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 10 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 11 tasks  elapsed: 6.2s
[Parallel(n_jobs=1)]: Done 12 tasks  elapsed: 6.3s
[Parallel(n_jobs=1)]: Done 13 tasks  elapsed: 6.3s
[Parallel(n_jobs=1)]: Done 14 tasks  elapsed: 6.3s
[Parallel(n_jobs=1)]: Done 15 tasks  elapsed: 6.4s
[Parallel(n_jobs=1)]: Done 16 tasks  elapsed: 6.4s
[Parallel(n_jobs=1)]: Done 17 tasks  elapsed: 6.4s
[Parallel(n_jobs=1)]: Done 18 tasks  elapsed: 6.4s
[Parallel(n_jobs=1)]: Done 19 tasks  elapsed: 6.5s
[Parallel(n_jobs=1)]: Done 20 tasks  elapsed: 6.5s
[Parallel(n_jobs=1)]: Done 21 tasks  elapsed: 6.5s
[Parallel(n_jobs=1)]: Done 22 tasks  elapsed: 6.6s
[Parallel(n_jobs=1)]: Done 23 tasks  elapsed: 6.7s
[Parallel(n_jobs=1)]: Done 24 tasks  elapsed: 6.7s
[Parallel(n_jobs=1)]: Done 25 tasks  elapsed: 6.7s
[Parallel(n_jobs=1)]: Done 26 tasks  elapsed: 6.8s
[Parallel(n_jobs=1)]: Done 27 tasks  elapsed: 6.8s
[Parallel(n_jobs=1)]: Done 28 tasks  elapsed: 6.9s
[Parallel(n_jobs=1)]: Done 29 tasks  elapsed: 6.9s
[Parallel(n_jobs=1)]: Done 30 tasks  elapsed: 6.9s
[Parallel(n_jobs=1)]: Done 31 tasks  elapsed: 7.0s
[Parallel(n_jobs=1)]: Done 32 tasks  elapsed: 7.0s
[Parallel(n_jobs=1)]: Done 33 tasks  elapsed: 7.0s
[Parallel(n_jobs=1)]: Done 34 tasks  elapsed: 7.0s
[Parallel(n_jobs=1)]: Done 35 tasks  elapsed: 7.1s
[Parallel(n_jobs=1)]: Done 36 tasks  elapsed: 7.1s
[Parallel(n_jobs=1)]: Done 37 tasks  elapsed: 7.2s
[Parallel(n_jobs=1)]: Done 38 tasks  elapsed: 7.2s
[Parallel(n_jobs=1)]: Done 39 tasks  elapsed: 7.2s
[Parallel(n_jobs=1)]: Done 40 tasks  elapsed: 7.2s
[Parallel(n_jobs=1)]: Done 41 tasks  elapsed: 7.3s
[Parallel(n_jobs=1)]: Done 42 tasks  elapsed: 7.3s
[Parallel(n_jobs=1)]: Done 43 tasks  elapsed: 7.3s
[Parallel(n_jobs=1)]: Done 44 tasks  elapsed: 7.4s
[Parallel(n_jobs=1)]: Done 45 tasks  elapsed: 7.4s
[Parallel(n_jobs=1)]: Done 46 tasks  elapsed: 7.5s
> 2nd iteration
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[Parallel(n_jobs=1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=1)]: Done 1 tasks  elapsed: 0.0s
[Parallel(n_jobs=1)]: Batch computation too fast (0.0260s.) Setting
batch_size=14.
[Parallel(n_jobs=1)]: Done 2 tasks  elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 3 tasks  elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 4 tasks  elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 5 tasks  elapsed: 0.0s
[Parallel(n_jobs=1)]: Done 60 out of 60  elapsed: 0.7s finished

As you can see, the first iteration gets a elapsed time much larger than
the 2nd iteration. Does it make sense? I am afraid that the model is doing
some kind of cache or shortcut from the 1st iteration, and consequently
could decrease the model training/performance? I already read the sklearn
documentation and I didn't saw any warning/note about this kind of
behaviour.
Thank you very much for your time :)
 next part 
An HTML attachment was scrubbed...
URL:
From MC_George123 at hotmail.com Mon Mar 30 03:33:08 2020
From: MC_George123 at hotmail.com (=?utf8?B?5qiKIOS5puWNjg==?=)
Date: Mon, 30 Mar 2020 07:33:08 +0000
Subject: [scikitlearn] A basic question about kmeans algorithms elkan
and llyod
InReplyTo:
References:
MessageID:
Hi,
Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikitlearn since elkan is an optimized on with better performance.
Best regards,
George
From: scikitlearn On Behalf Of Andreas Mueller
Sent: Saturday, March 28, 2020 12:37 AM
To: scikitlearn at python.org
Subject: Re: [scikitlearn] A basic question about kmeans algorithms elkan and llyod
There's an interesting analysis in this paper:
Fast KMeans with Accurate Bounds
http://proceedings.mlr.press/v48/newling16.pdf
On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
hi,
I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic
my 2c
Alex
On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > wrote:
Hi admins,
My team is working on optimization on scikitlearn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikitlearn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?
Best regards,
George Fan
_______________________________________________
scikitlearn mailing list
scikitlearn at python.org
https://mail.python.org/mailman/listinfo/scikitlearn
_______________________________________________
scikitlearn mailing list
scikitlearn at python.org
https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From adrin.jalali at gmail.com Mon Mar 30 07:02:14 2020
From: adrin.jalali at gmail.com (Adrin)
Date: Mon, 30 Mar 2020 13:02:14 +0200
Subject: [scikitlearn] Monthly meetings
InReplyTo: <080445a5123026c2b58203c760d1f80e@gmail.com>
References: <080445a5123026c2b58203c760d1f80e@gmail.com>
MessageID:
Hi,
The new meeting ID:
https://anaconda.zoom.us/j/324780759?pwd=a1ROSFE2Nnc0cHBaeUtiVS93QnpHQT09
Meeting ID: 324 780 759
Password: 617892
On Thu, Mar 26, 2020 at 9:00 PM Nicolas Hug wrote:
> Hi all,
>
> The next scikitlearn monthly meeting will take place on Monday (
> https://www.timeanddate.com/worldclock/meetingdetails.html?year=2020&month=3&day=30&hour=11&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195
>
> )
>
> While these meetings are mainly for coredevs to discuss the current
> topics, we're also happy to welcome noncore devs and other projects
> maintainers! Feel free to join.
>
>
> *Location:*
>
> Join Zoom Meeting
>
> https://anaconda.zoom.us/j/947129165?pwd=dEFZNHM0ZFBiQWlDYlJlRW1EaHg2QT09
>
> Meeting ID: 947 129 165
> Password: 586745
>
>
> Thanks,
> Nicolas
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
 next part 
An HTML attachment was scrubbed...
URL:
From olivier.grisel at ensta.org Mon Mar 30 07:03:44 2020
From: olivier.grisel at ensta.org (Olivier Grisel)
Date: Mon, 30 Mar 2020 13:03:44 +0200
Subject: [scikitlearn] Monthly meetings
InReplyTo: <080445a5123026c2b58203c760d1f80e@gmail.com>
References: <080445a5123026c2b58203c760d1f80e@gmail.com>
MessageID:
I get a message for an invalid meeting id.

Olivier
From t3kcit at gmail.com Mon Mar 30 10:30:09 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 10:30:09 0400
Subject: [scikitlearn] Analysis of sklearn and other python libraries
on github by MS team
InReplyTo: <20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
References: <60bf621118f9740803daa5157c754145@gmail.com>
<20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
MessageID: <272061a70edadd2c8666a4be22a40e92@gmail.com>
On 3/27/20 6:20 PM, Gael Varoquaux wrote:
> Thanks for the link Andy. This is indeed very interesting!
>
> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>> Regarding learners, Top5 in both GH17 and GH19 are LogisticRegression,
>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier (in this
>>> order).
>> Maybe LinearRegression docstring should more strongly suggest to use Ridge
>> with small regularization in practice.
> Yes! I actually wonder if we should not remove LinearRegression. It's a
> bit frightening me that so many people use it. The only time that I've
> seen it used in a scientific people, it was a mistake and it shouldn't
> have been used.
>
> I seldom advocate for deprecating :).
>
People use sklearn for inference. I'm not sure we should deprecate this
usecase even though it's not
our primary motivation.
Also, there's an inconsistency here: Logistic Regression has an L2
penalty by default (to the annoyance of some),
while Linear Regression does not. We have discussed the meaning of the
different classes for linear models several times,
they are certainly not consistent (ridge, lasso and lr are three classes
for squared loss while all three are in LogisticRegression for the log
loss).
I think to many "use statsmodels" is not a satisfying answer.
I have seen people argue that linear regression or logistic regression
should throw an error on colinear data, and I think that's not in the
spirit of sklearn
(even though we had this as a warning in discriminant analysis until
recently).
But we should probably have more clear signaling about this.
Our documentation doesn't really emphasize the prediction vs inference
point enough, I think.
Btw, we could also make our linear regression more stable by using the
minimum norm solution via the SVD.
From t3kcit at gmail.com Mon Mar 30 10:35:43 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 10:35:43 0400
Subject: [scikitlearn] Analysis of sklearn and other python libraries
on github by MS team
InReplyTo: <272061a70edadd2c8666a4be22a40e92@gmail.com>
References: <60bf621118f9740803daa5157c754145@gmail.com>
<20200327222017.fv7jgxrbulntgmbm@phare.normalesup.org>
<272061a70edadd2c8666a4be22a40e92@gmail.com>
MessageID: <71befd2175b663702416a5b01225c492@gmail.com>
Also see https://github.com/scikitlearn/scikitlearn/issues/14268
which is discussing how to make things faster *and* more stable!
On 3/30/20 10:30 AM, Andreas Mueller wrote:
>
>
> On 3/27/20 6:20 PM, Gael Varoquaux wrote:
>> Thanks for the link Andy. This is indeed very interesting!
>>
>> On Fri, Mar 27, 2020 at 06:10:28PM +0100, Roman Yurchak wrote:
>>>> Regarding learners, Top5 in both GH17 and GH19 are
>>>> LogisticRegression,
>>>> MultinomialNB, SVC, LinearRegression, and RandomForestClassifier
>>>> (in this
>>>> order).
>>> Maybe LinearRegression docstring should more strongly suggest to use
>>> Ridge
>>> with small regularization in practice.
>> Yes! I actually wonder if we should not remove LinearRegression. It's a
>> bit frightening me that so many people use it. The only time that I've
>> seen it used in a scientific people, it was a mistake and it shouldn't
>> have been used.
>>
>> I seldom advocate for deprecating :).
>>
>
> People use sklearn for inference. I'm not sure we should deprecate
> this usecase even though it's not
> our primary motivation.
>
> Also, there's an inconsistency here: Logistic Regression has an L2
> penalty by default (to the annoyance of some),
> while Linear Regression does not. We have discussed the meaning of the
> different classes for linear models several times,
> they are certainly not consistent (ridge, lasso and lr are three
> classes for squared loss while all three are in LogisticRegression for
> the log loss).
>
> I think to many "use statsmodels" is not a satisfying answer.
>
> I have seen people argue that linear regression or logistic regression
> should throw an error on colinear data, and I think that's not in the
> spirit of sklearn
> (even though we had this as a warning in discriminant analysis until
> recently).
> But we should probably have more clear signaling about this.
>
> Our documentation doesn't really emphasize the prediction vs inference
> point enough, I think.
>
> Btw, we could also make our linear regression more stable by using the
> minimum norm solution via the SVD.
From t3kcit at gmail.com Mon Mar 30 15:03:58 2020
From: t3kcit at gmail.com (Andreas Mueller)
Date: Mon, 30 Mar 2020 15:03:58 0400
Subject: [scikitlearn] A basic question about kmeans algorithms elkan
and llyod
InReplyTo:
References:
MessageID: <1982d76e554e07703eb1e970f3a9e983@gmail.com>
sorry I thought it also did experiements on what they call "sta" but I
guess they are not included.
The conclusion is the same, though. Different algorithms show different
performance on different datasets.
The Yingyang kmeans has some elkan vs lloyd figures:
http://proceedings.mlr.press/v37/ding15.pdf
In table 2, the Elkan row, in cases the speedup is <1, it means elkans
is slower than lloyd.
Elkans is also more memory intensive, so you can see some missing values
in that where the computation couldn't be performed, but lloyd could.
On 3/30/20 3:33 AM, ? ?? wrote:
>
> Hi,
>
> Thanks for your suggestion of the paper. However, the paper shows many
> more algorithms and finds out different algorithms show different
> performance on dataset with various dimensions, Lloyd algorithm not
> included. What I want to know is that can we remove the Lloyd
> algorithm in kmeans of scikitlearn since elkan is an optimized on
> with better performance.
>
> Best regards,
>
> George
>
> *From:* scikitlearn
> *On Behalf
> Of *Andreas Mueller
> *Sent:* Saturday, March 28, 2020 12:37 AM
> *To:* scikitlearn at python.org
> *Subject:* Re: [scikitlearn] A basic question about kmeans algorithms
> elkan and llyod
>
> There's an interesting analysis in this paper:
> Fast KMeans with Accurate Bounds
>
> http://proceedings.mlr.press/v48/newling16.pdf
>
> On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
>
> hi,
>
> I suspect Elkan is really winning when you have many centroids
>
> so the conclusion is not systematic
>
> my 2c
>
> Alex
>
> On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com
> > wrote:
>
> Hi admins,
>
> My team is working on optimization on scikitlearn staff now.
> When it comes to kmeans, I find there are two algorithms, one
> of which is lloyd and the other is elkan, which is the
> optimized one for lloyd using triangle inequality.? In the
> older version of scikitlearn, elkan only supports dense
> dataset instead of sparse one. And in the latest version,
> elkan supports both type of datasets. So there is a question
> why both two algorithms are kept in kmeans since they do the
> almost same thing and elkan is a optimized one for lloyd. Are
> there any precision difference between two algorithms and how
> can I decide what algorithm to use?
>
> Best regards,
>
> George Fan
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
>
>
>
> _______________________________________________
>
> scikitlearn mailing list
>
> scikitlearn at python.org
>
> https://mail.python.org/mailman/listinfo/scikitlearn
>
>
> _______________________________________________
> scikitlearn mailing list
> scikitlearn at python.org
> https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From MC_George123 at hotmail.com Tue Mar 31 03:49:45 2020
From: MC_George123 at hotmail.com (=?utf8?B?5qiKIOS5puWNjg==?=)
Date: Tue, 31 Mar 2020 07:49:45 +0000
Subject: [scikitlearn] A basic question about kmeans algorithms elkan
and llyod
InReplyTo: <1982d76e554e07703eb1e970f3a9e983@gmail.com>
References:
<1982d76e554e07703eb1e970f3a9e983@gmail.com>
MessageID:
Thank you very much for your information.
From: scikitlearn On Behalf Of Andreas Mueller
Sent: Tuesday, March 31, 2020 3:04 AM
To: scikitlearn at python.org
Subject: Re: [scikitlearn] A basic question about kmeans algorithms elkan and llyod
sorry I thought it also did experiements on what they call "sta" but I guess they are not included.
The conclusion is the same, though. Different algorithms show different performance on different datasets.
The Yingyang kmeans has some elkan vs lloyd figures:
http://proceedings.mlr.press/v37/ding15.pdf
In table 2, the Elkan row, in cases the speedup is <1, it means elkans is slower than lloyd.
Elkans is also more memory intensive, so you can see some missing values in that where the computation couldn't be performed, but lloyd could.
On 3/30/20 3:33 AM, ? ?? wrote:
Hi,
Thanks for your suggestion of the paper. However, the paper shows many more algorithms and finds out different algorithms show different performance on dataset with various dimensions, Lloyd algorithm not included. What I want to know is that can we remove the Lloyd algorithm in kmeans of scikitlearn since elkan is an optimized on with better performance.
Best regards,
George
From: scikitlearn On Behalf Of Andreas Mueller
Sent: Saturday, March 28, 2020 12:37 AM
To: scikitlearn at python.org
Subject: Re: [scikitlearn] A basic question about kmeans algorithms elkan and llyod
There's an interesting analysis in this paper:
Fast KMeans with Accurate Bounds
http://proceedings.mlr.press/v48/newling16.pdf
On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
hi,
I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic
my 2c
Alex
On Thu, Mar 26, 2020 at 3:18 AM MC_George123 at hotmail.com > wrote:
Hi admins,
My team is working on optimization on scikitlearn staff now. When it comes to kmeans, I find there are two algorithms, one of which is lloyd and the other is elkan, which is the optimized one for lloyd using triangle inequality. In the older version of scikitlearn, elkan only supports dense dataset instead of sparse one. And in the latest version, elkan supports both type of datasets. So there is a question why both two algorithms are kept in kmeans since they do the almost same thing and elkan is a optimized one for lloyd. Are there any precision difference between two algorithms and how can I decide what algorithm to use?
Best regards,
George Fan
_______________________________________________
scikitlearn mailing list
scikitlearn at python.org
https://mail.python.org/mailman/listinfo/scikitlearn
_______________________________________________
scikitlearn mailing list
scikitlearn at python.org
https://mail.python.org/mailman/listinfo/scikitlearn
_______________________________________________
scikitlearn mailing list
scikitlearn at python.org
https://mail.python.org/mailman/listinfo/scikitlearn
 next part 
An HTML attachment was scrubbed...
URL:
From benoit.presles at ubourgogne.fr Tue Mar 31 09:48:50 2020
From: benoit.presles at ubourgogne.fr (=?UTF8?Q?Beno=c3=aet_Presles?=)
Date: Tue, 31 Mar 2020 15:48:50 +0200
Subject: [scikitlearn] Number of informative features vs total number of
features
MessageID: <10c2473f50e3c959b9f707c2b903c840@ubourgogne.fr>
Dear sklearn users,
I did some supervised classification simulations with the
make_classification function from sklearn increasing the number of
informative features from 1 out of 40 to 40 out of 40 (100%). I did not
generate any repeated or redundant features. I fixed the number of
classes to two and the number of clusters per class to one.
I split the dataset 100 times using the StratifiedShuffleSplit function
into two subsets: a training set and a test set (80%  20%). I performed
a logistic regression and calculated training and testing accuracies and
averaged the results over the 100 splits leading to a mean training
accuracy and a mean testing accuracy.
I was expecting to get an increasing accuracy score as a function of
informative features for both the training and the test sets. On the
contrary, I have got the best training and test scores for one
informative feature. Why do I get these results ?
Thanks for your help,
Best regards,
Ben
Below the simulation code I have written:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
RANDOM_SEED = 4
n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])
mean_training_score_array = np.array([])
mean_testing_score_array = np.array([])
for n_inf_value in n_inf:
??? X, y = make_classification(n_samples=2500,
?????????????????????????????? n_features=40,
?????????????????????????????? n_informative=n_inf_value,
?????????????????????????????? n_redundant=0,
?????????????????????????????? n_repeated=0,
?????????????????????????????? n_classes=2,
?????????????????????????????? n_clusters_per_class=1,
?????????????????????????????? random_state=RANDOM_SEED,
?????????????????????????????? shuffle=False)
??? #
??? print('Simulated data  number of informative features = ' +
str(n_inf_value))
??? #
??? sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2,
random_state=RANDOM_SEED)
??? training_score_array = np.array([])
??? testing_score_array = np.array([])
??? for train_index_split, test_index_split in sss.split(X, y):
??????? X_split_train, X_split_test = X[train_index_split],
X[test_index_split]
??????? y_split_train, y_split_test = y[train_index_split],
y[test_index_split]
??????? scaler = StandardScaler()
??????? X_split_train = scaler.fit_transform(X_split_train)
??????? X_split_test = scaler.transform(X_split_test)
??????? lr = LogisticRegression(fit_intercept=True, max_iter=1e9,
verbose=0,
??????????????????????????????? random_state=RANDOM_SEED,
solver='lbfgs', tol=1e6, C=10)
??????? lr.fit(X_split_train, y_split_train)
??????? y_pred_train = lr.predict(X_split_train)
??????? y_pred_test = lr.predict(X_split_test)
??????? accuracy_train_score = accuracy_score(y_split_train, y_pred_train)
??????? accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
??????? training_score_array = np.append(training_score_array,
accuracy_train_score)
??????? testing_score_array = np.append(testing_score_array,
accuracy_test_score)
??? mean_training_score_array = np.append(mean_training_score_array,
np.average(training_score_array))
??? mean_testing_score_array = np.append(mean_testing_score_array,
np.average(testing_score_array))
#
print('mean_training_score_array=' + str(mean_training_score_array))
print('mean_testing_score_array=' + str(mean_testing_score_array))
#
plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score')
plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score')
plt.xlabel('number of informative features out of 40')
plt.ylabel('accuracy')
plt.legend()
plt.show()