MinMaxScaler scales all (and only all) features in X?
Hi, I have a mixture of table data and intermediate vectors from another model, which don't seem to scale productively. The fact that MinMaxScaler seems to do all features in X makes me wonder if/how people train with such mixed data. The easy approaches seem to be either scale the db data and then combine with the vectors, or just scale the db columns in place 'by hand'. Otherwise, I might consider adding a column-list option to the API. I suspect I'm just missing something important, since I wandered in following this purely-tabular example, which seemed good before adding ML-derived vectors: https://www.kaggle.com/code/carlmcbrideellis/tabular-classification-with-neu... Any advice or more-appropriate example to follow would be great. Thanks, Bill -- -- Phobrain.com
Hi there, The way to do what you describe in scikit-learn would be via the ColumnTransformer https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTran... Note that however scikit-learn is mostly designed for multi-variate statistics, and thus does not tend to individualize columns in its transformers. Some of us are working on a related package, skrub (https://skrub-data.org), which is more focused to on heterogeneous dataframes. It does not currently have something that would help you much, but we are heavily brain-storming a variety of APIs to do flexible transformations of dataframes, including easily doing what you want. The challenge is to address the variety of cases. Hope this helps, Gaël On Wed, Jan 22, 2025 at 03:23:39PM -0800, Bill Ross wrote:
Hi,
I have a mixture of table data and intermediate vectors from another model, which don't seem to scale productively. The fact that MinMaxScaler seems to do all features in X makes me wonder if/how people train with such mixed data.
The easy approaches seem to be either scale the db data and then combine with the vectors, or just scale the db columns in place 'by hand'.
Otherwise, I might consider adding a column-list option to the API.
I suspect I'm just missing something important, since I wandered in following this purely-tabular example, which seemed good before adding ML-derived vectors:
https://www.kaggle.com/code/carlmcbrideellis/tabular-classification-with-neu...
Any advice or more-appropriate example to follow would be great.
Thanks,
Bill -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info https://bsky.app/profile/gaelvaroquaux.bsky.social
ColumnTransformer
Thanks! I was also thinking of trying TabPFN, not researched yet, in case you can comment. <peeks/> Their attribution requirement seems overboard for what I want, unless it's flat-out miraculous for the flat-footed. :-) Some of us are working on a related package, skrub (https://skrub-data.org), which is more focused to on heterogeneous dataframes. It does not currently have something that would help you much, but we are heavily brain-storming a variety of APIs to do flexible transformations of dataframes, including easily doing what you want. The challenge is to address the variety of cases. Those are the storms we want. I'd love to know if/how/which ML tools are helping with that work, if appropriate here. Regards, Bill
I applied ColumnTransformer, but the results are unexpected. It could be my lack of python skill, but it seems like the value of p1_1 in the original should persist at 0,0 in the transformed? ------- pre-scale p1_1 p1_2 p1_3 p1_4 p2_1 ... resp1_4 resp2_1 resp2_2 resp2_3 resp2_4 760 1.382658 1.440719 1.555705 1.120171 1.717319 ... 0.598736 0.659797 0.376331 0.403887 0.390283 ------- scaled [[0.17045455 0.04680535 0.04372197 ... 0.37633118 0.40388673 0.39028345] Thanks, Bill Fingers crossed on the formatting. column_trans = make_column_transformer( (MinMaxScaler(), ['order_in_session','big_stime','big_time','load_time','user_time','user_time2','mouse_down_time','mouse_time','mouse_dist','mouse_dist2','dot_count','mouse_dx','mouse_dy','mouse_vecx','mouse_vecy','dot_vec_len','mouse_maxv','mouse_maxa','mouse_mina','mouse_maxj','dot_max_vel','dot_max_acc','dot_max_jerk','dot_start_scrn','dot_end_scrn','dot_vec_ang']), remainder='passthrough') print('------- pre-scale') print( str(X_train) ) X_train = column_trans.fit_transform(X_train) print('------- scaled') print( str(X_train) ) print('------- /scaled') split 414 414 ------- pre-scale p1_1 p1_2 p1_3 p1_4 p2_1 ... resp1_4 resp2_1 resp2_2 resp2_3 resp2_4 760 1.382658 1.440719 1.555705 1.120171 1.717319 ... 0.598736 0.659797 0.376331 0.403887 0.390283 218 0.985645 0.532462 0.780601 0.687588 0.781293 ... 0.890886 1.072392 0.536962 0.715136 0.792722 603 0.783806 0.437074 0.694766 0.371121 0.995891 ... 1.055465 1.518875 1.129209 1.201864 1.476702 0 0.501352 0.253304 0.427804 0.283380 0.571035 ... 1.035323 1.621431 0.838613 1.031724 1.131344 604 1.442482 1.019641 0.798387 1.055465 1.518875 ... 2.779447 1.636363 1.212313 1.274595 1.723697 ... ... ------- scaled [[0.17045455 0.04680535 0.04372197 ... 0.37633118 0.40388673 0.39028345] [0.27272727 0.04502229 0.04204036 ... 0.53696203 0.7151355 0.7927222 ] [0.30681818 0.04517088 0.04456278 ... 1.1292094 1.201864 1.4767016 ] ... [0.02272727 0.04457652 0.1680213 ... 1.796316 1.939811 2.1776829 ] [0.55681818 0.04546805 0.04176009 ... 0.48330075 0.37375322 0.29931256] [0.5 0.04457652 0.04091928 ... 0.6759416 0.7517819 0.8801653 ]] --- -- Phobrain.com On 2025-01-23 01:21, Bill Ross wrote:
ColumnTransformer
Thanks!
I was also thinking of trying TabPFN, not researched yet, in case you can comment. <peeks/> Their attribution requirement seems overboard for what I want, unless it's flat-out miraculous for the flat-footed. :-)
Some of us are working on a related package, skrub (https://skrub-data.org), which is more focused to on heterogeneous dataframes. It does not currently have something that would help you much, but we are heavily brain-storming a variety of APIs to do flexible transformations of dataframes, including easily doing what you want. The challenge is to address the variety of cases.
Those are the storms we want. I'd love to know if/how/which ML tools are helping with that work, if appropriate here.
Regards, Bill
It turned out my elderly conda was using sklearn<1.2. After much wasted time on piecemeal-upgrade conda solve attempts, this got sklearn to 1.6.1, $ conda update -n base -c conda-forge conda $ conda install conda=25.1.1 # fixes to upgrade compatibility breakages $ conda install tensorflow-gpu $ conda install conda-forge::imbalanced-learn Mantra: $ python -c "import sklearn; sklearn.show_versions()" I still don't retain column order, but the end columns match (as below), and now I notice it thanks to column labels existing with .set_output(transform="pandas"). Thanks, Bill --- -- Phobrain.com On 2025-02-11 01:31, Bill Ross wrote:
I applied ColumnTransformer, but the results are unexpected. It could be my lack of python skill, but it seems like the value of p1_1 in the original should persist at 0,0 in the transformed?
------- pre-scale p1_1 p1_2 p1_3 p1_4 p2_1 ... resp1_4 resp2_1 resp2_2 resp2_3 resp2_4 760 1.382658 1.440719 1.555705 1.120171 1.717319 ... 0.598736 0.659797 0.376331 0.403887 0.390283
------- scaled [[0.17045455 0.04680535 0.04372197 ... 0.37633118 0.40388673 0.39028345]
Thanks,
Bill
Fingers crossed on the formatting.
column_trans = make_column_transformer( (MinMaxScaler(), ['order_in_session','big_stime','big_time','load_time','user_time','user_time2','mouse_down_time','mouse_time','mouse_dist','mouse_dist2','dot_count','mouse_dx','mouse_dy','mouse_vecx','mouse_vecy','dot_vec_len','mouse_maxv','mouse_maxa','mouse_mina','mouse_maxj','dot_max_vel','dot_max_acc','dot_max_jerk','dot_start_scrn','dot_end_scrn','dot_vec_ang']), remainder='passthrough')
print('------- pre-scale') print( str(X_train) )
X_train = column_trans.fit_transform(X_train)
print('------- scaled') print( str(X_train) ) print('------- /scaled')
split 414 414 ------- pre-scale p1_1 p1_2 p1_3 p1_4 p2_1 ... resp1_4 resp2_1 resp2_2 resp2_3 resp2_4 760 1.382658 1.440719 1.555705 1.120171 1.717319 ... 0.598736 0.659797 0.376331 0.403887 0.390283 218 0.985645 0.532462 0.780601 0.687588 0.781293 ... 0.890886 1.072392 0.536962 0.715136 0.792722 603 0.783806 0.437074 0.694766 0.371121 0.995891 ... 1.055465 1.518875 1.129209 1.201864 1.476702 0 0.501352 0.253304 0.427804 0.283380 0.571035 ... 1.035323 1.621431 0.838613 1.031724 1.131344 604 1.442482 1.019641 0.798387 1.055465 1.518875 ... 2.779447 1.636363 1.212313 1.274595 1.723697
...
...
------- scaled [[0.17045455 0.04680535 0.04372197 ... 0.37633118 0.40388673 0.39028345] [0.27272727 0.04502229 0.04204036 ... 0.53696203 0.7151355 0.7927222 ] [0.30681818 0.04517088 0.04456278 ... 1.1292094 1.201864 1.4767016 ] ... [0.02272727 0.04457652 0.1680213 ... 1.796316 1.939811 2.1776829 ] [0.55681818 0.04546805 0.04176009 ... 0.48330075 0.37375322 0.29931256] [0.5 0.04457652 0.04091928 ... 0.6759416 0.7517819 0.8801653 ]]
--- --
Phobrain.com
On 2025-01-23 01:21, Bill Ross wrote:
ColumnTransformer
Thanks!
I was also thinking of trying TabPFN, not researched yet, in case you can comment. <peeks/> Their attribution requirement seems overboard for what I want, unless it's flat-out miraculous for the flat-footed. :-)
Some of us are working on a related package, skrub (https://skrub-data.org), which is more focused to on heterogeneous dataframes. It does not currently have something that would help you much, but we are heavily brain-storming a variety of APIs to do flexible transformations of dataframes, including easily doing what you want. The challenge is to address the variety of cases.
Those are the storms we want. I'd love to know if/how/which ML tools are helping with that work, if appropriate here.
Regards, Bill
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hi, it feels like you want to use a ColumnTransformer that can apply different preprocessing to different columns, see e.g. this example: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_pipeline_di... You can use 'passthrough' for the columns you don't want to change. Cheers, Loïc
Hi,
I have a mixture of table data and intermediate vectors from another model, which don't seem to scale productively. The fact that MinMaxScaler seems to do all features in X makes me wonder if/how people train with such mixed data.
The easy approaches seem to be either scale the db data and then combine with the vectors, or just scale the db columns in place 'by hand'.
Otherwise, I might consider adding a column-list option to the API.
I suspect I'm just missing something important, since I wandered in following this purely-tabular example, which seemed good before adding ML-derived vectors:
https://www.kaggle.com/code/carlmcbrideellis/tabular-classification-with-neu...
Any advice or more-appropriate example to follow would be great.
Thanks,
Bill
--
participants (3)
-
Bill Ross -
Gael Varoquaux -
Loïc Estève